2008-04-24

Python Command Line (-c option)

Perl has a -n option which implicitly runs a while-loop over all lines in STDIN (while (<>) { }). This mode is handy in a command shell when Perl is the recipient of the output of another command and you don't want to write a script. Can we do the same for Python?

Python has a -c option which runs a command in the string following it. While it's not entirely clear to me what is a Python command, I found that you can write some useful functions using list functions and statements using this template:

python -c "import <package>; print '\n'.join(<list function>(lambda x: <expression>, (s.strip() for s in sys.stdin)))

To use this template, replace <package> with a package name (e.g. os), <list function> with a list function (e.g. filter()) and <expression> with, well, an expression. The rest of the template just constructs a list of strings (without a trailing "\n") from the input and prints the results.

For simple string processing, the list function and expression are not required, resulting in a simplified version of this template:

python -c "import <package>; print '\n'.join(<fn>(s.strip()) for s in sys.stdin)"

While researching this topic, I found an ASPN Python Recipe called Pyline to help write commands. Here's the examples in that recipe rewritten using my template:

Print the first 20 characters of each line:

tail test.txt | python -c "import sys; print '\n'.join(s.strip()[:20] for s in sys.stdin)"

Print the 7th word in each line, assuming the separator is ' ':

tail test.txt | python -c "import sys; print '\n'.join(s.strip().split(' ')[6:7] for s in sys.stdin)"

Note that you can also get columns of text from a file using the cut command. Also note that the reason for using the array slice is to avoid getting an IndexError exception if the string is not long enough.

List all files that are greater than 1024 bytes in size:

ls | python -c "import os, sys; print '\n'.join(filter(lambda x: os.path.isfile(x) and os.stat(x).st_size > 1024, (s.strip() for s in sys.stdin)))

Generate MD5 digest values for a list of files, like md5sum.

ls *.txt | python -c "import md5, sys; print ''.join('%s %s' % md5.new(file(s.strip()).read()).hexdigest(), s) for s in sys.stdin)"

26-Apr-2008: Replaced list comprehension statement (for-in with square brackets) with generator expression (for-in with parentheses) in the template to avoid very large lists stored in memory.

Added MD5 digest example, and realised that we only need to use list functions (e.g. filter()) if you want to change the members of the resulting list. Otherwise, the simpler template suffices.

2008-04-11

Firefox Greasemonkey Kills Google Groups Spam

If you read Usenet newsgroups, no doubt you'd be familiar with spam messages spruiking credit, fake jewellery, external organ enlargements and free graduate degrees. On a PC, you can use killfiles in newsreading software to ignore spam messages. If you're reading newsgroups using the Google Groups web-based reader with Firefox, you can ignore annoying spam messages using a Greasemonkey script called Google Groups Killfile (GGK).

You can add entries to your killfile list using GGK's context menu but the list becomes hard to view and manage once you have a lot of entries. It is easier to edit GGK's kill list variable:

  • Enter "about:config" in Firefox's location bar.
  • Enter "kill" in the Filter field.
  • Click on greasemonkey.scriptvals.www.penney.org/Google Groups Killfile.GoogleKillFile and edit the configuration string.

2008-04-14: If you use regular expressions (RE), you can reduce the number of entries in the killfile list by using wildcards and the "alternate" operator (vertical bar symbol ("|")). You can further reduce the number of patterns to define by specifying case-insensitive comparison in GGK. Just search for the REs' "compile()" function in the GGK script and add a second "i" argument.

2008-04-08

Functional Python Palindromes

To find all palindromes from a list of words in a file, one word per line, you could write a procedural Python program like this:

for row in file('test.txt'):
  s = row.strip()
  if s == s[::-1]:
    print s

Here's a functional Python version, with notes below:

from itertools import imap
filter(lambda s: s == s[::-1], imap(str.strip, file('test.txt')))
  5       3              4       2               1
  1. Create a file iterator.
  2. When we read a line from a file into a string, the string has a trailing newline character (e.g. 'add\n'). We want to remove that trailing newline character, so we use the itertools.imap() function to create a new iterator that applies the str.strip() to each line read. The result is that we have an iterator that provides strings without the newline character.
  3. Define an anonymous function using lambda keyword that returns true if the input string is a palindrome.
  4. Python idiom for returning a reversed sequence (a string is a sequence of characters).
  5. Use the filter function to return a list of palindromes.

Using this input file …

add
dad
dam
mad
made
madam
set

… the result of running the functional script is:

['dad', 'madam']

2008-04-07

Reading CSV Files in Python

Python has a csv module for reading and writing CSV files (usually exported by Excel or database tables). The basic use of this module is documented in the on-line help. My CSV files usually have a header row, so the idiomatic way to skip this line is to open the CSV file and use the next() function immediately:

from csv import reader
f = open("blah.csv", "rb")
f.next()
for row in reader(f):
  print row

If your CSV files are pretty simple (e.g. only single line data, no quotes, etc.), you can use list comprehension and array slicing:

                   1       2                                      3
for row in [line.strip().split(',') for line in file("blah.csv")][1:]:
  print row

Notes:

  1. You have to remove the trailing "\n" from each line.
  2. Split the input line using the delimiter, typically a comma.
  3. The list comprehension statement returns all lines, so to ignore the first line, you take a slice of the array starting from the second line.