05 July 2008

Extract Columns From Tabular Text - Cut

A quick way to extract one or more columns from tabular or character delimited data, such as Web pages or log files, is to use the GnuWin cut command.

Some examples:

  • Print just the bug number and title from a list of bugs in a Web page (e.g. from Bugzilla): cut -f1,8.
  • Print the URLs requested from Apache log (in common format): cut -d" " -f7.

The -f switch specifies the column to extract. By default, the delimiter is TAB and the -d switch specifies an alternative delimiter.

One limitation of cut is that you can't specify the columns relative to last column, unless you know the index of the last column. If your data has a varying number of columns, such as the path strings printed by the find . -type f command, such as the example below …

./Profiles/9ls0tqn1.default/blocklist.xml
./Profiles/9ls0tqn1.default/bookmarkbackups/bookmarks-2008-06-14.html

… you can't easily extract just the file name (the last column) in every line.

A related command is colrm, which removes character columns from the input. It's quite limited and does the opposite of what I expect, so I haven't used it.

Cut is a simple utility to extract columns of data, and it can't process the column data like a scripting language. I'll write a bit more about processing tabular data in future.

See Also