28 July 2008

Extract Lines with Line Numbers using Gawk, Groovy, Perl, Python and Ruby

More ways to extract a block of text from a stream and prepend the line number to each line.

Below is the Gawk version. The built-in variables NR is the number of the current line and $0 is the content of the current line.

gawk "(NR >= r1 && NR <= r2) {printf("""%4d %s\n""", NR, $0)}"

The Perl and Ruby scripts are exactly the same. The built-in variable $. holds the number of the current line and $_ holds the text of the current line.

perl|ruby -ne "printf '%4d %s', $., $_ if $. >= r1 && $. <= r2"

The Groovy command line options are similar to the Perl and Ruby version, except that you have to separate -n and -e. The built-in variable count holds the number of the current line and line holds the text of the current line.

groovy -n -e "if (count >= r1 && count <= r2) out.format '%4d %s\n', count, line"

The Python version is verbose due to boilerplate code to iterate through all rows in a file:

python -c "import sys; print ''.join('%4d %s' % (r, l) for r, l in enumerate(sys.stdin) if r >= r1 and r <= r2)"

See Also


2008-07-29: Added Groovy version.

27 July 2008

Basic Perl Tk HTTP Server Monitor

Here's a port of my simple Python HTTP Server Monitor to Perl, using the Tkx module to interface with Tk. A minor difference is to use the Tk options database to specify the font of the headers in a configuration file.

# Basic HTTP Server Monitor by Kam-Hung Soh 2008.
use strict;
use warnings;
use Log::Log4perl qw(:easy);
use LWP::Simple;
use Text::CSV;
use POSIX;
use Tkx;

use constant CONFIGURATION_PATH => 'HttpServerMonitor.csv';
use constant LOG_PATH           => 'HttpServerMonitor.log';
use constant OPTION_PATH        => 'HttpServerMonitor.db';
use constant REFRESH_INTERVAL   => 60000; # Miliseconds
use constant TIME_FORMAT        => '%H:%M:%S %d-%m-%y';

sub create_widgets {
  my $logger = get_logger;
  my $app = Tkx::widget->new('.');
  Tkx::wm_title($app, 'HTTP Server Monitor');
  my $col = 0;
  for my $text ('Name', 'Host', 'Port', 'Status', 'Last Check') {
    $app->new_label(-name => 'header' . $col, -text => $text)
      ->g_grid(-row => 0, -column => $col, -padx => 2, -pady => 2);

  my $row = 1;
  my $csv = Text::CSV->new;
  ; # Skip header row
  while () {
    $logger->debug('$. = ' . $.);
    if ($csv->parse($_)) {
      my @field = $csv->fields;
      $col = 0;
      for my $s (@field) {
        $app->new_label(-text => $s)
          ->g_grid(-row => $row, -column => $col, -padx => 2, -pady => 2, -sticky => 'W');
      my ($name, $host, $port) = @field;
      my $key = $host . ':' . $port;
      $::status_label{$key} = $app->new_label(-background => 'yellow', -text => 'unknown');
      $::status_label{$key}->g_grid(-row => $row, -column => $col, -padx => 2, -pady => 2, -sticky => 'W');
      $::time_label{$key} = $app->new_label(-text => strftime TIME_FORMAT, localtime);
      $::time_label{$key}->g_grid(-row => $row, -column => $col+1, -padx => 2, -pady => 2, -sticky => 'W');
    $row = $.;
  close CSV;

  $app->new_button(-text => "Refresh", -command => \&refresh)
    ->g_grid(-row => $row, -column => 4, -padx => 2, -pady => 2, -sticky => 'E');

sub refresh {
  my $logger = get_logger;
  for my $key (keys %::status_label) {
    my $url = 'http://' . $key;
    if (head $url) {
      $::status_label{$key}->configure(-background => 'green', -text => 'up');
    } else {
      $::status_label{$key}->configure(-background => 'red', -text => 'down');
    $::time_label{$key}->configure(-text => strftime TIME_FORMAT, localtime);
  Tkx::after(REFRESH_INTERVAL, \&refresh);

my $logger = get_logger;

This script reads a list of servers in CONFIGURATION_PATH to monitor from a CSV file with three columns, the display name, the host name and the port, such as the one below:


When the script starts, it reads Tk widget configuration from an Xdefaults-style file in OPTION_PATH, such as the one below. Note that according to Options and Tk - A Beginner's Guide, you can't set grid options (that's why the script is peppered with padx and pady options).

*header0.font : -size 10 -weight bold
*header1.font : -size 10 -weight bold
*header2.font : -size 10 -weight bold
*header3.font : -size 10 -weight bold
*header4.font : -size 10 -weight bold

The Tk configuration file is more verbose than I expected. Each widget in Tk belongs in a container, containers can be members of other containers, and all widgets belong to a root container (similar to a file system). In each line of a Tk configuration file, you specify the path to a widget (all text up to the last dot), the option (the text between the last dot and colon) and the value to use (the text after the column).

!---- pathname ---+ +option+     +----- value -------+
application.header0.font       : -size 10 -weight bold

You can use an asterisk in the widget pathname if you don't care about the container of the widget. However, there's no wildcard for the widget's name, so in this case, I have to enumerate each widget that I want to configure.

See Also

13 July 2008

Basic Python Tk HTTP Server Monitor

HTTP Server Monitor We had some servers which would occasionally go offline, so I wrote a basic HTTP server monitor using Python and Tkinter (the interface to the Tk GUI library):

# HTTP Server Monitor by Kam-Hung Soh 2008
from csv     import reader
from httplib import HTTPConnection
from logging import basicConfig, error, info, INFO
from os.path import exists
from time    import strftime
from tkFont  import Font
from Tkinter import Button, Frame, Label

CONFIGURATION_PATH = 'HttpServerMonitor.csv'
LOG_PATH           = 'HttpServerMonitor.log'
REFRESH_INTERVAL   = 60000 # Miliseconds
TIME_FORMAT        = '%H:%M:%S %d-%m-%y'
GRID_DEFAULT       = {'padx':2, 'pady':2}

class Application(Frame):
  def __init__(self, master=None):
    Frame.__init__(self, master)
    self.status_label = {}
    self.time_label = {}

  def create_widgets(self):
    for i, s in enumerate(['Name', 'Host', 'Port', 'Status', 'Last Check']):
      Label(self, font=Font(size=10, weight='bold'), text=s).grid(column=i, row=0)

    if not exists(CONFIGURATION_PATH):
      error("Cannot open,%s" % CONFIGURATION_PATH)

    f = open(CONFIGURATION_PATH, "rb")
    f.next() # Skip header row
    for r, p in enumerate(reader(f)):
      row_num = r + 1
      for col_num, s in enumerate(p):
        Label(self, justify='left', text="%s" % s).grid(column=col_num, row=row_num, sticky='w', **GRID_DEFAULT)
      host_name, host, port = p
      key = host + ":" + port
      self.status_label[key] = Label(self, background='yellow', text='unknown')
      self.status_label[key].grid(column=col_num + 1, row=row_num, sticky='w', **GRID_DEFAULT)
      self.time_label[key] = Label(self, text='%s' % strftime(TIME_FORMAT))
      self.time_label[key].grid(column=col_num + 2, row=row_num, sticky='w', **GRID_DEFAULT)

    Button(self, text='Refresh', command=self.refresh).grid(column=4, sticky='e', **GRID_DEFAULT)

  def refresh(self):
    for key in self.status_label.keys():
      label = self.status_label[key]
      h = HTTPConnection(key)
        label.config(background='green', text='up')
        label.config(background='red', text='down')
    self.after(REFRESH_INTERVAL, self.refresh)

if __name__ == "__main__":
  app = Application()
  app.master.title('HTTP Server Monitor')

This program reads a CSV file specified in CONFIGURATION_PATH constant for a list of servers to monitor. The CSV file has three columns: the display name, the server address and the server's port. The first line of the CSV file is for information only; it is not used by the program. Below is a sample CSV file:

My server,myserver.com,80

You can define the time interval between checks by modifying the REFRESH_INTERVAL constant. This constant is in miliseconds, not seconds, so don't set too small a value!

If you using Windows, run it using pythonw HttpServerMonitor.py.

See Also

12 July 2008

Extract Columns From Tabular Text - Powershell and Python

Finishing off different ways to extract columns, here's the PowerShell and Python versions:

foreach-object { $_.Split('<delimiter>')[-1] }

$_ is the current object (or record) in the loop. When processing tabular text, $_ is a .Net String class, so we use its Split() method to divide the input on the <delimiter>. Split() returns a String array, and index -1 refers to the last String (or column) in that array.

python -c "import sys; print ''.join(s.split('<delimiter>')[-1] for s in sys.stdin)"

Unlike Perl or Ruby, Python doesn't have any special command-line support to iterate through all lines of input or split the input, so we have to use this generator hack. Like the PowerShell version, each record (s) is a string, so we use a string's split() function to divide the input into an array and use index -1 to refer to the last column in that array.

See Also

11 July 2008

Extract Columns From Tabular Text - Perl and Ruby

My previous posting described using the GnuWin cut command to extract columns from tabular text data but you couldn't specify columns relative to the last column. The cut command is pretty easy to use in a command console, so if you want to overcome this limitation without too additional effort, you could write an ad-hoc script using Perl or Ruby programming languages.

A Perl solution: perl -F <delimiter> -ane "print @F[-1]".

A Ruby solution: ruby -F <delimiter> -ane "print $F[-1]".

Both Perl and Ruby have the same command line switches for splitting lines: -n makes the interpreter iterate through all lines of input for the statement specified in the -e switch, the -a switch turns on the auto-split mode and -F changes the character used to split a line.

All columns in a record are collected in the global F array. For example, you extract column two using @F[1] (Perl) or $F[1] (Ruby). To extract the last column in a record, use $F[-1].

See Also

05 July 2008

Browser Usage Forecast

W3Schools Browser Statistics page shows that in June 2008, 41% of hits came from developers using Firefox and 53.5% from developers using MSIE7 or MSIE6.

What would the list be like at the end of this year? IE7 should cross 30%, IE6 to be about 22% and IE5 may disappear from the list. Unlike IE5, Moz could be barely be on the list because the number of hits is declining slower than IE5. FF may cross 43%, after the jump caused by the release of FF3 in June has been absorbed. Opera and Safari will noodle along at about 5% in total.

Enough crystal ball gazing …

Extract Columns From Tabular Text - Cut

A quick way to extract one or more columns from tabular or character delimited data, such as Web pages or log files, is to use the GnuWin cut command.

Some examples:

  • Print just the bug number and title from a list of bugs in a Web page (e.g. from Bugzilla): cut -f1,8.
  • Print the URLs requested from Apache log (in common format): cut -d" " -f7.

The -f switch specifies the column to extract. By default, the delimiter is TAB and the -d switch specifies an alternative delimiter.

One limitation of cut is that you can't specify the columns relative to last column, unless you know the index of the last column. If your data has a varying number of columns, such as the path strings printed by the find . -type f command, such as the example below …


… you can't easily extract just the file name (the last column) in every line.

A related command is colrm, which removes character columns from the input. It's quite limited and does the opposite of what I expect, so I haven't used it.

Cut is a simple utility to extract columns of data, and it can't process the column data like a scripting language. I'll write a bit more about processing tabular data in future.

See Also