21 December 2007

PowerShell Associative Arrays and Anagrams

Jon Bentley's Programming Pearls describes the following pipeline for finding anagrams from a list of words: generate a signature for the word, then group together all words with the same signature. The signature is just a sorted list of all letters in a word. For instance, dame, edam, made and mead all have the same signature adem.

Associative Arrays

To implement an anagram-finder in PowerShell, let's use an associative array and we store the signatures in the keys and each word that has the same signature is stored in an array related to that key. Below is a concrete example of what we plan to do:

> $a = @{adem:("dame","edam","made","mead")}
> $a
Name                           Value
----                           -----
adem                           {dame, edam, made, mead}

Notes

  • The key for an associative array does not have to be quoted if it is a string without a whitespace.
  • Oddly, arrays in the value column are printed delimited by braces instead of parentheses.

Generating Word Signatures

A word signature is just a string with the letters in the original word sorted. We split a string into a char[] type, sort it, then make it into a string again:

> $sig = [string]("edam".ToCharArray() | sort-object)
> $sig
a d e m
> $sig.length
7

Note that the signature sig is a string with a space between each character. We can prettify the signature but it doesn't hurt because all the signatures will have the same format.

Find Anagrams in a Word List

Now that we can create a signature, we can find all anagrams in a word list by assigning each word's signature as a key in the associative array's value and adding the word to that key's array:

> get-content <test.txt> | foreach-object { $h = @{} } { $t = $_.clone(); $sig = [string]($t.ToCharArray() | sort-object); if (!$h.containsKey($sig)) { $h[$sig] = @() } $h[$sig] += $t } { $h }

Name                           Value
----                           -----
adem                           {dame, edam, made, mead}

Let's decompose this longish statement to understand what it is doing:

get-content <test.txt> |
Send every line in the input file into a pipeline.
foreach-object
Apply some operation on each object in the pipeline.
{ $h = @{} }
Initialize the associative array h in the begin script block.
{ process block }
Here is where words with the same signature are grouped together in the associative array.
$t = $_.clone();
Copy the input word from the current pipeline object before it is overwritten in the next statement.
$sig = [string]($t.ToCharArray() | sort-object);
Get the signature of the input word.
if (!$h.containsKey($sig)) { $h[$sig] = @() }
Create an array of anagrams if the signature does not already exist.
$h[$sig] += $t
Add the current word to the array of anagrams.
{ $h }
Output the associative array h in the end script block.

Conclusion

This article presented an imperative approach to grouping data with the same property using a loop and an associative array. You could apply the same style to any programming language and get the same result. In a future article, I will explore how to use a more streamlined approach to solve similar problems.

19 December 2007

Code Noise Ratio

Using Scott Hickey's article on reducing code noise in Groovy as a starting point, can we measure of the noisiness of a programming language and environment? Let's say that the quietest code where the developer has to add the least amount boilerplate code to implement a particular feature. Just to keep the metric simple, let's just measure the size of the source files for different implementations of the same feature and assume that the developer is trying to write sensible code. We assume that the shortest version is the quietest and calculate the ratio between the shortest version and all other versions.

Let's test this ratio on several versions of Hello World implemented using different scripting languages in previous articles. What is the noisiness of each implementation?

VersionPlatformSizeRatio
Groovy + SwingBuilderJava2400%
JythonJava2535.42%
GroovyJava27012.50%
IronPython.Net47391.25%
PowerShell.Net579141.25%

What does this table tell us about reducing code noise for small programs?

  • Script environment should pre-import common classes. Groovy + SwingBuilder and plain Groovy makes it very easy to write a small GUI program because all the Java Swing references are pre-imported. The Jython and IronPython implementations are nearly the same but the IronPython version is longer because it has to load .Net references.
  • Use class aliases. One reason the PowerShell version is very long is you can't add a class name into the current namespace (such as Python's from <library> import <class> or C# using namespace <blah>). You can define class aliases like this: $Form = [System.Windows.Forms.Form] but that only starts reducing noise when you use that class more than once.
  • Define properties in constructors. Another reason the PowerShell version is long is that only the .Net constructors for GUI controls are available through the platform interface, so to define a control, you have to write a sequence of statements starting with creating a new object followed by some SetX() methods. I wonder if PowerShell adaptors can overload constructors?

Find Longest (or Shortest) Line in a File

Quick scripts to find the longest line in a file. To find the shortest line, just invert the appropriate test.

Gawk

gawk "{ if (length > max) { max = length; m = $0 } }  END { print m }" <file>

Gawk + Sort + Tail

gawk "{ printf("""%4.0f %s\n""", length,$0) }" <file> | sort | tail -1

In Windows Cmd, three double-quotes are required to escape a double-quote, and the %4.0f format control ensures that lines up to 9999 characters long are sorted correctly.

PowerShell

get-content <file> | sort-object -property length | select-object -last 1

WC

wc -L gives the length of the longest line, but not the line itself.

2008-04-20: Consolidated all samples into this article.

17 December 2007

PowerShell String and Char[] Sort and Conversion

Introduction

I wanted to sort the characters in a string to use as a signature for that string. A string is basically an array of characters but System.String class does not have a Sort() method. Hmm, looks like we have to break the process into the following steps:

  1. Convert a string to a character array.
  2. Sort the character array.
  3. Convert the character array back into a string.

First cut

A hitch: System.String can only be converted into a char[], so we have to use an external sorter such as Sort-Object cmdlet. If we could have made an Array of char, then we could have used the Array's Sort() method. The first cut looks like this:

> $s = "mad"
> $sig = sort-object -inputObject $s.ToCharArray()
> $sig
m
a
d

Eh? Why isn't sig sorted? Either Sort-Object or PowerShell doesn't do the expected when presented with an array using the -inputObject parameter. Let's test this:

> "m","a","d" | sort-object
a
d
m
> sort-object -inputObject "m","a","d"
m
a
d

Second Cut

Because of the strange behaviour found in the previous section, we call Sort-Object in a pipeline. In addition, we reconstruct the output into a string:

> $s = "mad"
> $sig = (string)($s.ToCharArray() | sort-object)
> $sig
a d m
> $sig.length
5

What's wrong now? Why does sig have a whitespace between each character? This was getting a bit deep into PowerShell for me at the moment, so let's use an appropriate constructor in System.String such as String(char[]).

21-Dec-2007: Solution is to change OFS (Output Field Separator) to an empty string, like this: $OFS = "".

Third Cut

We try the System.String(char[]) constructor and make the sorted array output into a string:

> $s = "mad"
> $sig = new-object String(($s.ToCharArray() | sort-object))
New-Object : Exception calling ".ctor" with "3" argument(s): "Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: startIndex"
At line:1 char:16
+ $t = new-object  <<<< String($s.ToCharArray())

What gives? The error indicates that a different constructor, String(char[], startIndex, length) should be used. It's not clear why the first constructor is not available.

Fourth Cut

Finally, taking into account all that we learnt above, we end up with the following statement which gives us the result we wanted:

> $s = "mad"
> $sig = new-object String(($s.ToCharArray() | sort-object), 0, $s.length)
> $sig
adm

Conclusion

The resulting statement to sort characters in a string is mostly noise because the purpose of the statement statement is obscured by the need to convert an object from one type to another. In an earlier article about palindromes, another way to make a string from an character array is to use the static method [string]::Join(). That is …

> $sig = [string]::Join("", ($s.ToCharArray() | sort-object))

The second method is shorter but still rather obscure because it relies on the side-effect of the empty string argument when calling the Join() method. It's a rather disappointing end to this exercise because I spent most of the time fighting instead of using PowerShell.

19-Dec-2007. PowerShell 2.0 will have a new Join operator that should make this exercise moot.

21-Dec-2007. Fourth method is to change OFS first, leading to:

> $OFS = ""
> $s = "mad"
$gt; $sig = [string]($s.ToCharArray() | sort-object)
> $sig
adm
> $sig.length
3

15 December 2007

Two Hello World Windows in Groovy

Groovy is scripting language on the Java platform. Groovy is interesting because it can be compiled into Java byte-code and makes heavy use of closures.

Groovy Hello World

In the beginning, I ported my Jython version into Groovy. It looks pretty much the same as the Jython version:

// Hello World in Groovy
f = new javax.swing.JFrame("Hello World")
f.setSize(170,70)
f.contentPane.layout = new java.awt.FlowLayout()
f.defaultCloseOperation = javax.swing.JFrame.EXIT_ON_CLOSE
f.add(new javax.swing.JLabel("Label me:"))
f.add(new javax.swing.JButton("Press me"))
f.show()

Groovy + SwingBuilder Hello World

Groovy has a helper class called SwingBuilder to make it much easier to write Swing applications. Here's one way to re-write the program:

// Hello World Window in Groovy + SwingBuilder
sb = new groovy.swing.SwingBuilder()
f = sb.frame(
  title:"Hello World"
  ,size:[170,70]
  ,defaultCloseOperation: javax.swing.JFrame.EXIT_ON_CLOSE) {
  flowLayout()
  label(text:"Label me:")
  button(text:"Press me")
}
f.show()

SwingBuilder is quite wonderful because it allows you to define a GUI with a minimum of code noise. The definition of a JFrame probably maps a keyword X to a setX() function but I haven't worked out how SwingBuilder converts words such as flowLayout(), label and button to Swing objects and functions.

PowerShell Hello World Window

Here's an attempt at a Hello World Window using PowerShell:

# PowerShell Hello World Window
[System.Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms")

$b = new-object System.Windows.Forms.Button
$b.Text = "Press Me"

$l = new-object System.Windows.Forms.Label
$l.Text = "A label:"
$l.Size = new-object System.Drawing.Size(50,20)
$l.TextAlign = [System.Drawing.ContentAlignment]::BottomLeft

$p = new-object System.Windows.Forms.FlowLayoutPanel
$p.Controls.Add($l)
$p.Controls.Add($b)

$f = new-object System.Windows.Forms.Form
$f.Text = "Hello World"
$f.Size = new-object System.Drawing.Size(160,70)
$f.Controls.Add($p)
$f.ShowDialog()

Mini Observations

  • System.Drawing namespace and assembly is loaded with System.Windows.Forms at the start.
  • Unlike IronPython, only the actual .Net constructors for System.Windows.Forms classes are available, so you have to set properties of objects in separate statements after an object is created.

14 December 2007

IronPython and Jython Hello Windows

The Python language has been ported to two major virtual machine platforms: IronPython for Microsoft .Net and Jython for Java. To get a flavour of these implementations, here are two Hello World scripts that open a window containing a label and a button.

IronPython Hello World Window

import clr
clr.AddReference("System.Drawing")
from System.Drawing import ContentAlignment, Size
clr.AddReference("System.Windows.Forms")
from System.Windows.Forms import Button, FlowLayoutPanel, Form, Label

p = FlowLayoutPanel()
p.Controls.Add(Label(Text="A label:", Size=Size(50,20), TextAlign=ContentAlignment.BottomLeft))
p.Controls.Add(Button(Text="Press Me"))
f = Form(Text="Hello World", Size=Size(160,70))
f.Controls.Add(p)
f.ShowDialog()

Jython Hello World Window

# Jython Hello Window
from java.awt import FlowLayout
from javax.swing import JButton, JFrame, JLabel

f = JFrame("Hello", defaultCloseOperation=JFrame.DISPOSE_ON_CLOSE, size=(170,70), layout=FlowLayout())
f.add(JLabel("A label:"))
f.add(JButton("Press Me"))
f.show()

Mini Observations

  • IronPython makes loading the .Net libraries explicit by using the clr module.
  • Both implementations allow setX functions in object constructor's argument list.

12 December 2007

Sax BASIC Initialize Array

If you're using Sax BASIC 6.4.x, you can only initialize an array of Variant using the Array(), not any other type. For example:

Option Explicit
Dim s1() As Variant
s1 = Array("a", "b", "c") 'Works
Dim s2() As String
s2 = Array("a", "b", "c") 'Fails with (10902) Builtin function/instruction is not implemented

10 December 2007

PowerShell Where-Object Predicates

In an earlier article about palindromes , I wrote an if-statement in the script block of the Where-Object cmdlet. While that's fine when the test is simple, the purpose of the Where-Object is obscured when the script block spans several lines, and it is not possible to re-use the code in the script block.

For example, let's find all prime numbers from a list of numbers. If you follow a top-down programming style, your code might look like this:

> 1..20 | Where-Object { Test-Prime $_ }

Now we have to supply a predicate Test-Prime that returns True if a number is prime, and False otherwise. Below is an example using Eratosthenes Sieve:

> function Test-Prime { 
  $isprime = $True
  foreach ($i in 2..([math]::Sqrt($args[0]))) {
    if ($args[0]%$i -eq 0) {
      $isprime = false
      break
    }
  }
  if ($isprime) {
    $args[1] += $args[0]
    return $True
  } else {
    return $False
  }
}

Test-Prime requires an array of existing primes, so the resulting script block is slightly different from our original plan. Oh well, can't plan everything in all the time. Let's make the modification and do a test run:

> $primes = (1,2); 1..20 | Where-Object { Test-Prime $_ $primes }
3
5
7
11
13
17
19

Not too shabby. It would be nice if Test-Prime can maintain its own list of prime numbers instead of relying on the caller to create it. I'll revisit this problem in future when I study more about PowerShell.

Convert Number Bases

Convert a number from base-10 to another base using the .Net Convert.ToString(number, target-base) static function. Test in PowerShell:

> [Convert]::ToString(15,16)
f
> [Convert]::ToString(15,2)
1111

To convert from base-n to base-10, use .Net Convert.ToInt32(String, source-base). There are also ToInt16() and ToInt64() methods. Examples using ToInt32() below:

> [Convert]::ToInt32("f",16)
15
> [Convert]::ToInt32("1111",2)
15

If you use the wrong source-base, PowerShell shows this error:

Exception calling "ToInt32" with "2" argument(s): "Could not find any recognizable digits."
At line:1 char:19
+ [Convert]::ToInt32( <<<< "f",2)

Note: Don't mix this class with DOS convert command which converts a filesystem from FAT32 to NTFS!

09 December 2007

Find Duplicate Objects in PowerShell

You can find unique lines in a file using Get-Unique but what if you want to know which lines are duplicates? Below are two variations of a one-liner that finds duplicate lines. In both cases, an object is added into an array $a if it does not already exist in that array ($a -contains …). If the object exists in the array (i.e. not unique), it is printed out. A side-effect is that the unique lines are available in $a after the statement is finished.

> $a = @(); foreach ($l in Get-Content <file>) { if ($a -contains $l) { $l } else { $a += $l } }
> Get-Content <file> | Foreach-Object { $a = @() } { if ($a -contains $_) { $_ } else { $a += $_ } }

The first version is implemented using the foreach statement. The $a array is initialized in the first statement, then the file is read into $l one line at a time in the second statement. It's actually a short two-liner.

The second version is implemented using the Foreach-Object cmdlet in a command pipeline. The Foreach-Object can have three command blocks: beginning, middle and ending. The $a array is initialized in the beginning command block ($a = @()) and the test and output executed in the middle command block (if () …). The ending command block is not used in this example. This version is more tidy because only one variable, $a has to be created; the other version requires another variable $l.

08 December 2007

Powershell Palindromes

A palindrome is a word or phrase that reads the same forwards or backwards. A well-known example is Madam, I am Adam. Another example is Weird Al Yankovic's song Bob, which is composed of rhyming palindromes. If you have a list of words in a file, here's how you can find palindromes using a one-liner in PowerShell.

> Get-Content <file> | Where-Object {[string]::Join("",$_[-1..-$_.Length]) -eq $_ }
…
aha
ala
dud
mm
mum
pap
pip
…

The one-liner translates to: for every line in an input file, display only lines which are the same forward and backward.

The complex part of this pipeline is the filter in Where-Object. In this filter, we …

  1. Convert the input string, $_, into an array of characters in reverse order: $_[-1..-$_.Length].
  2. Convert the array back into a string by using the Join() function with an empty separator: [string]::Join("",<array>).
  3. Compare the reversed and original strings using <s1> -eq <s2>.

Where-Object only outputs a line when comparison operator returns True.

Slicing PowerShell Arrays and Ranges

Introduction

You can slice (or extract elements) from an array in PowerShell in several ways. Let's start with a simple array of numbers. We start with 0, because PowerShell's array indices start from 0 and to make it easier to see the effect of each statement. We use the Write-Host cmdlet where necessary to keep the number of lines in the output to a minimum.

> $a = 0,1,2,3,4,5
> $a.length
6
> Write-Host $a
0 1 2 3 4 5

The rest of this article describes how to obtain a single item, multiple items and a reverse slice, and how to combine an index list with a range.

Get an Item

Let's retrieve an item from different indices in our array, $a:

> $a[0]
0
> $a[$a.length]
> $a[-$a.length-1]
> $a[$a.length-1] -eq $a[-1]
True
> $a[$a.length-2] -eq $a[-2]
True

The first item is in index 0 and the last index is $a.length-1. Also, $a.length-x is the same as -x. If you try to access an index value greater than $a.length-1 (e.g. 6) or less than -$a.length (e.g. -7), you will get nothing.

Get Arbitrary Set of Items

We can also retrieve multiple items and items more than once.

> Write-Host $a[0,-1,5,2,4,1]
0 5 5 2 4 1

Get Multiple Items Using a Range

We can also retrieve a range of values using the .. (dot-dot) operator.

> Write-Host $a[0..2]
0 1 2
> Write-Host $a[2..5]
2 3 4 5
> Write-Host $a[0..($a.length-1)]
0 1 2 3 4 5

We enclose the arithmetic expression in parentheses otherwise the command processor generates the error below. It seems like 0..$a.length-1 is expanded into an range and the length method is called for the last number. The error seems a bit unexpected because I would have thought that the binary minus operator had a higher precedence than the range operator. Also, the number converted into System.Object, doesn't have a length method, and PowerShell would have reported that error instead of an error about the op_Subtraction method.

> write-host $a[0..$a.length-1]
Method invocation failed because [System.Object[]] doesn't contain a method named 'op_Subtraction'.
At line:1 char:28
+ write-host $a[0..$a.length-1 <<<< ]
>> write-host 0..$a.length-1
0..0 1 2 3 4 5.length-1

Get a Reversed Slice

We can also get a reversed slice of the array by specifying the right index before the left index in the range.

> Write-Host $a[5..2]
5 4 3 2
> Write-Host $a[($a.length-1)..0]

We know that -1 represents the last item, so can we simplify the second statement?

> Write-Host $a[-1..0]
5 0

What went wrong? How does PowerShell interpret -1..0?

> -1..0
-1
0
> Write-Host $a[-1] $a[0]
5 0

The range operator expanded -1..0 to 5 and 0. Is there another way specify a reverse slice of the original array? Let's try the following line of reasoning.

  1. From first section, $a.length-x refers to the same item as -x
  2. Replace x with $a.length
  3. So $a.length-$a.length (or 0) refers to the same item as -$a.length
> Write-Host (-1..-$a.length)
-1 -2 -3 -4 -5 -6
> Write-Host $a[-1..-$a.length]
5 4 3 2 1 0

Combining Numeric and Range Indices

Since an array slice can be specified by either a list of numbers or a range, can we combine them?

> Write-Host $a[0,3,2..5]
Cannot convert "System.Object[]" to "System.Int32".
At line:1 char:22
+ Write-Host $a[0,3,2..5 <<<< ]
> 0,3,2..5
Cannot convert "System.Object[]" to "System.Int32".
At line:1 char:8
+ 0,3,2..5 <<<<
> Write-Host 0,3,(2..5)
0 3 2 3 4 5
> Write-Host $a[0,3,(2..5)]
0 3

In the first and second statements, the comma array constructor operator has higher precedence than the range operator, so PowerShell could not create the list from 2..5. In the third statement, we add parentheses to modify the precedence so that the range operator creates a list 2 3 4 5.

However, in the fourth statement, it seems like PowerShell has ignored the list generated by the range operator because only two items are displayed. You can combine a range with a list of indices using the + (Plus) operator. The syntax is a little inconsistent because you can produce an index list without this operator:

> Write-Host $a[0,3+2..5]
0 3 2 3 4 5

Furthermore, you have to use the Plus operator before and after a range:

> Write-Host $a[0,3+2..5,1]
Cannot convert "System.Object[]" to "System.Int32".
At line:1 char:22
+ write-host $n[0,3+2..5 <<<< ,1]
> write-host $n[0,3+2..5+1,4]
0 3 2 3 4 5 1 4

Conclusion

This article has demonstrated how to slice arrays in PowerShell, starting from single item slices, progressing to multiple items, using the range operator and described two approaches to obtain a reversed list of items from an array. We also found that you have to use the Plus operator to combine the comma array constructor operator and range operator to get an array slice.

Renesis SVG Plug-in

It's old news that Adobe no longer develops its SVG Viewer plug-in. The only browser that really required an SVG plug-in is MSIE; Firefox, Opera and Safari browsers support SVG. If you're still using MSIE and want to view SVG, Examotion has a pre-release Renesis SVG plug-in.

I viewed some pages containing SVG with Renesis and found Renesis can display static images and Javascript animation in an <object> tag. Renesis doesn't currently support in-line SVG (essentially SVG code in XML).

07 December 2007

Outlook 2003 Smart Quotes and Right Justified Plain Text

Outlook 2003 mail editor has some undocumented shortcut keys for formatting plain text messages (!) that I accidentally triggered.

  • Ctrl+R right justifies the current paragraph and all new text. To restore your message back to left-justified formatting, select Format / HTML then select Format / Plain Text menu items. The shortcut key and menu item to left-justify a paragraph aren't available in plain text mode.
  • Ctrl+Shift+' toggles the Smart Quote (or Curly Quote) feature, where a leading and trailing quotes are ‘ ’ and “ ”.

These shortcut keys and features aren't useful in plain text mode. On the other hand, the useful shortcut key, Shift+F3 to switch case of selected text, doesn't work.

04 December 2007

PowerShell Directories, Collection and Range Operator

What are the biggest files in a directory? Using the Select-Object Cmdlet provides this command chain:

Get-ChildItem c:\windows\*.* | Sort-Object length -descending | Select-Object -first 3

In the article, the first two commands produce a collection and Select-Object picks the first three items. It's a little tedious having long command chains, so can we make it shorter? Since we have a collection, we can access its items using the array syntax and an index:

(Get-ChildItem c:\windows\*.* | Sort-Object length -descending)[3]

Using the dot-dot range operator (..), we can generate a list of indices using [0..2] (collections start with a 0 index) and reproduce the effect of Select-Object -first 3:

(Get-ChildItem c:\windows\*.* | Sort-Object length -descending)[0..2]

What about getting last 3 items using Select-Object -last 3? It can be replaced by [-3..-1]. Note that the last item in a collection starts from -1.

(Get-ChildItem c:\windows\*.* | Sort-Object length -descending)[-3..-1]

01 December 2007

Making Paragraphs With Vim

E-books from Project Gutenberg are broken into 70-character lines. Fixed length lines can be hard to read on a mobile phone, so I wanted to let the phone's viewer break the lines and use only a blank line between paragraphs. Here's some text from Arthur Conan Doyle's A Scandal in Bohemia originally fro Project Gutenberg:

"Wedlock suits you," he remarked. "I think, Watson, that you have
put on seven and a half pounds since I saw you."

"Seven!" I answered.

"Indeed, I should have thought a little more. Just a trifle more,
I fancy, Watson. And in practice again, I observe. You did not
tell me that you intended to go into harness."

Below is how this text should become (paste the before and after text into your text editor to see the difference):

"Wedlock suits you," he remarked. "I think, Watson, that you have put on seven and a half pounds since I saw you."

"Seven!" I answered.

"Indeed, I should have thought a little more. Just a trifle more, I fancy, Watson. And in practice again, I observe. You did not tell me that you intended to go into harness."

The following command in Vim did the required conversion: %s/\(.\)\n\(.\)/\1 \2/. Basically, for every line, replace any new line character between any two characters with a whitespace.