01 January 2008

PowerShell Group-Object and Anagrams

In an earlier article, we used an associative array to group words with the same property (in this case, the same set of letters) to find anagrams. While that solution worked, it seemed to me that there should be an easier solution using the Group-Object cmdlet.

> "add", "dad", "dam", "mad", "made", "madam", "set" | group { $_.toCharArray() | sort-object }

Count Name                      Group
----- ----                      -----
    2 a d d                     {add, dad}
    2 a d m                     {dam, mad}
    1 a d e m                   {made}
    1 a a d m m                 {madam}
    1 e s t                     {set}

Looking good, so let's try a bigger set of words in a file:

> get-content test.txt | group-object { $_.toCharArray() | sort-object }

Count Name                      Group
----- ----                      -----
    2 a d d                     {test.txt, test.txt}
    2 a d m                     {test.txt, test.txt}
    1 a d e m                   {test.txt}
    1 a a d m m                 {test.txt}
    1 e s t                     {test.txt}

That's mighty weird. For some reason, the group has the name of the file rather than the actual word while the signature in the Name column is computed correctly. Is the problem to do with the expression for the group-object? Let's try a simpler expression:

> get-content test.txt | group-object { $_.length }

Count Name                      Group
----- ----                      -----
    5 3                         {test.txt, test.txt, test.txt, test.txt...}
    1 4                         {test.txt}
    1 5                         {test.txt}

It's very puzzling and it seems like group-object was treating each file rather than each word as an input. But then, why is the expression being computed for each word?

Even stranger is when you assign the contents of a file to a variable and get the same result!

> $l = get-content test.txt
> move-item test.txt test2.txt #Ensure original file is no longer available.
> $l | group-object {$_.length}

Count Name                      Group
----- ----                      -----
    5 3                         {test.txt, test.txt, test.txt, test.txt...}
    1 4                         {test.txt}
    1 5                         {test.txt}

In this case, I would have thought that group-object would operate on a list of words and not refer to the original file.

Later … .Net has a function string[] ReadAllLines() that returns an array of strings, so the following works a treat:

> [System.IO.File]::ReadAllLines("C:\temp\download\doc\language\test.txt") | group-object {$_.ToCharArray() | sort-object}

Count Name                      Group
----- ----                      -----
    2 a d d                     {add, dad}
    2 a d m                     {dam, mad}
    1 a d e m                   {made}
    1 a a d m m                 {madam}
    1 e s t                     {set}

At least PowerShell's integration with the .Net Framework makes it possible to solve a problem if the pre-defined cmdlets don't work as you expect.

2-Jan-2008. If you're using PowerShell 2.0 CTP, the Get-Content version works.