Software Salariman: GnuWin

Showing posts with label GnuWin. Show all posts

2013-09-20

Concatenating two specific lines in sed

I wanted to concatenate specific rows of data so that I could analyse and chart them in Excel. My data had rows starting with a date and a keyword:

01/01/2013
12
46
79
xyz 0.1
02/01/2013
56
23
xyz 0.5

And the output I wanted is:

01/01/2013 xyz 0.1
02/01/2013 xyz 0.5

Below is a sed script to concatenate those two lines and ignore the others:

# Use: sed -n -f Concat.sed test.log
# H Hold <- Hold + \n + Pattern
# h Hold <- Pattern
# g Pattern = Hold
# First pattern
/..\/..\/..../ {
 h
}
# Second pattern
/xyz/ {
 H
 g
 s/\n/ /
 p
}

The "trick" is to collect the rows that I need into the hold space first then transfer it to the pattern space where the line break can be replaced by a space.

2013-09-13

Portable GnuWin32

How to make GnuWin32 portable.

Install all the GnuWin32 packages you require on a "build" computer. The packages are normally installed in a "GnuWin32" folder.
Copy the GnuWin32 folder from your "build" computer to your portable drive or target computer.
On your target computer, add the path to the GnuWin32\bin folder to your user (not system) PATH environment variable.

(Moving to portable applications to save time installing programming tools each time I have to work on another computer, especially when I don't have administrator rights.)

2013-06-26

Sed two-line match and replace using pattern and hold space

Sed is a line-oriented text processing tool, so to match and replace a two-line pattern requires accumulating lines in the hold space then testing if those lines should be changed. Today, I had a chance to use this feature to remap some Hyperion Financial Management (HFM) journal files.

HFM journal files have the following heading format. The scenario and year lines only appear once in a journal file.

!Scenario=s
!Year=yyyy

The task was to remap journal files for only one scenario and year to another scenario. Below is the resulting sed script.

# x = Exchange Pattern and Hold
# H = Hold = Hold + \n + pattern
/!Scenario=/ {
 x
}
/!Year=2010/ {
 H
 x
 s/!Scenario=S1\n!Year=2010/!Scenario=S2\n!Year=2010/
}

Not obvious: In the first rule, there is no output because the pattern space is empty. The second rule always outputs the scenario and year lines regardless of whether they have been changed.

2013-01-31

tr: extra operand

Kept getting this error when trying to delete Ctrl+Z (octal 032) from a file:

tr -d \032 file.txt
tr: extra operand `file.txt'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.

I keep forgetting that tr is expecting data from a stream not a file so the solution is to redirect the input: tr -d \032 < file.txt.

2012-11-10

Quick Play with Pattern and Hold Spaces in Sed

I had some rows of data that required two changes and both changes had to appear on separate lines. For example, if the input was ...

123
456
789

... and the requirement was to find '456', replace '5' with 'a' on one line and 'b' on another line so that the output would look like this:

The sed script below applies the first change, prints it, applies the second change and let sed print the pattern space at the end of the function.

/456/ {
  s/5/a/p --> pattern space = 4a6, print pattern space.
  s/a/b/  --> pattern space = 4b6
}         --> print pattern space, 4b6

In the script above, notice that the input string was first modified from '456' to '4a6' so the second statement has to replace 'a' with 'b'.

Another approach is to create two lines first (i.e. 456\n456) then replace the first '5' with 'a' and replace the second '5' with 'b'.

/456/ {
  h      --> hold space = pattern space, 456
  H      --> hold space = 456\n456 (456 + \n + pattern space)
  g      --> pattern space = hold space, 456\n456
  s/5/a/ --> pattern space = 4a6\n456
  s/5/b/ --> pattern space = 4a6\n4b6
}        --> print pattern space, '4a6\n4b6'

The script above uses sed's hold space to construct the two lines first. I don't know if the second script is clearer than the first but it was a simple problem that I could use to play with sed's pattern and hold spaces.

2012-09-08

Convert DOS Lines Into One Comma-Separated Line

Note to myself: Convert multiple DOS lines into one comma-separated line using this GnuWin command: tr -s "\r\n" ,. DOS lines use two control characters \r\n to mark the end of a line. The -s option replaces multiple original characters with only one replacement character.

2012-06-23

Flatten or Collapse Excel Multi-column Data Into One Dimension

You can flatten or collapse multi-column Excel data into one row or column using GnuWin utilities.

Copy the worksheet data into the clipboard.
Open a CMD window and enter this chain of commands: getclip | tr -s [:cntrl:] \n | putclip.
Paste the data back into your worksheet. You should get a column of data.

The getclip-putclip pair of programs gets and puts data in the system clipboard and is part of the CygWin package in GnuWin. tr translates control characters (e.g. TABs) to NEWLINEs and the -s option squeezes out repeated characters. This chain of commands works because each column of Excel data separated by a TAB character in the clipboard. Here are more examples of this pattern.

If you want to remove duplicates in your data, insert sort -u into the chain: getclip | tr -s [:cntrl:] \n | sort -u | putclip. For example, if you start with this input (note the trailing TABs in rows 1 and 2) ...

... you end up with the following:

If you want a single row output, replace all control characters with TABs: getclip | tr -s [:cntrl:] \t | putclip. In this case, you cannot include sort because it sorts lines of data and there is only one line in the output.

1 2 4 3 5 7 4 6 8 5 7 9

2011-11-11

Grep -f: No blank lines

If you search for a list of patterns in a data file using grep like below, don't include a blank line in the pattern file otherwise grep prints the entire data file.

grep -f patterns.txt data.txt

It's as if you had typed:

grep "" data.txt

See GNU Grep Manual

2011-02-12

Negate Numbers as String in Gawk

I had to negate numbers in a data file. My initial script simply negated the required field. Here a test:

gawk "{ print -$1 }"
123
-123
123.45678
-123.457

I wanted to preserve all the digits after the decimal point so I tried printf() (the multiple double-quotes is to escape double-quotes in Windows CMD):

gawk "{ printf("""%10lf\n""", -$1) }"
123
-123.000000
123.45678
-123.456780
123.456
-123.456000

Hm, now the result was being padded to the right depending on the number of digits specified. Since I only wanted to negate the values and not do any arithmetic, a trick is to add or delete the leading minus sign from the input string depending whether the input value was positive or negative, respectively:

gawk "{ print $1<0 ? substr($1, 2) : """-""" $1 }"
123
-123
123.45678
-123.45678
-123
123
-123.45678
123.45678

2009-04-20

Outlook 2003 Paste Special disabled

When you write or respond to a HTML-formatted message in Outlook 2003, and paste some text from another source (e.g. a Web page), the pasted text looks out of place because Outlook uses the string's original formatting, which is almost always different from the formatting in the message. If, like me, you find multiple fonts in a message ugly, you'd want to paste the text into the message without any formatting, as in MS Word's Paste Special command. However, when editing a message in Outlook 2003 in HTML format, the Edit / Paste Special menu item is disabled. According to this thread, it's only enabled if you use MS Word as your editor!

One workaround is to use GnuWin32 commands and a pipe: getclip | putclip. getclip outputs the unformatted string from the clipboard and putclip copies its input string back into the clipboard. Now, when you paste your text, it won't have any formatting.

2008-07-05

Extract Columns From Tabular Text - Cut

A quick way to extract one or more columns from tabular or character delimited data, such as Web pages or log files, is to use the GnuWin cut command.

Some examples:

Print just the bug number and title from a list of bugs in a Web page (e.g. from Bugzilla): cut -f1,8.
Print the URLs requested from Apache log (in common format): cut -d" " -f7.

The -f switch specifies the column to extract. By default, the delimiter is TAB and the -d switch specifies an alternative delimiter.

One limitation of cut is that you can't specify the columns relative to last column, unless you know the index of the last column. If your data has a varying number of columns, such as the path strings printed by the find . -type f command, such as the example below …

./Profiles/9ls0tqn1.default/blocklist.xml
./Profiles/9ls0tqn1.default/bookmarkbackups/bookmarks-2008-06-14.html

… you can't easily extract just the file name (the last column) in every line.

A related command is colrm, which removes character columns from the input. It's quite limited and does the opposite of what I expect, so I haven't used it.

Cut is a simple utility to extract columns of data, and it can't process the column data like a scripting language. I'll write a bit more about processing tabular data in future.

2008-06-04

Match Multiple String Patterns

To find multiple string patterns in an input file or stream, these commands are equivalent:

sed -n -e "/pattern1/p" -e "/pattern2/p". -n suppresses printing all input lines.
sed -n -r -e "/pattern1|pattern2/p". -r enables extended regular expressions.
grep -e "pattern1" -e "pattern2"..
grep -E "pattern1|pattern2". -E enables extended regular expressions.
findstr "pattern1 pattern2". You have to delimit the patterns in a single string argument. To find strings containing white spaces, you have to use the \s (whitespace) character class in your pattern.

2008-05-24

Fix Incorrectly Encoded Unicode Files with Python

The Problem

We had a lot of text files committed into our CVS repository as Unicode format. When these files were checked out later, we found that they weren't really text files nor Unicode files because CVS had only prepended two bytes to the start of these files, FF FE, but left only one byte for encoding each character. Some text editors such as Vim could open these files but other applications such as Notepad and Excel showed only gibberish.

Unicode Encoded Text in Files

Unicode is an encoding standard … for processing, storage and interchange of text data in any language. For the purpose of fixing this problem, we just have to know how to identify and write valid Unicode files.

We use two tools to experiment and visualize the effect of different encoding methods:

Microsoft Notepad editor, because it can save text files using different encoding methods.
GnuWin32 od utility to output the data in a file as byte values.

Open Notepad and enter this text: Hello World. Select the File / Save As menu item. In the Save As dialog, there are four encoding methods in the Encoding drop down list: ANSI, Unicode, Unicode big endian and UTF-8. Save the same text using each of the encoding methods into four files, say TestANSI.txt, TestUnicode.txt, TestUnicodeBigEndian.txt and TestUTF8.txt, respectively.

Examine the contents of each file using od:

>od -A x -t x1 HelloANSI.txt
000000 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000b

>od -A x -t x1 HelloUnicode.txt
000000 ff fe 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00
000010 6f 00 72 00 6c 00 64 00
000018

>od -A x -t x1 HelloUnicodeBigEndian.txt
000000 fe ff 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57
000010 00 6f 00 72 00 6c 00 64
000018

>od -A x -t x1 HelloUTF8.txt
000000 ef bb bf 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000e

The ANSI encoded file contains 11 bytes representing the characters you typed. The Unicode encoded files contain 24 bytes, starting with a two-byte BOM and using two bytes to represent each character. If the first two bytes are FF FE, then the two bytes are stored in low-byte, high-byte order. Conversely, if the first two bytes are FE FF, then the two bytes are stored in high-byte, low-byte order. Finally, when a file starts with byte EF BB BF, only one byte is used to encode each ANSI character and two or more bytes are used to encode non-ANSI characters (not demonstrated).

Fixing Incorrectly Encoded Files in Python

Now we know the format of a Unicode encoded file: it starts with FF FE and stores each character in low-byte, high-byte order. Our text files in CVS just have ANSI characters, so we just have to insert a 0 byte between each character, starting from the third byte. Julian W. wrote a short Python script that to do this. I don't have his code right now, so here's my version for correcting the Unicode encoding for a file:

import codecs
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = codecs.open(r'HelloFixedUnicode.txt', 'w', 'UTF-16')
  for i in raw[2:]:
    output.write(chr(i))
  output.close()

References

Unicode Consortium's FAQ on UTF-8, UTF-16, UTF-32 & BOM.
Wikipedia's Byte-order mark.

Postscript

I started with a more complicated piece of Python code using lists and generators:

from itertools import repeat
from operator import concat

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in reduce(concat, zip(raw[2:], repeat(0, len(raw)-2))):
    output.write(chr(i))
  output.close()

But then I realised I just had to write a 0 byte after each ANSI character, so here's a simpler version:

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in raw[2:]:
    output.write(chr(i))
    output.write(chr(0))
  output.close()

2008-05-25. I remembered that Python had no problems with writing Unicode files, resulting in the even simpler code in the body of this article.

2008-05-20

GnuWin32 find and missing argument for exec

Reminder on how to use -exec action in GnuWin32 find command in Windows cmd.exe. For example, if you want to find a string, the format is:

find . -type f -exec grep <pattern> {} ;

If you do any of the following, you can get this cryptic error message: find: missing argument to `-exec'

Put double-quote marks around the command:

find . -type f -exec "grep <pattern> {} ;"

Don't leave a space between braces and semi-colon:
```
find . -type f -exec "grep <pattern> {};"
```

Use Unix shell escape character:

find . -type f -exec grep <pattern> {} \;

Finally, if all else fails and you lack time to investigate, use xargs:

find . -type f | xargs grep <pattern>

2008-05-07

Sed Translate / Transform / Transliterate Command

Note to self: sed's (Stream EDitor) command y/list1/list2/ to transform / transliterate each character is based on its position in list1 to a character in the same position in list2. list1 and list2 must be an explicit character list, not a regular expression (and hence, not a character class). In other words, if you enter y/[a-z]/[A-Z]/, sed will look for these characters in the input, '[', 'a', '-', 'z' and ']', to replace with '[', 'A', '-', 'Z' and ']' respectively; sed does not expand a character class [a-z] to replace with [A-Z]. Same with Posix character class names such as [:lower:] and [:upper:].

I incorrectly mixed up the idea that sed's transform command with the tr (translate) command, which supports interpreted sequences, e.g. tr [:lower:] [:upper:] will transform all lower case characters to upper case.

2008-05-01

More Uses of Getclip-Putclip

More uses of GnuWin32 / Cygutils tools getclip and putclip using this recipe: getclip | <command chain> | putclip.

Copy m'th and n'th column of a table from a browser: cut -fm,n.
Copy columns from Excel and replace tab character with space: tr \t " ".
Capitalize letters: tr [:lower:] [:upper:]. (Duh! Enter Shift-F3 in Microsoft Word, thanks to Maria H.).
Remove indentation from e-mail messages: sed "s/> //".
Remove indentation from source code in Word document: sed -e "s/^ //" (5-May-2008).
Join lines broken into multiple lines by e-mail clients: dos2unix | tr -d \n. On a Windows system, tr doesn't recognise CR-LF pairs for terminating a line, so you have to convert them to a Unix-style LF using dos2unix first (6-May-2008).
Another way to join broken lines: tr -d \r\n using escape codes for carriage return and line feed, respectively (11-May-2008).
Remove formatting from string: getclip | putclip. This is equivalent to Microsoft Word's Paste Special / Unformatted Text. Also to work-around an annoyance in Outlook 2003, were the Edit / Paste Special is disabled when you are responding to an HTML-formatted document (7-May-2008).
Remove HTML / XML formatting from input: sed -e "s/<[^>]*>//g" (12-Jun-2008).

A second recipe is (for /f %i in ('getclip') do @command %i) | putclip if command cannot be used in a pipeline. Two examples are basename (return name of file in a path string) and dirname (return path string without file name).

2008-05-01: Don't simply list transformations and filters that can be done with GnuWin32 tools, but ones where existing applications (e.g. Excel, Firefox, Outlook or Word) don't have an easy way to achieve a particular action.

2012-06-25: Flatten or collapse Excel multi-column data.

2008-04-30

Using Clipboard in the Command Line

GnuWin32 / Cygutils package has two tools for interacting with the Windows clipboard: getclip and putclip. The first copies text from the clipboard to standard output and the second copies text from standard input to the clipboard. These tools are useful when you want to process text from one Windows application before pasting the text into another application, in the following recipe: getclip | <filters> | putclip.

For example, I want to paste all DLL file names in a folder into a document:

Navigate to the required folder using 2xExplorer browser.
Type Alt+a to select all files.
Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
Start cmd.exe console.
In cmd.exe console, enter: getclip | cut -d\ -fn | grep dll$ | putclip. cut is GnuWin32 tool which selects a column of data given a column delimiter (-d\ defines backslash) and field number (-fn defines column n). grep filters the output to only list files with "dll" in their name.
Start editor.
Paste the text in the clipboard in destination document.

Of course, you can do the same using Excel:

Navigate to the required folder using 2xExplorer browser.
Type Alt+a to select all files.
Type Alt+c to copy all file names. 2xExplorer copies the absolute path for each file.
Start Excel.
Paste data in a worksheet column.
Select all cells by typing Shift+Space.
Open Convert Text to Columns Wizard by typing Alt+d+e.
Select Delimited data type by typing Alt+d.
Type Alt+n to go to page 2.
Select Other delimiter by typing Alt+o, then enter "\" for paths.
Type Alt+f run the wizard.
Start Auto Filter by typing Alt+d+f+f.
Move to filter column using the mouse (no keyboard shortcuts?) then select from the drop down list (Custom …).
Select ends width criteria, enter .dll, then press Enter.
Move cursor to required column and select it using Control+Space.
Copy column by typing Control+C.
Start editor.
Paste the text in the clipboard in destination document.

The Excel solution has many more steps than the getclip-putclip solution but Excel leads you through to a solution step-by-step. If you're familiar with GNU tools, then getclip-putclip recipe is faster to use and much more extensible.

2008-05-07. I should have remembered that the basename command would output the name of the file without the leading path string. See later article More Uses of Getclip-PutClip about how to use basename in a pipeline.

2008-04-25

Strange GnuWin32 Invalid Argument Error Messages

When chaining GnuWin32 commands in Windows cmd.exe, you may encounter strange error messages like this:

> ls | grep
…
ls: write error: Invalid argument

The first command reports a write error but the error is really in the second command after the pipe symbol.

You may also encounter a similar write error if the wrong command is found in your PATH variable. For instance, Windows and GnuWin32 both have a find and sort command which support different command-line options, so depending on the order of directories listed in your PATH variable, one version or the other is used. If you enter the wrong command-line options for these commands, they won't start and cause the command earlier in the chain to report some sort of I/O error.