26 May 2008

Pivot Table Hack in Sqlite3 and MySQL

Introduction

A pivot table or cross tabulation is a reporting feature that BAs love to use to summarise transaction data, such as server logs and sales figures. Spreadsheet programs such as Microsoft Excel or OpenOffice.org Calc have nifty wizards to help you create a pivot table. You can also create pivot tables in databases. For example, Microsoft Access has a TRANSFORM … PIVOT SQL statement for generating a crosstab or pivot table.

What if you're using a database program that doesn't directly support pivot tables? For example, Sqlite 3 and MySQL don't seem to have any SQL statements for pivot tables.

All is not lost; another way to express a pivot table is to use aggregate functions, condition clauses and GROUP BY clause in this template:

SELECT col1, col2, … <aggregate>(<condition>) … FROM table1 GROUP BY col1, col2, ….

For Sqlite 3, the aggregate functions and GROUP BY is similar to SQL in other database programs. The condition clause we can use has this syntax: case when <expression> then <expression> end.

In the next section, we'll demonstration how to create pivot tables in Sqlite 3 using this template. All examples will be shown using Sqlite 3's command line interface, sqlite3.exe.

Sqlite 3 Pivot Table Demonstration

First, you have to download some sample transaction data. I used the NumberGo Pivot Table Tutorial AcmeShirtsCompany.xls spreadsheet as the raw data for this demonstration.

We start sqlite3.exe and use the -column -header arguments make the output of queries more readable.

sqlite3 -column -header test.db
SQLite version 3.5.9
Enter ".help" for instructions

Now we create a shirt table based on the headings in that spreadsheet:

create table shirt (Region varchar(8), Category varchar(8), Shirt_Style varchar(8), ShipDate date, Units integer, Price double, Cost double);

Next we load some transaction data into the shirt table:

insert into shirt values ('East','Boys','Tee',date('2005-01-01'),11,5.25,4.66);
insert into shirt values ('East','Boys','Golf',date('2005-01-01'),12,5.26,4.57);
insert into shirt values ('East','Boys','Polo',date('2005-01-01'),13,5.27,5.01);
insert into shirt values ('East','Girls','Tee',date('2005-01-01'),14,5.28,5.01);
insert into shirt values ('East','Girls','Golf',date('2005-01-01'),15,5.29,5.10);
insert into shirt values ('East','Girls','Polo',date('2005-01-01'),16,5.30,4.67);
insert into shirt values ('West','Boys','Tee',date('2005-01-01'),33,6.25,5.36);
insert into shirt values ('West','Boys','Golf',date('2005-01-01'),35,6.26,6.24);
insert into shirt values ('West','Boys','Polo',date('2005-01-01'),36,6.27,6.03);
…

Let's begin our analysis with a simple question: How many shirts were sold in each region?

select Region, sum(Units) from shirt group by Region;
Region      sum(Units)
----------  ----------
East        21841
North       27275
South       29994
West        23984

Next: in each region, how many Boys' and Girls' shirts were sold? Here's where a pivot table is useful:

select
  Region
  , sum(case when Category = 'Boys' then Units end) as Boys
  , sum(case when Category = 'Girls' then Units end) as Girls
  , sum(Units) as SubTotal
from shirt
group by Region;
Region      Boys        Girls       SubTotal
----------  ----------  ----------  ----------
East        10586       11255       21841
North       14049       13226       27275
South       14312       15682       29994
West        10763       13221       23984

We can drill further into the data: How many of each shirt style were sold?

select
  Region
  , Category
  , sum(case when Shirt_Style = 'Tee' then Units end) as Tee
  , sum(case when Shirt_Style = 'Golf' then Units end) as Golf
  , sum(case when Shirt_Style = 'Polo' then Units end) as Polo
  , sum(Units) as SubTotal
from shirt
group by Region, Category;
Region      Category    Tee         Golf        Polo        SubTotal
----------  ----------  ----------  ----------  ----------  ----------
East        Boys        3458        3096        4032        10586
East        Girls       3688        3481        4086        11255
North       Boys        4597        4702        4750        14049
North       Girls       4196        4598        4432        13226
South       Boys        5192        4670        4450        14312
South       Girls       5113        5377        5192        15682
West        Boys        3722        3791        3250        10763
West        Girls       4472        4235        4514        13221

The pattern becomes obvious, if rather tedious, when you want to use a specific values as a new virtual column.

Discussion

In this article, I've presented a SQL template for generating pivot tables for database programs, such as Sqlite 3, that do not have explicit support for this feature. While this template is extensible, it relies on the developer knowing beforehand the possible values (e.g. Category has 'Boys' and 'Girls', or Shirt Styles has 'Tee', 'Golf' and 'Polo') to use in the condition clause of the template. If there are many possible values, then it becomes very tedious to enumerate each of them in the SQL case when … then … end clause.

2008-06-01: I had a play with MySQL and found that I can use the same SQL statements to create the pivot tables.

I had saved the data in AcmeShirtsCompany.sql, so to set up my MySQL database, I created the shirt table using mysql.exe, exit the interpreter, then loaded the data into the database using this cmd.exe command: mysql -u root -p -D test < AcmeShirtsCompany.sql.

2008-06-06: See also SQL Cookbook by Anthony Molinaro, O'Reilly Media.

25 May 2008

Disable Vista Memory Diagnostic Tool

Vista has a Memory Diagnostic Tool which you can turn on to test your computer's memory when you restart it. Once it is enabled, this tool starts every time you restart your computer. Be warned: the Vista help system doesn't explain how to disable it!

After some Web searching, I found this tip:

- Open command prompt as Admistrator: by typing in start ''cmd'' right click the .exe file and then clicking on adminstrator. - Then typing in the console: ''bcdedit /bootsequence {memdiag} /remove'' press enter, after that you can restart your com. and it wont start

24 May 2008

Fix Incorrectly Encoded Unicode Files with Python

The Problem

We had a lot of text files committed into our CVS repository as Unicode format. When these files were checked out later, we found that they weren't really text files nor Unicode files because CVS had only prepended two bytes to the start of these files, FF FE, but left only one byte for encoding each character. Some text editors such as Vim could open these files but other applications such as Notepad and Excel showed only gibberish.

Unicode Encoded Text in Files

Unicode is an encoding standard … for processing, storage and interchange of text data in any language. For the purpose of fixing this problem, we just have to know how to identify and write valid Unicode files.

We use two tools to experiment and visualize the effect of different encoding methods:

  1. Microsoft Notepad editor, because it can save text files using different encoding methods.
  2. GnuWin32 od utility to output the data in a file as byte values.

Open Notepad and enter this text: Hello World. Select the File / Save As menu item. In the Save As dialog, there are four encoding methods in the Encoding drop down list: ANSI, Unicode, Unicode big endian and UTF-8. Save the same text using each of the encoding methods into four files, say TestANSI.txt, TestUnicode.txt, TestUnicodeBigEndian.txt and TestUTF8.txt, respectively.

Examine the contents of each file using od:

>od -A x -t x1 HelloANSI.txt
000000 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000b

>od -A x -t x1 HelloUnicode.txt
000000 ff fe 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00
000010 6f 00 72 00 6c 00 64 00
000018

>od -A x -t x1 HelloUnicodeBigEndian.txt
000000 fe ff 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57
000010 00 6f 00 72 00 6c 00 64
000018

>od -A x -t x1 HelloUTF8.txt
000000 ef bb bf 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000e

The ANSI encoded file contains 11 bytes representing the characters you typed. The Unicode encoded files contain 24 bytes, starting with a two-byte BOM and using two bytes to represent each character. If the first two bytes are FF FE, then the two bytes are stored in low-byte, high-byte order. Conversely, if the first two bytes are FE FF, then the two bytes are stored in high-byte, low-byte order. Finally, when a file starts with byte EF BB BF, only one byte is used to encode each ANSI character and two or more bytes are used to encode non-ANSI characters (not demonstrated).

Fixing Incorrectly Encoded Files in Python

Now we know the format of a Unicode encoded file: it starts with FF FE and stores each character in low-byte, high-byte order. Our text files in CVS just have ANSI characters, so we just have to insert a 0 byte between each character, starting from the third byte. Julian W. wrote a short Python script that to do this. I don't have his code right now, so here's my version for correcting the Unicode encoding for a file:

import codecs
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = codecs.open(r'HelloFixedUnicode.txt', 'w', 'UTF-16')
  for i in raw[2:]:
    output.write(chr(i))
  output.close()

References

Postscript

I started with a more complicated piece of Python code using lists and generators:

from itertools import repeat
from operator import concat

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in reduce(concat, zip(raw[2:], repeat(0, len(raw)-2))):
    output.write(chr(i))
  output.close()

But then I realised I just had to write a 0 byte after each ANSI character, so here's a simpler version:

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in raw[2:]:
    output.write(chr(i))
    output.write(chr(0))
  output.close()

2008-05-25. I remembered that Python had no problems with writing Unicode files, resulting in the even simpler code in the body of this article.

23 May 2008

MDI Child Window Menu Shortcuts

I accidently moved PythonWin's Interactive window out of sight when I grabbed and dropped it with my mouse pointer. Restarting PythonWin didn't help because the position of the child window was stored in Windows Registry, so it remained hidden even after restarting PythonWin. I considered hacking Windows Registry to reset the child window's position, until I found the keyboard shortcuts to move the keyboard focus to a child window, show the System Menu and select the Move menu item.

Background: A child window is a window in a MDI application.

The keyboard shortcuts required to bring the child window back into view were:

  1. Move focus through child windows: Control+F6.
  2. Show a child window's System Menu: Alt+- (Alt Minus).
  3. Move child window: m, then press the cursor keys.

Keyboard shortcuts for Microsoft Windows applications: KB 126449.

20 May 2008

GnuWin32 find and missing argument for exec

Reminder on how to use -exec action in GnuWin32 find command in Windows cmd.exe. For example, if you want to find a string, the format is:

find . -type f -exec grep <pattern> {} ;

If you do any of the following, you can get this cryptic error message: find: missing argument to `-exec'

  • Put double-quote marks around the command:
    find . -type f -exec "grep <pattern> {} ;"
  • Don't leave a space between braces and semi-colon:
    find . -type f -exec "grep <pattern> {};"
  • Use Unix shell escape character:
    find . -type f -exec grep <pattern> {} \;

Finally, if all else fails and you lack time to investigate, use xargs:

find . -type f | xargs grep <pattern>

Python Command Line (-c option) Test 2

Julian W. suggested that I write a one line Python loop for my command scripts instead of map(), as in my earlier article. Instead of map(lambda l: expression(l), sys.stdin), I could write for l in sys.stdin: expression(l). An example trivial command to echo all input lines would be:

python -c "import sys; for line in sys.stdin: print line,"

Problem is that the Python interpreter complains:

  File "", line 1
    import sys; for line in sys.stdin: print line,
                  ^
SyntaxError: invalid syntax

13 May 2008

Outlook 2003 Save HTML Limitation

If want to save an e-mail message in HTML format in Microsoft Outlook 2003, you may find that Outlook, unlike MSIE or Firefox browsers, only saves the text in the message but not any of the embedded images or attachments. Worse, Outlook doesn't warn you that it is not saving the entire message.

Another annoyance is that if you try to save an image in the message using the context menu item Save Picture As, then you can only save using BMP format.

11 May 2008

Assign USB Drives to Folder

Each time you plug in a USB device to your Windows computer, Windows can assign a different drive letter to your device. If you have programs that rely on a fixed drive letter (e.g. portable applications on a USB drive or backups) or if you use more than one computer regularly, then it gets annoying to reset the programs' configuration after plugging in your drive or remember to plug in devices in a particular sequence. Assign USB Drives to Folder describes how to use Window's Disk Management to assign a fixed path to each device.

I wonder if it's possible to refer to a device using its volume name, which would make this method redundant?

07 May 2008

Sed Translate / Transform / Transliterate Command

Note to self: sed's (Stream EDitor) command y/list1/list2/ to transform / transliterate each character is based on its position in list1 to a character in the same position in list2. list1 and list2 must be an explicit character list, not a regular expression (and hence, not a character class). In other words, if you enter y/[a-z]/[A-Z]/, sed will look for these characters in the input, '[', 'a', '-', 'z' and ']', to replace with '[', 'A', '-', 'Z' and ']' respectively; sed does not expand a character class [a-z] to replace with [A-Z]. Same with Posix character class names such as [:lower:] and [:upper:].

I incorrectly mixed up the idea that sed's transform command with the tr (translate) command, which supports interpreted sequences, e.g. tr [:lower:] [:upper:] will transform all lower case characters to upper case.

05 May 2008

Obscure Cmd.exe Output Replacement (Back Tick)

With Unix shells such as bash and [t]csh, you can set the value of a variable to the result of a command using the back-tick operator (or output replacement). For example, LINES = `wc -l filename`, would set the variable LINES with the result of wc -l, which is the number of lines in filename. This technique is useful when you want pass the value of a computed variable to subsequent commands in a script.

Windows' cmd.exe also supports this feature, in a obscure way, using the for command: for /f %i in ('command') do set VARIABLE=%i. To reproduce the previous example in cmd.exe: for /f %i in ('wc -l filename') do set LINES=%i.

Notes:

  • Use %%i in a script.
  • Use single-quote marks to delimit a command. If you use double-quote marks, for treats the argument in parentheses as a string.

I saw this and other cmd.exe hacks somewhere but I didn't take a note of it. Grr. Remind myself to update this page when I find that site again.

01 May 2008

More Uses of Getclip-Putclip

More uses of GnuWin32 / Cygutils tools getclip and putclip using this recipe: getclip | <command chain> | putclip.

  • Copy m'th and n'th column of a table from a browser: cut -fm,n.
  • Copy columns from Excel and replace tab character with space: tr \t " ".
  • Capitalize letters: tr [:lower:] [:upper:]. (Duh! Enter Shift-F3 in Microsoft Word, thanks to Maria H.).
  • Remove indentation from e-mail messages: sed "s/> //".
  • Remove indentation from source code in Word document: sed -e "s/^ //" (5-May-2008).
  • Join lines broken into multiple lines by e-mail clients: dos2unix | tr -d \n. On a Windows system, tr doesn't recognise CR-LF pairs for terminating a line, so you have to convert them to a Unix-style LF using dos2unix first (6-May-2008).
  • Another way to join broken lines: tr -d \r\n using escape codes for carriage return and line feed, respectively (11-May-2008).
  • Remove formatting from string: getclip | putclip. This is equivalent to Microsoft Word's Paste Special / Unformatted Text. Also to work-around an annoyance in Outlook 2003, were the Edit / Paste Special is disabled when you are responding to an HTML-formatted document (7-May-2008).
  • Remove HTML / XML formatting from input: sed -e "s/<[^>]*>//g" (12-Jun-2008).

A second recipe is (for /f %i in ('getclip') do @command %i) | putclip if command cannot be used in a pipeline. Two examples are basename (return name of file in a path string) and dirname (return path string without file name).

2008-05-01: Don't simply list transformations and filters that can be done with GnuWin32 tools, but ones where existing applications (e.g. Excel, Firefox, Outlook or Word) don't have an easy way to achieve a particular action.

2012-06-25: Flatten or collapse Excel multi-column data.