26 May 2008

Pivot Table Hack in Sqlite3 and MySQL

Introduction

A pivot table or cross tabulation is a reporting feature that BAs love to use to summarise transaction data, such as server logs and sales figures. Spreadsheet programs such as Microsoft Excel or OpenOffice.org Calc have nifty wizards to help you create a pivot table. You can also create pivot tables in databases. For example, Microsoft Access has a TRANSFORM … PIVOT SQL statement for generating a crosstab or pivot table.

What if you're using a database program that doesn't directly support pivot tables? For example, Sqlite 3 and MySQL don't seem to have any SQL statements for pivot tables.

All is not lost; another way to express a pivot table is to use aggregate functions, condition clauses and GROUP BY clause in this template:

SELECT col1, col2, … <aggregate>(<condition>) … FROM table1 GROUP BY col1, col2, ….

For Sqlite 3, the aggregate functions and GROUP BY is similar to SQL in other database programs. The condition clause we can use has this syntax: case when <expression> then <expression> end.

In the next section, we'll demonstration how to create pivot tables in Sqlite 3 using this template. All examples will be shown using Sqlite 3's command line interface, sqlite3.exe.

Sqlite 3 Pivot Table Demonstration

First, you have to download some sample transaction data. I used the NumberGo Pivot Table Tutorial AcmeShirtsCompany.xls spreadsheet as the raw data for this demonstration.

We start sqlite3.exe and use the -column -header arguments make the output of queries more readable.

sqlite3 -column -header test.db
SQLite version 3.5.9
Enter ".help" for instructions

Now we create a shirt table based on the headings in that spreadsheet:

create table shirt (Region varchar(8), Category varchar(8), Shirt_Style varchar(8), ShipDate date, Units integer, Price double, Cost double);

Next we load some transaction data into the shirt table:

insert into shirt values ('East','Boys','Tee',date('2005-01-01'),11,5.25,4.66);
insert into shirt values ('East','Boys','Golf',date('2005-01-01'),12,5.26,4.57);
insert into shirt values ('East','Boys','Polo',date('2005-01-01'),13,5.27,5.01);
insert into shirt values ('East','Girls','Tee',date('2005-01-01'),14,5.28,5.01);
insert into shirt values ('East','Girls','Golf',date('2005-01-01'),15,5.29,5.10);
insert into shirt values ('East','Girls','Polo',date('2005-01-01'),16,5.30,4.67);
insert into shirt values ('West','Boys','Tee',date('2005-01-01'),33,6.25,5.36);
insert into shirt values ('West','Boys','Golf',date('2005-01-01'),35,6.26,6.24);
insert into shirt values ('West','Boys','Polo',date('2005-01-01'),36,6.27,6.03);
…

Let's begin our analysis with a simple question: How many shirts were sold in each region?

select Region, sum(Units) from shirt group by Region;
Region      sum(Units)
----------  ----------
East        21841
North       27275
South       29994
West        23984

Next: in each region, how many Boys' and Girls' shirts were sold? Here's where a pivot table is useful:

select
  Region
  , sum(case when Category = 'Boys' then Units end) as Boys
  , sum(case when Category = 'Girls' then Units end) as Girls
  , sum(Units) as SubTotal
from shirt
group by Region;
Region      Boys        Girls       SubTotal
----------  ----------  ----------  ----------
East        10586       11255       21841
North       14049       13226       27275
South       14312       15682       29994
West        10763       13221       23984

We can drill further into the data: How many of each shirt style were sold?

select
  Region
  , Category
  , sum(case when Shirt_Style = 'Tee' then Units end) as Tee
  , sum(case when Shirt_Style = 'Golf' then Units end) as Golf
  , sum(case when Shirt_Style = 'Polo' then Units end) as Polo
  , sum(Units) as SubTotal
from shirt
group by Region, Category;
Region      Category    Tee         Golf        Polo        SubTotal
----------  ----------  ----------  ----------  ----------  ----------
East        Boys        3458        3096        4032        10586
East        Girls       3688        3481        4086        11255
North       Boys        4597        4702        4750        14049
North       Girls       4196        4598        4432        13226
South       Boys        5192        4670        4450        14312
South       Girls       5113        5377        5192        15682
West        Boys        3722        3791        3250        10763
West        Girls       4472        4235        4514        13221

The pattern becomes obvious, if rather tedious, when you want to use a specific values as a new virtual column.

Discussion

In this article, I've presented a SQL template for generating pivot tables for database programs, such as Sqlite 3, that do not have explicit support for this feature. While this template is extensible, it relies on the developer knowing beforehand the possible values (e.g. Category has 'Boys' and 'Girls', or Shirt Styles has 'Tee', 'Golf' and 'Polo') to use in the condition clause of the template. If there are many possible values, then it becomes very tedious to enumerate each of them in the SQL case when … then … end clause.

2008-06-01: I had a play with MySQL and found that I can use the same SQL statements to create the pivot tables.

I had saved the data in AcmeShirtsCompany.sql, so to set up my MySQL database, I created the shirt table using mysql.exe, exit the interpreter, then loaded the data into the database using this cmd.exe command: mysql -u root -p -D test < AcmeShirtsCompany.sql.

2008-06-06: See also SQL Cookbook by Anthony Molinaro, O'Reilly Media.