Archive für April 2011

iPhone Tracking: How to read the consolidated.db with Kettle

I will not discuss about the buzz about the iPhone tracking - all that needs to be discussed is already out. That iPhone is storing locations in the consolidated.db was known a long time ago, but now we got a prove of concept by Pete Warden and Alasdair Allan and this inspired me to dig into this a bit more.

After an amazing short time of less than one hour I made it possible to read in my own iPhone consolidated.db with Kettle. [It takes more time to write about this… and the other real hard thing was to locate this file in my file system.] When we can read this file with Kettle we are open and can look up any additional data, filter by any criteria, store this in another database, use the Pentaho BI Suite to create maps and anything else that can a human being imagine. Just have a look at the Pentaho Sandbox or more general to the Pentaho site.

Here are my findings and steps to reproduce:

1) Locate the consolidated.db in your filesystem.

This is a little bit tricky, since the filename is mapped to an ID. I found it very easy by just searching for the string “CellLocation” in the iTunes backup folder, described over here for Windows:

findstr /M CellLocation C:\Users\[Username]\AppData\Roaming\Apple Computer\MobileSync\Backup\[latest backup folder]\*.*

I copied the file 4096c9ec676f2847dc283405900e284a7c815836 (this was my sample filename, yours will be most likely different) to consolidated.db to another folder.

A more robust and flexible solution is to read in and analyze the Manifest.mbdb / Manifest.mbdx as described over here in a Python script. It shouldn’t be too hard to port this to Java or JavaScript for a flexible Kettle solution.

2) Read in the consolidated.db with Kettle

Create a transformation with a SQLite connection type. The database name is the location of the filename as shown below as a sample:

SQLite connection type for consolidated.db

Let’s create a simple transformation (attached): Read in the Cell data, convert the Mac Absolute Time to a Java time and just group on a daily basis for simplification:

Transformation for consolidated.db

Let’s do a preview and browse the data a bit. Well I’m located in Mainz, known to be exactly at Latitude 50. So every other data is of interest and here comes a sample:

Sample Result for consolidated.db

Where I have been there? My forensic research on my own calendar found that I have been at the Pentaho EMEA Partner Summit & Community Event in Lisbon…. or was it only my iPhone?

How the SQL CASE WHEN construct can help you in pivoting data?

Within Kettle you have the Normalizer and Denormalizer steps to help with pivoting (a simplified use case for pivoting is changing the row/column axis, see also transpose). Imagine you would like to have a cross table, you may use a Kettle transformation to accomplish this. (In these days Pentaho Reporting’s cross tab functionality is on the road map but not implemented, yet. One solution would be to use a Kettle Transformation as a data source.).

Another solution for a cross tab is to use the SQL CASE WHEN construct. Here is an example:

The result of the following sample query select status, year_id, sum(totalprice) as totalprice from orderfact group by status, year_id looks like this:

Result of sample query1

When we want to put the years (2003, 2004, 2005) into separate columns, we can use the SQL CASE WHEN construct:

CASE WHEN Boolean_expression THEN result_expression [ …n ] [ ELSE else_result_expression ]

to accomplish this, e.g.:

select status,
sum(case when year_id=2003 then totalprice else 0 end) as TOTAL_2003,
sum(case when year_id =2004 then totalprice else 0 end) as TOTAL_2004,
sum(case when year_id =2005 then totalprice else 0 end) as TOTAL_2005

from orderfact group by status

Et voilà, you have data that can be used as a cross tab:

Result from the CASE WHEN

Another use case for dealing with planning data

In another use case, we store actual and planning data (e.g. for turnover, quantity, weights etc.). To make it a bit more challenging, planning data is often stored as a budget and multiple forecasts or different plan versions (scenarios).

One solution for this problem is to store these data in separate columns, e.g. have columns for actual and other columns for planning data. Having multiple measures (see above) needs to have multiple columns for each type of the measures, e.g. turnover_actuals, quantity_actuals etc. and have this also for all types of planning scenarios. Thus you would get a big amount of columns and the biggest issue is: What happens when you need a new planning scenario? You would need to add more columns and your data model is not flexible enough and needs to be changed.

Another solution is to store the definition of an actual or planning type as an extra TYPE_ID in our table. This means each row is marked whether it is an actual or any kind of planning type.

But, when you sum up the whole table, the result would be wrong since it contains a mix of actual and planning data.

This can be solved also with the CASE WHEN construct even within a Mondrian schema definition:

    <Measure name=”Sales Actuals” aggregator=”sum” formatString=”#,###.00″>
    <MeasureExpression>    <SQL dialect=”generic”>
             (case when type_id=0 then TOTALPRICE else 0 end)
   
</SQL>    </MeasureExpression>    </Measure>

    <Measure name=”Sales Budget” aggregator=”sum” formatString=”#,###.00″>
    <MeasureExpression>    <SQL dialect=”generic”>
          (case when type_id=1 then TOTALPRICE else 0 end)
    </SQL>    </MeasureExpression>    </Measure>

[for more details see the Mondrian documentation about Measures]

With this solution you are flexible with your data model and the result will be correct.

Another more complex use case is to combine this with a reporting solution for project costing or contribution accounting. When you have a large number of different costs and would store these costs in separate columns you will reach a non manageable amount of columns and your data model is not flexible enough. It’s better to store these different types of costs in rows and you are much more flexible.

By all the flexibility with the data model you need to consider performance aspects of your database, e.g. if column based databases behave different with the CASE WHEN (I have not tested this) and different approaches should be tested. Any comments and findings are much appreciated.

|