How the SQL CASE WHEN construct can help you in pivoting data?
20.4.2011 von Jens Bleuel.
Within Kettle you have the Normalizer and Denormalizer steps to help with pivoting (a simplified use case for pivoting is changing the row/column axis, see also transpose). Imagine you would like to have a cross table, you may use a Kettle transformation to accomplish this. (In these days Pentaho Reporting’s cross tab functionality is on the road map but not implemented, yet. One solution would be to use a Kettle Transformation as a data source.).
Another solution for a cross tab is to use the SQL CASE WHEN construct. Here is an example:
The result of the following sample query select status, year_id, sum(totalprice) as totalprice from orderfact group by status, year_id looks like this:
When we want to put the years (2003, 2004, 2005) into separate columns, we can use the SQL CASE WHEN construct:
CASE WHEN Boolean_expression THEN result_expression [ …n ] [ ELSE else_result_expression ]
to accomplish this, e.g.:
select status,
sum(case when year_id=2003 then totalprice else 0 end) as TOTAL_2003,
sum(case when year_id =2004 then totalprice else 0 end) as TOTAL_2004,
sum(case when year_id =2005 then totalprice else 0 end) as TOTAL_2005
from orderfact group by status
Et voilà, you have data that can be used as a cross tab:
Another use case for dealing with planning data
In another use case, we store actual and planning data (e.g. for turnover, quantity, weights etc.). To make it a bit more challenging, planning data is often stored as a budget and multiple forecasts or different plan versions (scenarios).
One solution for this problem is to store these data in separate columns, e.g. have columns for actual and other columns for planning data. Having multiple measures (see above) needs to have multiple columns for each type of the measures, e.g. turnover_actuals, quantity_actuals etc. and have this also for all types of planning scenarios. Thus you would get a big amount of columns and the biggest issue is: What happens when you need a new planning scenario? You would need to add more columns and your data model is not flexible enough and needs to be changed.
Another solution is to store the definition of an actual or planning type as an extra TYPE_ID in our table. This means each row is marked whether it is an actual or any kind of planning type.
But, when you sum up the whole table, the result would be wrong since it contains a mix of actual and planning data.
This can be solved also with the CASE WHEN construct even within a Mondrian schema definition:
<Measure name=”Sales Actuals” aggregator=”sum” formatString=”#,###.00″>
<MeasureExpression> <SQL dialect=”generic”>
(case when type_id=0 then TOTALPRICE else 0 end)
</SQL> </MeasureExpression> </Measure>
<Measure name=”Sales Budget” aggregator=”sum” formatString=”#,###.00″>
<MeasureExpression> <SQL dialect=”generic”>
(case when type_id=1 then TOTALPRICE else 0 end)
</SQL> </MeasureExpression> </Measure>
[for more details see the Mondrian documentation about Measures]
With this solution you are flexible with your data model and the result will be correct.
Another more complex use case is to combine this with a reporting solution for project costing or contribution accounting. When you have a large number of different costs and would store these costs in separate columns you will reach a non manageable amount of columns and your data model is not flexible enough. It’s better to store these different types of costs in rows and you are much more flexible.
By all the flexibility with the data model you need to consider performance aspects of your database, e.g. if column based databases behave different with the CASE WHEN (I have not tested this) and different approaches should be tested. Any comments and findings are much appreciated.
Geschrieben in General | Keine Kommentare »
Connecting with SAP BI Systems
3.3.2011 von Jens Bleuel.
There are a lot of different options to connect with a SAP system from Kettle. It was back in 2009 were I tested the functionality to access a SAP BI system via Webservices (XML/A) and here are my (historic) findings:
To analyze the result of the MDX query with Kettle is complicated (you need for every number of columns a separate transformation) but is possible. In these days other Pentaho frontend tools are able to process this direct via OLAP4J. I tested this with the Webservices step (using Basic Authentication) and for specific usecases this would be an option. Further development brought the OLAP Input step and this simplifies a lot and was tested successfully.
Here are some further information that might be of help for setting this up:
For the XML/A interface the following webservice should be switched on in transaction SICF:
/default_host/sap/bw/xml/soap/xmla
XMLA has two methods
Discover - This method is used to query metadata and master data (this corresponds to the BAPI MDDataProviderBW).
Execute - You can execute MDX commands with this method to receive the corresponding result set (this corresponds to the BAPI MDDataSetBW).
Description of the service can be reached by: <Protocol>://<Server>:<Port>/sap/bw/xml/soap/xmla?wsdl
For XMLA with SAP BI you have to create MDX commands generated from the metadata you receive from the system.
Below is transaction MDXTEST in SAP BI to test MDX query on a SAP BI infoprovider or query:
SELECT
{[Measures].[0HDCNT_LAST],
[Measures].[ZKFHRFTE]} ON COLUMNS,
[0EMPLOYEE].MEMBERS ON ROWS
FROM [$ZMCHRPA1]
Some syntax templates are available in the MDX test editor.
Another service that is available is to get data directly from a query or query view. The service has one method: GetQueryViewData
Go to SICF and activate this web service: /sap/bc/srt/rfc/sap/QUERY_VIEW_DATA
The output will be in format of Datasets: AXIS_INFO (Metadata description of the characteristics, attributes, and structures on the axes), AXIS_DATA (Characteristic values, attribute values etc.), CELL_DATA (Value cells: The cells are arranged by row first, and then by column), TXT_SYMBOLS (Text elements for the query).
Further information:
Unfortuneatly all screen shots or data samples I have are from a real system but I hope some of the information given here are useful nevertheless to set up the SAP side.
Thanks to Paul Stoellberger, who continued to work on this. He provided me with some screen shots for Pentaho Analysis Tool (PAT), Pentaho Reporting and Kettle accessing a SAP BI system.
Geschrieben in Kettle (PDI) | Keine Kommentare »
Operational Patterns and the Watchdog Concept for Kettle
1.2.2011 von Jens Bleuel.
The Watchdog Concept for Kettle was presented at the Pentaho Community Event in Cascais, Portugal in September 2010. It came into my mind when I created the operational patterns for the Pentaho Data Integration for Database Developers training course and combined this with solutions from electronics to solve these types of problems (e.g. detect software or hardware errors reliably) and apply immune aware development concepts.
What’s the idea behind this?
When you want to check if your Kettle jobs and transformations execute correct, you can use some operational patterns like
- Health Checks for the JVM
- Health checks for clustered environments and failover of a master server
- Analyze log entries
- Use restartable solutions
- and others that are described in more detail in the Pentaho PDI training class
When you need a different approach for more safety and to measure additional values, then you can use the concept of a Watchdog and combine this with the normal patterns. A use case is e.g. you are building a very critical system (imagine a rocket flying to space) and engage different software developers, use different measuring units and different computers to test if everything is correct. When one of the units fails, you can signal an event and e.g. abort the launch. Well, the Kettle world is most times not such critical but it is used more and more in systems that need a very high reliability and 24×7 with high and continuously guaranteed throughput.
With a Kettle solution you may need many components to check, e.g. the Data Integration Server, Carte and clustered environments, Kitchen/Pan jobs and transformations, detect dead locks and also for external components like the JVM (e.g. memory, used CPUs), the server (e.g. test with a network ping) and the databases (up and running, capacity etc.).
You also need a signal to noise detection: Define what is normal (noise) and notify on exceptions (e.g. set thresholds – absolute or relative eventually by average), define what is unusual and notify on these events and define events (e.g. actions, notifications & alerts). By all these checks you have a primary constraint: You need to minimise the footprint and impact to the system by the measurements.
Why a Watchdog can help here?
One solution for checking if everything is on track is to use the concept of a Watchdog with tasks and events:
A software crash might go undetected by conventional watchdog strategies when you have multitasking (e.g. many PDI jobs and cluster nodes). The success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is much easier than it sounds:
- Build a watchdog task
- Create a data structure (database table) that has one entry per task
- When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops, e.g. every 10,000 rows
- As the job or transformation runs the number of counts for each task advances.
- Infrequently but at regular intervals the watchdog runs.
- The watchdog scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value.
- If the counts are unreasonable, halt and let the watchdog timeout and fire an event. If everything is OK, set all of the counts to zero and exit.
This is a derived concept from Jack Ganssle (2004): Great Watchdogs (especially the section WDTs for Multitasking).
An example implementation with Kettle
Most of the Kettle health checks can be accomplished with this Watchdog concept. An example implementation with Kettle is task oriented, not server oriented. This means it will check, if the task (a Transformation or Job) is running as expected independently in what environment (e.g. clustered or not).
Since Kettle has a wide variety of transformation steps and job entries I tend often not to program in a classical programming language but solve this solely with Kettle.
Here is an example of the event handling that could be altered very flexible by adding new events:
And here is an example of the above described logic of the Watchdog step. Sooner or later this could be simplified by building some Kettle steps that encapsulate these logics but for now it just works:
Geschrieben in Kettle (PDI) | 1 Kommentar »
Another blog about Kettle aka Pentaho Data Integration (PDI)
22.1.2011 von Jens Bleuel.
This is the starting point for the blog Fun Stuff about the Open Source ETL Tool Kettle aka Pentaho Data Integration (PDI).
The short history: Five years ago, in early December 2005, Matt Casters released the initial open source version of Kettle. I knew Kettle already before this time, developed a plug-in to get data out of SAP in 2005, did a lot of things around this great tool and joined Pentaho in mid 2007. You can read a bit more about the story in the book Pentaho Kettle Solutions by Matt Casters, Roland Bouman and Jos van Dongen.
![]()
More about Pentaho:
.
![]()
More about me:
Geschrieben in Kettle (PDI), General | Keine Kommentare »
