Archiv der Kategorie Kettle (PDI)

How to deal with Kettle bugs?

I’m always amazed about our great community: finding bugs and even fixing them!

That’s the way it goes:

Found a Kettle bug

Thanks to Paul for making my day by this post.

When you want to learn more about bug reports, have a look at: Bug Reports and Feature Requests FAQ

Data Profiling and Data Quality (Human Inference) Integration with Kettle

Data Profiling with DataCleaner (Human Inference) and Kettle

It was already possible to profile your data in an easy way with Kettle: Open the Database Explorer, chose a table and right click in the context menu on Data Profile. The result was basic information about the data like Min, Max, Count all for strings and additional information for numeric data but these were only basic metrics about your data. We have a much more better solution now:

Human Inference (DataCleaner) and Pentaho (Kettle) worked together to integrate their tools and the result is a nice and seamless integration of DataCleaner into Kettle. A sample for introduction and FAQ can be found at Kettle Data Profiling with DataCleaner

You can right click on any step within your transformation and profile your data. It is also possible to clean or harmonize your data and check the result directly within your transformation.

DC context menu

Here are some screen shots:

  • Number Analyzer

DC Number Analyzer

  • String Analyzer

DC String Analyzer

  • Pattern Analyzer (e.g. you see how many single words with first capital letters etc. exist in your data):

DC Pattern Analyzer

And there are much more Analyzers you can choose from: Matching & Deduplication, Boolean, Character Set Distribution, Data Gap, Date/Time, Reference Data Matcher, Value & Weekday DistributionAnalyzers

DC Analyzers

For a complete reference look at http://datacleaner.eobjects.org and for a quick introduction and sample check out: http://wiki.pentaho.com/display/EAI/Kettle+Data+Profiling+with+DataCleaner

Data Quality with EasyDQ (Human Inference) and Kettle

Additional to the Data Profiling capabilities, a couple of steps for Data Quality have been introduced into Kettle as plug-ins. This covers

  • Name Validation, Standardization and Cleansing
  • Address Validation, Standardization and Cleansing
  • E-Mail and Telephone Validation, Standardization and Cleansing
  • Duplicate Detection and Merge Duplicates

Here is a validation example:

DQ Sample 1


More details, download instructions, a sample video and examples can be found over here:

Pentaho Kettle for Big Data

All of Pentaho’s big data capabilities will be available as open source in the new Pentaho Kettle 4.3 release: Big data capabilities include the ability to input, output, manipulate and report on data using the following Hadoop and NoSQL stores: Cassandra, Hadoop HDFS, Hadoop MapReduce, Hadapt, HBase, Hive, HPCC Systems and MongoDB.

With regard to Hadoop, Pentaho Kettle makes available job orchestration steps for Hadoop, Amazon Elastic MapReduce, Pentaho MapReduce, HDFS File Operations, and Pig scripts. All major Hadoop distributions are supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition. Pentaho Kettle can execute ETL transforms outside the Hadoop cluster or within the nodes of the cluster, taking advantage of Hadoop’s distributed processing and reliability.

Download, how-to docs, videos and more at http://community.pentaho.com/BigData, see also Matt’s blog at http://www.ibridge.be/?p=207

Things to avoid with PDI (Part 1 to 5)

Just watch…

Things to avoid with PDI - Part 1

Things to avoid with PDI - Part 2

Things to avoid with PDI - Part 3

Things to avoid with PDI - Part 4

Things to avoid with PDI - Part 5

The new XML Input Stream (StAX) step in PDI 4.2

This step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads: Since Kettle has so many own steps to process data in different ways, the processing logic has been moved more into the transformation and the step itself provides the raw XML data stream together with additional and helpful processing information.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples at the Kettle Wiki for illustrations of the usage.

Note: In almost all use cases, a Set/Reset functionality was needed. At this time it can be accomplished by the Modified Java Script Value step or the User Defined Java Class step where the latter one is recommended and much more faster. An own Kettle step with Set/Reset functionality is one the road map to solve these and other similar use cases, see PDI-6389 for more details.

Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:

  • Very fast and independend of the memory regardless of the file size (GBs and more are possible due to the streaming approach)
  • Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)

Here is an example of parsing the following XML file with 2 main sample data blocks (Analyzer Lists & Products):

XML Input Stream (StAX) Test 2 - XML

A preview on the step may look like this (depending on the selected fields):

XML Input Stream (StAX) Test 2 - Preview

You see you really get almost the original streaming information with Elements and Attributes from the XML file together with helpful other fields like the element level.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples of this step for illustrations of the usage.

The transformation looks like this:

XML Input Stream (StAX) Test 2 - Transformation

The end result for the Analyzer List block:

XML Input Stream (StAX) Test 2 - Analyzer List

The end result for the Products block (splitted for example into two separate data streams for the end system):

XML Input Stream (StAX) Test 2 - Both Products And Attributes

And there are a lot more options in the step to help to solve your needs:

XML Input Stream (StAX) Test 2 - Step Options

More details about this step can be found in the XML Input Stream (StAX) documentation.

And for more details on the features for the upcoming PDI 4.2, please have a look at Matt’s blog.

Security Considerations and Encryption with Kettle

Kettle is used more and more in enterprises where the standard obfuscation of credentials is not sufficient enough. There are requirements to use strong encryption methods and even to store internal data encrypted (covered in PDI-6168 and PDI-6170). The above use cases inspired me to create some simple transformations to test and play around with encryption.

The transformations and some test data are attached to the Kettle Exchange page Security Considerations and Encryption with Kettle.

Let’s start with creating a key by the cryptographyCreateSecretKey transformation:

cryptographyCreateSecretKey

The generateKey step uses the User Defined Java Class step and implements sample code for AES, the Advanced Encryption Standard is a symmetric-key encryption standard, see also http://java.sun.com/developer/technicalArticles/Security/AES/AES_v1.html. The key serialization to file is a little trick to obfuscate the key. Other methods can be included instead of the clear text file output.

Now that we have the key file, we can encrypt our secret data:

cryptographySampleUnencryptedInput

With the transformation cryptographyEncrypt:

cryptographyEncrypt

We keep it simple and assume the key is available in each row (accomplished by the Join Key).

The encrypted result looks like this:

cryptographySampleEncrypted

Let’s decrypt it with the transformation cryptographyDecrypt:

cryptographyDecrypt

The result is correct but only when the key file is the same and the encrypted data was not modified. You can test it yourself and see what error messages come up or the resulting files look like when the key file or data was modified.

Instead of storing the decrypted data to a file there are a lof of other options, e.g.:

  • use the decrypted data as credentials in subsequent steps or transformations
  • put the decrypted data into variables visible in a limited scope (e.g. parent job) and use them as credentials for databases, repository etc. (see PDI-6168)
  • and many more options

We may consider:

  • Symmetric-key algorithm vs. asymmetric key algorithms (public-key cryptography)
  • Diffie-Hellman key exchange is a specific method of exchanging keys.
  • Ensure integrity e.g. by hash-codes
  • Key file handling could be optimized in different ways.
  • Please keep in mind that unencrypted data is in RAM (see PDI-6170 for a circumvention to prevent heap dumps)
  • Beneath the binary or indexed storage type, an encrypted storage type may be possible in Kettle core.

In the end: Don’t lose your key!

Update since Kettle 4.2: There are two steps in the experimental section: Secret key generator, Symmetric Cryptography that cover this use case.

iPhone Tracking: How to read the consolidated.db with Kettle

I will not discuss about the buzz about the iPhone tracking - all that needs to be discussed is already out. That iPhone is storing locations in the consolidated.db was known a long time ago, but now we got a prove of concept by Pete Warden and Alasdair Allan and this inspired me to dig into this a bit more.

After an amazing short time of less than one hour I made it possible to read in my own iPhone consolidated.db with Kettle. [It takes more time to write about this… and the other real hard thing was to locate this file in my file system.] When we can read this file with Kettle we are open and can look up any additional data, filter by any criteria, store this in another database, use the Pentaho BI Suite to create maps and anything else that can a human being imagine. Just have a look at the Pentaho Sandbox or more general to the Pentaho site.

Here are my findings and steps to reproduce:

1) Locate the consolidated.db in your filesystem.

This is a little bit tricky, since the filename is mapped to an ID. I found it very easy by just searching for the string “CellLocation” in the iTunes backup folder, described over here for Windows:

findstr /M CellLocation C:\Users\[Username]\AppData\Roaming\Apple Computer\MobileSync\Backup\[latest backup folder]\*.*

I copied the file 4096c9ec676f2847dc283405900e284a7c815836 (this was my sample filename, yours will be most likely different) to consolidated.db to another folder.

A more robust and flexible solution is to read in and analyze the Manifest.mbdb / Manifest.mbdx as described over here in a Python script. It shouldn’t be too hard to port this to Java or JavaScript for a flexible Kettle solution.

2) Read in the consolidated.db with Kettle

Create a transformation with a SQLite connection type. The database name is the location of the filename as shown below as a sample:

SQLite connection type for consolidated.db

Let’s create a simple transformation (attached): Read in the Cell data, convert the Mac Absolute Time to a Java time and just group on a daily basis for simplification:

Transformation for consolidated.db

Let’s do a preview and browse the data a bit. Well I’m located in Mainz, known to be exactly at Latitude 50. So every other data is of interest and here comes a sample:

Sample Result for consolidated.db

Where I have been there? My forensic research on my own calendar found that I have been at the Pentaho EMEA Partner Summit & Community Event in Lisbon…. or was it only my iPhone?

Connecting with SAP BI Systems

There are a lot of different options to connect with a SAP system from Kettle. It was back in 2009 were I tested the functionality to access a SAP BI system via Webservices (XML/A) and here are my (historic) findings:

To analyze the result of the MDX query with Kettle is complicated (you need for every number of columns a separate transformation) but is possible. In these days other Pentaho frontend tools are able to process this direct via OLAP4J. I tested this with the Webservices step (using Basic Authentication) and for specific usecases this would be an option. Further development brought the OLAP Input step and this simplifies a lot and was tested successfully.

Here are some further information that might be of help for setting this up:

For the XML/A interface the following webservice should be switched on in transaction SICF:
/default_host/sap/bw/xml/soap/xmla

SAP XML/A

XMLA has two methods
Discover - This method is used to query metadata and master data (this corresponds to the BAPI MDDataProviderBW).
Execute - You can execute MDX commands with this method to receive the corresponding result set (this corresponds to the BAPI MDDataSetBW).

Description of the service can be reached by: <Protocol>://<Server>:<Port>/sap/bw/xml/soap/xmla?wsdl

For XMLA with SAP BI you have to create MDX commands generated from the metadata you receive from the system.

Below is transaction MDXTEST in SAP BI to test MDX query on a SAP BI infoprovider or query:
SELECT
{[Measures].[0HDCNT_LAST],
[Measures].[ZKFHRFTE]} ON COLUMNS,
[0EMPLOYEE].MEMBERS ON ROWS
FROM [$ZMCHRPA1]

Some syntax templates are available in the MDX test editor.

Another service that is available is to get data directly from a query or query view. The service has one method: GetQueryViewData
Go to SICF and activate this web service: /sap/bc/srt/rfc/sap/QUERY_VIEW_DATA

The output will be in format of Datasets: AXIS_INFO (Metadata description of the characteristics, attributes, and structures on the axes), AXIS_DATA (Characteristic values, attribute values etc.), CELL_DATA (Value cells: The cells are arranged by row first, and then by column), TXT_SYMBOLS (Text elements for the query).

Further information:

Unfortuneatly all screen shots or data samples I have are from a real system but I hope some of the information given here are useful nevertheless to set up the SAP side.

Thanks to Paul Stoellberger, who continued to work on this. He provided me with some screen shots for Pentaho Analysis Tool (PAT), Pentaho Reporting and Kettle accessing a SAP BI system.

Operational Patterns and the Watchdog Concept for Kettle

The Watchdog Concept for Kettle was presented at the Pentaho Community Event in Cascais, Portugal in September 2010. It came into my mind when I created the operational patterns for the Pentaho Data Integration for Database Developers training course and combined this with solutions from electronics to solve these types of problems (e.g. detect software or hardware errors reliably) and apply immune aware development concepts.

What’s the idea behind this?

When you want to check if your Kettle jobs and transformations execute correct, you can use some operational patterns like

  • Health Checks for the JVM
  • Health checks for clustered environments and failover of a master server
  • Analyze log entries
  • Use restartable solutions
  • and others that are described in more detail in the Pentaho PDI training class

When you need a different approach for more safety and to measure additional values, then you can use the concept of a Watchdog and combine this with the normal patterns. A use case is e.g. you are building a very critical system (imagine a rocket flying to space) and engage different software developers, use different measuring units and different computers to test if everything is correct. When one of the units fails, you can signal an event and e.g. abort the launch. Well, the Kettle world is most times not such critical but it is used more and more in systems that need a very high reliability and 24×7 with high and continuously guaranteed throughput.

With a Kettle solution you may need many components to check, e.g. the Data Integration Server, Carte and clustered environments, Kitchen/Pan jobs and transformations, detect dead locks and also for external components like the JVM (e.g. memory, used CPUs), the server (e.g. test with a network ping) and the databases (up and running, capacity etc.).

You also need a signal to noise detection: Define what is normal (noise) and notify on exceptions (e.g. set thresholds – absolute or relative eventually by average), define what is unusual and notify on these events and define events (e.g. actions, notifications & alerts). By all these checks you have a primary constraint: You need to minimise the footprint and impact to the system by the measurements.

Why a Watchdog can help here?

One solution for checking if everything is on track is to use the concept of a Watchdog with tasks and events:

 PDI Watchdog

A software crash might go undetected by conventional watchdog strategies when you have multitasking (e.g. many PDI jobs and cluster nodes). The success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is much easier than it sounds:

  1. Build a watchdog task
  2. Create a data structure (database table) that has one entry per task
  3. When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops, e.g. every 10,000 rows
  4. As the job or transformation runs the number of counts for each task advances.
  5. Infrequently but at regular intervals the watchdog runs.
  6. The watchdog scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value.
  7. If the counts are unreasonable, halt and let the watchdog timeout and fire an event. If everything is OK, set all of the counts to zero and exit. 

This is a derived concept from Jack Ganssle (2004): Great Watchdogs (especially the section WDTs for Multitasking).

An example implementation with Kettle

Most of the Kettle health checks can be accomplished with this Watchdog concept. An example implementation with Kettle is task oriented, not server oriented. This means it will check, if the task (a Transformation or Job) is running as expected independently in what environment (e.g. clustered or not).

Since Kettle has a wide variety of transformation steps and job entries I tend often not to program in a classical programming language but solve this solely with Kettle.

Here is an example of the event handling that could be altered very flexible by adding new events:

Watchdog Event Handling

And here is an example of the above described logic of the Watchdog step. Sooner or later this could be simplified by building some Kettle steps that encapsulate these logics but for now it just works:

Watchdog check logic

Another blog about Kettle aka Pentaho Data Integration (PDI)

This is the starting point for the blog Fun Stuff about the Open Source ETL Tool Kettle aka Pentaho Data Integration (PDI).

The short history: Five years ago, in early December 2005, Matt Casters released the initial open source version of Kettle. I knew Kettle already before this time, developed a plug-in to get data out of SAP in 2005, did a lot of things around this great tool and joined Pentaho in mid 2007. You can read a bit more about the story in the book Pentaho Kettle Solutions by Matt Casters, Roland Bouman and Jos van Dongen.

Pentaho

More about Pentaho:

Pentaho Links
Pentaho Books

.

JB

More about me:

Homepage
XING Profile
Twitter

|