Archive für Juni 2011
The new XML Input Stream (StAX) step in PDI 4.2
24.6.2011 von Jens Bleuel.
This step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads: Since Kettle has so many own steps to process data in different ways, the processing logic has been moved more into the transformation and the step itself provides the raw XML data stream together with additional and helpful processing information.
Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples at the Kettle Wiki for illustrations of the usage.
Note: In almost all use cases, a Set/Reset functionality was needed. At this time it can be accomplished by the Modified Java Script Value step or the User Defined Java Class step where the latter one is recommended and much more faster. An own Kettle step with Set/Reset functionality is one the road map to solve these and other similar use cases, see PDI-6389 for more details.
Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:
- Very fast and independend of the memory regardless of the file size (GBs and more are possible due to the streaming approach)
- Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)
Here is an example of parsing the following XML file with 2 main sample data blocks (Analyzer Lists & Products):
A preview on the step may look like this (depending on the selected fields):
You see you really get almost the original streaming information with Elements and Attributes from the XML file together with helpful other fields like the element level.
Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples of this step for illustrations of the usage.
The transformation looks like this:
The end result for the Analyzer List block:
The end result for the Products block (splitted for example into two separate data streams for the end system):
And there are a lot more options in the step to help to solve your needs:
More details about this step can be found in the XML Input Stream (StAX) documentation.
And for more details on the features for the upcoming PDI 4.2, please have a look at Matt’s blog.
Geschrieben in Kettle (PDI) | Keine Kommentare »
Pentaho BI 4 Delivers Power to the User
22.6.2011 von Jens Bleuel.
New interactive reporting and enhanced visualizations enable fast and affordable user-driven BI
More details can be found over here: http://www.pentaho.com/power-to-the-user
Watch the video !
Geschrieben in General | Keine Kommentare »
Security Considerations and Encryption with Kettle
7.6.2011 von Jens Bleuel.
Kettle is used more and more in enterprises where the standard obfuscation of credentials is not sufficient enough. There are requirements to use strong encryption methods and even to store internal data encrypted (covered in PDI-6168 and PDI-6170). The above use cases inspired me to create some simple transformations to test and play around with encryption.
The transformations and some test data are attached to the Kettle Exchange page Security Considerations and Encryption with Kettle.
Let’s start with creating a key by the cryptographyCreateSecretKey transformation:
The generateKey step uses the User Defined Java Class step and implements sample code for AES, the Advanced Encryption Standard is a symmetric-key encryption standard, see also http://java.sun.com/developer/technicalArticles/Security/AES/AES_v1.html. The key serialization to file is a little trick to obfuscate the key. Other methods can be included instead of the clear text file output.
Now that we have the key file, we can encrypt our secret data:
With the transformation cryptographyEncrypt:
We keep it simple and assume the key is available in each row (accomplished by the Join Key).
The encrypted result looks like this:
Let’s decrypt it with the transformation cryptographyDecrypt:
The result is correct but only when the key file is the same and the encrypted data was not modified. You can test it yourself and see what error messages come up or the resulting files look like when the key file or data was modified.
Instead of storing the decrypted data to a file there are a lof of other options, e.g.:
- use the decrypted data as credentials in subsequent steps or transformations
- put the decrypted data into variables visible in a limited scope (e.g. parent job) and use them as credentials for databases, repository etc. (see PDI-6168)
- and many more options
We may consider:
- Symmetric-key algorithm vs. asymmetric key algorithms (public-key cryptography)
- Diffie-Hellman key exchange is a specific method of exchanging keys.
- Ensure integrity e.g. by hash-codes
- Key file handling could be optimized in different ways.
- Please keep in mind that unencrypted data is in RAM (see PDI-6170 for a circumvention to prevent heap dumps)
- Beneath the binary or indexed storage type, an encrypted storage type may be possible in Kettle core.
In the end: Don’t lose your key!
Update since Kettle 4.2: There are two steps in the experimental section: Secret key generator, Symmetric Cryptography that cover this use case.
Geschrieben in Kettle (PDI) | Keine Kommentare »










