Pentaho Community Contributions (#KCM19)

Leonardo Coelho, Software Engineer at Hitachi Vantara, and myself presented at the Kettle Community Meeting 2019 (#KCM19): How can a Community Member contribute to Pentaho?

We both think it is helpful for the community to create a blog post and list noteworthy links that we provided. It might also help to get found by *any* search engine. The full presentation with more details can be downloaded as a PDF and the recording is available on YouTube.

Links to Useful Documentation and Repositories

LinkDescription
https://github.com/pentaho/pentaho-coding-standardsPentaho Coding Standards
https://help.pentaho.com/Documentation/8.3/Developer_center/Embed_and_extend_PDI_functionality Embed and extend PDI functionality
https://github.com/pentaho/pentaho-engineering-samples Pentaho Engineering Samples
https://github.com/pentaho/pdi-sdk-plugins Samples as part of the PDI SDK

https://github.com/pentaho/maven-project-archetypes
For example, if you wanted to create a CE PDI step plugin, you would follow the instructions located in the README here:
https://github.com/pentaho/maven-project-archetypes/tree/master/pentaho-pdi-step-plugin-archetype

Main Links for Pentaho Community Contributions

LinkDescription
https://community.hitachivantara.com/s/pentaho Main community web site for Pentaho, including Forums and Articles
https://help.pentaho.com Pentaho Documentation (some can still be found in https://wiki.pentaho.com)
https://jira.pentaho.comWhere you search for and file defects, improvements, and feature requests
https://github.com/pentaho Where the source code lives
https://sourceforge.net/projects/pentaho/ Site that has compiled stable binaries
https://nexus.pentaho.org Artifact repository where things get resolved from
https://help.pentaho.com/Documentation/8.3/Developer_center APIs, Customize Pentaho Products, Extending and Embedding (as of 8.3)
https://github.com/pentaho/marketplace-metadata Pentaho Marketplace Repository (see also https://marketplace.hitachivantara.com/pentaho/ )
Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho Community Contributions (#KCM19)

Pentaho Community Meeting PCM18 is coming to Bologna


In short: The 11th Pentaho Community Event will be in beautiful Bologna on November 23 through 25, 2018. More details over here: PCM18

First of all, it is awesome to see how the Italia Pentaho User Group together with IT-Novum is organizing this event. And these guys are doing a fantastic job, thanks to all who are involved! (We definitely need a group picture at the event of all the hard workers.)

As of today, two month before, we reached over 220 registrations – that’s again a record in Pentaho history!

And we already have a lot of remarkable talks separated in two tracks: See and discuss the latest and greatest in Pentaho products and exciting geek stuff (techie track) as well as best practices of Pentaho implementations and successful projects (business track).

  • I already have a favorite talk and it has to do with my early days of Kettle and Pentaho. I don’t write more, otherwise I get too sentimental 🙂
  • Another favorite is to see how synergy works by the integration between Hitachi Vantara products and PDI, there are very exciting use cases behind! Again, I don’t tell you the name of these talks since there are sooooo many amazing talks, I will not promote one over the other, just have a look at the agenda.
  • Also, check #PCM18 on Twitter to see the latest news including talks and interviews!

There is only one bad (or good) thing at the time of writing this: The techie track is already full. So just in case, you would like to participate with a talk, bring it into the business track where we have some spots left. You can apply here for your talk submission: Call for Papers

And just in case you are not registered, register now and here: PCM18

Looking forward to see you in Bologna and share all this with you! Let’s have this a great event similar to PCM17 last year in Mainz:

Oh, I forgot the most important part: There is already a playlist … songs to make the Saturday evening party great and PCM18 even more unforgettable!

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho Community Meeting PCM18 is coming to Bologna

Pentaho Community Meeting PCM17 – Call for Collectibles

Just in case you don’t know, yet: Pentaho Community Meeting PCM17 celebrates its 10th anniversary this November (10th through 13th) with a 3-days event full of presentations and networking. And it-novum has the honor to host the event together with Pentaho. Read more over here: PCM17 and FAQ

Well, and I’m sure you already know this additional great news from last month already: Pentaho is now part of Hitachi Vantara. And if you want to know what will happen with the Pentaho community, read Pedro Alves blog: Hello Hitachi Vantara!

So, back to PCM17: It will take place in Mainz where everything started in 2008. I’m very excited about this, since it will be the same location, just much bigger(!) rooms. And I’m happy to assist and do things here and there, just as I did back in 2008. The difference is: Back in 2008, I organized it almost all on my own and this time we have so many helping hands, it’s amazing!

Here is a short status: We already have the agenda filled up with talks and, this is the first time ever, we added technical short talks on Friday together with the Hackathon. There are so many nice things to tell, we will make it possible! And, we got the highest number of attendees ever. Stay tuned for the final numbers. We will make it happen, anyone will be in.

So there is no Call for Presentations anymore, sorry… but, there is a Call for Collectibles. I explain why and what this is:

Since about two years, I’m a member of the non-profit organization GO for Tansania e.V. supporting The Brilliant Pre-Primary and Primary School in Tanzania. This school teaches children in the poorest families of Tanzania, many of whom are orphans. It is financed mainly through donations of the annual fundraising run at the Gymnasium Mainz-Oberstadt school and further donations.

To support this non-profit organization, we will do an auction on Saturday evening with a lot of nice Pentaho collectibles. BTW: Since Pentaho is now part of Hitachi Vantara, this might be even more interesting to get some of these 🙂

Here are two highlights what we already have for the auction:

#1 The Pentaho table-cloth that you see on the picture of the first Pentaho community meeting in 2008:

#2 The Pentaho flag that was on the entrance of the first event:

Furthermore, we have scratch books, blank DVDs all with the nice Pentaho logo.

Call for Collectibles: Please bring what you have and what you want to donate via the auction to support children in Tanzania.

If you have something but are not able to attend the event, it would be great when you could send it over to me. Please contact me. We will make sure your name will be on the „Walk of fame“ of donors 🙂

Thanks a lot and stay tuned for more …

And just in case you are not registered, yet for the 10th Pentaho Community Event, here you find everything you need to know, check it out: PCM17 and FAQ

Looking forward to see you in Mainz!

 

 

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho Community Meeting PCM17 – Call for Collectibles

Metadata Injection – Examples for Special Scenarios

Since Metadata Injection (MDI) with Pentaho Data Integration gets more and more popular and used in a lot of projects, this blog post provides examples that help in special scenarios.

It is a follow up from my previous blog post about Metadata Injection that provide you with some more basics and background. The feature has also been improved in PDI 7.0 and we continue to do so in future releases. For more details, have a look at the Pentaho Documentation about Metadata Injection.

All the following examples can be downloaded here: MDI_Examples.zip

Special Example: MDI with the Filter Rows step

The Filter Rows step is a special MDI scenario, since it has a nested structure of filter conditions (this applies accordingly to Join Rows step).

Example: a OR b

Example: a OR (x AND y)

The MDI condition is given in XML notation, for example:

The condition XML has the same format as we store the transformation meta data in a KTR file in XML format. We do not have a DTD (Document Type Definition) for the KTR XML format, nor the condition. But it is very easy to get to a XML condition:

  1. Create a sample Filter step with the different conditions you need. This gives you all the information, for example the value for function (in our example = represents the equal function)
  2. Select the step and Copy it to the clipboard. Paste it into a text editor. Alternatively you can store the KTR and open the KTR in a text editor.
  3. Find the <condition> element and their nested elements and modify it accordingly to use it in your MDI scenario.

Hint to remove the CR and LF within the XML in case you need a single line:

  1. Use Notepad++ or a similar text editor
  2. Replace \r\n and leave the „Replace with“ field empty
  3. Make sure to switch to Extended Search Mode

Different MDI Architectures

In general, a MDI process can:

  1. „Just“ inject metadata and call the template transformation –> we call this the MDI Standard use case
  2. Additionally, the data can get pushed (streamed) from the main to the template transformation and also pulled back. This is needed when the template processes dynamic data from the main transformation. –> we call this the MDI Data Flow use case
  3. For big data use cases, we can fill the template transformation with metadata and store it (phase 1). Then this stored transformation can be copied and executed on multiple nodes (phase 2).–> we call this the MDI 2 Phase Processing use case, also used in our Onboarding Blueprint for Big Data: Filling the Data Lake

The following graphics illustrate these different architectures.

Scenario & Architecture for Standard:

Scenario & Architecture for Data Flow:

Scenario & Architecture for 2 Phase Processing – Phase 1:

Scenario & Architecture for 2 Phase Processing – Phase 2:

All the examples can be downloaded here: MDI_Examples.zip

And last but not least: If you miss a feature, let us know in JIRA (when you are a community user) or contact Pentaho Customer Support (when you are a Pentaho customer) to let us know about your feature requests and use cases. The more we know what is needed, the more we can help. Join us, either as a customer or community member, to leverage the power of Pentaho.

 

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Metadata Injection – Examples for Special Scenarios

How to visualize PDI Data Lineage with yEd

A description of the PDI Data Lineage feature including set up instructions can be found within the Pentaho 6.1 Documentation and in Pedro Alves‘ blog post Seeing PDI lineage information. This post is a step-by-step example showing how to use PDI Data Lineage with yEd.

Setup PDI Data Lineage

Modify …\system\karaf\etc\pentaho.metaverse.cfg (Client & DI-Server when needed)

  • Set lineage.execution.runtime = on
  • Set lineage.execution.generation.strategy=latest
  • Default folder for lineage GraphML files: lineage.execution.output.folder=./pentaho-lineage-output

How to use yEd?

Access Your Lineage Output

Use the following steps to access your linage output from PDI:

  1. After you run your jobs and transformations, the created lineage files can be found in the default folder specified, pentaho-lineage-output for example
  2. Open a lineage file in yEd and see how it looksyEd_howTo_01yEd_howTo_02

Refine Your Output

It does not give you much useful visualization, so here are the instructions to make it useful:

  1. Download the default_yed_configuration.cnfx file for PDI. The file download location might change in a future version.
  2. Within the yEd menu, select Edit / Properties Mapper…
  3. Click on the Imports additional configurations icon and import the .cnfx file

yEd_howTo_03

  1. Select each of the configurations and press Apply:
  • Pentaho Metaverse Nodes (Node)
  • Pentaho Metaverse Edges (Edge)
  • PDI Nodes (Node)

yEd_howTo_04

  1. Finally, close this window with Ok
  2. Within the yEd menu, select Layout > Hierarchical and press Ok in the window (keep the default settings).

Now, it looks better, but we have way too many information:

yEd_howTo_05a

Optimize the View

Let’s optimize the view. We will look mostly at the Neighborhood window to search for information.

  1. Drag and drop the Neighborhood window to the right hand side.

yEd_howTo_06

  1. Resize the Neighborhood window to the right hand.

Note: This window does not show any information when there is too much to display, so change the window size and make it bigger.

yEd_howTo_07

  1. Search something in the Structure View. In our example we have a target field called totalprice. We want to find out where this field is coming from.
  2. Enter totalprice in the search view. You will see there are multiple totalprice nodes. In the following example, we see it is a database column (written into the fact_sales table).

yEd_howTo_08_totalprice3a

  1. Select another totalprice node – in the following example, we see it is also the name of a step (a Calculator step for example) and the PDI hops to/from the next steps.

yEd_howTo_08_totalprice2a

Other Visual Option

Another view can give us the information we need. It is also a stream field and is derived from two other stream fields: quantityordered and priceeach.

yEd_howTo_08_totalprice1a

  1. Double click on a derived field (quantityordered for example), we can trace back the data lineage in this example until the orderdetails Table Input step.

yEd_howTo_08_totalprice4

  1. Double click on the query attribute to look at the SQL within the Properties View and see the details:

yEd_howTo_08_totalprice5yEd_howTo_08_totalprice6

  1. Look at the database connections and see that the step orderdetails uses the pentaho_olap database connection.

yEd_howTo_08_totalprice7You see how much detailed information we already collect with data lineage and how you can visualize it all with yEd. Stay tuned for more updates and improvements in future releases!

Contribute to improve the lineage

And last but not least, when you are a developer and want to contribute to improve the lineage, this documentation is a good starting point: Contribute Additional Step and Job Entry Analyzers to the Pentaho Metaverse

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für How to visualize PDI Data Lineage with yEd

Pentaho 6.1 is Released – Part 2 – Metadata Injection

As I mentioned yesterday, Metadata Injection is my favorite part of this release. Now I get to explain why.

Metadata Injection (MDI) gives you the ability to modify transformations at execution time.

By dynamically passing source metadata to PDI at run time, IT teams can drive hundreds of data ingestion and preparation processes through just a few actual transformations. This promotes reusability and productivity, accelerating the creation of complex data onboarding processes.

How does Pentaho 6.1 and data onboarding relate back to the analytic data pipeline?6.1 pipeline imageOrganizations face challenges scaling their data pipelines to accommodate exploding data variety, volume, and complexity. In particular, it can take considerable time and resources to engineer and prepare data for the following types of enterprise use cases:

  • Migrating hundreds of tables between databases
  • Ingesting hundreds of changing data sources into Hadoop
  • Enabling business users to onboard a variety of data themselves

The modern data onboarding necessary in these projects is more than just ‘connecting to’ or ‘loading’ data. Rather, it involves managing a scalable, repeatable process to ingest an array of changing data sources.

Traditional static ETL or hand-coding approaches require repetitive, time-consuming manual design that can increase the risk of human error. Meanwhile, this leads to staffing challenges and increased backlog. IT labor hours are tied up in risky, long term development projects when there are other potentially more valuable projects that could be worked on instead (opportunity cost).

What is metadata injection and why is it useful?

Recall that metadata is “a set of data that describes and gives information about other data.” In the context of a simple PDI transformation, examples of metadata include the field names, field types (string, number, etc), and field lengths (5 characters or 50?) from a CSV file input step in PDI. Normally, this metadata is accessed manually when a user creates the transformation, such as by hitting the Get Fields button in the CSV input step.

Metadata injection refers to the dynamic passing of metadata to PDI transformations at run time in order to control complex data integration logic. The metadata (from the data source, a user defined file, or an end user request) can be injected on the fly into a transformation template, providing the “instructions” to generate actual transformations. This enables teams to drive hundreds of data ingestion and preparation processes through just a few actual transformations, heavily accelerating time to data insights and monetization. In data onboarding use cases, metadata injection reduces development time and resources required, accelerating time to value. At the same time, the risk of human error is reduced.

BTW: If you are a fan of Data Vault Modeling, MDI can help here a lot as well.

Here is a comparison of static data integration with metadata injection:

MDI Static vs MDI

How it all began: This feature exists since PDI 4.1 and Matt Casters blogged about this with a very nice example back in 2011: Parse nasty XLS with dynamic ETL

In Pentaho 6.1, we have hardened and enhanced our metadata injection capabilities. This includes enabling metadata injection with new steps, providing new documentation and examples on help.pentaho.com, and making other standardizations and improvements.

Here is a list of PDI steps that support metadata injection as of PDI 6.1:

MDI Steps

Sample: We also added a tutorial for MDI with a step-by-step documentation to create a solution to parse different Excel files that contain invoice data from suppliers to process and unify:

MDI Sample 1


 

MDI Sample 2


 

MDI Sample 3


 

MDI Sample 4


For more details, have a look at the Pentaho Documentation about Metadata Injection. You can download the complete sample here: flex_file_reader_metadata_injection.zip

This is a simplified example for illustration to store the data in a text file. It can even be used in all sorts of use cases and can store the data in SQL, NoSQL databases or Big Data. You have all the capabilities of PDI – sometimes I hear the term Swiss Army Knife 🙂

The journey continues and our engineering teams already work with full steam ahead to add more and more capabilities in the upcoming release – stay tuned!

And last but not least: If you miss a feature, let us know in JIRA (when you are a community user) or contact Pentaho Customer Support (when you are a Pentaho customer) to let us know about your feature requests and use cases. The more we know what is needed, the more we can help. Join us, either as a customer or community member, to leverage the power of Pentaho.

You can read the details with more information in our documentation What’s New in Pentaho 6.1.

Learn more about Pentaho 6.1 including a video overview, download link, webinar and more: Just follow this link.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho 6.1 is Released – Part 2 – Metadata Injection

Pentaho 6.1 is Released – Part 1 – New Features and Improvements

I am very excited to share that today we announced Pentaho 6.1 (see also the German press release). There is so much to share that I am dividing this into two blog posts. Today I will share a summary of the new features and improvements of Pentaho 6.1. Stay tuned for tomorrow when I will go into detail about my personal favorite new feature in this release – Metadata Injection.

New Features and Improvements in Pentaho 6.1

Pentaho Data Services

This powerful feature has been around since PDI 5.4 (known as Thin Kettle JDBC driver) and it solves a couple of really nice use cases that the following picture illustrates:

Pentaho Data Services

What we did in 6.0: Within 6.0, the usability and performance have been improved by Caching and Push-Down-Optimization. We are not finished, yet: There are remaining use cases to continue these improvements and harden this feature for enterprise use.

And we worked on these in 6.1:

  • Added Parameter Pushdown Optimization for Data Services – You can improve the performance of your Pentaho data service through the new Parameter Pushdown optimization technique. This technique is helpful if your transformation contains any step that should be optimized, including input steps like REST where a parameter in the URL could limit the results returned by a web service.
  • Driver Download for Data Services in Pentaho Data Integration – When connecting to a Pentaho Data Service from a non-Pentaho tool, you previously needed to manually download a Pentaho Data Service driver and install it. Now in 6.1, you can use the Driver Details dialog in Pentaho Data Integration to download the driver.
  • Pentaho Data Service as a Build Model Source Edit section – You can use a Pentaho Data Service as the source in your Build Model job entry, which streamlines the ability to generate data models when you are working with virtual tables.

Further PDI improvements

  • Improved JSON Input Performance – The JSON library used by Pentaho Data Integration has been replaced. The new library increases speed by up to 10 times or more, and can now handle file of up to hundreds of MBs in size. Although the library has been updated, the user interface to this functionality is fully backward compatible.
  • New Run Options Window – The Execute a transformation and Execute a job windows in Pentaho Data Integration have been replaced with new Run Options windows. These new windows streamline your interaction while running a transformation or job, especially with the Always show dialog on run option. With this option disabled, the Run Options window does not appear the next time you run your transformation or job. The options currently set are reused until you access the Run Options window again.

Run Options

  • New Run Context Menu – Besides the Action menu or by pressing F8 in Pentaho Data Integration, you can access the Run Options window through the new Run context menu. In this menu, select Run to run your transformation or job, and select Run Options to change any of options before running.

Run Options

Related to Analyzer and Data Model

  • Hide a Calculated Measure in the Data Model Edit section – You can hide a calculated measure in the data model. In the Annotate Stream step, select the Hide this calculated measure in the model check box to include the measure as part of the model, but to hide it from users when the data source is opened in Analyzer.
  • Publish a Data Model with Hidden Attributes Edit section – Hide attributes and calculated measures in Analyzer when creating them on a data model in the Annotate dialog box by setting the Hidden property to ‚True‘ to hide the level in Analyzer. The Hidden property is useful for attributes needed to build a proper data model, but not needed for analytic purposes.
  • Edit Calculated Measures in Analyzer Edit section – You can now update the properties on a calculated measure created via inline model editing within Analyzer, such as if you want to rename the measure or adjust the MDX formula on the calculation. Also, you can now easily identify calculated measures in the Available fields list by the icon, ‚f(x)‘, which only displays next to calculated measures created in Analyzer via inline model editing.
  • Show and Hide Available Fields Edit section – You can select to hide or show fields in the list of Available fields for a report in Analyzer. Hiding fields is helpful when you want a clear view of only those fields you are interested in for your report. When you hide a field, it is no longer available for selection in the report.

Related to Plattform, CTools & Documentation

  • Platform Improvements – We continued with the foundational improvements to the Pentaho Platform first introduced in Pentaho 6.0. Highlights include continued Karaf hardening and Jackrabbit performance improvements.
  • CTools Documentation Added to Help – We’ve added CTools documentation to the Help. You can now learn the basics of the Community Dashboard Editor (CDE) with the CDE Quick Start Guide and its companion articles: CTools Overview, Activate CDE, CDE Dashboard Overview, CDE Quick Start Guide
  • CDF API Documentation – A full set of API documentation has been created for OEMs who want to use JavaScript APIs to build custom dashboards with the Community Dashboard Framework (CDF).
  • Improved Prompting Documentation – We’ve added new dashboard design documentation for the Date Picker prompt.
  • Help Site Improvements – Our Help site has been upgraded to a more responsive version of MindTouch. This upgrade improves content readability and simplifies navigation, along with a better interaction on a mobile device.

And last but not least: If you miss a feature, let us know in JIRA (when you are a community user) or contact Pentaho Customer Support (when you are a Pentaho customer) to let us know about your feature requests and use cases.

You can read the details with more information in our documentation What’s New in Pentaho 6.1.

Learn more about Pentaho 6.1 including a video overview, download link, webinar and more: Just follow this link.

Continue reading part 2 featuring my favorite new feature in 6.1: Metadata Injection

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho 6.1 is Released – Part 1 – New Features and Improvements

Some Historic Cornerstones of Kettle & Pentaho

Every time when there is something new coming on the horizon, it’s a good time to look back. The intention of Hitachi Data Systems to acquire Pentaho is such a remarkable move that it opens up a great new chapter in Pentaho’s history.

As I just researched the details of the journey, I was very happy that you can still find the referenced postings on the Internet after all these years. All the other details came from my memories, so hope they are all accurate, especially the time of the day! And please consider that this is from my personal view and that I’m Kettle addicted, so European and German filter rules together with the priority on Pentaho Data Integration (aka Kettle)  topics might apply here and there…

Sep 8, 2004 (right after dinner) – I read Matt Casters post #30 on the BIRT mailing list about Kettle version 2.0. BTW: The BIRT project was just at the beginning and it was even before the BIRT project got officially accepted by the Eclipse Foundation on Oct 6th, 2004.

At this time Kettle was closed source and I got a trial license code from Matt to play with it. Since our intention at Proratio (my employer at this time) was to use it to load the companies data warehouse from our ERP with an extreme complex condition system, the needed combination of steps would be a night mare. I asked Matt if there is some kind of plug-in possibility for Java code and his answer was about this: „This is a good idea, but since I’m doing this part time and on the weekends, I don’t know when it will be finished.“

In the meantime I developed a connector to SAP (some of the root bits and bytes went into the godesys SAP-Connector, formerly ProERPConn by Proratio) and thought it would be a good time to check with Matt again. And surprisingly he just finished the plug-in possibility. Sometime in summer 2005 we ended up sitting together in Mainz and managed to integrate the SAP connector within Kettle in a very short time (between breakfast and lunch).

Early Dec 2005 – Matt Casters released Kettle to the open-source with a LGPL license (moved to Apache 2.0 later on). For those who would like to know more about the history of Kettle (Matt started 2001 to work on it), have a look at the Pentaho forum post Project road map, history of kettle. It contains also a zip file of the very first Java Kettle version ever and a screen shot of version 1.0. You can also read a bit more about the story in the book Pentaho Kettle Solutions.

Dec 9, 2005 (right after tea time) – I got aware of Pentaho by reading a press release in Computerwoche „5 Millionen Dollar Venture Capital für Open-Source-BI“ and let Matt know about this interesting company.

Dec 12, 2005Matt Casters‘ first forum post at Pentaho: „[…] Therefor, I would like to convince you to include a powerfull ETL tool that allows users to quickly build and maintain a full data warehouse. For me this means including slowly changing dimensions. Copying a couple of tables and hoping that the reporting engine will be able to cope just won’t do. Please feel free to consider including Kettle. Kettle is an ETL tool that turned LGPL about 10 days ago. […]“

Apr 4, 2006 – The press release went out: Pentaho Acquires Kettle Project – Fastest Growing Open Source BI Project Strengthens Portfolio With Best-in-Class Data Integration. This was also the time that Kettle was known as Pentaho Data Integration.

May 23, 2006 – Press release for offering Kettle (Pentaho Data Integration) trainings for Europe, starting in Mainz, Germany by Proratio. Proratio became a Pentaho partner and we did trainings and workshops in Europe.

Jun 2007 – I joined Pentaho as employee #3 in Europe. #1 was Thomas Morgner (Founder and Chief Architect of the Pentaho Reporting engine) and #2 Matt Casters. Since we had a limited head count (not only in Europe), I had multiple heads on: working on the code base, documentation, QA, support, training, prove of concepts, presales, community, even Guerilla-Marketing and may be something else that I forgot.

Dec 2007 – Pentaho won it’s first German customer – a large health insurance company

Jun 2008 – The first Pentaho community event in Mainz was a great success, bringing people together with their ideas, projects and to know each other personally and last but not least setting a foundation stone to continue and grow the community over the next years.

2008- 2015 – Read some more key points along the way from Richard Daley: The Pentaho Journey – A Big Data Booster

Feb 10, 2015Hitachi Data Systems Announces Intent to Acquire Pentaho to Deliver More Value From Big Data and the Internet of Things That Matter.

This acquisition builds on an existing OEM relationship between the two companies and is a core component of the HDS strategy to accelerate its Social Innovation business and become a leader in IoT. Social Innovation is the unifying strategy across Hitachi businesses to deliver solutions that enable healthier, safer and smarter societies. The Pentaho vision to create transformational value from data generated and interconnected across people and things is brought to life by a big data orchestration platform to power embedded analytics.

For further details, check out the Social Innovation News on HDS Community Innovation Center.

This is an exciting time for both companies, our customers, and partners. We both share a vision around analytics and in particular around big data opportunities.

I’m very thankful for the right decisions, the timing and passionated people all around at Pentaho, the Community and partner landscape that formed this success as a whole.

This was the short story of this great journey and it still continues 🙂

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Some Historic Cornerstones of Kettle & Pentaho

PDI/Kettle Solution Share (Presented at #PCMD15)

This has been presented at the German Pentaho Community Meeting 2015 (#PCMD15):

Before you share a PDI solution with the community, clients etc. you probably want to ensure that it does not contain any security related information you do not want to share.

There are many information within transformations and jobs that you may not want to expose to others like: hostnames, usernames, creator information, server names, file paths etc.

Use the PDI/Kettle Solution Share (explained on the Pentaho Wiki) to remove these unwanted (secure & private) information in your Kettle/PDI jobs and transformations.

It is using the XSL Transformation job entry by modifying the jobs and transformations stored in XML.

solution_share_job1

It is looping via all files and replacing the shown variables:

solution_share_job2

It should also encourage you to share your solution with others, e.g. via the Kettle Exchange on the Pentaho Wiki.

This works for multiple files es well for a repository export.

Here is the how to (see the latest version on the Pentaho Wiki for any changes):

  1. Download the solution_share.zip (attached to PDI/Kettle Solution Share), unzip it to a folder of your choice
  2. Open solution_share_main.kjb from Spoon
  3. Execute it with the test files within the solution_share/share folder
  4. A new folder solution_share/share_date_time (e.g. share_public_2015-02-08_22-06-27) has been created with the cleaned files
  5. You may want to compare the files if they do not contain any private information like connections, user names etc.
  6. Put your files to share into the ./share folder instead the sample files (you can also modify the default location by changing the parameter within solution_share.ktr)

If you want to add and/or modify the elements that are removed from your solution, you can do so by editing the XSL Transformation: share_remove.xsl

share_remove_xsl

Please see the complete share_remove.xsl within the on PDI/Kettle Solution Share.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für PDI/Kettle Solution Share (Presented at #PCMD15)

PDI/Kettle Telemetry (Presented at #PCM14)

This has been presented at the Pentaho Community Meeting 2014 (#PCM14):

There are many reasons to collect usage statistics, for example:

  • It can help in improving the product in the main used areas and features (steps, job entries, database types etc.)
  • It can help the user to determine if some features are effected by a planned upgrade (the upgrade notes on each release cover affected steps, job entries etc.)
  • When it gets combined with usage statistics in development/test/production you can also determine if some jobs/transformation are never used

Here is one solution with an how to:

Analyze the used steps, job entries and database types

  • Download the solution analyze_trans_job
  • Within PDI/Kettle, please open the job _analyze_trans_job/transformations_jobs/0_analyze_trans_job.kjb
  • Look at the comment within the job, it gives you all the usage information.
  • It is also possible to anonymize file names, transformation and step names: please see the option anonymize_names within the parameters.txt file.

If you want to contribute to this solution, the jobs/transformations are hosted on GitHub.

Note: This is limited actually to the file system and does not support a repository or repository exported file.

Further information can be found on the Pentaho Community Wiki for the user statistics that can be achieved by using the Pentaho Operations Mart.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für PDI/Kettle Telemetry (Presented at #PCM14)

Pentaho 5.2 Released at PentahoWorld 2014

A lot of exiting things happened at PentahoWorld 2014, you can read a lot about it on Twitter #PWorld2014. It is amazing what momentum we reached and that over 400 attendees have been at this conference. BTW: Did you know that Pentaho celebrated it’s 10th anniversary last week, too?

PentahoWorld 2014
PentahoWorld 2014

For me, as a „PDI/Kettle addicted person“ 😉 one of the major topics was also the release of Pentaho 5.2 (BA & PDI). You can read about all the great new features over here: Upgrade Existing Pentaho Systems (the press release is not out, yet by the time of writing this article, but will come soon over here: Pentaho Press Releases).

PDI 5.2
PDI 5.2

Here are the highlight of PDI 5.2:

Pentaho Data Integration 5.2 delivers many exciting and powerful features that help you quickly and securely access, blend, transform, and explore data.

New Streamlined Data Refinery Feature

The Streamlined Data Refinery (SDR)  is a simplified, ad hoc ETL refinery composed of a series of PDI jobs that take raw data, augment and blend it through the request form, and then publish it to the BA Server for report designers to use in Analyzer.

R Script Executor Step Improvements

The R Script Executor, Weka Forecasting, and Weka Scoring steps form the core of the Data Science Pack and transforms PDI into a powerful, predictive analytics tool. The R Script Executor step allows you to incorporate R scripts in your transformation so that you can include R-based statistical programming in your data flow. In PDI Version 5.2 you can now „plug and play“ R scripts, without extra customization.  Now you can pass incoming field metadata to the output field metadata, use a more intuitive user interface to run scripts by rows or by batches, and test scripts.

New DI Server Administration Features

Porting content from one environment to another and performing general DI Repository maintenance is easier with the introduction of the new Purge Utility.  The Purge Utility permanently purges the repository of versions of shared objects, such as database connection information, jobs, and transformations.  You can also turn DI Repository versioning and comment capturing capabilities on and off.

Kerberos Security Support for CDH 5.1 and HDP 2.1

If you are already using Kerberos to authenticate access to a Cloudera Distributed Hadoop 5.1 or Hortonworks Data Platform 2.1 cluster, with a little extra configuration, you can also use Kerberos to authenticate Pentaho DI users who need to access those clusters.

New Marketplace Plugins

Pentaho Marketplace continues to grow with many more of your contributions. Pentaho Marketplace is a home for community-developed plugins and a place where you can contribute, learn, benefit from, and connect to others. New contributions include:

  • LookupTimeDimensionStep: Looks up and creates an entry on a data warehouse dimension time table and returns the ID.
  • Probabilistic Row Distributions: Contains a collection of Row Distribution plugins for PDI that use probabilistic methods for determining the distribution of rows.
  • PDI Groovy Console: Adds a Groovy console to the Help menu that has helper methods and classes that interact with the PDI environment.
  • Gremlin Script Step: Provides a Gremlin script step for graph pipeline processing.
  • Avro Output Plugin: Allows to output Avro files. Avro files are commonly used in Hadoop allowing for schema evolution and truly separating the write schema from the read schema.

Improved Upgrade Experience

Upgrading PDI is easier because it is no longer a manual process.  You can now upgrade from 5.1.x to 5.2 using the same upgrade utility used for patch releases.

More Details and Download

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho 5.2 Released at PentahoWorld 2014

PentahoWorld 2014 – The 1st Annual Worldwide Conference is coming soon

I’m very happy to see PentahoWorld coming soon on October 8 through 10, 2014. It is our first(!) worldwide conference for Users, Advocates and Partners of Pentaho.

After many years of Pentaho community meetings, this is a next logical step in boosting the great Pentaho story. I initiated and organized the first Penaho community meeting. It was conducted in my hometown Mainz, Germany back in 2008 and we had a great audience, a lot of information, talks and fun. Since then, it succeeded every year and will continue in the future: the next community meeting is scheduled for November 14 through 16 in Antwerp, Belgium.

BTW: By writing this, a lot of history come to my mind, so I will continue to write a bit more about history that formed and forms our future in IT, Pentaho and Big Data in the next couple of blog posts, stay tuned via my Twitter account or RSS feeds of this blog.

And for now, have a look at some details of PentahoWorld 2014:

Training and Interactive Sessions: Three days of rich content to fuel the data-driven business

Learn how to apply the latest Pentaho innovations, see how customers use Pentaho to solve real-world problems, gain insight into product roadmaps and strategic plans, and take away practical knowledge and best practices.

Test-Drive Product Training: Hands-on sessions will provide the opportunity to explore Pentaho with Hadoop, NoSQL databases and WEKA predictive algorithms.

Meet the Experts: At the event, schedule 1:1 meetings with Pentaho innovators, partners and architects who will offer real-time advice, best practices and a look into next-generation product.

Breakout Sessions around:

  • Data Integration
  • Business Analytics
  • Advanced Analytics
  • Big Data Today
  • Embedded Analytics

For more information, check out the agenda.

You also get compelling keynotes and general sessions that help you develop a roadmap for scale, growth and success: See the list of speakers.

And last but not least: It’s always worth to make a trip to Orlando, especially in this time of the year! I can prove this from my experience of the last seven years being with Pentaho.

We also paved the way to give you 10 great reasons to attend and even arguments on hand to convince your boss 🙂

 

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für PentahoWorld 2014 – The 1st Annual Worldwide Conference is coming soon

Pentaho 5.1 EE & CE is released – What’s New in PDI & Big Data?

Pentaho 5.1 delivers code-free analytics directly on MongoDB, simplifies data preparation for data scientists and adds full YARN support. See the full list of Pentaho Business Analytics 5.1 features and technical specifications.

Here are some more details about the new PDI 5.1 release:

Native (YARN) Hadoop integration

PDI includes support for YARN capabilities including enhanced scalability, compatibility with MapReduce, improved cluster use, and greater support for agile development processes.  YARN also provides support for non-MapReduce workloads.

YARN for Carte Kettle Clusters

The Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster entries make it possible to to execute carte transforms in parallel using YARN. Carte clusters are implemented using the resources and data nodes of a Hadoop cluster, which optimizes resources and speeds processing.

Cloudera, Hortonworks, and MapR Hadoop Distribution Support

Use Pentaho’s innovative Big Data Adaptive Layer to connect to more Hadoop Distributions, including Cloudera 5, Hortonworks 2.1, and MapR 3.1.  These certified and tested YARN-based distributions allow you to use PDI to build scalable solutions that are optimized for performance. Pentaho supports over 20 different Hadoop Distribution versions from vendors such as Apache, Hortonworks, Intel, and MapR.  Pentaho also supports Cloudera distributions and is a certified Cloudera partner.

Data Science Pack with Weka and R

The R Script Executor, Weka Forecasting, and Weka Scoring steps form the core of the Data Science Pack and transforms PDI into a powerful, predictive analytics tool. The R Script Executor step, which is new for 5.1, lets you include R scripts in your transformations and jobs. You can customize random seed sampling, limit the batch and reservoir size, adjust logging level messages, and more. You can also choose to load the script from a file at runtime, enabling you to have more flexibility in transformation design.

Security Enhancements

PDI security has been enhanced to include support for more standard security protocols and specifications.

  • AES Password Support: Use Advanced Encryption Standard (AES) to encrypt passwords for databases, slave servers, web service connections, and more. AES uses a symmetric key algorithm to secure your data.
  • New Execute Permission: You can now choose whether to grant permission to execute transformations and jobs by user role.  This provides more finely-tuned access controls for different groups and can be useful for auditing, deployment, or quality assurance purposes.
  • Kerberos Security Support for Hadoop & MongoDB: If you are already using Kerberos to authenticate access a data source, with a little extra configuration, you can also use Kerberos to authenticate DI users who need to access your data.

Teradata and Vertica Bulkloaders

There are two new bulkloaders steps: Teradata TPT Bulkloader and Vertica Bulkloader.  Also, newer versions of Teradata and Vertica are now supported.

New Marketplace Plugins

Pentaho Marketplace continues to grow with many more of your contributions.  As a testimonial to the power of community, Pentaho Marketplace is a home for your plugins and a place where you can contribute, learn, benefit from, and connect to others. New contributions include:

  • Riak Consumer and Producer:  Links with Maven to provide dependency management.
  • Load Text From File:  Uses Apache Tika to extract text from files in many different formats, such as PDF and XLS.
  • Top / Bottom / First / Last filter: Filters rows based on a field’s values or row numbers.
  • Map (key/value pair) type:  Provides a ValueMeta plugin for key/value pairs that are backed by a java.util.Map.
  • PDI Tableau Data Extract Output:  Use Pentaho’s ETL capabilities to generate a Tableau Data Extract (tde) file.
  • PDI NuoDB: Provides a PDI database dialect for the NuoDB NewSQL database that works in the cloud.
  • Neo4j JDBC Connector:  Provides a PDI database dialect for the Neo4j graph database.
  • HTML to XML: Uses JTidy to convert HTML into XML or XHTML.
  • Apache Kafka Consumer and Producer: Reads and sends binary messages to and from Apache Kafka message queues.
  • LegStar z/OS File Reader: Reads raw z/OS records from a file and transforms them to PDI rows.
  • Compare Fields:  Compares 2 sets of fields in a row and directs it to the appropriate destination step based on the result. This step detects identical, changed, added, and removed rows.

There are many more new plugins such as IC AMQP, IC Bloom filter, JaRE  (Java Rule Engine), JDBC Metadata, Memcached Input/Output, Google Spreadsheet Input/Output, and Sentiment Analysis.

… and it’s getting more and more, check out the latest contributions (within the Data Integration design interface Spoon / Help / Marketplace).

Documentation

And last but not least: We have a great new documentation system, check out our documentation around New Features in Pentaho Data Integration 5.1

Download 5.1

Test 5.1 and download the Cummunity Edition (with limited functionality, no support etc. but eventually free beer) or start with our evaluation Enterprise Edition.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho 5.1 EE & CE is released – What’s New in PDI & Big Data?

Pentaho 5.0 CE is released – What’s New in PDI & Big Data?

Additional to the release of the Enterprise Edition (EE), Pentaho released the stable build of 5.0 Community Edition (CE).

If you can’t wait to get to the download, have a look at our new community.pentaho.com. It hosts all information and the download link. And for the download of a free 30-day trial of Pentaho EE, visit pentaho.com/download.

Among all the other great additions in the Pentaho Business Analytics Suite 5.0, here are the highlights for the new PDI 5.0 release:

PDI Community Edition includes the following new feature:

  • Inline Help links on transformation steps and job entries
  • Simplified looping – call jobs and transformations within a transformation and loop through the rows
  • Better previewing with dedicated window in run tab
  • Detailed timing metrics for low level operations – helps to analyze bottle necks in more detail, e.g. database connect & query time vs. transformation time
  • Extended monitoring of sub jobs and transformations in the Carte- and DI-Server (Expand remote job option)
  • Introduction of REST services (Carte and DI-Server)
  • Several new Transformation Steps and Job Entries including: Table compare, ZIP file, OpenERP input/output, Telnet, Nagios traps…
  • Lots of new configuration options for tuning the engine back-ends
  • Marketplace – Share and/or download new plug-ins
  • A series of new plugin systems to make it easier to extend Kettle (data types, extension points, carte servlets, logging tables, step-to-step row distributions…)
  • and many more…

PDI Enterprise Edition includes the following new features additional to the CE features:

  • Kettle JDBC driver – Query data using SQL/JDBC from any transformation, see how this can be used to blend data on the source in Pentaho’s Big Data Blend of the Week
  • Job Restart-ability – Set checkpoints, Restart jobs from last successful checkpoint
  • Transactional Job Execution – Provides the ability to roll back job execution on failure
  • Security on Database Connections – Full permissions: R/W/D, Securely share connections with other users or groups
  • Load Balancing of data within Transformations
  • Splunk input/output steps
  • WebSphere MQ / MQSeries integration
  • 4.4 to 5.0 Migration – Ease-of-use for Administrators

Matt Casters, Pentaho’s Chief Architect of PDI & Kettle Project Founder, will post more details over the next coming weeks, but just in case you can’t wait, test it yourself and go to community.pentaho.com or pentaho.com/download.

And what about big data?

Pentaho continues to lead in embracing and extending the big data ecosystem

Enhanced Big Data Functionality (5.0):

  • New InstaView use case templates for Hadoop and Splunk
  • Expanded NoSQL Integration

Expanded Big Data Ecosystem Support (5.0):

  • Hadoop High Availability support
  • New Integrations: RedShift, Impala, Splunk
  • New Hadoop Certs: Intel, Hortonworks, DataStax
  • Support for latest versions of CDH, MapR, MongoDB and Cassandra

Adaptive Big Data Layer (Transparent Access to & Integration of Big Data):

  • Insulates from changing versions, vendors, data stores
  • Give customers broad flexibility of choice, rapid time to value, reduced risk
  • Provides native integration into the big data ecosystem
  • Broadest, deepest Big Data Support

We are able to deliver support for new technologies as plug-ins very fast even without depending on release cycles of our core products. For the latest additions, see also the New and updated content in the Pentaho Big Data wiki space – stay tuned!

Learn more about Pentaho & Big Data and check out: pentahobigdata.com

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho 5.0 CE is released – What’s New in PDI & Big Data?

What’s New in PDI 4.4?

Among the other great additions like Mobile and many other feature enhancements in the Pentaho Business Analytics Suite 4.8, here are the highlights for the new PDI 4.4 release:

Pentaho Instaview

Pentaho Instaview is the fastest way to start using Pentaho Data Integration to analyze and visualize data. Instaview uses templates to manage the complexities of data access and preparation. You can focus on selecting and filtering the data you want to explore, rather than spending time creating source connections and identifying measure and dimension fields. Once the data has been selected, Instaview automatically generates transformation and metadata models, executes them, and launches Pentaho Analyzer. This allows you to explore your data in the Analyzer desktop user interface.
As your data requirements become more advanced, you have the ability to create your own templates and use the full power of Pentaho Data Integration (PDI).
Watch this video and see the Getting Started with Pentaho Data Integration Instaview Guide to understand and learn more about Pentaho Instaview or

PDI Operations Mart

The PDI Operations Mart enables administrators to collect and query PDI log data into one centralized data mart for easy reporting and analysis. The operations mart has predefined samples for Pentaho Analyzer, Interactive Reporting, and Dashboards. You can create individualized reports to meet your specific needs.

Sample inquiries include

  • How many jobs or transformations have been successful compared to how many failed in a given period?
  • How many jobs or transformations are currently running?
  • What are the longest running jobs or transformations in a given period?
  • What is the highest failure rate of job or transformations in a given period?
  • How many rows have been processed in a particular time period? This enables you to see a trend of rows or time in time series for selected transformations.

The operations mart provides setup procedures for MySQL, Oracle, and PostgresSQL databases. Install instructions for the PDI Operations Mart are available in the Pentaho InfoCenter.

Concat Fields Step

The Concat Fields step is used to join multiple fields into one target field. The fields can be delimited by a separator and the enclosure logic is completely compatible with the Text File Output step.

This step is very useful for joining fields as key/value pairs for the Hadoop MapReduce Output step.

SAS Input Step

The SAS Input step reads files in sas7bdat format created by SAS software. This step allows PDI developers to import files in sas7bdat format.

EDI to XML Step

The EDI to XML step converts EDI message text, which conforms to the ISO 9735 standard, to generic XML. The XML text is more accessible and enables selective data extraction using XPath and the Get Data From XML step.

New Pentaho Data Integration Software Development Kit (PDI SDK)

Extending and Embedding Pentaho Data Integration enables developers to utilize PDI beyond the out-of-the box functionality. This guide explains the mechanics of extending PDI plugins. It also explains embedding PDI functionality directly into Java applications. Pentaho provides sample code for all plugin types and embedding scenarios. More details can be found in the Pentaho Infocenter Embedding and Extending Pentaho Data Integration.

The complete PDI 4.4 change log: Now with JIRA Components and PDI Sub-Components

This is the first release with very detailed information about what has changed on a more granular level, so you could query JIRA by components and PDI Sub-components like steps, job entries and specific database topics. This helps in the upgrade process and finding already existing issues in JIRA. More details can be found in the Pentaho Wiki page JIRA PDI Components, Sub-components and Labels and the complete PDI 4.4 change log can be found in JIRA. You need to be logged into JIRA to include the fields in your default report, but you can also download the complete list when you select „View“ / „Excel (All fields)“.

Some background about versioning and compatibility for Kettle core

From the Kettle core perspective, 4.3.0 and 4.4.0 are still bug fix releases for 4.2.x but with significant new features. As such, we have decided to call the release version 4.4.0 rather than 4.3.1. As always, please see the PDI Upgrade Guide for specific topics when upgrading.

Download PDI 4.4

And when you don’t have it already, the download is over here: www.pentaho.com/download (Enterprise Edition) and should be soon on Sourceforce (Community Edition).

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für What’s New in PDI 4.4?

Call Kettle from PostgreSQL – Part 1

postgresql_kettle_integration_components_2012-09-29.png

Some questions and use cases that brought me to investigate into this area:

  • Can Kettle be called directly from a database trigger? So when rows get inserted, deleted, changed, I want to call a Kettle transformation.
  • Can we call a function in a SQL statement that calculates the fraud probability calculated by a WEKA data mining model?
  • Can we enrich some inserted data rows by external data sources?
  • Also performance considerations: I don’t want to transport the data to the process, I want to process the data were it is

So, lets try to integrate and embed Kettle into PostgreSQL:

Here is a short how to (based on a Windows 7 64 bit machine running PostgreSQL 9.1):

  • Download pljava-xxx.tar.gz from http://pgfoundry.org/projects/pljava – PL/Java is needed to call Java from PostgreSQL
  • Stop the postgresql Service, make a backup
  • After unpacking, follow the instructions given in pljava-x86_64-w64-mingw32-pg9.1-1.4.3/docs/docs/readme.html

Example for modifying the postgresql.conf:

  • dynamic_library_path = ‚$libdir;C:/Pentaho/_solutions_samples/realtime/pljava‘
  • log_min_messages= info
  • custom_variable_classes = ‚pljava‘
  • pljava.classpath=C:/Pentaho/_solutions_samples/realtime/pljava/pljava.jar
  • [When needed, add and modify:] pljava.vmoptions = ‚-Xmx512M‘

You may need to change the windows system PATH since I found an issue when loading pljava.dll
ERROR: could not load library „[…]/pljava.dll“: The specified module could not be found.

Dependency Walker helped in finding the missing parts, so I added the following to the system PATH (this might be different on your setup and is only an example):

  • C:\Program Files\Java\jre6\bin\server
  • C:\Program Files\PostgreSQL\9.1\bin

And follow the other steps in the readme (depending on your needs) and finally:

CREATE FUNCTION getsysprop(VARCHAR) RETURNS VARCHAR
AS ‚java.lang.System.getProperty‘
LANGUAGE java;

SELECT getsysprop(‚user.home‘);

The result of the query and function call is C:\User\postgres  since the PostgreSQL service and the JVM runs as user postgres.

  • Now, integrate the kettle-core.jar library for a first test:

SELECT sqlj.install_jar(‚file:///Pentaho/pdi-ee-4.3.0-GA/data-integration/lib/kettle-core.jar‘,’kettle_core‘,false);

SELECT sqlj.set_classpath(‚public‘,’kettle_core‘);

CREATE FUNCTION initCapKettle(VARCHAR) RETURNS VARCHAR  AS ‚org.pentaho.di.core.util.StringUtil.initCap‘    LANGUAGE javaU;

If you like, have a look at the loaded classes in table sql.jar_entry:
sqljar_entry.png

After executing select initCapKettle(‚jens‘); you can check if the first letter is changed into a capital ‚J‘ by calling a Kettle method:initcaptest.PNG

So, the first part of the tests worked very well.

Considerations for the future part 2 “Testing a Kettle transformation” of the prove of concept:

  • PL/Java is limited to single threading in the JVM: We need to use the Kettle Single Threading Engine
  • We may need to keep instances of the transformation meta definition. When to clean up?
  • We may need to define a threshold for processing a specific number of rows in a chunk or after a timeout to avoid excessive burden to the database.
  • I’m sure there are other unknowns…

This was first presented at the Pentaho Community Meeting 2012 in Amsterdam September, 29th 2012 and part 2 with more testings will come soon.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Call Kettle from PostgreSQL – Part 1

Carte as a Windows Service

Carte is a simple web server that allows you to execute transformations and jobs remotely and execute transformations clustered. Carte is normally started with a .bat file within Windows environments but there are some use cases to run Carte as a Windows Service, e.g.:

  • When Carte instances are running using a command window, anyone by mistake could close the instance and Carte will go down.
  • The Carte.bat command window is tied to the user session that called the batch file and needs to be kept logged in.
  • With a Windows Service you can start the Carte service at machine startup and also configure it to restart after a crash.

After you completed the instructions on the Pentaho Wiki, you are able to get Carte running as a Windows service like this:

yajsw_Pentaho_DI_Carte_Service

It is also possible to get a system tray

yajsw_system_tray

to get the console output of Carte like this after the initial start of the Windows service:

yajsw_console1

The console gets the same information as you get via the browser when you logged into carte, e.g. after the execution of the sample transformation:

yajsw_console2

Many thanks for the developers of YAJSW (Yet Another Java Service Wrapper) for this great solution to wrap Java programs as a service and to Dan to bring this to my attention.

More details can be found on the Pentaho Wiki: Carte as a Windows Service

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Carte as a Windows Service

Kettle and NoSQL: MongoDB

On May 23rd 2012 Pentaho and 10gen are jointly announcing a partnership to provide direct integration between Pentaho Business Analytics and MongoDB.

MongoDB is a scalable, high-performance, open source NoSQL database featuring document-oriented storage, auto-sharding for horizontal scalability, rich document-based queries and fast in-place updates. MongoDB is designed with both scalability and developer agility in mind. Instead of storing your data in tables and rows as you would with a relational database, in MongoDB you store JSON-like documents with dynamic schemas. The goal of MongoDB is to bridge the gap between key-­value stores (which are fast and scalable) and relational databases (which have rich functionality).

Working together, Pentaho and 10gen offer the first MongoDB-based big data analytics solution to the market. This solution combines MongoDB with Pentaho’s visual interfaces for high-performance data input, output and manipulation, as well as data discovery, visualization and predictive analytics. This makes it easy and productive for IT staff, developers, data scientists and business analysts to operationalize, integrate and analyze both big data and traditional data sources.

MongoDB and further BigData and NoSQL connectors are available in all editions of Kettle including Community, Basic, Professional and Enterprise since Version 4.3.

MongoDB Sample 1 MongoDB Sample 2

Kettle makes it really easy to extract data from multiple big data and traditional sources, and integrate it into a data mart or warehouse. The visual interfaces for big data will accelerate development and maintenance of MongoDB-based solutions a lot!

More details and resources can be found over here: pentaho.com/10gen

Veröffentlicht in General, Kettle (PDI) | Kommentare deaktiviert für Kettle and NoSQL: MongoDB

Data Profiling and Data Quality (Human Inference) Integration with Kettle

Data Profiling with DataCleaner (Human Inference) and Kettle

It was already possible to profile your data in an easy way with Kettle: Open the Database Explorer, chose a table and right click in the context menu on Data Profile. The result was basic information about the data like Min, Max, Count all for strings and additional information for numeric data but these were only basic metrics about your data. We have a much more better solution now:

Human Inference (DataCleaner) and Pentaho (Kettle) worked together to integrate their tools and the result is a nice and seamless integration of DataCleaner into Kettle. A sample for introduction and FAQ can be found at Kettle Data Profiling with DataCleaner

You can right click on any step within your transformation and profile your data. It is also possible to clean or harmonize your data and check the result directly within your transformation.

DC context menu

Here are some screen shots:

  • Number Analyzer

DC Number Analyzer

  • String Analyzer

DC String Analyzer

  • Pattern Analyzer (e.g. you see how many single words with first capital letters etc. exist in your data):

DC Pattern Analyzer

And there are much more Analyzers you can choose from: Matching & Deduplication, Boolean, Character Set Distribution, Data Gap, Date/Time, Reference Data Matcher, Value & Weekday DistributionAnalyzers

DC Analyzers

For a complete reference look at http://datacleaner.eobjects.org and for a quick introduction and sample check out: http://wiki.pentaho.com/display/EAI/Kettle+Data+Profiling+with+DataCleaner

Data Quality with EasyDQ (Human Inference) and Kettle

Additional to the Data Profiling capabilities, a couple of steps for Data Quality have been introduced into Kettle as plug-ins. This covers

  • Name Validation, Standardization and Cleansing
  • Address Validation, Standardization and Cleansing
  • E-Mail and Telephone Validation, Standardization and Cleansing
  • Duplicate Detection and Merge Duplicates

Here is a validation example:

DQ Sample 1


More details, download instructions, a sample video and examples can be found over here:

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Data Profiling and Data Quality (Human Inference) Integration with Kettle

Pentaho Kettle for Big Data

All of Pentaho’s big data capabilities will be available as open source in the new Pentaho Kettle 4.3 release: Big data capabilities include the ability to input, output, manipulate and report on data using the following Hadoop and NoSQL stores: Cassandra, Hadoop HDFS, Hadoop MapReduce, Hadapt, HBase, Hive, HPCC Systems and MongoDB.

With regard to Hadoop, Pentaho Kettle makes available job orchestration steps for Hadoop, Amazon Elastic MapReduce, Pentaho MapReduce, HDFS File Operations, and Pig scripts. All major Hadoop distributions are supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition. Pentaho Kettle can execute ETL transforms outside the Hadoop cluster or within the nodes of the cluster, taking advantage of Hadoop’s distributed processing and reliability.

Download, how-to docs, videos and more at http://community.pentaho.com/BigData, see also Matt’s blog at http://www.ibridge.be/?p=207

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Pentaho Kettle for Big Data

ExeBatLauncher: Calling .bat Files by .exe Executables (e.g. Having a Spoon.exe)

The ExeBatLauncher is a simple way of calling .bat files as .exe files in creating an .exe command that calls the same .bat command.

For example you want to call the Spoon.bat file by Spoon.exe, simply rename the ExeBatLauncher.exe (that is found in the distrib folder) to Spoon.exe and copy it to the same directory and you are done.

If you want it now, just download it and use it, it’s free 🙂

If you want to have a polished Spoon.exe with the right icon to test with, just download it from the PDI-6949 attachment.

The reason why I created this was a Pentaho Data Integration JIRA case: PDI-6949 – „While in PDI-2925 the feature of kettle.exe was removed, you cannot add a .bat file to the quick launch of Windows 7. The workaround is to add a cmd.exe to the quick launch and add /c <path to .bat> to modify it. This is due to microsoft security in Win 7.“

I also thought a lot of launching Kettle just by clicking on a .ktr or .kjb file and accomplished this by a wrapper .bat (calling the target bat with the /file: option). There are also nasty ways of launching .bat files by creating self extracting exe archives (nice trick), but this is just too dirty to be professional. But I tried also this and it worked even with the Spoon icon, but some limitations like having to extract the.bat file from the self extracting archive into the distribution folder are not a good solution. Further investigation into existing solutions of calling Java from .exe brought some nice projects but that is not the solution we want at this time.

I wanted to have a pretty simple .exe that just calls the .bat with the same name and here is the code for this:

ExeBatLauncher.cpp

As you see it supports:

  • calling the .exe/.bat from every folder
  • passing all arguments from the caller
  • returning the return code to the caller

If you want to compile it, follow the instructions in the _README.txt file in the attachment: ExeBatLauncher.zip

I look forward to your experiences!

Veröffentlicht in General | Kommentare deaktiviert für ExeBatLauncher: Calling .bat Files by .exe Executables (e.g. Having a Spoon.exe)

First German Pentaho Customer Meeting in Munich

Today was our first German Customer Meeting that Bruno initiated and it took place at one of our customer locations @Wirecard AG (many thanks!).

A nice (and squeezed…) agenda and around 20 attendees discussed the presented customer solutions, their experiences and what Pentaho can do better, but also: what really works well and this is a lot and proven in medium and large deployments. Here are some examples in addition to „normal“ BI:

  • Processing 500 thousand transactions per day (credit card and alternative payment methods)
  • Merchant deposit monitoring including fraud detection of merchants (!)
  • Application and data row level role based security in a multi-tenant solution using special role management via a Mondrian plug-in
  • Highly performant solutions with load balancing servers and high availability (99.95%), one of them with a failover with two complete separate locations
  • Pentaho integrates very well in existing IT infrastructures
  • Event triggered ETL by an ESB (Enterprise Service Bus) / SOA architecture: JBoss ESB triggers PDI Jobs for application integration (JMS, File, HTTP, SMTP, FTP etc.). At this time via the Kettle API but it is planned to use Kettle Web-Services
  • PDI is used as a data source for reporting, just a nice EAI use case
  • Pentaho is integrated into JBoss EAP (the customer version of JBoss AS, everything in this platform is secured by default)
  • 3 of the attendees are using Infobright
  • Report bursting was accomplished by 3 different methods: 1) using xActions, 2) using ant scripts, 3) using Kettle

Major discussed improvements:

  • More integrated and consistent security throughout all products (it’s possible but special requirements are tricky to implement)
  • Report bursting with scheduling should be more standardized and user friendly
  • Since Pentaho can be extended and embedded very well via extension APIs, the upgrade to next versions should be more seamless by special dedicated documentation for OEM and stable API extension points [this will be addressed in the road map by the SDK]
  • Upgrade process: upgrade tools would be a huge benefit for supporting it, also when some releases were not installed (e.g. jump from 3.x to 4.1 instead of 4.0)

As part of the Pentaho Product Management team, I presented the product road map and gave some quick insights in the Kettle Star Modeler.

I look forward to the next event of this type for knowledge sharing between Pentaho customers and users. We found already one volunteer (Guy) to take it from here 🙂

BTW: This event was only in German language (at this time!) and thanks again to Wirecard, Rob & Bruno for organizing this and to the attendees for their presentations and active participation!

Here are some impressions….

 2011-12-06 Munic A  2011-12-06 Munic B  2011-12-06 Munic C

Veröffentlicht in General | Kommentare deaktiviert für First German Pentaho Customer Meeting in Munich

The new XML Input Stream (StAX) step in PDI 4.2

This step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads: Since Kettle has so many own steps to process data in different ways, the processing logic has been moved more into the transformation and the step itself provides the raw XML data stream together with additional and helpful processing information.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples at the Kettle Wiki for illustrations of the usage.

Note: In almost all use cases, a Set/Reset functionality was needed. At this time it can be accomplished by the Modified Java Script Value step or the User Defined Java Class step where the latter one is recommended and much more faster. An own Kettle step with Set/Reset functionality is one the road map to solve these and other similar use cases, see PDI-6389 for more details.

Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:

  • Very fast and independend of the memory regardless of the file size (GBs and more are possible due to the streaming approach)
  • Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)

Here is an example of parsing the following XML file with 2 main sample data blocks (Analyzer Lists & Products):

XML Input Stream (StAX) Test 2 - XML

A preview on the step may look like this (depending on the selected fields):

XML Input Stream (StAX) Test 2 - Preview

You see you really get almost the original streaming information with Elements and Attributes from the XML file together with helpful other fields like the element level.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples of this step for illustrations of the usage.

The transformation looks like this:

XML Input Stream (StAX) Test 2 - Transformation

The end result for the Analyzer List block:

XML Input Stream (StAX) Test 2 - Analyzer List

The end result for the Products block (splitted for example into two separate data streams for the end system):

XML Input Stream (StAX) Test 2 - Both Products And Attributes

And there are a lot more options in the step to help to solve your needs:

XML Input Stream (StAX) Test 2 - Step Options

More details about this step can be found in the XML Input Stream (StAX) documentation.

And for more details on the features for the upcoming PDI 4.2, please have a look at Matt’s blog.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für The new XML Input Stream (StAX) step in PDI 4.2

Security Considerations and Encryption with Kettle

Kettle is used more and more in enterprises where the standard obfuscation of credentials is not sufficient enough. There are requirements to use strong encryption methods and even to store internal data encrypted (covered in PDI-6168 and PDI-6170). The above use cases inspired me to create some simple transformations to test and play around with encryption.

The transformations and some test data are attached to the Kettle Exchange page Security Considerations and Encryption with Kettle.

Let’s start with creating a key by the cryptographyCreateSecretKey transformation:

cryptographyCreateSecretKey

The generateKey step uses the User Defined Java Class step and implements sample code for AES, the Advanced Encryption Standard is a symmetric-key encryption standard, see also http://java.sun.com/developer/technicalArticles/Security/AES/AES_v1.html. The key serialization to file is a little trick to obfuscate the key. Other methods can be included instead of the clear text file output.

Now that we have the key file, we can encrypt our secret data:

cryptographySampleUnencryptedInput

With the transformation cryptographyEncrypt:

cryptographyEncrypt

We keep it simple and assume the key is available in each row (accomplished by the Join Key).

The encrypted result looks like this:

cryptographySampleEncrypted

Let’s decrypt it with the transformation cryptographyDecrypt:

cryptographyDecrypt

The result is correct but only when the key file is the same and the encrypted data was not modified. You can test it yourself and see what error messages come up or the resulting files look like when the key file or data was modified.

Instead of storing the decrypted data to a file there are a lof of other options, e.g.:

  • use the decrypted data as credentials in subsequent steps or transformations
  • put the decrypted data into variables visible in a limited scope (e.g. parent job) and use them as credentials for databases, repository etc. (see PDI-6168)
  • and many more options

We may consider:

  • Symmetric-key algorithm vs. asymmetric key algorithms (public-key cryptography)
  • Diffie-Hellman key exchange is a specific method of exchanging keys.
  • Ensure integrity e.g. by hash-codes
  • Key file handling could be optimized in different ways.
  • Please keep in mind that unencrypted data is in RAM (see PDI-6170 for a circumvention to prevent heap dumps)
  • Beneath the binary or indexed storage type, an encrypted storage type may be possible in Kettle core.

In the end: Don’t lose your key!

Update since Kettle 4.2: There are two steps in the experimental section: Secret key generator, Symmetric Cryptography that cover this use case.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Security Considerations and Encryption with Kettle

iPhone Tracking: How to read the consolidated.db with Kettle

I will not discuss about the buzz about the iPhone tracking – all that needs to be discussed is already out. That iPhone is storing locations in the consolidated.db was known a long time ago, but now we got a prove of concept by Pete Warden and Alasdair Allan and this inspired me to dig into this a bit more.

After an amazing short time of less than one hour I made it possible to read in my own iPhone consolidated.db with Kettle. [It takes more time to write about this… and the other real hard thing was to locate this file in my file system.] When we can read this file with Kettle we are open and can look up any additional data, filter by any criteria, store this in another database, use the Pentaho BI Suite to create maps and anything else that can a human being imagine. Just have a look at the Pentaho Sandbox or more general to the Pentaho site.

Here are my findings and steps to reproduce:

1) Locate the consolidated.db in your filesystem.

This is a little bit tricky, since the filename is mapped to an ID. I found it very easy by just searching for the string „CellLocation“ in the iTunes backup folder, described over here for Windows:

findstr /M CellLocation C:\Users\[Username]\AppData\Roaming\Apple Computer\MobileSync\Backup\[latest backup folder]\*.*

I copied the file 4096c9ec676f2847dc283405900e284a7c815836 (this was my sample filename, yours will be most likely different) to consolidated.db to another folder.

A more robust and flexible solution is to read in and analyze the Manifest.mbdb / Manifest.mbdx as described over here in a Python script. It shouldn’t be too hard to port this to Java or JavaScript for a flexible Kettle solution.

2) Read in the consolidated.db with Kettle

Create a transformation with a SQLite connection type. The database name is the location of the filename as shown below as a sample:

SQLite connection type for consolidated.db

Let’s create a simple transformation (attached): Read in the Cell data, convert the Mac Absolute Time to a Java time and just group on a daily basis for simplification:

Transformation for consolidated.db

Let’s do a preview and browse the data a bit. Well I’m located in Mainz, known to be exactly at Latitude 50. So every other data is of interest and here comes a sample:

Sample Result for consolidated.db

Where I have been there? My forensic research on my own calendar found that I have been at the Pentaho EMEA Partner Summit & Community Event in Lisbon…. or was it only my iPhone?

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für iPhone Tracking: How to read the consolidated.db with Kettle

How the SQL CASE WHEN construct can help you in pivoting data?

Within Kettle you have the Normalizer and Denormalizer steps to help with pivoting (a simplified use case for pivoting is changing the row/column axis, see also transpose). Imagine you would like to have a cross table, you may use a Kettle transformation to accomplish this. (In these days Pentaho Reporting’s cross tab functionality is on the road map but not implemented, yet. One solution would be to use a Kettle Transformation as a data source.).

Another solution for a cross tab is to use the SQL CASE WHEN construct. Here is an example:

The result of the following sample query select status, year_id, sum(totalprice) as totalprice from orderfact group by status, year_id looks like this:

Result of sample query1

When we want to put the years (2003, 2004, 2005) into separate columns, we can use the SQL CASE WHEN construct:

CASE WHEN Boolean_expression THEN result_expression [ …n ] [ ELSE else_result_expression ]

to accomplish this, e.g.:

select status,
sum(case when year_id=2003 then totalprice else 0 end) as TOTAL_2003,
sum(case when year_id =2004 then totalprice else 0 end) as TOTAL_2004,
sum(case when year_id =2005 then totalprice else 0 end) as TOTAL_2005

from orderfact group by status

Et voilà, you have data that can be used as a cross tab:

Result from the CASE WHEN

Another use case for dealing with planning data

In another use case, we store actual and planning data (e.g. for turnover, quantity, weights etc.). To make it a bit more challenging, planning data is often stored as a budget and multiple forecasts or different plan versions (scenarios).

One solution for this problem is to store these data in separate columns, e.g. have columns for actual and other columns for planning data. Having multiple measures (see above) needs to have multiple columns for each type of the measures, e.g. turnover_actuals, quantity_actuals etc. and have this also for all types of planning scenarios. Thus you would get a big amount of columns and the biggest issue is: What happens when you need a new planning scenario? You would need to add more columns and your data model is not flexible enough and needs to be changed.

Another solution is to store the definition of an actual or planning type as an extra TYPE_ID in our table. This means each row is marked whether it is an actual or any kind of planning type.

But, when you sum up the whole table, the result would be wrong since it contains a mix of actual and planning data.

This can be solved also with the CASE WHEN construct even within a Mondrian schema definition:

    <Measure name=“Sales Actuals“ aggregator=“sum“ formatString=“#,###.00″>
    <MeasureExpression>    <SQL dialect=“generic“>
             (case when type_id=0 then TOTALPRICE else 0 end)
   
</SQL>    </MeasureExpression>    </Measure>

    <Measure name=“Sales Budget“ aggregator=“sum“ formatString=“#,###.00″>
    <MeasureExpression>    <SQL dialect=“generic“>
          (case when type_id=1 then TOTALPRICE else 0 end)
    </SQL>    </MeasureExpression>    </Measure>

[for more details see the Mondrian documentation about Measures]

With this solution you are flexible with your data model and the result will be correct.

Another more complex use case is to combine this with a reporting solution for project costing or contribution accounting. When you have a large number of different costs and would store these costs in separate columns you will reach a non manageable amount of columns and your data model is not flexible enough. It’s better to store these different types of costs in rows and you are much more flexible.

By all the flexibility with the data model you need to consider performance aspects of your database, e.g. if column based databases behave different with the CASE WHEN (I have not tested this) and different approaches should be tested. Any comments and findings are much appreciated.

Veröffentlicht in General | Kommentare deaktiviert für How the SQL CASE WHEN construct can help you in pivoting data?

Connecting with SAP BI Systems

There are a lot of different options to connect with a SAP system from Kettle. It was back in 2009 were I tested the functionality to access a SAP BI system via Webservices (XML/A) and here are my (historic) findings:

To analyze the result of the MDX query with Kettle is complicated (you need for every number of columns a separate transformation) but is possible. In these days other Pentaho frontend tools are able to process this direct via OLAP4J. I tested this with the Webservices step (using Basic Authentication) and for specific usecases this would be an option. Further development brought the OLAP Input step and this simplifies a lot and was tested successfully.

Here are some further information that might be of help for setting this up:

For the XML/A interface the following webservice should be switched on in transaction SICF:
/default_host/sap/bw/xml/soap/xmla

SAP XML/A

XMLA has two methods
Discover – This method is used to query metadata and master data (this corresponds to the BAPI MDDataProviderBW).
Execute – You can execute MDX commands with this method to receive the corresponding result set (this corresponds to the BAPI MDDataSetBW).

Description of the service can be reached by: <Protocol>://<Server>:<Port>/sap/bw/xml/soap/xmla?wsdl

For XMLA with SAP BI you have to create MDX commands generated from the metadata you receive from the system.

Below is transaction MDXTEST in SAP BI to test MDX query on a SAP BI infoprovider or query:
SELECT
{[Measures].[0HDCNT_LAST],
[Measures].[ZKFHRFTE]} ON COLUMNS,
[0EMPLOYEE].MEMBERS ON ROWS
FROM [$ZMCHRPA1]

Some syntax templates are available in the MDX test editor.

Another service that is available is to get data directly from a query or query view. The service has one method: GetQueryViewData
Go to SICF and activate this web service: /sap/bc/srt/rfc/sap/QUERY_VIEW_DATA

The output will be in format of Datasets: AXIS_INFO (Metadata description of the characteristics, attributes, and structures on the axes), AXIS_DATA (Characteristic values, attribute values etc.), CELL_DATA (Value cells: The cells are arranged by row first, and then by column), TXT_SYMBOLS (Text elements for the query).

Further information:

Unfortuneatly all screen shots or data samples I have are from a real system but I hope some of the information given here are useful nevertheless to set up the SAP side.

Thanks to Paul Stoellberger, who continued to work on this. He provided me with some screen shots for Pentaho Analysis Tool (PAT), Pentaho Reporting and Kettle accessing a SAP BI system.

Update: SAP Recommends using Enterprise Services aka web services to access the functionality remotely. Further information can be found over here:
http://esworkplace.sap.com/

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Connecting with SAP BI Systems

Operational Patterns and the Watchdog Concept for Kettle

The Watchdog Concept for Kettle was presented at the Pentaho Community Event in Cascais, Portugal in September 2010. It came into my mind when I created the operational patterns for the Pentaho Data Integration for Database Developers training course and combined this with solutions from electronics to solve these types of problems (e.g. detect software or hardware errors reliably) and apply immune aware development concepts.

What’s the idea behind this?

When you want to check if your Kettle jobs and transformations execute correct, you can use some operational patterns like

  • Health Checks for the JVM
  • Health checks for clustered environments and failover of a master server
  • Analyze log entries
  • Use restartable solutions
  • and others that are described in more detail in the Pentaho PDI training class

When you need a different approach for more safety and to measure additional values, then you can use the concept of a Watchdog and combine this with the normal patterns. A use case is e.g. you are building a very critical system (imagine a rocket flying to space) and engage different software developers, use different measuring units and different computers to test if everything is correct. When one of the units fails, you can signal an event and e.g. abort the launch. Well, the Kettle world is most times not such critical but it is used more and more in systems that need a very high reliability and 24×7 with high and continuously guaranteed throughput.

With a Kettle solution you may need many components to check, e.g. the Data Integration Server, Carte and clustered environments, Kitchen/Pan jobs and transformations, detect dead locks and also for external components like the JVM (e.g. memory, used CPUs), the server (e.g. test with a network ping) and the databases (up and running, capacity etc.).

You also need a signal to noise detection: Define what is normal (noise) and notify on exceptions (e.g. set thresholds – absolute or relative eventually by average), define what is unusual and notify on these events and define events (e.g. actions, notifications & alerts). By all these checks you have a primary constraint: You need to minimise the footprint and impact to the system by the measurements.

Why a Watchdog can help here?

One solution for checking if everything is on track is to use the concept of a Watchdog with tasks and events:

 PDI Watchdog

A software crash might go undetected by conventional watchdog strategies when you have multitasking (e.g. many PDI jobs and cluster nodes). The success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is much easier than it sounds:

  1. Build a watchdog task
  2. Create a data structure (database table) that has one entry per task
  3. When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops, e.g. every 10,000 rows
  4. As the job or transformation runs the number of counts for each task advances.
  5. Infrequently but at regular intervals the watchdog runs.
  6. The watchdog scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value.
  7. If the counts are unreasonable, halt and let the watchdog timeout and fire an event. If everything is OK, set all of the counts to zero and exit. 

This is a derived concept from Jack Ganssle (2004): Great Watchdogs (especially the section WDTs for Multitasking).

An example implementation with Kettle

Most of the Kettle health checks can be accomplished with this Watchdog concept. An example implementation with Kettle is task oriented, not server oriented. This means it will check, if the task (a Transformation or Job) is running as expected independently in what environment (e.g. clustered or not).

Since Kettle has a wide variety of transformation steps and job entries I tend often not to program in a classical programming language but solve this solely with Kettle.

Here is an example of the event handling that could be altered very flexible by adding new events:

Watchdog Event Handling

And here is an example of the above described logic of the Watchdog step. Sooner or later this could be simplified by building some Kettle steps that encapsulate these logics but for now it just works:

Watchdog check logic

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert für Operational Patterns and the Watchdog Concept for Kettle

Another blog about Kettle aka Pentaho Data Integration (PDI)

This is the starting point for the blog Fun Stuff about the Open Source ETL Tool Kettle aka Pentaho Data Integration (PDI).

The short history: Five years ago, in early December 2005, Matt Casters released the initial open source version of Kettle. I knew Kettle already before this time, developed a plug-in to get data out of SAP in 2005, did a lot of things around this great tool and joined Pentaho in mid 2007. You can read a bit more about the story in the blog post Some Historic Cornerstones of Kettle & Pentaho and in the book Pentaho Kettle Solutions by Matt Casters, Roland Bouman and Jos van Dongen.

Pentaho

More about Pentaho:

Pentaho Links
Pentaho Books

.

JB

More about me:

Homepage
XING Profile
Twitter

Veröffentlicht in General, Kettle (PDI) | Kommentare deaktiviert für Another blog about Kettle aka Pentaho Data Integration (PDI)