Pentaho 5.2 Released at PentahoWorld 2014

A lot of exiting things happened at PentahoWorld 2014, you can read a lot about it on Twitter #PWorld2014. It is amazing what momentum we reached and that over 400 attendees have been at this conference. BTW: Did you know that Pentaho celebrated it’s 10th anniversary last week, too?

PentahoWorld 2014
PentahoWorld 2014

For me, as a “PDI/Kettle addicted person” ;-) one of the major topics was also the release of Pentaho 5.2 (BA & PDI). You can read about all the great new features over here: Upgrade Existing Pentaho Systems (the press release is not out, yet by the time of writing this article, but will come soon over here: Pentaho Press Releases).

PDI 5.2
PDI 5.2

Here are the highlight of PDI 5.2:

Pentaho Data Integration 5.2 delivers many exciting and powerful features that help you quickly and securely access, blend, transform, and explore data.

New Streamlined Data Refinery Feature

The Streamlined Data Refinery (SDR)  is a simplified, ad hoc ETL refinery composed of a series of PDI jobs that take raw data, augment and blend it through the request form, and then publish it to the BA Server for report designers to use in Analyzer.

R Script Executor Step Improvements

The R Script Executor, Weka Forecasting, and Weka Scoring steps form the core of the Data Science Pack and transforms PDI into a powerful, predictive analytics tool. The R Script Executor step allows you to incorporate R scripts in your transformation so that you can include R-based statistical programming in your data flow. In PDI Version 5.2 you can now “plug and play” R scripts, without extra customization.  Now you can pass incoming field metadata to the output field metadata, use a more intuitive user interface to run scripts by rows or by batches, and test scripts.

New DI Server Administration Features

Porting content from one environment to another and performing general DI Repository maintenance is easier with the introduction of the new Purge Utility.  The Purge Utility permanently purges the repository of versions of shared objects, such as database connection information, jobs, and transformations.  You can also turn DI Repository versioning and comment capturing capabilities on and off.

Kerberos Security Support for CDH 5.1 and HDP 2.1

If you are already using Kerberos to authenticate access to a Cloudera Distributed Hadoop 5.1 or Hortonworks Data Platform 2.1 cluster, with a little extra configuration, you can also use Kerberos to authenticate Pentaho DI users who need to access those clusters.

New Marketplace Plugins

Pentaho Marketplace continues to grow with many more of your contributions. Pentaho Marketplace is a home for community-developed plugins and a place where you can contribute, learn, benefit from, and connect to others. New contributions include:

  • LookupTimeDimensionStep: Looks up and creates an entry on a data warehouse dimension time table and returns the ID.
  • Probabilistic Row Distributions: Contains a collection of Row Distribution plugins for PDI that use probabilistic methods for determining the distribution of rows.
  • PDI Groovy Console: Adds a Groovy console to the Help menu that has helper methods and classes that interact with the PDI environment.
  • Gremlin Script Step: Provides a Gremlin script step for graph pipeline processing.
  • Avro Output Plugin: Allows to output Avro files. Avro files are commonly used in Hadoop allowing for schema evolution and truly separating the write schema from the read schema.

Improved Upgrade Experience

Upgrading PDI is easier because it is no longer a manual process.  You can now upgrade from 5.1.x to 5.2 using the same upgrade utility used for patch releases.

More Details and Download

Veröffentlicht in Kettle (PDI) | Hinterlassen Sie einen Kommentar

PentahoWorld 2014 – The 1st Annual Worldwide Conference is coming soon

I’m very happy to see PentahoWorld coming soon on October 8 through 10, 2014. It is our first(!) worldwide conference for Users, Advocates and Partners of Pentaho.

After many years of Pentaho community meetings, this is a next logical step in boosting the great Pentaho story. I initiated and organized the first Penaho community meeting. It was conducted in my hometown Mainz, Germany back in 2008 and we had a great audience, a lot of information, talks and fun. Since then, it succeeded every year and will continue in the future: the next community meeting is scheduled for November 14 through 16 in Antwerp, Belgium.

BTW: By writing this, a lot of history come to my mind, so I will continue to write a bit more about history that formed and forms our future in IT, Pentaho and Big Data in the next couple of blog posts, stay tuned via my Twitter account or RSS feeds of this blog.

And for now, have a look at some details of PentahoWorld 2014:

Training and Interactive Sessions: Three days of rich content to fuel the data-driven business

Learn how to apply the latest Pentaho innovations, see how customers use Pentaho to solve real-world problems, gain insight into product roadmaps and strategic plans, and take away practical knowledge and best practices.

Test-Drive Product Training: Hands-on sessions will provide the opportunity to explore Pentaho with Hadoop, NoSQL databases and WEKA predictive algorithms.

Meet the Experts: At the event, schedule 1:1 meetings with Pentaho innovators, partners and architects who will offer real-time advice, best practices and a look into next-generation product.

Breakout Sessions around:

  • Data Integration
  • Business Analytics
  • Advanced Analytics
  • Big Data Today
  • Embedded Analytics

For more information, check out the agenda.

You also get compelling keynotes and general sessions that help you develop a roadmap for scale, growth and success: See the list of speakers.

And last but not least: It’s always worth to make a trip to Orlando, especially in this time of the year! I can prove this from my experience of the last seven years being with Pentaho.

We also paved the way to give you 10 great reasons to attend and even arguments on hand to convince your boss :-)

 

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Pentaho 5.1 EE & CE is released – What’s New in PDI & Big Data?

Pentaho 5.1 delivers code-free analytics directly on MongoDB, simplifies data preparation for data scientists and adds full YARN support. See the full list of Pentaho Business Analytics 5.1 features and technical specifications.

Here are some more details about the new PDI 5.1 release:

Native (YARN) Hadoop integration

PDI includes support for YARN capabilities including enhanced scalability, compatibility with MapReduce, improved cluster use, and greater support for agile development processes.  YARN also provides support for non-MapReduce workloads.

YARN for Carte Kettle Clusters

The Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster entries make it possible to to execute carte transforms in parallel using YARN. Carte clusters are implemented using the resources and data nodes of a Hadoop cluster, which optimizes resources and speeds processing.

Cloudera, Hortonworks, and MapR Hadoop Distribution Support

Use Pentaho’s innovative Big Data Adaptive Layer to connect to more Hadoop Distributions, including Cloudera 5, Hortonworks 2.1, and MapR 3.1.  These certified and tested YARN-based distributions allow you to use PDI to build scalable solutions that are optimized for performance. Pentaho supports over 20 different Hadoop Distribution versions from vendors such as Apache, Hortonworks, Intel, and MapR.  Pentaho also supports Cloudera distributions and is a certified Cloudera partner.

Data Science Pack with Weka and R

The R Script Executor, Weka Forecasting, and Weka Scoring steps form the core of the Data Science Pack and transforms PDI into a powerful, predictive analytics tool. The R Script Executor step, which is new for 5.1, lets you include R scripts in your transformations and jobs. You can customize random seed sampling, limit the batch and reservoir size, adjust logging level messages, and more. You can also choose to load the script from a file at runtime, enabling you to have more flexibility in transformation design.

Security Enhancements

PDI security has been enhanced to include support for more standard security protocols and specifications.

  • AES Password Support: Use Advanced Encryption Standard (AES) to encrypt passwords for databases, slave servers, web service connections, and more. AES uses a symmetric key algorithm to secure your data.
  • New Execute Permission: You can now choose whether to grant permission to execute transformations and jobs by user role.  This provides more finely-tuned access controls for different groups and can be useful for auditing, deployment, or quality assurance purposes.
  • Kerberos Security Support for Hadoop & MongoDB: If you are already using Kerberos to authenticate access a data source, with a little extra configuration, you can also use Kerberos to authenticate DI users who need to access your data.

Teradata and Vertica Bulkloaders

There are two new bulkloaders steps: Teradata TPT Bulkloader and Vertica Bulkloader.  Also, newer versions of Teradata and Vertica are now supported.

New Marketplace Plugins

Pentaho Marketplace continues to grow with many more of your contributions.  As a testimonial to the power of community, Pentaho Marketplace is a home for your plugins and a place where you can contribute, learn, benefit from, and connect to others. New contributions include:

  • Riak Consumer and Producer:  Links with Maven to provide dependency management.
  • Load Text From File:  Uses Apache Tika to extract text from files in many different formats, such as PDF and XLS.
  • Top / Bottom / First / Last filter: Filters rows based on a field’s values or row numbers.
  • Map (key/value pair) type:  Provides a ValueMeta plugin for key/value pairs that are backed by a java.util.Map.
  • PDI Tableau Data Extract Output:  Use Pentaho’s ETL capabilities to generate a Tableau Data Extract (tde) file.
  • PDI NuoDB: Provides a PDI database dialect for the NuoDB NewSQL database that works in the cloud.
  • Neo4j JDBC Connector:  Provides a PDI database dialect for the Neo4j graph database.
  • HTML to XML: Uses JTidy to convert HTML into XML or XHTML.
  • Apache Kafka Consumer and Producer: Reads and sends binary messages to and from Apache Kafka message queues.
  • LegStar z/OS File Reader: Reads raw z/OS records from a file and transforms them to PDI rows.
  • Compare Fields:  Compares 2 sets of fields in a row and directs it to the appropriate destination step based on the result. This step detects identical, changed, added, and removed rows.

There are many more new plugins such as IC AMQP, IC Bloom filter, JaRE  (Java Rule Engine), JDBC Metadata, Memcached Input/Output, Google Spreadsheet Input/Output, and Sentiment Analysis.

… and it’s getting more and more, check out the latest contributions (within the Data Integration design interface Spoon / Help / Marketplace).

Documentation

And last but not least: We have a great new documentation system, check out our documentation around New Features in Pentaho Data Integration 5.1

Download 5.1

Test 5.1 and download the Cummunity Edition (with limited functionality, no support etc. but eventually free beer) or start with our evaluation Enterprise Edition.

Veröffentlicht in Kettle (PDI) | 1 Kommentar

Pentaho 5.0 CE is released – What’s New in PDI & Big Data?

Additional to the release of the Enterprise Edition (EE), Pentaho released the stable build of 5.0 Community Edition (CE).

If you can’t wait to get to the download, have a look at our new community.pentaho.com. It hosts all information and the download link. And for the download of a free 30-day trial of Pentaho EE, visit pentaho.com/download.

Among all the other great additions in the Pentaho Business Analytics Suite 5.0, here are the highlights for the new PDI 5.0 release:

PDI Community Edition includes the following new feature:

  • Inline Help links on transformation steps and job entries
  • Simplified looping – call jobs and transformations within a transformation and loop through the rows
  • Better previewing with dedicated window in run tab
  • Detailed timing metrics for low level operations – helps to analyze bottle necks in more detail, e.g. database connect & query time vs. transformation time
  • Extended monitoring of sub jobs and transformations in the Carte- and DI-Server (Expand remote job option)
  • Introduction of REST services (Carte and DI-Server)
  • Several new Transformation Steps and Job Entries including: Splunk input/output, Table compare, ZIP file, OpenERP input/output, Telnet, Nagios traps…
  • Lots of new configuration options for tuning the engine back-ends
  • Marketplace – Share and/or download new plug-ins
  • A series of new plugin systems to make it easier to extend Kettle (data types, extension points, carte servlets, logging tables, step-to-step row distributions…)
  • and many more…

PDI Enterprise Edition includes the following new features additional to the CE features:

  • Kettle JDBC driver – Query data using SQL/JDBC from any transformation, see how this can be used to blend data on the source in Pentaho’s Big Data Blend of the Week
  • Job Restart-ability – Set checkpoints, Restart jobs from last successful checkpoint
  • Transactional Job Execution – Provides the ability to roll back job execution on failure
  • Security on Database Connections – Full permissions: R/W/D, Securely share connections with other users or groups
  • Load Balancing of data within Transformations
  • WebSphere MQ / MQSeries integration
  • 4.4 to 5.0 Migration – Ease-of-use for Administrators

Matt Casters, Pentaho’s Chief Architect of PDI & Kettle Project Founder, will post more details over the next coming weeks, but just in case you can’t wait, test it yourself and go to community.pentaho.com or pentaho.com/download.

And what about big data?

Pentaho continues to lead in embracing and extending the big data ecosystem

Enhanced Big Data Functionality (5.0):

  • New InstaView use case templates for Hadoop and Splunk
  • Expanded NoSQL Integration

Expanded Big Data Ecosystem Support (5.0):

  • Hadoop High Availability support
  • New Integrations: RedShift, Impala, Splunk
  • New Hadoop Certs: Intel, Hortonworks, DataStax
  • Support for latest versions of CDH, MapR, MongoDB and Cassandra

Adaptive Big Data Layer (Transparent Access to & Integration of Big Data):

  • Insulates from changing versions, vendors, data stores
  • Give customers broad flexibility of choice, rapid time to value, reduced risk
  • Provides native integration into the big data ecosystem
  • Broadest, deepest Big Data Support

We are able to deliver support for new technologies as plug-ins very fast even without depending on release cycles of our core products. For the latest additions, see also the New and updated content in the Pentaho Big Data wiki space – stay tuned!

Learn more about Pentaho & Big Data and check out: pentahobigdata.com

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

What’s New in PDI 4.4?

Among the other great additions like Mobile and many other feature enhancements in the Pentaho Business Analytics Suite 4.8, here are the highlights for the new PDI 4.4 release:

Pentaho Instaview

Pentaho Instaview is the fastest way to start using Pentaho Data Integration to analyze and visualize data. Instaview uses templates to manage the complexities of data access and preparation. You can focus on selecting and filtering the data you want to explore, rather than spending time creating source connections and identifying measure and dimension fields. Once the data has been selected, Instaview automatically generates transformation and metadata models, executes them, and launches Pentaho Analyzer. This allows you to explore your data in the Analyzer desktop user interface.
As your data requirements become more advanced, you have the ability to create your own templates and use the full power of Pentaho Data Integration (PDI).
Watch this video and see the Getting Started with Pentaho Data Integration Instaview Guide to understand and learn more about Pentaho Instaview or

PDI Operations Mart

The PDI Operations Mart enables administrators to collect and query PDI log data into one centralized data mart for easy reporting and analysis. The operations mart has predefined samples for Pentaho Analyzer, Interactive Reporting, and Dashboards. You can create individualized reports to meet your specific needs.

Sample inquiries include

  • How many jobs or transformations have been successful compared to how many failed in a given period?
  • How many jobs or transformations are currently running?
  • What are the longest running jobs or transformations in a given period?
  • What is the highest failure rate of job or transformations in a given period?
  • How many rows have been processed in a particular time period? This enables you to see a trend of rows or time in time series for selected transformations.

The operations mart provides setup procedures for MySQL, Oracle, and PostgresSQL databases. Install instructions for the PDI Operations Mart are available in the Pentaho InfoCenter.

Concat Fields Step

The Concat Fields step is used to join multiple fields into one target field. The fields can be delimited by a separator and the enclosure logic is completely compatible with the Text File Output step.

This step is very useful for joining fields as key/value pairs for the Hadoop MapReduce Output step.

SAS Input Step

The SAS Input step reads files in sas7bdat format created by SAS software. This step allows PDI developers to import files in sas7bdat format.

EDI to XML Step

The EDI to XML step converts EDI message text, which conforms to the ISO 9735 standard, to generic XML. The XML text is more accessible and enables selective data extraction using XPath and the Get Data From XML step.

New Pentaho Data Integration Software Development Kit (PDI SDK)

Extending and Embedding Pentaho Data Integration enables developers to utilize PDI beyond the out-of-the box functionality. This guide explains the mechanics of extending PDI plugins. It also explains embedding PDI functionality directly into Java applications. Pentaho provides sample code for all plugin types and embedding scenarios. More details can be found in the Pentaho Infocenter Embedding and Extending Pentaho Data Integration.

The complete PDI 4.4 change log: Now with JIRA Components and PDI Sub-Components

This is the first release with very detailed information about what has changed on a more granular level, so you could query JIRA by components and PDI Sub-components like steps, job entries and specific database topics. This helps in the upgrade process and finding already existing issues in JIRA. More details can be found in the Pentaho Wiki page JIRA PDI Components, Sub-components and Labels and the complete PDI 4.4 change log can be found in JIRA. You need to be logged into JIRA to include the fields in your default report, but you can also download the complete list when you select “View” / “Excel (All fields)”.

Some background about versioning and compatibility for Kettle core

From the Kettle core perspective, 4.3.0 and 4.4.0 are still bug fix releases for 4.2.x but with significant new features. As such, we have decided to call the release version 4.4.0 rather than 4.3.1. As always, please see the PDI Upgrade Guide for specific topics when upgrading.

Download PDI 4.4

And when you don’t have it already, the download is over here: www.pentaho.com/download (Enterprise Edition) and should be soon on Sourceforce (Community Edition).

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Call Kettle from PostgreSQL – Part 1

postgresql_kettle_integration_components_2012-09-29.png

Some questions and use cases that brought me to investigate into this area:

  • Can Kettle be called directly from a database trigger? So when rows get inserted, deleted, changed, I want to call a Kettle transformation.
  • Can we call a function in a SQL statement that calculates the fraud probability calculated by a WEKA data mining model?
  • Can we enrich some inserted data rows by external data sources?
  • Also performance considerations: I don’t want to transport the data to the process, I want to process the data were it is

So, lets try to integrate and embed Kettle into PostgreSQL:

Here is a short how to (based on a Windows 7 64 bit machine running PostgreSQL 9.1):

  • Download pljava-xxx.tar.gz from http://pgfoundry.org/projects/pljava – PL/Java is needed to call Java from PostgreSQL
  • Stop the postgresql Service, make a backup
  • After unpacking, follow the instructions given in pljava-x86_64-w64-mingw32-pg9.1-1.4.3/docs/docs/readme.html

Example for modifying the postgresql.conf:

  • dynamic_library_path = ‘$libdir;C:/Pentaho/_solutions_samples/realtime/pljava’
  • log_min_messages= info
  • custom_variable_classes = ‘pljava’
  • pljava.classpath=C:/Pentaho/_solutions_samples/realtime/pljava/pljava.jar
  • [When needed, add and modify:] pljava.vmoptions = ‘-Xmx512M’

You may need to change the windows system PATH since I found an issue when loading pljava.dll
ERROR: could not load library “[...]/pljava.dll”: The specified module could not be found.

Dependency Walker helped in finding the missing parts, so I added the following to the system PATH (this might be different on your setup and is only an example):

  • C:\Program Files\Java\jre6\bin\server
  • C:\Program Files\PostgreSQL\9.1\bin

And follow the other steps in the readme (depending on your needs) and finally:

CREATE FUNCTION getsysprop(VARCHAR) RETURNS VARCHAR
AS ‘java.lang.System.getProperty’
LANGUAGE java;

SELECT getsysprop(‘user.home’);

The result of the query and function call is C:\User\postgres  since the PostgreSQL service and the JVM runs as user postgres.

  • Now, integrate the kettle-core.jar library for a first test:

SELECT sqlj.install_jar(‘file:///Pentaho/pdi-ee-4.3.0-GA/data-integration/lib/kettle-core.jar’,’kettle_core’,false);

SELECT sqlj.set_classpath(‘public’,’kettle_core’);

CREATE FUNCTION initCapKettle(VARCHAR) RETURNS VARCHAR  AS ‘org.pentaho.di.core.util.StringUtil.initCap’    LANGUAGE javaU;

If you like, have a look at the loaded classes in table sql.jar_entry:
sqljar_entry.png

After executing select initCapKettle(‘jens’); you can check if the first letter is changed into a capital ‘J’ by calling a Kettle method:initcaptest.PNG

So, the first part of the tests worked very well.

Considerations for the future part 2 “Testing a Kettle transformation” of the prove of concept:

  • PL/Java is limited to single threading in the JVM: We need to use the Kettle Single Threading Engine
  • We may need to keep instances of the transformation meta definition. When to clean up?
  • We may need to define a threshold for processing a specific number of rows in a chunk or after a timeout to avoid excessive burden to the database.
  • I’m sure there are other unknowns…

This was first presented at the Pentaho Community Meeting 2012 in Amsterdam September, 29th 2012 and part 2 with more testings will come soon.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Carte as a Windows Service

Carte is a simple web server that allows you to execute transformations and jobs remotely and execute transformations clustered. Carte is normally started with a .bat file within Windows environments but there are some use cases to run Carte as a Windows Service, e.g.:

  • When Carte instances are running using a command window, anyone by mistake could close the instance and Carte will go down.
  • The Carte.bat command window is tied to the user session that called the batch file and needs to be kept logged in.
  • With a Windows Service you can start the Carte service at machine startup and also configure it to restart after a crash.

After you completed the instructions on the Pentaho Wiki, you are able to get Carte running as a Windows service like this:

yajsw_Pentaho_DI_Carte_Service

It is also possible to get a system tray

yajsw_system_tray

to get the console output of Carte like this after the initial start of the Windows service:

yajsw_console1

The console gets the same information as you get via the browser when you logged into carte, e.g. after the execution of the sample transformation:

yajsw_console2

Many thanks for the developers of YAJSW (Yet Another Java Service Wrapper) for this great solution to wrap Java programs as a service and to Dan to bring this to my attention.

More details can be found on the Pentaho Wiki: Carte as a Windows Service

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Kettle and NoSQL: MongoDB

On May 23rd 2012 Pentaho and 10gen are jointly announcing a partnership to provide direct integration between Pentaho Business Analytics and MongoDB.

MongoDB is a scalable, high-performance, open source NoSQL database featuring document-oriented storage, auto-sharding for horizontal scalability, rich document-based queries and fast in-place updates. MongoDB is designed with both scalability and developer agility in mind. Instead of storing your data in tables and rows as you would with a relational database, in MongoDB you store JSON-like documents with dynamic schemas. The goal of MongoDB is to bridge the gap between key-­value stores (which are fast and scalable) and relational databases (which have rich functionality).

Working together, Pentaho and 10gen offer the first MongoDB-based big data analytics solution to the market. This solution combines MongoDB with Pentaho’s visual interfaces for high-performance data input, output and manipulation, as well as data discovery, visualization and predictive analytics. This makes it easy and productive for IT staff, developers, data scientists and business analysts to operationalize, integrate and analyze both big data and traditional data sources.

MongoDB and further BigData and NoSQL connectors are available in all editions of Kettle including Community, Basic, Professional and Enterprise since Version 4.3.

MongoDB Sample 1 MongoDB Sample 2

Kettle makes it really easy to extract data from multiple big data and traditional sources, and integrate it into a data mart or warehouse. The visual interfaces for big data will accelerate development and maintenance of MongoDB-based solutions a lot!

More details and resources can be found over here: pentaho.com/10gen

Veröffentlicht in General, Kettle (PDI) | Kommentare deaktiviert

Data Profiling and Data Quality (Human Inference) Integration with Kettle

Data Profiling with DataCleaner (Human Inference) and Kettle

It was already possible to profile your data in an easy way with Kettle: Open the Database Explorer, chose a table and right click in the context menu on Data Profile. The result was basic information about the data like Min, Max, Count all for strings and additional information for numeric data but these were only basic metrics about your data. We have a much more better solution now:

Human Inference (DataCleaner) and Pentaho (Kettle) worked together to integrate their tools and the result is a nice and seamless integration of DataCleaner into Kettle. A sample for introduction and FAQ can be found at Kettle Data Profiling with DataCleaner

You can right click on any step within your transformation and profile your data. It is also possible to clean or harmonize your data and check the result directly within your transformation.

DC context menu

Here are some screen shots:

  • Number Analyzer

DC Number Analyzer

  • String Analyzer

DC String Analyzer

  • Pattern Analyzer (e.g. you see how many single words with first capital letters etc. exist in your data):

DC Pattern Analyzer

And there are much more Analyzers you can choose from: Matching & Deduplication, Boolean, Character Set Distribution, Data Gap, Date/Time, Reference Data Matcher, Value & Weekday DistributionAnalyzers

DC Analyzers

For a complete reference look at http://datacleaner.eobjects.org and for a quick introduction and sample check out: http://wiki.pentaho.com/display/EAI/Kettle+Data+Profiling+with+DataCleaner

Data Quality with EasyDQ (Human Inference) and Kettle

Additional to the Data Profiling capabilities, a couple of steps for Data Quality have been introduced into Kettle as plug-ins. This covers

  • Name Validation, Standardization and Cleansing
  • Address Validation, Standardization and Cleansing
  • E-Mail and Telephone Validation, Standardization and Cleansing
  • Duplicate Detection and Merge Duplicates

Here is a validation example:

DQ Sample 1


More details, download instructions, a sample video and examples can be found over here:

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Pentaho Kettle for Big Data

All of Pentaho’s big data capabilities will be available as open source in the new Pentaho Kettle 4.3 release: Big data capabilities include the ability to input, output, manipulate and report on data using the following Hadoop and NoSQL stores: Cassandra, Hadoop HDFS, Hadoop MapReduce, Hadapt, HBase, Hive, HPCC Systems and MongoDB.

With regard to Hadoop, Pentaho Kettle makes available job orchestration steps for Hadoop, Amazon Elastic MapReduce, Pentaho MapReduce, HDFS File Operations, and Pig scripts. All major Hadoop distributions are supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition. Pentaho Kettle can execute ETL transforms outside the Hadoop cluster or within the nodes of the cluster, taking advantage of Hadoop’s distributed processing and reliability.

Download, how-to docs, videos and more at http://community.pentaho.com/BigData, see also Matt’s blog at http://www.ibridge.be/?p=207

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

ExeBatLauncher: Calling .bat Files by .exe Executables (e.g. Having a Spoon.exe)

The ExeBatLauncher is a simple way of calling .bat files as .exe files in creating an .exe command that calls the same .bat command.

For example you want to call the Spoon.bat file by Spoon.exe, simply rename the ExeBatLauncher.exe (that is found in the distrib folder) to Spoon.exe and copy it to the same directory and you are done.

If you want it now, just download it and use it, it’s free :-)

If you want to have a polished Spoon.exe with the right icon to test with, just download it from the PDI-6949 attachment.

The reason why I created this was a Pentaho Data Integration JIRA case: PDI-6949 – “While in PDI-2925 the feature of kettle.exe was removed, you cannot add a .bat file to the quick launch of Windows 7. The workaround is to add a cmd.exe to the quick launch and add /c <path to .bat> to modify it. This is due to microsoft security in Win 7.”

I also thought a lot of launching Kettle just by clicking on a .ktr or .kjb file and accomplished this by a wrapper .bat (calling the target bat with the /file: option). There are also nasty ways of launching .bat files by creating self extracting exe archives (nice trick), but this is just too dirty to be professional. But I tried also this and it worked even with the Spoon icon, but some limitations like having to extract the.bat file from the self extracting archive into the distribution folder are not a good solution. Further investigation into existing solutions of calling Java from .exe brought some nice projects but that is not the solution we want at this time.

I wanted to have a pretty simple .exe that just calls the .bat with the same name and here is the code for this:

ExeBatLauncher.cpp

As you see it supports:

  • calling the .exe/.bat from every folder
  • passing all arguments from the caller
  • returning the return code to the caller

If you want to compile it, follow the instructions in the _README.txt file in the attachment: ExeBatLauncher.zip

I look forward to your experiences!

Veröffentlicht in General | Kommentare deaktiviert

First German Pentaho Customer Meeting in Munich

Today was our first German Customer Meeting that Bruno initiated and it took place at one of our customer locations @Wirecard AG (many thanks!).

A nice (and squeezed…) agenda and around 20 attendees discussed the presented customer solutions, their experiences and what Pentaho can do better, but also: what really works well and this is a lot and proven in medium and large deployments. Here are some examples in addition to “normal” BI:

  • Processing 500 thousand transactions per day (credit card and alternative payment methods)
  • Merchant deposit monitoring including fraud detection of merchants (!)
  • Application and data row level role based security in a multi-tenant solution using special role management via a Mondrian plug-in
  • Highly performant solutions with load balancing servers and high availability (99.95%), one of them with a failover with two complete separate locations
  • Pentaho integrates very well in existing IT infrastructures
  • Event triggered ETL by an ESB (Enterprise Service Bus) / SOA architecture: JBoss ESB triggers PDI Jobs for application integration (JMS, File, HTTP, SMTP, FTP etc.). At this time via the Kettle API but it is planned to use Kettle Web-Services
  • PDI is used as a data source for reporting, just a nice EAI use case
  • Pentaho is integrated into JBoss EAP (the customer version of JBoss AS, everything in this platform is secured by default)
  • 3 of the attendees are using Infobright
  • Report bursting was accomplished by 3 different methods: 1) using xActions, 2) using ant scripts, 3) using Kettle

Major discussed improvements:

  • More integrated and consistent security throughout all products (it’s possible but special requirements are tricky to implement)
  • Report bursting with scheduling should be more standardized and user friendly
  • Since Pentaho can be extended and embedded very well via extension APIs, the upgrade to next versions should be more seamless by special dedicated documentation for OEM and stable API extension points [this will be addressed in the road map by the SDK]
  • Upgrade process: upgrade tools would be a huge benefit for supporting it, also when some releases were not installed (e.g. jump from 3.x to 4.1 instead of 4.0)

As part of the Pentaho Product Management team, I presented the product road map and gave some quick insights in the Kettle Star Modeler.

I look forward to the next event of this type for knowledge sharing between Pentaho customers and users. We found already one volunteer (Guy) to take it from here :-)

BTW: This event was only in German language (at this time!) and thanks again to Wirecard, Rob & Bruno for organizing this and to the attendees for their presentations and active participation!

Here are some impressions….

 2011-12-06 Munic A  2011-12-06 Munic B  2011-12-06 Munic C

Veröffentlicht in General | Kommentare deaktiviert

The new XML Input Stream (StAX) step in PDI 4.2

This step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads: Since Kettle has so many own steps to process data in different ways, the processing logic has been moved more into the transformation and the step itself provides the raw XML data stream together with additional and helpful processing information.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples at the Kettle Wiki for illustrations of the usage.

Note: In almost all use cases, a Set/Reset functionality was needed. At this time it can be accomplished by the Modified Java Script Value step or the User Defined Java Class step where the latter one is recommended and much more faster. An own Kettle step with Set/Reset functionality is one the road map to solve these and other similar use cases, see PDI-6389 for more details.

Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:

  • Very fast and independend of the memory regardless of the file size (GBs and more are possible due to the streaming approach)
  • Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)

Here is an example of parsing the following XML file with 2 main sample data blocks (Analyzer Lists & Products):

XML Input Stream (StAX) Test 2 - XML

A preview on the step may look like this (depending on the selected fields):

XML Input Stream (StAX) Test 2 - Preview

You see you really get almost the original streaming information with Elements and Attributes from the XML file together with helpful other fields like the element level.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples of this step for illustrations of the usage.

The transformation looks like this:

XML Input Stream (StAX) Test 2 - Transformation

The end result for the Analyzer List block:

XML Input Stream (StAX) Test 2 - Analyzer List

The end result for the Products block (splitted for example into two separate data streams for the end system):

XML Input Stream (StAX) Test 2 - Both Products And Attributes

And there are a lot more options in the step to help to solve your needs:

XML Input Stream (StAX) Test 2 - Step Options

More details about this step can be found in the XML Input Stream (StAX) documentation.

And for more details on the features for the upcoming PDI 4.2, please have a look at Matt’s blog.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Security Considerations and Encryption with Kettle

Kettle is used more and more in enterprises where the standard obfuscation of credentials is not sufficient enough. There are requirements to use strong encryption methods and even to store internal data encrypted (covered in PDI-6168 and PDI-6170). The above use cases inspired me to create some simple transformations to test and play around with encryption.

The transformations and some test data are attached to the Kettle Exchange page Security Considerations and Encryption with Kettle.

Let’s start with creating a key by the cryptographyCreateSecretKey transformation:

cryptographyCreateSecretKey

The generateKey step uses the User Defined Java Class step and implements sample code for AES, the Advanced Encryption Standard is a symmetric-key encryption standard, see also http://java.sun.com/developer/technicalArticles/Security/AES/AES_v1.html. The key serialization to file is a little trick to obfuscate the key. Other methods can be included instead of the clear text file output.

Now that we have the key file, we can encrypt our secret data:

cryptographySampleUnencryptedInput

With the transformation cryptographyEncrypt:

cryptographyEncrypt

We keep it simple and assume the key is available in each row (accomplished by the Join Key).

The encrypted result looks like this:

cryptographySampleEncrypted

Let’s decrypt it with the transformation cryptographyDecrypt:

cryptographyDecrypt

The result is correct but only when the key file is the same and the encrypted data was not modified. You can test it yourself and see what error messages come up or the resulting files look like when the key file or data was modified.

Instead of storing the decrypted data to a file there are a lof of other options, e.g.:

  • use the decrypted data as credentials in subsequent steps or transformations
  • put the decrypted data into variables visible in a limited scope (e.g. parent job) and use them as credentials for databases, repository etc. (see PDI-6168)
  • and many more options

We may consider:

  • Symmetric-key algorithm vs. asymmetric key algorithms (public-key cryptography)
  • Diffie-Hellman key exchange is a specific method of exchanging keys.
  • Ensure integrity e.g. by hash-codes
  • Key file handling could be optimized in different ways.
  • Please keep in mind that unencrypted data is in RAM (see PDI-6170 for a circumvention to prevent heap dumps)
  • Beneath the binary or indexed storage type, an encrypted storage type may be possible in Kettle core.

In the end: Don’t lose your key!

Update since Kettle 4.2: There are two steps in the experimental section: Secret key generator, Symmetric Cryptography that cover this use case.

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

iPhone Tracking: How to read the consolidated.db with Kettle

I will not discuss about the buzz about the iPhone tracking – all that needs to be discussed is already out. That iPhone is storing locations in the consolidated.db was known a long time ago, but now we got a prove of concept by Pete Warden and Alasdair Allan and this inspired me to dig into this a bit more.

After an amazing short time of less than one hour I made it possible to read in my own iPhone consolidated.db with Kettle. [It takes more time to write about this... and the other real hard thing was to locate this file in my file system.] When we can read this file with Kettle we are open and can look up any additional data, filter by any criteria, store this in another database, use the Pentaho BI Suite to create maps and anything else that can a human being imagine. Just have a look at the Pentaho Sandbox or more general to the Pentaho site.

Here are my findings and steps to reproduce:

1) Locate the consolidated.db in your filesystem.

This is a little bit tricky, since the filename is mapped to an ID. I found it very easy by just searching for the string “CellLocation” in the iTunes backup folder, described over here for Windows:

findstr /M CellLocation C:\Users\[Username]\AppData\Roaming\Apple Computer\MobileSync\Backup\[latest backup folder]\*.*

I copied the file 4096c9ec676f2847dc283405900e284a7c815836 (this was my sample filename, yours will be most likely different) to consolidated.db to another folder.

A more robust and flexible solution is to read in and analyze the Manifest.mbdb / Manifest.mbdx as described over here in a Python script. It shouldn’t be too hard to port this to Java or JavaScript for a flexible Kettle solution.

2) Read in the consolidated.db with Kettle

Create a transformation with a SQLite connection type. The database name is the location of the filename as shown below as a sample:

SQLite connection type for consolidated.db

Let’s create a simple transformation (attached): Read in the Cell data, convert the Mac Absolute Time to a Java time and just group on a daily basis for simplification:

Transformation for consolidated.db

Let’s do a preview and browse the data a bit. Well I’m located in Mainz, known to be exactly at Latitude 50. So every other data is of interest and here comes a sample:

Sample Result for consolidated.db

Where I have been there? My forensic research on my own calendar found that I have been at the Pentaho EMEA Partner Summit & Community Event in Lisbon…. or was it only my iPhone?

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

How the SQL CASE WHEN construct can help you in pivoting data?

Within Kettle you have the Normalizer and Denormalizer steps to help with pivoting (a simplified use case for pivoting is changing the row/column axis, see also transpose). Imagine you would like to have a cross table, you may use a Kettle transformation to accomplish this. (In these days Pentaho Reporting’s cross tab functionality is on the road map but not implemented, yet. One solution would be to use a Kettle Transformation as a data source.).

Another solution for a cross tab is to use the SQL CASE WHEN construct. Here is an example:

The result of the following sample query select status, year_id, sum(totalprice) as totalprice from orderfact group by status, year_id looks like this:

Result of sample query1

When we want to put the years (2003, 2004, 2005) into separate columns, we can use the SQL CASE WHEN construct:

CASE WHEN Boolean_expression THEN result_expression [ ...n ] [ ELSE else_result_expression ]

to accomplish this, e.g.:

select status,
sum(case when year_id=2003 then totalprice else 0 end) as TOTAL_2003,
sum(case when year_id =2004 then totalprice else 0 end) as TOTAL_2004,
sum(case when year_id =2005 then totalprice else 0 end) as TOTAL_2005

from orderfact group by status

Et voilà, you have data that can be used as a cross tab:

Result from the CASE WHEN

Another use case for dealing with planning data

In another use case, we store actual and planning data (e.g. for turnover, quantity, weights etc.). To make it a bit more challenging, planning data is often stored as a budget and multiple forecasts or different plan versions (scenarios).

One solution for this problem is to store these data in separate columns, e.g. have columns for actual and other columns for planning data. Having multiple measures (see above) needs to have multiple columns for each type of the measures, e.g. turnover_actuals, quantity_actuals etc. and have this also for all types of planning scenarios. Thus you would get a big amount of columns and the biggest issue is: What happens when you need a new planning scenario? You would need to add more columns and your data model is not flexible enough and needs to be changed.

Another solution is to store the definition of an actual or planning type as an extra TYPE_ID in our table. This means each row is marked whether it is an actual or any kind of planning type.

But, when you sum up the whole table, the result would be wrong since it contains a mix of actual and planning data.

This can be solved also with the CASE WHEN construct even within a Mondrian schema definition:

    <Measure name=”Sales Actuals” aggregator=”sum” formatString=”#,###.00″>
    <MeasureExpression>    <SQL dialect=”generic”>
             (case when type_id=0 then TOTALPRICE else 0 end)
   
</SQL>    </MeasureExpression>    </Measure>

    <Measure name=”Sales Budget” aggregator=”sum” formatString=”#,###.00″>
    <MeasureExpression>    <SQL dialect=”generic”>
          (case when type_id=1 then TOTALPRICE else 0 end)
    </SQL>    </MeasureExpression>    </Measure>

[for more details see the Mondrian documentation about Measures]

With this solution you are flexible with your data model and the result will be correct.

Another more complex use case is to combine this with a reporting solution for project costing or contribution accounting. When you have a large number of different costs and would store these costs in separate columns you will reach a non manageable amount of columns and your data model is not flexible enough. It’s better to store these different types of costs in rows and you are much more flexible.

By all the flexibility with the data model you need to consider performance aspects of your database, e.g. if column based databases behave different with the CASE WHEN (I have not tested this) and different approaches should be tested. Any comments and findings are much appreciated.

Veröffentlicht in General | Kommentare deaktiviert

Connecting with SAP BI Systems

There are a lot of different options to connect with a SAP system from Kettle. It was back in 2009 were I tested the functionality to access a SAP BI system via Webservices (XML/A) and here are my (historic) findings:

To analyze the result of the MDX query with Kettle is complicated (you need for every number of columns a separate transformation) but is possible. In these days other Pentaho frontend tools are able to process this direct via OLAP4J. I tested this with the Webservices step (using Basic Authentication) and for specific usecases this would be an option. Further development brought the OLAP Input step and this simplifies a lot and was tested successfully.

Here are some further information that might be of help for setting this up:

For the XML/A interface the following webservice should be switched on in transaction SICF:
/default_host/sap/bw/xml/soap/xmla

SAP XML/A

XMLA has two methods
Discover - This method is used to query metadata and master data (this corresponds to the BAPI MDDataProviderBW).
Execute - You can execute MDX commands with this method to receive the corresponding result set (this corresponds to the BAPI MDDataSetBW).

Description of the service can be reached by: <Protocol>://<Server>:<Port>/sap/bw/xml/soap/xmla?wsdl

For XMLA with SAP BI you have to create MDX commands generated from the metadata you receive from the system.

Below is transaction MDXTEST in SAP BI to test MDX query on a SAP BI infoprovider or query:
SELECT
{[Measures].[0HDCNT_LAST],
[Measures].[ZKFHRFTE]} ON COLUMNS,
[0EMPLOYEE].MEMBERS ON ROWS
FROM [$ZMCHRPA1]

Some syntax templates are available in the MDX test editor.

Another service that is available is to get data directly from a query or query view. The service has one method: GetQueryViewData
Go to SICF and activate this web service: /sap/bc/srt/rfc/sap/QUERY_VIEW_DATA

The output will be in format of Datasets: AXIS_INFO (Metadata description of the characteristics, attributes, and structures on the axes), AXIS_DATA (Characteristic values, attribute values etc.), CELL_DATA (Value cells: The cells are arranged by row first, and then by column), TXT_SYMBOLS (Text elements for the query).

Further information:

Unfortuneatly all screen shots or data samples I have are from a real system but I hope some of the information given here are useful nevertheless to set up the SAP side.

Thanks to Paul Stoellberger, who continued to work on this. He provided me with some screen shots for Pentaho Analysis Tool (PAT), Pentaho Reporting and Kettle accessing a SAP BI system.

Update: SAP Recommends using Enterprise Services aka web services to access the functionality remotely. Further information can be found over here:
http://esworkplace.sap.com/

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Operational Patterns and the Watchdog Concept for Kettle

The Watchdog Concept for Kettle was presented at the Pentaho Community Event in Cascais, Portugal in September 2010. It came into my mind when I created the operational patterns for the Pentaho Data Integration for Database Developers training course and combined this with solutions from electronics to solve these types of problems (e.g. detect software or hardware errors reliably) and apply immune aware development concepts.

What’s the idea behind this?

When you want to check if your Kettle jobs and transformations execute correct, you can use some operational patterns like

  • Health Checks for the JVM
  • Health checks for clustered environments and failover of a master server
  • Analyze log entries
  • Use restartable solutions
  • and others that are described in more detail in the Pentaho PDI training class

When you need a different approach for more safety and to measure additional values, then you can use the concept of a Watchdog and combine this with the normal patterns. A use case is e.g. you are building a very critical system (imagine a rocket flying to space) and engage different software developers, use different measuring units and different computers to test if everything is correct. When one of the units fails, you can signal an event and e.g. abort the launch. Well, the Kettle world is most times not such critical but it is used more and more in systems that need a very high reliability and 24×7 with high and continuously guaranteed throughput.

With a Kettle solution you may need many components to check, e.g. the Data Integration Server, Carte and clustered environments, Kitchen/Pan jobs and transformations, detect dead locks and also for external components like the JVM (e.g. memory, used CPUs), the server (e.g. test with a network ping) and the databases (up and running, capacity etc.).

You also need a signal to noise detection: Define what is normal (noise) and notify on exceptions (e.g. set thresholds – absolute or relative eventually by average), define what is unusual and notify on these events and define events (e.g. actions, notifications & alerts). By all these checks you have a primary constraint: You need to minimise the footprint and impact to the system by the measurements.

Why a Watchdog can help here?

One solution for checking if everything is on track is to use the concept of a Watchdog with tasks and events:

 PDI Watchdog

A software crash might go undetected by conventional watchdog strategies when you have multitasking (e.g. many PDI jobs and cluster nodes). The success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is much easier than it sounds:

  1. Build a watchdog task
  2. Create a data structure (database table) that has one entry per task
  3. When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops, e.g. every 10,000 rows
  4. As the job or transformation runs the number of counts for each task advances.
  5. Infrequently but at regular intervals the watchdog runs.
  6. The watchdog scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value.
  7. If the counts are unreasonable, halt and let the watchdog timeout and fire an event. If everything is OK, set all of the counts to zero and exit. 

This is a derived concept from Jack Ganssle (2004): Great Watchdogs (especially the section WDTs for Multitasking).

An example implementation with Kettle

Most of the Kettle health checks can be accomplished with this Watchdog concept. An example implementation with Kettle is task oriented, not server oriented. This means it will check, if the task (a Transformation or Job) is running as expected independently in what environment (e.g. clustered or not).

Since Kettle has a wide variety of transformation steps and job entries I tend often not to program in a classical programming language but solve this solely with Kettle.

Here is an example of the event handling that could be altered very flexible by adding new events:

Watchdog Event Handling

And here is an example of the above described logic of the Watchdog step. Sooner or later this could be simplified by building some Kettle steps that encapsulate these logics but for now it just works:

Watchdog check logic

Veröffentlicht in Kettle (PDI) | Kommentare deaktiviert

Another blog about Kettle aka Pentaho Data Integration (PDI)

This is the starting point for the blog Fun Stuff about the Open Source ETL Tool Kettle aka Pentaho Data Integration (PDI).

The short history: Five years ago, in early December 2005, Matt Casters released the initial open source version of Kettle. I knew Kettle already before this time, developed a plug-in to get data out of SAP in 2005, did a lot of things around this great tool and joined Pentaho in mid 2007. You can read a bit more about the story in the book Pentaho Kettle Solutions by Matt Casters, Roland Bouman and Jos van Dongen.

Pentaho

More about Pentaho:

Pentaho Links
Pentaho Books

.

JB

More about me:

Homepage
XING Profile
Twitter

Veröffentlicht in General, Kettle (PDI) | Kommentare deaktiviert