Data Profiling and Data Quality (Human Inference) Integration with Kettle

Data Profiling with DataCleaner (Human Inference) and Kettle

It was already possible to profile your data in an easy way with Kettle: Open the Database Explorer, chose a table and right click in the context menu on Data Profile. The result was basic information about the data like Min, Max, Count all for strings and additional information for numeric data but these were only basic metrics about your data. We have a much more better solution now:

Human Inference (DataCleaner) and Pentaho (Kettle) worked together to integrate their tools and the result is a nice and seamless integration of DataCleaner into Kettle. A sample for introduction and FAQ can be found at Kettle Data Profiling with DataCleaner

You can right click on any step within your transformation and profile your data. It is also possible to clean or harmonize your data and check the result directly within your transformation.

DC context menu

Here are some screen shots:

  • Number Analyzer

DC Number Analyzer

  • String Analyzer

DC String Analyzer

  • Pattern Analyzer (e.g. you see how many single words with first capital letters etc. exist in your data):

DC Pattern Analyzer

And there are much more Analyzers you can choose from: Matching & Deduplication, Boolean, Character Set Distribution, Data Gap, Date/Time, Reference Data Matcher, Value & Weekday DistributionAnalyzers

DC Analyzers

For a complete reference look at http://datacleaner.eobjects.org and for a quick introduction and sample check out: http://wiki.pentaho.com/display/EAI/Kettle+Data+Profiling+with+DataCleaner

Data Quality with EasyDQ (Human Inference) and Kettle

Additional to the Data Profiling capabilities, a couple of steps for Data Quality have been introduced into Kettle as plug-ins. This covers

  • Name Validation, Standardization and Cleansing
  • Address Validation, Standardization and Cleansing
  • E-Mail and Telephone Validation, Standardization and Cleansing
  • Duplicate Detection and Merge Duplicates

Here is a validation example:

DQ Sample 1


More details, download instructions, a sample video and examples can be found over here:

Dieser Eintrag wurde veröffentlicht in Kettle (PDI). Fügen Sie den permalink zu Ihren Favoriten hinzu.