A description of the PDI Data Lineage feature including set up instructions can be found within the Pentaho 6.1 Documentation and in Pedro Alves‘ blog post Seeing PDI lineage information. This post is a step-by-step example showing how to use PDI Data Lineage with yEd.
Setup PDI Data Lineage
Modify …\system\karaf\etc\pentaho.metaverse.cfg (Client & DI-Server when needed)
- Set lineage.execution.runtime = on
- Set lineage.execution.generation.strategy=latest
- Default folder for lineage GraphML files: lineage.execution.output.folder=./pentaho-lineage-output
How to use yEd?
Access Your Lineage Output
Use the following steps to access your linage output from PDI:
- After you run your jobs and transformations, the created lineage files can be found in the default folder specified, pentaho-lineage-output for example
- Open a lineage file in yEd and see how it looks
Refine Your Output
It does not give you much useful visualization, so here are the instructions to make it useful:
- Download the default_yed_configuration.cnfx file for PDI. The file download location might change in a future version.
- Within the yEd menu, select Edit / Properties Mapper…
- Click on the Imports additional configurations icon and import the .cnfx file
- Select each of the configurations and press Apply:
- Pentaho Metaverse Nodes (Node)
- Pentaho Metaverse Edges (Edge)
- PDI Nodes (Node)
- Finally, close this window with Ok
- Within the yEd menu, select Layout > Hierarchical and press Ok in the window (keep the default settings).
Now, it looks better, but we have way too many information:
Optimize the View
Let’s optimize the view. We will look mostly at the Neighborhood window to search for information.
- Drag and drop the Neighborhood window to the right hand side.
- Resize the Neighborhood window to the right hand.
Note: This window does not show any information when there is too much to display, so change the window size and make it bigger.
- Search something in the Structure View. In our example we have a target field called totalprice. We want to find out where this field is coming from.
- Enter totalprice in the search view. You will see there are multiple totalprice nodes. In the following example, we see it is a database column (written into the fact_sales table).
- Select another totalprice node – in the following example, we see it is also the name of a step (a Calculator step for example) and the PDI hops to/from the next steps.
Other Visual Option
Another view can give us the information we need. It is also a stream field and is derived from two other stream fields: quantityordered and priceeach.
- Double click on a derived field (quantityordered for example), we can trace back the data lineage in this example until the orderdetails Table Input step.
- Double click on the query attribute to look at the SQL within the Properties View and see the details:
- Look at the database connections and see that the step orderdetails uses the pentaho_olap database connection.
You see how much detailed information we already collect with data lineage and how you can visualize it all with yEd. Stay tuned for more updates and improvements in future releases!
Contribute to improve the lineage
And last but not least, when you are a developer and want to contribute to improve the lineage, this documentation is a good starting point: Contribute Additional Step and Job Entry Analyzers to the Pentaho Metaverse