0% found this document useful (0 votes)
203 views

Pentaho Data Integration (PDI) Tutorial

The document provides instructions for a tutorial on using Pentaho Data Integration (PDI) to build an ETL process. The tutorial demonstrates extracting data from a CSV file, filtering and resolving missing data, cleaning the data, running the transformation, and orchestrating the process with jobs. It walks through 6 steps: 1) extract and load data from a CSV file, 2) filter for missing postal codes, 3) resolve missing data, 4) clean the data, 5) run the transformation, and 6) orchestrate the steps with jobs. The aim is to load sales data from a CSV into a database to generate mailing lists while handling missing postal codes in the records.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views

Pentaho Data Integration (PDI) Tutorial

The document provides instructions for a tutorial on using Pentaho Data Integration (PDI) to build an ETL process. The tutorial demonstrates extracting data from a CSV file, filtering and resolving missing data, cleaning the data, running the transformation, and orchestrating the process with jobs. It walks through 6 steps: 1) extract and load data from a CSV file, 2) filter for missing postal codes, 3) resolve missing data, 4) clean the data, 5) run the transformation, and 6) orchestrate the steps with jobs. The aim is to load sales data from a CSV into a database to generate mailing lists while handling missing postal codes in the records.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Pentaho Data Integration (PDI) tutorial

The following tutorial is intended for users who are new to the Pentaho suite or who are evaluating Pentaho as
a data integration and business analysis solution. The tutorial consists of six basic steps, demonstrating how to
build a data integration transformation and a job using the features and tools provided by Pentaho Data
Integration (PDI).

The Data Integration perspective of PDI allows you to create two basic file types: transformations and jobs.
Transformations describe the data flows for ETL such as reading from a source, transforming data and loading
it into a target location. Jobs coordinate ETL activities such as defining the flow and dependencies for what
order transformations should be run, or prepare for execution by checking conditions such as, "Is my source
file available?" or "Does a table exist in my database?"

The aim of this tutorial is to walk you through the basic concepts and processes involved in building a
transformation with PDI in a typical business scenario. In this scenario, you are loading a flat file (CSV) of sales
data into a database to generate mailing lists. Several of the customer records are missing postal codes that
must be resolved before loading into the database. In the preview feature of PDI, you will use a combination of
steps to cleanse, format, standardize, and categorize the sample data. The six basic steps are:

Step 1: Extract and load data

Step 2: Filter for missing codes

Step 3: Resolve missing data

Step 4: Clean the data

Step 5: Run the transformation

Step 6: Orchestrate with jobs

Parent Topic

• Setup
Child Topics

• Prerequisites
• Step 1: Extract and load data
• Step 2: Filter for missing codes
• Step 3: Resolve missing data
• Step 4: Clean the data
• Step 5: Run the transformation
• Step 6: Orchestrate with jobs

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 1/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
Prerequisites
To complete this tutorial, you need the following items:

• An installed version of the Pentaho 30-day trial.

Parent Topic

• Pentaho Data Integration (PDI) tutorial

Step 1: Extract and load data


In Step 1, you will retrieve data from a CSV flat file and use the Text File Input step to connect to a repository,
view the file schema, and retrieve the data contents.

Parent Topic

• Pentaho Data Integration (PDI) tutorial


Child Topics

• Create a new transformation


• View the content in the sample file
• Edit and save the transformation
• Load data into a relational database

Create a new transformation


Follow these steps to create a new transformation.
Procedure

1. Select File New Transformation in the upper-left corner of the PDI window.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 2/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
2. Under the Design tab, expand the Input node, then select and drag a Text File Input step onto the
canvas.

3. Double-click the Text File input step. In the Text file input window, you can set the properties of the
step.

4. In the Step Name field, type Read Sales Data.


The Text file input step is now renamed to Read Sales Data.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 3/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
5. Click Browse to locate the sales_data.csv source file in the ...\design-tools\data-
integration\samples\transformations\files folder. The Browse button appears in the upper-
right side of the window near the File or Directory field.

6. Change File type to *.csv. Select sales_data.csv, then click OK.


The path to the source file appears in the File or directory field.
7. Click Add.
The path to the file appears under Selected Files.

Parent Topic

• Step 1: Extract and load data

View the content in the sample file


Follow these steps to look at the contents of the sample file.
Procedure

1. Click the Content tab, then set the Format field to Unix.

2. Click the File tab again and click the Show file content in the lower section of the window.

3. The Number of lines (0-all lines) window appears. Click OK to accept the default.

4. The Content of first file window displays the file. Examine the file to see how that input file is delimited,
what enclosure character is used, and whether or not a header row is present.
In the sample, the input file is comma delimited, using the enclosure character of a quotation mark (").
It contains a single header row containing field names.
5. Click the Close button to close the window.

Parent Topic

• Step 1: Extract and load data

Edit and save the transformation


Follow these steps to provide information about the data's content.
Procedure

1. Click the Content tab. Use the fields under the Content tab to define how your data is formatted.

2. Verify that the Separator is set to comma (,) and that the Enclosure is set to quotation mark ("). Select
Header and enter 1 in the Number of header lines field.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 4/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
3. Click the Fields tab and click Get Fields to retrieve the input fields from your source file. When the
Number of lines to sample window appears, enter 0 in the field, then click OK.

4. If the Scan Result window displays, click Close to close the window.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 5/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
5. To verify that the data is read correctly, click the Content tab, then click Preview Rows.

6. In the Enter the number of rows you would like to preview window, click OK to accept the default.
The Examine preview data window appears.
7. Review the data. Do you notice any missing, incomplete, or variations of the data?

◦ STATE & POSTALCODE both contain <null>


◦ COUNTRY contains both USA and United States.

8. Click OK to save the information that you entered in the step.

9. Enter a name for the transformation and provide additional properties using the Transformation
Properties window. There are multiple ways to open the Transformation Properties window.
◦ Right-click on any empty space on the canvas and select Properties.
◦ Double-click on any empty space on the canvas to select Properties.
◦ Enter the CTRL-T keyboard combination.
10. In the Transformation Name field, enter Getting Started Transformation.
Below the name, the filename is empty.
11. Click OK to close the Transformation Properties window.

12. To save the transformation, select File Save.


When saving your transformation for the first time, you are prompted for a file location and name of
your choice. The file extension .ktr is the usual file extension for transformations.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 6/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
Parent Topic

• Step 1: Extract and load data

Load data into a relational database


Now you are ready to take all the records that are exiting the Filter rows step where the POSTALCODE was not
null (the true condition), and load them into a database table. You will use the Table Output step and a hop
from the Text File Input step to direct the data stream into a database table. This section of the tutorial uses a
pre-existing database established during the Pentaho installation, which is started along with the server.

Parent Topic

• Step 1: Extract and load data


Child Topics

• Create the Table Output step


• Create a connection to the database
• Define the Data Definition Language (DDL)

Create the Table Output step


Follow these instructions to create the Table Output step.
Procedure

1. Under the Design tab, expand the contents of the Output node.

2. Click and drag a Table Output step into your transformation.

3. Create a hop between the Read Sales Data and Table Output steps. To create the hop:

1. Press the SHIFT key.

2. Click the Read Sales Data (Text File Input) step and drag the mouse to draw a line to the Table
Output step.

3. Release the SHIFT key.

4. Click the Table Output step.

4. Double-click the Table Output step to open its Edit properties dialog box.

5. Rename your Table Output step to Write to Database.

Parent Topic

• Load data into a relational database

Create a connection to the database


Follow these steps to create a connection to the database.
Procedure

1. Click New next to the Connection field. You must create a connection to the database.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 7/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
The Database Connection window appears.
2. Provide the settings for connecting to the database.

Field Setting

Connection Name Sample Data

Connection Type Hypersonic

Host Name localhost

Database Name sampledata

Port Number 9001

User Name pentaho_admin

password (If password does not work, please check


Password
with your system administrator.)

3. Click Test to verify your entries are correct. A success message appears. Click OK.
NoteIf you get an error when testing your connection, ensure that you have provided the correct
settings information as described in the table and that the sample database is running. See Start and
Stop the Pentaho Server for information about how to start the Pentaho Server.
4. Click OK to exit the Database Connections window.

Parent Topic

• Load data into a relational database

Define the Data Definition Language (DDL)


DDLs are the SQL commands that define the different structures in a database such as CREATE TABLE.
Fortunately, Pentaho can help you create the necessary DDL.
Procedure

1. Enter SALES_DATA in the Target Table text field.

2. This table does not exist in the target database, so Pentaho can generate the DDL to create the table
and execute it. In this scenario, the DDL is based on the stream of data coming from the previous step,
which is the Read Sales Data step.

3. In the Table Output window, select the Truncate Table property.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 8/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
4. Click the SQL button in the bottom of the Table output dialog box to generate the DDL for creating your
target table.

5. The Simple SQL editor window appears with the SQL statements needed to create the table.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 9/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
6. Click Execute to execute the SQL statement.
The Results of the SQL statements window appears.
7. Examine the results, then click OK to close the Results of the SQL statements window.

8. Click Close in the Simple SQL editor window

9. Click OK to close the Table output window.

10. Save your transformation.

Parent Topic

• Load data into a relational database

Step 2: Filter for missing codes


After completing Step 1: Extract and load data, you are ready to add a transformation component to your data
pipeline. The source file contains several records that are missing postal codes. This section of the tutorial
filters out those records that have missing postal codes, where the POSTALCODE is not null (the true
condition), and ensures that only complete records are loaded into the database table.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 10/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
Parent Topic

• Pentaho Data Integration (PDI) tutorial


Child Topics

• Preview the rows read by the input step


• Separate the records with missing postal codes

Preview the rows read by the input step


Follow these steps to preview the rows read by the input step.
Procedure

1. Right-click on the Read Sales Data step and select Preview.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 11/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
2. Specify the number of rows to preview. Optionally, you can configure break-points which pause
execution based on a defined condition, such as a field having a specific value or exceeding a threshold.

3. Click the Quick Launch button. Preview the data and notice that several of the input rows are missing
values for the POSTALCODE field.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 12/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
4. Click Stop on the preview window to end the preview.

Parent Topic

• Step 2: Filter for missing codes

Separate the records with missing postal codes


Follow these instructions to use the Filter Rows transformation step to separate out those records missing
postal codes. These records are resolved later in the tutorial.
Procedure

1. Add a Filter Rows step to your transformation. Under the Design tab, select Flow Filter Rows.

2. Insert your Filter Rows step between your Read Sales Data step and your Write to Database step.

1. Right-click and delete the hop between the Read Sales Data step and Write to Database steps.

2. Create a hop between the Read Sales Data step and the Filter Rows step. Create a hop by
clicking on the step, hold the SHIFT key down and click-and-drag to draw a line to the next step.

3. Create a hop between the Filter Rows step and Write to Database step.

4. In the dialog box that appears, select Result is TRUE.

3. Double-click the Filter Rows step. The Filter Rows window appears.

4. In the Step Name field, enter Filter Missing Zips.

5. Click in The condition field to open the Fields window. The available conditions appear.

6. In the Fields window select POSTALCODE and click OK.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 13/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
7. Click the comparison operator field, which is set to = by default. The Functions window appears.

8. Select IS NOT NULL from the list of functions, and then click OK to close the Functions window.

9. Click OK to exit the Filter Rows window.


NoteYou will return to this step later to configure the Send true data to step and Send false data to
step settings after adding their target steps to your transformation.
10. Save your transformation.

Parent Topic

• Step 2: Filter for missing codes

Step 3: Resolve missing data


After completing Step 2: Filter for missing codes, you are ready to resolve the missing postal codes. In this
section, you will learn how to use a second text file containing a list of cities, states, and postal codes, to look
up the postal codes for those records in which the fields are missing, which is the false branch of your Filter
rows step.

First, you will use a Text file input step to read from the source file. Then, you will use a Stream lookup step to
bring the resolved postal codes into the stream. Lastly, you will use the Select values step to rename fields on
the stream, remove unnecessary fields, and more.

Parent Topic

• Pentaho Data Integration (PDI) tutorial


Child Topics

• Retrieve data from your lookup file


• View the contents of the sample file
• Edit and save the transformation

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 14/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
• Resolve missing zip code information
• Preview your transformation
• Apply formatting to your transformation

Retrieve data from your lookup file


Follow these steps to retrieve data from your lookup file.
Procedure

1. Add a new Text File Input step to your transformation.


This step retrieves the records from your lookup file. Do not add a hop yet.

2. Open the Text File Input step window, then enter Read Postal Codes in the Step name property.

3. Click Browse to navigate to the Zipssortedbycitystate.csv source file located in the directory
...\design-tools\data-integration\samples\transformations\files.

4. Change File type to *.csv, select Zipsortedbycitrystate.csv, and click OK.


The path to the source file appears in the File or directory field.
5. Click Add.
The path to the file appears under Selected files.

Parent Topic

• Step 3: Resolve missing data

View the contents of the sample file


Follow these steps to view the contents of the sample file.
Procedure

1. Click the Content tab, then set the Format field to Unix.

2. Click the File tab again and click the Show file content near the bottom of the window.

3. The Number of lines(0=all lines) window appears. Click the OK button to accept the default.

4. The Content of first file window displays the file. Examine the file to see how that input file is delimited,
what enclosure character is used, and whether or not a header row is present. In the example, the

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 15/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
input file is comma (,) delimited and the enclosure character is the quotation mark ("). A single header
row contains field names.

5. Click Close to close the window.

Parent Topic

• Step 3: Resolve missing data

Edit and save the transformation


Follow these steps to edit and save your transformation.
Procedure

1. In the Content tab, change the Separator character to a comma (,) and confirm that the Enclosure
setting is a quotation mark ("). Verify that the Header option is selected.

2. Under the Fields tab, click Get Fields to retrieve the data from your CSV file.

3. The Number of lines to sample window appears. Enter 0 in the field, then click OK.

4. If the Scan Result window displays, click Close to close it.

5. Click Preview rows to verify that your entries are correct.

1. When prompted to enter the preview size, click OK.

2. Review the information in the window, then click Close.

6. Click OK to exit the Text File input window.

7. Save the transformation.

Parent Topic

• Step 3: Resolve missing data

Resolve missing zip code information


Follow these steps to resolve the mising postal code information.
Procedure

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 16/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
1. Add a Stream Lookup step to your transformation by clicking the Design tab, expanding the Lookup
folder, then selecting Stream Lookup.

2. Draw a hop from the Filter Missing Zips to the Stream lookup step. In the dialog box that appears,
select Result is FALSE.

3. Create a hop from the Read Postal Codes step to the Stream lookup step.

4. Double-click the Stream lookup step to open the Stream Value Lookup window.

5. Rename Stream Lookup to Lookup Missing Zips.

6. From the Lookup step drop-down box, select Read Postal Codes as the lookup step. Perform the
following:

1. In the key(s) to look up the value(s) table, define the CITY and STATE fields .

2. In row #1, open the drop-down menu in the Field column and select CITY.

3. Click in the LookupField column and select CITY.

4. In row #2, open the drop-down menu in the Field column and select STATE.

5. Click in the LookupField column and select STATE.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 17/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
7. Click Get Lookup Fields to pull the three fields from the Read Postal Code step.

8. POSTALCODE is the only field you want to retrieve. To delete the CITY and STATE lines, right-click in the
line and select Delete Selected Lines.

9. In the New Name field, change the name POSTALCODE toZIP_RESOLVED and verify that Type is set to
String.

10. Select Use sorted list (i.s.o. hashtable).

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 18/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
11. Click OK to close the Stream Value Lookup edit properties dialog box.

12. Save your transformation.

Parent Topic

• Step 3: Resolve missing data

Preview your transformation


Follow these steps to preview your transformation.
Procedure

1. To preview the data, select and right-click the Lookup Missing Zips step. From the menu that appears,
select Preview.

2. In the Transformation debug dialog window, click Quick Launch to preview the data flowing through
this step.

3. In the Examine preview data window that appears, note that the new field, ZIP_RESOLVED, has been
added to the stream containing your resolved postal codes.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 19/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
4. Click Close to close the window.

5. If the Select the preview step window appears, click Close.

Results
The execution results near the bottom of the PDI window display updated metrics in the Step Metrics tab.

Parent Topic

• Step 3: Resolve missing data

Apply formatting to your transformation


Follow these steps to clean up the field layout on your lookup stream so that it matches the format and layout
of the other stream going to the Write to Database step.
Procedure

1. Add a Select Values step to your transformation by expanding the Transform folder and clicking Select
Values.

2. Create a hop from the Lookup Missing Zips to the Select Values step.

3. Double-click the Select Values step to open its properties dialog box.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 20/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
4. Rename the Select Values step to Prepare Field Layout.

5. Click Get fields to select to retrieve all fields and begin modifying the stream layout.

6. In the Fields list, find the # column and click the number for the ZIP_RESOLVED field.
Use CTRLUP (MacOS, COMMANDUP) to move ZIP_RESOLVED just below the POSTALCODE field, which
is the one that still contains null values.

7. Select the old POSTALCODE field in the list (line 20), right-click in the line, and select Delete Selected

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 21/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
Lines

8. The original POSTALCODE field was formatted as a 9-character string. You must modify your new field
to match the form. Click the Meta-Data tab.

9. In the first row of the Fields to alter table the meta-data for section, click in the Fieldname column
and select ZIP_RESOLVED. Perform the following steps:

1. Enter POSTALCODE in the Rename to column.

2. Select String in the Type column and enter 9 in the Length column.

3. Click OK to exit the edit properties dialog box.

10. Draw a hop from the Prepare Field Layout (Select values) step to the Write to Database (Table output)
step.

11. When prompted, select the Main output of the step option.

12. Save your transformation.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 22/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
Parent Topic

• Step 3: Resolve missing data

Step 4: Clean the data


After completing Step 3: Resolve missing data, you can further cleanse and and categorize the data into
buckets before loading it into a relational database. In this section, you will cleanses the COUNTRY field data

by mapping United States to USA using the Value mapper step. Cleaning the data ensures there is only

one version of USA.

In addition, you will learn how to use buckets for categorizing the SALES data into small, medium, and large
categories using the Number range step. You will learn how to insert these cleaning and categorizing functions
into your transformation just prior to the Write to Database step on the canvas.

Parent Topic

• Pentaho Data Integration (PDI) tutorial


Child Topics

• Add a Value mapper step to the transformation


• Set the properties in the Value Mapper step
• Apply ranges
• Execute the SQL statement

Add a Value mapper step to the transformation


Follow these steps to add the Value mapper step to the transformation.
Procedure

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 23/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
1. Delete both hops connected to the Write to Database step. For each hop, right-click and select Delete.

2. Create some extra space on the canvas. Drag the Write to Database step toward the right side of your
canvas.

3. Add the Value mapper step to your transformation by expanding the Transform folder and choosing
Value mapper.

4. Create a hop between the Filter Missing Zips and Value mapper steps. In the dialog box that appears,
select Result is TRUE.

5. Create a hop between the Prepare Field Layout and Value mapper steps. When prompted, select the
Main output of the step option.

Parent Topic

• Step 4: Clean the data

Set the properties in the Value Mapper step


Follow these steps to set the properties in the Value mapper step.
Procedure

1. Double-click the Value mapper step to open its properties dialog box.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 24/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
2. Click in the Fieldname to use field and select COUNTRY.

3. In the Field Values table, define the United States and USA field values.

1. In row #1, click the field in the Source value column and enter United States

2. Then, click the field in the Target value column and enter USA

4. Click OK.

5. Save your transformation.

Parent Topic

• Step 4: Clean the data

Apply ranges
Follow these steps to apply ranges to your transformation.
Procedure

1. Add a Number range step to your transformation by expanding the Transform folder and selecting
Number range.

2. Create a hop between the Value mapper and Number range steps.

3. Create a hop between the Number range and Write to Database (which was built using Table output)
steps. When prompted, select the Main output of the step option.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 25/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
4. Double-click the Number range step to open its properties dialog box.

5. Click in the Input field and select SALES from the list.

6. In the Output field enter DEALSIZE.

7. In the Ranges (min <=x< max) table, define the Lower Bound and Upper Bound field ranges along with
the bucket Value.

1. In row #1, click the field in the Upper Bound column and enter 3000.0. Then, click the field in
the Value column and enter Small.

2. In row #2, click the field in the Lower Bound column and enter 3000.0. Then, click the field in
the Upper Bound column and enter 7000.0. Click the field in the Value column and enter
Medium.

3. In row #3, click the field in the Lower Bound column and enter 7000.0. Then, click the field in
the Value column and enter Large.

8. Click OK.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 26/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
Parent Topic

• Step 4: Clean the data

Execute the SQL statement


Your database table does not yet contain the field DEALSIZE. Perform these steps to execute the SQL
statement.
Procedure

1. Double-click the Write to Database step to open its properties dialog box.

2. Click the SQL button at the bottom of the window to generate the new DDL for editing your original
target table. Note that the Write to Database step was built using Table output.

1. The Simple SQL editor window appears with the SQL statements needed to alter the table.

2. Click Execute to execute the SQL statement.

3. The Results of the SQL statements window appears. Examine the results, then click OK to close
the window.

4. Click Close in the Simple SQL editor window to close it.

5. Click OK to close the Write to Database window. Note that the Write to Database step was built
using Table output

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 27/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
3. Save your transformation.

Parent Topic

• Step 4: Clean the data

Step 5: Run the transformation


Pentaho Data Integration provides a number of deployment options. Running a Transformation explains these
and other options available for execution. In this section of the tutorial, you create a transformation using the
Local run option.
Procedure

1. In the PDI client window, select Action Run.


The Run Options window appears.
2. Keep the default Pentaho local option for this exercise.
It uses the native Pentaho engine and runs the transformation on your local machine. See Run
configurations if you are interested in setting up configurations that use another engine, such as Spark,
to run a transformation.
3. Click Run.
The transformation executes.

Results
After the transformation runs, the Execution Results panel opens below the canvas.

Parent Topic

• Pentaho Data Integration (PDI) tutorial


Child Topics

• Viewing the execution results

Viewing the execution results


Us the tabs in the Execution Results section of the window to view how the transformation executed, pinpoint
errors, and monitor performance.

• Step Metrics

Provides statistics for each step in your transformation including how many records were read, written,

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 28/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
or caused an error, as well as processing speed (rows per second) and more. This tab also indicates
whether an error occurred in a transformation step.

This tutorial introduces no intentional transformation errors, so the transformation should run
correctly. If a mistake does occur, you can view the steps that caused the transformation to fail
highlighted in red. In the example below, the Lookup Missing Zips step caused an error.

• Logging

Displays the logging details for the most recent execution of the transformation. It also allows you to
drill deeper to determine where errors occur. Error lines are highlighted in red. In the example below,
the Lookup Missing Zips step caused an error because it attempted to look up values on a field called
POSTALCODE2 which did not exist in the lookup stream.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 29/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
• Execution History

Provides access to the step metrics and log information from previous executions of the
transformation. This feature works only if you have configured your transformation to log to a database
through the Logging tab of the Transformation Settings dialog box. For more information on
configuring logging or viewing the execution history, see Analyze your transformation results.

• Performance Graph

Analyzes the performance of steps based on a variety of metrics including how many records were
read, written, or caused an error, as well as processing speed (rows per second) and more. Like the
execution history, this feature requires you to configure your transformation to log to a database
through the Logging tab found in the Transformation Settings dialog box.

• Metrics tab

Displays a Gantt chart after the transformation or job runs. This information includes how long it takes
to connect to a database, how much time is spent executing a SQL query, or how long it takes to load a
transformation.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 30/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
• Preview Data

Displays a preview of the data.

Parent Topic

• Step 5: Run the transformation

Step 6: Orchestrate with jobs


Jobs are used to coordinate ETL activities such as:

• Defining the flow and dependencies that control the linear order for the transformations to run.
• Preparing for execution by checking conditions such as, "Is my source file available?" or "Does a table
exist?"
• Performing bulk load database operations.
• Assisting file management, such as posting or retrieving files using FTP, copying files, and deleting files.
• Sending success or failure notifications through email.
For this part of the tutorial, imagine that an external system is responsible for placing your sales_data.csv
input in its source location every Saturday night at 9 p.m. You want to create a job that will verify that the file
has arrived and then run the transformation to load the records into the database. In a subsequent exercise,
you will schedule the job to run every Sunday morning at 9 a.m.

The following steps assume that you have built a Getting Started transformation as described in Step 1: Extract

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 31/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
and load data of the tutorial.

Procedure

1. Go to File New Job.

2. Expand the General folder and drag a Start job entry onto the canvas.
The Start job entry defines where the execution will begin.
NoteJobs run in a sequential order of steps and transformations can run in a parallel order of steps.
3. Expand the Conditions folder and add a File Exists job entry.

4. Draw a hop from the Start job entry to the File Exists job entry.

5. Double-click the File Exists job entry to open its properties dialog box. Click Browse and set the filter
near the bottom of the window to All Files. Select the sales_data.csv from the following directory:
...\design-tools\data-integration\samples\transformations\files.

6. Click OK to exit the Open File window.

7. Click OK to exit the Check if a file exists window.

8. Expand the General folder and add a Transformation job entry.

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 32/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT
9. Draw a hop between the File Exists and the Transformation job entries.

10. Double-click the Transformation job entry to open its properties dialog box.

11. Click Browse to open the Select repository object window. Browse to and select the Getting Started
transformation.

12. Click OK to close the Transformation window.

13. Save your job as Sample Job.

14. Click Run icon in the toolbar. When the Run Options window appears, select Local environment type
and click Run. The Execution Results panel should open showing you the job metrics and log
information for the job execution.

Parent Topic

• Pentaho Data Integration (PDI) tutorial

https://help.hitachivantara.com/Documentation/Pentaho/9.1/Setup/Pentaho_Data_Integration_(PDI)_tutorial 33/33
Updated: Wed, 23 Nov 2022 03:28:56 GMT

You might also like