Data Preperation Prac
Data Preperation Prac
Exploring
You will be working in actual Talend software, not a simulation. We hope you have fun and get lots of practice using the software!
However, if you work on tasks beyond the scope of the training, you could run out of time with the environment, or you could mess up
data or Jobs needed for subsequent exercises. We suggest finishing the course first, and if you have remaining time, explore as you
wish. Keep in mind that our technical support team can’t assist with your exploring beyond the course materials.
Sharing
This course is provided for your personal use under an agreement with Talend. You may not take screenshots or redistribute the con-
tent or software.
Intentionally blank
Concepts 8
Concepts
LESSON 1 | 9
10 | Talend Data Preparation for Implementers - Participant Guide
LESSON 2
Getting Started
This chapter discusses:
Concepts 12
Overview 15
Exploring the Environment 16
Creating Users and Groups in TAC 19
Connecting to Talend Data Preparation 29
Review 34
Concepts
LESSON 2 | 13
User settings are role based. Data Preparation access
permissions are granted according to Job profiles:
administrator, dataset manager, data preparator.
Use case
Talend software is a modular suite made up of several applications working together to provide a distributed development envir-
onment.
In this training session, you will use these modules:
Talend Data Preparation server
Talend Administration Center
Talend Studio
You can install each application on separate computers or install multiple applications on the same computer.
In this course, you use a complete Talend environment hosted on a single virtual machine (VM) that contains all the items you need.
This environment is similar to what you would find in a new Talend installation, with default parameters and configuration.
In this lesson, you will learn about Talend Data Preparation server configuration. You will connect to Talend Administration Center
(TAC) and create several users, as well as a user group. You will start the Data Preparation server and test your connection to the
Data Preparation web UI.
Objectives
After completing this lesson, you will be able to:
Create Data Preparation users in TAC
Create a user group in TAC
Start the Talend Data Preparation server
Connect to the Data Preparation web UI
Next step
You are ready to set up and explore the environment.
LESSON 2 | 15
Exploring the Environment
Task outline
You will explore the training environment by viewing the configuration of the Talend Data Preparation server and verifying
that the services you need are running on the machine.
NOTE:
The password is already encrypted with the password you will create in TAC. If you need to update the pass-
word or use another user account, enter the credentials directly in the application.properties file, then restart
Talend Data Preparation Server 6.4.1 service to encrypt the new password.
MongoDB settings are configured in this configuration file as well. MongoDB is a prerequisite for the Data Pre-
paration server, as Data Preparation uses it to store metadata for datasets and preparations. In your training envir-
onment, MongoDB is installed and running, and there is a MongoDB user with required credentials.
LESSON 2 | 17
Search for the MongoDB and Components Catalog services and verify they are running.
NOTE:
The list of Talend services may vary depending on the version and type of installation.
Next step
You are ready to connect to TAC and create users.
Task outline
In this section, you will connect to TAC to create and set up all the users you need for these exercises.
First you must create a system user that can:
Enable communication between the Data Preparation server and TAC
Execute Talend Studio Jobs
Update semantic types in Talend Dictionary Service
Then you will create two Data Preparation business users with different privileges: an administrator and an operator.
You will also create a user group to easily manage the permissions of the two Data Preparation users.
Connect to TAC
1. IN A WEB BROWSER, OPEN THE TAC URL
On the right tile of the Windows Start menu, click Talend Administration Center.
LESSON 2 | 19
NOTE:
TAC user names must be in the format of an e-mail address. On some virtual keyboard configurations, the AltGr key is
not active. You can use a combination of the CTRL and Alt keys instead.
3. LOG IN
Click the Login button.
The WELCOME page appears.
2. ADD A USER
On the USERS tab, click the Add button, as shown in the screenshot.
LESSON 2 | 21
b. Select the Data Preparation User check box.
c. For Data Preparation Role, select Data Preparator.
LESSON 2 | 23
Create Data Preparation users in TAC
1. ADD AN ADMINISTRATOR DATA PREPARATION USER
Adam Brown will be a Data Preparation business user with administration rights.
As an administrator, he has permission to create live datasets from Talend Studio Jobs. You will use live datasets later.
a. Still on the Users tab, click the Add button.
b. Configure the Data Preparation administrator user as shown in the screenshot.
Login: [email protected]
First Name: Adam
Last Name: Brown
Password: talend
Type: No Project Access
c. Select the Data Preparation User check box.
d. For Data Preparation Role, select Administrator/Dataset Manager/Data Preparator.
e. Confirm that the Active check box is selected.
The user details appear as shown in the screenshot.
LESSON 2 | 25
f. Click Save.
3. VERIFY THAT JOHN SMITH AND ADAM BROWN ARE ON THE LIST
LESSON 2 | 27
For Label, enter DataPrep_US, and for Type, select Data Preparation.
c. Click Save.
The user group is created.
Next step
You are ready to start the Data Preparation server and test your connection.
Task outline
You can now start the Data Preparation server, and, as a business user, test your connection.
LESSON 2 | 29
Wait until the status is Running.
LESSON 2 | 31
b. Click the LOG IN button.
Next step
You have almost finished this section. Time for a quick review.
LESSON 2 | 33
Review
You started this lesson by analyzing the Data Preparation server configuration file, and then you connected to TAC and created a sys-
tem user. You created two Data Preparation users and grouped them in a Data Preparation user group. You noted that Data Pre-
paration user roles and permissions are handled in TAC. You started the Data Preparation server and tested your connection using
the Data Preparation web UI.
More information
Talend documentation:
About Talend Data Preparation
Talend Administration Center User Guide
Concepts 36
Overview 39
Creating a Data Preparation and Related Dataset 40
Adding a Join to a Data Preparation 64
Promoting the Preparation 82
Review 85
Concepts
LESSON 3 | 37
38 | Talend Data Preparation for Implementers - Participant Guide
Overview
Use case
In this exercise, your role is that of a business user who has received an extract of US customer data from Salesforce.com.
You will create your preparation on the development environment before exporting it to the production environment.
Connected on the development environment, you will view the data and use several Data Preparation functions to cleanse and stand-
ardize the initial customer data file. Then you will create a version of the preparation to capture the state of the recipe. You will share
the preparation and the dataset with another business user who wants to enrich the customer data with business regions. You will
reuse the shared data preparation to join the customer data with the list of business regions corresponding to states. You will group
the customers by business region and export the resulting customer data to a CSV file. Then you will export the preparation to reuse
it on the production environment.
Here are the steps in this lesson:
Objectives
After completing this lesson, you will be able to:
Create a data preparation and dataset
Use Data Preparation to discover data
Use prebuilt Data Preparation functions to cleanse data
Create a version of the preparation
Share a data preparation and dataset
Add a lookup table to a data preparation
Export the results file
Export the preparation
Next step
You are ready to create a data preparation and related dataset.
LESSON 3 | 39
Creating a Data Preparation and Related Dataset
Task outline
A data preparation applies a recipe to a dataset to produce an outcome. The original dataset is never modified.
The recipe corresponds to the set of functions that is applied to the initial dataset.
The dataset holds the data that can be used as the raw material for preparations. It is presented as a table to which you
can apply recipes without affecting the original data.
In this section, you will create a data preparation to cleanse a customer dataset.
You will find inconsistencies in the dataset and then create a recipe that lets you correct the errors you detected and stand-
ardize columns.
NOTE:
The Data Preparation server hosted by the training VM can be viewed as the development environment.
LESSON 3 | 41
c. In the File Upload wizard, go to the C:/StudentFiles/DataPrep folder and select Customers.csv.
The separator in the file is a comma and the encoding is UTF-8, so you do not need to adjust the parameters.
To close the window, click the settings icon again.
b. Hover over the us_state_code type and view the matching percentage rates that allow the system to establish the
column semantic type.
LESSON 3 | 43
NOTE:
Talend Data Preparation suggests the correct data type for each column in your dataset. However, based on
your experience, at any time, you can change the suggestions.
b. To view the number of customers from a state, hover over the state (this works on the bar chart as well).
LESSON 3 | 45
c. To view the number of valid and invalid rows, click the VALUE tab.
d. To view the most common format in this field, click the PATTERN tab.
Notice that the most common pattern for the state is a double-uppercase character value.
c. You can apply additional filters by clicking a pattern on the statistics panel.
LESSON 3 | 47
To redisplay all the records, click the Enable/Disable all filters button.
Create a recipe
1. ADD A FIRST STEP FOR THE RECIPE
Click the LAST_NAME column and wait for the list of functions on the right to refresh.
a. Use the search bar to find the Change to UPPER case function.
LESSON 3 | 49
A discovery pop-up may appear. Click the NEXT button until the tour ends.
e. Do not worry about saving the recipe. As you can see at the top of the added step, all the steps are automatically
saved.
3. CORRECT INVALID VALUES
Some inconsistent state codes can be replaced.
b. To change all occurrences of Texas to TX, select the Replace the Cells that Match... function.
c. Fill in the boxes, check Overwrite entire cell and click SUBMIT.
A new step is added in the recipe.
LESSON 3 | 51
The filter icon next to the step indicates that the replace function has been applied to only the filtered rows.
d. To redisplay all the records, remove the filter.
4. FORMAT THE DATE
To standardize the date format, select the DATE column.
a. Search for and select the Change Date Format... function.
LESSON 3 | 53
Click SUBMIT.
Three columns are added to the dataset.
To see the complete column names, enlarge the column sizes.
b.
6. RENAME A COLUMN
Rename the CAMPAIGN_ID_SPLIT_2 column Quarter_Year.
c. Click SUBMIT.
LESSON 3 | 55
d. Use the same process to rename the CAMPAIGN_ID_SPLIT_1 column CAMPAIGN_NAME.
LESSON 3 | 57
d. Click SUBMIT.
The values are updated with the replacement pattern.
8. REMOVE A COLUMN
Remove the CAMPAIGN_ID_SPLIT_3 column.
Click the down arrow on the column and select Delete Column.
9. MASK DATA
To see how masking differs based on the column type, apply data masking to two columns.
b. Click the function and notice that only the domain of the email address is shown.
In the first section of the email addresses, characters have been replaced with Xes.
c. Click the CREDITCARDNUMBER column and again search for the Mask data function.
LESSON 3 | 59
d. Click the function and notice that while the numbers have changed, the pattern of each value is intact.
b. The step is removed from the recipe and the dataset is updated.
11. UNDO YOUR LAST ACTION
Undo your last action to reapply data masking to the CREDITCARDNUMBER column.
In the top right corner of the web Ul, click the Undo button.
The step is added to your recipe, and all the steps in the recipe are automatically saved in the data preparation.
TIP:
Keep your recipe clean.
A function applied on data can be canceled in the GUI by modifying parameters or by applying another function.
b. The home page is displayed and your preparation appears on the Preparations list.
c. Confirm that the imported file has also been saved as a dataset.
Click Datasets.
LESSON 3 | 61
b. In the Enter Folder Name text box, enter US.
c. Click OK.
The folder is created in the main HOME folder.
c. Click MOVE.
Confirm that the preparation is in the US folder.
Next step
Now you will learn how to share a dataset and preparation, define a new dataset, and add a lookup to your existing data preparation.
LESSON 3 | 63
Adding a Join to a Data Preparation
Task outline
The first business user has created a data preparation for the Customers dataset in order to cleanse the file. He can cre-
ate a version of the preparation to brand a milestone in the recipe development. The second business user wants to join
this data preparation with a lookup dataset containing a list of business regions. Each region is composed of several
states.
To do this, you must create the version then share the dataset and data preparation with the second business user. Then
you can add a lookup to the data preparation.
d. Click SUBMIT.
LESSON 3 | 65
You will reuse this version later.
b. To leave the read only mode, click SWITCH TO CURRENT STATE.
Share a dataset
1. DISPLAY THE DATASETS
Still logged in as Adam Brown, click Datasets.
2. SHARE A DATASET
Adam Brown wants to share his Customers dataset with all Data Preparation users in the US.
a. To enable additional options, hover over the Customers dataset.
Click the Share Dataset button.
LESSON 3 | 67
Share a data preparation folder
1. DISPLAY DATA PREPARATION FOLDERS
Still logged in as Adam Brown, click Preparations.
b. On the All Users and Groups list, select DataPrep_US and the Operator System user.
c. Click the Add to List button.
The group and the user are added to the Current Collaborators list.
LESSON 3 | 69
b. Click Datasets.
e. To grant data access to other collaborators, hover over the Businessregions_States dataset.
Click the Share Dataset button.
Share the dataset with DataPrep_US and Operator System then click the CONFIRM button.
LESSON 3 | 71
Add a join to the data preparation
1. OPEN A SHARED DATA PREPARATION
Open the Customers preparation you created as Adam Brown in the US folder.
a. Open theUS folder in PREPARATIONS.
3. CREATE A JOIN
Join Customers preparation with the BusinessRegions_States dataset in the STATE column.
a. The new dataset is displayed below the original one.
In the lookup table, the first column corresponding to the STATE column is selected by default.
LESSON 3 | 73
In the original dataset, select the STATE column.
Now you will learn why you have this issue, as well as how to correct it by adding a step and moving it ahead of the lookup
step in the preparation sequence.
a. Click the REGION column.
The chart shows many empty rows.
LESSON 3 | 75
b. Use the data quality bar for the REGION column to display only rows with empty values.
This reveals that the associated states contain unwanted white spaces.
The white spaces are removed, but the number of empty cells in the REGION column remains the same.
LESSON 3 | 77
d. Hover over the last step of the recipe until a handle symbol appears on its left.
e. To move the step ahead of the lookup step, use the handle or click the arrow above it.
Using a cleansed STATE column, the lookup is more efficient. The REGION column contains fewer empty cells.
LESSON 3 | 79
The EXPORT wizard appears.
c. Click CONFIRM. The browser has been set up to automatically save the CSV file in the Windows default downloads
folder.
d. Close the data preparation by clicking the X in the upper-right corner of the window.
c. Confirm that the results file contains the clean, enriched data.
Next step
The preparation is ready on the development environment. Now you will learn how to promote the preparation across environments.
LESSON 3 | 81
Promoting the Preparation
Task outline
To comply with IT best practices, you will promote your preparation into another environment.
In the development environment, you will export the preparation in a JSON file. You can use this file to import the pre-
paration into another (test or production) environment.
WARNING:
To prevent an error during importing, datasets used by the preparation must exist in the other environment,
with the same names and schemas.
NOTE:
Because only one instance of Data Preparation runs on the training VM, you will practice by importing the ver-
sion into the same environment, but in another folder.
b. In the Enter Folder Name text box, enter PRODUCTION and click OK.
LESSON 3 | 83
c. Select the JSON file and click Open.
Next step
You have almost finished this section. Time for a quick review.
More information
Talend documentation:
Talend Data Preparation Getting Started Guide
Regular expressions
LESSON 3 | 85
Intentionally blank
LESSON 4
Working with Large Data Volumes
This chapter discusses:
Concepts 88
Overview 90
Creating a Dataset from a Database 91
Using selective sampling 101
Exporting preparations 107
Review 116
Concepts
LESSON 4 | 89
Overview
In the previous lesson, you created datasets in the Talend Data Preparation web UI by importing files.
Talend Data Preparation can also use a database as a source for creating datasets.
In this lab, you will create a dataset from a MySQL database stored on your virtual machine. Then you will apply your preparation to
this dataset.
This database contains a substantial number of rows, so you will use some sampling and export features that are available only for
large data volumes.
Here is a diagram of this lesson:
Objectives
After completing this lesson, you will be able to:
Create a dataset from a MySQL database
Apply a preparation to this dataset
Progressively apply filters to a large dataset to get the most accurate data sample for your preparation
Export a sample of cleansed data
Export the full, cleansed dataset
Next step
You are ready to create a dataset from a database.
Task outline
In this section, you will use the Talend Data Preparation web UI to quickly access and cleanse data stored in a database.
A MySQL database is available locally on your virtual machine. Notice that the features described in this exercise are not
restricted to local databases and can be used with different types of network architecture.
The local training database contains two tables:
customers, which contains 1,000 rows and is the exact replica of the CSV data file you used in the previous les-
son
customersfull, which contains 11,000 rows; it has the same structure as the first table but needs more cleansing
For this exercise, you will use the customersfull table.
LESSON 4 | 91
As a dataset manager, John Smith can see five sources.
An administrator such as Adam Brown sees an additional source for live datasets, which you will create in a future les-
son.
b. Click From Database.
Talend Data Preparation uses a JDBC URL to connect the database. To simplify the set-up process, a URL template
is provided:
LESSON 4 | 93
localhost is the server address; if necessary, replace it with the server IP address
3306 is the default port for MySQL; if necessary, replace it with another port
db must be replaced with the name of the database
b. To connect the database, enter these credentials:
Dataset name: Customers Full
Database type: MYSQL
JDBC URL: jdbc:mysql://localhost:3306/training
Username: root
Password: root
If you do not get the success message, make corrections and retest the connection.
4. ENTER THE QUERY
To simply select all the columns of the customersfull table, in the Query text box, enter select * from customersfull
b. Make sure your ADD A DATABASE DATASET screen looks like the one in the screenshot, and click the ADD
DATASET button.
The new dataset opens in Talend Data Preparation. By default, a sample of 10,000 rows is displayed.
LESSON 4 | 95
5. CLOSE THE DATASET
Close the dataset to confirm that it was automatically saved.
a. To close the dataset, in the upper right corner of the window, click the X symbol.
b. The home page is displayed. Click the DATASETS tab and confirm that the new dataset is there.
c. To grant data access to other collaborators, hover over the Customers Full dataset.
Click the Share Dataset button.
Share the dataset with DataPrep_US and Operator System then click the CONFIRM button.
Applying a preparation
The new dataset has the same structure as the CSV file you used earlier. Therefore, you can apply the same preparation.
1. APPLY THE PREPARATION
Open the new dataset and apply the preparation you created earlier.
a. On the DATASETS tab, open the Customers Full dataset.
b. On the toolbar, click the preparation icon as shown in the screenshot.
LESSON 4 | 97
2. UPDATE SEMANTIC TYPES
Some semantic tasks may not be recognized correctly, which may impact the efficiency of some steps in the recipe.
If needed, update the semantic types and move the update steps to the top of the recipe.
b. Hover over the last step in the recipe until a handle symbol appears on its left, then use it to move the step to the top
of the recipe.
LESSON 4 | 99
Close the preparation and save it in the US directory.
a. To close the preparation, in the upper right corner of the window, click the X.
b. Select the US directory and click the SAVE IT button to save the preparation by its default name.
Next step
You will continue working on this large dataset, learning about the use of selective sampling features.
Task outline
By default, the Data Preparation web UI displays a data sample of a maximum of 10,000 rows. Some features have been
introduced for datasets that exceed this limitation.
For instance, filters are applied only to the data sample, but what if a data preparator wants to set up a filter on all rows?
Selective sampling allows the data preparator to specify the sample with which to interact.
In this lesson, you will set up a one-click filter to display only rows with empty values. You will use selective sampling to
select more rows that match the current filter, and you will correct all invalid data.
This shows that the dataset exceeds the sample limitation. Only 10,000 rows are displayed, but the entire dataset is
kept intact.
LESSON 4 | 101
NOTE:
You can set another value by editing the dataset.records.limit parameter in the application.properties file. Keep
in mind that a higher value might decrease the application performances.
b. On the list of preset filters, click Display rows with empty values.
The filter is applied to the sample. The grid displays fewer than 10,000 rows. A FETCH MORE button appears next to
the number of displayed rows.
The process stops when 10,000 rows are reached, or at the end of the dataset.
Using the FETCH MORE button with one of the preset filters allows you to display all the rows that potentially need
rework in the same sample. Then you can use data quality bars to profile data issues column by column.
TIP: The usage of the FETCH MORE button is not restricted to invalid or empty rows. Use it to bring all rows that
match the current filter, whatever the filter is. Keep in mind that it will never bring more than 10,000 rows.
LESSON 4 | 103
2. REPLACE EMPTY VALUES
Use the Fill Empty Cells with Text function to copy the content of the Name column to the empty cells of the last_name
column.
a. The last_name column should be selected; if not, click the column header.
b. Use the search bar to search for the Fill Empty Cells with Text function.
e. Click SUBMIT.
f. To again display the data, remove the filter on the last_name column by clicking the X symbol.
The number of rows displayed does not change simply because the rows you modified contain other data issues.
g. Close the data preparation by clicking the X symbol in the upper right corner of the window.
LESSON 4 | 105
Next step
Now you will learn about export features.
Task outline
You already exported a 1,000-row preparation in a CSV file. In this lesson, you will explore additional features available
when exporting a preparation from a large dataset. You will start by exporting a sample of the data, then export all data in
a single CSV file.
LESSON 4 | 107
b. To export only the sample, select Current sample. The Apply filter slider must be activated.
NOTE:
The Current sample option must be checked to export the sample and not the whole dataset. This option has an
impact only when there are more than 10,000 filtered rows (not the case here).
The Apply filters slider must be activated to export only rows with empty values.
c. Select the radio button for Local CSV file. Additional options appear.
In the Delimiter text box, select Comma.
In the Filename text box, enter Customer_Full_Preparation_Sample
The EXPORT window must be configured as in the screenshot.
d. Click CONFIRM.
3. CHECK THE EXPORTED DATA
The Current sample option must be checked to export the sample and not the whole dataset.
The Apply filters slider must be deactivated.
LESSON 4 | 109
c. Select the Local CSV file option. Additional options appear.
For Delimiter, select Comma.
In the Filename text box, enter Customer_Full_Preparation_SampleNoFilter
Configure the EXPORT window as in the screenshot.
d. Click CONFIRM.
2. CHECK THE EXPORTED DATA
Navigate to the default downloads folder and open the file with Notepad++.
b. To set up the new filter, in the Add a filter text box, enter gmail.
LESSON 4 | 111
b. To export only filtered rows, set up the export as in the screenshot.
The file is kept in memory and must be downloaded on demand. To do so, click the Export history icon.
d. The EXPORT HISTORY page opens. It lists all the full exports processed for a given preparation.
For now, only one export is listed.
To display the export details, click the arrow icon on the right.
LESSON 4 | 113
c. To export all rows, set up the export as in the screenshot.
NOTE:
For a given export format, only the latest preparation export is available for download in the
EXPORT HISTORY page.
b. To log out, in the upper right corner, click John Smith and the Logout button.
Next step
You have almost finished this section. Time for a quick review.
LESSON 4 | 115
Review
In this lesson, you learned how to manage large data volumes in Talend Data Preparation.
First you imported a dataset from a MySQL database and applied a preparation to it. Then you used selective sampling to create an
ad-hoc sample, filtering rows from the whole dataset. You exported the sample data and the whole dataset in CSV files, with or
without filters.
More information
Talend documentation:
Working with JDBC datasets
Working on large datasets
Concepts 118
Overview 121
Discovering Talend Dictionary Service 123
Creating a Dictionary Semantic Type 129
Creating a Regular Expression Semantic Type 140
Creating a Compound Semantic Type 147
Review 155
Concepts
LESSON 5 | 119
To ease the semantic types creation, two files are avail-
able on your VM: the first one contains a list of values
and the second one a regular expression.
Use case
In earlier lessons, you worked with semantic types. Each column of the preparation can be associated with a semantic type. Talend
Data Preparation automatically recognizes some of them.
That was the case with the EMAIL, LAST_NAME, or STATE column in your preparation. You saw that data operators can easily
change semantic types by clicking the column header.
Applying a semantic type to a column helps identify cells not in conformity with values or data patterns expected by the semantic type.
This list of available semantic types is provided by Talend Dictionary Service. The semantic types are stored, along with their formats
and values, in a MongoDB database.
The dictionary server communicates with Talend Data Preparation using Apache Kafka, an open source messaging system.
After content analysis, Talend Dictionary Service can assign the correct semantic type to each column in a preparation.
All the necessary modules are installed locally on your virtual training machine. The dictionary server is installed as a Windows service
along with Kafka. Both services are already running.
Notice that the features described in this lesson are not restricted to local modules and can be used with different types of network
architecture.
The semantic types stored in the MongoDB database can be updated through command lines.
Here is a diagram of the modules involved:
LESSON 5 | 121
In Talend Dictionary Service, there are three categories of semantic types:
Regular expression types are based on data patterns
Dictionary types are based on a list of values
Compound types are created by grouping several existing semantic types
In this lesson, you will create and update semantic types for both categories using the web UI.
Objectives
After completing this lesson, you will be able to:
List all semantic types available in Talend Dictionary Service
Create a dictionary semantic type and apply it to a column in your preparation
Add new values to this semantic type
Create a regular expression semantic type and apply it to a column in your preparation
Create a new dictionary and group it with another dictionary in a compound semantic type
Next step
You are ready to learn about Talend Dictionary Service.
Task outline
You will explore Talend Dictionary Service, which is installed locally in your training environment, by viewing the services
installed on the machine.
You will review the user rights needed to communicate with Talend Dictionary Service. Then you will use the system user
you created earlier to connect to Data Preparation and access the semantic types tab.
You can also see the other services used for the solution: Talend Data Preparation Server, Talend Kafka, and Talend Mon-
goDB.
LESSON 5 | 123
A Data Management type with Operation Manager role
A Data Preparation user with Data Preparator role
LESSON 5 | 125
a. Click the Airport semantic type.
The type is designated as Dictionary and the Use for validation slider is deactivated. Therefore, this semantic type
is based on a list of values and used for discovery only.
The list of values is displayed at the bottom of the page.
The type is designated as Regular expression and the Use for validation slider is activated. Therefore, this
semantic type is based on a data pattern and used for discovery and validation.
The regular expression is displayed at the bottom of the page.
LESSON 5 | 127
a. Browse down the list and click the North American state code semantic type.
The type is designated as Compound type and the Use for validation slider is activated.
This compound semantic type groups two other dictionaries to create a list of Canadian and American codes.
Next step
You are ready to create your own semantic type.
Task outline
In this lesson, you will create a dictionary for the REGION column. The region labels are listed in a source file that you will
upload during the semantic type creation process. In a second step, you will manually add values.
LESSON 5 | 129
b. In the CHART section, examine the available values.
c. To display the list of semantic types suggested by Talend Dictionary Service, in the column header, click the menu
icon.
LESSON 5 | 131
c. A new GENERAL window opens.
d. To use this dictionary for data validation, keep the Use for validation slider activated.
To ease the validation process by ignoring punctuation, white spaces, case, and accents, change the Validation cri-
terion field to Simplified text (most permissive).
g. To be able to immediately use this new semantic type in Data Preparation, click the SAVE AND PUBLISH button.
The Region dictionary is added to the list.
LESSON 5 | 133
NOTE:
In the text file used to load values, non-alphabetical values must be enclosed in quotes.
It is possible to load synonyms by using multiple values on the same row. In this case, values must be separated
by commas.
When importing the file, a deduplication process is performed automatically.
This is because some region labels are missing. A value that exists on the column but not in the semantic type is considered
invalid.
To filter the sample on invalid values, below the REGION column header, click the orange bar on the data quality bar and
click Select rows with invalid values for REGION.
LESSON 5 | 135
The West region must have been omitted when creating the semantic type.
LESSON 5 | 137
a. In the upper right corner of the list of values, click the Add item icon.
TIP:
As this dictionary validates data using the simplified text criterion, there is no need to type the first letter of the
region name in the correct case.
d. To be able to immediately use the updated semantic type in Data Preparation, click the SAVE AND PUBLISH but-
ton.
b. Remove the filter on rows with invalid values by clicking the X next to it.
Next step
You are ready to create a regular expression semantic type.
LESSON 5 | 139
Creating a Regular Expression Semantic Type
Task outline
In this lesson you will create a new column in the preparation. Then you will create a regular expression semantic type to
validate the pattern followed by the values of the new column.
LESSON 5 | 141
a. To rename the column, click the menu icon on the column header and select Rename Column.
b. In the New name field, enter CUSTOMER_CODE and click the SUBMIT button.
TIP:
The customer codes pattern is a valid two-letter American state code followed by a hyphen (-) and an integer.
Customers codes are invalid if data in the STATE or ID columns is invalid or missing.
The regular expression to use is:
^(A[KLRZ]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V
[AT]|W[AIVY])-([0-9]{1,9})$
You can copy it from the CustomerCode.txt file in the C:\StudentFiles\DataPrep\source directory.
LESSON 5 | 143
For Name, enter Customer code.
For Description, enter Customer code pattern.
Keep the Type field set on Regular expression.
To use this dictionary for data validation, keep the Use for validation slider activated.
d. Copy the regular expression from the C:\StudentFiles\DataPrep\source\CustomerCode.txt file and paste it in
the Validation pattern box.
e. To be able to immediately use this new semantic type in Data Preparation, click the SAVE AND PUBLISH button.
The new semantic type is available and automatically selected. If this is not the case, manually select the new semantic
type.
LESSON 5 | 145
The data quality bar shows that about 10% of the values do not match the regular expression.
Next step
Now you will create a compound semantic type.
Task outline
In this lesson you will fill empty cells in the STATE column with a single value. Then you will create a dictionary with this
value and associate it with the US states codes dictionary into a compound semantic type.
This is a good way to add values to a dictionary while keeping the original untouched. You will not impact other pre-
parations for which the original dictionary is used.
LESSON 5 | 147
d. In the Value field, enter N/A.
LESSON 5 | 149
c. A new GENERAL window opens.
d. To use this dictionary for data validation, keep the Use for validation slider activated.
To use the most restrictive validation criterion, keep the Validation criterion field set to the default value Exact
value (most restrictive).
TIP:
This dictionary validates data using the exact value criterion, so make sure that you enter the correct case and
punctuation.
d. To be able to immediately use the new dictionary in Data Preparation, click the SAVE AND PUBLISH button.
NOTE:
Compound semantic types use validation criteria of their children types.
LESSON 5 | 151
a. Click the ADD SEMANTIC TYPE button.
b. Select No state.
TIP:
You can type the first letters of the dictionary to display a short list of matching dictionary names.
c. To be able to immediately use the compound semantic type in Data Preparation, click the SAVE AND PUBLISH but-
ton.
LESSON 5 | 153
3. CONFIRM THE RESULTS IN DATA PREPARATION
Confirm that the semantic type of the STATE column has been updated.
a. From the PREPARATIONS tab, open Customers Preparation.
The semantic type of the STATE column has been updated and the N/A value is no longer considered invalid.
Next step
You have almost finished this section. Time for a quick review.
More information
Talend documentation:
Enriching semantic types
LESSON 5 | 155
Intentionally blank
LESSON 6
Using DI for Data Preparation
This chapter discusses:
Concepts 158
Overview 161
Publishing a Dataset to Data Preparation 162
Executing a Preparation in Talend Studio 181
Challenge 191
Solution 192
Review 194
Concepts
LESSON 6 | 159
160 | Talend Data Preparation for Implementers - Participant Guide
Overview
Use case
You can use Talend Data Preparation along with other Talend products.
For instance, you can use Data Preparation components in Talend Studio to:
Extract data from the database and publish it to Data Preparation for business users.
Execute data preparations and writing results data to output files without passing through the Talend Data Preparation web
UI.
Access datasets created by business users in the web UI.
Objectives
After completing this lesson, you will be able to:
Use Data Integration to publish a dataset in Data Preparation
Execute a preparation in a Data Integration Job
Create a Job to read a dataset that a business user has uploaded to the Data Preparation server
Next step
You are ready to use Data Integration to publish a dataset to Data Preparation.
LESSON 6 | 161
Publishing a Dataset to Data Preparation
Task outline
In this section, you will use the Talend Data Integration suite to empower business users to quickly access and cleanse
data stored in a database.
First you will create a new Talend Studio project. Then, from a predefined project, you will import a Job and its related
metadata. The purpose of this Job is to export the customer data from a database to a simple CSV file. To publish the
dataset to Data Preparation, instead of writing the data to a CSV file, you will duplicate this Job and modify it.
Here is a diagram of the process:
When publishing a dataset to the Data Preparation server, you can use either of these modes:
Batch mode: The dataset is stored on the Data Preparation server and updated every time the DI Job is
executed
Live mode: The dataset is not stored on the Data Preparation server side; it is updated on demand every time
the dataset is opened from Data Preparation
In this lesson, you will implement Batch mode. Live mode is covered in the next lesson.
2. CREATE A PROJECT
Create a project called DataPrep.
b. Click Create.
The project appears on the list of projects.
LESSON 6 | 163
NOTE:
Although you do not access the online community in this course, Talend recommends creating an account
from your installation environment and becoming an active member of the online community, which provides
several valuable resources.
When the initialization is complete, Talend Studio may display the Welcome page.
b. Click the Start now! button.
LESSON 6 | 165
If the Integration perspective is not showing, click the Integration button.
d. Click Open.
The list of available items is displayed.
LESSON 6 | 167
The archive contains a prebuilt Job, database connection metadata, and file-delimited metadata.
f. Click Finish.
The items are added to the Repository.
LESSON 6 | 169
2. VIEW THE IMPORTED ITEMS
a. In the Repository, expand the Metadata folder.
Confirm that you have Db Connection and File delimited metadata defined.
The Db Connection is configured to connect to the local MySQL database.
The File delimited metadata describes the structure of the Customers data.
LESSON 6 | 171
c. Use Windows Explorer to confirm that the CustomersOut.csv file has been written to the C:/Temp folder.
d. Open the file using Notepad++ and view the contents of the Customers data extracted from the database.
LESSON 6 | 173
c. Click OK.
The Job appears in the Repository.
2. DELETE A COMPONENT
Right-click the tFileOutputDelimited component and select Delete.
LESSON 6 | 175
c. Right-click the Local_MySQL component, select Row>Main, and click the output component.
b. If there is a warning on the component, click the Sync columns button for the schema.
Click the Edit Schema button and verify that the output schema contains 12 columns.
To close the schema, click OK.
LESSON 6 | 177
d. To save the updated Job, click CTRL+S.
LESSON 6 | 179
It contains 1,000 lines.
Next step
Now you will learn how to execute a preparation in Talend Studio using Data Preparation components.
Task outline
In this section, you will use Talend Studio to create a Job that runs the preparation you created in the previous chapter,
and then write the results to a CSV file. You will choose the preparation version to execute.
This allows administrators and operators to automate tasks that business users used to do manually.
Here is a diagram of the Job:
LESSON 6 | 181
a. Right-click the ExtractDatafromDB Job and select Duplicate.
LESSON 6 | 183
TIP:
To insert a component inside a data flow, you can also place the new component on the designer and manually
reorganize the links between components.
c. When asked if you want the schema of the target component, click the Yes button.
The link is created.
LESSON 6 | 185
e. Click OK.
Confirm that for Preparation Id, "Customers Preparation" is selected.
LESSON 6 | 187
j. To see how the schema was updated, click the Edit Schema button..
c. If there is a warning on the output component, click the Sync Columns button.
LESSON 6 | 189
Using Notepad++, open CleanDBCustomersOut.csv.
Examine some fields to ensure that the preparation was applied.
NOTE:
The regions column does not appear in the export file, as you ran Adam Brown's version.
Next step
You have almost completed this lesson. Now you can test your knowledge with a challenge exercise.
Exercise
LESSON 6 | 191
Solution
Here is a solution to the challenge exercise. Your solution may be slightly different, but still valid.
5. To get the schema, click the Fetch Schema button. When asked if you want to propagate the schema, click Yes.
NOTE:
As in the previous exercise, the regions column does not appear in the export file.
Next step
You have almost finished this section. Time for a quick review.
LESSON 6 | 193
Review
In this lesson, you created a Talend Studio project and imported a simple Job with related metadata items. The initial Job was simply
reading data from a database and writing it to a CSV file.
You duplicated the initial Job and modified it to publish the dataset to Data Preparation instead of writing the data to an output file. To
publish the data to the Data Preparation server, you used the tDatasetOutput component.
You again modified the initial Job to add an additional step: execute a specific version of a data preparation on the data extracted
from the database before writing the data to an output file containing the clean Customers data. To execute a data preparation in
Talend Studio, you used the tDataprepRun component.
You enhanced your knowledge with a challenge exercise. The objective was to build a Job that executes the data preparation on a
dataset manually uploaded by a business user to the Data Preparation server. You used the tDatasetInput component to read the
dataset in Talend Studio.
More information
Talend documentation:
Talend Data Integration Getting Started Guide
Talend Data Integration Studio User Guide
tDatasetOutput component
tDataprepRun component
tDatasetInput component
Concepts 196
Overview 199
Implementing Live Dataset Mode in Talend Studio 200
Deploying a Job in TAC 208
Creating a Dataset from a Talend Job 222
Review 230
Concepts
You will copy the previous Job and update the tData-
setOutput configuration.
Then you will export the Job in a Zip file to deploy it on
TAC.
In Data Preparation, you will create the live dataset
and update it on demand.
LESSON 7 | 197
198 | Talend Data Preparation for Implementers - Participant Guide
Overview
Use case
In the previous lesson, you used Talend Studio to publish a dataset to Data Preparation.
In this lesson you will implement the live dataset scenario.
To do this, you will:
Implement the live dataset option
Prepare your TAC environment to deploy the new Job: create a project, assign users with project authorizations, create a
local Job server, and create an execution task
Create a dataset from the deployed Job in the Data Preparation web UI
Here is a diagram of the interaction between Talend components in the live dataset scenario:
Objectives
After completing this lesson, you will be able to:
Implement the live dataset method in a Talend Job
Build a Talend Job and compress it in a Zip file
Create a project and assign project authorizations in TAC
Configure a local Job server in TAC
Create a task in TAC
Create a dataset from a deployed Talend Job in Data Preparation
Next step
You are ready to implement the live dataset option in Talend Studio.
LESSON 7 | 199
Implementing Live Dataset Mode in Talend Studio
Task outline
When publishing a dataset to the Talend Data Preparation server, you can use one of two modes:
Batch mode: The dataset is stored on the Data Preparation server and updated every time the DI Job is
executed. If the Job is deployed in TAC, the update frequency depends on the configuration defined by the oper-
ations manager in TAC.
Live mode: The dataset is not stored on the Data Preparation server side. It is updated on demand every time it
is opened from Data Preparation. This means the update frequency is not fixed and depends on business user
requests.
In this section, you will use Talend Studio to duplicate and modify the Publish Dataset Job to implement the live dataset
publishing method.
LESSON 7 | 201
c. Click OK.
The Job appears in the Repository.
LESSON 7 | 203
d. To save the updated Job, press CTRL+S.
3. CONFIGURE THE tDatasetOutput COMPONENT
Update the component settings with the Live Dataset mode and the context variables you created.
a. Double-click the tDatasetOutput component to open the Component view.
LESSON 7 | 205
d. To save the updated Job, press CTRL+S.
Next step
You are ready to deploy a Job from a Zip build in TAC.
LESSON 7 | 207
Deploying a Job in TAC
Task outline
Talend Administration Center provides a feature called Job Conductor, which helps you configure Job servers, schedule
Job execution, and configure Job deployment. When using Job Conductor, there are three ways to deploy Jobs in TAC:
Using Zip files generated from Talend Studio
Using artifacts stored in Nexus artifact repository
Using Publisher to publish artifacts from SVN sources in TAC
NOTE:
In this section, you will use the first method to deploy a Job from a Zip file. The other methods are detailed in
the Talend Data Integration Administration course.
The method you use to deploy a job in TAC; has no impact on the behavior of the Data Preparation live dataset.
To deploy the Job, you will:
Create a project with the same name as the one you created in Talend Studio, then assign authorizations on this
project for the system user (remember, the system user is used by the Data Preparation server to communicate
with TAC)
Check the Job server configuration used to deploy the Job
Assign server project authorizations
Create a task and deploy it on the local Job server
Connect to TAC
1. CONNECT TO TAC IN THE WEB UI
Connect to TAC with the system user credentials:
Login: [email protected]
2. LOG IN
Click the Login button.
LESSON 7 | 209
Create a project in TAC
1. OPEN THE PROJECTS TAB
On the Menu pane, in Settings, click Projects.
2. ADD A PROJECT
Create a project called DataPrep.
Note: The project name in TAC must correspond to the Data Integration project from Talend Studio.
Define project authorizations
1. OPEN THE PROJECT AUTHORIZATIONS TAB
On the Menu pane, in Settings, click Project authorizations.
LESSON 7 | 211
2. ASSIGN ACCESS
Assign read/write permissions to the Operator user on the DataPrep project.
a. On the Project list, click DataPrep.
b. On the User/Group Authorizations list, in the Right column next to the Operator user, click the read/write (per-
son with pencil)icon.
LESSON 7 | 213
c. Click Save.
The updated server appears on the list.
To expand the window and see a message that TAC will display errors until it successfully communicates with the Job
server, which may take a couple of minutes, click the plus (+) symbol.
Confirm that the Status server is UP.
Create a task
1. OPEN THE JOB CONDUCTOR TAB
On the Menu, expand Conductor and click the Job Conductor tab.
LESSON 7 | 215
2. ADD A NORMAL TASK
Click the Add button and select Normal Task.
b. In C:\Talend\6.4.1\dataprep\config, open application.properties and view the value of the tac.task-prefix variable.
LESSON 7 | 217
b. Click Browse....
LESSON 7 | 219
For Execution server, select the default LocalServer.
Next step
You imported a Zip build generated from Talend Studio as a task in TAC. In addition, you used Job Conductor to deploy the task on a
Job server.
You can also deploy Jobs by using Nexus artifact repository (covered in the Data Integration Administration course).
In the next section, you will create a dataset from a Talend Job in the Data Preparation web UI.
LESSON 7 | 221
Creating a Dataset from a Talend Job
Task outline
In the previous lessons, you learned how to create a dataset from a file right in the Data Preparation web UI, as well as
how to use the Data Integration tool to publish a dataset to Data Preparation.
In this section, you will create a dataset from a deployed Talend Job in the Data Preparation web UI. This option is avail-
able only to users in the Data Preparation administrator role.
Then you will apply an existing preparation to the newly created dataset and test the full-run functionality.
LESSON 7 | 223
c. In the Dataset name text box, enter Live_Customers
In the User text box, enter [email protected]
For Password, enter talend
For Talend job, select customers
NOTE:
The specified user has permissions to run the execution task on the Job server.
The label of the Talend job (customers) is the label of the execution task you created in TAC (dataprep_cus-
tomers) without the prefix (dataprep_).
d. Click OK.
The dataset opens with the last version of data extracted from the database. Ten thousand lines are displayed.
NOTE:
The cache retains the same data for an hour.
LESSON 7 | 225
c. Close the data preparation by clicking the X in the upper right corner of the window.
LESSON 7 | 227
c. In the US folder, view the Preparations.
Next step
You have almost finished this section. Time for a quick review.
LESSON 7 | 229
Review
In this lesson, you learned how to set up the live dataset scenario. You:
Implemented live dataset mode by reusing an existing Job
Built the Zip archive of a Job and deployed it in TAC
Created a dataset in Data Preparation from the deployed Talend Job
More information
Talend documentation:
Working with datasets based on on-demand Job executions
Talend Help Center