0% found this document useful (0 votes)
158 views

Data Preperation Prac

Uploaded by

Evan van Zyl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views

Data Preperation Prac

Uploaded by

Evan van Zyl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 230

Participant Guide

Talend Data Preparation for Implementers


Version 2.1
Copyright 2017 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agree-
ments. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200
Welcome to Talend Training

Congratulations on choosing a Talend training course.

Working through the course


You will develop your skills by working through use cases and practice exercises using live software. Completing the exercises is crit-
ical to learning!
If you are following a self-paced, on-demand training (ODT) module, and you need an answer to proceed with a particular exercise,
use the help suggestions on your image desktop. If you can’t access your image, contact [email protected].

Exploring
You will be working in actual Talend software, not a simulation. We hope you have fun and get lots of practice using the software!
However, if you work on tasks beyond the scope of the training, you could run out of time with the environment, or you could mess up
data or Jobs needed for subsequent exercises. We suggest finishing the course first, and if you have remaining time, explore as you
wish. Keep in mind that our technical support team can’t assist with your exploring beyond the course materials.

For more information


Talend product documentation (help.talend.com)
Talend Community (community.talend.com)

Sharing
This course is provided for your personal use under an agreement with Talend. You may not take screenshots or redistribute the con-
tent or software.
Intentionally blank

CONTENTS | Participant Guide


CONTENTS
LESSON 1 Data Preparation in Context
Concepts 8
LESSON 2 Getting Started
Concepts 12
Overview 15
Exploring the Environment 16
Creating Users and Groups in TAC 19
Connecting to Talend Data Preparation 29
Review 34
LESSON 3 Creating Data Preparation
Concepts 36
Overview 39
Creating a Data Preparation and Related Dataset 40
Adding a Join to a Data Preparation 64
Promoting the Preparation 82
Review 85
LESSON 4 Working with Large Data Volumes
Concepts 88
Overview 90
Creating a Dataset from a Database 91
Using selective sampling 101
Exporting preparations 107
Review 116
LESSON 5 Using Talend Dictionary Service
Concepts 118
Overview 121
Discovering Talend Dictionary Service 123
Creating a Dictionary Semantic Type 129
Creating a Regular Expression Semantic Type 140
Creating a Compound Semantic Type 147
Review 155
LESSON 6 Using DI for Data Preparation
Concepts 158
Overview 161
Publishing a Dataset to Data Preparation 162
Executing a Preparation in Talend Studio 181
Challenge 191
Solution 192
Review 194
LESSON 7 Implementing a Live Dataset
Concepts 196
Overview 199
Implementing Live Dataset Mode in Talend Studio 200
Deploying a Job in TAC 208
Creating a Dataset from a Talend Job 222
Review 230

CONTENTS | Participant Guide


LESSON 1
Data Preparation in Context
In this lesson you have a business overview of Talend Data Preparation.
This chapter discusses:

Concepts 8
Concepts

Talend Data Fabric combines Talend products into a


common set of powerful, easy-to-use solutions.

Self-service data preparation is provided as an exten-


sion of an enterprise-class integration platform rather
than a stand-alone capability. Talend customers can
implement Talend Data Preparation without creating
another data silo.

Here are the objectives of the course.

8 | Talend Data Preparation for Implementers - Participant Guide


In many companies, IT users have direct access to
raw data. Data must be prepared before being
passed along to business users.

Self-service data preparation is delivered through a


common platform that authorized users can access.
Not only can this bring self-service data to anyone,
including the data analyst, data scientist, and occa-
sional information worker, it facilitates collaboration
between these people so that they can share datasets
and preparations.

Apply changes and see the impact immediately.


Create rules to cleanse and standardize your data,
make changes, and easily go back and forth to get it
right.

LESSON 1 | 9
10 | Talend Data Preparation for Implementers - Participant Guide
LESSON 2
Getting Started
This chapter discusses:

Concepts 12
Overview 15
Exploring the Environment 16
Creating Users and Groups in TAC 19
Connecting to Talend Data Preparation 29
Review 34
Concepts

You can create datasets directly in Data Preparation,


loading data from a file or database. You can also cre-
ate them by using a Studio data integration Job.

When new data arrives, there is no need to start from


scratch. As you work on correcting data, your pro-
gress is tracked in the recipe. You can apply changes
and immediately see the effect. You can get it right and
then reuse your recipe.

The MongoDB stores preparation steps and


metadata for datasets and preparations: name, shar-
ing status, user rights…
Datasets are stored in the file system (on the server
side).

12 | Talend Data Preparation for Implementers - Participant Guide


You create Data integration (DI) Jobs in Studio to inter-
act with Data Preparation. Preparations can be
executed, and datasets can be created or updated in
DI Jobs.

You create Data Preparation users in Talend Admin-


istration Center (TAC), which is also where you can
create execution tasks for Studio Jobs.

These services must be running:


MongoDB
Talend Administration Center (TAC)
Talend Data Preparation

The architecture of Talend Dictionary Service is


explained further along in the course.

LESSON 2 | 13
User settings are role based. Data Preparation access
permissions are granted according to Job profiles:
administrator, dataset manager, data preparator.

To start the web server, run the Talend Data Pre-


paration service.
For a better user experience, Talend provides a
single-sign-on connection to let users switch between
Talend Data Preparation and Talend Data Ste-
wardship without typing additional passwords.
This process, totally transparent for users, is managed
by Talend Identity Access Management (IAM) service.

14 | Talend Data Preparation for Implementers - Participant Guide


Overview

Use case
Talend software is a modular suite made up of several applications working together to provide a distributed development envir-
onment.
In this training session, you will use these modules:
Talend Data Preparation server
Talend Administration Center
Talend Studio

You can install each application on separate computers or install multiple applications on the same computer.
In this course, you use a complete Talend environment hosted on a single virtual machine (VM) that contains all the items you need.
This environment is similar to what you would find in a new Talend installation, with default parameters and configuration.
In this lesson, you will learn about Talend Data Preparation server configuration. You will connect to Talend Administration Center
(TAC) and create several users, as well as a user group. You will start the Data Preparation server and test your connection to the
Data Preparation web UI.

Objectives
After completing this lesson, you will be able to:
Create Data Preparation users in TAC
Create a user group in TAC
Start the Talend Data Preparation server
Connect to the Data Preparation web UI

Next step
You are ready to set up and explore the environment.

LESSON 2 | 15
Exploring the Environment

Task outline

You will explore the training environment by viewing the configuration of the Talend Data Preparation server and verifying
that the services you need are running on the machine.

Open the configuration file


1. OPEN THE TALEND DATA PREPARATION SERVER CONFIGURATION FILE
In C:\Talend\6.4.1\dataprep\config, open the application.properties file in the Notepad++ text editor.
a. Use Windows Explorer to navigate to the C:\Talend\6.4.1\dataprep\config folder.

b. Right-click the application.properties file and select Edit with Notepad++.

2. EXPLORE THE CONFIGURATION FILE

16 | Talend Data Preparation for Implementers - Participant Guide


The tac.url parameter points to the master TAC server URL of your Talend system. TAC manages Data Pre-
paration users and authentication. In the training environment, all the modules are installed locally, so the para-
meter is the "studentpc" URL.
The public.ip parameter is the name of the server hosting the Data Preparation server. On your training VM, it is
"studentpc".
The iam.ip parameter is the name of the server hosting the Identity Access management (IAM) service that man-
ages Single Sign On (SSO) authentications. For the training, all the components are installed on the same machine,
so the server name is also "studentpc".
The tac.user and tac.password parameters must contain the credentials of the user that the Data Preparation
server uses to request information from TAC. This TAC user, which you will create later, needs a specific role des-
ignation (administrator or operations manager/designer) and authorization for Data Preparation projects.

NOTE:
The password is already encrypted with the password you will create in TAC. If you need to update the pass-
word or use another user account, enter the credentials directly in the application.properties file, then restart
Talend Data Preparation Server 6.4.1 service to encrypt the new password.

MongoDB settings are configured in this configuration file as well. MongoDB is a prerequisite for the Data Pre-
paration server, as Data Preparation uses it to store metadata for datasets and preparations. In your training envir-
onment, MongoDB is installed and running, and there is a MongoDB user with required credentials.

Confirm that the services are running


1. OPEN THE LIST OF WINDOWS SERVICES
On the Windows taskbar, click the Services button, as shown in the screenshot.

2. VERIFY THAT MONGODB AND COMPONENTS CATALOG SERVICES ARE RUNNING

LESSON 2 | 17
Search for the MongoDB and Components Catalog services and verify they are running.

3. VERIFY THAT TAC SERVICE IS RUNNING


Search for the Talend Administration Center 6.4.1 service and verify that the Status is Running.

NOTE:
The list of Talend services may vary depending on the version and type of installation.

Next step
You are ready to connect to TAC and create users.

18 | Talend Data Preparation for Implementers - Participant Guide


Creating Users and Groups in TAC

Task outline

In this section, you will connect to TAC to create and set up all the users you need for these exercises.
First you must create a system user that can:
Enable communication between the Data Preparation server and TAC
Execute Talend Studio Jobs
Update semantic types in Talend Dictionary Service
Then you will create two Data Preparation business users with different privileges: an administrator and an operator.
You will also create a user group to easily manage the permissions of the two Data Preparation users.

Connect to TAC
1. IN A WEB BROWSER, OPEN THE TAC URL
On the right tile of the Windows Start menu, click Talend Administration Center.

2. FILL IN THE ADMIN CREDENTIALS


The TAC log-in page appears. Log in with the default account credentials:
Login: [email protected]
Password: admin

LESSON 2 | 19
NOTE:
TAC user names must be in the format of an e-mail address. On some virtual keyboard configurations, the AltGr key is
not active. You can use a combination of the CTRL and Alt keys instead.

3. LOG IN
Click the Login button.
The WELCOME page appears.

You are logged in with administrator privileges.

20 | Talend Data Preparation for Implementers - Participant Guide


Create a system user in TAC
1. ACCESS THE USERS TAB
In Menu, in Settings, click Users.

2. ADD A USER
On the USERS tab, click the Add button, as shown in the screenshot.

3. CONFIGURE THE USER


This system user must have these permissions:
A Data Integration type with administrator, operations manager, and designer roles to:
Enable communication between the Data Preparation server and TAC
Execute Talend Studio Jobs
A Data Management type with operations manager role and a Data Preparation user with a Data Preparator role
to:
Update the Talend Dictionary Service
For convenience, all of these privileges must be assigned to the same system user. You can still have different users execute
Talend Studio Jobs and update the dictionary.
a. Knowing that Data Management includes the Data Integration permissions, define this system user:
Login: [email protected]
First Name: Operator
Last Name: System
Password: talend
Type: Data Management
Roles: Administrator, Operation manager and Designer

LESSON 2 | 21
b. Select the Data Preparation User check box.
c. For Data Preparation Role, select Data Preparator.

22 | Talend Data Preparation for Implementers - Participant Guide


d. Confirm that the Active check box is selected.
The user details appear as shown in the screenshot.

4. SAVE THE USER


Click Save.
The user is added to the user list.

LESSON 2 | 23
Create Data Preparation users in TAC
1. ADD AN ADMINISTRATOR DATA PREPARATION USER
Adam Brown will be a Data Preparation business user with administration rights.
As an administrator, he has permission to create live datasets from Talend Studio Jobs. You will use live datasets later.
a. Still on the Users tab, click the Add button.
b. Configure the Data Preparation administrator user as shown in the screenshot.
Login: [email protected]
First Name: Adam
Last Name: Brown
Password: talend
Type: No Project Access
c. Select the Data Preparation User check box.
d. For Data Preparation Role, select Administrator/Dataset Manager/Data Preparator.
e. Confirm that the Active check box is selected.
The user details appear as shown in the screenshot.

24 | Talend Data Preparation for Implementers - Participant Guide


f. Click Save.
The user is added to the user list.

2. ADD A BUSINESS USER


John Smith will be a Data Preparation operator.
a. On the Users tab, click the Add button.
b. Configure the first business user like this:
Login: [email protected]
First Name: John
Last Name: Smith
Password: talend
Type: No Project Access
c. Select the Data Preparation User check box.
d. For Data Preparation Role, select Dataset Manager/Data Preparator.
e. Make sure the Active check box is selected.
The user details appear as in the screenshot.

LESSON 2 | 25
f. Click Save.
3. VERIFY THAT JOHN SMITH AND ADAM BROWN ARE ON THE LIST

26 | Talend Data Preparation for Implementers - Participant Guide


Create a user group in TAC
1. ACCESS THE USER GROUPS TAB
On the Menu, in Settings, click User groups.

2. CREATE A NEW USER GROUP


Create a user group named DataPrep_US.
a. Click the Add a user group button.

b. Define the user group.

LESSON 2 | 27
For Label, enter DataPrep_US, and for Type, select Data Preparation.

c. Click Save.
The user group is created.

3. CONFIGURE THE USER GROUP


The new user group contains all business users from the United States.
Drag John Smith and Adam Brown to this group.

Next step
You are ready to start the Data Preparation server and test your connection.

28 | Talend Data Preparation for Implementers - Participant Guide


Connecting to Talend Data Preparation

Task outline

You can now start the Data Preparation server, and, as a business user, test your connection.

Start the Data Preparation server


1. OPEN THE LIST OF WINDOWS SERVICES
On the Windows taskbar, click the Services button.

2. SEARCH FOR TALEND DATA PREPARATION SERVER SERVICE


Select Talend Data Preparation Server 6.4.1 and click Start the service.

3. VERIFY THAT THE SERVICE IS RUNNING

LESSON 2 | 29
Wait until the status is Running.

Check the connection to Data Preparation


1. OPEN THE DATA PREPARATION WEB UI
In a web browser, enter the Data Preparation URL, http://studentpc:9999
You are channeled to the login page provided by IAM.

30 | Talend Data Preparation for Implementers - Participant Guide


NOTE:
You will not be able to access this page if the Talend Data Preparation Server 6.4.1 service is not completely started.
Wait a few moments before trying again.

2. CONNECT TO DATA PREPARATION


Connect as business user John Smith.
a. Enter these credentials:
Email: [email protected]
Password: talend

LESSON 2 | 31
b. Click the LOG IN button.

3. LEARN ABOUT THE DATA PREPARATION WEB UI


The interface takes you on a "discovery tour."
a. Read the description and click Next when prompted.

b. Click the LET ME TRY button.


When the tour ends, you see the main page.

4. LOG OUT OF THE DATA PREPARATION WEB UI


In the upper right corner, click John Smith and the Logout button.

32 | Talend Data Preparation for Implementers - Participant Guide


WARNING:
To avoid any issue, always quit Data Preparation with a proper logout.

Next step
You have almost finished this section. Time for a quick review.

LESSON 2 | 33
Review
You started this lesson by analyzing the Data Preparation server configuration file, and then you connected to TAC and created a sys-
tem user. You created two Data Preparation users and grouped them in a Data Preparation user group. You noted that Data Pre-
paration user roles and permissions are handled in TAC. You started the Data Preparation server and tested your connection using
the Data Preparation web UI.

More information
Talend documentation:
About Talend Data Preparation
Talend Administration Center User Guide

34 | Talend Data Preparation for Implementers - Participant Guide


LESSON 3
Creating Data Preparation
This chapter discusses:

Concepts 36
Overview 39
Creating a Data Preparation and Related Dataset 40
Adding a Join to a Data Preparation 64
Promoting the Preparation 82
Review 85
Concepts

You will use both options during the exercise.

You can reorganize the steps by moving them into the


recipe order, or delete them.

Data Preparation provides functions based on column


data type.

36 | Talend Data Preparation for Implementers - Participant Guide


Each version of a preparation can be individually expor-
ted and selected in Data Integration as well as Big
Data Jobs.
Preparation exporting and importing are managed in
the Data Preparation web UI.
To prevent an error during importation, datasets used
by the preparation must exist in the other envir-
onment, with the same names and schemas.

You created these two users in TAC. Adam Brown is


the data administrator and John Smith is the data pre-
parator.

Several tools for discovering data are available in Data


Preparation:
Each column is associated with a semantic
type
The data quality bar displays by column the
number of fields that have correct data,
empty fields, and incorrect data
Column statistics are displayed in charts

Columns from a second dataset are added to the initial


one. The join is based on a common column. Several
output types are supported: CSV, XLSX, Tableau.

LESSON 3 | 37
38 | Talend Data Preparation for Implementers - Participant Guide
Overview

Use case
In this exercise, your role is that of a business user who has received an extract of US customer data from Salesforce.com.
You will create your preparation on the development environment before exporting it to the production environment.
Connected on the development environment, you will view the data and use several Data Preparation functions to cleanse and stand-
ardize the initial customer data file. Then you will create a version of the preparation to capture the state of the recipe. You will share
the preparation and the dataset with another business user who wants to enrich the customer data with business regions. You will
reuse the shared data preparation to join the customer data with the list of business regions corresponding to states. You will group
the customers by business region and export the resulting customer data to a CSV file. Then you will export the preparation to reuse
it on the production environment.
Here are the steps in this lesson:

Objectives
After completing this lesson, you will be able to:
Create a data preparation and dataset
Use Data Preparation to discover data
Use prebuilt Data Preparation functions to cleanse data
Create a version of the preparation
Share a data preparation and dataset
Add a lookup table to a data preparation
Export the results file
Export the preparation

Next step
You are ready to create a data preparation and related dataset.

LESSON 3 | 39
Creating a Data Preparation and Related Dataset

Task outline

A data preparation applies a recipe to a dataset to produce an outcome. The original dataset is never modified.
The recipe corresponds to the set of functions that is applied to the initial dataset.
The dataset holds the data that can be used as the raw material for preparations. It is presented as a table to which you
can apply recipes without affecting the original data.
In this section, you will create a data preparation to cleanse a customer dataset.
You will find inconsistencies in the dataset and then create a recipe that lets you correct the errors you detected and stand-
ardize columns.

Add a data preparation


1. CONNECT TO DATA PREPARATION
As user Adam Brown, connect to the Data Preparation web console.
Enter these credentials:
Email: [email protected]
Password: talend
Click LOG IN.

NOTE:
The Data Preparation server hosted by the training VM can be viewed as the development environment.

The home page for Adam Brown appears.

40 | Talend Data Preparation for Implementers - Participant Guide


2. ADD A PREPARATION
Create a data preparation using the Customers.csv file in the C:/StudentFiles/DataPrep folder.
a. Click the ADD PREPARATION button.

A wizard displays your recently used datasets.

b. There is no dataset in your environment, so you need to create one.


Click the Import File button.

LESSON 3 | 41
c. In the File Upload wizard, go to the C:/StudentFiles/DataPrep folder and select Customers.csv.

d. Click Open and wait for the file to upload.


Confirm that the number of lines in the dataset is 1000.

42 | Talend Data Preparation for Implementers - Participant Guide


e. To expand the dataset parameters, click the settings icon in the upper left corner.

The separator in the file is a comma and the encoding is UTF-8, so you do not need to adjust the parameters.
To close the window, click the settings icon again.

Explore the dataset


1. VIEW THE COLUMN SEMANTIC TYPES
The semantic type indicates what the data in a column represents.
View the columns in your dataset and notice that the semantic data types have been automatically detected.
a. Click the menu icon in the column header for the STATE column.

b. Hover over the us_state_code type and view the matching percentage rates that allow the system to establish the
column semantic type.

LESSON 3 | 43
NOTE:
Talend Data Preparation suggests the correct data type for each column in your dataset. However, based on
your experience, at any time, you can change the suggestions.

2. EXPLORE THE COLUMN STATISTICS


Still in the STATE column, view the statistics on the lower right side of the screen.
The state distribution is charted on a geographical map.

44 | Talend Data Preparation for Implementers - Participant Guide


a. To display the same information in a bar chart, click the bar chart icon.

b. To view the number of customers from a state, hover over the state (this works on the bar chart as well).

LESSON 3 | 45
c. To view the number of valid and invalid rows, click the VALUE tab.

d. To view the most common format in this field, click the PATTERN tab.
Notice that the most common pattern for the state is a double-uppercase character value.

3. UNDERSTAND THE DATA QUALITY BAR

46 | Talend Data Preparation for Implementers - Participant Guide


Still in the State column, notice the three-color data quality bar below the semantic type. It shows the number of rows with
correct data, empty fields, and incorrect data.
Green: Data matches cell format
White: Empty cell
Orange: Data does not match cell format
4. APPLY FILTERS
Display only invalid values.
a. Still in the STATE column, click the orange tile and select Select rows with invalid values for STATE.

b. View the invalid values.

c. You can apply additional filters by clicking a pattern on the statistics panel.

d. To remove the pattern filter, click the pattern.


5. DISABLE THE FILTERS

LESSON 3 | 47
To redisplay all the records, click the Enable/Disable all filters button.

6. REMOVE THE FILTERS


Notice that filters appear at the top of your dataset.
To remove a filter, click the X next to it.

7. EXPLORE THE OTHER COLUMNS


Play with the options to determine potential inconsistencies in the rest of the columns.
When you are finished profiling the data, you can start using the available functions to build a recipe.

Create a recipe
1. ADD A FIRST STEP FOR THE RECIPE
Click the LAST_NAME column and wait for the list of functions on the right to refresh.
a. Use the search bar to find the Change to UPPER case function.

b. Click the Change to UPPER Case function.


c. The values are updated. This first step of the recipe appears on the left pane.

2. ADD A FUNCTION TO SEVERAL COLUMNS AT ONCE


To avoid repetitive action, you can apply functions across multiple columns.

48 | Talend Data Preparation for Implementers - Participant Guide


a. To select the LAST_NAME and NAME columns, press the CTRL key and click both columns.
b. Use the search bar to find the Remove trailing and leading characters... function, which removes extra char-
acters in the data in the selected columns.

c. Click the function.


d. By default, the padding character box is already set on Whitespace.
To remove extra spaces, click SUBMIT.

Two new steps are added to the recipe.

LESSON 3 | 49
A discovery pop-up may appear. Click the NEXT button until the tour ends.

e. Do not worry about saving the recipe. As you can see at the top of the added step, all the steps are automatically
saved.
3. CORRECT INVALID VALUES
Some inconsistent state codes can be replaced.

50 | Talend Data Preparation for Implementers - Participant Guide


a. As in the "Discover the dataset" section, select the the STATE column and filter values on the Aaaaa pattern.

b. To change all occurrences of Texas to TX, select the Replace the Cells that Match... function.

c. Fill in the boxes, check Overwrite entire cell and click SUBMIT.
A new step is added in the recipe.

LESSON 3 | 51
The filter icon next to the step indicates that the replace function has been applied to only the filtered rows.
d. To redisplay all the records, remove the filter.
4. FORMAT THE DATE
To standardize the date format, select the DATE column.
a. Search for and select the Change Date Format... function.

b. For New Format, select custom.


For Your Format, enter MM.dd.yyyy.

52 | Talend Data Preparation for Implementers - Participant Guide


c. Click SUBMIT.
The settings of the new step can be updated directly in the recipe.

5. SPLIT A COLUMN INTO PARTS


Split the CAMPAIGN_ID column into three parts.
a. Click the CAMPAIGN_ID column and search for the Split the Text in Parts... function.
For Parts, select 3, and for Separator, select _

LESSON 3 | 53
Click SUBMIT.
Three columns are added to the dataset.
To see the complete column names, enlarge the column sizes.
b.

6. RENAME A COLUMN
Rename the CAMPAIGN_ID_SPLIT_2 column Quarter_Year.

54 | Talend Data Preparation for Implementers - Participant Guide


a. Click the CAMPAIGN_ID_SPLIT_2 column and select Rename Column.

b. For New name, enter QUARTER_YEAR

c. Click SUBMIT.

LESSON 3 | 55
d. Use the same process to rename the CAMPAIGN_ID_SPLIT_1 column CAMPAIGN_NAME.

7. TRANSFORM A COLUMN USING A REGEX EXPRESSION


Change the pattern of the QUARTER_YEAR column from Y(2 digits)Q(2digits) to Quarter (2 digits) Year (4 digits).
The regex of the current format of the column is Y(\d{2})Q(\d{2}) where (\d{2}) corresponds to 2 digits:
\d=Digit
{2}=Number of times digit is repeated
The replacement expression refers to the first group appearing in the current expression as $1 and the second as $2. Then
the replacement expression is Quarter $1 Year $2.
a. Click the QUARTER_YEAR column and search for the Replace the Cells that Match... function.

56 | Talend Data Preparation for Implementers - Participant Guide


b. Change the operator for the Current field to RegEx.

c. For Current, enter Y(\d{2})Q(\d{2})


For Replacement, enter Quarter $2 Year $1
Check Overwrite entire cell

LESSON 3 | 57
d. Click SUBMIT.
The values are updated with the replacement pattern.

8. REMOVE A COLUMN
Remove the CAMPAIGN_ID_SPLIT_3 column.
Click the down arrow on the column and select Delete Column.

9. MASK DATA
To see how masking differs based on the column type, apply data masking to two columns.

58 | Talend Data Preparation for Implementers - Participant Guide


a. Click the EMAIL column and search for the Mask data (obfuscation) function.

b. Click the function and notice that only the domain of the email address is shown.
In the first section of the email addresses, characters have been replaced with Xes.

c. Click the CREDITCARDNUMBER column and again search for the Mask data function.

LESSON 3 | 59
d. Click the function and notice that while the numbers have changed, the pattern of each value is intact.

10. REMOVE A STEP IN THE RECIPE


You may not have noticed that numbers changed when you applied data masking to the CREDITCARDNUMBER column.
You will now delete this step, and then reapply it to see how the values change.
a. Click the recipe panel. Hover over the step in the recipe until the recycling bin symbol appears, then click it.

b. The step is removed from the recipe and the dataset is updated.
11. UNDO YOUR LAST ACTION
Undo your last action to reapply data masking to the CREDITCARDNUMBER column.
In the top right corner of the web Ul, click the Undo button.

The step is added to your recipe, and all the steps in the recipe are automatically saved in the data preparation.

TIP:
Keep your recipe clean.
A function applied on data can be canceled in the GUI by modifying parameters or by applying another function.

60 | Talend Data Preparation for Implementers - Participant Guide


For instance, you can rename a column and then cancel the name change by renaming once again the column with its
original name. Consequently, two unnecessary rename steps are created in your recipe.
To keep your recipe clean, always cancel a modification by deleting the step in the recipe or using the Undo button.
This will keep you from using resources to manage useless steps.

12. CLOSE AND CONFIRM


a. Close the data preparation by clicking the X in the upper-right corner of the window.

b. The home page is displayed and your preparation appears on the Preparations list.

c. Confirm that the imported file has also been saved as a dataset.
Click Datasets.

Move the data preparation


1. CREATE A NEW FOLDER
Create a folder called US for storing all preparations from the US.
a. On the menu, click Preparations.
Click the ADD FOLDER button.

LESSON 3 | 61
b. In the Enter Folder Name text box, enter US.

c. Click OK.
The folder is created in the main HOME folder.

2. MOVE THE PREPARATION


Move the Customers data preparation to the US folder.
a. Hover over Customers preparation.
Click the Copy or Move preparation button.

62 | Talend Data Preparation for Implementers - Participant Guide


b. The Copy/Move Item window opens. Select the US folder.

c. Click MOVE.
Confirm that the preparation is in the US folder.

Next step
Now you will learn how to share a dataset and preparation, define a new dataset, and add a lookup to your existing data preparation.

LESSON 3 | 63
Adding a Join to a Data Preparation

Task outline

The first business user has created a data preparation for the Customers dataset in order to cleanse the file. He can cre-
ate a version of the preparation to brand a milestone in the recipe development. The second business user wants to join
this data preparation with a lookup dataset containing a list of business regions. Each region is composed of several
states.
To do this, you must create the version then share the dataset and data preparation with the second business user. Then
you can add a lookup to the data preparation.

Create a preparation version


To create a version, reopen the preparation.
1. CREATE THE VERSION
a. Click the MANAGE VERSIONS button.

There is no version available, you have to create one.


b. Click the ADD VERSION button.

64 | Talend Data Preparation for Implementers - Participant Guide


c. In the Description text box, enter Adam's version.

d. Click SUBMIT.

The new version is created.


2. OPEN THE VERSION
a. To test the version availability, click the version.

The version opens in read only mode.

LESSON 3 | 65
You will reuse this version later.
b. To leave the read only mode, click SWITCH TO CURRENT STATE.

c. Close the preparation.

Share a dataset
1. DISPLAY THE DATASETS
Still logged in as Adam Brown, click Datasets.

2. SHARE A DATASET
Adam Brown wants to share his Customers dataset with all Data Preparation users in the US.
a. To enable additional options, hover over the Customers dataset.
Click the Share Dataset button.

66 | Talend Data Preparation for Implementers - Participant Guide


b. On the All Users and Groups list, select DataPrep_US.

c. Click the Add to List button.


The group is added to the Current Collaborators list.

d. Follow the same procedure to add the Operator System user.

e. Click the CONFIRM button.

LESSON 3 | 67
Share a data preparation folder
1. DISPLAY DATA PREPARATION FOLDERS
Still logged in as Adam Brown, click Preparations.

2. SHARE A DATA PREPARATION FOLDER


Adam Brown wants to share his Customers preparation with all Data Preparation users in the US.
a. To enable additional options, hover over the US folder.
Click the Share this folder button.

b. On the All Users and Groups list, select DataPrep_US and the Operator System user.
c. Click the Add to List button.
The group and the user are added to the Current Collaborators list.

d. Click the CONFIRM button.


3. LOG OUT
In the upper right corner, click Adam Brown and the Logout button.

68 | Talend Data Preparation for Implementers - Participant Guide


Check access to the shared folder
1. CONNECT TO THE DATA PREPARATION UI
Connect as user John Smith using these credentials:
Email: [email protected]
Password:talend
Click LOG IN.

2. VIEW THE SHARED FOLDER


The home page is displayed.
a. Click the US folder.
Access Customers Preparation.

LESSON 3 | 69
b. Click Datasets.

Add a dataset for joining data


1. UPLOAD A NEW DATASET
Create a new dataset using the BusinessRegions_States.csv file in the C:/StudentFiles/DataPrep folder.
a. Still on the Datasets tab, click the ADD DATASET button.

b. Go to the C:/StudentFiles/DataPrep folder and select BusinessRegions_States.csv.

c. Click Open and wait for the file to upload.


Recall that this file contains the mapping between US business regions and states.
The STATE column in this dataset has the same pattern as the STATE column in the Customers dataset, so the
column is used to join the initial data preparation with the lookup dataset.

70 | Talend Data Preparation for Implementers - Participant Guide


d. Close the dataset and confirm that it appears on the Datasets list owned by John Smith.

e. To grant data access to other collaborators, hover over the Businessregions_States dataset.
Click the Share Dataset button.

Share the dataset with DataPrep_US and Operator System then click the CONFIRM button.

LESSON 3 | 71
Add a join to the data preparation
1. OPEN A SHARED DATA PREPARATION
Open the Customers preparation you created as Adam Brown in the US folder.
a. Open theUS folder in PREPARATIONS.

b. Click Customers preparation.


The recipe appears on the left.

2. ADD A SECOND DATASET AS A LOOKUP


Add the BusinessRegions_States dataset as a lookup.
a. Click the Lookup: combine two datasets button.

72 | Talend Data Preparation for Implementers - Participant Guide


b. A panel on the lower part of the screen guides you through the steps to build the join.
Click the Add a dataset to Lookup button.

c. Select the BUSINESSREGIONS_STATES dataset and click the ADD button.

3. CREATE A JOIN
Join Customers preparation with the BusinessRegions_States dataset in the STATE column.
a. The new dataset is displayed below the original one.
In the lookup table, the first column corresponding to the STATE column is selected by default.

LESSON 3 | 73
In the original dataset, select the STATE column.

b. The columns used for the join are highlighted in blue.


In the lookup dataset, in the REGION column, select the Add to Dataset check box.

74 | Talend Data Preparation for Implementers - Participant Guide


c. Click the CONFIRM button.
A Region column is added to the dataset and a new step is added to the recipe.

d. The preparation is saved. Do not close it yet.


4. MOVE A STEP IN THE PREPARATION SEQUENCE
The data quality bar for the REGION column reveals that many cells are empty.

Now you will learn why you have this issue, as well as how to correct it by adding a step and moving it ahead of the lookup
step in the preparation sequence.
a. Click the REGION column.
The chart shows many empty rows.

LESSON 3 | 75
b. Use the data quality bar for the REGION column to display only rows with empty values.

This reveals that the associated states contain unwanted white spaces.

76 | Talend Data Preparation for Implementers - Participant Guide


c. To remove the white spaces, click the STATE column, then search for and click the Remove trailing and leading
characters... function.

To remove extra spaces, use the default padding character.

The white spaces are removed, but the number of empty cells in the REGION column remains the same.

LESSON 3 | 77
d. Hover over the last step of the recipe until a handle symbol appears on its left.

e. To move the step ahead of the lookup step, use the handle or click the arrow above it.

Using a cleansed STATE column, the lookup is more efficient. The REGION column contains fewer empty cells.

78 | Talend Data Preparation for Implementers - Participant Guide


f. To remove the filter, click the trash icon.

Export the data preparation results


1. SAVE THE RESULTS FILE
Save the cleaned and enriched data to a delimited file, in the Windows default downloads folder.
a. On the top menu, click the EXPORT button.

LESSON 3 | 79
The EXPORT wizard appears.

b. Select the Local CSV file format.


For Delimiter, select Comma.
Change the Filename to Customers_Preparation_Result.

c. Click CONFIRM. The browser has been set up to automatically save the CSV file in the Windows default downloads
folder.
d. Close the data preparation by clicking the X in the upper-right corner of the window.

2. OPEN THE EXPORTED DATA


Navigate to the default downloads folder and open Customers Preparation Result.csv.

80 | Talend Data Preparation for Implementers - Participant Guide


a. Use Windows Explorer to navigate to the default downloads folder. You can use the shortcut on the left.

b. Right-click the CSV file and select Edit with Notepad++.

c. Confirm that the results file contains the clean, enriched data.

Next step
The preparation is ready on the development environment. Now you will learn how to promote the preparation across environments.

LESSON 3 | 81
Promoting the Preparation

Task outline

To comply with IT best practices, you will promote your preparation into another environment.
In the development environment, you will export the preparation in a JSON file. You can use this file to import the pre-
paration into another (test or production) environment.

WARNING:
To prevent an error during importing, datasets used by the preparation must exist in the other environment,
with the same names and schemas.

NOTE:
Because only one instance of Data Preparation runs on the training VM, you will practice by importing the ver-
sion into the same environment, but in another folder.

Exporting the preparation


You can export preparations from the Preparation tab.
1. EXPORT THE PREPARATION
Still logged in as John Smith, display the contents of the US folder.
a. To enable additional options, hover over Customers Preparation.

b. Click the Export Preparation button.


2. CONFIRM THAT THE JSON FILE HAS BEEN EXPORTED
Use Windows Explorer to navigate to the default downloads folder. You can use the shortcut on the left.
a. Open Windows Explorer and browse to the downloads folder.

b. Notice the time stamp in the JSON file name.

Importing the preparation


You are ready to import the preparation in a new folder.

82 | Talend Data Preparation for Implementers - Participant Guide


1. CREATE THE FOLDER
Display the preparations Home folder.
a. Click the ADD FOLDER button to create a new folder.

b. In the Enter Folder Name text box, enter PRODUCTION and click OK.

The new folder is created.

2. IMPORT THE PREPARATION.


Import the JSON file in the PRODUCTION folder.
a. Open the PRODUCTION folder.
b. Click the IMPORT PREPARATION button.

LESSON 3 | 83
c. Select the JSON file and click Open.

The preparation has been imported.

Next step
You have almost finished this section. Time for a quick review.

84 | Talend Data Preparation for Implementers - Participant Guide


Review
In this lesson, you imported a dataset into the Talend Data Preparation web UI by using a delimited CSV file. Then you created a data
preparation to cleanse the data, and you explored several functions.
You created a version of the preparation. Then, to collaborate with another business user, you shared your dataset and data pre-
paration. To enrich the data, you added a lookup table to the data preparation, and you moved a step in the preparation sequence to
correct errors. Then you exported the results.
Finally, you exported the preparation from the development environment in order to import it into the production environment.

More information
Talend documentation:
Talend Data Preparation Getting Started Guide
Regular expressions

LESSON 3 | 85
Intentionally blank
LESSON 4
Working with Large Data Volumes
This chapter discusses:

Concepts 88
Overview 90
Creating a Dataset from a Database 91
Using selective sampling 101
Exporting preparations 107
Review 116
Concepts

In the Data Preparation web UI, a data preparator can


create a dataset by extracting data from a database.
The Data Preparation web UI does not display the
complete dataset; only a data sample.
The default sample size is 10,000 rows. This limit is
configurable.
The selective sampling feature lets you apply filters not
only to the rows displayed in the sample but to the
entire dataset.
When exporting data, the data preparator can export
only rows displayed onscreen or the full, unlimited data-
set.

A data preparator can create a dataset by:


Importing data from a local file
Extracting data from a database
Importing data from an HDFS file system
On demand, an administrator can create a live dataset
from a Studio Job.

You can easily set up one-click filters to select invalid


and empty rows. Used with selective sampling, this lets
you select all rows that need rework in the same
sample.

88 | Talend Data Preparation for Implementers - Participant Guide


Exports made from datasets larger than 10,000 rows
are kept in memory. You can download them on
demand from the export history page.

LESSON 4 | 89
Overview
In the previous lesson, you created datasets in the Talend Data Preparation web UI by importing files.
Talend Data Preparation can also use a database as a source for creating datasets.
In this lab, you will create a dataset from a MySQL database stored on your virtual machine. Then you will apply your preparation to
this dataset.
This database contains a substantial number of rows, so you will use some sampling and export features that are available only for
large data volumes.
Here is a diagram of this lesson:

Objectives
After completing this lesson, you will be able to:
Create a dataset from a MySQL database
Apply a preparation to this dataset
Progressively apply filters to a large dataset to get the most accurate data sample for your preparation
Export a sample of cleansed data
Export the full, cleansed dataset

Next step
You are ready to create a dataset from a database.

90 | Talend Data Preparation for Implementers - Participant Guide


Creating a Dataset from a Database

Task outline

In this section, you will use the Talend Data Preparation web UI to quickly access and cleanse data stored in a database.
A MySQL database is available locally on your virtual machine. Notice that the features described in this exercise are not
restricted to local databases and can be used with different types of network architecture.
The local training database contains two tables:
customers, which contains 1,000 rows and is the exact replica of the CSV data file you used in the previous les-
son
customersfull, which contains 11,000 rows; it has the same structure as the first table but needs more cleansing
For this exercise, you will use the customersfull table.

Creating a dataset from a MySQL database


1. DISPLAY THE DATASETS TAB
As John Smith, click the DATASETS tab.
a. Make sure you are connected to Talend Data Preparation in the John Smith account.

b. On the HOME page, click the DATASETS tab.

2. CREATE THE DATASET


You will create a dataset from a MySQL database named training, which is stored on your virtual machine.
a. To open the list of available datasets sources, click the descending arrow on the right side of the ADD DATASET
button.

LESSON 4 | 91
As a dataset manager, John Smith can see five sources.

An administrator such as Adam Brown sees an additional source for live datasets, which you will create in a future les-
son.
b. Click From Database.

You already created a dataset from a local file.


A dataset manager can also create datasets from data stored on an HDFS cluster, on Amazon S3, or on Salesforce.
The process of using Talend Data Preparation with an HDFS cluster is covered in an additional training module.
3. CONNECT THE DATABASE
The ADD A DATABASE DATASET window opens.

92 | Talend Data Preparation for Implementers - Participant Guide


a. In the Database type drop-down list, select MySQL.

Talend Data Preparation uses a JDBC URL to connect the database. To simplify the set-up process, a URL template
is provided:

You must adapt this template to match your environment:

LESSON 4 | 93
localhost is the server address; if necessary, replace it with the server IP address
3306 is the default port for MySQL; if necessary, replace it with another port
db must be replaced with the name of the database
b. To connect the database, enter these credentials:
Dataset name: Customers Full
Database type: MYSQL
JDBC URL: jdbc:mysql://localhost:3306/training
Username: root
Password: root

c. To validate the connection details, click TEST CONNECTION.


A connection success notification appears.

If you do not get the success message, make corrections and retest the connection.
4. ENTER THE QUERY

94 | Talend Data Preparation for Implementers - Participant Guide


a. A new query box appears at the bottom of the window.

To simply select all the columns of the customersfull table, in the Query text box, enter select * from customersfull
b. Make sure your ADD A DATABASE DATASET screen looks like the one in the screenshot, and click the ADD
DATASET button.

The new dataset opens in Talend Data Preparation. By default, a sample of 10,000 rows is displayed.

LESSON 4 | 95
5. CLOSE THE DATASET
Close the dataset to confirm that it was automatically saved.
a. To close the dataset, in the upper right corner of the window, click the X symbol.

b. The home page is displayed. Click the DATASETS tab and confirm that the new dataset is there.

c. To grant data access to other collaborators, hover over the Customers Full dataset.
Click the Share Dataset button.

Share the dataset with DataPrep_US and Operator System then click the CONFIRM button.

Adding new database types


By default, Talend Data Preparation offers connectivity to the MySQL, Derby, PostgreSQL, SQL Server and Azure SQL databases.

96 | Talend Data Preparation for Implementers - Participant Guide


You can install additional JDBC drivers to connect to other databases.
The JDBC driver installation process is not in the scope of this course. To read more about it, you can use the links available in the
Review page.

Applying a preparation
The new dataset has the same structure as the CSV file you used earlier. Therefore, you can apply the same preparation.
1. APPLY THE PREPARATION
Open the new dataset and apply the preparation you created earlier.
a. On the DATASETS tab, open the Customers Full dataset.
b. On the toolbar, click the preparation icon as shown in the screenshot.

c. The window lists the compatible preparations. Click Customers Preparation.

The preparation is applied to the dataset.

LESSON 4 | 97
2. UPDATE SEMANTIC TYPES
Some semantic tasks may not be recognized correctly, which may impact the efficiency of some steps in the recipe.
If needed, update the semantic types and move the update steps to the top of the recipe.

98 | Talend Data Preparation for Implementers - Participant Guide


a. In the column header for the column, click the down arrow and select the correct semantic type.

b. Hover over the last step in the recipe until a handle symbol appears on its left, then use it to move the step to the top
of the recipe.

3. SAVE THE PREPARATION

LESSON 4 | 99
Close the preparation and save it in the US directory.
a. To close the preparation, in the upper right corner of the window, click the X.

b. Select the US directory and click the SAVE IT button to save the preparation by its default name.

Next step
You will continue working on this large dataset, learning about the use of selective sampling features.

100 | Talend Data Preparation for Implementers - Participant Guide


Using selective sampling

Task outline

By default, the Data Preparation web UI displays a data sample of a maximum of 10,000 rows. Some features have been
introduced for datasets that exceed this limitation.
For instance, filters are applied only to the data sample, but what if a data preparator wants to set up a filter on all rows?
Selective sampling allows the data preparator to specify the sample with which to interact.
In this lesson, you will set up a one-click filter to display only rows with empty values. You will use selective sampling to
select more rows that match the current filter, and you will correct all invalid data.

Filtering the whole preparation on invalid or empty values


1. OPEN THE PREPARATION
On the PREPARATION tab, in the US subdirectory, open Customers Full Preparation.

The first 10,000 rows are displayed.


2. DISPLAY THE DATASET PARAMETER
a. To expand the dataset parameters, on the toolbar, click the settings icon as shown in the screenshot.

This shows that the dataset exceeds the sample limitation. Only 10,000 rows are displayed, but the entire dataset is
kept intact.

LESSON 4 | 101
NOTE:
You can set another value by editing the dataset.records.limit parameter in the application.properties file. Keep
in mind that a higher value might decrease the application performances.

b. To close the panel, click the settings icon.


3. SET UP THE FILTER
You can see preset filters in the upper left area of the grid.
a. In the upper left part of the grid, click the menu icon as shown in the screenshot.

b. On the list of preset filters, click Display rows with empty values.

The filter is applied to the sample. The grid displays fewer than 10,000 rows. A FETCH MORE button appears next to
the number of displayed rows.

102 | Talend Data Preparation for Implementers - Participant Guide


c. To display more rows with empty values, click the FETCH MORE button.
The FETCH ADDITIONAL ROWS windows opens as Talend Data Preparation is retrieving more rows from the data-
set.

The process stops when 10,000 rows are reached, or at the end of the dataset.

Using the FETCH MORE button with one of the preset filters allows you to display all the rows that potentially need
rework in the same sample. Then you can use data quality bars to profile data issues column by column.

TIP: The usage of the FETCH MORE button is not restricted to invalid or empty rows. Use it to bring all rows that
match the current filter, whatever the filter is. Keep in mind that it will never bring more than 10,000 rows.

Updating empty cells


The data quality bar shows that some values are missing in the last_name column.
1. DISPLAY MISSING VALUES
Click the white area of the data quality bar and click Select rows with empty values for last_name.

Only five rows are displayed on the grid.

LESSON 4 | 103
2. REPLACE EMPTY VALUES
Use the Fill Empty Cells with Text function to copy the content of the Name column to the empty cells of the last_name
column.
a. The last_name column should be selected; if not, click the column header.
b. Use the search bar to search for the Fill Empty Cells with Text function.

c. In the Use with drop-down list, select Other column.

104 | Talend Data Preparation for Implementers - Participant Guide


d. In the Column drop-down list, select Name.

e. Click SUBMIT.

The function is applied to only the selected rows.


The grid is empty because there are no longer empty cells in the last_name column.

f. To again display the data, remove the filter on the last_name column by clicking the X symbol.

The number of rows displayed does not change simply because the rows you modified contain other data issues.
g. Close the data preparation by clicking the X symbol in the upper right corner of the window.

LESSON 4 | 105
Next step
Now you will learn about export features.

106 | Talend Data Preparation for Implementers - Participant Guide


Exporting preparations

Task outline

You already exported a 1,000-row preparation in a CSV file. In this lesson, you will explore additional features available
when exporting a preparation from a large dataset. You will start by exporting a sample of the data, then export all data in
a single CSV file.

Export the filtered data sample


1. OPEN THE PREPARATION
Reopen the Customers Full preparation and confirm that the filters you set up in the previous exercise were saved.
a. On the PREPARATION tab, click Customer Full Preparation.
b. Confirm that the grid displays the sample you set up earlier.

2. EXPORT THE FILTERED SAMPLE


Export the rows with invalid and empty values in a single CSV file.
a. Click the EXPORT button.

The EXPORT window opens.

LESSON 4 | 107
b. To export only the sample, select Current sample. The Apply filter slider must be activated.

NOTE:
The Current sample option must be checked to export the sample and not the whole dataset. This option has an
impact only when there are more than 10,000 filtered rows (not the case here).
The Apply filters slider must be activated to export only rows with empty values.

c. Select the radio button for Local CSV file. Additional options appear.
In the Delimiter text box, select Comma.
In the Filename text box, enter Customer_Full_Preparation_Sample
The EXPORT window must be configured as in the screenshot.

d. Click CONFIRM.
3. CHECK THE EXPORTED DATA

108 | Talend Data Preparation for Implementers - Participant Guide


Navigate to the default downloads folder and open the file with Notepad++.

Export the data sample without filters


1. EXPORT THE WHOLE SAMPLE
Use the same process to export the 10,000 rows of the current sample, with no filter applied.
a. Click the EXPORT button.

The EXPORT window opens.


b. To export the whole sample, use these settings:

The Current sample option must be checked to export the sample and not the whole dataset.
The Apply filters slider must be deactivated.

LESSON 4 | 109
c. Select the Local CSV file option. Additional options appear.
For Delimiter, select Comma.
In the Filename text box, enter Customer_Full_Preparation_SampleNoFilter
Configure the EXPORT window as in the screenshot.

d. Click CONFIRM.
2. CHECK THE EXPORTED DATA
Navigate to the default downloads folder and open the file with Notepad++.

Export all data with filters


1. SET UP NEW FILTERS
Remove the filter on rows with empty values and set up another filter to keep customers who have gmail addresses.
a. To remove the rows with empty values filter, click the X symbol.

b. To set up the new filter, in the Add a filter text box, enter gmail.

Filter suggestions are displayed in a drop-down list.

110 | Talend Data Preparation for Implementers - Participant Guide


c. Click gmail in email to activate it.

The sample displays customers who have only a gmail address.

2. EXPORT ALL DATA


Export all filtered rows of the whole dataset in a single CSV file.
a. Click the EXPORT button.

The EXPORT window opens.

LESSON 4 | 111
b. To export only filtered rows, set up the export as in the screenshot.

For a full export, select the All data option.


Activate the Apply filters slider.
Select the Local CSV file option with the Comma separator.
In the Filename text box, enter Customer_Full_Preparation
c. Click CONFIRM and wait for the extraction to be completed.
When the file is ready, a green circle appears to the right of the Export history icon.

The file is kept in memory and must be downloaded on demand. To do so, click the Export history icon.
d. The EXPORT HISTORY page opens. It lists all the full exports processed for a given preparation.
For now, only one export is listed.
To display the export details, click the arrow icon on the right.

112 | Talend Data Preparation for Implementers - Participant Guide


Details are displayed.

To collapse the section, click the arrow icon on the right.


e. To download the file, click the download icon.

3. CHECK THE EXPORTED DATA


Navigate to the default downloads folder and open the file with Notepad++.

Export all data without filters


1. EXPORT ALL DATA
Use the same process to export all rows in a single CSV file, with no filter applied.
a. To browse back to the preparation, click the back arrow.

b. Click the EXPORT button again.

The EXPORT window opens.

LESSON 4 | 113
c. To export all rows, set up the export as in the screenshot.

For Export to CSV, select All data.


Deactivate the Apply filters slider.
Select the Local CSV file option with the Comma delimiter.
Add NoFilter to the end of the defaut Filename.
d. Click CONFIRM and wait for the extraction to be completed.
When the file is ready, a green circle appears to the right of the Export history icon.

Click the Export history icon.


e. A second extraction is listed at the top of the EXPORT HISTORY page.

NOTE:
For a given export format, only the latest preparation export is available for download in the
EXPORT HISTORY page.

f. To download the file, click the download icon.


2. CHECK THE EXPORTED DATA

114 | Talend Data Preparation for Implementers - Participant Guide


Navigate to the default downloads folder and open the file with Notepad++.
3. CLOSE THE PREPARATION
Close the preparation and log out.
a. To close the preparation, click the X symbol in the upper right corner of the window.

b. To log out, in the upper right corner, click John Smith and the Logout button.

Next step
You have almost finished this section. Time for a quick review.

LESSON 4 | 115
Review
In this lesson, you learned how to manage large data volumes in Talend Data Preparation.
First you imported a dataset from a MySQL database and applied a preparation to it. Then you used selective sampling to create an
ad-hoc sample, filtering rows from the whole dataset. You exported the sample data and the whole dataset in CSV files, with or
without filters.

More information
Talend documentation:
Working with JDBC datasets
Working on large datasets

116 | Talend Data Preparation for Implementers - Participant Guide


LESSON 5
Using Talend Dictionary Service
This chapter discusses:

Concepts 118
Overview 121
Discovering Talend Dictionary Service 123
Creating a Dictionary Semantic Type 129
Creating a Regular Expression Semantic Type 140
Creating a Compound Semantic Type 147
Review 155
Concepts

A Data Management license is necessary to use


Talend Dictionary Service.
Talend Dictionary Service can be connected with
Talend Data Stewardship.

Data can be aggregated from several sources. Busi-


ness users may not be familiar with all of these
sources.
Data discovery is another way to prepare data for busi-
ness users and let them visually browse data to
uncover patterns, trends, and nuances.

When creating a dictionary semantic type, you must


specify whether or not it is used for validation.

118 | Talend Data Preparation for Implementers - Participant Guide


You will create several semantic types in the exercise.

Data Preparation and Dictionary Service have a sim-


ilar structure. A Kafka server is used for internal mes-
saging between the two blocks.
Semantic types management is accessible directly in
the Data Preparation web UI for authorized users.

Data Preparation and Dictionary Service can be


installed on different machines.
All the components are installed on your local VM.

Only Data Preparation users set up with a Data Man-


agement type in TAC can see and access the
SEMANTIC TYPES tab.

LESSON 5 | 119
To ease the semantic types creation, two files are avail-
able on your VM: the first one contains a list of values
and the second one a regular expression.

120 | Talend Data Preparation for Implementers - Participant Guide


Overview

Use case
In earlier lessons, you worked with semantic types. Each column of the preparation can be associated with a semantic type. Talend
Data Preparation automatically recognizes some of them.
That was the case with the EMAIL, LAST_NAME, or STATE column in your preparation. You saw that data operators can easily
change semantic types by clicking the column header.

Applying a semantic type to a column helps identify cells not in conformity with values or data patterns expected by the semantic type.
This list of available semantic types is provided by Talend Dictionary Service. The semantic types are stored, along with their formats
and values, in a MongoDB database.
The dictionary server communicates with Talend Data Preparation using Apache Kafka, an open source messaging system.
After content analysis, Talend Dictionary Service can assign the correct semantic type to each column in a preparation.
All the necessary modules are installed locally on your virtual training machine. The dictionary server is installed as a Windows service
along with Kafka. Both services are already running.
Notice that the features described in this lesson are not restricted to local modules and can be used with different types of network
architecture.
The semantic types stored in the MongoDB database can be updated through command lines.
Here is a diagram of the modules involved:

LESSON 5 | 121
In Talend Dictionary Service, there are three categories of semantic types:
Regular expression types are based on data patterns
Dictionary types are based on a list of values
Compound types are created by grouping several existing semantic types
In this lesson, you will create and update semantic types for both categories using the web UI.

Objectives
After completing this lesson, you will be able to:
List all semantic types available in Talend Dictionary Service
Create a dictionary semantic type and apply it to a column in your preparation
Add new values to this semantic type
Create a regular expression semantic type and apply it to a column in your preparation
Create a new dictionary and group it with another dictionary in a compound semantic type

Next step
You are ready to learn about Talend Dictionary Service.

122 | Talend Data Preparation for Implementers - Participant Guide


Discovering Talend Dictionary Service

Task outline

You will explore Talend Dictionary Service, which is installed locally in your training environment, by viewing the services
installed on the machine.
You will review the user rights needed to communicate with Talend Dictionary Service. Then you will use the system user
you created earlier to connect to Data Preparation and access the semantic types tab.

Confirm that services are running


1. OPEN THE LIST OF WINDOWS SERVICES
On the Windows taskbar, click the Services button.

2. VERIFY THAT DICTIONARY SERVICE IS RUNNING


Search for Talend Dictionary Service and verify that the status is Running.

You can also see the other services used for the solution: Talend Data Preparation Server, Talend Kafka, and Talend Mon-
goDB.

Reviewing user privileges


In the first lesson, you set up a system user in Talend Administration Center (TAC).
The [email protected] user has the following privileges needed to access the semantic types tab in Data Preparation:

LESSON 5 | 123
A Data Management type with Operation Manager role
A Data Preparation user with Data Preparator role

Displaying the semantic types menu in Data Preparation


1. CONNECT TO DATA PREPARATION
As the system user, connect to the Data Preparation web console.
Enter these credentials:
Email: [email protected]
Password: talend
Click LOG IN.

124 | Talend Data Preparation for Implementers - Participant Guide


2. DISPLAY THE HOME PAGE
The home page appears. Notice the SEMANTIC TYPES tab.

3. ACCESS SEMANTIC TYPES


To access the list of semantic types available in the Dictionary Service, click the SEMANTIC TYPES tab.
The list of semantic types appears.
4. OPEN A DICTIONARY SEMANTIC TYPE

LESSON 5 | 125
a. Click the Airport semantic type.

The type is designated as Dictionary and the Use for validation slider is deactivated. Therefore, this semantic type
is based on a list of values and used for discovery only.
The list of values is displayed at the bottom of the page.

b. Click the CANCEL button to navigate back to the previous page.


5. OPEN A REGULAR EXPRESSION SEMANTIC TYPE

126 | Talend Data Preparation for Implementers - Participant Guide


a. Click the Amex Card semantic type.

The type is designated as Regular expression and the Use for validation slider is activated. Therefore, this
semantic type is based on a data pattern and used for discovery and validation.
The regular expression is displayed at the bottom of the page.

b. Click the CANCEL button to navigate back to the previous page.


6. OPEN A COMPOUND SEMANTIC TYPE

LESSON 5 | 127
a. Browse down the list and click the North American state code semantic type.

The type is designated as Compound type and the Use for validation slider is activated.
This compound semantic type groups two other dictionaries to create a list of Canadian and American codes.

b. Click the CANCEL button to navigate back to the previous page.

Next step
You are ready to create your own semantic type.

128 | Talend Data Preparation for Implementers - Participant Guide


Creating a Dictionary Semantic Type

Task outline

In this lesson, you will create a dictionary for the REGION column. The region labels are listed in a source file that you will
upload during the semantic type creation process. In a second step, you will manually add values.

Examining the REGION column


1. OPEN THE CUSTOMERS PREPARATION
Still logged in as the Operator System user, open the US directory and the Customers Preparation, the first one you cre-
ated from a dataset of 1,000 rows.

2. DISPLAY AVAILABLE SEMANTIC TYPES


Display the semantic types available for the REGION column.
a. Locate and select the REGION column.

LESSON 5 | 129
b. In the CHART section, examine the available values.

c. To display the list of semantic types suggested by Talend Dictionary Service, in the column header, click the menu
icon.

Notice that none of the listed semantic types corresponds to regions.


3. CLOSE THE PREPARATION
To close the preparation, click the X in the upper right corner of the window.

Creating the dictionary semantic type


You can create the new dictionary by uploading the list of values.
1. ADD A NEW DICTIONARY
First, you must access the semantic types interface.

130 | Talend Data Preparation for Implementers - Participant Guide


a. Click the SEMANTIC TYPES tab.

b. Click the ADD SEMANTIC TYPE button.

LESSON 5 | 131
c. A new GENERAL window opens.

For Name, enter Region.


For Description, enter Region name.
Change the Type field to Dictionary.

d. To use this dictionary for data validation, keep the Use for validation slider activated.
To ease the validation process by ignoring punctuation, white spaces, case, and accents, change the Validation cri-
terion field to Simplified text (most permissive).

132 | Talend Data Preparation for Implementers - Participant Guide


e. To load the region names, click the Import values from a file icon.

f. Browse to the C:\StudentFiles\DataPrep\source directory.


Select Regions.txt and click the Open button.

Region names are uploaded and appear in the Values box.

g. To be able to immediately use this new semantic type in Data Preparation, click the SAVE AND PUBLISH button.
The Region dictionary is added to the list.

LESSON 5 | 133
NOTE:
In the text file used to load values, non-alphabetical values must be enclosed in quotes.
It is possible to load synonyms by using multiple values on the same row. In this case, values must be separated
by commas.
When importing the file, a deduplication process is performed automatically.

2. TEST THE NEW SEMANTIC TYPE


Open the Customers Preparation and check the semantic type of the REGION column.
a. From the PREPARATIONS tab, open Customers Preparation.

134 | Talend Data Preparation for Implementers - Participant Guide


b. Locate the REGION column.

The new semantic type is available and automatically selected.

Updating the dictionary semantic type


1. ANALYZE INVALID VALUES
Below the column header, the data quality bar shows invalid values.

This is because some region labels are missing. A value that exists on the column but not in the semantic type is considered
invalid.
To filter the sample on invalid values, below the REGION column header, click the orange bar on the data quality bar and
click Select rows with invalid values for REGION.

LESSON 5 | 135
The West region must have been omitted when creating the semantic type.

2. CLOSE THE PREPARATION.


To close the preparation, click the X in the upper right corner of the window.

3. DISPLAY THE VALUES OF THE REGION SEMANTIC TYPE


You can use the search engine to open the Region dictionary.
a. In the upper bar, click the Toggle search input icon.

b. In the search text box, enter Region.

After a while, the search engine displays a list of matching results.

136 | Talend Data Preparation for Implementers - Participant Guide


c. Click the Region semantic type.
The dictionary opens. There is no West region on the list.

4. UPDATE THE SEMANTIC TYPE


You can manually add the missing region.

LESSON 5 | 137
a. In the upper right corner of the list of values, click the Add item icon.

b. In the text box, enter west.

TIP:
As this dictionary validates data using the simplified text criterion, there is no need to type the first letter of the
region name in the correct case.

c. Click the Validate and Add icon.

d. To be able to immediately use the updated semantic type in Data Preparation, click the SAVE AND PUBLISH but-
ton.

5. CONFIRM THE RESULTS IN DATA PREPARATION


Remove the filter on invalid region names and determine whether the update has been taken into account in the pre-
paration.

138 | Talend Data Preparation for Implementers - Participant Guide


a. From the PREPARATIONS tab, open Customers Preparation.

b. Remove the filter on rows with invalid values by clicking the X next to it.

The REGION column no longer contains invalid values.

Next step
You are ready to create a regular expression semantic type.

LESSON 5 | 139
Creating a Regular Expression Semantic Type

Task outline

In this lesson you will create a new column in the preparation. Then you will create a regular expression semantic type to
validate the pattern followed by the values of the new column.

Creating the CUSTOMER_CODE column


The new CUSTOMER_CODE column must contain a concatenation of the REGION and ID columns.
1. CONCATENATE THE DATA
Select the STATE column and set up the Concatenate with function. The two columns must be connected with a hyphen
(-).
a. Click the STATE column and wait for the list of functions on the right to refresh.
b. Use the search bar to find the Concatenate with function.

c. Click the function.


d. For Use with, leave the ID column selected by default.
For Separator, enter a hyphen (-).

140 | Talend Data Preparation for Implementers - Participant Guide


e. Click the SUBMIT button.
The new column is created with concatenated data. It appears as a text column with no associated specific
semantic type.

2. RENAME THE NEW COLUMN


The column must be renamed CUSTOMER_CODE.

LESSON 5 | 141
a. To rename the column, click the menu icon on the column header and select Rename Column.

b. In the New name field, enter CUSTOMER_CODE and click the SUBMIT button.

3. CLOSE THE PREPARATION.


To close the preparation, click the X in the upper right corner of the window.

Creating the regular expression semantic type


A new semantic type must be created to validate the pattern of the customer codes.

TIP:
The customer codes pattern is a valid two-letter American state code followed by a hyphen (-) and an integer.
Customers codes are invalid if data in the STATE or ID columns is invalid or missing.
The regular expression to use is:
^(A[KLRZ]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V
[AT]|W[AIVY])-([0-9]{1,9})$
You can copy it from the CustomerCode.txt file in the C:\StudentFiles\DataPrep\source directory.

1. ADD A NEW REGULAR EXPRESSION SEMANTIC TYPE


Access the semantic types interface.

142 | Talend Data Preparation for Implementers - Participant Guide


a. Click the SEMANTIC TYPES tab.

b. Click the ADD SEMANTIC TYPE button.

c. The new GENERAL window opens.

LESSON 5 | 143
For Name, enter Customer code.
For Description, enter Customer code pattern.
Keep the Type field set on Regular expression.
To use this dictionary for data validation, keep the Use for validation slider activated.

d. Copy the regular expression from the C:\StudentFiles\DataPrep\source\CustomerCode.txt file and paste it in
the Validation pattern box.

e. To be able to immediately use this new semantic type in Data Preparation, click the SAVE AND PUBLISH button.

The Customer code semantic type is added to the list.

144 | Talend Data Preparation for Implementers - Participant Guide


2. TEST THE NEW SEMANTIC TYPE
Open the Customers Preparation and check the semantic type of the CUSTOMER_CODE column.
a. From the PREPARATIONS tab, open Customers Preparation.

b. Locate the CUSTOMER_CODE column.

The new semantic type is available and automatically selected. If this is not the case, manually select the new semantic
type.

LESSON 5 | 145
The data quality bar shows that about 10% of the values do not match the regular expression.

Next step
Now you will create a compound semantic type.

146 | Talend Data Preparation for Implementers - Participant Guide


Creating a Compound Semantic Type

Task outline

In this lesson you will fill empty cells in the STATE column with a single value. Then you will create a dictionary with this
value and associate it with the US states codes dictionary into a compound semantic type.
This is a good way to add values to a dictionary while keeping the original untouched. You will not impact other pre-
parations for which the original dictionary is used.

Updating the STATE column


Fill in the empty cells of the STATE column with the N/A value.
1. FILL EMPTY CELLS
Select the STATE column and use the search bar to find the Fill empty cells with text... function.
a. Click the STATE column and wait for the list of functions on the right to refresh.
b. Use the search bar to find the Fill empty cells with text... function.

c. Click the function.

LESSON 5 | 147
d. In the Value field, enter N/A.

e. Click the SUBMIT button.

2. CHECK THE RESULT


The STATE column is updated and no longer contains empty cells.
Notice that the N/A value is considered invalid.

3. CLOSE THE PREPARATION.


To close the preparation, click the X in the upper right corner of the window.

Creating a new dictionary for the N/A value


Create a new dictionary and manually add the N/A value.
1. CREATE A DICTIONARY
Create a dictionary named No state. This dictionary must validate data using the most restrictive validation criterion.

148 | Talend Data Preparation for Implementers - Participant Guide


a. Click the SEMANTIC TYPES tab.

b. Click the ADD SEMANTIC TYPE button.

LESSON 5 | 149
c. A new GENERAL window opens.

For Name, enter No state.


For Description, enter No state code.
Change the Type field to Dictionary.

d. To use this dictionary for data validation, keep the Use for validation slider activated.
To use the most restrictive validation criterion, keep the Validation criterion field set to the default value Exact
value (most restrictive).

2. ADD THE VALUE MANUALLY


This dictionary will contain only one value: N/A.
a. In the upper right corner of the list of values, click the Add item icon.

150 | Talend Data Preparation for Implementers - Participant Guide


b. In the text field, enter N/A.

TIP:
This dictionary validates data using the exact value criterion, so make sure that you enter the correct case and
punctuation.

c. Click the Validate and Add icon.

d. To be able to immediately use the new dictionary in Data Preparation, click the SAVE AND PUBLISH button.

The No state semantic type is added to the list.

Creating a new compound semantic type


Create a compound semantic type to associate the original US states dictionary with the new one you created.
1. CREATE A COMPOUND SEMANTIC TYPE
Create a compound semantic type called US States Code and more, which will be used for data validation.

NOTE:
Compound semantic types use validation criteria of their children types.

LESSON 5 | 151
a. Click the ADD SEMANTIC TYPE button.

b. A new GENERAL window opens.

For Name, enter US state code and more.


For Description, enter US state code and the N/A value.
Change the Type field to Compound type.

152 | Talend Data Preparation for Implementers - Participant Guide


c. To use this dictionary for data validation, keep the Use for validation slider activated.

2. SPECIFY THE CHILDREN TYPES


Select the US State code and No state dictionaries.
a. On the Children types drop-down list, select US State Code.

b. Select No state.

TIP:
You can type the first letters of the dictionary to display a short list of matching dictionary names.

c. To be able to immediately use the compound semantic type in Data Preparation, click the SAVE AND PUBLISH but-
ton.

The compound semantic type is added to the list.

LESSON 5 | 153
3. CONFIRM THE RESULTS IN DATA PREPARATION
Confirm that the semantic type of the STATE column has been updated.
a. From the PREPARATIONS tab, open Customers Preparation.

b. Locate the STATE column.

The semantic type of the STATE column has been updated and the N/A value is no longer considered invalid.

Next step
You have almost finished this section. Time for a quick review.

154 | Talend Data Preparation for Implementers - Participant Guide


Review
In this lesson, you learned how to interact with Talend Dictionary Service.
First you browsed the semantic types list and displayed semantic types in the dictionary, regular expression and compound cat-
egories. Then you created a dictionary semantic type and applied it to your preparation. You updated the dictionary with an additional
value to reduce the number of invalid rows. You created a regular expression semantic type to validate the pattern of a column in
your preparation. Finally, you created a compound semantic type, associating one of your personal dictionaries with a standard dic-
tionary.

More information
Talend documentation:
Enriching semantic types

LESSON 5 | 155
Intentionally blank
LESSON 6
Using DI for Data Preparation
This chapter discusses:

Concepts 158
Overview 161
Publishing a Dataset to Data Preparation 162
Executing a Preparation in Talend Studio 181
Challenge 191
Solution 192
Review 194
Concepts

A DI Job is created in Talend Studio. It extracts data


from a MySQL database and creates a dataset in Data
Preparation.

A second DI Job is created in Talend Studio. It extracts


data from a MySQL database, executes a preparation
retrieved from the Data Preparation sever on the
extracted data, and exports the resulting data in a
CSV file.

The tDataPrepRun component is used to execute a


data preparation from a DI Job in Studio.
tDataPrepRun can be used in Big Data Jobs. To learn
more, register for the Talend Data Preparation on Big
Data course.

158 | Talend Data Preparation for Implementers - Participant Guide


The tDatasetOutput component is used to create a
dataset in a DI Job.
In this exercise, you will use Batch mode to create a
dataset that is updated every time the Job is executed.
You will use Live mode in the next lesson.

The tDataPrepRun component uses user credentials


to create the dataset; the new dataset is available for
this specific user and can be shared with others.

tDataPrepRun component uses user credentials to


access a preparation on the Data Preparation server.
Make sure the user has permission to access the given
preparation.

The tDatasetInput component is used to access a data-


set on the Data Preparation server.

LESSON 6 | 159
160 | Talend Data Preparation for Implementers - Participant Guide
Overview

Use case
You can use Talend Data Preparation along with other Talend products.
For instance, you can use Data Preparation components in Talend Studio to:
Extract data from the database and publish it to Data Preparation for business users.
Execute data preparations and writing results data to output files without passing through the Talend Data Preparation web
UI.
Access datasets created by business users in the web UI.

Objectives
After completing this lesson, you will be able to:
Use Data Integration to publish a dataset in Data Preparation
Execute a preparation in a Data Integration Job
Create a Job to read a dataset that a business user has uploaded to the Data Preparation server

Next step
You are ready to use Data Integration to publish a dataset to Data Preparation.

LESSON 6 | 161
Publishing a Dataset to Data Preparation

Task outline

In this section, you will use the Talend Data Integration suite to empower business users to quickly access and cleanse
data stored in a database.
First you will create a new Talend Studio project. Then, from a predefined project, you will import a Job and its related
metadata. The purpose of this Job is to export the customer data from a database to a simple CSV file. To publish the
dataset to Data Preparation, instead of writing the data to a CSV file, you will duplicate this Job and modify it.
Here is a diagram of the process:

When publishing a dataset to the Data Preparation server, you can use either of these modes:
Batch mode: The dataset is stored on the Data Preparation server and updated every time the DI Job is
executed
Live mode: The dataset is not stored on the Data Preparation server side; it is updated on demand every time
the dataset is opened from Data Preparation
In this lesson, you will implement Batch mode. Live mode is covered in the next lesson.

Create a project in Talend Studio


1. START TALEND STUDIO
On the Windows taskbar, click the Talend Studio icon as shown in the screenshot.

2. CREATE A PROJECT
Create a project called DataPrep.

162 | Talend Data Preparation for Implementers - Participant Guide


a. Click the Create a new project radio button, and in the text box, enter DataPrep.

b. Click Create.
The project appears on the list of projects.

c. Click Finish to open the project.


3. ACCESS THE PROJECT
You may have to log in to Talend Forge then click close the Welcome Page to access the Integration perspective.
a. The Connect to TalendForge window may appear. Log in with your existing Talend account. If you don't have a
Talend account yet, you can create one or click Skip this Step.

LESSON 6 | 163
NOTE:
Although you do not access the online community in this course, Talend recommends creating an account
from your installation environment and becoming an active member of the online community, which provides
several valuable resources.

When the initialization is complete, Talend Studio may display the Welcome page.
b. Click the Start now! button.

164 | Talend Data Preparation for Implementers - Participant Guide


The Integration perspective is displayed.

LESSON 6 | 165
If the Integration perspective is not showing, click the Integration button.

Import items from an existing project


1. IMPORT ITEMS
a. On the top menu bar, click the Import items button.

b. The Import items window opens.


Select the radio button for Select archive file and click the Browse . . . button.

166 | Talend Data Preparation for Implementers - Participant Guide


c. In the C:/StudentFiles/DataPrep folder, select the DIStartProject.zip archive.

d. Click Open.
The list of available items is displayed.

LESSON 6 | 167
The archive contains a prebuilt Job, database connection metadata, and file-delimited metadata.

168 | Talend Data Preparation for Implementers - Participant Guide


e. Click Select All.

f. Click Finish.
The items are added to the Repository.

LESSON 6 | 169
2. VIEW THE IMPORTED ITEMS
a. In the Repository, expand the Metadata folder.
Confirm that you have Db Connection and File delimited metadata defined.
The Db Connection is configured to connect to the local MySQL database.
The File delimited metadata describes the structure of the Customers data.

b. In the Repository, expand Job Designs.


Double-click the ExtractDataFromDB Job to open it.
The Job queries the local MySQL database to retrieve the Customers data and writes it to a CSV file using the file
delimited metadata.

3. TEST THE IMPORTED JOB


Run the imported Job and check the results.
a. Double-click the tFileOutputDelimited component to open the Component view.
Notice the output file name.

170 | Talend Data Preparation for Implementers - Participant Guide


b. In the Run view, click the Run button.

LESSON 6 | 171
c. Use Windows Explorer to confirm that the CustomersOut.csv file has been written to the C:/Temp folder.

d. Open the file using Notepad++ and view the contents of the Customers data extracted from the database.

172 | Talend Data Preparation for Implementers - Participant Guide


Modify the existing Job
1. DUPLICATE THE EXISTING JOB
a. Right-click the ExtractDatafromDB Job and select Duplicate.

b. In the Input new name text box, enter PublishDatasetToDataPrep

LESSON 6 | 173
c. Click OK.
The Job appears in the Repository.

d. Double-click the Job to open it.

2. DELETE A COMPONENT
Right-click the tFileOutputDelimited component and select Delete.

3. ADD A tDatasetOutput COMPONENT


Add a tDatasetOutput component.

174 | Talend Data Preparation for Implementers - Participant Guide


a. On the Palette, expand Talend Data Preparation.

b. Drag the tDatasetOutput component into the work area.

LESSON 6 | 175
c. Right-click the Local_MySQL component, select Row>Main, and click the output component.

d. The link between the components is created.

4. CONFIGURE THE tDatasetOutput COMPONENT


The tDatasetOutput component accepts four modes: Create, Update, CreateOrUpdate, and LiveDataset.
Create mode is used to publish a new dataset to Data Preparation in Batch mode.
Update mode is used to update an existing dataset in Data Preparation with new data in Batch mode.
CreateOrUpdate mode is used to update an existing dataset, or to create it if a dataset with the specified name
does not exist (in Batch mode for both possibilities).
LiveDataset mode is used to create a Job that publishes a dataset and keeps the data live once deployed in Talend
Administration Center (TAC). In this mode, you cannot run the job directly from Talend Studio. This corresponds to
Live mode, which is covered in the next chapter.

176 | Talend Data Preparation for Implementers - Participant Guide


a. Open the Component view by double-clicking the tDatasetOutput component.

b. If there is a warning on the component, click the Sync columns button for the schema.
Click the Edit Schema button and verify that the output schema contains 12 columns.
To close the schema, click OK.

c. In the URL text box, enter "http://studentpc:9999"


In the Email text box, enter "[email protected]"
In the Password text box, enter "talend"
For Mode, select Create
In the Dataset Name text box, enter "Customers_From_DB"
In the Limit text box, enter 1000

LESSON 6 | 177
d. To save the updated Job, click CTRL+S.

Run the Job and view the results


1. RUN THE JOB
Click the Run view and Run button.

2. CONNECT TO DATA PREPARATION


Connect to Data Preparation as Adam Brown.
For Email, enter [email protected]
For Password, enter talend
Click SIGN IN.

178 | Talend Data Preparation for Implementers - Participant Guide


3. VIEW THE DATASETS
Open the list of datasets and verify creation of the Customers_From_DB dataset.
a. Click the Datasets button.
A dataset named CUSTOMERS_FROM_DB is added.

b. Click the CUSTOMERS_FROM_DB dataset to open it.

LESSON 6 | 179
It contains 1,000 lines.

c. Close the dataset and log out.

Next step
Now you will learn how to execute a preparation in Talend Studio using Data Preparation components.

180 | Talend Data Preparation for Implementers - Participant Guide


Executing a Preparation in Talend Studio

Task outline

In this section, you will use Talend Studio to create a Job that runs the preparation you created in the previous chapter,
and then write the results to a CSV file. You will choose the preparation version to execute.
This allows administrators and operators to automate tasks that business users used to do manually.
Here is a diagram of the Job:

Modify an existing Job


1. DUPLICATE THE JOB
Duplicate the ExtractDataFromDB Job as ExecuteDataPreparation then open the new Job.

LESSON 6 | 181
a. Right-click the ExtractDatafromDB Job and select Duplicate.

b. In the Input new name text box, enter ExecuteDataPreparation

182 | Talend Data Preparation for Implementers - Participant Guide


c. Click OK.
The Job appears in the Repository.

d. Double-click the Job to open it.

2. ADD A tDataprepRun COMPONENT


Between the Local_MySQLand tFileOutputDelimited components, add a tDataPrepRun component.
a. Click the Row1 (Main) row and enter tDatap.

b. Select the tDataprepRun component and press the ENTER key.


The component is added between the Local_MySQLand tFileOutputDelimited components.

LESSON 6 | 183
TIP:
To insert a component inside a data flow, you can also place the new component on the designer and manually
reorganize the links between components.

c. When asked if you want the schema of the target component, click the Yes button.
The link is created.

3. CONFIGURE THE COMPONENT


Configure the tDataprepRun component.
a. Double-click the tDataprepRun component to open the Component view.

b. Click the Edit Schema button.

184 | Talend Data Preparation for Implementers - Participant Guide


Verify that the schema has been propagated.
To close the schema, click OK.

c. In the URL text box, enter "http://studentpc:9999"


In the Username text box, enter "[email protected]"
In the Password text box, enter "talend"
Click the Choose an existing preparation button.

d. The Select an existing preparation window opens.


Select Customers Preparation.

LESSON 6 | 185
e. Click OK.
Confirm that for Preparation Id, "Customers Preparation" is selected.

f. This preparation has been modified by John Smith.


To select the version validated by Adam Brown, click Choose a Version.

g. The Set the version window opens.


Select version 1.

186 | Talend Data Preparation for Implementers - Participant Guide


h. Click OK.
Confirm that for Version, "1" is selected.

i. Click the Fetch Schema button.


In the Confirm changes ? window, click the OK button.

LESSON 6 | 187
j. To see how the schema was updated, click the Edit Schema button..

k. To close the schema editor, click OK.


4. UPDATE THE tFileOutputDelimited COMPONENT CONFIGURATION
Update the configuration of the tFileOutputDelimited component to write the data to CleanDBCustomersOut.csv.
a. Double-click the tFileOutputDelimited component to open the Component view.
b. In the File Name text box, enter "C:/Temp/CleanDBCustomersOut.csv"

c. If there is a warning on the output component, click the Sync Columns button.

188 | Talend Data Preparation for Implementers - Participant Guide


Run the Job
1. RUN THE JOB
In the Run view, click the Run button.

2. VIEW THE RESULTS


In Windows Explorer, go to C:/Temp.
Confirm that CleanDBCustomersOut.csv has been created.

3. EXPLORE THE OUTPUT FILE

LESSON 6 | 189
Using Notepad++, open CleanDBCustomersOut.csv.
Examine some fields to ensure that the preparation was applied.

NOTE:
The regions column does not appear in the export file, as you ran Adam Brown's version.

Next step
You have almost completed this lesson. Now you can test your knowledge with a challenge exercise.

190 | Talend Data Preparation for Implementers - Participant Guide


Challenge

Exercise

Complete this exercise to reinforce your understanding of the lesson.

Read a Dataset from Data Preparation in Talend Studio


In this lesson, you executed a data preparation in a Talend Studio Job.
For this exercise, you will reuse this Job to execute the data preparation on the Customers dataset that a business user has manually
uploaded to the Data Preparation server. You must:
Duplicate the existing ExecuteDataPreparation Job and open it
Change the input Local_MySQL component to a tDatasetInput component
Configure the tDatasetInput component
Run the Job and check the results
Good luck, and if you get stuck, you can read ahead to a possible solution.

LESSON 6 | 191
Solution

Solving the challenge

Here is a solution to the challenge exercise. Your solution may be slightly different, but still valid.

Read a dataset from Data Preparation in Talend Studio


Review the steps to execute the data preparation in Talend Studio on the Customers dataset that a business user manually
uploaded to the Data Preparation server.
1. Duplicate the existing ExecuteDataPreparation Job and name it ReadDataset
2. In the new Job, delete the LocalMySQL component.
3. Add a tDatasetInput component and connect it to the tDataprepRun component using a Main row.
4. Configure the component as in the screenshot.

5. To get the schema, click the Fetch Schema button. When asked if you want to propagate the schema, click Yes.

192 | Talend Data Preparation for Implementers - Participant Guide


6. In the tFileOutputDelimited component view, change the output file name to CleanInitialCustomersFile.csv

7. In the Run view, click Run.


8. View the file in the C:/Temp folder.

Use Notepad++ to open the file.


Examine some fields to ensure that the preparation was applied.

NOTE:
As in the previous exercise, the regions column does not appear in the export file.

Next step
You have almost finished this section. Time for a quick review.

LESSON 6 | 193
Review
In this lesson, you created a Talend Studio project and imported a simple Job with related metadata items. The initial Job was simply
reading data from a database and writing it to a CSV file.
You duplicated the initial Job and modified it to publish the dataset to Data Preparation instead of writing the data to an output file. To
publish the data to the Data Preparation server, you used the tDatasetOutput component.
You again modified the initial Job to add an additional step: execute a specific version of a data preparation on the data extracted
from the database before writing the data to an output file containing the clean Customers data. To execute a data preparation in
Talend Studio, you used the tDataprepRun component.
You enhanced your knowledge with a challenge exercise. The objective was to build a Job that executes the data preparation on a
dataset manually uploaded by a business user to the Data Preparation server. You used the tDatasetInput component to read the
dataset in Talend Studio.

More information
Talend documentation:
Talend Data Integration Getting Started Guide
Talend Data Integration Studio User Guide
tDatasetOutput component
tDataprepRun component
tDatasetInput component

194 | Talend Data Preparation for Implementers - Participant Guide


LESSON 7
Implementing a Live Dataset
This chapter discusses:

Concepts 196
Overview 199
Implementing Live Dataset Mode in Talend Studio 200
Deploying a Job in TAC 208
Creating a Dataset from a Talend Job 222
Review 230
Concepts

The dataset update is triggered from the Data Pre-


paration web UI.

Developer creates data flow in Studio.


Administrator deploys data flow in TAC.
Authorized business user creates dataset from Job in
Data Preparation.

You will copy the previous Job and update the tData-
setOutput configuration.
Then you will export the Job in a Zip file to deploy it on
TAC.
In Data Preparation, you will create the live dataset
and update it on demand.

196 | Talend Data Preparation for Implementers - Participant Guide


You can set up the tDatasetOuput component using
either Batch or Live mode.
You used Batch mode in the previous lesson; now you
will use Live mode.

When using tDatasetOutput in LiveDataset mode, the


URL and limit parameters are set automatically.

The execution task prefix accepted by the Data Pre-


paration server is set up in the application.properties
file.

Adam Brown has administrator permissions.


Data is sent to the Data Preparation server through a
REST interface.

LESSON 7 | 197
198 | Talend Data Preparation for Implementers - Participant Guide
Overview

Use case
In the previous lesson, you used Talend Studio to publish a dataset to Data Preparation.
In this lesson you will implement the live dataset scenario.
To do this, you will:
Implement the live dataset option
Prepare your TAC environment to deploy the new Job: create a project, assign users with project authorizations, create a
local Job server, and create an execution task
Create a dataset from the deployed Job in the Data Preparation web UI
Here is a diagram of the interaction between Talend components in the live dataset scenario:

Objectives
After completing this lesson, you will be able to:
Implement the live dataset method in a Talend Job
Build a Talend Job and compress it in a Zip file
Create a project and assign project authorizations in TAC
Configure a local Job server in TAC
Create a task in TAC
Create a dataset from a deployed Talend Job in Data Preparation

Next step
You are ready to implement the live dataset option in Talend Studio.

LESSON 7 | 199
Implementing Live Dataset Mode in Talend Studio

Task outline

When publishing a dataset to the Talend Data Preparation server, you can use one of two modes:
Batch mode: The dataset is stored on the Data Preparation server and updated every time the DI Job is
executed. If the Job is deployed in TAC, the update frequency depends on the configuration defined by the oper-
ations manager in TAC.

Live mode: The dataset is not stored on the Data Preparation server side. It is updated on demand every time it
is opened from Data Preparation. This means the update frequency is not fixed and depends on business user
requests.

In this section, you will use Talend Studio to duplicate and modify the Publish Dataset Job to implement the live dataset
publishing method.

Modify an existing Job


1. DUPLICATE THE JOB
Duplicate the PublishDatasetToDataPrep Job as LiveDataset then open the new Job.

200 | Talend Data Preparation for Implementers - Participant Guide


a. Right-click the PublishDatasetToDataPrep Job and select Duplicate.

b. In the Input new name text box, enter LiveDataset

LESSON 7 | 201
c. Click OK.
The Job appears in the Repository.

d. Open the Job by double-clicking it.

2. CONFIGURE THE INPUT COMPONENT


Configure the Local_MySQL component to read data from a different table, customersFull,
which contains 11,000 records (instead of 1,000 as in the previous customer table).

202 | Talend Data Preparation for Implementers - Participant Guide


a. Open the Component view by double-clicking the Local_MySQL component.

b. In the Table Name text box, enter "customersFull"

c. Click the Guess Query button.


The query is updated.

LESSON 7 | 203
d. To save the updated Job, press CTRL+S.
3. CONFIGURE THE tDatasetOutput COMPONENT
Update the component settings with the Live Dataset mode and the context variables you created.
a. Double-click the tDatasetOutput component to open the Component view.

b. For Mode, select LiveDataset.


Remember that live dataset mode is used to publish a dataset that keeps the data live once deployed in TAC. In this
mode, you cannot run the Job directly from Talend Studio.

204 | Talend Data Preparation for Implementers - Participant Guide


Notice that the content of the Url and Limit boxes has been automatically updated with context variables. You don't
have to create these variables in Studio, TAC fills them in when receiving a request from the Data Preparation server.
c. If there is a warning on the DatasetOutput component, click the Sync columns button to have the schema resolve
the issue.

LESSON 7 | 205
d. To save the updated Job, press CTRL+S.

Build the Job


1. BUILD THE JOB
Right-click the Live Dataset Job and select Build Job.

2. DEFINE THE BUILD JOB PROPERTIES


The Build Job window opens.
a. Click Browse, and in the To archive file: text box, enter C:/Temp/LiveDataset.zip.
The _0.1 stamp is automatically added.

206 | Talend Data Preparation for Implementers - Participant Guide


b. Confirm that the Build type is Standalone Job.
Leave the other default options.
c. Click Finish.
Confirm that the Zip file was generated in the C:/Temp folder.

Next step
You are ready to deploy a Job from a Zip build in TAC.

LESSON 7 | 207
Deploying a Job in TAC

Task outline

Talend Administration Center provides a feature called Job Conductor, which helps you configure Job servers, schedule
Job execution, and configure Job deployment. When using Job Conductor, there are three ways to deploy Jobs in TAC:
Using Zip files generated from Talend Studio
Using artifacts stored in Nexus artifact repository
Using Publisher to publish artifacts from SVN sources in TAC

NOTE:
In this section, you will use the first method to deploy a Job from a Zip file. The other methods are detailed in
the Talend Data Integration Administration course.

The method you use to deploy a job in TAC; has no impact on the behavior of the Data Preparation live dataset.
To deploy the Job, you will:
Create a project with the same name as the one you created in Talend Studio, then assign authorizations on this
project for the system user (remember, the system user is used by the Data Preparation server to communicate
with TAC)
Check the Job server configuration used to deploy the Job
Assign server project authorizations
Create a task and deploy it on the local Job server

Connect to TAC
1. CONNECT TO TAC IN THE WEB UI
Connect to TAC with the system user credentials:
Login: [email protected]

208 | Talend Data Preparation for Implementers - Participant Guide


Password: talend

2. LOG IN
Click the Login button.

LESSON 7 | 209
Create a project in TAC
1. OPEN THE PROJECTS TAB
On the Menu pane, in Settings, click Projects.

2. ADD A PROJECT
Create a project called DataPrep.
Note: The project name in TAC must correspond to the Data Integration project from Talend Studio.

210 | Talend Data Preparation for Implementers - Participant Guide


a. Click the Add button as shown in the screenshot.

b. For Label, enter DataPrep


For Project type, select Data Integration/ESB
For Storage, select the radio button for None.

c. Click the Save button.


The project appears on the project list.

Define project authorizations
1. OPEN THE PROJECT AUTHORIZATIONS TAB
On the Menu pane, in Settings, click Project authorizations.

LESSON 7 | 211
2. ASSIGN ACCESS
Assign read/write permissions to the Operator user on the DataPrep project.
a. On the Project list, click DataPrep.

b. On the User/Group Authorizations list, in the Right column next to the Operator user, click the read/write (per-
son with pencil)icon.

c. Notice that one user is allowed on the DataPrep project.

Configure the local server


Before you can deploy jobs on a Job server, you must declare and configure the server in TAC, which automatically tracks activity and
statistics for all declared Job servers.

212 | Talend Data Preparation for Implementers - Participant Guide


1. OPEN THE SERVERS TAB
On the Menu, expand Conductor and click the Servers tab.

2. UPDATE THE SERVER CONFIGURATION


A Job server already exists on the training environment.
a. To view the details, click the serv1 default server.

b. Update the Label text box to LocalServer.


In the Host text box, enter studentpc.

LESSON 7 | 213
c. Click Save.
The updated server appears on the list.
To expand the window and see a message that TAC will display errors until it successfully communicates with the Job
server, which may take a couple of minutes, click the plus (+) symbol.
Confirm that the Status server is UP.

Check the server project authorizations


1. OPEN THE SERVER PROJECT AUTHORIZATIONS TAB
On the Menu, expand Conductor and click the Server Project authorizations tab.

214 | Talend Data Preparation for Implementers - Participant Guide


2. VERIFY THE SERVER CONFIGURATION
A Job server already exists in the training environment.
a. Click the DataPrep project.
b. Confirm that Local_Server is authorized by default on the DataPrep project.
The Authorized (person) icon in the Right column next to the LocalServer server on the Server Authorizations
list must be selected. If not, select it.

Create a task
1. OPEN THE JOB CONDUCTOR TAB
On the Menu, expand Conductor and click the Job Conductor tab.

LESSON 7 | 215
2. ADD A NORMAL TASK
Click the Add button and select Normal Task.

3. DEFINE THE EXECUTION TASK LABEL


The label for Data Preparation tasks must always be preceded by dataprep_ (or the prefix defined in the applic-
ation.properties configuration file).
a. In the Label text box, enter dataprep_customers

b. In C:\Talend\6.4.1\dataprep\config, open application.properties and view the value of the tac.task-prefix variable.

216 | Talend Data Preparation for Implementers - Participant Guide


Data Preparation lists only tasks with prefixes defined here.

4. IMPORT THE ZIP ARCHIVE


The normal task is based on a previously created Zip archive.
a. For Job, click the Import zip button as in the screenshot.

LESSON 7 | 217
b. Click Browse....

c. In the C:/Temp/ folder, select LiveDataset_0.1.zip.

d. Click Open and then Launch upload.

218 | Talend Data Preparation for Implementers - Participant Guide


e. Several details are filled in:

5. SELECT THE EXECUTION SERVER

LESSON 7 | 219
For Execution server, select the default LocalServer.

6. SAVE THE TASK


Click Save.
The Job Conductor refreshes and you can see the dataprep_customers task on the list.

Deploy the Job


1. SELECT THE TASK
On the list, select the dataprep_customers task.
The Generate button is grayed out because Talend Studio already generated the Job.

220 | Talend Data Preparation for Implementers - Participant Guide


2. DEPLOY THE JOB
To send the build to your local Job server for deployment, click Deploy.
Wait for the task status to change to Ready to run.

Next step
You imported a Zip build generated from Talend Studio as a task in TAC. In addition, you used Job Conductor to deploy the task on a
Job server.
You can also deploy Jobs by using Nexus artifact repository (covered in the Data Integration Administration course).
In the next section, you will create a dataset from a Talend Job in the Data Preparation web UI.

LESSON 7 | 221
Creating a Dataset from a Talend Job

Task outline

In the previous lessons, you learned how to create a dataset from a file right in the Data Preparation web UI, as well as
how to use the Data Integration tool to publish a dataset to Data Preparation.
In this section, you will create a dataset from a deployed Talend Job in the Data Preparation web UI. This option is avail-
able only to users in the Data Preparation administrator role.
Then you will apply an existing preparation to the newly created dataset and test the full-run functionality.

Add a live dataset


1. CONNECT TO DATA PREPARATION
Connect to the Data Preparation web console as Adam Brown (he is a Data Preparation administrator; John Smith is not).
Enter these credentials:
Email: [email protected]
Password: talend
Click SIGN IN.

2. DISPLAY THE DATASETS


On the menu on the left, click Datasets.
You see a list of current datasets.

222 | Talend Data Preparation for Implementers - Participant Guide


3. ADD A DATASET
Create a dataset from a deployed Talend Job.
a. To the right of ADD DATASET, click the down arrow.

b. Select From Talend Job.


The ADD TALEND JOB DATASET window opens.

LESSON 7 | 223
c. In the Dataset name text box, enter Live_Customers
In the User text box, enter [email protected]
For Password, enter talend
For Talend job, select customers

NOTE:
The specified user has permissions to run the execution task on the Job server.
The label of the Talend job (customers) is the label of the execution task you created in TAC (dataprep_cus-
tomers) without the prefix (dataprep_).

d. Click OK.
The dataset opens with the last version of data extracted from the database. Ten thousand lines are displayed.

NOTE:
The cache retains the same data for an hour.

224 | Talend Data Preparation for Implementers - Participant Guide


e. Remember that the LiveDataset Job was configured to extract data from a table containing 11,000 records.
However, only 10,000 are displayed. Once again, this corresponds to the dataset.records.limit variable defined in the
application.properties file. Only 10,000 rows are displayed, but the entire dataset is kept intact.

f. Close the dataset and confirm that it was saved.

Apply an existing data preparation to a live dataset


Adam Brown created a data preparation on the initial Customers dataset. He wants to apply it to the live dataset.
There is only one constraint: to apply an existing data preparation to a new dataset, the format (number and name of columns) of the
new dataset must be the same as that of the initial one on which the data preparation was built.
In this case, the format is the same, so there are no issues.
1. OPEN A DATASET
Still logged in as Adam Brown, again click the LIVE_CUSTOMERS dataset to open it.
2. APPLY AN EXISTING PREPARATION
Apply the Customers Preparation to the Live_Customers dataset then close the preparation.
a. Click the Choose a preparation to apply to this dataset button.

b. The corresponding wizard is displayed.


Select Customers preparation.

LESSON 7 | 225
c. Close the data preparation by clicking the X in the upper right corner of the window.

d. Save the new preparation in the US folder.

226 | Talend Data Preparation for Implementers - Participant Guide


e. Open the US folder and confirm that the new data preparation was saved.

3. VIEW ACCESS TO THE PREPARATION


Confirm that John Smith has access to the Live_Customers Preparation.
a. Disconnect Adam Brown.

b. Reconnect as John Smith.

LESSON 7 | 227
c. In the US folder, view the Preparations.

d. To open the Live_Customers Preparation, click on it.

228 | Talend Data Preparation for Implementers - Participant Guide


NOTE:
Even if John Smith cannot create a live dataset, he can access a data preparation built on a live dataset that was
shared with him. This means the security restriction is linked only to the user who created the live dataset in Data Pre-
paration (not to the user who runs it).

Next step
You have almost finished this section. Time for a quick review.

LESSON 7 | 229
Review
In this lesson, you learned how to set up the live dataset scenario. You:
Implemented live dataset mode by reusing an existing Job
Built the Zip archive of a Job and deployed it in TAC
Created a dataset in Data Preparation from the deployed Talend Job

More information
Talend documentation:
Working with datasets based on on-demand Job executions
Talend Help Center

230 | Talend Data Preparation for Implementers - Participant Guide

You might also like