0% found this document useful (0 votes)
187 views

Talend Tutorial12 Writing and Reading Data in HDFS

This tutorial demonstrates how to write data to HDFS and read data from HDFS using Talend Data Fabric. It involves: 1. Generating random customer data using a tRowGenerator component. 2. Writing the data to HDFS using a tHDFSOutput component. 3. Reading the data from HDFS using a tHDFSInput component with the same schema. 4. Sorting the data by customer ID using a tSortRow component. 5. Displaying the sorted data in the console using a tLogRow component. The job is then run to test the process.

Uploaded by

geoinsys
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views

Talend Tutorial12 Writing and Reading Data in HDFS

This tutorial demonstrates how to write data to HDFS and read data from HDFS using Talend Data Fabric. It involves: 1. Generating random customer data using a tRowGenerator component. 2. Writing the data to HDFS using a tHDFSOutput component. 3. Reading the data from HDFS using a tHDFSInput component with the same schema. 4. Sorting the data by customer ID using a tSortRow component. 5. Displaying the sorted data in the console using a tLogRow component. The job is then run to test the process.

Uploaded by

geoinsys
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Talend Tutorial Task Aid >

Writing and Reading Data in HDFS

This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version
5.4.

1. Create a new standard Job


a. Ensure that the Integration perspective is selected.
b. To ensure that the Hadoop cluster connection and the HDFS connection metadata have
been created in the Project Repository, expand Hadoop Cluster.
c. In the Repository, expand Job Designs, right-click Standard, and click Create Standard Job.
In the Name field of the New Job wizard, type ReadWriteHDFS. In the Purpose field, type
Read/Write data in HDFS and in the Description field, type Standard job to write and read
customers data to and from HDFS and click Finish.
The Job opens in the Job Designer.

2. Add and configure a tRowGenerator component to generate


random customer data
a. To generate random customer data, in the Job Designer, add a tRowGenerator
component.
b. To set the schema and function parameters for the tRowGenerator component, double-
click the tRowGenerator_1 component.
c. To add columns to the schema, click the [+] icon three times and type the column names
as CustomerID, FirstName, and LastName. Next, you will configure the attributes for these
fields.
d. To change the Type for the CustomerID column, click the Type field and click Integer and
set the Functions field of the three columns to Numeric.random(int,int),
TalendDataGenerator.getFirstName(), and TalendDataGenerator.getLastName()
respectively.
e. In the table, select the CustomerID column, then, in the Functions parameters tab, set the
max value to 1000.
f. In the Number of Rows for RowGenerator field, type 1000, and click OK to save the
configuration.

Talend takes the complexity out of integration


Based on open source Scalable Future-proof Predictable cost
Visit www.talend.com Follow us on Twitter @Talend
Talend Tutorial Task Aid >

3. Write data to HDFS


For this, you will create a new tHDFSOutput component that reuses the existing HDFS
metadata available in the Project Repository.
a. From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click
MyHadoopCluster_HDFS and drag it to the Job Designer.
b. In the Components list, select tHDFSOutput and click OK.
c. Create a flow of data from the tRowGenerator_1 component to the
MyHadoopCluster_HDFS component by linking the two components with the Main row
and then double-click the MyHadoopCluster_HDFS component to open the Component
view.
Note that the component is already configured with the pre-defined HDFS metadata
connection information.
d. In the File Name box, type /user/student/CustomersData and in the Action list, select
Overwrite.
The first subjob to write data to HDFS is now complete. It takes the data generated in the
tRowGenerator you created earlier, and writes it to HDFS using a connection defined using
metadata.

4. Read data from HDFS


Next, you will build a subjob to read the customer data on HDFS, sort them, and display them
in the console. To read the customer data from HDFS, you will create a new tHDFSInput
component that reuses the existing HDFS metadata available in the Project Repository.
a. From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click
MyHadoopCluster_HDFS and drag it to the Job Designer.
b. In the Components list, select tHDFSInput and click OK.
c. To open the component view of the MyHadoopCluster_HDFS input component, double-
click the MyHadoopCluster_HDFS input component.
Note that the component is already configured with the pre-defined HDFS metadata
connection information.
d. In the File Name box, type /user/student/CustomersData.

Talend takes the complexity out of integration


Based on open source Scalable Future-proof Predictable cost
Visit www.talend.com Follow us on Twitter @Talend
Talend Tutorial Task Aid >

5. Specify the schema in the MyHadoopCluster_HDFS input


component to read the data from HDFS
a. To open the schema editor, in the Component view of the MyHadoopCluster_HDFS input
component, click Edit schema.
b. To add columns to the schema, click the [+] icon three times and type the columns names
as CustomerID, FirstName, and LastName.
c. To change the Type for the CustomerID column, click the Type field and click Integer.
Note: This schema is the same as in tRowGenerator and tHDFSOutput. You can copy it from
either of those components and paste it in this schema.
d. Connect the tRowGenerator component to the MyHadoopCluster_HDFS input component
using the OnSubjobOk trigger.

6. Sort data in the ascending order of customer ID, using the


tSortRow component
a. Add a tSortRow component and connect it to the MyHadoopCluster_HDFS input
component with the Main row.
b. To open the Component view of the tSortRow component, double-click the component.
c. To configure the schema, click Sync columns.
d. To add new criteria to the Criteria table, click the [+] icon and in the Schema column, type
CustomerID. In the sort num or alpha? column, select num and in the Order asc or desc?
column, select asc.

7. Display the sorted data in the console using a tLogRow


component
a. Add a tLogRow component and connect it to the tSortRow component with the Main row.
b. To open the Component view of the tLogRow component, double-click the component.
c. In the Mode panel, select Table.
Your Job is now ready to run. First, it generates data and writes it to HDFS. Then, it reads the
data from HDFS, sorts it, and displays it in the console.

Talend takes the complexity out of integration


Based on open source Scalable Future-proof Predictable cost
Visit www.talend.com Follow us on Twitter @Talend
Talend Tutorial Task Aid >

8. Run the Job and observe the result in the console


a. To run the Job, in the Run view, click Run.
The sorted data is displayed in the console.

Talend takes the complexity out of integration


Based on open source Scalable Future-proof Predictable cost
Visit www.talend.com Follow us on Twitter @Talend

You might also like