Talend Tutorial12 Writing and Reading Data in HDFS
This tutorial demonstrates how to write data to HDFS and read data from HDFS using Talend Data Fabric. It involves:
1. Generating random customer data using a tRowGenerator component.
2. Writing the data to HDFS using a tHDFSOutput component.
3. Reading the data from HDFS using a tHDFSInput component with the same schema.
4. Sorting the data by customer ID using a tSortRow component.
5. Displaying the sorted data in the console using a tLogRow component. The job is then run to test the process.
Talend Tutorial12 Writing and Reading Data in HDFS
This tutorial demonstrates how to write data to HDFS and read data from HDFS using Talend Data Fabric. It involves:
1. Generating random customer data using a tRowGenerator component.
2. Writing the data to HDFS using a tHDFSOutput component.
3. Reading the data from HDFS using a tHDFSInput component with the same schema.
4. Sorting the data by customer ID using a tSortRow component.
5. Displaying the sorted data in the console using a tLogRow component. The job is then run to test the process.
This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version 5.4.
1. Create a new standard Job
a. Ensure that the Integration perspective is selected. b. To ensure that the Hadoop cluster connection and the HDFS connection metadata have been created in the Project Repository, expand Hadoop Cluster. c. In the Repository, expand Job Designs, right-click Standard, and click Create Standard Job. In the Name field of the New Job wizard, type ReadWriteHDFS. In the Purpose field, type Read/Write data in HDFS and in the Description field, type Standard job to write and read customers data to and from HDFS and click Finish. The Job opens in the Job Designer.
2. Add and configure a tRowGenerator component to generate
random customer data a. To generate random customer data, in the Job Designer, add a tRowGenerator component. b. To set the schema and function parameters for the tRowGenerator component, double- click the tRowGenerator_1 component. c. To add columns to the schema, click the [+] icon three times and type the column names as CustomerID, FirstName, and LastName. Next, you will configure the attributes for these fields. d. To change the Type for the CustomerID column, click the Type field and click Integer and set the Functions field of the three columns to Numeric.random(int,int), TalendDataGenerator.getFirstName(), and TalendDataGenerator.getLastName() respectively. e. In the table, select the CustomerID column, then, in the Functions parameters tab, set the max value to 1000. f. In the Number of Rows for RowGenerator field, type 1000, and click OK to save the configuration.
Talend takes the complexity out of integration
Based on open source Scalable Future-proof Predictable cost Visit www.talend.com Follow us on Twitter @Talend Talend Tutorial Task Aid >
3. Write data to HDFS
For this, you will create a new tHDFSOutput component that reuses the existing HDFS metadata available in the Project Repository. a. From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click MyHadoopCluster_HDFS and drag it to the Job Designer. b. In the Components list, select tHDFSOutput and click OK. c. Create a flow of data from the tRowGenerator_1 component to the MyHadoopCluster_HDFS component by linking the two components with the Main row and then double-click the MyHadoopCluster_HDFS component to open the Component view. Note that the component is already configured with the pre-defined HDFS metadata connection information. d. In the File Name box, type /user/student/CustomersData and in the Action list, select Overwrite. The first subjob to write data to HDFS is now complete. It takes the data generated in the tRowGenerator you created earlier, and writes it to HDFS using a connection defined using metadata.
4. Read data from HDFS
Next, you will build a subjob to read the customer data on HDFS, sort them, and display them in the console. To read the customer data from HDFS, you will create a new tHDFSInput component that reuses the existing HDFS metadata available in the Project Repository. a. From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click MyHadoopCluster_HDFS and drag it to the Job Designer. b. In the Components list, select tHDFSInput and click OK. c. To open the component view of the MyHadoopCluster_HDFS input component, double- click the MyHadoopCluster_HDFS input component. Note that the component is already configured with the pre-defined HDFS metadata connection information. d. In the File Name box, type /user/student/CustomersData.
Talend takes the complexity out of integration
Based on open source Scalable Future-proof Predictable cost Visit www.talend.com Follow us on Twitter @Talend Talend Tutorial Task Aid >
5. Specify the schema in the MyHadoopCluster_HDFS input
component to read the data from HDFS a. To open the schema editor, in the Component view of the MyHadoopCluster_HDFS input component, click Edit schema. b. To add columns to the schema, click the [+] icon three times and type the columns names as CustomerID, FirstName, and LastName. c. To change the Type for the CustomerID column, click the Type field and click Integer. Note: This schema is the same as in tRowGenerator and tHDFSOutput. You can copy it from either of those components and paste it in this schema. d. Connect the tRowGenerator component to the MyHadoopCluster_HDFS input component using the OnSubjobOk trigger.
6. Sort data in the ascending order of customer ID, using the
tSortRow component a. Add a tSortRow component and connect it to the MyHadoopCluster_HDFS input component with the Main row. b. To open the Component view of the tSortRow component, double-click the component. c. To configure the schema, click Sync columns. d. To add new criteria to the Criteria table, click the [+] icon and in the Schema column, type CustomerID. In the sort num or alpha? column, select num and in the Order asc or desc? column, select asc.
7. Display the sorted data in the console using a tLogRow
component a. Add a tLogRow component and connect it to the tSortRow component with the Main row. b. To open the Component view of the tLogRow component, double-click the component. c. In the Mode panel, select Table. Your Job is now ready to run. First, it generates data and writes it to HDFS. Then, it reads the data from HDFS, sorts it, and displays it in the console.
Talend takes the complexity out of integration
Based on open source Scalable Future-proof Predictable cost Visit www.talend.com Follow us on Twitter @Talend Talend Tutorial Task Aid >
8. Run the Job and observe the result in the console
a. To run the Job, in the Run view, click Run. The sorted data is displayed in the console.
Talend takes the complexity out of integration
Based on open source Scalable Future-proof Predictable cost Visit www.talend.com Follow us on Twitter @Talend
ActiveReports Allows You To Create Master Detail Reports With Grouping by Using The GroupHeader and Detail Sections To Contain Data From Master Files and Detail Files
ActiveReports Allows You To Create Master Detail Reports With Grouping by Using The GroupHeader and Detail Sections To Contain Data From Master Files and Detail Files