Hadoop - getmerge Command

Last Updated : 29 Jun, 2020

Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system. We want to merge the 2 files present inside are HDFS i.e. file1.txt and file2.txt, into a single file output.txt in our local file system.

Steps To Use -getmerge Command

Step 1: Let's see the content of file1.txt and file2.txt that are available in our HDFS. You can see the content of File1.txt in the below image: Content of File1.txt

Content of File1.txt

Content of File2.txt

Content of File2.txt

In this case, we have copied both of these files inside my HDFS in Hadoop_File folder. If you don't know how to make the directory and copy files to HDFS then follow below command to do so.

Making Hadoop_Files directory in our HDFS
```
hdfs dfs -mkdir /Hadoop_File
```
Copying files to HDFS
hdfs dfs -copyFromLocal /home/dikshant/Documents/hadoop_file/file1.txt /home/dikshant/Documents/hadoop_file/file2.txt /Hadoop_File

Hadoop - getmerge Command - 1

Below is the Image showing this file inside my /Hadoop_File directory in HDFS. Hadoop - getmerge Command - 2

Hadoop - getmerge Command - 2

Step 2: Now it's time to use -getmerge command to merge these files into a single output file in our local file system for that follow the below procedure. Syntax:

hdfs dfs -getmerge -nl /path1 /path2 ..../path n /destination

-nl is used for adding new line. this will add a new line between the content of these n files. In this case we have merge it to /hadoop_file folder inside my /Documents folder.

hdfs dfs -getmerge -nl /Hadoop_File/file1.txt /Hadoop_File/file2.txt /home/dikshant/Documents/hadoop_file/output.txt

getmerge-command

Now let's see whether the file get merged in output.txt file or not. getmerge command output in Hadoop

getmerge command output in Hadoop

In the above image, you can easily see that the file is merged successfully in our output.txt file.

D

dikshantmalidev

Improve

Article Tags :

Data Engineering

Similar Reads

Hadoop - Architecture

As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to dea

Introduction to Hadoop

Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framewo

Top 60+ Data Engineer Interview Questions and Answers

Data engineering is a rapidly growing field that plays a crucial role in managing and processing large volumes of data for organizations. As companies increasingly rely on data-driven decision-making, the demand for skilled data engineers continues to rise. If you're preparing for a data engineer in

What is Big Data?

Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight meaning, from which one pattern can be designed that will be useful to take a d

Explain the Hadoop Distributed File System (HDFS) Architecture and Advantages.

The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed to store and manage large volumes of data across multiple machines in a distributed manner. It provides high-throughput access to data, making it suitable for applications that deal with large datas

What is Big Data Analytics ? - Definition, Working, Benefits

Big Data Analytics uses advanced analytical methods that can extract important business insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured (unorganized) data. Its applications cover different industries such as healthcare, education, insurance, AI, r

Kafka Architecture

Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. It is known for its high throughput, low latency, fault tolerance, and scalability. This article delves into the architecture of Kafka, exploring its core components, functiona

Hadoop - HDFS (Hadoop Distributed File System)

Before head over to learn about the HDFS(Hadoop Distributed File System), we should know what actually the file system is. The file system is a kind of Data structure or method which we use in an operating system to manage file on disk space. This means it allows the user to keep maintain and retrie

What is Data Lake ?

In todayâ€™s data-driven world, organizations face the challenge of managing vast amounts of raw data to get meaningful insights. To resolve this Data Lakes was introduced. It is a centralized storage repository that allows businesses to store structured, semi-structured and unstructured data at any s

MapReduce Programming Model and its role in Hadoop.

In the Hadoop framework, MapReduce is the programming model. MapReduce utilizes the map and reduce strategy for the analysis of data. In todayâ€™s fast-paced world, there is a huge number of data available, and processing this extensive data is one of the critical tasks to do so. However, the MapReduc