Hadoop - Different Modes of Operation
Last Updated :
22 Jun, 2020
As we all know Hadoop is an open-source framework which is mainly used for storage purpose and maintaining and analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is actually a data management tool. Hadoop also posses a scale-out storage property, which means that we can scale up or scale down the number of nodes as per are a requirement in the future which is really a cool feature.
Hadoop Mainly works on 3 different Modes:
- Standalone Mode
- Pseudo-distributed Mode
- Fully-Distributed Mode
1. Standalone Mode
In
Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also means that we are installing Hadoop only in a single system. By default, Hadoop is made to run in this Standalone Mode or we can also call it as the
Local mode. We mainly use Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS (Hadoop distributed file system) is one of the major components for Hadoop which utilized for storage Permission is not utilized in this mode. You can think of HDFS as similar to the file system's available for windows i.e. NTFS (New Technology File System) and FAT32(File Allocation Table which stores the data in the blocks of 32 bits ). when your Hadoop works in this mode there is no need to configure the files -
hdfs-site.xml,
mapred-site.xml,
core-site.xml for Hadoop environment. In this Mode, all of your Processes will run on a single
JVM(Java Virtual Machine) and this mode can only be used for small development purposes.
2. Pseudo Distributed Mode (Single Node Cluster)
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster is simulated, which means that all the processes inside the cluster will run independently to each other. All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual Machine) or we can say run on different java processes that is why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up so all the Master and Slave processes are handled by the single system. Namenode and Resource Manager are used as Master and Datanode and Node Manager is used as a slave. A secondary name node is also used as a Master. The purpose of the Secondary Name node is to just keep the hourly based backup of the Name node. In this Mode,
- Hadoop is used for development and for debugging purposes both.
- Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output processes.
- We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for setting up the environment.
3. Fully Distributed Mode (Multi-Node Cluster)
This is the most important one in which multiple nodes are used few of them run the Master Daemon's that are Namenode and Resource Manager and the rest of them run the Slave Daemon's that are DataNode and Node Manager. Here Hadoop will run on the clusters of Machine or nodes. Here the data that is used is distributed across different nodes. This is actually the
Production Mode of Hadoop let's clarify or understand this Mode in a better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install it in your system and you run all the processes in a single system but here in the fully distributed mode we are extracting this tar or zip file to each of the nodes in the Hadoop cluster and then we are using a particular node for a particular process. Once you distribute the process among the nodes then you'll define which nodes are working as a master or which one of them is working as a slave.
Similar Reads
Hadoop - Architecture As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to dea
6 min read
Hadoop Ecosystem Overview: Apache Hadoop is an open source framework intended to make interaction with big data easier, However, for those who are not acquainted with this technology, one question arises that what is big data ? Big data is a term given to the data sets which can't be processed in an efficient manner
6 min read
Introduction to Hadoop Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framewo
3 min read
Top 60+ Data Engineer Interview Questions and Answers Data engineering is a rapidly growing field that plays a crucial role in managing and processing large volumes of data for organizations. As companies increasingly rely on data-driven decision-making, the demand for skilled data engineers continues to rise. If you're preparing for a data engineer in
15+ min read
What is Big Data? Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight meaning, from which one pattern can be designed that will be useful to take a d
5 min read
Explain the Hadoop Distributed File System (HDFS) Architecture and Advantages. The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed to store and manage large volumes of data across multiple machines in a distributed manner. It provides high-throughput access to data, making it suitable for applications that deal with large datas
5 min read
What is Big Data Analytics ? - Definition, Working, Benefits Big Data Analytics uses advanced analytical methods that can extract important business insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured (unorganized) data. Its applications cover different industries such as healthcare, education, insurance, AI, r
9 min read
Kafka Architecture Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. It is known for its high throughput, low latency, fault tolerance, and scalability. This article delves into the architecture of Kafka, exploring its core components, functiona
12 min read
Hadoop - HDFS (Hadoop Distributed File System) Before head over to learn about the HDFS(Hadoop Distributed File System), we should know what actually the file system is. The file system is a kind of Data structure or method which we use in an operating system to manage file on disk space. This means it allows the user to keep maintain and retrie
7 min read
Map Reduce and its Phases with numerical example. Map Reduce is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a reliable manner.Phases of MapReduceMapReduce model has three major and one optional phase.âMappingShuffling and SortingReducingCombining1) MappingIt is
3 min read