Introduction to Hadoop - Copy
Introduction to Hadoop - Copy
• 1. it is fault tolerance.
• 2. it is highly available.
• 3. it’s programming is easy.
• 4. it have huge flexible storage.
• 5. it is low cost.
Differences Between RDBMS and Hadoop
RDBMS Hadoop
Traditional row-column based databases, basically used for data storage, An open-source software used for storing data and running applications or
manipulation and retrieval. processes concurrently.
In this structured data is mostly processed. In this both structured and unstructured data is processed.
It is best suited for OLTP environment. It is best suited for BIG data.
The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
High data integrity available. Low data integrity available than RDBMS.
Hadoop – Architecture
1. MapReduce
2. HDFS(Hadoop Distributed
File System)
3. YARN(Yet Another
Resource Negotiator)
4. Common Utilities or
Hadoop Common
1. MapReduce
• Combiner: Combiner is used for grouping the data in the Map workflow. It
is similar to a Local reducer. The intermediate key-value that are generated
in the Map is combined with the help of this combiner. Using a combiner is
not necessary as it is optional.
• Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-
value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort the
data using its key value. Once some of the Mapping tasks are
done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its
key element.
• DataNode(Slave)
• NameNode:NameNode works as a Master in a Hadoop
cluster that guides the Datanode(Slaves). Namenode is
mainly used for storing the Metadata i.e. the data about
the data. Meta Data can be the transaction logs that
keep track of the user’s activity in a Hadoop cluster.
• DataNode: DataNodes works as a Slave DataNodes are
mainly utilized for storing the data in a Hadoop cluster,
the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the
Hadoop cluster will be able to store more data. So it is
advised that the DataNode should have High storing
capacity to store a large number of file blocks.
YARN(Yet Another Resource Negotiator)
2. Scalability
3. Cluster-Utilization
4. Compatibility
Hadoop common or Common Utilities