Assignment 2
Assignment 2
1. Explain the design of HDFS. What are the key concepts behind HDFS, such as block
sizes, data replication, and block abstraction? How does HDFS ensure fault tolerance and
scalability?
2. Describe how HDFS stores, reads, and writes files. How does HDFS achieve high
throughput when handling large datasets? Explain the data flow in HDFS from the
client’s perspective.
3. What are the key differences between the Hadoop File System (HDFS) command line
interface and Java interfaces? How can you interact with HDFS using both the command
line and Java?
4. Describe the steps involved in setting up a Hadoop cluster. What are the main
configurations that need to be considered during Hadoop installation? How do you ensure
security in a Hadoop environment?
5. Explain the role of YARN in the Hadoop ecosystem. How does YARN improve the
resource management in Hadoop 2.0? What are the main differences between MRv1 and
MRv2?
6. What are the key characteristics of NoSQL databases? Explain how MongoDB fits into
the NoSQL landscape. How do you create, update, delete, and query documents in
MongoDB?
7. Describe the concept of Resilient Distributed Datasets (RDDs) in Spark. How do Spark
applications, jobs, stages, and tasks work in the context of distributed data processing?
8. Provide an overview of the basic syntax and concepts in Scala. How does Scala support
object-oriented and functional programming? Describe the use of functions, closures, and
inheritance in Scala.
9. Compare and contrast the three Hadoop ecosystem frameworks: Pig, Hive, and HBase.
How do they differ in terms of data processing, querying, and storage? Provide examples
of their use cases.
10. What is Zookeeper and how does it help in monitoring a Hadoop cluster? Explain its role
in coordination and configuration management for distributed applications in a cluster
environment.