HDFS
HDFS
Q1) What's the purpose of a distributed file systems as in HDFS and AFS?
A distributed file system like AFS and HDFS manages storage across multiple machines,
so the storage capacity is not restricted to the physical limitations of a single machine.
So, by having multiple machines in a cluster, big data sets can be addressed.
While the regular file systems like ext3, ext4, NTFS, FAT32 manage the storage of a
single machine and are limited to the physical limitations of a single machine.
Q2) What are the design considerations behind HDFS architecture?
The design considerations for HDFS are
- Very large files (in TB and PB)
- Write once and read many times (as in analytics)
- High throughput and low latency (as in OLAP)
- Using commodity machines to store the data
Q3) What are the different modes in which Hadoop can be run and what's the
purpose of these modes?
Standalone mode - For developing/debugging applications in Eclipse
Pseudo Distributed mode - For debugging/testing applications on a single machines
Fully Distributed mode - For debugging/testing applications in a cluster of machines
Q4) What's the disadvantage of running the NameNode and the DataNode on
the same machine?
The NameNode requires a machine with high memory to store the metadata and it is
should be a reliable machine, because the NameNode is a single point of failure. And the
DataNode stores the data in the Hard Disk, so it is recommended to use a machine with
a high Hard Disk capacity.
The purpose and the machine configuration of the NameNode and the DataNode is
different, so it is recommended to run them on separate machines as in the case of
production. But, for the sake of POC, evaluation, development running the different
process on a single machine should be fine.
Q4) What are the commands for
- deleting the log and data files
- cd $HADOOOP_HOME
- rm -rf logs/*
- rm -rf data/*
- Formatting the HDFS
- hadoop namenode -format
- starting/stopping HDFS
- start-dfs.sh and stop-dfs.sh
- make sure that the required processes are running
- jps
Q5) What are the commands for
a) The DataNodes hostname has to be added to the slaves file and start the DataNode as
`bin/hadoop-daemon.sh start datanode`.
b) The DataNodes hostname has to be added to the slaves file and restart HDFS.
Q13) Does HDFS consider the record boundaries while splitting the data into
blocks?
HDFS doesn't consider record boundaries while splitting the data into blocks. The data in
HDFS is exactly split at the block size. So, a record might be split across multiple blocks.
Q14) How to encrypt/compress the data in HDFS?
HDFS doesn't support encryption/compression of the data. So, before putting the data in
HDFS the data has to be manually encrypted/compressed.
Q15) What happens when DataNode doesn't send a Heart Beat to NameNode?
When a DataNode doesn't send a Heart Beat for some time, then the NameNode will wait
for some additional time and assume that the DataNode is dead. And then the
NameNode will replicate the blocks across different nodes, to maintain the proper
replication factor of the blocks.
Q16) How does HDFS verify that a block of data got corrupted and what action
does it take when a block got corrupted?
HDFS stores the checksum (aka signature) of the block along with the actual block. And
the DataNode periodically matches the blocks with the signature. If there is any
mismatch, then it is assumed that the block is corrupted and the appropriate action is
taken.
Q17) How does Hadoop know which rack a machine belongs to?
Hadoop has to made rackaware, by setting the `net.topology.script.file.name` property
to a shell script in the core-site.xml. The script takes a hostname/ip and returns the rack
to which the host belongs to. Hadoop has to be made rack aware, so that the blocks can
be placed to use the network efficiently and also to address the fault tolerance.
Q18) What is the purpose of using installing/configuring ssh in the
master/slaves node in HDFS?
The start-up scripts in the bin folder use ssh to login to remote machines in the cluster
to manage (start/stop) the services. The ssh-client has to be installed on the master and
the ssh-server has to be installed on the slaves.
Q19) What are the commands to install ssh server and client on Ubuntu (debian
based) and RHEL (RPM based) systems?
On debian based systems
sudo apt-get install openssh-client
sudo apt-get install openssh-server
Linux
Java
ssh server
Hadoop
Q24) What does a HDFS Client need to know about HDFS cluster to perform a
file system operation?
The HDFS Client need to know about the NameNode hostname/ip and the port number
to perform a file system operation in HDFS. There is no need to know about the
DataNode details.
Q25) Where does NameNode store the metadata?
The NameNode stores the metadata in the memory as well as on the hard disk. The
reason for storing in the memory is for quick access and reason for storing in the hard
disk is for the sake of reliable persistence.
Q26) Where does DataNode store the blocks of data? What else is stored along
with the block of data?
The DataNode stores the blocks of data on the hard disk. Along with the block of data it
also stores the some data about the block like CRC checksum.
Q27) How does the start/stop scripts know where to start/stop the different
services related to HDFS?
The start/stop scripts go through the masters/slaves file to figure out where to start/stop
the different services. The masters file corresponds to the location of the Secondary
NameNode/CheckPointNode and the slaves file corresponds to the location of the
TaskTracker/DataNode.
Q28) What are the typical characteristics of machine to run a NameNode and
DataNode?
The NameNode is a Single Point Of Failure and stores metadata in the RAM, so it should
typically be a highly reliable machine with a lot of RAM. The DataNode stores all the
blocks onto the hard disk and the DataNodes are in many numbers. So, the datanodes
should be commodity machines with a lot of hard disk storage on each machine.
Q29) How does NameNode know the location of the different DataNodes in a
HDFS cluster?
The slaves file will have the list of the DataNodes in the cluster.
Q30) How does a DataNode know the location of the NameNode in a HDFS
cluster?
The NameNode hostname/ip and the port number is stored in the core-site.xml. The
DataNode uses this for communication with the NameNode.
Q31) What happens when some of the DataNodes are overloaded in terms of
block storage? How to rectify this?
When the DataNode is overloaded with blocks, then there is a better probability that a
good percentage of the reads will be redirected to that particular DataNode and the read
process in HDFS will become slow. As a solution, the Hadoop balancer scripts can be run
to move the blocks from the over loaded node to the under loaded nodes.
Q32) What are the commands (rwx) to change the permissions on the files?
The HDFS chmod option can be used to change the rwx permission on the files.
Q33) How does the get command know where to get the blocks of data from?
The client/gateway interacts with the NameNode to get the information on which
DataNode the blocks are. Once it gets this information, it will directly contact the
DataNodes to get the actual blocks.
Q34) What does rwx permissions mean in case of files and folders in HDFS?
The read permission is required to read files or list the contents of a directory. The write
permission is required to write a file, or for a directory, to create or delete files or
directories in it. The execute permission is ignored for a file because you cant execute a
file on HDFS (unlike POSIX), and for a directory this permission is required to access its
children.
Q35) What are the commands to modify the user/group to which the file
belongs to in HDFS?
The Hadoop chown and the chgrp options can be used to modify the user/group to which
the file belongs to in HDFS.
Q36) What is the difference between copyFromLocal and the put option in HDFS
`hadoop fs` options?
The HDFS put option supports multiple source file systems, but with copyFromLocal the
source is restricted to the local file reference.
Q37) What is the difference between rm and rmr option in HDFS `hadoop fs`
options?
rm option deletes files and any empty directory. While rmr deletes folder in a recursive
fashion.
Q38) What the different options for looking at the contents of a file in HDFS?
a) The HDFS -cat option.
b) The HDFS Web Browser
c) The HDFS file can be got using -get option and then viewed in the local file system.
Q39) What are the different ways of creating data in HDFS?
The data can be inserted using
- the HDFS fs commands
- SQOOP
- Flume
- NoSQL Database like HBase, Cassandra etc
- HDFS Java API
- and others