Big Data
Big Data
net/publication/266938495
CITATIONS READS
0 1,184
1 author:
Dr.Chandrakant Naikodi
MNC,Bangalore
69 PUBLICATIONS 42 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Publishing two books : Python Applications Programming and Fundamentals of Computer Programming View project
All content following this page was uploaded by Dr.Chandrakant Naikodi on 05 December 2014.
CHANDRAKANT NAIKODI
ABOUT AUTHOR
• Dr.Chandrakant N
Dr.Chandrakant Naikodi is presently working as a Senior Software Engineer in MNC, and
visiting professor at Cambridge Institute of Technology, Bangalore, India. He completed
Diploma in CSE, BE degree in Information Science and Engg. and ME in Information
Technology. His PhD degree is in Computer Science and Engg. were earned from the
Bangalore University. He has published many research papers in referred International
Journals and Conferences. He is the author of many technical books namely “C:Test
Your Aptitude” and “1000 Questions and Answers in C++” published by Tata Mc-Graw
Hill, other books like ”Programming in C and Data Structure” by Vikas Publication, and
”Wireless Sensor Network for Beginners” by Mudranik Technologies. His area of interest
includes Computer Networks, MANETs, WSN, and Programming Languages, Big Data
etc.
PREFACE
• Big data is a popular term used to describe the exponential growth and availability of
data, both structured and unstructured. Hadoop is an open-source software framework
for storing and processing big data in a distributed fashion on large clusters of commodity
hardware. This book concentrates on Hadoop architecture, HDFS, MapReduce, etc. The
authors will appreciate the suggestions or feedback from readers and users of this book,
kindly communicate via email addresses chandrakant.naikodi@{yahoo.in, gmail.com,
facebook.com}.
ACKNOWLEDGEMENTS
My deep gratitude and thanks to my wife Mrs. Vidyadhare Chandrakant and my daughter
Vaishnavi N for their immense patience, prayers and support. My sincere thanks to my father
Mr. Dharmanna N and mother Mrs. Shanthabhai Dharmanna for their blessings and support.
I thank my brothers Mr. Shankar N, Mr. Surykant N, father-in-law Mr. Venkatesh J, mother-in-
law Mrs. Kalavathi V, my brother-in-law Mr. Nataraj and Mr. Raghavendra.
I am greatly thankful to my well wishers and teachers, especially late Sri. B C Bhavikatti, Sri.
M B Naikodi, Sri. Somashekar Muniyappa, Sri. Sanjeevkumar Chetti and Sri.Santhoshkumar
Hanaji who supported and encouraged me greatly in all steps.
This book could not be completed without the support of my family and friends and the help
of several individuals who extended their valuable support in the preparation and compilation
of this book.
To daughter Vaishnavi
Contents
i
ii
3 BASICS OF HADOOP 51
3.1 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Analyzing Data with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Map and Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Java MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Scaling Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Hadoop Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Hadoop Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7 Design of Hadoop Distributed File System (HDFS) . . . . . . . . . . . . . . . . . 62
3.8 HDFS Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.1 Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.2 Namenodes(NN) and Datanodes(DN) . . . . . . . . . . . . . . . . . . . . . 63
3.8.3 HDFS Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.4 HDFS High-Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9 Java interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.1 Reading Data from a Hadoop URL . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.2 Reading Data using the FileSystem API . . . . . . . . . . . . . . . . . . . . 66
3.9.3 Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.4 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.9.5 Querying the Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.9.6 Deleting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10 Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10.1 Data Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.10.3 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.10.4 File-based data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 MAP-REDUCE APPLICATIONS 77
4.1 Map-Reduce Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Decomposing a Problem into MapReduce Jobs: . . . . . . . . . . . . . . . 77
4.1.2 JobControl: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.3 Apache Oozie: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Unit tests with MRUnit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Mappers Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Reducers Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
iii