Get Syllabus PDF
Get Syllabus PDF
Semester: Autumn
Objective:
The course gives a comprehensive introduction to storing and processing `big data’ using modern big data
systems such as Map-reduce and Spark that run on large commodity clusters. The primary focus is on
algorithm design and programming at `scale’ applied to all major domains: text, graph, streaming and
relational data. The course also introduces scalable machine learning algorithms using Spark. The course will
use Databricks cloud platform for hands-on demo. The students can use Databricks or any other cluster
computing framework for their assignments.
Content:
• Introduction (2 hrs)
Characteristics of big data, Challenges in processing big data, Limitations of classical algorithms on big
data: Case studies, Applications of Big data: Examples from Healthcare, Education, Economics, Agriculture
and Public Policy
• Storing big data (2 hrs)
Distributed file system: Google file system, Hadoop Distributed File System (HDFS)
Random data access in distributed storage: Apache HBase
• Overview of Cluster computing (3 hrs)
Cluster organization, Cluster managers: YARN, MESOS
Programming challenges in Cluster
• Programming for big data (4 hrs)
Functional programming with Python and Scala
• MapReduce: Simplified Data Processing on Large Clusters (6 hrs)
Map-reduce Internals, Map-reduce programming model: Mapper, reducer, combiner and partitioner
Algorithm design in Map-reduce: Value-to-key design pattern, Handling data shuffling, Customizing
partitioner for better parallelization
Example algorithms: Language model from text, Word co-occurrence, Database join, k-means clustering
• Spark: Fast Data Processing on Large Clusters (8 hrs)
Spark internals, SPARK APIs: Resilient distributed datasets (RDD), RDD creation, RDD transformation, RDD
persistence, RDD partitioning, RDD lineage graph, Dataframes, Datasets
Programming in Spark
• Graph data processing (3 hrs)
Property graph model
GraphX: Graph operators, Property operators, Structural operators, Join operators, Neighborhood
aggregators
Graph builders, Vertex RDD, Edge RDD
Example algorithms: Pagerank, Graph clustering, Triangle counting
References:
Books:
Papers:
1. MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, OSDI, 2004.
2. Bigtable: A Distributed Storage System for Structured Data, Chang et al., OSDI, 2006.
3. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et
al., Usenix, 2012.
4. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al., SOSP, 2013.