0% found this document useful (0 votes)
25 views

Get Syllabus PDF

This document outlines a course on big data processing that introduces modern big data systems like MapReduce and Spark. The course focuses on algorithm design and programming at scale for different data types like text, graphs, streams, and relational data. It covers topics like storing big data in distributed file systems, cluster computing frameworks, programming models for MapReduce and Spark, processing graph and stream data, and machine learning with Spark. Students will get hands-on experience using the Databricks cloud platform.

Uploaded by

ashok babu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Get Syllabus PDF

This document outlines a course on big data processing that introduces modern big data systems like MapReduce and Spark. The course focuses on algorithm design and programming at scale for different data types like text, graphs, streams, and relational data. It covers topics like storing big data in distributed file systems, cluster computing frameworks, programming models for MapReduce and Spark, processing graph and stream data, and machine learning with Spark. Students will get hands-on experience using the Databricks cloud platform.

Uploaded by

ashok babu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Name of the Subject: Big Data Processing

L-T-P : 3-0-0, Credit : 3

Semester: Autumn

Prerequisites: Programming and Data Structure

Objective:
The course gives a comprehensive introduction to storing and processing `big data’ using modern big data
systems such as Map-reduce and Spark that run on large commodity clusters. The primary focus is on
algorithm design and programming at `scale’ applied to all major domains: text, graph, streaming and
relational data. The course also introduces scalable machine learning algorithms using Spark. The course will
use Databricks cloud platform for hands-on demo. The students can use Databricks or any other cluster
computing framework for their assignments.

Content:
• Introduction (2 hrs)
Characteristics of big data, Challenges in processing big data, Limitations of classical algorithms on big
data: Case studies, Applications of Big data: Examples from Healthcare, Education, Economics, Agriculture
and Public Policy
• Storing big data (2 hrs)
Distributed file system: Google file system, Hadoop Distributed File System (HDFS)
Random data access in distributed storage: Apache HBase
• Overview of Cluster computing (3 hrs)
Cluster organization, Cluster managers: YARN, MESOS
Programming challenges in Cluster
• Programming for big data (4 hrs)
Functional programming with Python and Scala
• MapReduce: Simplified Data Processing on Large Clusters (6 hrs)
Map-reduce Internals, Map-reduce programming model: Mapper, reducer, combiner and partitioner
Algorithm design in Map-reduce: Value-to-key design pattern, Handling data shuffling, Customizing
partitioner for better parallelization
Example algorithms: Language model from text, Word co-occurrence, Database join, k-means clustering
• Spark: Fast Data Processing on Large Clusters (8 hrs)
Spark internals, SPARK APIs: Resilient distributed datasets (RDD), RDD creation, RDD transformation, RDD
persistence, RDD partitioning, RDD lineage graph, Dataframes, Datasets
Programming in Spark
• Graph data processing (3 hrs)
Property graph model
GraphX: Graph operators, Property operators, Structural operators, Join operators, Neighborhood
aggregators
Graph builders, Vertex RDD, Edge RDD
Example algorithms: Pagerank, Graph clustering, Triangle counting

• Stream Processing (3 hrs)


Characteristics of stream data, Discretized stream processing in Spark, Combining stream and batch
processing in spark, Use Cases of Stream Processing
• Machine Learning with Spark (5 hours)
Classification, Regression, Clustering, Collaborative Filtering, Training Deep Neural Network
• Processing structured data (1 hr)
Spark SQL
• High Level Language for Data Analytics on Clusters (2 hr)
PIG Latin

References:

Books:

1. Mining of Massive Datasets : Rajaraman and Ullman

2. Data-Intensive Text Processing with MapReduce: Lin and Dyer

3. Learning Spark: Konwinski et al.

4. Spark - The Definitive Guide: Chambers and Zaharia

Papers:

1. MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, OSDI, 2004.

2. Bigtable: A Distributed Storage System for Structured Data, Chang et al., OSDI, 2006.

3. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et
al., Usenix, 2012.

4. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al., SOSP, 2013.

You might also like