Syllabus E63 Spring2016-2

This document provides information on the Big Data Analytics course CSCI E-63 to be offered in Spring 2016. The course will introduce tools for storing, manipulating, and analyzing large volumes of unstructured data, including Hadoop, Spark, SQL databases, NoSQL databases and data streaming technologies. It will cover topics like MapReduce, machine learning algorithms, and data visualization. Grades will be based mainly on homework assignments, with a final project accounting for 15% of the grade. The course will include both lectures and optional online lab sections on various topics related to big data.

Uploaded by

alps_7777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Syllabus E63 Spring2016-2

Uploaded by

alps_7777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CSCI E-63 Big Data Analytics (24038) 2016 Spring term (4 credits)

Zoran B. Djordjevi, PhD, Senior Enterprise Architect, NTT Data, Inc.

Lectures: Fridays starting January 29th, 2016 at 5:30 to 7:30 PM (EST), 1 Story Street, Room
306, Cambridge, MA
Optional Online Sections: Saturdays starting January 30th, 2016 at 10-11:30 AM (EST).

The recent explosion of social media and the computerization of every aspect of economic
activity resulted in creation of large volumes of mostly unstructured data: web logs, videos,
speech recordings, photographs, e-mails, Tweets, and similar. In a parallel development,
computers keep getting ever more powerful and storage ever cheaper. Today, we have the
ability to reliably and cheaply store huge volumes of data, efficiently analyze them, and extract
business and socially relevant information. This course introduces you to several key IT
technologies that you will be able to use to manipulate, store, and analyze big data. We will
look at the basic tools for statistical analysis, R and Python, and a few key methods used in
Machine Learning. We will review MapReduce techniques for parallel processing and Hadoop,
an open source framework that allow us to cheaply and efficiently implement MapReduce on
internet scale problems. We will spend considerable time mastering Spark, a memory based
evolution of Hadoop. We will touch on related tools that provide SQL-like access to
unstructured data like Hive. We will analyze so-called NoSQL storage solutions exemplified by
Cassandra for their critical features: speed of reads and writes, data consistency, and ability to
scale to extreme volumes. We will examine memory resident databases (VoltDB) and streaming
technologies which allow analysis of data in flight, i.e. near real time. Students will gain the
ability to design highly scalable systems that can accept, store, and analyze large volumes of
unstructured data in batch mode and/or real time.
Prerequisites: Familiarity with Intermediate Java is advised. Most assignments could easily be
done in Python, Scala, C# or Perl too. We will assume no familiarity with Linux and will
introduce you to all essential Linux commands. Students need access to a computer with a 64
bit operating system and at least 4 GB of RAM. Note: 8 GB or more of RAM is strongly advised.
Lectures: Lectures will be delivered live and simultaneously made available after lectures for
online viewing through BlackBoard Collaborate Web Conferencing tool. Streaming recording
might also be available. Links to BlackBoard Collaborate recorded lectures will be accessible on
the course Web site within a few hours after the end of the lecture. If streaming video is
provided, recorded lectures will become available with a delay of up to two days.
References: Detailed handouts with references to material on the Web will be handed out
every week. There is no required text book.
Grading: Practically every class will be followed by a homework assignment. Grades on the
solutions for class assignments constitute approximately 85% of the final grade. 15% of the
grade will be earned through the final project. Final projects will be assigned a few weeks
before the end of the class. You will produce a paper (10+ pages of MS Word text, 10+
PowerPoint Slides, a working demo, 15 minute YouTube video of your presentation and a brief
2 minute YouTube video that might be presented to the class on the day of final presentations.
Several students will be invited to present their final projects live to the entire class. Grades:
95% or higher cumulative grade on all assignments and the final project gives you an A as the
final grade in the course, 90-94.9% gives you an A-, 85-89.9% a B+, 80-84.9% a B, etc.

Communications: [email protected], also Canvas message box once the class starts.

1
Tentative List of Class Topics:
Date Topic
1 01/29/16 R a language developed by statisticians for statisticians. Whatever you do with Big
Data has something to do with statistics. Learning R might be a good idea.
2 02/05/16 MapReduce Framework and Hadoop. Embarrassingly parallel processes and other
design patterns for big data processing. Cloudera virtual machine. HDFS -
Hadoop Distributed Filesystem, YARN - Yet Another Resource Negotiator.
3 02/12/16 MapReduce 2 API . We will examine some of advanced details of Hadoop
MapReduce Java API
4 02/19/16 Spark. A memory based evolution of MapReduce framework with considerable
improvement in execution speed. Spark RDD-s
5 02/26/16 Hive is a data warehouse built atop of HDFS and Hadoop. It allows SQL queries
over data stored in HDFS.
6 03/04/16 Spark Data Frames and Spark SQL are tools (APIs) within Spark ecosystem allowing
you to manipulate data (RDD-s) in a most efficient manner.
7 03/11/16 NoSQL Databases and Cassandra are non-traditional database engines build with
some relaxed features like consistency but providing very high performance of
reads or writes.
03/18/16 No Class Spring Break
8 03/25/16 Spark Streaming, Kafka and Cassandra is becoming a standard stack for processing
of fast data
9 04/01/16 Visualizing Large Data Sets with D3. We will introduce a Java Script API and
techniques that enable more insightful use of graphs and charts to present the
content and features of large data set
10 04/08/16 Neo4J, a Graph Database. A storage and retrieval system based on hierarchical
structures which have proven themselves very efficient for fast queries among
highly correlated data.
11 04/15/16 Natural Language Processing. Basic mechanisms for processing and analysis of
written text.
12 04/22/16 Spark MLLib, Machine Learning with Spark. We will review a few algorithms that
can learn from and make predictions on data
13 04/29/16 Compute Unified Device Architecture (CUDA) is a parallel computing platform
and API model created by NVIDIA. It allows software developers to use graphics
processing units (GPU) for general purpose processing. Basic Programming
techniques.
14 05/06/16 Data Flow Computing is a new revolutionary way of performing computations,
completely different to computing with conventional CPUs or GPUs. Dataflow
computers focus on optimizing the movement of data in an application and utilize
massive parallelism between thousands of tiny dataflow cores to provide order of
magnitude benefits in performance, space and power consumption. We will learn
to perform Data Flow computations using Maxelers technology.
15 05/13/16 Presentations of selected student projects.

Tentative List of Lab Topics

2
Date Topic
1 01/30/16 R Zoran
2 02/06/16 MapReduce Framework and Hadoop. Zoran
3 02/13/16 MapReduce 2 API Marina
4 02/20/16 Spark. Spark RDD-s Diane
5 02/27/16 Hive Rahul
6 03/05/16 Spark Data Frames and Spark SQL Olena
7 03/12/16 NoSQL Databases and Cassandra Blagoje (AWS)
03/19/16 No Class Spring Break
8 03/26/16 Spark Streaming, Kafka and Cassandra Marina
9 04/02/16 Visualizing Large Data Sets with D3. Diane
10 04/09/16 Neo4J, a Graph Database. Zoran
11 04/16/16 Natural Language Processing Blagoje
12 04/23/16 Spark MLLib Rahul
13 04/30/16 CUDA Blagoje/Zoran
14 05/07/16 Data Flow Computing Guest
15 05/13/16 Presentations

Souther Gas Company Question
0% (1)
Souther Gas Company Question
3 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Cost Analysis Library Case For ACCT425
100% (1)
Cost Analysis Library Case For ACCT425
4 pages
Strang (1968) - On The Construction and Comparison of Difference Schemes
100% (1)
Strang (1968) - On The Construction and Comparison of Difference Schemes
13 pages
Syllabus E63 2018 Fall PDF
No ratings yet
Syllabus E63 2018 Fall PDF
3 pages
CC ZG522 Course Handout
No ratings yet
CC ZG522 Course Handout
6 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
C Se 487 Course Outline Jan 28
No ratings yet
C Se 487 Course Outline Jan 28
4 pages
B.Tech. CS_CE and CSE Syllabus 3rd Year 2024-25
No ratings yet
B.Tech. CS_CE and CSE Syllabus 3rd Year 2024-25
2 pages
Big Data Applications, Software, Hardware and Curricula
No ratings yet
Big Data Applications, Software, Hardware and Curricula
71 pages
BDA2023Outline
No ratings yet
BDA2023Outline
7 pages
BE-AIDS-R-20-VII-VIII-Sem-Syllabus_compressed
No ratings yet
BE-AIDS-R-20-VII-VIII-Sem-Syllabus_compressed
55 pages
Koe097big Data
No ratings yet
Koe097big Data
1 page
BDS Course Handout - Intuit PDF
No ratings yet
BDS Course Handout - Intuit PDF
6 pages
Big Data Engineer Course (2) (1)
No ratings yet
Big Data Engineer Course (2) (1)
31 pages
Big Data Management Syllabus
100% (1)
Big Data Management Syllabus
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Big Daa R18 Manual
No ratings yet
Big Daa R18 Manual
84 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
97 pages
1 Intro
No ratings yet
1 Intro
33 pages
Unit 1
No ratings yet
Unit 1
19 pages
3rd Sem Syllabus
No ratings yet
3rd Sem Syllabus
13 pages
Course Pack BDA
No ratings yet
Course Pack BDA
6 pages
Information Technology Engineering Syllabus Sem Viii Mumbai University
No ratings yet
Information Technology Engineering Syllabus Sem Viii Mumbai University
60 pages
Data and Analytics Syllabus
No ratings yet
Data and Analytics Syllabus
4 pages
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
No ratings yet
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
8 pages
UNIT-IV PDF
No ratings yet
UNIT-IV PDF
26 pages
BD Course Handout (Spring 2024)
No ratings yet
BD Course Handout (Spring 2024)
4 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
20ai402 Data Analytics Unit-2
No ratings yet
20ai402 Data Analytics Unit-2
72 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Big Data Technologies Course Outline
No ratings yet
Big Data Technologies Course Outline
2 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Big Data Analytics Course Outline (Fall 2020) : Dr. Tariq Mahmood 830 Am - 11 Am (Monday) Scope
No ratings yet
Big Data Analytics Course Outline (Fall 2020) : Dr. Tariq Mahmood 830 Am - 11 Am (Monday) Scope
3 pages
Data Science and Big Data Analytics_ Unit_1
No ratings yet
Data Science and Big Data Analytics_ Unit_1
47 pages
HPC Week1 Samp
No ratings yet
HPC Week1 Samp
23 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Big Data With Hadoop and Spark_2023-25
No ratings yet
Big Data With Hadoop and Spark_2023-25
4 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
CloudxLab BDHS Course Details
No ratings yet
CloudxLab BDHS Course Details
9 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
DSA Practical Index
No ratings yet
DSA Practical Index
3 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
22IS61 Big data analytics 2025
No ratings yet
22IS61 Big data analytics 2025
4 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
2 pages
lec01
No ratings yet
lec01
40 pages
Syllabus
No ratings yet
Syllabus
2 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
No ratings yet
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
5 pages
Mtech-Syllabus-Data Science - Sem2
No ratings yet
Mtech-Syllabus-Data Science - Sem2
18 pages
LP BigData
No ratings yet
LP BigData
5 pages
BCA-BIGDATA-FIFTH_SEM-APPROVED-SYLLABUS
No ratings yet
BCA-BIGDATA-FIFTH_SEM-APPROVED-SYLLABUS
23 pages
BD Course Handout
No ratings yet
BD Course Handout
5 pages
big data sv publication
No ratings yet
big data sv publication
142 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Csca Shopping Centre Database Data Dictionary
No ratings yet
Csca Shopping Centre Database Data Dictionary
3 pages
Front Matter
No ratings yet
Front Matter
1 page
Toastmasters Membership Form
No ratings yet
Toastmasters Membership Form
3 pages
RBC Diversity Blueprint
No ratings yet
RBC Diversity Blueprint
11 pages
RBC Diversity Blueprint
No ratings yet
RBC Diversity Blueprint
11 pages
Sheave Design Versus Wire Rope Life
No ratings yet
Sheave Design Versus Wire Rope Life
2 pages
SIP Annex 8 - Root Cause Analysis Overview
100% (2)
SIP Annex 8 - Root Cause Analysis Overview
4 pages
BEP2 Task2 Revised
100% (2)
BEP2 Task2 Revised
10 pages
Varma Practictioner Guide
No ratings yet
Varma Practictioner Guide
9 pages
Management Information System in Indian Universities: A Comparative Study
No ratings yet
Management Information System in Indian Universities: A Comparative Study
10 pages
MP2: Equations of State: Engr. Elisa G. Eleazar
No ratings yet
MP2: Equations of State: Engr. Elisa G. Eleazar
20 pages
Recent Trends of Work Participation Rate in Santipur Handloom Industry: Case Study Bathangachi Village
No ratings yet
Recent Trends of Work Participation Rate in Santipur Handloom Industry: Case Study Bathangachi Village
7 pages
111 - Preventive Maintenance Training
No ratings yet
111 - Preventive Maintenance Training
13 pages
Multi-Objective Particle Swarm Optimization Based On Fuzzy Optimality
No ratings yet
Multi-Objective Particle Swarm Optimization Based On Fuzzy Optimality
14 pages
Thayer Xi Jinping To Visit Vietnam - Scene Setter - 1
100% (1)
Thayer Xi Jinping To Visit Vietnam - Scene Setter - 1
3 pages
CBT Faq Document Applicant 2
No ratings yet
CBT Faq Document Applicant 2
6 pages
CBDRRP
No ratings yet
CBDRRP
34 pages
DYMO Range Guide
No ratings yet
DYMO Range Guide
8 pages
Botl and Nut Specification
No ratings yet
Botl and Nut Specification
17 pages
Analisis Perilaku Konsumen Terhadap Keputusan Pembelian Handphone Xiaomi Redmi 3S
No ratings yet
Analisis Perilaku Konsumen Terhadap Keputusan Pembelian Handphone Xiaomi Redmi 3S
9 pages
Successful Applications: Digital Microscope VHX
No ratings yet
Successful Applications: Digital Microscope VHX
7 pages
Spoken English Exam
No ratings yet
Spoken English Exam
66 pages
Description Dermatological Status
No ratings yet
Description Dermatological Status
6 pages
Weekly Learning Plan, Quarter 1 Week 2 SCIENCE
No ratings yet
Weekly Learning Plan, Quarter 1 Week 2 SCIENCE
3 pages
Total Productive Maintenance (TPM)
No ratings yet
Total Productive Maintenance (TPM)
66 pages
L1 Intro To SAD
No ratings yet
L1 Intro To SAD
42 pages
REVTEX 4.1 Author's Guide: 1 Research Road, Ridge, NY 11961 (Dated: March 2010)
No ratings yet
REVTEX 4.1 Author's Guide: 1 Research Road, Ridge, NY 11961 (Dated: March 2010)
19 pages
Roles and Responsibilities
No ratings yet
Roles and Responsibilities
6 pages
Shell Programming
100% (2)
Shell Programming
25 pages
Benhabib - Moving Beyond False Binarisms On Samuel Moyn's The Last Utopia (2013)
No ratings yet
Benhabib - Moving Beyond False Binarisms On Samuel Moyn's The Last Utopia (2013)
14 pages
How Generic Are Project Management Knowledge and Practice: January 2007
No ratings yet
How Generic Are Project Management Knowledge and Practice: January 2007
12 pages
Btech Syllabus Rtu 3-8 Semester
No ratings yet
Btech Syllabus Rtu 3-8 Semester
86 pages

Syllabus E63 Spring2016-2

Uploaded by

Syllabus E63 Spring2016-2

Uploaded by

CSCI E-63 Big Data Analytics (24038) 2016 Spring term (4 credits)

Zoran B. Djordjevi, PhD, Senior Enterprise Architect, NTT Data, Inc.

Tentative List of Lab Topics

You might also like