CSE6006 NoSQL-Databases ETH 1 AC41
CSE6006 NoSQL-Databases ETH 1 AC41
NoSQL Databases L,T,P,J,C
Subject Code:
2,0,2,4,4
Objective This course will explore the origins of NoSQL databases and the
characteristics that distinguish them from traditional relational database
management systems.
This covers the architectures and common features of the main types of
NoSQL databases (key-value stores, document databases, column-family
stores, graph databases)
Finally, discuss the criteria that decision makers should consider when
choosing between relational and non-relational databases and techniques for
selecting the NoSQL database that best addresses specific use cases.
Expected Outcomes After successfully completing the course the student should be able to
1. Explain the detailed architecture, define objects, load data, query data and
performance tune NoSQL databases
2. Define NoSQL, its characteristics, history and primary benefits using
NoSQL Databases.
3. Define the major types of NoSQL databases including a primary use
case and advantages/disadvantages of each type.
4. Analyze semi-structured data and choose an appropriate storage
structure
SLO’s 2,7,12
Module Topics L Hrs SLO
1 INTRODUCTION TO NOSQL CONCEPTS
Data base revolutions: First generation, second generation, third
generation, Managing Transactions and Data Integrity, ACID and BASE 4 2
for reliable database transactions, Speeding performance by strategic use
of RAM, SSD, and disk, Achieving horizontal scalability with database
sharding, Brewer’s CAP theorem.
2 NOSQL DATA ARCHITECTURE PATTERNS
NoSQL Data model: Aggregate Models- Document Data Model- Key-
Value Data Model- Columnar Data Model, Graph Based Data Model –
4 12
Graph Data Model, NoSQL system ways to handle big data problems,
Moving Queries to data, not data to the query, hash rings to distribute the
data on clusters, replication to scale reads, Database distributed queries to
data nodes.
3 KEY –VALUE DATA STORES
From array to key –value databases, Essential features of key – value
Databases, Properties of keys, Characteristics of Values, Key-Value
Database Data Modeling Terms, Key-Value Architecture and 5 7
implementation Terms, Designing Structured Values, Limitations of Key-
Value Databases, Design Patterns for Key-Value Databases, Case Study:
Key-Value Databases for Mobile Application Configuration
4
DOCUMENT ORIENTED DATABASE
Document, Collection, Naming, CRUD operation, querying, indexing,
5 7
Replication, Sharding, Consistency Implementation: Distributed
consistency, Eventual Consistency, Capped Collection,
Case studies: document oriented database: MongoDB and/or Cassandra
5
COLUMNAR DATA MODEL - I
Data warehousing schemas: Comparison of columnar and row-oriented 3 7
storage, Column-store Architectures: C-Store and Vector-Wise,
Column-store internals and, Inserts/updates/deletes, Indexing, Adaptive
Indexing and Database Cracking.
6 3 7
COLUMNAR DATA MODEL - II
Advanced techniques: Vectorized Processing, Compression, Write
penalty, Operating Directly on Compressed Data Late Materialization
Joins , Group-by, Aggregation and Arithmetic Operations, Case Studies
7
DATA MODELING WITH GRAPH
Comparison of Relational and Graph Modeling, Property Graph Model 4 7
Graph Analytics: Link analysis algorithm- Web as a graph, PageRank-
Markov chain, page rank computation, Topic specific page rank (Page
Ranking Computation techniques: iterative processing, Random walk
distribution Querying Graphs: Introduction to Cypher, case study:
Building a Graph Database Application- community detection
8
Recent Trends
2 2
Lab (Indicative List of Experiments (in the areas of )
30 14
1. Import the Hubway data into Neo4j and configure Neo4j. Then, answer the
following questions using the Cypher Query Language:
a) List top 10 stations with most outbound trips (Show station name and
number of trips)
b) List top 10 stations with most inbound trips (Show station name and
number of trips)
c) List top 5 routes with most trips (Show starting station name, ending
station name and number of trips) (4) List the hour number (for example
13 means 1pm -2pm) and number of trips which start from the station
"B.U. Central"
d) List the hour number (for example 13 means 1pm -2pm) and number of
trips which end at the station "B.U. Central"
2. The flight data can be found at http://stat-computing.org/dataexpo/2009/the-
data.html . You need to download just one year and from there you can sample a
subset of at least 10000 records. You can use the data from a full year if you
want but we recommend using a smaller dataset for simplicity.
Hint: If you need to unzip the data file, you can use the command: bzip2 –
d datafile from a terminal. For example, for the 2008, you download the file and
unzip it using: bzip2 -d 1987.csv.bz2. The airport data can be found at
http://stat-computing.org/dataexpo/2009/supplemental-data.html .
1) Download the flight dataset and airport dataset.
(2) Clean the dataset (for example: remove columns you do not need, remove
records with missing information, remove duplicate records and so on).
(3) Give the header to csv files
(4) Import the data into Neo4j.
(5) Write the queries to answer following questions:
(5.1) List top 10 airports with most outbound flights.
(5.2) List top 10 airports with most inbound flights.
(5.3) List top 5 routes with most flights in weekdays.
(5.4) List top 5 routes with most flights in weekends.
(5.5) List the hour number (for example 13 means 1pm -2pm) and
number of flights, which depart from a specific airport in your data (e.g.,
Boston Logan Airport).
(5.6) List the hour number (for example 13 means 1pm -2pm) and
number of flights, which arrive at specific airport in your data (e.g.,
Boston Logan Airport).
In your report, you should answer the following questions:
(a) List the year of the flights that you downloaded and prepared for this
assignment. You can get a sample set from one-year data. However, the number of
flights cannot be smaller than 10k.
(b) Describe how you clean the data (Which columns you remove and
why? Which rows you remove and why?). Hint: You can clean your data by writing a
small program in Java, Python, C, Matlab or any kind of programming language.
(c) Describe the header you give to the csv files.
(d) Write down the command for importing data.
(e) Write and execute the queries from step (5) above.
3. Download a zip code dataset at http://media.mongodb.org/zips.json.
Use mongoimport to import the zip code dataset into MongoDB.
After importing the data, answer the following questions by using aggregation
pipelines:
(1) Find all the states that have a city called "BOSTON".
(2) Find all the states and cities whose names include the string "BOST".
(3) Each city has several zip codes. Find the city in each state with the most number
of zip codes and rank those cities along with the states using the city populations.
(4) MongoDB can query on spatial information.
Assume we have a spatial position as [-72, 42], and in the range of 2 (it
can be [-71.5, 41.5] or [-72.5, 42.5] or somewhere else), there may exist a number
of zip codes . Try to find the states in that range. You should return the total
populations and the number of cities of each state in that range. Rank the states
based on the number of cities.
(5) Consider a certain rectangular area, in which the vertices are [ -80 , 30 ] , [ -90 ,
30 ] , [ -90 , 40 ] and [ -80 , 40 ]. Find and report the top 10 largest cities (by
population) in this area.
4. Create a database that stores road cars. Cars have a manufacturer, a type. Each
car has a maximum performance and a maximum torque value. Use ifconfig to
determine a machine’s IP address. To check if Cassandra is running in the
background, run: ps aux | grep cassandr[a]
Do the following:
5. Test Cassandra’s replication schema and consistency models.
6. Network Partition without Replication
7. Network Partition with Replication and Weak Consistency
8. Network Partition with Replication and Quorum Consistency
9. Cars have different powertrains. Each type can be described with different
parameters:
10. Internal combustion engine: fuel type, displacement, maximum torque,
maximum power
11. Electric motor: maximum torque, maximum power
12. Both: all of the above and the combined maximum torque and power values
13. The class hierarchy for different powertrain types
14. Extend the cars column family to store the powertrain of each car.
15. Write a query that collects the cars with an internal combustion engine.
16. Write a query that collects the cars with an internal combustion engine or an
electric motor.
60 17
Project
Projects may be given as group projects
The following is the sample project that can be given to students to be implemented:
1. Analyzing and Visualizing social networks like Facebook and twitter using
NoSQL Databases.
4. Project on Combining Database management and Cloud storage system.
5. CarTel. In the CarTel project, we are building a system for collecting and
managing data from automobiles. There are several possible CarTel related
projects:
a) One of the features of CarTel is a GUI for browsing geo‐spatial data collected
from cars. We currently have a primitive interface for retrieving parts of the data
that are of interest, but developing a more sophisticated interface or query
language for browsing and exploring this data would make a great project.
b) One of the dangers with building a system like CarTel is that it collects relatively
sensitive personal information about user’s location and driving habits.
Protecting this information from casual browsers, insurance companies, or other
undesired users is important. However, it is also important to be able to combine
different user’s data together to do things like intelligent route planning or
vehicle anomaly detection. The goal of this project would be to find a way to
securely perform certain types of aggregate queries over CarTel data without
exposing personally identifiable information.
Reference Books
1. Guy Harrison, “Next Generation database: NoSQL New SQL and Big Data”,
Apress, Ist Edition, 2015
2. Daniel G. McCreary and Ann M. Kelly “Making Sense of NoSQL” Manning
publisher, Edition illustrated, 2013
3. Shanshak Tiwari, “Professional NoSQL”, Wrox, Ist Edition, 2011
4. Christopher D. manning, Prabhakar Raghavan, Hinrich Schutze, “An
introduction to Information Retrieval”, Cambridge University Press, 2008
5. Daniel Abadi, Peter Boncz, Stavros Harizopoulos, “The Design and
Implementation of Modern Column-Oriented Database Systems”, Now
Publisher, 2013.
6. Kristina Chodorow, “Mongo DB the Definitive Guide” O’Reilly Media, 2013.
2. Knowledge Areas that contain topics and learning outcomes covered in the course
Total 30
Total hours 30
This Course is designed with 100 minutes of in-classroom sessions per week, 100 minutes of lab
hours per week, as well as 200 minutes of non-contact time spent on implementing course related
project. Generally this course should have the combination of lectures, in-class discussion, case
studies, guest-lectures, mandatory off-class reading material, and assignment.
Students are assessed on a combination of group activities, classroom discussion, projects and
continuous, final assessment tests.
Additional weightage will be given to students working with projects based on different
databases, and competitions and projects handling with large databases.
Students can earn additional weightage based on certificate of completion of a related MOOC
course.
Session wise plan
Class Lab Topics Covered Level of Text/Reference Remarks
Hour Hour mastery Book
2 Data base revolutions : First Familiarity 1,2
generation,second generation, third
generation,Managing Transactions
and Data Integrity
ACID and BASE for reliable
database transactions