syllabus sem 6
syllabus sem 6
Course Outcome:
Unit 1:
Introduction: Distributed Data Processing, Distributed Database Systems, advantages and
drawbacks of DDBSs, Distributed DBMS Architecture : Models- Autonomy, Distribution,
Heterogeneity, DDBMS Architecture – Client/Server, Peer to peer, MDBS
Unit 2:
Database Distribution
Design Alternatives – localized data, distributed data, Fragmentation – Vertical, Horizontal
(primary & derived), hybrid, general guidelines, correctness rules, Distribution transparency
– location, fragmentation, replication Impact of distribution on user queries – No Global Data
Dictionary(GDD), GDD containing location information,
Unit 3:
Query Processing
Query Processing, Query Processing in Centralized Systems. Layers of Query Processing,
Distributed query processing, Query decomposition, Localization of distributed data
Unit 4:
Optimization of Distributed Queries
Query Optimization, Centralized Query Optimization, Distributed Query Optimization
Algorithms
Unit 5:
Distributed Transaction Management & Concurrency Control
Transaction concept, ACID property, Distributed Concurrency Control,
Serializability and recoverability, Distributed Serializability, Enhanced lock based
and timestamp based protocols, Multiple granularity, Multi version schemes,
Optimistic Concurrency Control techniques, Distributed deadlock & Recovery
Suggested Readings
1. M.T. Ozsu and P. Valduriez, “Principles of Distributed Database Systems”, Pearson
Publication
2. D. Bell and J. Grimson, “Distributed Database System”, Addison-Wesley
3. Stefano Ceri and Giuseppe Pelagatti, “Distributed Databases: principles and systems”,
Course No. Type Subject L T P Credits CA MS ES CA ES Pre-
requisites
CDCSE19 ED Data Science 0 2 4 4 Python
Tools Workshop
COURSE OUTCOMES
COURSE CONTENTS
UNIT-1: Data Science an Introduction: Computer Science, Data Science, and Real Science, What is Data
Science? Need for Data Science, Data Science Components, Tools for Data Science, Data Science Lifecycle,
Applications of Data Science.
UNIT-2:Python and R Programming for Data Science for Data Science: Introduction to Python
Programming (Python Basics, Python Data Structures, Python Programming Fundamentals, Working
with Data in Python, Working with NumPy, Pandas, SciPy, and Matplotlib).
UNIT-3: Data Processing: Data Operations, Data cleansing, Processing CSV Data, Processing JSON
Data, Processing XLS Data, Relational databases, NoSQL Databases, Date and Time, Data
Wrangling, Data Aggregation, Reading HTML Pages, Processing Unstructured Data, Word
tokenization, Stemming and Lemmatization
UNIT 4: Statistical Data Analysis: Measuring Central Tendency, Measuring Variance, Normal
Distribution, Binomial Distribution, Poisson Distribution, Bernoulli Distribution, P-Value,
Correlation, Chi-square Test, Linear Regression
UNIT-5: Data Visualization: Chart Properties, Chart Styling, Box Plots, Heat Maps, Scatter Plots,
Bubble Charts, 3D Charts, Time Series, Geographical Data, Graph Data
Reference Book(s):
• An Introduction to Probability and Statistics by V.K. Rohatgi & A.K. Md. E. Saleh, Wiley,
(2008), 3rd ed.
• Introduction to Probability Theory and Statistical Inference by H.J. Larson, John Wiley
& Sons, (2005) 3rd ed.
S No Description Hours
• Write a Python/R program to create a vector of a specified type and length.
Create a vector of numeric, complex, logical, and character types of length
6.
• Write a Python/R program to add two vectors of integer type and length 3.
1 3
• Write a Python/R program to create a list containing a vector, a matrix, and
a list and remove the second element
• Write a Python/R program to create a list containing a vector, a matrix, and
a list and update the last element.
Write Python/R programs to solve the following tasks in both of them.
• Read numbers from a file, and print them out in sorted order.
2 • Read a text file, and count the total number of words. 3
• Read a text file, and count the total number of distinct words.
• Read a file of numbers, and plot a frequency histogram of them
Statistical Data Analysis
Write a program to solve linear regression for a given data set.
Y = ax + b
where
a = (nΣxy –ΣxΣy) / nΣx2 – Σ(x)2
b = (Σy- aΣx)/n
Here
Y: response variable
X: predicator variable
3 a, b: regression coefficients 3
Read data set
X Y
-2 -1
1 1
3 2
2006 19
2007 29
2008 37
2009 45
1 1500 2 5 4 10
2 1650 3 6 5 10
3 1750 3 3 5 12
4 1400 2 3 3 9
5 2
5 2000 4 4 6 15
6 2200 5 6 6 14
7 2100 1 5 4 12
8 2750 5 8 7 15
9 2900 8 9 8 25
10 1100 3 3 2 7
11 1000 4 2 1 5
12 1350 6 4 4 12
13 1550 4 6 4 11
Here you will get an error as y- value must be 0 < 1. So modify Y values.
Statistical Data Analysis
• In an entrance examination, there are twenty multiple-choice questions. Each
question has four options, and only one of them is correct. Find the
probability of having seven or less than seven correct answers if a student
6 2
attempts to answer every question at random.
• Let us assume that the test scores an entrance exam fit a normal distribution
where the mean test score is 67, and the standard deviation is 13.7. Calculate
the percentage of students scoring 80 or more in the exam?
Mid-Semester Lab Examination
Data Visualization
Construct a revealing visualization of some aspect of your favorite data set, using:
• A well-designed table.
• A dot and/or line plot.
9 • A scatter plot. 3
• A heatmap.
• A bar plot or pie chart.
• A histogram.
• A data map.
Data Visualization
10 Create ten different versions of line charts for a particular set of (x, y) points. 3
Which ones are best and which ones worst? Explain why.
Data Visualization
11 Construct scatter plots for sets of 10, 100, 1000, and 10,000 points. Experiment 3
with the point size to find the most revealing value for each data set.
Data Visualization
Experiment with different color scales to construct scatter plots for a particular
12 3
set of (x, y, z) points, where color is used to represent the z dimension. Which
color schemes work best? Which are the worst? Explain why.
End-Semester Lab Examination
Total Lab hours 28
Course Type Subject L T P Credits CA MS ES CA ES Pre-
No. requisites
CDCSC20 CC
Query Processing 3 1 0 4 25 25 50 - - CDCSC05
and Optimization
COURSE OUTCOMES
UNIT I
Query Processing: Introduction, Steps: Parsing and Translation, Optimization, Evaluation;
Measures of Query Cost
Relational Algebra, Operations from Set Theory, Translational SQL Queries into Relational
Algebra, Equivalence rules, Equivalence derivability and minimality, Enumeration of Equivalent
Expressions.
UNIT II
Algorithms for Selection Operations: using indices, comparisons, complex selections; Algorithms
for External Sorting, Algorithms for SELECT Operations, Aggregation Operation
Algorithms for JOIN Operations: Nested-Loop Join, Block Nested Loop Join, Indexed Nested loop
join, Merge-Join, Hash Join, Hybrid Hash Join, Complex Joins, Outer Join, Algorithms for Project
and Set Operations
UNIT III
Evaluation of Expression, Transformation of Relational Expression, Combining Operations using
Pipelining, Procedure-driven pipelining, Double pipelining join technique, Materialization,
Materialized Evaluation
UNIT IV
Query Optimization, Introduction, Query Evaluation Plan(QEP), cost based query optimization,
Estimation of QEP cost, using heuristics in query optimization, Selectivity and Cost Estimation,
Semantic Query optimization
UNIT V
Estimation Statistics of Estimation Results, Cost Estimation, Statistical Information for cost
estimation: Histograms, Selection and JOIN Size Estimation, Projection and aggregation size
estimation
Choice of Evaluation Plans, Dynamic programming in Optimization, Cost of Optimization,
Structure of Query Optimizers, Materialized Views, Optimization in distributed databases.
SUGGESTED READINGS
1. Raghu Rama Krishnan and J. Gehrke,, Database Management Systems,, 3rd Edition,
McGraw Hill
2. Silberschatz, H. F. Korth& A. Sudarshan,, Database System Concepts,, McGraw Hill, 5th
ed, 2006.