CS8091-Big-Data-Analytics
CS8091-Big-Data-Analytics
OBJECTIVES:
To know the fundamental concepts of big data and analytics.
To explore tools and practices for working with big data
To learn about stream computing.
To know about the research that requires the integration of large amounts of data.
Hbase – Analyzing big data with twitter - Big data for E-Commerce Big data for blogs - Review of
Basic Data Analytic Methods using R.
TOTAL: 45 PERIODS
OUTCOMES: Upon completion of the course, the students will be able to:
Work with big data tools and its analysis techniques
Analyze data by utilizing clustering and classification algorithms
Learn and apply different mining algorithms and recommendation systems for large
volumes of data
Perform analytics on data streams
Learn NoSQL databases and management.
TEXT BOOKS:
1. Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Datasets", Cambridge
University Press, 2012.
2. David Loshin, "Big Data Analytics: From Strategic Planning to Enterprise Integration
with Tools, Techniques, NoSQL, and Graph", Morgan Kaufmann/El sevier Publishers,
2013.
REFERENCES:
1. EMC Education Services, "Data Science and Big Data Analytics: Discovering,
Analyzing, Visualizing and Presenting Data", Wiley publishers, 2015.
2. Bart Baesens, "Analytics in a Big Data World: The Essential Guide to Data Science and
its Applications", Wiley Publishers, 2015.
3. Dietmar Jannach and Markus Zanker, "Recommender Systems: An Introduction",
Cambridge University Press, 2010.
4. Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Practical Guide for Managers
" CRC Press, 2015.
5. Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with MapReduce", Synthesis
Lectures on Human Language Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010
Evolution of Big data – Best Practices for Big data Analytics – Big data characteristics – Validating
–The Promotion of the Value of Big Data – Big Data Use Cases- Characteristics of Big Data
Applications – Perception and Quantification of Value -Understanding Big Data Storage – A
General Overview of High-Performance Architecture – HDFS – Map Reduce and YARN – Map
Reduce Programming Model
COURSE OBJECTIVE: To know the fundamental concepts of big data and analytics.
c. Jackknife
d. Bootstrap
11. How Hadoop MapReduce works?(An)
In MapReduce, during the map phase, it counts the words in each document, while in the
reduce phase it aggregates the data as per the document spanning the entire collection.
During the map phase, the input data is divided into splits for analysis by map tasks running
in parallel across Hadoop framework.
12. Explain what is shuffling in MapReduce?(R)
The process by which the system performs the sort and transfers the map outputs to the
reducer as inputs is known as the shuffle
13. Explain what is distributed Cache in MapReduce Framework? (U)
Distributed Cache is an important feature provided by the MapReduce framework. When
you want to share some files across all nodes in Hadoop Cluster, Distributed Cache is
used. The files could be an executable jar files or simple properties file.
14. Explain what is NameNode in Hadoop? (U)
NameNode in Hadoop is the node, where Hadoop stores all the file location information in
HDFS (Hadoop Distributed File System). In other words, NameNode is the centerpiece of
an HDFS file system. It keeps the record of all the files in the file system and tracks the file
data across the cluster or multiple machines
15. Explain what is heartbeat in HDFS? (U)
Heartbeat is referred to a signal used between a data node and Name node, and between task
tracker and job tracker, if the Name node or job tracker does not respond to the signal, then
it is considered there is some issues with data node or task tracker
16. Explain what combiners are and when you should use a combiner in a MapReduce Job?
(U)
To increase the efficiency of MapReduce Program, Combiners are used. The amount of
data can be reduced with the help of combiner‘s that need to be transferred across to the
reducers. If the operation performed is commutative and associative you can use your
reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop
17. Explain what is Speculative Execution?(R)
In Hadoop during Speculative Execution, a certain number of duplicate tasks are
launched. On a different slave node, multiple copies of the same map or reduce task can be
executed using Speculative Execution. In simple words, if a particular drive is taking a long
time to complete a task, Hadoop will create a duplicate task on another disk. A disk that
finishes the task first is retained and disks that do not finish first are killed.
18. Explain what are the basic parameters of a Mapper? (U)
The basic parameters of a Mapper are
LongWritable and Text
Text and Int Writable
19. Explain what is the function of MapReduce partitioner?(U)
The function of MapReduce partitioner is to make sure that all the value of a single key goes
to the same reducer, eventually which helps even distribution of the map output over the
reducers
20. Explain what is a difference between an Input Split and HDFS Block?(U)
The logical division of data is known as Split while a physical division of data is known as
HDFS Block
21. Explain what happens in text format? (U)
In text input format, each line in the text file is a record. Value is the content of the line
while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
22. Mention what are the main configuration parameters that user need to specify to run
MapReduce Job?(An)
a. The user of the Map Reduce framework needs to specify
b. Job‘s input locations in the distributed file system
c. Job‘s output location in the distributed file system
d. Input format
e. Output format
f. Class containing the map function
g. Class containing the reduce function
h. JAR file containing the mapper, reducer and driver classes
23. Explain what is WebDAV in Hadoop? (U)
To support editing and updating files WebDAV is a set of extensions to HTTP. On most
operating system WebDAV shares can be mounted as filesystems, so it is possible to access
HDFS as a standard filesystem by exposing HDFS over WebDAV.
24. Explain what is Sqoop in Hadoop? (R)
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS
a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like
MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
25. Explain how Job Tracker schedules a task? (R)
The task tracker sends out heartbeat messages to Job tracker usually every few minutes to
make sure that Job Tracker is active and functioning. The message also informs Job Tracker
about the number of available slots, so the Job Tracker can stay up to date with wherein the
cluster work can be delegated
26. Explain what is Sequence file input format? (R)
Sequence file input format is used for reading files in sequence. It is a specific compressed
binary file format which is optimized for passing data between the outputs of one
MapReduce job to the input of some other MapReduce job.
27. Explain what does the conf.setMapper Class do?(An)
Conf.setMapperclass sets the mapper class and all the stuff related to map job such as
reading data and generating a key-value pair out of the mapper
28. Explain what is Hadoop? (R)
a) The Context Object enables the mapper to interact with the rest of the Hadoop
b) system. It includes configuration data for the job, as well as interfaces which allow it to
emit output.
51. Mention what is the next step after Mapper or MapTask? (R)
The next step after Mapper or MapTask is that the output of the Mapper are sorted, and
partitions will be created for the output.
52. Mention what is the number of default partitioner in Hadoop? (R)
In Hadoop, the default partitioner is a ―Hash‖ Partitioner.
53. Explain what is the purpose of RecordReader in Hadoop? (R)
In Hadoop, the RecordReader loads the data from its source and converts it into (key, value)
pairs suitable for reading by the Mapper.
54. Explain how is data partitioned before it is sent to the reducer if no custom partitioner is
defined in Hadoop? (R)
If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash
value for the key and assigns the partition based on the result.
55. Explain what happens when Hadoop spawned 50 tasks for a job and one of the task
failed?(U)
It will restart the task again on some other TaskTracker if the task fails more than the
defined limit.
56. Mention what is the best way to copy files between HDFS clusters? (R)
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp
command, so the workload is shared.
57. Mention what is the difference between HDFS and NAS?(An)
HDFS data blocks are distributed across local drives of all machines in a cluster while NAS
data is stored on dedicated hardware.
58. Mention how Hadoop is different from other data processing tools? (R)
In Hadoop, you can increase or decrease the number of mappers without worrying about the
volume of data to be processed.
59. Mention what job does the conf class do? (R)
Job conf class separate different jobs running on the same cluster. It does the job level
settings such as declaring a job in a real environment.
60. Mention what is the Hadoop MapReduce APIs contract for a key and value class?(U)
a. For a key and value class, there are two Hadoop MapReduce APIs contract
b. The value must be defining the org.apache.hadoop.io.Writable interface
c. The key must be defining the org.apache.hadoop.io.WritableComparable interface
61. Mention what are the three modes in which Hadoop can be run? (R)
a. The three modes in which Hadoop can be run are
b. Pseudo distributed mode
c. Standalone (local) mode
d. Fully distributed mode
62. Mention what does the text input format do? (R)
The text input format will create a line object that is an hexadecimal number. The value is
considered as a whole line text while the key is considered as a line object. The mapper will
receive the value as ‗text‘ parameter while key as ‗longwriteable‘ parameter.
63. Mention what is distributed cache in Hadoop? (R)
Distributed cache in Hadoop is a facility provided by MapReduce framework. At the time
of execution of the job, it is used to cache file. The Framework copies the necessary files to
the slave node before the execution of any task at that node.
64. Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop
daemons?(U)
Classpath will consist of a list of directories containing jar files to stop or start daemons.
65. List the main characteristics of Big Data (U) (NOVEMBER /DECEMBER 2020/
APRIL / MAY 2021)
66. Veracity, Variety, Velocity, Volume, Validity, Variability, Volatility, Visualization and
Value.
67. Why HDFS preferred to RDBMS (U) (NOVEMBER /DECEMBER 2020/ APRIL /
MAY 2021)
It is more flexible in storing, processing, and managing data than traditional RDBMS. Unlike
traditional systems, Hadoop enables multiple analytical processes on the same data at the same
time. It supports scalability very flexibly.
PART-B
1. Analyse in detail about the challenges of the Big Data in Modern Data Analytics. (An)
2. Justify the Statement ―Web Data is the Most Popular Big Data‖ with reference to data
analytic professional. (E)
3. Comment on the statement ―Is the ―Big‖ Part or the ―Data‖ Part More Important ―.(E)
4. Develop the role of Analytic Sandbox and its benefits in the Analytic Process.(C)
5. List the features of Hadoop and explain the functionalities of Hadoop cluster? (U)
6. Describe briefly about Hadoop input and output and write a note on data integrity? (U)
7. Discuss the various core components of the Hadoop.(U)
8. Assess the significances of MapReduce .(U)
9. Explain about Hadoop distributed file system architecture with neat diagram.(U)
10. Summarize briefly on
11. Algorithms using MapReduce. (U)
12. Extensions to MapReduce. (U)
13. Compare and Contrast the Hadoop and MapR(U)
14. Analyse the steps of Map Reduce Algorithms. (An)
15. Describe the concepts of HDFS. (U)
16. (i) Explain the management of computing resources and the management of the data
across the network of storage nodes in High performance architecture (4) (U)
NOVEMBER /DECEMBER 2020/ APRIL / MAY 2021
17. Write short notes on the following programming model:(9) (R) NOVEMBER
/DECEMBER 2020/ APRIL / MAY 2021
1. HDFS
2. MapReduce
3. YARN
18. (i) Brief about the characteristics of Big data Applications (5)(U)NOVEMBER
/DECEMBER 2020/ APRIL / MAY 2021
19. Explain the role of Big Data Analytics in the following: (8)(U)NOVEMBER
/DECEMBER 2020/ APRIL / MAY 2021
1. Credit Fraud Detection
2. Clustering and data segmentation
3. Recommendation engines
4. Price modeling
COURSE OUTCOMES:
Students can able to work with big data tools and its analysis techniques
UNIT II
CLUSTERING AND CLASSIFICATION
Advanced Analytical Theory and Methods: Overview of Clustering – K-means – Use Cases –
Overview of the Method – Determining the Number of Clusters – Diagnostics – Reasons to Choose
and Cautions .- Classification: Decision Trees – Overview of a Decision Tree – The General
Algorithm – Decision Tree Algorithms – Evaluating a Decision Tree – Decision Trees in R – Naïve
Bayes – Bayes‗ Theorem – Naïve Bayes Classifier.
COURSE OBJECTIVE:
To explore tools and practices for working with big data
1. What are the three stages of IDA process? (R)
o Data preparation
o Data mining and rule finding
o Result validation and interpretation
2. What is linear regression? (R)
Linear regression is an approach for modeling the relationship between a scalar
dependent variable y and one or more explanatory variables (or independent variables) denoted
X. The case of one explanatory variable is called simple linear regression.
3. Explain Bayesian Inference ?(U)
Bayesian inference is a method of statistical inference in which Bayes' theorem is used
to update the probability for a hypothesis as more evidence or information becomes available.
Bayesian inference is an important technique in statistics, and especially in mathematical
statistics.
4. What is meant by rule induction? (R)
Rule induction is an area of machine learning in which formal rules are extracted from a
set of observations. The rules extracted may represent a full scientific model of the data, or
merely represent local patterns in the data.
5. What are the two strategies in Learn-One-Rule Function. (R)
General to specific
Specific to general
It is a Data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a
set of categories (subpopulations), a new observation belongs to, on the basis of a training set of
data containing observations and whose categories membership is known.
16. What is Discriminative(R)
It is a very basic classifier and determines just one class for each row of data. It tries to
model just by depending on the observed data, depends heavily on the quality of data rather than
on distributions.
Example: Logistic Regression
17. Define Generative (R)
It models the distribution of individual classes and tries to learn the model that generates the
data behind the scenes by estimating assumptions and distributions of the model. Used to predict the
unseen data. Example: Naive Bayes Classifier
18. List out Classifiers Of Machine Learning(U)
Decision Trees
Bayesian Classifiers
Neural Networks
K-Nearest Neighbour
Support Vector Machines
Linear Regression
Logistic Regression
19. List out an Associated Tools and Languages Used to mine/ extract useful information
from raw data.(U)
a) Main Languages used: R, SAS, Python, SQL
b) Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
c) Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK,
TensorFlow, Seaborn, Basemap, etc.
20. List out Real Life Examplesof classification (U)
a) Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of
buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing
some product, certain suggestions for the commodities are shown that some people
have bought in the past.
b) Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters
such as temperature, humidity, wind direction. This keen observation also requires
the use of previous records in order to predict it accurately.
21. List out an Advantages of classification (U)
Mining Based Methods are cost effective and efficient
Helps in identifying criminal suspects
Helps in predicting risk of diseases
Helps Banks and Financial Institutions to identify defaulters so that they may approve
Cards, Loan, etc.
22. List out disadvantages of classification (U)
Privacy: When the data is either are chances that a company may give some information about
their customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best
accuracy and result.
23. List out an Applications of classification (U)
Marketing and Retailing
Manufacturing
Telecommunication Industry
Intrusion Detection
Education System
Fraud Detection
24. State Bayes theorem (R) NOVEMBER /DECEMBER 2020/ APRIL / MAY 2021
Bayes Theorem is the extension of Conditional probability. Conditional probability helps us
to determine the probability of A given B, denoted by P(A|B). So Bayes' theorem says if we
know P(A|B) then we can determine P(B|A), given that P(A) and P(B) are known to us.
25. What is the application of clustering in medical domain? (R) NOVEMBER
/DECEMBER 2020/ APRIL / MAY 2021
Clustering is a powerful machine learning tool for detecting structures in datasets. In the
medical field, clustering has been proven to be a powerful tool for discovering patterns and
structure in labeled and unlabeled datasets
PART-B
1.Analyze the statement in detail : ―Data Analysis is not a decision-making system, but a decision-
supporting system‖ (An)
2.Create a Regression Model for ― happy people get many hours of sleep‖ using your own data and
what kind of inferences it provides.(C)
3.Summarize hierarchical clustering in detail. Analyse the given diagram and draw the dendrogram
using hierarchical clustering algorithm . (U)
4.Compose the K-means partitioning algorithm using the given data. (C)
Consider five points { X1, X2,X3, X4, X5} with the following coordinates as a two dimensional
sample for clustering: X1 = (0,2.5); X2 = (0,0); X3= (1.5,0); X4 = (5,0); X5 = (5,2)
5.Two Cluster the following eight points into three clusters using K means clustering algorithm
and use Euclidean distance. Ala(2,10), A2-(2,5), ??-64), A4-(5,8), A5-(7,5), A6 (6,4), A7-(1,2),
A8 (4,9). a) Create distance matrix by calculating Euclidean distance between each pair of points
(C)
6.Brief about K-means clustering with example(8) (U) NOVEMBER /DECEMBER 2020/
APRIL / MAY 2021
7.Explain the several decision that the practitioner must make for the following parameters in K-
means clustering: (5)(U) NOVEMBER /DECEMBER 2020/ APRIL / MAY 2021
1. Object attributes
2. Units of measures
3. Rescaling
8.Write short notes on the following Decision Tree (C) (13) NOVEMBER /DECEMBER 2020/
APRIL / MAY 2021
1. ID3 algorithm
2. C4.5
3. CART
4. Evaluation of Decision Tree
COURSE OUTCOME:
Students can able to Analyze data by utilizing clustering and classification algorithms
UNIT III
ASSOCIATION AND RECOMMENDATION SYSTEM
Advanced Analytical Theory and Methods: Association Rules – Overview – Apriori Algorithm –
Evaluation of Candidate Rules – Applications of Association Rules – Finding Association& finding
similarity – Recommendation System: Collaborative Recommendation- Content Based
Recommendation – Knowledge Based Recommendation- Hybrid Recommendation Approaches.
COURSE OBJECTIVE:
To learn about association and recommendation system.
o Financial services
o Government
o E-Commerce sites
COURSE OUTCOME:
Students able to learn and apply different mining algorithms and recommendation systems for
large volumes of data
UNIT IV
STREAM MEMORY
Introduction to Streams Concepts – Stream Data Model and Architecture – Stream Computing,
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
Estimating moments – Counting oneness in a Window – Decaying Window – Real time Analytics
Platform(RTAP) applications – Case Studies – Real Time Sentiment Analysis, Stock Market
Predictions. Using Graph Analytics for Big Data: Graph Analytics
COURSE OBJECTIVE:
To learn about stream computing
1.What is Association Rule Mining?(R)
The Association Rule Mining is main purpose to discovering frequent itemsets from a
large dataset is to discover a set of if-then rules called Association rules. The form of an
association rules is I→j, where I is a set of items(products) and j is a particular item.
2.List any two algorithms for Finding Frequent Itemset.(R)
o Apriori Algorithm
o FP-Growth Algorithm
o SON algorithm
o PCY algorithm
3.What is meant by curse of dimensionality? (R)
Points in high-dimensional Euclidean spaces, as well as points in non-Euclidean spaces
often behave unintuitively. Two unexpected properties of these spaces are that the random points
are almost always at about the same distance, and random vectors are almost always orthogonal.
4.Define Toivonen’s Algorithm(R)
Toivonen‘s algorithm makes only one full pass over the database. The algorithm thus
produces exact association rules in one full pass over the database. The algorithm will give
neither false negatives nor positives, but there is a small yet non-zero probability that it will fail
to produce any answer at all. Toivonen‘s algorithm begins by selecting a small sample of the
input dataset and finding from it the candidate frequent item sets.
5.List out some applications of clustering. (R)
Collaborative filtering
Customer segmentation
Data summarization
Dynamic trend detection
Multimedia data analysis
Biological data analysis
Social network analysis
6.What are the types of Hierarchical Clustering Methods? (R)
Single- link clustering
Complete-link clustering
Average-link clustering
Centroid link clustering
7.Define CLIQUE(R)
CLIQUE is a subspace clustering algorithm that automatically finds subspaces with high-
density clustering in high dimensional attribute spaces. CLIQUE is a simple grid-based method
for finding density-based clusters in subspaces. The procedure for this grid-baased clustering is
relatively simple.
8.What is meant by k-means algorithm? (R)
The family of algorithms is of the point-assignment type and assumes a Euclidean space.
It is assumed that there are exactly k clusters for some known k. After picking k initial cluster
centroids, the points are considered one at a time and assigned to the closest centroid.
9.List out the types of Data streaming operators (U)
Stateless operators
Stateful operators
10. List out the steps to be followed to deploy a Big Data solution (U)
Data Ingestion
Data Storage
Data Processing
11. What is FSCK?
FSCK (File System Check) is a command used to detect inconsistencies and issues in the file.
12. What are the real-time applications of Hadoop?
Some of the real-time applications of Hadoop are in the fields of:
Content management.
Financial agencies.
Defense and cyber security.
Managing posts on social media.
13. What is the function of HDFS? (R)
The HDFS (Hadoop Distributed File System) is Hadoop‘s default storage unit. It is used for
storing different types of data in a distributed environment.
14. What is commodity hardware?(R)
Commodity hardware can be defined as the basic hardware resources needed to run
the Apache Hadoop framework.
15. Name a few daemons used for testing JPS command.(U)
NameNode
NodeManager
DataNode
ResourceManager
16. What are the most common input formats in Hadoop?(R)
Text Input Format
Key Value Input Format
PART-B
1. Describe the Data Stream model with a neat architecture diagram (U)
2. Illustrate briefly about the sources of data stream. (U)
3. Explain issues in data stream queries .(U)
4. (i) List the issues in data streaming . (R)
5. (ii) Summarize the stream data model and its architecture. (U)
6. Analyse and write a short note on Aurora system model.(An)
7. Explain Sampling in Data Streams . (U)
8. Explain the sampling types in detail (U)
9. Describe about Aurora query model. (U)
10. Generalize how mining is done with data streams. (U)
11. Describe briefly how to count the distinct elements in a stream. (U)
What do you meant by count–distinct problem . (R)
12. Quote short notes on
a) Sliding window concept (R)
b) Land mark window concept (U)
13. Illustrate how would you describe the various windowing approach to data stream
mining(U)
14. List the methods for analyzing time series data. (R)
15. What are the several types of motivation and data analysis available for time series? (R)
16. Explain about time series in detail and discuss its significance (R)
17. Evaluate the process of Data Stream Mining with suitable examples. (E)
18. Summarize data streaming algorithms in detail. (U)
19. Discuss the challenges associated with each problem. (U)
20. Generalize how is data analysis used in (U)
a) stock market predictions.
b) weather forecasting predictions.
21. Explain the Count distinct problem and Flajolet-Martin algorithm in stream (7) (U)
NOVEMBER /DECEMBER 2020/ APRIL / MAY 2021
22. Explain in detail about the Alon–Matias-Szegedy Algorithm for estimating second moments
in stream (6) (U) NOVEMBER /DECEMBER 2020/ APRIL / MAY 2021)
23. Explain in detail about the Sampling in Data stream (9) (U) NOVEMBER /DECEMBER
2020/ APRIL / MAY 2021
24. Brief about the features of a graph analytics platform to be considered for various Big data
applications (4) (U) NOVEMBER /DECEMBER 2020/ APRIL / MAY 2021
COURSE OUTCOME:
Students can able to perform analytics on data streams
UNIT V
NOSQL DATA MANAGEMENT FOR BIG DATA AND VISUALIZATION
NoSQL Databases : Schema-less Models‖: Increasing Flexibility for Data Manipulation-Key Value
Stores- Document Stores – Tabular Stores – Object Data Stores – Graph Databases Hive – Sharding
–- Hbase – Analyzing big data with twitter – Big data for E-Commerce Big data for blogs – Review
of Basic Data Analytic Methods using R.
COURSE OBJECTIVE:To know about the research that requires the integration of large amounts
of data.
PART-A
1. Compare NoSQL & RDBMS (An)
Criteria NoSQL RDBMS
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
2000- Graph database Neo4j is launched
2004- Google BigTable is launched
2005- CouchDB is launched
2007- The research paper on Amazon Dynamo is released
2008- Facebooks open sources the Cassandra project
6. List out Features of NoSQL(R)
NoSQL databases never follow the relational model
Never provide tables with flat fixed-column records
Work with self-contained aggregates or BLOBs
Doesn't require object-relational mapping and data normalization
No complex features like query languages, query planners,
referential integrity joins, ACID
7. Explain about Schema-free (U)
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
8. Write about Simple API (U)
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based query language
Web-enabled databases running as internet-facing services
9. List out types of NoSQL Databases(R)
Key-value Pair Based
Column-oriented Graph
Graphs based
Document-oriented
10. Query Mechanism tools for NoSQL (R)
The most common data retrieval mechanism is the REST-based retrieval of a value
based on its key/ID with GET resource
Document store Database offers more difficult queries as they understand the value
in a key-value pair. For example, CouchDB allows defining views with MapReduce
11. What is the CAP Theorem?(R)
CAP theorem is also called brewer's theorem. It states that is impossible for a
distributed data store to offer more than two out of three guarantees
Consistency
Availability
Partition Tolerance
Library () function gives an Require () function is used inside function and throws a
error message display, if the warning messages whenever a particular package is not
desired package cannot be Found
loaded.
It loads the packages It just checks that it is loaded, or loads it if it isn‘t (use in
whether it is already loaded functions that rely on a certain package). The
or not documentation explicitly states that neither function will
reload an already loaded package.
16. What is R? (R)
R is a programming language which is used for developing statistical software and data
analysis. It is being increasingly deployed for machine learning applications as well.
17. How R commands are written? (An)
By using # at the starting of the line of code like #division commands are written.
18. What is t-tests() in R? (R)
It is used to determine that the means of two groups are equal or not by using
t.test() function.
19. What are the disadvantages of R Programming? (R)
The disadvantages are:-
Lack of standard GUI
Not good for big data.
Does not provide spreadsheet view of data.
20. What is the use of With () and By () function in R? (R)
with() function applies an expression to a dataset.
#with(data,expression)
By() function applies a function t each level of a factors.
#by(data,factorlist,function)
21. What is the use of subset() and sample() function in R?(R)
Subset() is used to select the variables and observations and sample() function is used
to generate a random sample of the size n from a dataset.
22. Explain what is transpose. (U)
Transpose is used for reshaping of the data which is used for analysis. Transpose is
performed by t() function.
23. What are the advantages of R? (R)
The advantages are:-
It is used for managing and manipulating of data.
No license restrictions
Free and open source software.
Graphical capabilities of R are good.
Runs on many Operating system and different hardware and also run on 32 & 64 bit
processors etc.
26. What is the function used for adding datasets in R? (R)
For adding two datasets rbind() function is used but the column of two datasets must be
same.
COURSE OUTCOME: Students can able to learn NoSQL databases and management.
CS8091.1 An ability to work with big data tools and its analysis techniques
CS8091.2 An ability to analyse data by utilizing clustering and classification algorithms
An ability to learn and apply different mining algorithms and recommendation systems
CS8091.3
for large volumes of data
CS8091.4 An ability to perform analytics on data streams
CS8091.5 An ability to learn NoSQL databases and management.
CO-PO MATRIX:
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CS8091.1 3 3 3 3 2 - - - - - -
CS8091.2 3 3 3 3 2 - - - - - -
CS8091.3 3 3 3 3 2 - - - - - -
CS8091.4 3 3 3 3 2 - - - - - -
CS8091.5 3 3 3 3 2 - - - - - -
CS8091 3 3 3 3 2 - - - - - -
CO-PSO MATRIX:
CO PSO1 PSO2 PSO3
CS8091.1 2 - 2
CS8091.2 2 - 2
CS8091.3 2 - 2
CS8091.4 2 - 2
CS8091.5 2 - 2
CS8091 2 - 2