0% found this document useful (0 votes)
78 views16 pages

DA PYQs

The document provides a comprehensive solution to the June 2022 BTech Semester VI Data Analytics exam, covering various topics including data analytics, neural networks, multivariate analysis, decision trees, and big data characteristics. It includes short answer questions, long answer questions, and case studies, detailing concepts, algorithms, and tools used in data analytics. Key sections address the importance of data analytics, classification of data, and the architecture of data stream models, among other essential topics.

Uploaded by

ayush.kush3001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views16 pages

DA PYQs

The document provides a comprehensive solution to the June 2022 BTech Semester VI Data Analytics exam, covering various topics including data analytics, neural networks, multivariate analysis, decision trees, and big data characteristics. It includes short answer questions, long answer questions, and case studies, detailing concepts, algorithms, and tools used in data analytics. Key sections address the importance of data analytics, classification of data, and the architecture of data stream models, among other essential topics.

Uploaded by

ayush.kush3001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Here is the complete and detailed solution to the June 2022 BTech Semester VI Data

Analytics (KIT601) theory exam.

🟦 SECTION A — Short Answer Type Questions


(10 × 2 = 20 Marks)

(a) Discuss the need of data analytics.

CO1
Data analytics is essential for extracting useful insights from raw data. It enables
organizations to:

 Understand patterns and trends


 Make data-driven decisions
 Improve operational efficiency
 Predict future outcomes
 Enhance customer satisfaction
Data analytics supports better business intelligence and competitive advantage in the
data-driven digital era.

(b) Give the classification of data.

CO1
Data can be classified into:

1. Qualitative (Categorical) Data:


o Nominal: No order (e.g., gender, color)
o Ordinal: With order (e.g., rating scales)
2. Quantitative (Numerical) Data:
o Discrete: Countable (e.g., number of students)
o Continuous: Measurable (e.g., height, weight)

(c) Define neural network.

CO2
A neural network is a computational model inspired by the structure and function of the
human brain. It consists of layers of interconnected nodes (neurons) that process input data
and learn patterns through training to perform tasks such as classification, regression, and
pattern recognition.
(d) What is multivariate analysis?

CO2
Multivariate analysis involves the simultaneous analysis of more than two variables to
understand relationships and patterns. It is used in scenarios where multiple outcomes or
predictor variables exist, e.g., principal component analysis (PCA), factor analysis, multiple
regression, etc.

(e) Give the full form of RTAP and discuss its application.

CO3
RTAP: Real-Time Analytical Processing
It refers to the processing of data in real-time to generate instant analytics.
Applications:

 Fraud detection in banking


 Stock trading systems
 Real-time traffic monitoring
 Smart city management

(f) What is the role of sampling data in a stream?

CO3
Sampling in data streams is crucial for:

 Reducing memory usage


 Ensuring fast computation
 Preserving representative patterns
 Enabling real-time analytics with limited resources
Common sampling techniques include reservoir sampling, sliding windows, and
stratified sampling.

(g) Discuss the use of limited pass algorithm.

CO4
Limited pass algorithms process data in few or single passes, which is essential for stream
processing where data cannot be stored.
Use:

 Memory efficiency
 Suitable for large/streaming datasets
 Fast computation with approximation algorithms (e.g., Count-Min Sketch)
(h) What is the principle behind hierarchical clustering technique?

CO4
Hierarchical clustering builds a hierarchy of clusters either:

 Agglomerative (bottom-up): Start with individual points and merge them


 Divisive (top-down): Start with all points and split them
It uses distance metrics (e.g., Euclidean) and linkage criteria (e.g., single, complete,
average) to form a dendrogram.

(i) List five R functions used in descriptive statistics.

CO5

1. mean() – Mean value


2. median() – Median value
3. sd() – Standard deviation
4. summary() – Summary of statistics
5. var() – Variance

(j) List the names of any 2 visualization tools.

CO5

1. Tableau
2. Power BI
Others: QlikView, Google Data Studio, Matplotlib (Python), ggplot2 (R)

🟦 SECTION B — Long Answer Type Questions


(3 × 10 = 30 Marks)

2(a) Explain the process model and computation model for Big Data platform.

CO1

Process Model:

Describes the stages involved in managing big data.

 Data Collection: Sensors, web logs, transactions


 Data Storage: Distributed file systems (e.g., HDFS)
 Data Processing: Batch (Hadoop) or Stream (Spark)
 Data Analysis: ML models, statistics
 Data Visualization: Dashboards

Computation Model:

 Batch Processing: (Hadoop MapReduce)


o Handles large-scale offline data
o Slower but cost-effective
 Stream Processing: (Apache Storm, Spark Streaming)
o Handles real-time data
o Faster with lower latency
 In-Memory Processing: (Apache Spark)
o Faster processing using RAM
o Suitable for iterative ML tasks

2(b) Explain the use and advantages of decision trees.

CO2

Use of Decision Trees:

 Classification (spam detection, loan approval)


 Regression (predicting values)
 Feature selection
 Rule extraction

Advantages:

 Easy to understand and interpret


 Requires little data preprocessing
 Handles numerical and categorical data
 Non-linear relationships captured
 Can be visualized as flowcharts

2(c) Explain the architecture of data stream model.

CO3
Data Stream Architecture Components:

1. Data Sources: Sensors, logs, web APIs


2. Data Ingestion Layer: Tools like Apache Kafka, Flume
3. Stream Processing Engine: Apache Storm, Spark Streaming
4. Storage: NoSQL (Cassandra), HDFS
5. Analytics and Visualization: Dashboards, ML models
Key Features:

 Continuous input
 Real-time processing
 Memory-efficient algorithms
 Time-windowing

2(d) Illustrate the K-means algorithm in detail with its advantages.

CO4

Steps:

1. Choose k initial centroids


2. Assign each point to the nearest centroid
3. Update centroid positions based on assigned points
4. Repeat steps 2–3 until convergence

Advantages:

 Simple and fast


 Scales to large datasets
 Works well when clusters are spherical

Limitations:

 Sensitive to initial choice


 Fixed number of clusters k
 Poor performance with non-spherical data

2(e) Differentiate between NoSQL and RDBMS databases.

CO5

Feature RDBMS NoSQL


Data Model Relational (tables) Document, Key-Value, Columnar
Schema Fixed schema Dynamic schema
Scalability Vertical Horizontal
Query Language SQL No standard query language
Examples MySQL, Oracle MongoDB, Cassandra, CouchDB
Suitability Structured data Semi/unstructured data
🟦 SECTION C — Case Study / Application Based
Questions
(4 × 10 = 40 Marks)

3(a) Explain the various phases of data analytics life cycle.

CO1

1. Discovery: Define problem, identify data sources


2. Data Preparation: Clean, transform, integrate data
3. Model Planning: Choose analytical techniques
4. Model Building: Create and train models
5. Operationalize: Deploy models in real-time or batch
6. Communicate Results: Visualizations, reports, insights

3(b) Explain modern data analytics tools in detail.

CO1

1. Apache Hadoop: Batch processing, HDFS, MapReduce


2. Apache Spark: Fast, in-memory analytics
3. Tableau: Drag-and-drop visualizations
4. Power BI: Microsoft tool for dashboards
5. Python/R: Statistical programming languages
6. Apache Flink: Real-time stream processing
7. Google BigQuery: Serverless analytics data warehouse

4(a) Compare various types of support vector and kernel methods of data
analysis.

CO2

 Support Vector Machines (SVM):


o Supervised learning for classification/regression
o Finds hyperplane that maximizes margin
 Kernels:
Allow SVM to work in higher dimensions
o Linear Kernel: Fast but limited
o Polynomial Kernel: Non-linear relationships
o RBF (Gaussian) Kernel: Handles complex data
o Sigmoid Kernel: Similar to neural networks
4(b) PCA Example

Given data:

X = [2, 3, 4, 5, 6, 7]
Y = [1, 5, 3, 6, 7, 8]

Steps:

1. Standardize data
2. Compute covariance matrix
3. Calculate eigenvalues and eigenvectors
4. Select principal components
5. Project data

Due to length and math, a detailed PCA solution is best done with actual matrix
computation tools or software.

5(a) Explain any one algorithm to count number of distinct elements in a data
stream.

CO3

Flajolet-Martin Algorithm:

 Uses hash functions to map elements to binary strings


 Tracks position of least significant 1
 Estimates using harmonic mean of positions

Steps:

1. Hash each element


2. Find max trailing zeroes
3. Estimate distinct count = 2^R (R = max zero count)
Useful for big data streams with memory limits.

5(b) Discuss the case study of stock market predictions in detail.

CO3

Steps in stock market prediction:

1. Data Collection: Historical stock prices, news, indicators


2. Preprocessing: Remove noise, normalize data
3. Feature Engineering: Technical indicators (MACD, RSI)
4. Modeling:
oARIMA: Time series
oLSTM: Deep learning
oRandom Forest: Feature-based
5. Evaluation: Accuracy, RMSE
6. Deployment: Predict future trends for trading

6(a) Differentiate between CLIQUE and ProCLUS clustering.

CO4

Feature CLIQUE ProCLUS


Approach Grid-based Projected clustering
Dimensionality Handles high-dimensional data Works on subspace clustering
Complexity Efficient More complex due to projections
Output Dense regions in subspaces Clusters with relevant dimensions
Examples Market basket, customer segments Gene expression clustering

6(b) Apriori Algorithm

Transactions:

TID Items
T100 M,O,N,K,E,Y
T200 D,O,N,K,E,Y
T300 M,A,K,E
T400 M,U,C,K,Y
T500 C,O,O,K,I,E

min_sup = 60% (i.e., 3 transactions)


min_conf = 80%

Frequent Itemsets:
Use Apriori algorithm to find frequent 1-itemsets, then 2-itemsets, and so on by pruning
infrequent ones.

1. Frequent 1-itemsets (support ≥ 3): M, O, K, E, Y


2. Frequent 2-itemsets: (M,K), (K,E), (K,Y), (O,K), (E,Y)
3. Continue building larger sets...

Strong Association Rules:


e.g.,

{K} ⇒ {E}, support = 60%, confidence = 100%


{M} ⇒ {K}, support = 60%, confidence = 100%


(You can elaborate further by calculating all)

7(a) Explain the HIVE architecture with its features in detail.

CO5

Hive Architecture:

1. User Interface: CLI, Web UI


2. Driver: Manages lifecycle
3. Compiler: Converts queries to execution plans
4. Metastore: Stores schema
5. Execution Engine: Executes using MapReduce
6. HDFS: Stores actual data

Features:

 SQL-like language (HiveQL)


 Supports batch processing
 Schema on read
 Compatible with Hadoop
 Scalable and fault-tolerant

7(b) Write R function to check whether the given number is prime or not.

CO5

isPrime <- function(n) {


if (n <= 1) return(FALSE)
if (n == 2) return(TRUE)
for (i in 2:sqrt(n)) {
if (n %% i == 0) return(FALSE)
}
return(TRUE)
}

# Example:
isPrime(13) # Output: TRUE
Here is the complete, detailed solution to all questions from the June 2023 BTech Semester
VI Data Analytics theory exam paper (KIT-601).

SECTION A – Short Answer Questions (2 × 10 = 20


Marks)
(a) What are the main characteristics of Big Data?

The five major characteristics (5 V’s) of Big Data are:

1. Volume – Massive amounts of data generated every second.


2. Velocity – Speed at which new data is generated and processed.
3. Variety – Diverse data formats (structured, semi-structured, unstructured).
4. Veracity – Reliability and trustworthiness of data.
5. Value – Useful insights extracted from data.

(b) Generalize the role of analytical tools in Big Data?

Analytical tools process, analyze, and visualize Big Data to:

 Discover patterns, correlations, and trends.


 Enable decision-making through predictive and descriptive analytics.
 Automate data pipelines and enhance data-driven strategies.

(c) What are the purposes of regression analysis?

 To predict a dependent variable based on independent variables.


 To identify relationships between variables.
 To forecast trends and future outcomes.
 To assess the strength of predictors.

(d) What do you mean by fuzzy qualitative model?

A fuzzy qualitative model uses fuzzy logic to handle imprecise or vague data:

 Useful where binary logic fails (e.g., temperature: cold, warm, hot).
 Allows reasoning with approximate values.
 Widely used in AI, control systems, and decision-making.
(e) Define association rule.

An association rule uncovers relationships among items in a dataset:

Form: X ⇒ Y (If X occurs, Y is likely to occur).


Example: {Milk} ⇒ {Bread}, with support and confidence measures.

(f) State the benefits of analytic sandbox.

 Provides a secure, isolated environment for experimentation.


 Allows analysts to explore and manipulate data freely.
 Prevents risk to production systems.
 Facilitates advanced analytics and model testing.

(g) What do you mean by data stream management system?

A Data Stream Management System (DSMS):

 Handles continuous, real-time data flow.


 Enables real-time querying, aggregation, and filtering.
 Examples: Apache Kafka, Apache Flink, Apache Storm.

(h) What do you mean by response modeling?

Response modeling:

 Predicts customer responses (e.g., purchase, click).


 Common in marketing to identify likely responders.
 Uses classification, regression, or machine learning models.

(i) What are the benefits of visual data exploration?

 Quick pattern and trend detection.


 Easy identification of anomalies and outliers.
 Better stakeholder communication.
 Supports faster and more informed decision-making.

(j) Mention some main goals of Hadoop.


 Distributed storage of large datasets (via HDFS).
 Parallel data processing using MapReduce.
 High fault tolerance.
 Scalability and flexibility to handle diverse data formats.

SECTION B – Descriptive Questions (10 × 3 = 30 Marks)


(a) Compare and contrast analysis and reporting in data analytics with
suitable example.

Aspect Analysis Reporting


Purpose Discover insights, patterns, predictions Communicate findings and KPIs
Nature Exploratory or predictive Summarized and static
Techniques Machine learning, statistics Dashboards, static tables
Example Predicting customer churn Monthly sales report

Example: Analysis might reveal why sales dropped (e.g., due to pricing). Reporting will
simply state that sales dropped by 10%.

(b) What is a neural network? How can it be used in analytics?

Neural Network: A machine learning model inspired by the human brain. It has layers of
interconnected nodes (neurons) that process inputs to produce outputs.

Use in Analytics:

 Classification: Image or speech recognition.


 Regression: Predicting stock prices.
 Clustering: Customer segmentation.
 Forecasting: Sales, weather, or demand prediction.

(c) Explain Apriori association rule mining algorithm.

Apriori Algorithm Steps:

1. Identify frequent itemsets using support.


2. Generate candidate itemsets from previous frequent itemsets.
3. Prune infrequent itemsets.
4. Generate strong association rules using confidence.

Example:
Dataset:
 T1: A, B
 T2: A, C
 T3: A, B, C
 T4: B, C

From frequent itemsets:

 Rule: A ⇒ B (Support = 50%, Confidence = 66%)

(d) List the advantages and disadvantages of K-Means clustering.

Advantages:

 Simple and fast.


 Efficient for large datasets.
 Scales well with dimensions.

Disadvantages:

 Requires pre-defined number of clusters (K).


 Sensitive to initial centroids.
 Not ideal for non-spherical clusters or noise.

(e) What is HDFS? How does it handle Big Data?

HDFS (Hadoop Distributed File System):

 Distributed storage system in Hadoop.

Handles Big Data by:

 Splitting data into blocks (default 128MB).


 Storing blocks across multiple nodes.
 Replicating blocks (default: 3 copies) for fault tolerance.
 Ensuring data locality for faster processing.

SECTION C – Long Answer Questions (10 × 5 = 50


Marks)
Q3 (a) What are the various stages in Big Data Analytics Life Cycle?

Stages of Big Data Analytics Life Cycle:


1. Data Discovery – Identify data sources and understand requirements.
2. Data Preparation – Cleaning, transforming, and integrating data.
3. Model Planning – Choose techniques and tools (e.g., regression, clustering).
4. Model Building – Train and test models.
5. Operationalize – Deploy models in production.
6. Communicate Results – Share insights through reports and dashboards.

Diagram (You can draw the six-stage wheel or cycle with arrows connecting each step).

Q3 (b) What is the difference between regression modelling and Bayesian


modeling?

Feature Regression Modeling Bayesian Modeling


Type Deterministic Probabilistic
Assumptions Fixed parameters Parameters have distributions
Output Point estimate Probability distributions
Use Predict value Incorporate prior beliefs and update with data
Example Linear regression Bayesian linear regression with priors

Q4 (a) Explain the role of Principal Component Analysis (PCA) in neural


networks.

PCA Role in Neural Networks:

 Dimensionality reduction before training.


 Reduces noise and redundancy.
 Speeds up training and improves accuracy.
 Enhances feature selection by projecting onto significant principal components.
 Preprocessing step for input normalization.

Q4 (b) What are the parameters used to characterize any fuzzy membership
function?

Key parameters:

1. Support: Range of input values where the membership > 0.


2. Core: Input values with membership = 1.
3. Crossover points: Inputs where membership = 0.5.
4. Shape: Triangular, trapezoidal, bell-shaped.
5. Continuity and smoothness: Determines function behavior and differentiability.
Q5 (a) Discriminate the concept of sampling data in a stream.

Sampling in Data Streams:

 Extracts representative data from infinite stream.


 Reduces memory and processing overhead.

Techniques:

 Reservoir Sampling – Maintains a sample of size k from stream of unknown length.


 Sliding Window Sampling – Samples last N items or time duration.
 Bernoulli Sampling – Select each item with fixed probability p.

Q5 (b) Illustrate various Real Time Analytics Platforms (RTAPs) with


examples.

RTAPs process data as it arrives in real-time. Examples:

Platform Description
Apache Storm Distributed real-time computation system.
Apache Flink Handles stream and batch data.
Apache Kafka + Kafka Streams Messaging + real-time processing.
Spark Streaming Micro-batch streaming with Spark.

Used in fraud detection, IoT monitoring, stock trading, etc.

Q6 (a) Explain the working of CLIQUE algorithm in brief.

CLIQUE (Clustering in QUEST):

 Grid-based and density-based subspace clustering.


Steps:

1. Divide data space into uniform grids.


2. Identify dense units (grids exceeding density threshold).
3. Merge dense units into clusters.
4. Automatically identifies subspaces relevant for clusters.

Advantages:

 Scales to high dimensions.


 Efficient and automatic subspace detection.
Q6 (b) Identify the major issues in data stream query processing.

Major Issues:

1. Memory limitations – Data is infinite, memory is finite.


2. Real-time constraints – Low-latency processing needed.
3. Approximation – Exact answers are hard; rely on probabilistic models.
4. Concept drift – Data distribution changes over time.
5. Out-of-order arrivals – Need time windows or buffering.

Q7 (a) Illustrate and explain the concept of MapReduce framework in brief.

MapReduce is a programming model for processing large data sets.

Phases:

1. Map – Converts input data into key-value pairs.


2. Shuffle – Groups key-value pairs by keys.
3. Reduce – Aggregates values for each key.

Example: Word Count

 Map: (word, 1)
 Reduce: (word, total count)

Highly fault-tolerant, parallel, scalable.

Q7 (b) Write R function to check whether the given number is prime or not?
is_prime <- function(n) {
if (n <= 1) {
return(FALSE)
}
for (i in 2:sqrt(n)) {
if (n %% i == 0) {
return(FALSE)
}
}
return(TRUE)
}

# Example:
is_prime(7) # Output: TRUE
is_prime(10) # Output: FALSE

You might also like