0% found this document useful (0 votes)
17 views12 pages

SEM_BDA_QUEST

Uploaded by

Ashie Aishu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

SEM_BDA_QUEST

Uploaded by

Ashie Aishu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

DHANALAKSHMI SRINIVASAN COLLEGE OF ENGINEERING

(AUTONOMOUS)
Coimbatore – 641105
B. E/ B.Tech. DEGREE EXAMINATIONS, NOV / DEC- 2024
Fifth Semester
CCS334– BIG DATA ANALYTICS (Artificial Intelligence and Data Science)
(Regulations 2021)

QP ID: N24224

PART–– A (10 x 2 = 20 Marks)


1 List the two characteristics of big data.
 Volume: Large amounts of data generated.
 Variety: Different types of data (structured, unstructured, and semi-
structured).

2 Recall the term "data explosion."


 "Data explosion" refers to the rapid and exponential growth of data
generated due to digital technologies like social media, IoT, and mobile
devices.

3 What is Cassandra? Mention its purpose.


 Apache Cassandra is a highly scalable, distributed NoSQL database.
 Purpose: To handle large volumes of data across many servers without a
single point of failure. It's widely used for real-time big data applications.

4 State the significance of eventual consistency.


 Eventual consistency ensures that all replicas in a distributed system will
converge to the same value over time, even if updates are not immediately
synchronized.

5 Relate data format in the context of big data.


 Data formats in big data include:
 Structured (e.g., relational databases)
 Unstructured (e.g., images, videos, text)
 Semi-structured (e.g., JSON, XML).
These formats impact storage, processing, and analysis in big data
systems.

6 Mention the primary purpose of Hadoop in data analysis.


 Hadoop enables distributed storage (HDFS) and processing (MapReduce)
of large datasets across clusters of computers, making data analysis scalable
and fault-tolerant.
7 Name one built-in output format used in Hadoop.
 TextOutputFormat: It writes key-value pairs as plain text.

8 List the different types of MapReduce jobs.


 Mapper-only job
 Reducer-only job
 Map and Reduce job

9 HiveQL:
 Designed for querying big data in Hadoop.
 Executes on top of Hive, translating queries into MapReduce jobs.
SQL:
 Used for relational databases (RDBMS).
 Follows ACID compliance and uses traditional database engines.

10 Draw the structure of Pig data model.


 The Pig data model follows:
 Atom: Single value.
 Tuple: Collection of fields.
 Bag: Collection of tuples.
 Map: Key-value pairs.

PART–– B (5 x 13 = 65 Marks)
11(a) Outline the challenges and opportunities of unstructured data in big data
analytics.
Challenges of Unstructured Data:
 Volume: Unstructured data is generated in massive quantities, making
storage and management difficult.
 Variety: The data comes in diverse forms (e.g., images, text, audio, videos),
making it hard to process.
 Lack of Schema: Unlike structured data, there is no predefined schema,
which complicates analysis.
 Storage Costs: Unstructured data needs scalable storage systems like
Hadoop or cloud solutions.
 Data Quality: It may contain inconsistencies, noise, or redundant
information.
 Processing Complexity: Requires advanced tools like NLP, computer
vision, and AI for meaningful extraction.
2. Opportunities and Examples:
 Opportunities:
oDeriving actionable insights from social media trends and customer
behaviours.
o Enhancing decision-making by analysing customer sentiments or
video data.
o Training AI models for tasks like speech recognition and image
classification.
 Examples of Unstructured Data Sources:
o Social media: Insights into user engagement, trends (Twitter,
Facebook).
o Videos and Images: Used in healthcare (X-ray analysis) and
autonomous driving.
o Emails and Texts: Sentiment analysis to gauge customer
satisfaction.
o Sensors and IoT: Generating log files and real-time monitoring
data.

OR
11(b) Summarize the importance of Web Analytics in Understanding User
Behaviour and improving Website Performance. Discuss key metrics that
organization should focus on.
1. Importance of Web Analytics:
 Helps analyse user behaviours and optimize website content.
 Provides insights into user journeys to identify pain points.
 Improves website performance by identifying bottlenecks (e.g., slow pages,
broken links).
 Aids in decision-making for marketing strategies through data-driven
insights.
 Tracks user interaction across platforms to enhance user experience (UX).
 Helps monitor the success of campaigns (ads, SEO).
2. Key Metrics Organizations Should Focus On:
 Page Views: Number of pages viewed by users.
 Bounce Rate: Percentage of visitors leaving without interaction.
 Session Duration: Average time users spend on the site.
 Conversion Rate: Percentage of users completing desired actions (e.g.,
purchases).
 Traffic Sources: Identifies where users come from (e.g., search engines,
social media).
 Click-through Rate (CTR): The rate of users clicking on ads or links.
 Exit Rate: Tracks the page where users most often leave.

12(a) Outline the Role of Cassandra Clients in Interacting with the Database.
Explain the importance of Client Libraries and Drivers in Application
Development?
 Cassandra clients allow developers to interact with the database by sending
queries and receiving responses.
 They act as an intermediary between the application and the distributed
database nodes.
 Clients help manage data replication across nodes and ensure fault
tolerance.
 They maintain consistency levels (e.g., ONE, QUORUM, ALL) as per
application requirements.
 Examples of Cassandra clients include:
 CQLSH: Command-line interface for running CQL queries.
 DataStax Java Driver: Widely used for Java-based applications.
Importance of Client Libraries and Drivers in Application Development:
 Simplifies Development: Libraries offer pre-built functions for connecting
to Cassandra and executing queries.
 Language Support: Drivers are available for multiple programming
languages like Python, Java, and C++.
 Performance Optimization: Drivers handle connection pooling, retries, and
load balancing automatically.
 Integration: Helps integrate Cassandra with various applications and
frameworks.
 Scalability: Client libraries support distributed architecture, allowing
applications to scale seamlessly.
 Example Libraries:
o Java Driver for Java applications.
o PyCassa or Cassandra-driver for Python-based tools.

OR
12(b) Illustrate various consistency models available in NoSQL databases and their
implications for data integrity and application performance.
 NoSQL databases support flexible data storage, but
consistency can vary based on the use case.
 Consistency models determine how updated data is made
visible across distributed systems.

Types of Consistency Models:


a. Strong Consistency
 Ensures that all reads return the most recent write.
 Example: Databases like HBase.
 Implications: High data integrity but can lead to increased
latency.
b. Eventual Consistency
 Guarantees that all replicas converge to the same value over
time.
 Example: DynamoDB.
 Implications: Better performance and availability but may
temporarily serve stale data.
c. Causal Consistency
 Ensures operations that are causally related are seen in the same
order.
 Example: Systems requiring session consistency.
 Implications: Balances between performance and consistency
for certain workflows.
d. Read-Your-Writes Consistency
 Ensures a user always sees their latest updates.
 Example: Systems with user-specific data updates.
 Implications: Provides a tailored consistency at the cost of
computational overhead.
Comparison of Implications:
 Strong consistency is ideal for systems requiring high data
accuracy, like financial applications.
 Eventual consistency is suitable for high-availability
applications like social media platforms.
 Trade-offs exist between consistency, latency, and availability
(CAP theorem).

13(a) Importance of Scaling Out in Big Data Environments and Hadoop’s


Architecture

1. Definition of Scaling Out:


o Scaling out refers to adding more nodes (servers or machines) to a
system to increase processing capacity. It contrasts with scaling up,
which involves upgrading the hardware of existing servers.
2. Handling Massive Data Volumes:
o Big data environments deal with enormous datasets that cannot be
processed efficiently by a single machine. Scaling out allows data to
be distributed across multiple nodes, ensuring parallel processing.
3. Cost-Effectiveness:
o Scaling out is more cost-effective compared to scaling up because
organizations can use commodity hardware instead of high-cost
enterprise systems.
4. Performance and Fault Tolerance:
o Distributed systems ensure improved performance by processing
data simultaneously on multiple nodes.
o It also increases fault tolerance: if one node fails, others can
continue working without system failure.
5. Elasticity:
o Scaling out allows organizations to handle growing data demands by
adding more machines dynamically, providing flexibility to scale
infrastructure as needed.
6. Examples:
o Organizations like Facebook, Twitter, and Google use distributed
systems to scale out their data processing operations.
o Hadoop clusters use horizontal scaling to handle petabytes of data.

Hadoop Distributed File System (HDFS):


o HDFS splits large data files into smaller blocks and distributes them
across nodes in the cluster.
o This enables parallel processing, a key benefit of horizontal scaling.
2. MapReduce Programming Model:
o Hadoop’s MapReduce framework processes data in parallel across
multiple nodes. Tasks are divided into Map and Reduce jobs,
enabling efficient scaling for big data.
3. Cluster-Based Processing:
o Hadoop operates on a cluster of machines where each machine
(node) performs part of the overall computation.
o Adding nodes to the cluster improves processing speed and handles
larger datasets.
4. Fault Tolerance in Hadoop:
o Data is replicated across multiple nodes, ensuring no data loss if a
node fails. Horizontal scaling supports this fault-tolerant
mechanism.
5. Commodity Hardware Support:
o Hadoop is designed to run on low-cost, commodity hardware,
making horizontal scaling affordable and feasible for organizations.
6. Benefits of Horizontal Scaling:
o Improved throughput: More nodes mean more tasks can be
processed simultaneously.
o Enhanced reliability: Redundancy ensures system resilience to
hardware failures.
o Cost-efficiency: Scaling out avoids expensive hardware upgrades.
7. Example:
o In a Hadoop cluster, data processing for an e-commerce website
with 10 TB of sales data can be split across multiple nodes to reduce
processing time significantly.

OR
13(b) Draw the architecture of HDFS, including the key components such as
NameNode and DataNode. Discuss how this architecture enables reliable and
scalable storage.

Architecture of HDFS:
1. Overview of HDFS:
o The Hadoop Distributed File System (HDFS) is a distributed file
system designed to store and manage large datasets across a cluster
of machines.
2. Key Components:
o NameNode:
 The master node responsible for managing metadata, such as
file names, block locations, and directory structure.
 Example: NameNode keeps track of where file blocks are
stored.
o DataNodes:
 Worker nodes that store data blocks and handle read/write
requests from clients.
 DataNodes regularly send heartbeat signals to the
NameNode to confirm availability.
o Blocks:
 Files are split into fixed-size blocks (default size: 128 MB)
and distributed across DataNodes.
 Each block is replicated across multiple nodes to ensure fault
tolerance.
3. HDFS Write Process:
o Client sends the data to the NameNode.
o The NameNode assigns DataNodes to store data blocks.
o Data is replicated (default replication factor: 3) to ensure
redundancy.
4. HDFS Read Process:
o Client requests the file from the NameNode.
o NameNode provides the locations of DataNodes containing the
required file blocks.
o Client fetches data directly from the DataNodes.

Reliability and Scalability of HDFS:


1. Fault Tolerance:
o HDFS ensures reliability by replicating data blocks across multiple
DataNodes. If one node fails, data can still be accessed from other
nodes.
2. Scalability:
o Horizontal scaling is achieved by adding more nodes to the HDFS
cluster. This allows HDFS to manage growing data without
performance degradation.
3. Data Integrity:
o HDFS uses checksums to verify data integrity during transmission.
If a corrupted block is detected, it is replaced with a healthy replica.
4. High Throughput:
o HDFS is optimized for sequential reads and writes, providing high
throughput for large files.
5. Commodity Hardware Support:
o HDFS can run on inexpensive hardware, reducing costs while
maintaining scalability and reliability.
6. Replication Management:
o Replication ensures high availability: the NameNode automatically
reassigns block replicas if a DataNode fails.
7. Example:
o In a 5-node HDFS cluster, a 1 GB file split into 8 blocks (128 MB
each) will be replicated across nodes for redundancy, ensuring data
availability.

14(a) Illustrate the shuffle and sort process in detail, explaining how it facilitates the
grouping of intermediate data for the Reduce phase. Use diagrams to illustrate
the process.
1. Introduction to Shuffle and Sort
o The Shuffle step transfers intermediate key-value pairs from the
Map phase to the Reduce phase.
o The Sort step organizes these key-value pairs by key to prepare
them for the Reducer.
o Essential for aggregating and processing grouped data efficiently.
2. Detailed Process of Shuffle and Sort
o Map Output: The Mapper outputs key-value pairs.
o Partitioning: Determines which Reducer processes which keys
based on a partitioning function.
o Shuffling: Intermediate data is sent to corresponding Reducers.
o Sorting:
 Keys are sorted within each partition.
 This ensures that data is grouped by key for the Reducer.
o Combiner (Optional): Reduces the size of intermediate data by
performing local aggregation.
3. Diagram Explanation
o A diagram should include:
 Output from Mapper.
 Partitioning function directing data to Reducers.
 Sorted key-value pairs grouped by key in the Reducer input.

OR
14(b) Determine the failure handling mechanisms in Classic MapReduce and how
they compare to those in YARN. Explain how YARN improves fault tolerance
and job reliability.
Failure Handling in Classic MapReduce (4 marks)
 Task Failures: Retries failed tasks up to a defined limit.
 Node Failures:
 Job Tracker reassigns tasks from failed nodes to healthy nodes.
 Heartbeat mechanism detects failures.
 Job Tracker Failure: Entire job fails as Job Tracker is a single point of
failure.

Failure Handling in YARN (4 marks)


 Resource Manager: Monitors and reassigns tasks in case of node failures.
 Node Manager: Detects and reports failures to Resource Manager.
 Application Master: Handles application-level failures and ensures task
retries.
 Eliminates Single Point of Failure: Resource Manager and Application
Master reduce reliance on a single Job Tracker.

Comparison and Improvements (5 marks)


 YARN decentralizes responsibilities across Resource Manager and Node
Manager.
 Enhanced fault detection and quicker recovery.
 Better scalability and reliability compared to Classic MapReduce.

15(a) Analyse a real-world example of data manipulation using HiveQL, detailing


the steps taken and the outcomes achieved.
Introduction to HiveQL (3 marks)
 HiveQL is used to query and manipulate data stored in HDFS.
 Real-world use case: Analyse website log data to identify peak user activity
times.
Steps in HiveQL Data Manipulation (6 marks)
a. Loading Data: Import website log data into a Hive table.

LOAD DATA INPATH '/logs/user_activity' INTO TABLE activity_log;

b. Data Filtering: Select records from a specific date.

SELECT * FROM activity_log WHERE date = '2024-12-01';

c. Aggregation: Calculate hourly user activity.

SELECT hour, COUNT (*) as user_count


FROM activity_log
GROUP BY hour
ORDER BY user_count DESC;

d. Storing Results: Save processed data into a new table for reporting.

CREATE TABLE peak_hours AS


SELECT hour, user_count FROM activity_log WHERE user_count > 1000;

Outcomes Achieved (4 marks)


 Identified peak hours for website traffic.
 Results used to optimize server loads and improve user experience.
 Data saved for historical analysis and business strategy decisions.

OR
15(b)
Distinguish the syntax and semantics of Pig Latin. Provide examples of common
operations and how they are expressed in Pig Latin scripts.
Introduction to Pig Latin
 Pig Latin is a high-level scripting language for analysing large data sets in
Hadoop.
 Focuses on procedural data processing compared to SQL’s declarative
approach.

Syntax and Semantics


 Syntax:
 Operations are expressed as a sequence of steps (LOAD →
TRANSFORM → STORE).
 Example:
data = LOAD 'data.txt' USING PigStorage(',');
grouped_data = GROUP data BY $0;
results = FOREACH grouped_data GENERATE COUNT ($1);
STORE results INTO 'output';
Semantics:
 Procedural nature allows intermediate transformations.
 Operations execute lazily (only upon a DUMP or STORE).

Common Operations (5 marks)


 Loading Data:
raw_data = LOAD 'input.txt' USING PigStorage(',');

 Filtering Data:
filtered_data = FILTER raw_data BY $0 > 100;

 Grouping and Aggregating:


grouped_data = GROUP raw_data BY $1;
aggregated = FOREACH grouped_data GENERATE COUNT (raw_data);

 Storing Results:
STORE aggregated INTO 'output';

PART–– B (1 x 15 = 15 Marks)
16(a) Elaborate various applications of big data across sectors such as marketing,
finance, and supply chain management. Provide examples of specific tools or
methods used in each application.
Big Data in Marketing (5 Marks)
 Applications:
o Customer segmentation and targeting based on purchase history and
browsing behaviours.
o Personalized marketing campaigns using predictive analytics.
o Sentiment analysis to gauge public opinion about a brand or
product.
 Examples of Tools/Methods:
o Google Analytics for tracking and analysing web traffic.
o Apache Spark for real-time data processing in marketing campaigns.
o Natural Language Processing (NLP) for analysing customer reviews
and social media sentiments.
Big Data in Finance (5 Marks)
 Applications:
 Fraud detection through anomaly detection techniques in
transactional data.
 Risk management and credit scoring by analysing customer credit
behaviours.
 High-frequency trading using predictive models.
 Examples of Tools/Methods:
 SAS for financial data analytics.
 Hadoop for processing large volumes of transaction data.
 Machine Learning algorithms for credit risk modelling and fraud
detection.
Big Data in Supply Chain Management (5 Marks)
 Applications:
 Inventory optimization using real-time data from IoT devices.
 Demand forecasting to ensure product availability during peak
seasons.
 Route optimization for logistics and delivery using GPS and
historical data.
 Examples of Tools/Methods:
 Tableau for visualizing supply chain metrics.
 Apache Kafka for streamlining real-time data from IoT sensors.
 Predictive analytics for demand and supply planning.

OR
16(b) Relate the concepts of data replication and fault tolerance in HDFS. Explain
how these concepts ensure data integrity and availability in a distributed
environment.
Concept of Data Replication (5 Marks)
 Definition: In HDFS, data replication means duplicating blocks of data
across multiple nodes.
 Purpose: Ensures data availability and fault tolerance in case of node
failures.
 Replication Factor: Default is three copies.
o Example: A 128 MB block is replicated across three different nodes.
 Advantages:
o High availability: If one node fails, replicas on other nodes can be
accessed.
o Scalability: Distributed copies allow parallel processing.
Fault Tolerance in HDFS (5 Marks)
 Definition: The ability of a system to continue functioning in the event of a
hardware or software failure.
 Mechanisms:
o Heartbeat mechanism: NameNode checks health of DataNodes.
o Automatic re-replication: Missing blocks are automatically
replicated to healthy nodes.
 Example: If a DataNode storing a block fails, HDFS retrieves the block
from another node with a replica.
Ensuring Data Integrity and Availability (5 Marks)
 Data Integrity:
o Each block is stored with a checksum.
o During reads, checksums are verified to ensure no corruption.
 Data Availability:
o Replication ensures access even during node failures.
o NameNode maintains metadata to locate replicas.
 Distributed Environment:
o Allows data storage and processing across multiple nodes, ensuring
reliability and scalability.
o Example: Large-scale systems like Netflix use HDFS to ensure
uninterrupted service despite failures.

You might also like