SEM_BDA_QUEST
SEM_BDA_QUEST
(AUTONOMOUS)
Coimbatore – 641105
B. E/ B.Tech. DEGREE EXAMINATIONS, NOV / DEC- 2024
Fifth Semester
CCS334– BIG DATA ANALYTICS (Artificial Intelligence and Data Science)
(Regulations 2021)
QP ID: N24224
9 HiveQL:
Designed for querying big data in Hadoop.
Executes on top of Hive, translating queries into MapReduce jobs.
SQL:
Used for relational databases (RDBMS).
Follows ACID compliance and uses traditional database engines.
PART–– B (5 x 13 = 65 Marks)
11(a) Outline the challenges and opportunities of unstructured data in big data
analytics.
Challenges of Unstructured Data:
Volume: Unstructured data is generated in massive quantities, making
storage and management difficult.
Variety: The data comes in diverse forms (e.g., images, text, audio, videos),
making it hard to process.
Lack of Schema: Unlike structured data, there is no predefined schema,
which complicates analysis.
Storage Costs: Unstructured data needs scalable storage systems like
Hadoop or cloud solutions.
Data Quality: It may contain inconsistencies, noise, or redundant
information.
Processing Complexity: Requires advanced tools like NLP, computer
vision, and AI for meaningful extraction.
2. Opportunities and Examples:
Opportunities:
oDeriving actionable insights from social media trends and customer
behaviours.
o Enhancing decision-making by analysing customer sentiments or
video data.
o Training AI models for tasks like speech recognition and image
classification.
Examples of Unstructured Data Sources:
o Social media: Insights into user engagement, trends (Twitter,
Facebook).
o Videos and Images: Used in healthcare (X-ray analysis) and
autonomous driving.
o Emails and Texts: Sentiment analysis to gauge customer
satisfaction.
o Sensors and IoT: Generating log files and real-time monitoring
data.
OR
11(b) Summarize the importance of Web Analytics in Understanding User
Behaviour and improving Website Performance. Discuss key metrics that
organization should focus on.
1. Importance of Web Analytics:
Helps analyse user behaviours and optimize website content.
Provides insights into user journeys to identify pain points.
Improves website performance by identifying bottlenecks (e.g., slow pages,
broken links).
Aids in decision-making for marketing strategies through data-driven
insights.
Tracks user interaction across platforms to enhance user experience (UX).
Helps monitor the success of campaigns (ads, SEO).
2. Key Metrics Organizations Should Focus On:
Page Views: Number of pages viewed by users.
Bounce Rate: Percentage of visitors leaving without interaction.
Session Duration: Average time users spend on the site.
Conversion Rate: Percentage of users completing desired actions (e.g.,
purchases).
Traffic Sources: Identifies where users come from (e.g., search engines,
social media).
Click-through Rate (CTR): The rate of users clicking on ads or links.
Exit Rate: Tracks the page where users most often leave.
12(a) Outline the Role of Cassandra Clients in Interacting with the Database.
Explain the importance of Client Libraries and Drivers in Application
Development?
Cassandra clients allow developers to interact with the database by sending
queries and receiving responses.
They act as an intermediary between the application and the distributed
database nodes.
Clients help manage data replication across nodes and ensure fault
tolerance.
They maintain consistency levels (e.g., ONE, QUORUM, ALL) as per
application requirements.
Examples of Cassandra clients include:
CQLSH: Command-line interface for running CQL queries.
DataStax Java Driver: Widely used for Java-based applications.
Importance of Client Libraries and Drivers in Application Development:
Simplifies Development: Libraries offer pre-built functions for connecting
to Cassandra and executing queries.
Language Support: Drivers are available for multiple programming
languages like Python, Java, and C++.
Performance Optimization: Drivers handle connection pooling, retries, and
load balancing automatically.
Integration: Helps integrate Cassandra with various applications and
frameworks.
Scalability: Client libraries support distributed architecture, allowing
applications to scale seamlessly.
Example Libraries:
o Java Driver for Java applications.
o PyCassa or Cassandra-driver for Python-based tools.
OR
12(b) Illustrate various consistency models available in NoSQL databases and their
implications for data integrity and application performance.
NoSQL databases support flexible data storage, but
consistency can vary based on the use case.
Consistency models determine how updated data is made
visible across distributed systems.
OR
13(b) Draw the architecture of HDFS, including the key components such as
NameNode and DataNode. Discuss how this architecture enables reliable and
scalable storage.
Architecture of HDFS:
1. Overview of HDFS:
o The Hadoop Distributed File System (HDFS) is a distributed file
system designed to store and manage large datasets across a cluster
of machines.
2. Key Components:
o NameNode:
The master node responsible for managing metadata, such as
file names, block locations, and directory structure.
Example: NameNode keeps track of where file blocks are
stored.
o DataNodes:
Worker nodes that store data blocks and handle read/write
requests from clients.
DataNodes regularly send heartbeat signals to the
NameNode to confirm availability.
o Blocks:
Files are split into fixed-size blocks (default size: 128 MB)
and distributed across DataNodes.
Each block is replicated across multiple nodes to ensure fault
tolerance.
3. HDFS Write Process:
o Client sends the data to the NameNode.
o The NameNode assigns DataNodes to store data blocks.
o Data is replicated (default replication factor: 3) to ensure
redundancy.
4. HDFS Read Process:
o Client requests the file from the NameNode.
o NameNode provides the locations of DataNodes containing the
required file blocks.
o Client fetches data directly from the DataNodes.
14(a) Illustrate the shuffle and sort process in detail, explaining how it facilitates the
grouping of intermediate data for the Reduce phase. Use diagrams to illustrate
the process.
1. Introduction to Shuffle and Sort
o The Shuffle step transfers intermediate key-value pairs from the
Map phase to the Reduce phase.
o The Sort step organizes these key-value pairs by key to prepare
them for the Reducer.
o Essential for aggregating and processing grouped data efficiently.
2. Detailed Process of Shuffle and Sort
o Map Output: The Mapper outputs key-value pairs.
o Partitioning: Determines which Reducer processes which keys
based on a partitioning function.
o Shuffling: Intermediate data is sent to corresponding Reducers.
o Sorting:
Keys are sorted within each partition.
This ensures that data is grouped by key for the Reducer.
o Combiner (Optional): Reduces the size of intermediate data by
performing local aggregation.
3. Diagram Explanation
o A diagram should include:
Output from Mapper.
Partitioning function directing data to Reducers.
Sorted key-value pairs grouped by key in the Reducer input.
OR
14(b) Determine the failure handling mechanisms in Classic MapReduce and how
they compare to those in YARN. Explain how YARN improves fault tolerance
and job reliability.
Failure Handling in Classic MapReduce (4 marks)
Task Failures: Retries failed tasks up to a defined limit.
Node Failures:
Job Tracker reassigns tasks from failed nodes to healthy nodes.
Heartbeat mechanism detects failures.
Job Tracker Failure: Entire job fails as Job Tracker is a single point of
failure.
d. Storing Results: Save processed data into a new table for reporting.
OR
15(b)
Distinguish the syntax and semantics of Pig Latin. Provide examples of common
operations and how they are expressed in Pig Latin scripts.
Introduction to Pig Latin
Pig Latin is a high-level scripting language for analysing large data sets in
Hadoop.
Focuses on procedural data processing compared to SQL’s declarative
approach.
Filtering Data:
filtered_data = FILTER raw_data BY $0 > 100;
Storing Results:
STORE aggregated INTO 'output';
PART–– B (1 x 15 = 15 Marks)
16(a) Elaborate various applications of big data across sectors such as marketing,
finance, and supply chain management. Provide examples of specific tools or
methods used in each application.
Big Data in Marketing (5 Marks)
Applications:
o Customer segmentation and targeting based on purchase history and
browsing behaviours.
o Personalized marketing campaigns using predictive analytics.
o Sentiment analysis to gauge public opinion about a brand or
product.
Examples of Tools/Methods:
o Google Analytics for tracking and analysing web traffic.
o Apache Spark for real-time data processing in marketing campaigns.
o Natural Language Processing (NLP) for analysing customer reviews
and social media sentiments.
Big Data in Finance (5 Marks)
Applications:
Fraud detection through anomaly detection techniques in
transactional data.
Risk management and credit scoring by analysing customer credit
behaviours.
High-frequency trading using predictive models.
Examples of Tools/Methods:
SAS for financial data analytics.
Hadoop for processing large volumes of transaction data.
Machine Learning algorithms for credit risk modelling and fraud
detection.
Big Data in Supply Chain Management (5 Marks)
Applications:
Inventory optimization using real-time data from IoT devices.
Demand forecasting to ensure product availability during peak
seasons.
Route optimization for logistics and delivery using GPS and
historical data.
Examples of Tools/Methods:
Tableau for visualizing supply chain metrics.
Apache Kafka for streamlining real-time data from IoT sensors.
Predictive analytics for demand and supply planning.
OR
16(b) Relate the concepts of data replication and fault tolerance in HDFS. Explain
how these concepts ensure data integrity and availability in a distributed
environment.
Concept of Data Replication (5 Marks)
Definition: In HDFS, data replication means duplicating blocks of data
across multiple nodes.
Purpose: Ensures data availability and fault tolerance in case of node
failures.
Replication Factor: Default is three copies.
o Example: A 128 MB block is replicated across three different nodes.
Advantages:
o High availability: If one node fails, replicas on other nodes can be
accessed.
o Scalability: Distributed copies allow parallel processing.
Fault Tolerance in HDFS (5 Marks)
Definition: The ability of a system to continue functioning in the event of a
hardware or software failure.
Mechanisms:
o Heartbeat mechanism: NameNode checks health of DataNodes.
o Automatic re-replication: Missing blocks are automatically
replicated to healthy nodes.
Example: If a DataNode storing a block fails, HDFS retrieves the block
from another node with a replica.
Ensuring Data Integrity and Availability (5 Marks)
Data Integrity:
o Each block is stored with a checksum.
o During reads, checksums are verified to ensure no corruption.
Data Availability:
o Replication ensures access even during node failures.
o NameNode maintains metadata to locate replicas.
Distributed Environment:
o Allows data storage and processing across multiple nodes, ensuring
reliability and scalability.
o Example: Large-scale systems like Netflix use HDFS to ensure
uninterrupted service despite failures.