0% found this document useful (0 votes)
11 views23 pages

Bda F

Lab manual of big data analytics

Uploaded by

Rushil Beladiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

Bda F

Lab manual of big data analytics

Uploaded by

Rushil Beladiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 1
AIM : Implement following using Map- Reduce a. Matrix multiplication b.
Sorting c. Indexing.

Code:

A. Matrix Multiplication

• Mapper Function:
def mapper(matrix_entry):
matrix, i, j, value = matrix_entry
if matrix == 'A':
for k in range(1, N + 1):
yield (i, k), ('A', j, value)
else:
for k in range(1, N + 1):
yield (k, j), ('B', i, value)

• Reducer Function:

from collections import defaultdict

def reducer(index, values):


A = defaultdict(int)
B = defaultdict(int)

for matrix, idx, value in values:


if matrix == 'A':
A[idx] = value
else:
B[idx] = value

product = sum(A[i] * B[i] for i in A if i in B)


return (index, product)

• Example Usage:

N = 2 # Dimension of the matrices


matrix_entries = [
('A', 1, 1, 2), ('A', 1, 2, 3),
('B', 1, 1, 4), ('B', 2, 1, 5)
]

BAIT,SURAT Page 1
Big Data Analytics [1010206714] [2107020701005]

mapped_entries = []
for entry in matrix_entries:
mapped_entries.extend(mapper(entry))

# Group by key
grouped_entries = defaultdict(list)
for key, value in mapped_entries:
grouped_entries[key].append(value)

# Reduce phase
result = []
for key, values in grouped_entries.items():
result.append(reducer(key, values))

# Display the result


for ((i, j), value) in result:
print(f"Element ({i}, {j}) = {value}")

Output:

B. Sorting
def mapper(value):
yield (value, None)

def reducer(key, values):


yield key

data = [3, 1, 4, 1, 5, 9, 2, 6, 5]

mapped_data = []
for value in data:
mapped_data.extend(mapper(value))

# Sort by key
sorted_data = sorted(mapped_data, key=lambda x: x[0])

BAIT,SURAT Page 2
Big Data Analytics [1010206714] [2107020701005]

# Reduce phase
sorted_result = []
for key, _ in sorted_data:
sorted_result.extend(reducer(key, None))

print(sorted_result)

Output:

C. Indexing
def mapper(document_id, document):
for word in document.split():
yield (word, document_id)

from collections import defaultdict

def reducer(word, document_ids):


yield (word, list(set(document_ids)))

documents = [
("doc1", "hello world"),
("doc2", "hello mapreduce world")
]

mapped_data = []
for doc_id, text in documents:
mapped_data.extend(mapper(doc_id, text))

# Group by key
grouped_data = defaultdict(list)
for key, value in mapped_data:
grouped_data[key].append(value)

# Reduce phase
index = {}
for word, document_ids in grouped_data.items():

BAIT,SURAT Page 3
Big Data Analytics [1010206714] [2107020701005]

index.update(reducer(word, document_ids))

print(index)

Output:

BAIT,SURAT Page 4
Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 2
AIM : Distributed Cache & Map Side Join, Reduce side Join Building and
Running a Spark Application Word count in Hadoop and Spark
Manipulating RDD.

Code :

• Map Side Join:


from collections import defaultdict

print("Map Side Join:")


# Assuming we have two datasets, dataset1 and dataset2, where dataset2 is small enough to fit
into memory
dataset1 = [("A", 1), ("B", 2), ("C", 3)]
dataset2 = [("A", "X"), ("B", "Y")]

# Distributed Cache - dataset2 is cached


cache_dict = {key: value for key, value in dataset2}

# Mapper function
def mapper(record):
key, value = record
if key in cache_dict:
yield (key, (value, cache_dict[key]))

mapped_result = []
for record in dataset1:
mapped_result.extend(mapper(record))

print(mapped_result)
print("-------------------------------------------")

• Reduce Side Join


print("Reduce Side Join")

# Assuming dataset1 and dataset2 are large datasets


dataset1 = [("A", 1), ("B", 2), ("C", 3)]
dataset2 = [("A", "X"), ("B", "Y")]

# Mapper function
def mapper1(record):
key, value = record

BAIT,SURAT Page 5
Big Data Analytics [1010206714] [2107020701005]

yield key, ("dataset1", value)

def mapper2(record):
key, value = record
yield key, ("dataset2", value)

mapped_result1 = []
mapped_result2 = []

for record in dataset1:


mapped_result1.extend(mapper1(record))
for record in dataset2:
mapped_result2.extend(mapper2(record))

# Combine both mapped results


mapped_result = mapped_result1 + mapped_result2

# Group by key
grouped_result = defaultdict(list)
for key, value in mapped_result:
grouped_result[key].append(value)

# Reducer function
def reducer(key, values):
dataset1_values = [v for source, v in values if source == "dataset1"]
dataset2_values = [v for source, v in values if source == "dataset2"]
return [(key, (v1, v2)) for v1 in dataset1_values for v2 in dataset2_values]

reduced_result = []
for key, values in grouped_result.items():
reduced_result.extend(reducer(key, values))

print(reduced_result)
print("-------------------------------------------")

Output :

BAIT,SURAT Page 6
Big Data Analytics [1010206714] [2107020701005]

• Word Count in Spark:


from pyspark.sql import SparkSession

# Initialize Spark Session


spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Read input file


input_file = "path/to/input.txt"
text_file = spark.read.text(input_file).rdd

# Word Count Logic


words = text_file.flatMap(lambda line: line.value.split())
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Collect and print the result


output = word_counts.collect()
for word, count in output:
print(f"{word}: {count}")

spark.stop()

• Word Count in Hadoop


Word Count Mapper:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
word.set(token);
context.write(word, one);
}
}
}

BAIT,SURAT Page 7
Big Data Analytics [1010206714] [2107020701005]

Word Count Reducer:


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

• Manipulating RDD in Spark


Basic RDD Operations:
from pyspark.sql import SparkSession

# Initialize Spark Session


spark = SparkSession.builder.appName("RDDExamples").getOrCreate()

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Map Transformation
squared_rdd = rdd.map(lambda x: x * x)

# Filter Transformation
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)

# Reduce Action
sum_of_elements = rdd.reduce(lambda a, b: a + b)

# Collect Action
collected_elements = rdd.collect()

print("Squared RDD:", squared_rdd.collect())


print("Filtered RDD:", filtered_rdd.collect())
print("Sum of elements:", sum_of_elements)

BAIT,SURAT Page 8
Big Data Analytics [1010206714] [2107020701005]

print("Collected elements:", collected_elements)

spark.stop()

Output :

BAIT,SURAT Page 9
Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 3
AIM : Implementation of Matrix algorithms in Spark Sql programming.

Code:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MatrixMultiplication").getOrCreate()

# Creating DataFrames for Matrix A


data_a = [(1, 1, 1), (1, 2, 2), (2, 1, 3), (2, 2, 4)]
df_a = spark.createDataFrame(data_a, ["row", "col", "value"])

# Creating DataFrames for Matrix B


data_b = [(1, 1, 5), (1, 2, 6), (2, 1, 7), (2, 2, 8)]
df_b = spark.createDataFrame(data_b, ["row", "col", "value"])

df_a.createOrReplaceTempView("matrix_a")
df_b.createOrReplaceTempView("matrix_b")

result = spark.sql("""
SELECT a.row AS row, b.col AS col, SUM(a.value * b.value) AS value
FROM matrix_a a
JOIN matrix_b b
ON a.col = b.row
GROUP BY a.row, b.col
ORDER BY a.row, b.col
""")

result.show()

Output:

BAIT,SURAT Page 10
Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 4
AIM : Implementing K-Means Clustering algorithm using Map-Reduce.

Code:
import math

def mapper(data_point, centroids):


min_dist = float('inf')
nearest_centroid = None
for centroid in centroids:
dist = math.sqrt(sum((data_point[i] - centroid[i]) ** 2 for i in range(len(data_point))))
if dist < min_dist:
min_dist = dist
nearest_centroid = centroid
yield nearest_centroid, data_point

from collections import defaultdict


import numpy as np

def reducer(centroid, data_points):


data_points = np.array(data_points)
new_centroid = data_points.mean(axis=0)
return centroid, new_centroid

def k_means_map_reduce(data, initial_centroids, max_iterations=10):


centroids = initial_centroids
for _ in range(max_iterations):
# Map step
mapped = []
for point in data:
mapped.extend(mapper(point, centroids))

# Group by centroid
grouped = defaultdict(list)
for centroid, point in mapped:
grouped[centroid].append(point)

# Reduce step
new_centroids = []
for centroid, points in grouped.items():
_, new_centroid = reducer(centroid, points)

BAIT,SURAT Page 11
Big Data Analytics [1010206714] [2107020701005]

new_centroids.append(tuple(new_centroid))

# Check for convergence


if set(new_centroids) == set(centroids):
break
centroids = new_centroids

return centroids

# Example usage
data = [
(1.0, 2.0), (1.5, 1.8), (5.0, 8.0),
(8.0, 8.0), (1.0, 0.6), (9.0, 11.0)
]
initial_centroids = [(1.0, 1.0), (5.0, 5.0)]

final_centroids = k_means_map_reduce(data, initial_centroids)


print("Final centroids:", final_centroids)

def k_means_map_reduce(data, initial_centroids, max_iterations=10):


centroids = initial_centroids
for _ in range(max_iterations):
# Map step
mapped = []
for point in data:
mapped.extend(mapper(point, centroids))

# Group by centroid
grouped = defaultdict(list)
for centroid, point in mapped:
grouped[centroid].append(point)

# Reduce step
new_centroids = []
for centroid, points in grouped.items():
_, new_centroid = reducer(centroid, points)
new_centroids.append(tuple(new_centroid))

# Check for convergence


if set(new_centroids) == set(centroids):
break
centroids = new_centroids

BAIT,SURAT Page 12
Big Data Analytics [1010206714] [2107020701005]

return centroids

# Example usage
data = [
(1.0, 2.0), (1.5, 1.8), (5.0, 8.0),
(8.0, 8.0), (1.0, 0.6), (9.0, 11.0)
]
initial_centroids = [(1.0, 1.0), (5.0, 5.0)]

final_centroids = k_means_map_reduce(data, initial_centroids)


print("Final centroids:", final_centroids)

Output:

BAIT,SURAT Page 13
Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 5
AIM : Implementing any one Frequent Item set algorithm using Map-
Reduce.

Code:
def mapper(transaction):
items = transaction.split()
item_pairs = []

# Generate all possible pairs of items


for i in range(len(items)):
for j in range(i + 1, len(items)):
item_pairs.append((frozenset([items[i], items[j]]), 1))

return item_pairs

from collections import defaultdict

def reducer(item_pairs):
pair_counts = defaultdict(int)

for item_pair, count in item_pairs:


pair_counts[item_pair] += count

return pair_counts

transactions = [
"bread milk",
"bread butter",
"milk butter",
"bread milk butter",
"bread",
"milk"
]

# Map step
mapped_data = []
for transaction in transactions:
mapped_data.extend(mapper(transaction))

BAIT,SURAT Page 14
Big Data Analytics [1010206714] [2107020701005]

# Reduce step
reduced_data = reducer(mapped_data)

# Print the frequent itemsets


for itemset, count in reduced_data.items():
print(f"Itemset: {itemset}, Count: {count}")

min_support = 2

frequent_itemsets = {itemset: count for itemset, count in reduced_data.items() if count >=


min_support}

for itemset, count in frequent_itemsets.items():


print(f"Frequent Itemset: {itemset}, Count: {count}")

Output:

BAIT,SURAT Page 15
Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 6
AIM : Create A Data Pipeline Based On Messaging Using PySpark And Hive
- Covid-19 Analysis.

Step 1: Data Ingestion


First, gather the COVID-19 data from various sources like APIs, CSV files, or databases.
Code:
from pyspark.sql import SparkSession

# Initialize Spark session


spark = SparkSession.builder \
.appName("COVID-19 Analysis") \
.enableHiveSupport() \
.getOrCreate()

# Read data from a CSV file


covid_data = spark.read.csv("path/to/covid_data.csv", header=True, inferSchema=True)

Step 2: Data Processing


Process the data to clean and transform it for analysis.
Code:
from pyspark.sql.functions import col

# Select relevant columns and clean data


covid_data_cleaned = covid_data.select(
col("date"),
col("state"),
col("confirmed_cases"),
col("deaths"),
col("recovered")
).filter(col("confirmed_cases").isNotNull())

Output:

BAIT,SURAT Page 16
Big Data Analytics [1010206714] [2107020701005]

Step 3: Data Storage in Hive


Store the processed data in a Hive table for querying.
Code:
# Save data to a Hive table

covid_data_cleaned.write.mode("overwrite").saveAsTable("covid_analysis.covid_data")

# Verify data is stored in Hive


spark.sql("SELECT * FROM covid_analysis.covid_data").show()

Output:

Step 4: Data Analysis


Run queries on the Hive table to perform analysis.
Code:
# Run a query to get total confirmed cases per state
total_cases_per_state = spark.sql("""
SELECT state, SUM(confirmed_cases) as total_cases
FROM covid_analysis.covid_data
GROUP BY state
ORDER BY total_cases DESC
""")
total_cases_per_state.show()

Output:

BAIT,SURAT Page 17
Big Data Analytics [1010206714] [2107020701005]

Step 5: Messaging and Notification


Set up a messaging system to notify users about significant data insights.

Code:

import smtplib
from email.mime.text import MIMEText

def send_email(subject, body, to):


msg = MIMEText(body)
msg["Subject"] = subject
msg["From"] = "[email protected]"
msg["To"] = to

# Send email
with smtplib.SMTP("smtp.example.com") as server:
server.login("[email protected]", "password")
server.sendmail("[email protected]", to, msg.as_string())

# Example usage
send_email("COVID-19 Update", "Total confirmed cases have increased.",
"[email protected]")

Output:

An email will be sent to the recipient with the subject "COVID-19 Update" and
body "Total confirmed cases have increased."

BAIT,SURAT Page 18
Big Data Analytics [1010206714] [2107020701005]

PRACTICAL – 7
AIM : Case Study: Stage 1: Selection of case study topics and formation of
small working groups of 2/3 students per group. Students engage with the
cases, read through background material provided in the session and work
through an initial set of questions to deepen the understanding of the case.
Sample applications and data will be provided to help students familiarize
themselves with the cases and available (big) data.

Stage 2: The groups are given a specific task relevant to the case in question
and are expected to develop a corresponding big data concept using the
knowledge gained in the course and the parameters set by the case study
scenario. A set of questions that help guide through the scenarios will be
provided.

Stage 3: Each group prepares a short 2 – 5 page report on their results and
a 10 min oral presentation of their big data concept.

Case Study on Amazon


1. Introduction

Amazon is a multinational technology company that was founded in 1994 by Jeff Bezos in
Seattle, Washington. Initially conceived as an online bookstore, Amazon has since expanded
into a variety of other e-commerce categories, including electronics, apparel, groceries, and
digital services like cloud computing (AWS), streaming (Amazon Prime Video), and artificial
intelligence.

Amazon's meteoric rise can be attributed to its pioneering approach to online shopping, focus
on customer-centric services, vast product offerings, innovation in logistics, and continuous
diversification. Today, Amazon is one of the world’s most valuable companies and a dominant
force in both the e-commerce and technology sectors.

2. Key Business Segments

Amazon operates across a variety of business segments, with its revenue and profits driven by
the following key areas:

BAIT,SURAT Page 19
Big Data Analytics [1010206714] [2107020701005]

a. E-commerce Retail

Amazon Marketplace: This is Amazon’s core business, where it allows third-party sellers to
list products alongside its own inventory. This segment includes categories like books,
electronics, clothing, toys, and more.

Amazon Prime: A subscription service that offers free shipping, access to streaming media,
and other benefits. It is a significant driver of customer loyalty and recurring revenue.

Amazon Fresh & Whole Foods: With acquisitions like Whole Foods and the introduction of
Amazon Fresh, Amazon is now a major player in the grocery retail industry.

b. Amazon Web Services (AWS)

Cloud Computing: AWS is the largest cloud computing provider globally, offering services
such as computing power, storage, and databases to businesses. AWS is a critical part of
Amazon’s profitability, contributing a significant portion of its total operating income.

c. Digital Streaming

Amazon Prime Video: Competing with services like Netflix and Disney+, Prime Video offers
a range of original content and licensed films and TV shows. This has helped Amazon penetrate
the entertainment and media sector.

d. Amazon Devices & AI

Alexa & Echo Devices: Amazon’s entry into AI and smart home technology with its Alexa
voice assistant and Echo devices has been a major success. Alexa enables users to control smart
devices, stream music, and access services.

Kindle: Amazon revolutionized digital reading with the Kindle e-reader, which has become
synonymous with e-books.

3. Business Model

Amazon’s business model is primarily based on the following strategies:

a. Customer-Centricity

Amazon has built its business around the philosophy of being "Earth’s most customer-centric
company." It consistently prioritizes customer experience through fast delivery, easy returns,
competitive pricing, and personalized recommendations.

b. Diversification

Amazon's continuous diversification into new industries—cloud computing, entertainment,


grocery retail, AI, and logistics—has reduced its reliance on any single market and created a
robust revenue model with multiple income streams.

BAIT,SURAT Page 20
Big Data Analytics [1010206714] [2107020701005]

c. Economies of Scale & Logistics

Amazon operates a vast network of fulfillment centers, warehouses, and delivery systems that
allow it to achieve economies of scale. This gives Amazon a competitive advantage in both
product availability and delivery speed, with options like same-day or two-day delivery for
Prime members.

d. Data-Driven Decisions

Amazon uses vast amounts of data to inform its business decisions. Customer browsing
behavior, purchase patterns, and search trends help Amazon optimize its product
recommendations, pricing strategy, and inventory management. This data also powers Alexa
and other AI-driven products.

e. Subscription Revenue

Through Amazon Prime, the company has built a substantial recurring revenue stream. With
benefits extending beyond shipping (e.g., streaming, exclusive deals), Prime has become a
powerful customer retention tool.

4. Challenges Faced by Amazon

a. Competition

Amazon faces competition from both traditional brick-and-mortar retailers (like Walmart and
Target) and online-only rivals (like eBay, Alibaba, and other specialized e-commerce
platforms). Additionally, AWS competes with Microsoft Azure, Google Cloud, and other cloud
providers.

b. Regulation and Antitrust Scrutiny

As Amazon’s dominance continues to grow, it has faced increased scrutiny from regulators,
particularly concerning issues like data privacy, market dominance, tax practices, and labor
rights. The company has been subject to antitrust investigations in several countries.

c. Profitability in Retail Business

While Amazon’s retail business is a significant revenue driver, it often operates on thin profit
margins. The company frequently reinvests its profits into expanding its infrastructure,
logistics, and new services, which can limit overall profitability in the short term.

d. Labor and Ethical Concerns

Amazon has faced criticism over its treatment of workers, including reports of high turnover
rates, safety concerns in warehouses, and issues regarding wages and benefits for fulfillment
center employees. The company has also faced accusations of undercutting small businesses
and squeezing suppliers with low-cost demands.

BAIT,SURAT Page 21
Big Data Analytics [1010206714] [2107020701005]

5. Strategic Initiatives and Innovation

a. Amazon Go and Automation

Amazon has ventured into physical retail with Amazon Go stores, which use sensors and AI to
allow customers to shop without checkout lines. This aligns with the company’s focus on
streamlining operations through automation and technology.

b. Acquisitions and Partnerships

Over the years, Amazon has acquired several companies to expand its reach and capabilities.
Key acquisitions include Whole Foods (grocery retail), Ring (smart security), Zoox
(autonomous driving), and PillPack (online pharmacy). These acquisitions reflect Amazon’s
strategy of entering and transforming different industries.

c. Sustainability and Green Initiatives

Amazon has made significant strides toward sustainability, pledging to reach net-zero carbon
by 2040. The company has invested in renewable energy, electric delivery vehicles, and
sustainable packaging to reduce its environmental footprint.

6. Financial Performance

Revenue Growth: Amazon has shown impressive revenue growth over the years, driven by
its diversified business model. For instance, its annual revenue exceeded $500 billion in 2023.

Profit Margins: While Amazon's retail business operates on low margins, AWS delivers high-
profit margins, making it a key driver of the company's overall profitability.

Stock Performance: Amazon's stock has performed exceptionally well since its IPO in 1997,
with the company now being one of the most valuable in the world.

7. Future Outlook

Amazon's future growth is likely to continue through:

Global Expansion: Amazon is expanding its e-commerce footprint in international markets


like India and Europe.

Technological Innovations: With advancements in AI, machine learning, and robotics,


Amazon is well-positioned to continue leading in logistics, customer experience, and product
offerings.

Amazon Prime: The continued growth of Amazon Prime will likely drive recurring revenue,
customer loyalty, and data insights.

BAIT,SURAT Page 22
Big Data Analytics [1010206714] [2107020701005]

Sustainability: Given the increasing focus on climate change and environmental


responsibility, Amazon's green initiatives will play a crucial role in its long-term brand
positioning and regulatory standing.

8. Conclusion

Amazon has evolved from a small online bookstore to one of the most influential companies
in the world. Its commitment to customer-centricity, continuous innovation, and strategic
diversification has allowed it to dominate various industries. However, as it expands into new
territories, Amazon will need to address challenges related to competition, regulation, and labor
practices. The company's ability to adapt and innovate in these areas will be key to its future
success.

BAIT,SURAT Page 23

You might also like