Bda F
Bda F
PRACTICAL – 1
AIM : Implement following using Map- Reduce a. Matrix multiplication b.
Sorting c. Indexing.
Code:
A. Matrix Multiplication
• Mapper Function:
def mapper(matrix_entry):
matrix, i, j, value = matrix_entry
if matrix == 'A':
for k in range(1, N + 1):
yield (i, k), ('A', j, value)
else:
for k in range(1, N + 1):
yield (k, j), ('B', i, value)
• Reducer Function:
• Example Usage:
BAIT,SURAT Page 1
Big Data Analytics [1010206714] [2107020701005]
mapped_entries = []
for entry in matrix_entries:
mapped_entries.extend(mapper(entry))
# Group by key
grouped_entries = defaultdict(list)
for key, value in mapped_entries:
grouped_entries[key].append(value)
# Reduce phase
result = []
for key, values in grouped_entries.items():
result.append(reducer(key, values))
Output:
B. Sorting
def mapper(value):
yield (value, None)
data = [3, 1, 4, 1, 5, 9, 2, 6, 5]
mapped_data = []
for value in data:
mapped_data.extend(mapper(value))
# Sort by key
sorted_data = sorted(mapped_data, key=lambda x: x[0])
BAIT,SURAT Page 2
Big Data Analytics [1010206714] [2107020701005]
# Reduce phase
sorted_result = []
for key, _ in sorted_data:
sorted_result.extend(reducer(key, None))
print(sorted_result)
Output:
C. Indexing
def mapper(document_id, document):
for word in document.split():
yield (word, document_id)
documents = [
("doc1", "hello world"),
("doc2", "hello mapreduce world")
]
mapped_data = []
for doc_id, text in documents:
mapped_data.extend(mapper(doc_id, text))
# Group by key
grouped_data = defaultdict(list)
for key, value in mapped_data:
grouped_data[key].append(value)
# Reduce phase
index = {}
for word, document_ids in grouped_data.items():
BAIT,SURAT Page 3
Big Data Analytics [1010206714] [2107020701005]
index.update(reducer(word, document_ids))
print(index)
Output:
BAIT,SURAT Page 4
Big Data Analytics [1010206714] [2107020701005]
PRACTICAL – 2
AIM : Distributed Cache & Map Side Join, Reduce side Join Building and
Running a Spark Application Word count in Hadoop and Spark
Manipulating RDD.
Code :
# Mapper function
def mapper(record):
key, value = record
if key in cache_dict:
yield (key, (value, cache_dict[key]))
mapped_result = []
for record in dataset1:
mapped_result.extend(mapper(record))
print(mapped_result)
print("-------------------------------------------")
# Mapper function
def mapper1(record):
key, value = record
BAIT,SURAT Page 5
Big Data Analytics [1010206714] [2107020701005]
def mapper2(record):
key, value = record
yield key, ("dataset2", value)
mapped_result1 = []
mapped_result2 = []
# Group by key
grouped_result = defaultdict(list)
for key, value in mapped_result:
grouped_result[key].append(value)
# Reducer function
def reducer(key, values):
dataset1_values = [v for source, v in values if source == "dataset1"]
dataset2_values = [v for source, v in values if source == "dataset2"]
return [(key, (v1, v2)) for v1 in dataset1_values for v2 in dataset2_values]
reduced_result = []
for key, values in grouped_result.items():
reduced_result.extend(reducer(key, values))
print(reduced_result)
print("-------------------------------------------")
Output :
BAIT,SURAT Page 6
Big Data Analytics [1010206714] [2107020701005]
spark.stop()
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
word.set(token);
context.write(word, one);
}
}
}
BAIT,SURAT Page 7
Big Data Analytics [1010206714] [2107020701005]
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
# Map Transformation
squared_rdd = rdd.map(lambda x: x * x)
# Filter Transformation
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
# Reduce Action
sum_of_elements = rdd.reduce(lambda a, b: a + b)
# Collect Action
collected_elements = rdd.collect()
BAIT,SURAT Page 8
Big Data Analytics [1010206714] [2107020701005]
spark.stop()
Output :
BAIT,SURAT Page 9
Big Data Analytics [1010206714] [2107020701005]
PRACTICAL – 3
AIM : Implementation of Matrix algorithms in Spark Sql programming.
Code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MatrixMultiplication").getOrCreate()
df_a.createOrReplaceTempView("matrix_a")
df_b.createOrReplaceTempView("matrix_b")
result = spark.sql("""
SELECT a.row AS row, b.col AS col, SUM(a.value * b.value) AS value
FROM matrix_a a
JOIN matrix_b b
ON a.col = b.row
GROUP BY a.row, b.col
ORDER BY a.row, b.col
""")
result.show()
Output:
BAIT,SURAT Page 10
Big Data Analytics [1010206714] [2107020701005]
PRACTICAL – 4
AIM : Implementing K-Means Clustering algorithm using Map-Reduce.
Code:
import math
# Group by centroid
grouped = defaultdict(list)
for centroid, point in mapped:
grouped[centroid].append(point)
# Reduce step
new_centroids = []
for centroid, points in grouped.items():
_, new_centroid = reducer(centroid, points)
BAIT,SURAT Page 11
Big Data Analytics [1010206714] [2107020701005]
new_centroids.append(tuple(new_centroid))
return centroids
# Example usage
data = [
(1.0, 2.0), (1.5, 1.8), (5.0, 8.0),
(8.0, 8.0), (1.0, 0.6), (9.0, 11.0)
]
initial_centroids = [(1.0, 1.0), (5.0, 5.0)]
# Group by centroid
grouped = defaultdict(list)
for centroid, point in mapped:
grouped[centroid].append(point)
# Reduce step
new_centroids = []
for centroid, points in grouped.items():
_, new_centroid = reducer(centroid, points)
new_centroids.append(tuple(new_centroid))
BAIT,SURAT Page 12
Big Data Analytics [1010206714] [2107020701005]
return centroids
# Example usage
data = [
(1.0, 2.0), (1.5, 1.8), (5.0, 8.0),
(8.0, 8.0), (1.0, 0.6), (9.0, 11.0)
]
initial_centroids = [(1.0, 1.0), (5.0, 5.0)]
Output:
BAIT,SURAT Page 13
Big Data Analytics [1010206714] [2107020701005]
PRACTICAL – 5
AIM : Implementing any one Frequent Item set algorithm using Map-
Reduce.
Code:
def mapper(transaction):
items = transaction.split()
item_pairs = []
return item_pairs
def reducer(item_pairs):
pair_counts = defaultdict(int)
return pair_counts
transactions = [
"bread milk",
"bread butter",
"milk butter",
"bread milk butter",
"bread",
"milk"
]
# Map step
mapped_data = []
for transaction in transactions:
mapped_data.extend(mapper(transaction))
BAIT,SURAT Page 14
Big Data Analytics [1010206714] [2107020701005]
# Reduce step
reduced_data = reducer(mapped_data)
min_support = 2
Output:
BAIT,SURAT Page 15
Big Data Analytics [1010206714] [2107020701005]
PRACTICAL – 6
AIM : Create A Data Pipeline Based On Messaging Using PySpark And Hive
- Covid-19 Analysis.
Output:
BAIT,SURAT Page 16
Big Data Analytics [1010206714] [2107020701005]
covid_data_cleaned.write.mode("overwrite").saveAsTable("covid_analysis.covid_data")
Output:
Output:
BAIT,SURAT Page 17
Big Data Analytics [1010206714] [2107020701005]
Code:
import smtplib
from email.mime.text import MIMEText
# Send email
with smtplib.SMTP("smtp.example.com") as server:
server.login("[email protected]", "password")
server.sendmail("[email protected]", to, msg.as_string())
# Example usage
send_email("COVID-19 Update", "Total confirmed cases have increased.",
"[email protected]")
Output:
An email will be sent to the recipient with the subject "COVID-19 Update" and
body "Total confirmed cases have increased."
BAIT,SURAT Page 18
Big Data Analytics [1010206714] [2107020701005]
PRACTICAL – 7
AIM : Case Study: Stage 1: Selection of case study topics and formation of
small working groups of 2/3 students per group. Students engage with the
cases, read through background material provided in the session and work
through an initial set of questions to deepen the understanding of the case.
Sample applications and data will be provided to help students familiarize
themselves with the cases and available (big) data.
Stage 2: The groups are given a specific task relevant to the case in question
and are expected to develop a corresponding big data concept using the
knowledge gained in the course and the parameters set by the case study
scenario. A set of questions that help guide through the scenarios will be
provided.
Stage 3: Each group prepares a short 2 – 5 page report on their results and
a 10 min oral presentation of their big data concept.
Amazon is a multinational technology company that was founded in 1994 by Jeff Bezos in
Seattle, Washington. Initially conceived as an online bookstore, Amazon has since expanded
into a variety of other e-commerce categories, including electronics, apparel, groceries, and
digital services like cloud computing (AWS), streaming (Amazon Prime Video), and artificial
intelligence.
Amazon's meteoric rise can be attributed to its pioneering approach to online shopping, focus
on customer-centric services, vast product offerings, innovation in logistics, and continuous
diversification. Today, Amazon is one of the world’s most valuable companies and a dominant
force in both the e-commerce and technology sectors.
Amazon operates across a variety of business segments, with its revenue and profits driven by
the following key areas:
BAIT,SURAT Page 19
Big Data Analytics [1010206714] [2107020701005]
a. E-commerce Retail
Amazon Marketplace: This is Amazon’s core business, where it allows third-party sellers to
list products alongside its own inventory. This segment includes categories like books,
electronics, clothing, toys, and more.
Amazon Prime: A subscription service that offers free shipping, access to streaming media,
and other benefits. It is a significant driver of customer loyalty and recurring revenue.
Amazon Fresh & Whole Foods: With acquisitions like Whole Foods and the introduction of
Amazon Fresh, Amazon is now a major player in the grocery retail industry.
Cloud Computing: AWS is the largest cloud computing provider globally, offering services
such as computing power, storage, and databases to businesses. AWS is a critical part of
Amazon’s profitability, contributing a significant portion of its total operating income.
c. Digital Streaming
Amazon Prime Video: Competing with services like Netflix and Disney+, Prime Video offers
a range of original content and licensed films and TV shows. This has helped Amazon penetrate
the entertainment and media sector.
Alexa & Echo Devices: Amazon’s entry into AI and smart home technology with its Alexa
voice assistant and Echo devices has been a major success. Alexa enables users to control smart
devices, stream music, and access services.
Kindle: Amazon revolutionized digital reading with the Kindle e-reader, which has become
synonymous with e-books.
3. Business Model
a. Customer-Centricity
Amazon has built its business around the philosophy of being "Earth’s most customer-centric
company." It consistently prioritizes customer experience through fast delivery, easy returns,
competitive pricing, and personalized recommendations.
b. Diversification
BAIT,SURAT Page 20
Big Data Analytics [1010206714] [2107020701005]
Amazon operates a vast network of fulfillment centers, warehouses, and delivery systems that
allow it to achieve economies of scale. This gives Amazon a competitive advantage in both
product availability and delivery speed, with options like same-day or two-day delivery for
Prime members.
d. Data-Driven Decisions
Amazon uses vast amounts of data to inform its business decisions. Customer browsing
behavior, purchase patterns, and search trends help Amazon optimize its product
recommendations, pricing strategy, and inventory management. This data also powers Alexa
and other AI-driven products.
e. Subscription Revenue
Through Amazon Prime, the company has built a substantial recurring revenue stream. With
benefits extending beyond shipping (e.g., streaming, exclusive deals), Prime has become a
powerful customer retention tool.
a. Competition
Amazon faces competition from both traditional brick-and-mortar retailers (like Walmart and
Target) and online-only rivals (like eBay, Alibaba, and other specialized e-commerce
platforms). Additionally, AWS competes with Microsoft Azure, Google Cloud, and other cloud
providers.
As Amazon’s dominance continues to grow, it has faced increased scrutiny from regulators,
particularly concerning issues like data privacy, market dominance, tax practices, and labor
rights. The company has been subject to antitrust investigations in several countries.
While Amazon’s retail business is a significant revenue driver, it often operates on thin profit
margins. The company frequently reinvests its profits into expanding its infrastructure,
logistics, and new services, which can limit overall profitability in the short term.
Amazon has faced criticism over its treatment of workers, including reports of high turnover
rates, safety concerns in warehouses, and issues regarding wages and benefits for fulfillment
center employees. The company has also faced accusations of undercutting small businesses
and squeezing suppliers with low-cost demands.
BAIT,SURAT Page 21
Big Data Analytics [1010206714] [2107020701005]
Amazon has ventured into physical retail with Amazon Go stores, which use sensors and AI to
allow customers to shop without checkout lines. This aligns with the company’s focus on
streamlining operations through automation and technology.
Over the years, Amazon has acquired several companies to expand its reach and capabilities.
Key acquisitions include Whole Foods (grocery retail), Ring (smart security), Zoox
(autonomous driving), and PillPack (online pharmacy). These acquisitions reflect Amazon’s
strategy of entering and transforming different industries.
Amazon has made significant strides toward sustainability, pledging to reach net-zero carbon
by 2040. The company has invested in renewable energy, electric delivery vehicles, and
sustainable packaging to reduce its environmental footprint.
6. Financial Performance
Revenue Growth: Amazon has shown impressive revenue growth over the years, driven by
its diversified business model. For instance, its annual revenue exceeded $500 billion in 2023.
Profit Margins: While Amazon's retail business operates on low margins, AWS delivers high-
profit margins, making it a key driver of the company's overall profitability.
Stock Performance: Amazon's stock has performed exceptionally well since its IPO in 1997,
with the company now being one of the most valuable in the world.
7. Future Outlook
Amazon Prime: The continued growth of Amazon Prime will likely drive recurring revenue,
customer loyalty, and data insights.
BAIT,SURAT Page 22
Big Data Analytics [1010206714] [2107020701005]
8. Conclusion
Amazon has evolved from a small online bookstore to one of the most influential companies
in the world. Its commitment to customer-centricity, continuous innovation, and strategic
diversification has allowed it to dominate various industries. However, as it expands into new
territories, Amazon will need to address challenges related to competition, regulation, and labor
practices. The company's ability to adapt and innovate in these areas will be key to its future
success.
BAIT,SURAT Page 23