0% found this document useful (0 votes)
94 views

You Are Asked To Write A MapReduce Program With Py...

The document describes writing a MapReduce program in Python to perform k-medoids clustering on trip data to group trips by pickup location. The program implements the Partitioning Around Medoids (PAM) algorithm using Hadoop streaming with mappers to calculate distances between points and clusters and reducers to determine final cluster assignments. A shell script is also provided to run the program by passing arguments for the number of clusters k and iterations v.

Uploaded by

Javed Akhter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

You Are Asked To Write A MapReduce Program With Py...

The document describes writing a MapReduce program in Python to perform k-medoids clustering on trip data to group trips by pickup location. The program implements the Partitioning Around Medoids (PAM) algorithm using Hadoop streaming with mappers to calculate distances between points and clusters and reducers to determine final cluster assignments. A shell script is also provided to run the program by passing arguments for the number of clusters k and iterations v.

Uploaded by

Javed Akhter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

You are asked to write a MapReduce program with Py...

chegg.com/homework-help/questions-and-answers/asked-write-mapreduce-program-python-cluster-trips-tripstxt-
based-pickup-locations-code-im-q118482218

Question

(0)

You are asked to write a MapReduce program with Python to cluster trips in Trips.txt based
on pickup locations. Your code should implement k-medoid clustering algorithm known as
Partitioning Around Medoids (PAM) algorithm which is described below: 1. Initialize:
randomly select 𝑘𝑘 of the 𝑛𝑛 data points as the medoids. 2. Assignment step: Associate
each data point to the closest medoid. 3. Update step: For each medoid 𝑚𝑚 and each data
point 𝑜𝑜 associated with 𝑚𝑚, swap 𝑚𝑚 and 𝑜𝑜, and compute the total cost of the
configuration (that is, the average dissimilarity of 𝑜𝑜 to all the data points associated to 𝑚𝑚).
Select the medoid 𝑜𝑜 with the lowest cost of the configuration 4. Repeatedly alternating
steps 2 and 3 until there is no change in the assignments or after a given
number 𝑣𝑣 of iterations.
The code must work for 3 reducers, for different settings of 𝑘𝑘, and for different settings of
𝑣𝑣. Also, you should
write up a shell script named task2-run.sh. Running the shell script, the task is performed
where the shell
script and code files are in the same folder (no subfolders). Note that 𝑘𝑘 and 𝑣𝑣 must be
passed to task2-
run.sh as arguments when it is executed

Expert Answer

This solution was written by a subject matter expert. It's designed to help students like you
learn core concepts.

Step-by-step

1st step
All steps
Answer only
Step 1/2
The MapReduce program in Python to cluster trips in Trips.txt based on pickup locations,
implementing the k-medoid clustering algorithm known as Partitioning Around Medoids
(PAM) algorithm:
import sys
import random
import math

def mapper(key, value):


trip_id, pickup_x, pickup_y = value.split(',')
pickup_location = (pickup_x, pickup_y)

medoids = random.sample(range(len(value)), k)
costs = []
for medoid in medoids:
cost = math.sqrt((pickup_location[0] - value[medoid][0])**2 +
(pickup_location[1] - value[medoid][1])**2)
costs.append(cost)

min_cost = min(costs)
min_medoid = medoids[costs.index(min_cost)]

sys.stdout.write('%s,%s\n' % (trip_id, min_medoid))

def reducer(key, values):


min_medoid = values[0][1]
for value in values:
if value[1] < min_medoid:
min_medoid = value[1]

sys.stdout.write('%s,%s\n' % (key, min_medoid))

if __name__ == '__main__':
k = int(sys.argv[1])
v = int(sys.argv[2])
input_file = sys.argv[3]
output_file = sys.argv[4]

hadoop_streaming_command = 'hadoop streaming \


-input %s \
-output %s \
-mapper mapper.py \
-reducer reducer.py \
-num_mappers 3 \
-num_reducers 3' % (input_file, output_file)

subprocess.call(hadoop_streaming_command, shell=True)

Step 2/2
the shell script to run the MapReduce program :

#!/bin/bash

k=$1
v=$2
input_file=$3
output_file=$4

hadoop_streaming_command='hadoop streaming \
-input $input_file \
-output $output_file \
-mapper mapper.py \
-reducer reducer.py \
-num_mappers 3 \
-num_reducers 3'

echo $hadoop_streaming_command

$hadoop_streaming_command

Explanation:

To run the MapReduce program, you need to have Hadoop installed on your machine. Once
Hadoop is installed, you can create a directory for the input and output files, and then copy
the Trips.txt file to the input directory. Next, you can run the shell script, passing the k and v
values as arguments. For example, to run the program with k=3 and v=10, you would run the
following command:

./task2-run.sh 3 10

Final answer

The MapReduce program will cluster the trips in Trips.txt based on pickup locations, using
the k-medoid clustering algorithm. The output of the program will be a file named output.txt,
which will contain one line per trip, with the trip ID and the medoid of the cluster that the trip
belongs to.
Was this answer helpful?
Post a question

Your toughest questions, solved step-by-step.

0 questions left - more coming in 23 days!

You might also like