Problem Description: Sensitivity: Internal & Restricted
Problem Description: Sensitivity: Internal & Restricted
Problem Description
You have two files named movies_en.json and artists_en.json containing a small movie database in the
JSON format. You need to load them into Spark Data frames and perform analysis.
Sample record: movies_en.json - one record per line, newline characters have been added for
readability:
{
"id": "movie:14",
"title": "Se7en",
"year": 1995,
"genre": "Crime",
"summary": " Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven
deadly sins as his modus operandi.",
"country": "USA",
"director": {
"id": "artist:31",
"last_name": "Fincher",
"first_name": "David",
"year_of_birth": "1962"
},
"actors": [
{"id": "artist:18",
"role": "Doe"
},
{"id": "artist:22",
"role": "Somerset"
},
{"id": "artist:32",
"role": "Mills"
}]}
movies_en.json contains the full names and years of birth of movie directors, but only the identifiers of
actors. The full names and years of birth of all artists, as well as their identifier, are listed in
artists_en.json.
Assignment 1:
Connect to the Hadoop cluster and copy the two files movies_en.json and artists_en.json to a folder in
HDFS.
Assignment 2:
Write spark SQL code to perform the following:
a. Read the files into data frames
b. Write a query to group titles of American movies by year
c. Display the first 5 records in command line
Sample Output:
[...]
(1988,{(Rain Man),(Die Hard)})
(1990,{(The Godfather: Part III),(Die Hard 2),(The Silence of the Lambs),(King of New York)})
(1992,{(Unforgiven),(Bad Lieutenant),(Reservoir Dogs)})
(1994,{(Pulp Fiction)})
[...]
Assignment 3:
Write spark code to normalize the data frames created above and store the output as Parquet files
Output will be stored in 3 folders
Folder 1: Stores artist details
Folder 2: Stores Movie Details
Folder 3: Stores link between Movie, Artist and Role played
Assignment 4:
Write spark code to normalize the data frames created above and store the output as 3 Hive Table in
Parquet files
Assignment 5:
Execute a Spark SQL query on hive tables to list MovieID, Title, Year, Genre, Country, Director and actors
by joining on the tables created in assignment 4 and display the first 5 records in command prompt.