DataGrokr Technical Assignment - Data Engineering (1) (1)
DataGrokr Technical Assignment - Data Engineering (1) (1)
Thank you for your interest in the Data Engineering Internship at DataGrokr.
We anticipate the selected candidates to be working in Data Engineering and Cloud-related projects. As such for this given assignment, we'd like to test
candidates' skills in those areas. Candidates who are already proficient in SQL, Python, and Spark will have an edge in this assignment but even if you
don't know anything about any of these technologies you should be able to do this assignment by following along the instructions and studying the links
provided.
Please note that this ability to learn new technologies while following instructions would really help you in your day-to-day activities at DataGrokr.
Note:
Please follow industry standards while writing the code. Preferred Programming language – Python
!apt-get update -y
Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to
download Java.
Next, we will download and unzip Apache Spark with Hadoop 2.7 to install it.
!wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"
Then we need to install and import the 'findspark' library that will locate Spark on the system and import it as a regular library.
Now, import SparkSession from pyspark.sql and create a SparkSession, which will be the entry point to Spark.
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("<app_name>")
.getOrCreate())
2. Google Drive link for the files needed to complete this assignment.
3. Create dataframes for each of the datasets. Give proper column names and datatypes. Include basic schema validation. (Refer the schema provided
with the data for reference) Check out spark.read.format from here to create spark dataframe
1. Your SparkSession has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different
pieces of information.
One of the most useful method is the .listTables() method, which return all table names in your cluster as a list.
Register those dataframes as tables using .createOrReplaceTempView() , this method registers the DataFrame as a table in the catalog, but as this
table is temporary, it can only be accessed from the specific SparkSession used to create the Spark DataFrame. To read more about it, refer to this
link
2. You can either use pyspark dataframe functions like .select() or you can use .sql() function and provide the sql query inside it to solve the problem.
Check out this link for more info.
4. Check out the diagram to see all the different ways your Spark data structures can interact with each other.
2. List of Players with number of times they have won Tournament in descending order(Max to min).
Result attributes: player_name, number_of_wins
5. Longest and shortest game ever played in a world championship in terms of move.
Chess Funda: "move" is completed once both White and Black have played one turn. e.g If a game lasts 10 moves, both White and Black have
played 10 moves)
Result attributes: game_id, event, tournament_name, number_of_moves
Final result will have only two rows
9. How many times players with low rating won matches with their total win Count.
Result attributes: player_name, win_count
11. Total Number of games where losing player has more Captured score than Winning player.
Hint: Captured score is cumulative, i.e., for 3rd capture it will have score for 1, 2, and 3rd.
Result attributes: total_number_of_games Final result will have only one row
15. List games where player has won game without queen.
Result attributes: game_id, event, player_name
2. Mount your Google Drive if not mounted onto your colab notebook.
3. Create a directory in your google drive if not created with name: DE_SOLUTION_FirstName_LastName/results/
Create directory using os, python's built-in library.
Current working directory is '/content/', so to create the above directory you need to pass
'gdrive/MyDrive/DE_SOLUTION_FirstName_LastName/results/' as your path parameter.
Note: We are converting it to pandas dataframe to avoid having multiple files in parts saved for a single csv file.
5. Convert the pandas dataframe to csv and save it to goggle drive under DE_SOLUTION_FirstName_LastName/results/.
6. File name should be df1.csv, df2.csv, and so on... And they should be inside results folder.
Deliverables:
1. A single colab notebook where you have developed the code for the Section 1, Section 2 and Section 3.
Move the colab notebook to DE_SOLUTION_FirstName_LastName/ folder in your drive with name
Assignment_Solution_FirstName_LastName.ipynb
Upload your up-to-date resume to DE_SOLUTION_FirstName_LastName/ folder in your drive with name FirstName_LastName.pdf
Now email us your google drive folder link of DE_SOLUTION_FirstName_LastName/
NOTE: Make sure to share the google drive link after choosing: Anyone with the link and Viewer Mode
We will run the colab notebook on our end and correct your submissions.
2. Your code will be evaluated not just on the basis of final results but also on code quality. Here are few tips:
3. Your final submission should be sent to [email protected]. Your submissions are due to us by end of day 19th October 2022 and subject
If you have any questions during the assignment, send your questions to [email protected]
Good luck and we hope you learn something new in this process!