Database Systems
SQL
Overview
BigQuery is Google’s service for big data. Throughout the three class
projects, we will be using BigQuery’s basic SQL querying interface,
its interface with Colaboratory1, and its built-in machine learning
features.
Google has published many datasets on BigQuery -- these range from
StackOverflow statistics to real-time air quality data. In this first part
of the course project, you will be using BigQuery’s SQL interface to
answer questions about the NCAA Basketball Dataset. To find the
dataset in BigQuery, follow the instructions in the Getting Started with
BigQuery support document.
Please note: this is a solo Assignment. You may discuss ideas at a
high-level with other students, but all work should be your own. Please
note the names and SUNet IDs of any students you collaborate with.
Task A: Getting Set Up
Before proceeding, make sure you have read and understood the
Getting Started with BigQuery support document (available on the
course website) which describes how to get up and running with your
BigQuery account, how to manage your course credit, etc.
Note: This is a very important step as we will be unable to give extra Google
Cloud credit to students that use up all of their credit. If you have any questions
about Google Cloud, your account, or your credits, check Eduko for similar
questions or use email (
[email protected]).
Task B: Familiarize yourself with the NCAA Basketball
Dataset
Now that you’ve oriented yourself in BigQuery, your second task is to
examine the schemas and the descriptions of the NCAA Basketball
dataset tables and understand the data that you will be working with.
Database Systems-Manual X Page | 1
You may try running some simple queries over the tables to get a feel
for them, or use BigQuery’s “Preview” tab to see what the data looks
like.
Figure 1:mascost Dataset
Some notes:
● _sr stands for “Sportradar”, which is a company that collects
sports data, down to the x/y coordinates of events (shot
attempted, rebound, turnover, foul).
● The historical data makes a distinction between tournament
games and regular season games. Please make sure you're using
the right table!
● “pbp” means play-by-play, which is very granular data about
each event that happens in the game
Task C: Querying!
Now that you’ve gotten comfortable and familiar with BigQuery and
its SQL querying interface, let’s get to work and answer some
questions about the NCAA Basketball dataset.
We intend for part of this assignment to be about how to translate a
question in plain English to a schema - in other words, we want you to
read the tables and explore the data and think about which tables and
columns are necessary in answering the question we’re asking. This
skill is both necessary for the remainder of the projects and is exactly
Database Systems-Manual X Page | 2
how real world data querying and analysis works!
Your queries should be fairly efficient -- they should each take at most
ten seconds to execute on BigQuery, and most of them will be finished
in less than ~4 seconds. If any of your queries are taking much longer
than that, you’ve probably written them in particularly inefficient way;
please try rewriting them, and see the course staff if you need help.
Please also check to make sure you’re not querying more than a couple
GBs of data - we’ve specifically chosen this dataset so that no one need
exhaust their credits completing the assignment. All the queries you
write should fit within the 1TB of free querying you’re alloted for the
month.
You can save your queries for each question from the BigQuery
interface directly, or you can keep track of your queries in separate
files yourself. Remember that you can use BigQuery’s “Query
History” tab to inspect previous queries you’ve run.
Note: When querying in BigQuery, table names should be wrapped in
backticks (`). For example, instead of saying:
SELECT * FROM bigquery-public-data.ncaa_basketball.mascots
say:
SELECT *FROM`bigquery-public-data.ncaa_basketball.mascots`
Questions:
We will provide answers for these questions so that you can check your
work. Please make sure your output from BigQuery matches these
answers, both in terms of values and ordering. Read instructions
carefully; if we ask for rounded answers, we may deduct points for not
rounding.
Note: While matching these answers is a good sanity check, it does not guarantee
a perfect score. The datasets we will use to grade your assignment may not
perfectly match the datasets on BigQuery; therefore, make sure that your queries
are generalizable to other datasets (given that schemas are identical).
We reserve the right to deduct points from your project if your queries are hard-
coded in some way or are not generalizable to other tables.
Database Systems-Manual X Page | 3
For the following questions, unless otherwise specified, a game can
be either a tournament game or a regular season game.
Write standard SQL queries to answer the following questions:
1. What is the name and capacity of Stanford’s NCAA basketball team
venue? (1 point)
Answer:
Row venue_name venue_capacity
1 Maples Pavilion 7392
2. How many games were played in Stanford’s venue in the 2013-2014
season?(1 point)
Answer:
Row games_at_stanford
1 16
3. Hexadecimal colors codes are a way of representing color on a
computer. Hex color codes are of form #AABBCC, where AA, BB,
and CC are hexadecimal numbers (00, 01, … , FE, FF) indicating
the intensity of red, green, and blue in the color, respectively.
Hint: be careful with the case of the colors in the dataset -- some
use lower case characters and some use upper case characters. Note
that in the expected answer below, the original case from the dataset
is kept.
What teams have the maximum possible red intensity in their color?
Give (team market, color) as your answer. Order your results
alphabetically by the team name.(1 point)
Answer:
Row market Color
1 Idaho State #ff7800
2 Morehead State #ffc300
3 North Carolina A&T #ffb82b
Database Systems-Manual X Page | 4
4 Northern Colorado #ffb500
5 Oklahoma State #FF6600
6 Pacific #ff6900
7 South Dakota #ff2310
8 Syracuse #ff5113
9 Tennessee-Martin #ff6900
4. How many home games has Stanford won in seasons 2013 to 2017
(inclusive)? Give (number of games won, average score for Stanford
in those games, average score of the opponents in those games) as
your answer. Round any decimal values to two places.(1 point)
Answer:
Depending on which table you use for your query, you may get
slightly different values. Either of the following results are
acceptable.
Row number avg_stanford avg_opponent
1 71 78.04 64.21
Row number avg_stanford avg_opponent
1 71 78.07 64.13
5. How many players have been on a team based in the same city
where they were born? For this question, please only use the
player’s birth city and state (do not include the player’s birth
country). (2 points)
Answer:
Row num_players
Database Systems-Manual X Page | 5
1 606
6. What is the biggest margin of victory in the historical tournament
data? Output the winning team name, losing team name, winning
team points, losing team points, and the win margin of that game. (2
points)
Answer:
Row win_name lose_name win_pts lose_pts Margin
1 Jayhawks Panthers 110 52 58
7. In a basketball tournament, teams are ranked from best to worst
prior to starting the matches. This ranking is called the “seed” of the
team (1 is the best team, and a higher number indicates a worse
team). In general, a higher ranked team is expected to beat a lower
ranked team.
Definition: An upset occurs whenever a team with seed A beats a
team with seed B, and A > B.
What percentage of historical tournament games are upsets?
Round to two decimal places. For example, if 50.2489% of games
are upsets, your query should return 50.25. (3 points)
Answer:
Row upset_percentage
1 27.26
8. Which pairs of NCAA basketball teams are 1) based in the same
state and 2) have the same team color? Output the team names and
Database Systems-Manual X Page | 6
the state. Put the team name that comes alphabetically first in each
pair on the leftmost column, and order the rows alphabetically by
the first column. (3 points)
Answer:
Row teamA teamB State
1 Bearcats Norse KY
2 Cougars Red Raiders TX
3 Razorbacks Red Wolves AR
9. Definition: A geographical location L is a unique tuple (city, state,
country).
Definition: A geographical location L “makes” points for a team T
whenever a player that was born in L scores points for T.
What three geographical locations made the most points for
Stanford’s team in seasons 2013 through 2017, and how many points
did they make? (3 points)
Restrictions:
-For the purposes of this query, avoid using the “birth_place”
column.
Answer:
Row city state country total_points
1 Phoenix AZ USA 2223
2 Minneapolis MN USA 1427
3 Rock Island IL USA 1399
Database Systems-Manual X Page | 7
10.Since the start of the 2013 season, which teams have had more
than 5 players score 15 or more points in the first half in a single
game? Note: These players did not all have to score 15+ points in
the first half of the same game.
Output the top 5 team markets and the number of players for each
team meeting this criteria from most to least, breaking ties by team
markets in alphabetical order. (4 points)
Answer:
Row team_market num_players
1 Kentucky 14
2 Oregon 14
3 UCLA 14
4 Duke 13
5 Marquette 13
11. Definition: Team X is a top performer on season Y if no other team
had more wins than
X in the same season. This includes teams with either null or non-
null markets.
What five teams (identify them here by their “markets”) were top
performers in the most seasons between 1900 and 2000
(inclusive), and how many times were they top performers?
Output the team markets and the number of times each team was
a top performer. If there are ties in the final output, break them by
giving a higher ranking to team markets that come first
alphabetically. Ignore teams with NULL markets only in the final
output. (4 points)
Database Systems-Manual X Page | 8
Row top market top_performer_count
1 University of California, Los 6
Angeles
2 University of Kentucky 6
3 Texas Southern University 5
4 University of Pennsylvania 5
5 Western Kentucky University 5
Submission Instructions
Once you have written queries that answer all questions and conform to
the given result schemas, you’re ready to submit.
To submit:
1. Copy all the queries you wrote for Task C into the
project1_submission.py file (available on the Eduko
website), pasting all of your queries into the
corresponding places.
2. If you collaborated with others to generate your queries, add
their names and SUNet IDs to the comment at the top of the
project1_submission.py file.
3. Submit this Python file on Eduko assignment section.
You may resubmit as many times as you like; however, only the latest
submission and timestamp will be saved, and we will use your latest
submission for grading your work and determining any late penalties
that may apply. Submissions via email will not be accepted!
IMPORTANT SUBMISSION NOTES:
When you submit to Eduko, we will run a syntax checker that will make sure that
your SQL runs OK. It should run immediately and return whether the query ran
OK or if there were errors - please make sure that you get a positive result from
this test in your final submission.
Database Systems-Manual X Page | 9
You will not see a final grade until after the project deadline. The answers are
provided above for the questions so that you may check your work yourself. It is
your responsibility to ensure that your final submission is free from Python or SQL
syntax errors and that you follow all instructions in this section.
We reserve the right to deduct points from your project if you do not follow the
submission instructions, or if you have syntax errors in your queries.
FAQ
Question:
I’m getting syntax errors when I submit to Eduko, but I don’t see these
syntax errors when I run on BigQuery.
Answer:
Some things to check:
● Were queries copied correctly to the submission file?
● Are you using Standard SQL on BigQuery?
● Did you use backticks around table names?
● Check name of your file
● Check zip format
● Do not add subfolders in zip file.
Question:
Do I have to match the column names given in the solutions? For
example, in question 7, do I have to name the column
“upset_percentage”?
Answer:
Yes, your column names must match the desired ones.
Database Systems-Manual X Page | 10