Data Analytics Course 3
Data Analytics Course 3
Databases enable analysts to manipulate, store, and process data. This helps them search through data a lot
more efficiently to get the best insights.
Relational databases
A relational database is a database that
contains a series of tables that can be
connected to show relationships. Basically,
they allow data analysts to organize and link
data based on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be interested in analyzing all
grouped together. This can make it really hard to sort through. This is one reason why relational databases are
so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use
across an entire database.
Database Normalization
Normalization is a process of organizing data in a relational database. For example, creating tables and
establishing relationships between those tables. It is applied to eliminate data redundancy, increase data
integrity, and reduce complexity in a database.
By contrast, a foreign key is a field within a table that is a primary key in another table. A table can have only one
primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables
in a relational database, which helps organize and connect data across multiple tables in the database.
Some tables don't require a primary key. For example, a revenue table can have multiple foreign keys and not
have a primary key. A primary key may also be constructed using multiple columns of a table. This type of
primary key is
called a composite
key. For example, if
customer_id and
location_id are
two columns of a
composite key for a
customer table,
the values
assigned to those
fields in any given
row must be
unique within the
entire table.
SQL? You’re speaking my language
Databases use a special language to communicate called a query language. Structured Query Language (SQL)
is a type of query language that lets data analysts communicate with a database. So, a data analyst will use
SQL to create a query to view the specific data that they want from within the larger set. In a relational database,
data analysts can write queries to get data from the related tables. SQL is a powerful tool for working with
databases — which is why you are going to learn more about it coming up!
Metadata is as important as
the data itself
Data analytics, by design, is a field that thrives on collecting and
organizing data. In this reading, you are going to learn about how to
analyze and thoroughly understand every aspect of your data.
Take a look at any data you find. What is it? Where did it come from?
Is it useful? How do you know? This is where metadata comes in to
provide a deeper understanding of the data. To put it simply, metadata
is data about data. In database management, it provides information
about other data and helps data analysts interpret the contents of the
data within a database.
Regardless of whether you are working with a large or small quantity of data, metadata is the mark of a
knowledgeable analytics team, helping to communicate about data across the business and making it easier to
reuse data. In essence, metadata tells the who, what, when, where, which, how, and why of data.
Elements of metadata
Before looking at metadata examples, it is important to understand what type of information metadata typically
provides.
Examples of metadata
In today’s digital world, metadata is everywhere, and it is becoming a more common practice to provide
metadata on a lot of media and information you interact with. Here are some real-world examples of where to
find metadata:
Photos
Whenever a photo is captured with a camera, metadata such as camera filename, date, time, and geolocation
are gathered and saved with it.
Emails
When an email is sent or received, there is lots of visible metadata such as subject line, the sender, the recipient
and date and time sent. There is also hidden metadata that includes server names, IP addresses, HTML format,
and software details.
Websites
Every web page has a number of standard metadata fields, such as tags and categories, site creator’s name,
web page title and description, time of creation and any iconography.
Digital files
Usually, if you right click on any computer file, you will see its metadata. This could consist of file name, file size,
date of creation and modification, and type of file.
Books
Metadata is not only digital. Every book has a number of standard metadata on the covers and inside that will
inform you of its title, author’s name, a table of contents, publisher information, copyright description, index, and
a brief description of the book’s contents.
In this reading, you’ll learn more about the benefits of metadata, metadata repositories, and metadata of external
databases.
The benefits of metadata
Reliability
Data analysts use reliable and high-quality data to identify the root causes of any problems that might occur
during analysis and to improve their results. If the data being used to solve a problem or to make a data-driven
decision is unreliable, there’s a good chance the results will be unreliable as well.
Metadata helps data analysts confirm their data is reliable by making sure it is:
Accurate
Precise
Relevant
Timely
It does this by helping analysts ensure that they’re working with the right data and that the data is described
correctly. For example, a data analyst completing a project with data from 2022 can use metadata to easily
determine if they should use data from a particular file.
Consistency
Data analysts thrive on consistency and aim for uniformity in their data and databases, and metadata helps
make this possible. For example, to use survey data from two different sources, data analysts use metadata to
make sure the same collection methods were applied in the survey so that both datasets can be compared
reliably.
When a database is consistent, it’s easier to discover relationships between the data inside the database and
data that exists elsewhere. When data is uniform, it is:
Organized: Data analysts can easily find tables and files, monitor the creation and alteration of assets,
and store metadata.
Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in
cleaning and processing data.
Stored: Consistent and uniform data can be efficiently stored in various data repositories. This
streamlines storage management tasks such as managing a database.
Accessed: Users, applications, and systems can efficiently locate and use data.
Together, these benefits empower data analysts to effectively analyze and interpret their data.
Metadata repositories
Metadata repositories help data analysts ensure their data is reliable and consistent.
Metadata repositories are specialized databases specifically created to store and manage metadata. They can
be kept in a physical location or a virtual environment—like data that exists in the cloud.
Metadata repositories describe where the metadata came from and store that data in an accessible form with a
common structure. This provides data analysts with quick and easy access to the data. If data analysts didn’t
use a metadata repository, they would have to select each file to look up its information and compare the data
manually, which would waste a lot of time and effort.
Data analysts also use metadata repositories to bring together multiple sources for data analysis. Metadata
repositories do this by describing the state and location of the data, the structure of the tables inside the data,
and who accessed the user logs.
Data analysts should understand the metadata of external databases to confirm that it is consistent and reliable.
In some cases, they should also contact the owner of the third-party data to confirm that it is accessible and
available for purchase. Confirming that the data is reliable and that the proper permissions to use it have been
obtained are best practices when using data that comes from another organization.
Key takeaways
Metadata helps data analysts make data-driven decisions more quickly and efficiently. It also ensures that data
and databases are reliable and consistent.
Metadata repositories are used to store metadata—including data from second-party and third-party companies.
These repositories describe the state and location of the metadata, the structure of the tables inside it, and who
has accessed the repository. Data analysts use metadata repositories to ensure that they use the right data
appropriately.
Fortunately, there are tools to help you automate data imports so you don’t need to continually update the data
in your current spreadsheet. Take a small general store as an example. The store has three cash registers
handled by three clerks. At the end of each day, the owner wants to determine the total sales and the amount of
cash in each register. Each clerk is responsible for counting their money and entering their sales total into a
spreadsheet. The owner has the spreadsheets set up to import each clerks’ data into another spreadsheet,
where it automates and calculates the total sales for all three registers. Without this automation, each clerk
would have to take turns entering their data into the owner’s spreadsheet. This is an example of a dynamic
method of importing data, which saves the owner and clerks time and energy. When data is dynamic, it is
interactive and automatically changes and updates over time.
In the following sections you’ll learn how to import data into Google Sheets dynamically.
1. The URL of the Google Sheet from which you’ll import data.
2. The name of the sheet and the range of cells you want to import into your Google Sheet.
Once you have this information, open the Google Sheet into which you want to import data and select the cell
into which the first cell of data should be copied. Enter = to indicate you will enter a function, then complete the
IMPORTRANGE function with the URL and range you identified in the following manner:
=IMPORTRANGE("URL", "sheet_name!cell_range"). Note that an exclamation point separates the sheet name
and the cell range in the second part of this function.
=IMPORTRANGE("https://docs.google.com/thisisatestabc123", "sheet1!A1:F13")
Note: This URL is for syntax purposes only. It is not meant to be entered into your own spreadsheet.
Once you’ve completed the function, a box will pop up to prompt you to allow access to the Google Sheet from
which you’re importing data. You must allow access to the spreadsheet containing the data the first time you
import it into Google Sheets. Replace it with a spreadsheet’s URL that you have created so you can control
access by selecting the Allow access button.
Refer to the Google Help Center's IMPORTRANGE page for more information about the syntax. You’ll also learn
more about this later in the program.
In Google Sheets, you can use the IMPORTHTML function to import the data from an HTML table (or list) on a
web page. This function is similar to the IMPORTRANGE function. Refer to the Google Help Center's
IMPORTHTML page for more information about the syntax.
You can use the IMPORTDATA function in a Google Sheet to import data into a Google Sheet. This function is
similar to the IMPORTRANGE function. Refer to Google Help Center's IMPORTDATA page for more information
and the syntax.
Other spreadsheets
CSV files
HTML tables (in web pages)
Importing data from other spreadsheets
In a lot of cases, you might have an existing spreadsheet open and need to add additional data from another
spreadsheet.
Google Sheets
In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in the
other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the spreadsheet
containing the data the first time you import the data. The URL shown below is for syntax purposes only. Don't
enter it in your own spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control access
to it by clicking the Allow access button.
Refer to the Google Help Center's IMPORTRANGE page for more information about the syntax. There is also an
example of its use later in the program in Advanced functions for speedy data cleaning.
Microsoft Excel
To import data from another spreadsheet, do the following:
Step 2: Click Get Data, and then select From File within the toolbar. In the drop down, choose From Excel
Workbook
Step 3: Browse for and select the spreadsheet file and then click Import.
Step 5: Click Load to import all the data in the worksheet; or click Transform Data to open the Power Query
Editor to adjust the columns and rows of data you want to import.
Step 6: If you clicked Transform Data, click Close & Load and then select one of the two options:
If you are using Numbers, search the Numbers User Guide for directions.
Step 4: Select Import data. The data in the CSV file will be loaded into your sheet, and you can begin using it!
Note: You can also use the IMPORTDATA function in a spreadsheet cell to import data using the URL to a CSV
file. Refer to Google Help Center's IMPORTDATA page for more information and the syntax.
Microsoft Excel
Step 1: Open a new or existing spreadsheet
Step 2: Click Data in the main menu and select the From Text/CSV option.
Step 3: Browse for and select the CSV file and then click Import.
Step 4: From here, you will have a few options. You can change the delimiter from a comma to another character
such as a semicolon. You can also turn automatic data type detection on or off. And, finally, you can transform
your data by clicking Transform Data to open the Power Query Editor.
Step 5: In most cases, accept the default settings in the previous step and click Load to load the data in the CSV
file to the spreadsheet. The data in the CSV file will be loaded into the spreadsheet, and you can begin working
with the data.
If these directions do not work for the version of Excel that you have. Visit this free online training center,
Microsoft Excel for Windows Training, you will find everything you need to know, all in one place.
If you are using Numbers, search the Numbers User Guide for directions.
Google Sheets
In Google Sheets, you can use the IMPORTHTML function. It enables you to import the data from an HTML
table (or list) on a web page.
Refer to the Google Help Center's IMPORTHTML page for more information about the syntax. If you are
importing a list, replace "table" with "list" in the above example. The number 4 is the index that refers to the order
of the tables on a web page. It is like a pointer indicating which table on the page you want to import the data
from.
You can try this yourself! In blank worksheets, copy and paste each of the following IMPORTHTML functions into
cell A1 and watch what happens. You will actually be importing the data from four different HTML tables in a
Wikipedia article: Demographics of India. You can compare your imported data with the tables in the article.
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",1)
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",2)
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",3)
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",4)
Microsoft Excel
You can import data from web pages using the From Web option:
Step 2: Click Data in the main menu and select the From Web option.
Step 5: Click Load to load the data from the table into your spreadsheet.
If these directions do not work for the version of Excel that you have. Visit this free online training center,
Microsoft Excel for Windows Training, you will find everything you need to know, all in one place.
If you are using Numbers, search the Numbers User Guide for directions.
The Google Cloud Public Datasets allow data analysts access to high-demand public datasets, and
make it easy to uncover insights in the cloud.
The Dataset Search can help you find available datasets online with keyword searches.
Kaggle has an Open Data search function that can help you find datasets to practice with.
Finally, BigQuery hosts 150+ public datasets you can access and use.
1. Global Health Observatory data: You can search for datasets from this page or explore featured data
collections from the World Health Organization.
2. The Cancer Imaging Archive (TCIA) dataset: Just like the earlier dataset, this data is hosted by the
Google Cloud Public Datasets and can be uploaded to BigQuery.
3. 1000 Genomes: This is another dataset from the Google Cloud Public resources that can be uploaded
to BigQuery.
1. National Climatic Data Center: The NCDC Quick Links page has a selection of datasets you can
explore.
2. NOAA Public Dataset Gallery: The NOAA Public Dataset Gallery contains a searchable collection
of public datasets.
1. UNICEF State of the World’s Children: This dataset from UNICEF includes a collection of tables
that can be downloaded.
2. CPS Labor Force Statistics: This page contains links to several available datasets that you can
explore.
3. The Stanford Open Policing Project: This dataset can be downloaded as a .CSV file for your own
use.
Cleaning data is an important part of the data analysis process. If data analysis is based on bad or dirty data, it
may be biased, erroneous, and uninformed. Sorting and filtering are essential skills for every data analyst, and are
also very useful for cleaning data.
Think about everything you’ve learned about spreadsheets and databases. In many ways, they are
similar. In other ways, they are different.
For example, both spreadsheets and databases store and organize data. However, databases can be
relational while spreadsheets cannot. This means that spreadsheets are better-suited to self-contained
data, where the data exists in one place. Meanwhile, you can use databases to store data from external
tables, allowing you to change data in several places by editing in only one place.
Take a moment to consider these examples and come up with a few of your own. Here are some areas
you may want to consider:
An upcoming activity is performed in BigQuery. This reading provides instructions to create your own BigQuery
account, select public datasets, and upload CSV files. At the end of this reading, you can confirm your access to
the BigQuery console before you move on to the activity,
Note: Additional getting started resources for a few other SQL database platforms are also provided at the end of
this reading if you choose to work with them instead of BigQuery.
A free sandbox account doesn’t ask for a method of payment. It does, however, limit you to 12 projects. It
also doesn't allow you to insert new records to a database or update the field values of existing
records. These data manipulation language (DML) operations aren't supported in the sandbox.
A free trial account requires a method of payment to establish a billable account, but offers full
functionality during the trial period.
With either type of account, you can upgrade to a paid account at any time and retain all of your existing
projects. If you set up a free trial account but choose not to upgrade to a paid account when your trial period
ends, you can still set up a free sandbox account at that time. However, projects from your trial account won't
transfer to your sandbox account. It would be like starting from scratch again.
Follow these step-by-step instructions or watch the video, Setting up BigQuery, including sandbox and
billing options. The free trial offers $300 in credit over the next 90 days. You won’t get anywhere near
that spending limit if you just use the BigQuery console to practice SQL queries. After you spend the
$300 credit (or after 90 days) your free trial will expire and you will need to personally select to upgrade
to a paid account to keep using Google Cloud Platform services, including BigQuery. Your method of
payment will never be automatically charged after your free trial ends. If you select to upgrade your
account, you will begin to be billed for charges.
After you set up your account, you will see My First Project in the banner and the status of your account
above the banner – your credit balance and the number of days remaining in your trial period.
How to get to the BigQuery console
In your browser, go to console.cloud.google.com/bigquery.
Note: Going to console.cloud.google.com in your browser takes you to the main dashboard for the Google Cloud
Platform. To navigate to BigQuery from the dashboard, do the following:
Getting started with MySQL: This is a guide to setting up and using MySQL.
Getting started with Microsoft SQL Server: This is a tutorial to get started using SQL Server.
Getting started with PostgreSQL: This is a tutorial to get started using PostgreSQL.
Getting started
with SQLite:
This is a quick
start guide for
using SQLite.
Set up your BigQuery account
As you’ve been learning, BigQuery is a database you can use to access, explore, and analyze data from many
sources. Now, you’ll begin using BigQuery, which will help you gain SQL knowledge by typing out commands
and troubleshooting errors. This reading will guide you through the process of setting up your very own BigQuery
account.
Note: Working with BigQuery is not a requirement of this program. Additional resources for other SQL database
platforms are also provided at the end of this reading if you choose to use them instead.
This reading provides instructions for setting up either account type. An effective first step is to begin with a
sandbox account and switch to a free-of-charge trial account when needed to run the SQL presented upcoming
courses.
Sandbox account
The sandbox account is available at no cost, and anyone with a Google account can use it. However, it does
have some limitations. For instance, you are limited to a maximum of 12 projects at a time. This means that, to
create a 13th project, you'll need to delete one of your existing 12 projects. Additionally, the sandbox account
doesn't support all operations you’ll do in this program. For example, there are limits on the amount of data you
can process and you can’t insert new records into a database or update the values of existing records. However,
a sandbox account is perfect for most program activities, including all of the activities in this course. Additionally,
you can convert your sandbox account into a free-of-charge trial account at any time.
Free-of-charge trial
If you wish to explore more of BigQuery's capabilities with fewer limitations, consider the Google Cloud Free
Trial. It provides you with $300 in credit for Google Cloud usage during the first 90 days. If you're primarily using
BigQuery for SQL queries, you're unlikely to come close to this spending limit. After you've used up the $300
credit or after 90 days, your free trial will expire, and you will only be able to use this account if you pay to do so.
Google won't automatically charge your payment method when the trial ends. However, you'll need to set up a
payment option with Google Cloud. This means that you’ll need to enter your financial information. Rest assured,
it won't charge you unless you consciously opt to upgrade to a paid account. If you're uncomfortable providing
payment information, don't worry; you can use the BigQuery sandbox account instead.
Key takeaways
BigQuery offers multiple account options. Keep the following in mind when you choose an account type:
Account tiers: BigQuery provides various account tiers to cater to a wide range of user requirements.
Whether you're starting with a sandbox account or exploring a paid account with the free-of-charge trial
option, BigQuery offers flexibility to choose the option that aligns best with your needs and budget.
Sandbox limitations: While a sandbox account is a great starting point, it comes with some limitations,
such as a cap on the number of projects and restrictions on data manipulation operations like inserting or
updating records, which you will encounter later in this program. Be aware of these limitations if you
choose to work through this course using a sandbox account.
Easy setup and upgrades: Getting started with any BigQuery account type is quick and easy. And if your
needs evolve, you have the flexibility to modify your account status at any time. Additionally, projects can
be retained even when transitioning between account types.
Choose the right BigQuery account type to match your specific needs and adapt as your requirements change!
Go to next item
Log in to BigQuery
When you log in to BigQuery using the landing page, you will automatically open your project space. This
is a high-level overview of your project, including the project information and the current resources being
used. From here, you can check your recent activity.
Navigate to your project’s BigQuery Studio by selecting BigQuery from the navigation menu and BigQuery
Studio from the dropdown menu.
The Navigation
pane
On the console
page, find the
Navigation pane.
This is how you
navigate from the
project space to the
BigQuery tool. This
menu also contains
a list of other
Google Cloud
Project (GCP) data tools. During this program, you will focus on BigQuery, but it’s useful to understand that
the GCP has a collection of connected tools data professionals use
every day.
The Explorer pane lists your current projects and any starred projects
you have added to your console. It’s also where you’ll find the + ADD
button, which you can use to add datasets.
This button opens the Add dialog that allows you to open or import a variety of datasets.
Select Public Datasets. This takes you to the Public Datasets Marketplace, where you can search for and
select public datasets to add to your BigQuery console. For example, search for the "noaa lightning" dataset
in the Marketplace search bar. When you search for this dataset, you will find NOAA’s Cloud-to-Ground
Lightning Strikes data.
Select the
dataset to read
its
description.
Select View
dataset to
create a tab of
the dataset’s
information
within the
SQL
workspace.
The Explorer
Pane lists the
noaa_lightning
and other public
datasets.
Starring bigquery-public-data will enable you to search for and add public datasets by scrolling in the
Explorer pane or by searching for them in the Explorer search bar.
For example, you might want to select a different public dataset. If you select the second dataset,
"austin_311," it will expand to list the table stored in it, “311_service_requests.”
Additionally, you can select the Query button from the menu
bar in the SQL Workspace to query this table.
The final menu pane in your console is the SQL Workspace. This is where you will actually write and
execute queries in BigQuery.
The SQL Workspace also gives you
access to your personal and project
history, which stores a record of the
queries you’ve run. This can be
useful if you want to return to a
query to run it again or use part of it
in another query.
Key takeaways
BigQuery's SQL workspace allows you to search for public datasets, run SQL queries, and even upload your
own data for analysis. Whether you're working with public datasets, running SQL queries, or uploading your
own data, BigQuery’s SQL workspace offers a range of features to support all kinds of data analysis tasks.
Throughout this program, you will be using BigQuery to practice your SQL skills, so being familiar with the
major components of your BigQuery console will help you navigate it effectively in the future!
Keep this guide open as you watch the video. It can serve as a helpful reference if you need additional context
or clarification while following the video steps. This is not a graded activity, but you can complete these steps to
practice the skills demonstrated in the video.
1. Log in to BigQuery and go to your console. You should find the Welcome to your SQL Workspace!
landing page open. Select COMPOSE A NEW QUERY In the Bigquery console. Make sure that no tabs
are open so that the entire workspace is displayed, including the Explorer pane.
2. Enter sunroof in the search bar. In the search results, expand sunroof_solar and then select the
solar_potential_by_postal_code dataset.
3. Observe the Schema tab of the Explorer pane to explore the table fields.
4. Select the Preview tab to view the regions, states, yearly sunlight, and more.
1. The first step is finding out the complete, correct name of the dataset. Select the ellipses by the dataset
solar_potential_by_postal_code, then select Query. A new tab will populate on your screen. Select the
tab. The name of the dataset should be written inside the two backticks.
2. Select the dataset name by highlighting the text including the backticks and copy it.
3. Now, click on the plus sign to create a new query. Notice that BigQuery doesn’t automatically generate a
SELECT statement in this window. Enter SELECT and add a space after it.
4. Put an asterisk * after SELECT to indicate you want to return the entire dataset. The asterisk lets the
database know to include all columns. Without this shortcut, you would have to manually enter every
column name!
5. Next, press the Enter/Return key and Enter FROM on the second line. FROM indicates where the data is
coming from. After FROM, add another space.
6. Paste in the name of the dataset that you copied earlier. It will read `bigquery-public-
data.sunroof_solar.solar_potential_by_postal_code`
7. Execute the query by selecting the RUN button.
Example 3: Use SQL to view a piece of data
If the project doesn’t require every field to be completed, you can use SQL to see a particular piece, or pieces, of
data. To do this, specify a certain column name in the query.
1. For example, you might only need data from Pennsylvania. You’d begin your query the same way you
just did in the previous examples: Click on the plus sign, enter SELECT, add a space, an asterisk (*), and
then press Enter/Return.
2. Enter FROM and then paste `bigquery-public-data.sunroof_solar.solar_potential_by_postal_code`. Press
Enter/Return.
3. This time, add WHERE. It will be on the same line as the FROM statement. Add a space and enter
state_name with a space before state and a space after name.
4. Because you only want data from Pennsylvania, add = and 'Pennsylvania' on the same line as state_name. In
SQL, single quotes represent the beginning and ending of a string.
5. Execute the query with the RUN button.
6. Review the data on solar potential for Pennsylvania. Scroll through the query results.
Keep in mind that SQL queries can be written in a lot of different ways and still return the same results. You
might discover other ways to write these queries!