0% found this document useful (0 votes)
155 views

Coursera - Data Analytics - Course 4

This document discusses using SQL to clean large datasets. It compares the features of spreadsheets and SQL databases for data cleaning. SQL can process large amounts of data more quickly than spreadsheets. The document provides learning objectives about using basic SQL queries and functions for cleaning string variables and transforming data in a database. It explains why standard SQL is useful as it works with most databases and requires few syntax changes between dialects.

Uploaded by

Utjale
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views

Coursera - Data Analytics - Course 4

This document discusses using SQL to clean large datasets. It compares the features of spreadsheets and SQL databases for data cleaning. SQL can process large amounts of data more quickly than spreadsheets. The document provides learning objectives about using basic SQL queries and functions for cleaning string variables and transforming data in a database. It explains why standard SQL is useful as it works with most databases and requires few syntax changes between dialects.

Uploaded by

Utjale
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Coursera Google Data Analytics

Course 4 Notes
Week 1 - The Importance of Integrity

The Importance of Integrity


Course 4.1 Learning Objectives:
● Describe statistical measures associated with data integrity including
statistical power, hypothesis testing, and margin of error
● Describe strategies that can be used to address insufficient data
● Discuss the importance of sample size with reference to sample bias and
random samples
● Describe the relationship between data and related business objectives
● Define data integrity with reference to types and risks
● Discuss the importance of pre-cleaning activities

Data Integrity and analytics objectives


Why data integrity is important
Data Integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its
lifecycle.

Data replication
The process of storing data in multiple locations

Data Transfer
The process of copying data from a storage device to memory, or from one computer to
another.

Data Manipulation
The process involves changing the data to make it more organized and easier to read.

Other threats to data integrity:


- Human error
- Viruses
- Malware
- Hacking
- System Failures

Balancing objectives with data integrity

It's important to check that the data you use aligns with the business objective.

Well-aligned objectives and data


You can gain powerful insights and make accurate conclusions when data is well-aligned to
business objectives.
Clean data + alignment to business objective =
accurate conclusions

Overcoming The Challenges of Insufficient Data


Dealing with Insufficient Data
What you can do when you have insufficient data

Types of insufficient data


- Data from only one source
- Data that keeps updating
- Outdated data
- Geographically-limited data

Ways to address insufficient data


- Identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective
- Look for a new dataset

Use the following decision tree as a reminder of how to deal with data errors or not
enough data:
The importance of sample size
Random Sampling
A way of selecting a sample from a population so that every possible type of sample has an
equal chance of being chosen.

Testing Your Data


Using Statistical Power
Statistical Power
The probability of getting meaningful results from a test.

Hypothesis Testing
A way to see if a survey or experiment has meaningful results.

If a test is statistically significant, it means the results of the test are real and not an error
caused by random chance.

Usually, you need a statistical power of at least 0.8 or 80% to consider your results statistically
significant.

Determine the Best Sample Size


Confidence Level
The probability that your sample accurately reflects the greater population.

Consider the Margin of Error


Evaluate the Reliability of Your Data
Margin of Error
The maximum that the sample results are expected to differ from those of the actual population.

To calculate margin of error you need:


- Population Size
- Sample Size
- Confidence Level
Coursera Google Data Analytics
Course 4 Notes
Week 2 - Sparkling Clean Data

Sparkling Clean Data


Course 4.2 Learning Objectives:
● Differentiate between clean and dirty data
● Explain the characteristics of dirty data
● Describe data cleaning techniques with reference to identifying errors,
redundancy, compatibility, and continuous monitoring
● Identify common pitfalls when cleaning data
● Demonstrate an understanding of the use of spreadsheets to clean data

Data Cleaning is a Must


Clean it up!
Dirty data
Data that are incomplete, incorrect, or irrelevant to the problem you're trying to solve.

Clean data
Data that are complete, correct, or relevant to the problem you're trying to solve.

Why data cleaning is important


Data Engineers
Transform data into a useful format for analysis and give it a reliable infrastructure. This
means they develop, maintain, and test databases, data processors and related systems.

Data warehousing specialists


Develop processes and procedures to effectively store and organize data. They make sure that
data is available, secure, and backed up to prevent loss.

Begin Cleaning Data


Data-cleaning tools and techniques
Before removing, make a copy of data set.

Cleaning Data in Spreadsheets


Data-cleaning features in spreadsheets
Conditional Formatting
A spreadsheet tool that changes how cells appear when values meet specific conditions.

Text String
A group of characters within a cell, most often composed of letters.
Split
A tool that divides a text string around the specified character and puts each fragment into a
new and separate cell. Split is helpful when you have more than one piece of data in a cell
and you want to separate them out.

Specified Text Separator = Delimiter

CONCATENATE
A function that joins multiple text strings into a single string.

Optimize the data-cleaning process


Spreadsheets function:
- COUNTIF is a function that returns the number of cells that match a specified value.
- LEN is a function that tells you the length of the text string by counting the number of
characters it contains.
- LEFT is a function that gives you a set number of characters from the left side of a
text string.
- RIGHT is a function that gives you a set number of characters from the right side of a
text string.
- MID is a function that gives you a segment from the middle of a text string.
- RIM is a function that removes leading, trailing, and repeated spaces in data.
Coursera Google Data Analytics
Course 4 Notes
Week 3 - Cleaning Data with SQL

Cleaning Data with SQL


Course 4.3 Learning Objectives:
● Describe how SQL can be used to clean large datasets
● Compare spreadsheet data-cleaning functions to those associated with SQL
in databases
● Develop basic SQL queries for use with databases
● Apply basic SQL functions for use in cleaning string variables in a database
● Apply basic SQL functions for transforming data variables

Using SQL to Clean Data


Understanding SQL capabilities
SQL
A structured query language that analysts use to work with databases. SQL can
process large amounts of data much more quickly than spreadsheets. This is one of
the reasons why data analysts use SQL when working with vast, complex datasets.

Spreadsheets vs SQL
Features of Spreadsheets Features of SQL Databases

Smaller data sets Larger datasets

Enter data manually Access tables across a database

Create graphs and visualizations in the same Prepare data for further analysis in another
program software

Built-in spell check and other useful functions Fast and powerful functionality

Best when working solo on a project Great for collaborative work and tracking
queries run by all users

Spreadsheets and SQL Similarities


1. You can still perform arithmetic
2. Use formulas
3. Join data when you're using SQL

Why Standard SQL?


Standard SQL works with a majority of databases and requires a small number of syntax
changes to adapt to other dialects.

Advanced SQL

You might also like