0% found this document useful (0 votes)

13 views12 pages

Techniques Used to Transform Data, Part 1

Uploaded by

Constant HOUEHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views12 pages

Techniques Used to Transform Data, Part 1

Uploaded by

Constant HOUEHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Techniques used to transform data, part 1

Data transformation is the process of taking raw or inconsistent data, and converting it into a
usable format for analysis and visualization. In this reading, you’ll learn more about data
transformation techniques for aggregation, deduplication, and filtering, and how you can
perform them in SQL.

Data aggregation
Data aggregation is the process of combining data from multiple sources and summarizing it
to provide general insights, or report overall statistics. Common aggregation functions in SQL
are SUM, COUNT, AVG, MAX, and MIN.

SUM and COUNT

The SUM function returns the total sum of a numeric column, like sales or expenditure for the
month, year, or quarter.

This query uses the SUM function to calculate the sum of the values in the sales_amount
column. The SELECT statement returns the total sales amount, which is aliased as total_sales.

Unset
SELECT SUM(sales_amount) AS total_sales_amount
FROM total_sales;

If the total_sales table contains these rows of data:

Unset
| sales_amount |
|--------------|
| 100000 |
| 200000 |
| 300000 |
| 400000 |

The query will return this result:

1
Unset
| total_sales_amount |
|--------------------|
| 1000000 |

The COUNT function returns the number of rows that match a specific criteria. These rows can
contain numeric and non-numeric values.

If you want to count the number of customers you’ve served in the last quarter to make a
projection for the next quarter. The customer_id column contains strings and numeric values.
The SELECT statement counts the rows in the customer_id column from the customers table,
and the result will be aliased as number_of_customers.

Unset
SELECT COUNT(customer_id) AS number_of_customers
FROM customers;

If the customers table contains these rows of data:

Unset
| customer_id |
|-------------|
| CUST001 |
| CUST002 |
| CUST003 |
| CUST004 |
| CUST005 |

The query will return this result:

Unset
| number_of_customers |
|---------------------|
| 5 |

SUM and COUNT perform a “count” in different ways. The SUM statement adds numerical
values and returns a numerical value. The COUNT statement returns a numeric value, too, but

2
it’s the total number of rows counted, not a total sum of the values in their cells.

SUM and COUNT statements can be used together. Consider a scenario where you want to
know how many orders customers made, and the total revenue from those orders. The
SELECT statement counts the rows in the order_id column aliased as number_of_orders, and
sums the rows in the total_price column aliased as total_revenue.

Unset
SELECT
COUNT(order_id) AS number_of_orders,
SUM(total_price) AS total_revenue
FROM orders;

If the table orders contains these rows of data:

Unset
| order_id | customer_id | total_price |
|----------|-------------|-------------|
| 1 | CUST001 | 100 |
| 2 | CUST002 | 150 |
| 3 | CUST003 | 200 |
| 4 | CUST001 | 50 |
| 5 | CUST004 | 250 |

The query will return this result:

Unset
| number_of_orders | total_revenue |
|------------------|---------------|
| 5 | 750 |

AVG

The AVG function returns the mean or average value of a numeric column. The AVG function
makes comparisons, identifies trends and patterns, and sets benchmarks.

An analyst might use the AVG function to calculate a student satisfaction score for different
class sessions. The SELECT statement calculates the average of the student_rating column
from the students table, and aliases the result as student_satisfaction.

3
Unset
SELECT AVG(student_rating) AS student_satisfaction
FROM students;

If the table contains these rows of data:

Unset
| student_id | student_rating |
|------------|----------------|
| ANSM1995 | 5 |
| JOLE1998 | 4 |
| SARO2000 | 4 |
| MATO1997 | 3 |
| KIST2002 | 5 |

The query will return this result:

Unset
| student_satisfaction |
|----------------------|
| 4.2 |

MAX & MIN

The MAX and MIN functions in SQL return the largest and smallest values in a column. They
can be used on numeric and non-numeric values. For numeric values, the MAX and MIN
function will return the largest and smallest numeric values in the column, whether they are
integers, floats, or dates. The SELECT statement calculates the maximum and minimum grade
received by students to determine the range of scores, aliased as highest_score and
lowest_score.

Unset
SELECT MAX(grade) as highest_score, MIN(grade) as lowest_score
FROM students;

4
If the table contains these rows of data:

Unset
| student_id | grade |
|------------|-------|
| ANSM1995 | 85 |
| JOLE1998 | 90 |
| SARO2000 | 78 |
| MATO1997 | 88 |
| KIST2002 | 82 |

The query will return this result:

Unset
| highest_score | lowest_score |
|---------------|--------------|
| 90 | 78 |

Note: MAX and MIN can be used to find the first or last string values using alphabetical order.

Data deduplication
The purpose of data deduplication is to identify and remove duplicates from a dataset.
Duplicates in a dataset negatively impact analysis by skewing results, making outcomes or
insights inaccurate. Analysts use a combination of the WHERE, DISTINCT, and GROUP BY
functions in SQL for deduplication:

WHERE

The WHERE clause is used to filter rows based on specific condition(s). You can use WHERE to
locate all instances of duplicates based on the condition set for the query.

An analyst has a list of employee data containing information about everyone in the
organization. The analyst would like to check for duplicate employees in the sales department
to ensure the most up-to-date information. To be sure the entries are actually duplicates, the
SELECT statement queries both the first and last name of employees, and the employee_id in
the sales department.

Unset
SELECT employee_id, first_name, last_name

5
FROM employees
WHERE department = 'Sales';

If the table contains these rows of data:

Unset
| employee_id | first_name | last_name | department |
|-------------|------------|-----------|------------|
| E001 | John | Doe | Sales |
| E002 | Jane | Smith | HR |
| E003 | Alice | Johnson | Sales |
| E004 | Bob | White | Marketing |
| E001 | John | Doe | Sales |
| E006 | Charlie | Brown | Sales |
| E007 | John | Thompson | IT |

The query will return the output that contains duplicates for John Doe:

Unset
| employee_id | first_name | last_name |
|-------------|------------|-----------|
| E001 | John | Doe |
| E003 | Alice | Johnson |
| E001 | John | Doe |

The analyst can then delete the duplicates and update the table.

DISTINCT

DISTINCT is used to remove duplicate rows from a result set. Analysts use DISTINCT in
combination with WHERE to drill down on their data and identify duplicates.

An analyst is working with this employees table. There are 4 employees in the sales
department, but there's a duplicate by the name of John Doe.

6
|-------------|------------|-----------|------------|
| E001 | John | Doe | Sales |
| E002 | Jane | Smith | HR |
| E003 | Alice | Johnson | Sales |
| E004 | Bob | White | Marketing |
| E001 | John | Doe | Sales |
| E006 | Charlie | Brown | Sales |
| E007 | John | Thompson | IT |

To query the results with no duplicates, the analyst utilizes the DISTINCT function to select
employees by first and last name.

Unset
SELECT DISTINCT employee_id, first_name, last_name
FROM employees
WHERE department = 'Sales';

The query returns this output:

Unset
| employee_id | first_name | last_name |
|-------------|------------|-----------|
| E001 | John | Doe |
| E003 | Alice | Johnson |
| E006 | Charlie | Brown |

Note that this is the result of a specific query. The analyst will need to update the table to
remove the duplicate permanently.

GROUP BY

GROUP BY groups rows that have the same values in specified columns into summary rows.
GROUP BY is ideal for identifying duplicates within groups. If an analyst has this employees
table, and wants to count the number of employees by department, they can use the GROUP
BY function.

7
Unset
SELECT department, COUNT(DISTINCT employee_id) AS
unique_employee_count
FROM employees
GROUP BY department;

The SELECT statement includes a COUNT and DISTINCT function to ensure that duplicates
are not counted.

If the table contains these rows of data:

Unset
| employee_id | first_name | last_name | department |
|-------------|------------|-----------|------------|
| E001 | John | Doe | Sales |
| E002 | Jane | Smith | HR |
| E003 | Alice | Johnson | Sales |
| E004 | Bob | White | Marketing |
| E001 | John | Doe | Sales |
| E005 | Charlie | Brown | Sales |
| E006 | John | Thompson | IT |

The query will result in this output:

Unset
| department | unique_employee_count |
|------------|-----------------------|
| Sales | 3 |
| HR | 1 |
| Marketing | 1 |
| IT | 1 |

Data derivation
Data derivation is the process of extrapolating data from other existing data. For data
derivation, you can use the CASE statement, a conditional expression that allows you to derive
values based on a specific criteria.

8
CASE can be used in a variety of ways to extrapolate data by: creating new columns based on
existing column values, customizing the sorting of results, and replacing or transforming data
values based on specific criteria.

You can use a CASE statement to create a grading scale for test scores. Instead of listing
information by each individual score, the CASE statement enables you to create ranges of
values, and assign the data to groups of your choosing. The SELECT statement groups first
and last name, and grades of students into labeled groups based on their grade:

Unset
SELECT
first_name,
last_name,
grade,
CASE
WHEN grade < 59 THEN 'F'
WHEN grade BETWEEN 60 AND 69 THEN 'D'
WHEN grade BETWEEN 70 AND 79 THEN 'C'
WHEN grade BETWEEN 80 AND 89 THEN 'B'
ELSE 'A'
END AS grading_scale
FROM students;

If the table contains these rows of data:

Unset
| student_id | first_name | last_name | grade |
|------------|------------|-----------|-------|
| S001 | John | Doe | 85 |
| S002 | Jane | Smith | 58 |
| S003 | Alice | Johnson | 78 |
| S004 | Bob | White | 68 |
| S005 | Charlie | Brown | 95 |

The query will result in this output:

Unset
| first_name | last_name | grade | grading_scale |
|------------|-----------|-------|---------------|
| John | Doe | 85 | B |

9
| Jane | Smith | 58 | F |
| Alice | Johnson | 78 | C |
| Bob | White | 68 | D |
| Charlie | Brown | 95 | A |

The CASE statement is a powerful tool for adding logic to your queries and deriving new
information based on existing data.

Data filtering
Data filtering allows you to select specific portions of your dataset based on certain
conditions. The WHERE clause is helpful for setting a condition in your query. You can filter
your data even further by using AND or OR operators.

The AND operator requires both WHERE clause conditions to be true. OR requires that at least
one condition is true. The WHERE clause condition you choose will determine the outcome of
the query.

An analyst wants to determine which employees exceeded their sales quota and are in a senior
position. The SELECT statement queries the first and last names, sales amount, and title from
the employees table. The WHERE clause specifies that the result should only include records
for employees who are in a senior position, and have surpassed their sales quota. Both
conditions must be met in order to be included in the results.

Unset
SELECT first_name, last_name, sales, title
FROM employees
WHERE sales > quota AND title = 'Senior';

If the table contains these rows of data:

Unset
|-------------|------------|-----------|-------|-------|--------|
| E001 | John | Doe | 5000 | 4500 | Senior
|
| E002 | Jane | Smith | 4200 | 4500 | Junior
|
| E003 | Alice | Johnson | 4800 | 4500 | Senior
|

10
| E004 | Bob | White | 4600 | 4500 | Senior
|
| E005 | Charlie | Brown | 4300 | 4200 | Junior
|
| E006 | Daisy | Green | 4400 | 4500 | Senior
|
| employee_id | first_name | last_name | sales | quota | title

The query will output these results:

Unset
| first_name | last_name | sales | title |
|------------|-----------|-------|--------|
| John | Doe | 5000 | Senior |
| Alice | Johnson | 4800 | Senior |
| Bob | White | 4600 | Senior |

The analyst modifies the SELECT statement by adjusting the WHERE clause to include
employees that exceeded their sales quota, OR hold a senior position.

Unset
SELECT first_name, last_name, sales, title
FROM employees
WHERE sales > quota OR title = 'Senior';

The output is noticeably different. Only one condition needs to be true for the result to be
included in the output.

Unset
| employee_id | first_name | last_name | sales | quota | title
|
|-------------|------------|-----------|-------|-------|--------|
| E001 | John | Doe | 5000 | 4500 | Senior
|
| E002 | Jane | Smith | 4200 | 4500 | Junior
|
| E003 | Alice | Johnson | 4800 | 4500 | Senior

11
|
| E004 | Bob | White | 4600 | 4500 | Senior
|
| E005 | Charlie | Brown | 4300 | 4200 | Junior
|
| E006 | Daisy | Green | 4400 | 4500 | Senior

The choice to choose AND or OR will depend on the final result you are trying to achieve.

Key takeaways
In this reading, you learned about data aggregation, deduplication, derivation and filtering in
SQL, and explored specific examples of how to perform these functions. These data
transformation techniques help analysts correct errors and reduce unnecessary details,
making the data usable and accessible.

First Course in Statistical Programming With R 2nd Edition Braun - The 2025 ebook edition is available with updated content
100% (1)
First Course in Statistical Programming With R 2nd Edition Braun - The 2025 ebook edition is available with updated content
79 pages
Data Interpretation Guide For All Competitive and Admission Exams
From Everand
Data Interpretation Guide For All Competitive and Admission Exams
Mohmmad Khaja Shareef
2.5/5 (6)
Aggregate Functions in SQL
No ratings yet
Aggregate Functions in SQL
9 pages
SQL Material
No ratings yet
SQL Material
39 pages
Module 2 Introduction to SQL
No ratings yet
Module 2 Introduction to SQL
22 pages
Grouping and Aggregating Data: Module Overview
No ratings yet
Grouping and Aggregating Data: Module Overview
24 pages
Advanced SQL Skills
No ratings yet
Advanced SQL Skills
22 pages
Lecture Notes 3.1
No ratings yet
Lecture Notes 3.1
18 pages
8-In-Built Functions, Join and Group by Queries-19-03-2024
No ratings yet
8-In-Built Functions, Join and Group by Queries-19-03-2024
38 pages
Aggregate Functions MCA1B
No ratings yet
Aggregate Functions MCA1B
7 pages
Structured Query Language (SQL) : (Chapter 11)
No ratings yet
Structured Query Language (SQL) : (Chapter 11)
7 pages
MySQL - Impotant Clarifications
No ratings yet
MySQL - Impotant Clarifications
7 pages
ITE407 - Advanced Databases Group Fns LectureNotes 02212017
No ratings yet
ITE407 - Advanced Databases Group Fns LectureNotes 02212017
4 pages
Aggregation
No ratings yet
Aggregation
8 pages
Exp 6_7_8
No ratings yet
Exp 6_7_8
26 pages
Introduction To Oracle Functions and Group by Clause
100% (2)
Introduction To Oracle Functions and Group by Clause
62 pages
SQL Aggregate Functions: 1. Count Function
No ratings yet
SQL Aggregate Functions: 1. Count Function
21 pages
SQL Keywords
No ratings yet
SQL Keywords
6 pages
SQL Commands: Unit 3 Database Management
No ratings yet
SQL Commands: Unit 3 Database Management
17 pages
Aggregate Function
No ratings yet
Aggregate Function
5 pages
SQL Functions
No ratings yet
SQL Functions
9 pages
Lab 10
No ratings yet
Lab 10
12 pages
Unit 2: Database Query Using SQL Syllabus: Single Row Functions
No ratings yet
Unit 2: Database Query Using SQL Syllabus: Single Row Functions
8 pages
Lab - 4 - Retrieving Data From Multiple Tables
No ratings yet
Lab - 4 - Retrieving Data From Multiple Tables
16 pages
SQL Query Tutorial
No ratings yet
SQL Query Tutorial
12 pages
CNG351 Lecture 10 DML Part 1 (1)
No ratings yet
CNG351 Lecture 10 DML Part 1 (1)
19 pages
Database Nest Quiz
No ratings yet
Database Nest Quiz
22 pages
Assignmet 3.docx
No ratings yet
Assignmet 3.docx
7 pages
Week 9 lec 1 2 3Aggregate Functions
No ratings yet
Week 9 lec 1 2 3Aggregate Functions
18 pages
Ora Final Material 2024
No ratings yet
Ora Final Material 2024
41 pages
Ch 2.3 - Aggregate Functions
No ratings yet
Ch 2.3 - Aggregate Functions
4 pages
Database Management and MySQL Part2 CS
No ratings yet
Database Management and MySQL Part2 CS
55 pages
SQL Functions
No ratings yet
SQL Functions
21 pages
Chapter 7 - Querying Using SQL
No ratings yet
Chapter 7 - Querying Using SQL
32 pages
dbms lab 4
No ratings yet
dbms lab 4
7 pages
exp-3
No ratings yet
exp-3
3 pages
Sqlfunctions
No ratings yet
Sqlfunctions
8 pages
Databasetechlecture 9
No ratings yet
Databasetechlecture 9
10 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
2 pages
Aggregation
No ratings yet
Aggregation
35 pages
Selecting Data
No ratings yet
Selecting Data
15 pages
4 Group by Clause, Having Clause, Multiple Row (Or Group or Aggregate) Functions
100% (1)
4 Group by Clause, Having Clause, Multiple Row (Or Group or Aggregate) Functions
17 pages
Data Base Lab 6
No ratings yet
Data Base Lab 6
8 pages
Learn SQL - Aggregate Functions Cheatsheet - Codecademy
No ratings yet
Learn SQL - Aggregate Functions Cheatsheet - Codecademy
2 pages
Data Definition Commands-1
No ratings yet
Data Definition Commands-1
54 pages
IP XII Quick Notes- Querying in MYSQL
No ratings yet
IP XII Quick Notes- Querying in MYSQL
11 pages
My SQL Worksheet-3 (Aggregate Functions)
No ratings yet
My SQL Worksheet-3 (Aggregate Functions)
7 pages
DBMS UNIT 4
No ratings yet
DBMS UNIT 4
31 pages
Exp 4
No ratings yet
Exp 4
6 pages
Aggregate functions
No ratings yet
Aggregate functions
5 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
38 pages
TOPICMIDTERM
No ratings yet
TOPICMIDTERM
6 pages
chp04 05 More SQL
No ratings yet
chp04 05 More SQL
23 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
55 pages
Learn SQL_ Aggregate Functions Cheatsheet _ Codecademy
No ratings yet
Learn SQL_ Aggregate Functions Cheatsheet _ Codecademy
3 pages
Ch-6 Mysql Functions
No ratings yet
Ch-6 Mysql Functions
51 pages
DBMS 5
No ratings yet
DBMS 5
6 pages
SQL Aggregate Functions PDF
100% (2)
SQL Aggregate Functions PDF
19 pages
SQL Worksheet
No ratings yet
SQL Worksheet
8 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
HP Universal Print Driver: Solution and Feature Guide
No ratings yet
HP Universal Print Driver: Solution and Feature Guide
24 pages
Download full (Ebook) Writing Great Specifications: Using Specification by Example and Gherkin by Kamil Nicieja ISBN 9781617294105, 1617294101 ebook all chapters
100% (5)
Download full (Ebook) Writing Great Specifications: Using Specification by Example and Gherkin by Kamil Nicieja ISBN 9781617294105, 1617294101 ebook all chapters
65 pages
CC Mid 1Subjective&ObjectiveQuestions
No ratings yet
CC Mid 1Subjective&ObjectiveQuestions
9 pages
CLICK-DirectLogic CablesPinout
No ratings yet
CLICK-DirectLogic CablesPinout
48 pages
CAD CAM Part 1
No ratings yet
CAD CAM Part 1
9 pages
First Quarter Summative Assessment
No ratings yet
First Quarter Summative Assessment
4 pages
GRADE 3 UNIT 2 HARDWARE AND SOFTWARE WORK TOGETHER
No ratings yet
GRADE 3 UNIT 2 HARDWARE AND SOFTWARE WORK TOGETHER
3 pages
lc_1
No ratings yet
lc_1
337 pages
Pubg Phishing Tutorial in Termux in Hindi by (Noob Hackers)
No ratings yet
Pubg Phishing Tutorial in Termux in Hindi by (Noob Hackers)
7 pages
2.1 MCQS_AI_Course 1_Introduction to Artificial Intelligence
No ratings yet
2.1 MCQS_AI_Course 1_Introduction to Artificial Intelligence
18 pages
Fielding Dissertation CHAPTER 5 Representational State Transfer (REST)
No ratings yet
Fielding Dissertation CHAPTER 5 Representational State Transfer (REST)
17 pages
Downloaded From Manuals Search Engine
No ratings yet
Downloaded From Manuals Search Engine
171 pages
S&s Question
No ratings yet
S&s Question
10 pages
Appendix 63 RIS
No ratings yet
Appendix 63 RIS
1 page
CC Mid - 2 Objective Paper-1
No ratings yet
CC Mid - 2 Objective Paper-1
2 pages
BA400 Advanced Training Guide v2.0
100% (1)
BA400 Advanced Training Guide v2.0
34 pages
Project Libre Tutorial
No ratings yet
Project Libre Tutorial
36 pages
DRiver para Visual FoxPro To 64
No ratings yet
DRiver para Visual FoxPro To 64
27 pages
Chapter-4-File Input-Output in CPP
No ratings yet
Chapter-4-File Input-Output in CPP
68 pages
Types of Sender in Communication Process
No ratings yet
Types of Sender in Communication Process
1 page
Operating Manual: English
No ratings yet
Operating Manual: English
10 pages
Codes Document
No ratings yet
Codes Document
4 pages
Priority Notification Pn2022-27B: Pns Are Proprietary To Honeywell Hps and Honeywell Hps Customers
No ratings yet
Priority Notification Pn2022-27B: Pns Are Proprietary To Honeywell Hps and Honeywell Hps Customers
4 pages
Earthquake Early Warning Systems Based On Low-Cost Ground Motion Sensors
No ratings yet
Earthquake Early Warning Systems Based On Low-Cost Ground Motion Sensors
16 pages
Lecture4 (Share Memory-"According Access")
No ratings yet
Lecture4 (Share Memory-"According Access")
16 pages
RHEL8 Book2
No ratings yet
RHEL8 Book2
6 pages
2nd Quarter Examination in E-TECH
100% (1)
2nd Quarter Examination in E-TECH
6 pages
Me6702 Mechatronics
No ratings yet
Me6702 Mechatronics
7 pages
Grids Supported by Se Inverters Europe and Apac
No ratings yet
Grids Supported by Se Inverters Europe and Apac
6 pages

Techniques Used to Transform Data, Part 1

Uploaded by

Techniques Used to Transform Data, Part 1

Uploaded by

Techniques used to transform data, part 1

SUM and COUNT

If the total_sales table contains these rows of data:

The query will return this result:

If the customers table contains these rows of data:

The query will return this result:

If the table orders contains these rows of data:

The query will return this result:

If the table contains these rows of data:

The query will return this result:

MAX & MIN

The query will return this result:

If the table contains these rows of data:

The query returns this output:

If the table contains these rows of data:

The query will result in this output:

If the table contains these rows of data:

The query will result in this output:

If the table contains these rows of data:

The query will output these results:

You might also like