0% found this document useful (0 votes)
6 views

An_Introduction_to_SQL_1731971471

Uploaded by

Talal Ben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

An_Introduction_to_SQL_1731971471

Uploaded by

Talal Ben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

An introduction to SQL

Alkistis Pourtsidou
Queen Mary, University of London

Image credit: Hayden Planetarium, 2014


WHAT IS SQL?

✦ SQL stands for Structured Query Language

✦ Used to query (“talk to”) a database server

✦ Data manipulation, database creation

✦ Almost all companies use databases to store their data

✦ Have a look at the Indeed Job Search Engine…tens of


thousands of jobs mentioning SQL!

✦ https://www.indeed.co.uk/
WHY DO WE NEED DATABASES?

✦ Concurrency: multiple simultaneous changes to data

✦ Data changes regularly

✦ Large data sets but only need subsets

✦ Sharing large data sets

✦ Rapid queries

✦ Data web interfaces (dynamic data)


WAYS TO USE SQL

✦ Standard console command (e.g. mysql -u user -p dbname)

✦ GUI interfaces often available

✦ Interfaces to many programming languages (Python, R, …)

✦ SQLite - use SQL without a database server: this is what we


are going to use in our tutorial

✦ PostgreSQL - the world’s most advanced open source


database: we’ll see how that works later on
MORE ABOUT DATABASES

✦ A database server can contain many databases

✦ Basically, databases are collections of tables with rows


(observations) and columns (variables)

✦ Limited mathematical operations available

✦ Very good at combining information from several related tables

✦ We’ll explore the above (and more) in detail


duction to
SQL Finding Your Way Around
EXPLORING the Server
THE SERVER
✦ A given server can support multiple databases

✦ EachSince
database contains
a single servermany tables many databases, each
can support
containing many tables, with each table having a variety
✦ Each table contains many columns
of columns, it’s easy to get lost when you’re working with
databases.
✦ But keeping These
things commands
under control is will help figure out what’s
straightforward!
available:
I SHOW DATABASES;
I SHOW TABLES IN database;
I SHOW COLUMNS IN table;
I DESCRIBE table; - shows the columns and their
types
duction to
SQL Variable Types
VARIABLE TYPES
✦ SQLSQL
supports a variety
supports a veryoflarge
different
numberformats for storing
of di↵erent information
formats for
internal storage of information.

Numeric
I INTEGER, SMALLINT, BIGINT
I NUMERIC(w,d), DECIMAL(w,d) - numbers with width
w and d decimal places
I REAL, DOUBLE PRECISION - machine and database
dependent
I FLOAT(p) - floating point number with p binary
digits of precision
VARIABLE TYPES
uction to
SQL Variable Types (cont’d)
✦ SQL supports a variety of different formats for storing information

Character
I CHARACTER(L) - a fixed-length character of length L
I CHARACTER VARYING(L) or VARCHAR(L) - supports
maximum length of L
Binary
I BIT(L), BIT VARYING(L) - like corresponding
characters
I BINARY LARGE OBJECT(L) or BLOB(L)
Temporal
I DATE
I TIME
I TIMESTAMP
SQL: THE BASICS
✦ Enough intro! Let’s dive into SQL with hands-on examples

✦ We will use SQLite3, which is part of Python, for details see


https://www.pythoncentral.io/introduction-to-sqlite-in-python/

✦ SQLite is an embedded SQL database engine. It doesn’t have a


separate server process, which makes it really easy to use,
immediately.
✦ Using Python (in the form of a Jupyter notebook) to run our SQL
code allows us to use Pandas for importing our results and make
everything look nice and clear!
SQL: THE BASICS
SQL: THE BASICS

✦ Let’s first define a function that takes our query, stored as a string, as an input.

✦ Then shows the result as a formatted data frame (we’ll see this in action in a
bit…)
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS

✦ As we saw, the database has two tables, TRIPS and


STATIONS. We will first work with the TRIPS table. Let’s see
what kind of information it contains:
SQL: THE BASICS

✦ See all columns:


SQL: THE BASICS

✦ See specific columns:


SQL: THE BASICS
SQL: THE BASICS

✦ DESC: Descending
SQL: THE BASICS

✦ The longest trip lasts


a bit less than 3
hours.
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS

✦ So far we’ve been looking at queries pulling data from the


TRIPS table

✦ But as you might remember there’s also the STATIONS table

✦ The STATIONS table contains information about every station


in the Hubway network

✦ It also includes an id column referenced by the TRIPS table

✦ So these tables can be combined to extract useful information

✦ Let’s have a look…


SQL: THE BASICS

ID is a unique identifier for each station, corresponding to the


start_station and end_station columns in the TRIPS table
SQL: THE BASICS
✦ Let’s say we want to know which station is the most
popular starting point

✦ For that we need to combine information from both the TRIPS


and STATIONS tables

✦ We will use the JOIN command


SQL: THE BASICS

✦ We will use SELECT to return the station column from the stations table using the
table.column syntax, i.e. stations.station in our case
✦ We also return the COUNT of the number of rows from the trips table

✦ To tell the database how the stations and trips tables are connected, we’ll use JOIN and
ON.
✦ JOIN specifies which tables should be connected

✦ ON specifies which columns in each table are related

✦ INNER JOIN means rows will only be returned where there is a much in the columns
specified by ON
✦ Tables are connected ON trips.start_station = stations.id

✦ Then we group by the station column so that COUNT will give the number of trips for
each station separately
✦ Finally we ORDER BY descending order
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
✦ Let’s slightly expand this query to see which are the most
popular round-trip stations:
EXERCISES/TASKS

✦ Code up the queries we just learned in Jupyter and reproduce


the results

✦ How many trips lasted more than half an hour? (this induces
extra charges)

✦ Which bike was used for the least total time?

✦ Did registered or casual users take more round trips?

✦ Pick up any publicly available database and play with it!


PostgreSQL

✦ PostgreSQL: “the world’s most advanced open source relational


database
✦ Active development for 30 years now!

✦ www.postgresql.org

✦ Installation: For MAC OS I strongly recommend using


Postgress.app, see https://www.calhoun.io/how-to-install-
postgresql-9-6-on-mac-os-x/
✦ For other systems, see https://www.dataquest.io/blog/sql-
intermediate/ and https://www.systems.ethz.ch/sites/default/files/
ex1a_postgresql_jupyter_setup.pdf (not tested…)
PostgreSQL

✦ First we need to create new user, database, and tables

✦ Follow the instructions in https://www.dataquest.io/blog/sql-


intermediate/ to run psql, create a new user named ‘oracle’ (or
another name of your preference) and a new database

✦ The new database contains consumer complaints

✦ It has two tables: one for bank account complaints and one
for credit card complaints
PostgreSQL

✦ We need to populate these with actual data!

✦ We will use data from here https://data.world/dataquest/bank-


and-credit-card-complaints

✦ Again, follow the instructions in https://www.dataquest.io/blog/


sql-intermediate/ to load the data

✦ They are CSV files

✦ They have identical fields: complaint_id, date_received,


product, …, issue, consumer_complaint_narrative, etc.
PostgreSQL
✦ Before having a look and playing with the data, we need to
create two helper functions

✦ One to run queries and one to run commands


PostgreSQL
✦ Before having a look and playing with the data, we need to
create two helper functions

✦ One to run queries and one to run commands


PostgreSQL
✦ OK, now let’s test everything works OK.

✦ First let’s see how the credit card complaints table looks like.
PostgreSQL
✦ Then let’s get the number of records using the COUNT
function

✦ Works well! (try the bank account complaints table too)


PostgreSQL
✦ How to deal with NULL values

✦ Let’s see how many records in each table have null values for
the consumer complaint narrative field

✦ When comparing a column to null (no value), we cannot use


arithmetic operators. Instead we use IS NULL / IS NOT NULL.
PostgreSQL: Views
✦ So we just saw a large amount of records had null values for
the consumer complaint narrative field.

✦ Instead of having to filter on this field later, we’ll create a view


with this subset only.

✦ Syntax is simple:
PostgreSQL: Views
✦ Let’s have a look:
PostgreSQL: String Concatenation

✦ Extremely useful, combines two or more strings (text values)


together to form a single string

✦ For example say we have a “month” field and a “year” field but
we need to show “month-year” instead

✦ Syntax:

✦ Let’s try it out with our credit card complaints table

✦ Let’s select complaint_id, product, company, and concatenate


separated by a hyphen
PostgreSQL: String Concatenation
PostgreSQL: Subqueries
✦ Subqueries (“inline views”) create a mini view within a single
query

✦ The best way to understand how they work is via an example:


TASKS

✦ Read about UNION/UNION ALL, and put them in action using


different views of the banking data

✦ Same with INTERSECT/EXCEPT

✦ Explore subqueries, for example by reproducing the


“Subqueries in action” examples in https://www.dataquest.io/
blog/sql-intermediate/

✦ Go through the SQL/Pandas tutorial in https://


www.dataquest.io/blog/python-pandas-databases/
REFERENCES

✦ Introduction to SQL https://www.stat.berkeley.edu/~spector/


sql.pdf

✦ https://www.w3schools.com/sql/

✦ http://www.sql-tutorial.net/

✦ https://www.kaggle.com/learn/sql

✦ https://www.dataquest.io/blog/sql-basics/

✦ https://www.dataquest.io/blog/sql-intermediate/

✦ https://www.dataquest.io/blog/python-pandas-databases/

You might also like