0% found this document useful (0 votes)
18 views

2 Data Formats Relational DB

The document discusses a data engineering lecture that covers commonly used file formats like CSV, JSON, and XML. It provides examples of each format and discusses how they are used to store and exchange data. It also recaps relational databases and SQL, including topics like schema design, normalization, and publishing SQL data as XML.

Uploaded by

Aya Saafan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

2 Data Formats Relational DB

The document discusses a data engineering lecture that covers commonly used file formats like CSV, JSON, and XML. It provides examples of each format and discusses how they are used to store and exchange data. It also recaps relational databases and SQL, including topics like schema design, normalization, and publishing SQL data as XML.

Uploaded by

Aya Saafan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

VC: 623.

500
Data Formats, Relational DB,
Advanced Queries

Julius Köpke, [email protected]

Data Engineering - WS 2023 - Julius Köpke 10/12/2023


1
Agenda

 Topics today:
 Commonly used file formats
 Recap on Relational Databases / RDBMS
 Relational DB + XML
 Relational DB + JSON
 Advanced SQL Queries
Commonly used File Formats

 Text files
 Comma Separated Values (CSV)
 JSON
 XML
 Binary, open formats
 Proprietary binary formats

10/12/2023
3
CSV Files

 Plain text file


 Fields separated by colon or another delimiter.
 Typically, first row contains column names.
 Other rows contain data. One data item per row.

MatNo, name, studylan, lecture, grade • Very simple


1, John Doe, CS, “Data Engineering”, 1 • Human readable
2, Bob Byte, CS, “Data Engineering”, 2 • Strings with separator require quoting
3, Alice Block, CS, “Data Engineering”, 4 • Quoting and delimiter handling may
4, John Tuple, CS, “Data Engineering”, 1 become complex.
• Good for exchange, less suited for
search in large files. Only flat tabular
structure.
<?xml version="1.0" encoding="UTF-8"?>
<stations>
<station id="1" name="Klagenfurt Flughafen" minC="-30" maxC="38"/>
XML files <station id="2" name="Klagenfurt Innenstadt" minC="-28" maxC="41"/>
<station id="100" name="Innsbruck Flughafen" minC="-34" maxC="27"/>
</stations>

 A text file containing tags forming a tree.


 Suitable for structured or semi-structured data
 Better standardization than CSV
 Human readable
 Very popular for data exchange (Web Services, Office file formats, Business Process
Models, …)
 Usage (for Web Services) partially declining in favor of JSON.
 Large set of mature additional standards (XML-Schema, XPath, XQuery, XSLT, … )
 Quite verbose, good to read, less good to write (manually).
{
“student” : [
{“name” : “John Doe”, “studyPlan”: “CS”, “lecture” : “Data Engineering”},

JSON {“name” : “Bob Byte”, “studyPlan”: “CS”, “lecture” : “Data Engineering”},



]
}

 Format originating from JavaScript Object Notation.


 Very popular on the Web (see lecture Web Technologies).
 Can be parsed very efficiently by browsers (being js´s native object serialization format)
 Contains keys and values.
 Always one root element. Root may be an array.
 Data Types inherited from JavaScript:
Null, Boolean (true, false), Number, String “…”, Array [], Object {}
 Much smaller set of (and less mature) additional standards than XML
(JSON Schema, JSON-LD, JSON Pointer, …)
 No Schema for validation (unless emerging JSON Schema is used)
→ Application needs to validate document!
Relational DB

 The No. 1 standard for classical databases.


 See lecture “Datenbanken”.
 Data stored in relations with fixed schema.
 Aiming at physical and logical data
independence.
 Focus on consistency, ACID Transactions
 Schema quality defined via normal forms:
 Minimize redundancies to avoid
anomalies.
Schema Example

Primary Key
Foreign Key

 student(matno, firstname, lastname)


 enrolment(lvnr, matno, semester)
 lecture(lvnr, semester, title, ssn)
 lecturer(ssn, firstname, lastname, degree, title)
Recap: Schema Quality / NF I

 Anomalies: Insert-, update-, delete-

 Informal Recap on Normal forms (see lecture “Databases” for full definitions.)
 1NF: Each attribute value must only contain atomic data. No complex content.
 2NF: Non-Prime attributes must not depend on real subsets of a key.
 Counter example: enrolment(matno, lvnr, semester, firstname, lastname)

 3NF: Non-Prime attributes must only depend on the key / no transitive dependencies between
key and non-prime attributes.
 Counter example: student(matno, firstname, lastname, studyplanID, studyPlanTitle)

 BCNF: Every non-trivial dependency must have the form: A → B, where A is a key of the relation.
 Counter example: enrolment(matno, assistant, labCourse)
{matno, labCourse } → assistant, assistant → labCourse →Keys: { matno, labCourse}, {matno,assistant}
Recap: Schema Quality / NF II

 Data is split up in multiple tables, aim: no redundancies.


 Each table represents one real world concept or relation.
 We cannot mix multiple real-world concepts or relations in one table.

 ++ We can get rid of anomalies


 -- Queries might become complex and joins are expensive.
 -- However, many ways to tune (bitmap-join indexes, materializes views, … - see lecture
DBT)
Atomicity
ACID
Criteria of Consistency
Transactions
Isolation

Durability
Recap on Concurrent Access

 A database is typically accessed by multiple users in parallel.


 Typical Concurrency Problems:
 Lost Update
 Uncommitted dependency / Dirty Read
 Inconsistent Analysis / Phantom Problem

 Correctness criteria of a schedule


 Serializability
 Typically achieved by locking protocols.

 Different Isolation levels to balance consistency and performance (see lecture DB-T)
 Serializable, Repeatable Read, Read Committed, Read Uncommitted
Recovery

 Recovery, in case of HW/SW failures.


 Rollback of aborted transactions.

 Idea:
 Each update is written in a LOG.
 Forward and backward recovery.

 Very powerful:
 Log (hopefully) stored on different disk than data.
 Given a backup and the most recent log, we can restore all committed transactions.
 Log can be used to replicate DB to other nodes (Hot/Warm Standby).
SQL

The standard for interacting with relational databases.

Declarative query language, much more convenient than writing queries in some programming
language.
Allows for complex queries / however, intentionally not Turing Complete.

Rooted in relational algebra and tuple calculus.

Allows for automatic query optimization.


Very mature technology. One of the major
success stories of CS.

Relational
DBMS
Tons of vendors and open-source projects
(Oracle, Postgres, MS SQL, MySQL, SQLite,
…)

For easy application integration: Object


relational Mapping. Allows very convenient
access for most cases (typically 80/20 rule).

ERP, Accounting, Student


management, …
Typical Application: Fixed Schema.
OLTP processing.
SQL Extensions for XML and JSON

 First normal form forbids structured attributes.


 However, for many applications we need to
 Store XML or JSON Data in relational databases.
 Retrieve relational data in form of XML or JSON.

 XML support was added with SQL’ 2003 and extended in 2008
 JSON Support was added with SQL’2016.
Publishing SQL as XML

 Publishing relational Data as XML: Usage of XML constructor functions in SQL select
statements:

 XMLELEMENT()
 XMLATTRIBUTES()
 XMLAGG()
 XMLROOT()
 XMLCONCAT()
 XMLPI(), XMLCOMMENT()

 Typical problem: Relational tables are flat, XML is nested.

 See Postgres Documentation for details


Publishing SQL as XML Example 1

 Create XML elements for each person

SELECT xmlelement(
name person, xmlattributes(email, geschlecht),vorname || ‘ ‘ || nachname
)
FROM person;
Publishing SQL as XML Example 2

 Nest it in a “persons” element


select xmlelement(
name persons, xmlagg(
xmlelement(
name person, xmlattributes(email, geschlecht), vorname ||’ ‘ || nachname)
)
)
from person
<persons>
<person email="[email protected]" geschlecht="W">Hannah Müller</person>
<person email="[email protected]" geschlecht="W">Hanna Schmidt</person>
<person email="[email protected]" geschlecht="W">Leoni Schneider</person>
...
</persons>
Publishing SQL as XML Example 3

 Include friends

select xmlelement(
name person,
xmlattributes(p.email, geschlecht, vorname as fn, nachname as ln),
xmlelement(name friends,
xmlagg(
xmlelement(name friend, h.emailfreund)
)
)
)
from person p, hatfreund h
where p.email = h.email group by p.email
Publishing SQL as XML Example 4
<persons>
<person email="[email protected]" geschlecht="M" fn="Phillip"
ln="Winkler"><friends><friend>[email protected]</friend><friend>M.Kuehn
@sms.at</friend></friends>
</person>
select xmlelement(
<person email="[email protected]" geschlecht="W" fn="Lina"
name persons, ln="Hartmann"><friends><friend>[email protected]</friend><friend>Laura.Heinri
xmlagg(pq.pe) [email protected]</friend><friend>[email protected]</friend></friends><
</person>
)
...
from ( </persons>
select xmlelement(
name person,
xmlattributes(p.email, geschlecht, vorname as fn, nachname as ln),
xmlelement(name friends,
xmlagg(
xmlelement(name friend, h.emailfreund)
)
)
) as pe
from person p, hatfreund h
where p.email = h.email group by p.email
) as pq
Storing XML in Relations

create table weatherReportXML ( Create a table with column of type


id integer not null PRIMARY KEY, xml
report xml not null Note: Check your DBMS manual!
)

Insert into weatherReportXML (id, report) values Inserting some rows containing XML
(1,'<station name="Klagenfurt Airport" temp= "21"/>'); strings
Insert into weatherReportXML (id, report) values
(2,'<station name="Villacher Alpe" temp= "6"/>');

Select report from weatherReportXML Returns data as XML


Querying XML in SQL

Select xpath('//station/@name', report) as station {“Klagenfurt Airport”}, {“Villacher


from weatherReportXML Alpe”}

Select xpath('//station/@name', report) as station xpath_exists returns true, if the xpath


from weatherReportXML expressions returns a non-empty set
where (xpath_exists('//station[@temp > 6]', report)) of nodes.

→ We can use xpath() in the select and the where clause.

[ Details ]
More SQL Related DB features

 In place update of XML values via UPDATEXML(), (not in postgres)


 Map XML to tables with registered schema. (not in postgres)
 XQuery support (not in postgres)
 Index on XML data e.g. XMLIndex (postgres only functional indexes over xpath, full-text).
Storing JSON in Relational DB

create table weatherReport ( Create a table with column of type


id integer not null PRIMARY KEY, JSON or JSONB (binary)
report json not null Note: Check your DBMS manual!
)

Insert into weatherReport (id, report) values Inserting some rows containing JSON
(1,’{“station” : “Klagenfurt Airport”, “temp” : “21” }’) strings
Insert into weatherReport (id, report) values
(2,’{“station” : “Villacher Alpe”, “temp” : “6” }’)

Select report from weatherReport Returns data as json


Addressing JSON in SQL I

Right
Operand Example
Operator Type Description Example Result
-> int Get JSON array element (indexed from zero, negative '[{"a":"foo"},{"b":"bar"},{"c":"baz"}]'::json->2 {"c":"baz"}
integers count from the end)
-> text Get JSON object field by key '{"a": {"b":"foo"}}'::json->'a' {"b":"foo"}

->> int Get JSON array element as text '[1,2,3]'::json->>2 3

->> text Get JSON object field as text '{"a":1,"b":2}'::json->>'b' 2

#> text[] Get JSON object at specified path '{"a": {"b":{"c": "foo"}}}'::json#>'{a,b}' {"c": "foo"}

#>> text[] Get JSON object at specified path as text '{"a":[1,2,3],"b":[4,5,6]}'::json#>>'{a,2}' 3

[ Details and Source ]


Addressing JSON in SQL II

Select report -> ‘station’ as station from weatherReport [“Klagenfurt Airport”, “Villacher Alpe”]

Select report ->> ‘station’ as station from weatherReport Klagenfurt Airport, Villacher Alpe

Select report -> ‘station’ ->> name as name from We can chain -> if JSON is nested
weatherReport

Select report ->> 'station' as station from weatherReport We can use ->> and -> in where
where report ->> 'temp' = '6' clause as well

Select report ->> 'station' as station from weatherReport However, we might need to cast
where cast(report ->> 'temp' as integer) < 10 data accordingly.
Publishing Relational data as JSON I

to_json(anyelement) and to_jsonb(anyelement):

Returns the value as json or jsonb. Arrays and composites are converted (recursively) to
arrays and objects; otherwise, if there is a cast from the type to json, the cast function will
be used to perform the conversion; otherwise, a scalar value is produced. For any scalar
type other than a number, a Boolean, or a null value, the text representation will be used, in
such a fashion that it is a valid json or jsonb value.

Example:
person(email, vorname, nachname,geburtsdatum, geschlecht)
select to_json(p) from person p
Publishing Relational Data as JSON III
[{
"email": "[email protected]",
"vorname": "Phillip",
 json_agg aggregates sets of tuples to a JSON array [ Details ] "nachname": "Winkler",
"geburtsdatum": "1985-10-02",
"geschlecht": "M",
"friends": [
 Output each person and a nested array of friends. "[email protected]",
"[email protected]"
]
SELECT json_agg(pt) FROM },{
"email": "[email protected]",
( "vorname": "Lina",
select p.*, json_agg(h.email) as friends "nachname": "Hartmann",
"geburtsdatum": "1988-01-25",
from person p, hatfreund h "geschlecht": "W",
where p.email = h.email group by p.email "friends": [
"[email protected]",
) as pt "[email protected]",
"[email protected]"
]
}
...
]
JSON / XML Conclusions

 We can store XML and JSON in relational databases.


 We can query XML and JSON data with relational databases.
 Good for hybrid approaches where some data is relational, and some is hierarchic.
 Allows light-weight implementations of REST Services.
 Allows to process XML and JSON in DB as a step of a data pipeline.
Advanced SQL Queries

 After attending “Databases” you should be fluent in:


 Subqueries, correlated subqueries, quantifiers.
 Group by, having.
 Complex queries mixing everything.

 Here we discuss some types of queries you did not see in the lecture databases
 Window Functions
 Recursive Queries
 Statistical
Window Functions

 So far, we know
 Simple functions that can only access the current row
 Aggregate functions

 However, aggregate functions always condense multiple rows to one (maybe in a group).

 What if we want to compare some row to “similar” other rows?


 Return each employee, her salary and the average salary of her department.
 Return the number of friends of each person and the average number of friends per gender.

[ Tutorial, Details ]
Simple Window Functions I

 SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM


empsalary;
depname | empno | salary | avg
-----------+-------+--------+-----------------------
develop | 11 | 5200 | 5020.0000000000000000 “over” specifies the partition
develop | 7 | 4200 | 5020.0000000000000000 for the aggregate function.
develop | 9 | 4500 | 5020.0000000000000000
develop | 8 | 6000 | 5020.0000000000000000
develop | 10 | 5200 | 5020.0000000000000000
personnel | 5 | 3500 | 3700.0000000000000000
personnel | 2 | 3900 | 3700.0000000000000000
sales | 3 | 4800 | 4866.6666666666666667
sales | 1 | 5000 | 4866.6666666666666667
sales | 4 | 4800 | 4866.6666666666666667
(10 rows)
Simple Window Functions II

select f.vorname, f.nachname, f.geschlecht, f.freunde,


avg(f.freunde)
OVER (PARTITION BY f.geschlecht)
from
(
select p.vorname, p.nachname, p.geschlecht,
count(h.emailfreund) as freunde
from person p, hatfreund h
where p.email = h.email
group by p.email
) AS f
Window Functions 3

Over order by without partitioning


select f.nachname, f.freunde, sum(f.freunde)
OVER (ORDER BY f.nachname)
from
(
select p.nachname, count(h.emailfreund) as freunde
from person p, hatfreund h
where p.email = h.email
group by p.email
order by freunde
) AS f
Window Function Definition

[ Tutorial, Details, Detailed Tutorial ]


Recursive Queries I

 We never really talked about recursive associations.


*
 Output all friends and friends of friends of Hanna Schmidt hasFriend

select h.emailfreund
from person p, hatfreund h Person *
where p.email = h.email and
p.vorname = 'Hanna' and
p.nachname = 'Schmidt'

 But this query will only return direct friends.


Common Table Expressions

 Can be used to structure a complex query.


 Defined by with clause:

with boys as (
Select p.* from person p where geschlecht = 'M’
),
girls as (
Select p.* from person p where geschlecht = 'W’
)
select * from boys UNION select * from girls;

Differences to views: Views are schema objects. Views cannot


reference itself → no recursion.
[ Details ]
Recursive Common Table Expressions

WITH RECURSIVE friends(freund) AS (


select h.emailfreund as freund
from person p, hatfreund h
where p.email = h.email and Non-Recursive part
p.vorname = 'Hanna' and
p.nachname = 'Schmidt'
UNION
SELECT h1.emailfreund
FROM hatfreund h1, friends f Recursive part.
WHERE f.freund = h1.email
)
select * from friends;
Recursive Query Evaluation

 Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows.
Include all remaining rows in the result of the recursive query, and also place them in a
temporary working table.
 So long as the working table is not empty, repeat these steps:
 Evaluate the recursive term, substituting the current contents of the working table for the recursive
self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate
any previous result row. Include all remaining rows in the result of the recursive query, and also
place them in a temporary intermediate table.
 Replace the contents of the working table with the contents of the intermediate table, then
empty the intermediate table.
Aggregate Functions for Statistics

 Todays DBMS come with a whole range of aggregation functions useful for statistics.
 Here we introduce the list for PostgreSQL

• Covariance: covar_pop(Y,X)

• Correlation: corr(Y,X) // values in range of -1 to +1

• Standard Deviation: stddev(exp), stddev_pop(exp), stddev_samp(exp)

• Variance: variance(exp), var_pop(exp), var_samp(exp)

• Linear Regression: regr_intercept(Y,X), regr_slope(Y,X), …


Idea: y = ax + b; a is slope, b is intercept.

[ Details ]
Correlation: An example

 corr(Y,X) aggregates all Y and X values to a single Age Size


value between -1 and +1. 3 25
4 27
 Calculates the Pearson correlation with output 5 30
between -1 and +1. +1 max positive, - 1 max
negative correlation. 0 no correlation. 6 31
7 32
Select corr(age,size) from kids 3 24
Will return 0.9635 4 28
The higher the age the larger the size… …
To Wrap Up…

 DBMSs provide functionalities to treat file formats for data exchange


 Major formats: CSV, JSON, XML
 This overcomes inherent limitations of the relational model, e.g., normalized schema vs structured
attributes
 We have seen how to publish, store, and query JSON and XML
 Applications may require support for complex queries, not based on functions operating
on single rows, or aggregate functions
 Window functions
 Recursive queries
 Aggregate functions for statistics
Homework

 Publish relational data as XML


 Publish relational data as JSON
 Query JSON data and output as relation
 Working with Window Functions
 Hierarchical Queries

You might also like