0% found this document useful (0 votes)
29 views

Session - 6 - Complex Data Types

The document discusses different types of semi-structured data including their features and common uses. It also describes JSON, XML, and RDF as examples of semi-structured data formats and how they are supported in SQL databases.

Uploaded by

alexsburg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Session - 6 - Complex Data Types

The document discusses different types of semi-structured data including their features and common uses. It also describes JSON, XML, and RDF as examples of semi-structured data formats and how they are supported in SQL databases.

Uploaded by

alexsburg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Advanced Databases

Session 6
Academic year 2023-2024

Professor: Luis Angel Galindo


Semi-structured Data
Features
• Flexible schema. Different instances of data within the same dataset can have
different structures
• Wide column representation: allow each tuple to have a different set of
attributes, can add new attributes at any time (eg. Apache Cassandra)
• Sparse column representation: schema has a fixed but large set of attributes, by
each tuple may store only a subset (eg. HDH5)
• Hierarchy and Nesting. Elements can be organized in a nested manner, forming a
tree-like or graph-like structure. This is common in formats like JSON and XML.
• Loose Data Typing. Data types in semi-structured data are often loosely defined.
Unlike strict relational databases with fixed data types for each column, semi-
structured data formats may allow for a variety of data types within a single dataset.
• No or Partial Integrity Constraints. Data may lack or have fewer constraints. This
allows for more flexibility but can also introduce challenges in maintaining data
integrity.
Features
• Textual Representation. Often uses textual representation (JSON, XML, YAML…),
which are human-readable and facilitate easy exchange of data between different
systems.
• Support for Arrays and Lists. Elements within a dataset can be organized as arrays,
allowing for the representation of multiple values for a single attribute. Can use array
DB for compressed storage, query language extensions (eg. Oracle GeoRaster,
PostGIS, …)
• Dynamic Evolution. New fields or attributes can be added without affecting existing
data, supporting a more agile development and adaptation to changing requirements.
• Common Use Cases. Semi-structured data is often found in scenarios where the
structure of the data is not completely known in advance, such as in web scraping, log
files, configuration files, and certain types of NoSQL databases.
JSON- JavaScript Object Notation
• Textual representation widely used for data exchange
{
"ID": "22222",
"name": { name is an object
"firstname: "Albert", with two key-values
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{"firstname": "Hans", "lastname": "Einstein" }, children is an array
{"firstname": "Eduard", "lastname": "Einstein" } with two key-values
] elements
}
• Types: integer, real, string, booleans and
• Objects: are key-value maps, i.e. sets of (attribute name, value) pairs
• Arrays are also key-value maps ("skills": ["JavaScript", "HTML", "CSS"])
JSON- JavaScript Object Notation
• JSON is ubiquitous in data exchange today
• Widely used for web services
• Most modern applications are architected around on web services
• SQL extensions for
• JSON types for storing JSON data
• Extracting data from JSON objects using path expressions
SELECT json_extract(json_data, '$.employee.name') AS employee_name
FROM employees;
• Generating JSON from relational data
• E.g. json.build_object(‘ID’, 12345, ‘name’, ‘Einstein’)
• Creation of JSON collections using aggregation
• E.g. json_agg aggregate function in PostgreSQL
• JSON has compressed representations such as BSON (Binary JSON) used
for efficient data storage
XML - Extensible Markup Language
• XML uses tags to mark up text. Tags make the data self-documenting and can be
hierarchical
• <purchase order>
<identifier> P-101 </identifier>
<purchaser>
<name> Cray Z. Coyote </name>
<address> Route 66, Mesa Flats, Arizona 86047, USA </address>
</purchaser>
<itemlist>
<item>
<identifier> RS1 </identifier>
<description> Atom powered rocket sled </description>
<quantity> 2 </quantity>
<price> 199.95 </price>
</item>
</itemlist>
<total cost> 429.85 </total cost>
….
</purchase order>
XML - Extensible Markup Language
• Support for XML in SQL is typically provided through SQL extensions or specific
functionalities that allow for the storage, retrieval, and manipulation of XML data
within a relational database
• SQL standards support XML CREATE TABLE MyTable (
ID INT PRIMARY KEY,
XmlColumn XML
);
• XPath and XQuery are query languages designed for navigating and querying XML
documents. Eg with PostgreSQL
SELECT xpath('/bookstore/book/title/text()', xml_column) AS book_title
FROM books;
• Support for XML schema definition and validation to ensure XML documents
conform to a specified structure. Eg. MySQL
CREATE TABLE MyTable (
ID INT PRIMARY KEY, creates a table that can store XML data in a
XmlColumn XML specific column and uses an external
) XMLSCHEMA 'path/to/schema.xsd'; schema to validate XML data
RDF - Resource Description Framework
• A framework for representing information
about resources on the web.
• Key components are:
• Triplets. (subject, predicate, object)
• E.g. (NBA-2019, winner, Raptors)
• URIs (Uniform Resource Identifiers) used
to identify things on the web, and they are
used to create globally unique identifiers
for resources.
• Graph Structure where nodes represent
resources, and edges represent
relationships between resources.
RDF - Resource Description Framework
RDF – Queries using SPARQL
• Look for the names of students who are taking the course titled "Intro. to Computer
Science”

SELECT ?name
WHERE {
?cid title "Intro. to Computer Science" . "Intro. to Computer Science," is represented by the variable
?cid
?sid course ?cid .
?id takes ?sid . student takes the specified computer science course

?id name ?name . entity with an identifier (?id) takes the course
} represented by ?sid

entity with identifier ?id has a name represented by the


variable ?name
Object-relational Data
Object orientation
• Approaches for integrating object-orientation with databases
• Build an object-relational database, adding object-oriented features to a
relational database
• Automatically convert data between programming language model and
relational model; data conversion specified by object-relational mapping
• Build an object-oriented database that natively supports object-oriented data
and direct access from programming language
Object-relational Database

• User-defined types
create type Person
(ID varchar(20) primary key,
name varchar(20),
address varchar(20)) ref from(ID); instances of this type can be
create table people of Person; referenced by their ID

• Table types
create type interest as table ( Create interest which
is a table
topic varchar(20),
degree_of_interest int);
create table users (
ID varchar(20), interests is interest
type, so a table
name varchar(20),
interests interest);
Object-relational Database
• Type inheritance
create type Student under Person
(degree varchar(20)) ;
create type Teacher under Person
(salary integer);
• Table inheritance and hierarchy
create table students
(degree varchar(20))
inherits people;
create table teachers
(salary integer)
inherits people;
create table people of Person; Creates a table people as base table
create table students of Student
under people;
create table teachers of Teacher
under people;
Object-relational Database
• Creating reference types
create type Department ( Head reference to Person and
dept_name varchar(20), must be under the scope of table
head ref(Person) scope people); people
create table departments of Department
12345 is the ID in the table Person
insert into departments values ('CS', '12345’)
• System generated references can be retrieved using subqueries
select ref(p)
from people as p
where ID = '12345'
• Using references in path expressions
select head->name, head->address
from departments;
Object-relational mapping
• Object-relational mapping (ORM) systems allow
• Specification of mapping between programming language objects and database
tuples
• Automatic creation of database tuples upon creation of objects
• Automatic update/delete of database tuples when objects are update/deleted
• Interface to retrieve objects satisfying specified conditions
• Tuples in database are queried, and object created from the tuples
• Django ORM for Python
Textual Data
Textual data
• Information retrieval when querying of unstructured data
• Simple model of keyword queries: given query keywords, retrieve documents
containing all the keywords
• More advanced models rank the relevance of documents
• Today, keyword queries return many types of information as answers
• E.g., a query “cricket” typically returns information about ongoing cricket
matches
• Relevance ranking
• Essential since there are usually many documents matching keywords
Ranking using TF-ITF
• Term: keyword occurring in a document/query
• Term Frequency: TF(d, t), the relevance of a term t to a document d
• One definition: TF(d, t) = log(1 + n(d,t)/n(d))
where
• n(d,t) = number of occurrences of term t in document d
• n(d) = number of terms in document d
• Inverse Document Frequency: IDF(t)
• One definition: IDF(t) = log(N/n(t))
Where
• N is the total number of documents in the collection.
• n(t) is the number of documents containing the term t

• Relevance of a document d to a set of terms Q


• One definition: r(d, Q) = ∑t∈Q TF(d, t) ∗ IDF(t)
Ranking using Hyperlinks
• Hyperlinks provide very important clues to importance
• Google introduced PageRank, a measure of popularity/importance based on
hyperlinks to pages
• Pages hyperlinked from many pages should have higher PageRank
• Pages hyperlinked from pages with higher PageRank should have higher
PageRank
• Formalized by random walk model
Measures of Efectiveness
• Precision is a measure of how many retrieved documents are relevant to a query

• Recall measures the proportion of relevant documents that were successfully


retrieved

• F1 Score is the harmonic mean of precision and recall, providing a balanced measure
of both

• Accuracy measures the overall correctness of the retrieval system

• Query response time


• …
Spacial Data
Spatial DBs
• Store information related to spatial locations and support efficient storage,
indexing, and querying of spatial data.
• Geographic data -- road maps, land-usage maps, topographic elevation maps,
political maps showing boundaries, land-ownership maps… Using Round-earth
coordinate system (Latitude, longitude, elevation)
• Geometric data. Design information about how objects are constructed (designs of
buildings, aircraft, layouts of integrated-circuits) by using 2 or 3 dimensional
Euclidean space with (X, Y, Z) coordinates
• Geometric data represented by:
• A line segment can be represented by the coordinates of its endpoints.
• A polyline or line string consists of a connected sequence of line segments and
can be represented by a list of the coordinates of the endpoints of the
segments, in sequence.
• Polygons is represented by a list of vertices in order
Representation of geometric information

• Representation of points and line segment in


3-D similar to 2-D, except that points have an
extra z component
• Represent arbitrary polyhedra by dividing
them into tetrahedrons, like triangulating
polygons.
• Alternative: List their faces, each of which is a
polygon, along with an indication of which
side of the face is inside the polyhedron.
Design DBs
• Represent design components as objects (generally geometric objects); the
connections between the objects indicate how the design is structured.
• Simple two-dimensional objects: points, lines, triangles, rectangles, polygons.
• Complex two-dimensional objects: formed from simple objects via union,
intersection, and difference operations.
• Complex three-dimensional objects: formed from simpler objects such as spheres,
cylinders, and cuboids, by union, intersection, and difference operations.
• Wireframe models represent three-dimensional surfaces as a set of simpler objects
• Design databases also store non-spatial information about objects (e.g.,
construction material, color, etc.)
Representation of Geographic data
• Raster data consist of bit maps or pixel maps, in two (grid) or more dimensions.
• Eg. 2-D raster image: satellite image of cloud cover, where each pixel stores the
cloud visibility in a particular area.
• Additional dimensions might include the temperature at different altitudes at
different regions, or measurements taken at different points in time.
• Design databases generally do not store raster data.
• Vector data are constructed from basic geometric objects: points, line segments,
and polygons in two dimensions, and cylinders, spheres, cuboids, and other
polyhedrons in three dimensions.
• Vector format often used to represent map data.
• Roads can be considered as two-dimensional and represented by lines and
curves.
• Some features, such as rivers, may be represented either as complex curves or
as complex polygons, depending on whether their width is relevant.
• Features such as regions and lakes can be depicted as polygons.

You might also like