INTRODUCTION
• Data
• Information
• Database
• DBMS
• Table/relation
• Oracle
• SQL
What is SQL?
• SQL is structured Query Language which is a
computer language for storing, manipulating
and retrieving data stored in relational
database.
• SQL is the standard language for Relation
Database System. All relational database
management systems like MySQL, MS Access,
Oracle, Sybase, Informix and SQL Server uses
SQL as standard database language.
SQL Components
• DDL– create table/rename/alter/drop/
truncate
• DML—insert/update/delete
• DCL---grant/revoke
• TCL---commit/rollback
• DQL--Select
Basic data types
• Char
• Varchar(size)/varchar2
• Date
• Number
• Long
• Raw/long raw
Create Command
• This command is used to create/generate new tables.
• Syntax
CREATE TABLE table name
(column name datatype(size), column name datatype(size)
--- );
• Example
CREATE TABLE Student
(std_name varchar2(20), father_name varchar2(20), DOB
number(10), address varchar2(20));
Insert Command
• This command is used to insert rows into the table
containing already defined columns.
• Syntax
INSERT INTO table name
(column name, column name, --- )
VALUES (expression, expression, ---);
• Example
INSERT INTO Student
(std_name, father_name, DOB, address)
VALUES (’Abid’, ’Ibrahim’, 1984, ‘KARAK”);
Select Command
• This command is used for viewing/retrieving of data from
table.
• Syntax
SELECT (column name, column name, --- )
FROM table name;
• Examples
• Retrieving student names and father names from student
SELECT (std_name, father_name)
FROM Student;
• Retrieving all records from student
SELECT * FROM Student;
• Selected columns and all rows
SELECT (std_name, father_name)
FROM Student;
• Selected rows and all columns
SELECT * FROM Student where DOB=1984;
• Selected columns and selected rows
SELECT (std_name, DOB) FROM Student where
address=‘KARAK’;
• Elimination of Duplicates from Select
SELECT DISTINCT * FROM Student;
Delete Command
• This is used to delete specified row from table.
• Syntax
DELETE FROM table name;
• Examples
• Removing all rows from table student
DELETE FROM Student;
• Removing all rows from table student
DELETE FROM Student where address=‘KARAK’;
Alter Command
• This is used to change description of a column or add an
extra column.
• Syntax
ALTER TABLE table name
ADD (new column name data type (size), new column
name data type (size) ---);
• Example
– ALTER TABLE Student
ADD (marks number(3), gender varchar2(2));
– ALTER TABLE Student
MODIFY (address char(20));
Update Command
• This is used for changing one or more values in the row of
the table.
• Syntax
UPDATE table name SET (column name = expression,
column name = expression ---);
• Example
UPDATE Student SET marks = marks+5;
UPDATE Student SET marks = marks+5 where
std_name=‘Abid’;
Rename Command
• This is used to rename old/existing table.
• Syntax
RENAME Old table name TO New table;
• Example
RENAME Student TO Personal Data;
Drop Command
• This is used to remove table from database.
• Syntax
DROP TABLE table name;
• Example
DROP Table Student;
Describe Command
• This command is used to display column names, data types
and attributes connected to the database.
• Syntax
DESCRIBE table name;
• Example
DESCRIBE Student;
• SQL Components: DDL/DML/DCL/TCL
•
• create table emp
• (
• rno number(5),
• name varchar2(15),
• marks number(5))
• …………………………
•
• update emp set marks=null where marks=55;
• ………………………….
•
• update emp set marks=null;
• ……………………………….
• select rowid from emp;
• --------------------------------------
•
• delete from emp where rowid='AAADVTAABAAAKS6AAC';
• ……………………………………
Integrity constraints
• Primary key
• Foreign key
• Check
• Not null
• Unique
• Default
• NOT NULL Constraint: Ensures that a column cannot have
NULL value.
• DEFAULT Constraint : Provides a default value for a column
when none is specified.
• UNIQUE Constraint: Ensures that all values in a column are
different.
• PRIMARY Key: Uniquely identified each rows/records in a
database table.
• FOREIGN Key: Uniquely identified a rows/records in any
another database table.
• CHECK Constraint: The CHECK constraint ensures that all
values in a column satisfy certain conditions.
Primary key
• alter table emp
• add primary key(rno);
Default
• alter table emp
• modify name default 'aaa'
Check
• alter table emp
• add check(rno>5);
Foreign key
• Alter table emp add foreign key (dno)
references dept(depno);
Unique
• ALTER TABLE Persons
ADD UNIQUE (P_Id)
NOT NULL
• Alter table emp
modify esal number(5) not null;
Joins
• 1. The purpose of a join is to combine the data
across tables.
• 2. A join is actually performed by the where
clause which combines the specified rows of
tables.
• 3. If a join involves in more than two tables
then oracle joins first two tables based on the
joins condition and then compares the result
with the next table and so on.
Types of Joins
• Natural join
• Inner join
• Outer join
✓ left outer join
✓ right outer join
✓ full outer join
• Self join
• Cross join (Cartesian product)
• Equi & Non Equi joins
Assume that we have the following tables.
• 1. EQUI JOIN
• A join which contains an equal to ‘=’ operator
in the joins condition.
• Ex: SQL> select eno,ename,esal,dname from
emp ,dept where emp.deptno=dept.deptno;
• 2. NON-EQUI JOIN
• A join which contains an operator other than
equal to ‘=’ in the joins condition.
• Ex: SQL> select eno,ename,esal,dname from
emp ,dept where emp.deptno>dept.deptno;
• 3. SELF JOIN
Joining the table itself is called self join.
• Select a.name “teacher”, c.name “hod” from
teacher a, teacher c where a.hod=c.id
ID NAME HOD
1 M 2
2 N
3 O 4
4 P
• 4. NATURAL JOIN
• Natural join compares all the common
columns.
• Ex: SQL> select eno,ename,dname,loc from
emp natural join dept;
• 5. CROSS JOIN
• This will gives the cross product.
• Ex: SQL> select empno,ename,esal,dname,loc
from emp cross join dept;
• 6. OUTER JOIN
• Outer join gives the non-matching records along with
matching records.
• LEFT OUTER JOIN
• This will display the all matching records and the
records which are in left hand side table those that are
not in right hand side table.
• Ex: SQL> select eno,ename,job,dname,loc from emp
left outer join dept on(emp.depno=dept.dno);
• Or
• SQL> select eno,ename,job,dname,loc from emp ,dept
where emp.depno=dept.dno(+);
• RIGHT OUTER JOIN
• This will display the all matching records and the records
which are in right hand side table those that are not in left
hand side table.
• Ex:
• SQL> select empno,ename,job,dname,loc from emp right
outer join dept on(emp.depno=dept.dno);
• Or
• SQL> select empno,ename,job,dname,loc from emp ,dept
where emp.depno(+) = dept.dno;
• FULL OUTER JOIN
• This will display the all matching records and the non-
matching records from both tables.
• Ex:
• SQL> select empno,ename,job,dname,loc from emp full
outer join dept on(emp.depno=dept.dno);
• 7. INNER JOIN
• This will display all the records that have
matched.
• Ex: SQL> select empno,ename,job,dname,loc
from emp inner join dept using(deptno);
Operators and clauses
• IN
• OR
• AND
• Between
• Like
• Distinct
• Rowid
• Order by
• Like opeartor is used for string or patteren
matching
• % character is used to match any string of any
length.
• _ character to match a single charcter.
Aggregate functions
• These functions operate on the multiset of values
of a column of a relation, and return a value.
• avg: average value
• min: minimum value
• max: maximum value
• sum: sum of values
• count: number of values
Find the average account balance at the Perryridge branch.
select avg(balance)
From account
where branch-name = ‘Perryridge’
Find the number of depositors in the bank.
select count (*)
from customer
Find the number of tuplesin the customerrelation.
select count (distinct customer-name)
from depositor
Group By and Having
Find the names of all branches where the average account
balance is more than $1,200.
Select branch-name, avg(balance)
From account
group by branch-name having avg(balance) > 1200
Note: predicates in the havingclause are applied after the
formation of groups whereas predicates in the where
clause are applied before forming groups
Problems
1)select count(empno), dname from emp111,dep111 where
emp111.deptno=dep111.Deptno GROUP BY dname having
dname in ('cse','ece')
2) select count(empno), dname from emp111,dep111 where
emp111.deptno=dep111.Deptno and SAL between 15000
and 70000 GROUP BY dname having dname in ('cse','me');
3) select Empno,ename,sal, dname from emp111 left join
dep111 on emp111.deptno=dep111.Deptno
4) select e.empno,'works-under',m.mgr from emp111
e,emp111 m where m.empno=e.empno
Who is BI for?
• BI for management
• Operational BI
• BI for process improvement
• BI for performance improvement
• BI to improve customer experience
Scenario 1
ABC Pvt Ltd is a company with branches at
Mumbai, Delhi, Chennai and Banglore. The
Sales Manager wants quarterly sales report.
Each branch has a separate operational
system.
Scenario 1 : ABC Pvt Ltd.
Mumbai
Delhi
Sales per item type per branch Sales
for first quarter. Manager
Chennai
Banglore
Solution 1:ABC Pvt Ltd.
• Extract sales information from each database.
• Store the information in a common repository at
a single site.
Solution 1:ABC Pvt Ltd.
Mumbai
Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse
Chennai
Banglore
Scenario 2
One Stop Shopping Super Market has huge
operational database.Whenever Executives wants
some report the OLTP system becomes
slow and data entry operators have to wait for
some time.
Scenario 2 : One Stop Shopping
Data Entry Operator
Report
Wait Operational Management
Database
Data Entry Operator
Solution 2
• Extract data needed for analysis from operational
database.
• Store it in warehouse.
• Refresh warehouse at regular interval so that it
contains up to date information for analysis.
• Warehouse will contain data with historical
perspective.
Solution 2
Data Entry
Operator
Report
Transaction Extract Data
Operational Manager
data Warehouse
database
Data Entry
Operator
Scenario 3
Cakes & Cookies is a small,new company.President
of the company wants his company should grow.He
needs information so that he can make correct
decisions.
Solution 3
• Improve the quality of data before loading it
into the warehouse.
• Perform data cleaning and transformation
before loading the data.
Solution 3
Expansio
n
sales
Data Query and Analysis President
Warehouse tool
time
Improvemen
t
Data warehousing
• It is the process which prepares the basic
repository of data that becomes the data
source where we extract information from.
• Data warehouse is subject Oriented,
Integrated, Time-Variant and nonvolatile
collection of data that support of
management's decision making process. Let's
explore this Definition of data warehouse.
• A data warehouse is built by extracting data
from multiple heterogeneous and external
sources ,cleansing to detect errors in the data
and rectify them wherever possible ,
integrating ,transforming the data from legacy
format to warehouse format and then loading
the data after sorting and summarizing.
What is a Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses
55
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• The Data warehouse is subject oriented because it provide us
the information around a subject rather the organization's
ongoing operations.
• These subjects can be product, customers, suppliers, sales,
revenue etc.
• The data warehouse does not focus on the ongoing operations
rather it focuses on modelling and analysis of data for decision
making.
56
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources such as relational databases, flat files, on-line
transaction records. This integration enhance the effective
analysis of data.
• Data cleaning and data integration techniques are applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
57
Data Warehouse—Time Variant
• The Data in Data Warehouse is identified with a particular time
period. The data in data warehouse provide information from
historical point of view.
• The time horizon for the data warehouse is significantly longer than
that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain “time
element”
58
Data Warehouse—Nonvolatile
• Non volatile means that the previous data is not removed when
new data is added to it. The data warehouse is kept separate from
the operational database therefore frequent changes in operational
database are not reflected in data warehouse.
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data warehouse
environment
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
59
• Metadata - Metadata is simply defined as data about data.
The data that are used to represent other data is known as
metadata. For example the index of a book serves as
metadata for the contents in the book.In other words we can
say that metadata is the summarized data that lead us to the
detailed data.
A Data Warehouse Is A Process
Data Characteristics
• Raw Detail • Integrated • History • Targeted
• No/Minimal History • Scrubbed • Summaries • Specialized (OLAP)
Source OLTP Architected
Systems Data Mart
Data
Warehouse
End User
Workstations
Central
Repository
•Extract •Load
•Design •Replication •Access & Analysis
•Scrub •Index
•Mapping •Data Set Distribution •Resource Scheduling & Distribution
•Transform •Aggregation
Meta Data
System Monitoring
There Are Many Options
Operational User
Source Workstations
Systems E Operational
x Data Store
t
r
a
c Architected
t Data Mart
i
o Data
n Warehouse
S
y
s
t
e
m Independent
s Data Mart
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
63
Why a Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP analysis
directly on relational databases
64
Overview of ETL
• In Computing; Extract, Transform and Load (ETL)
refers to
– Extract data from outside source
– Transforms it to fit as per Business need
– Loads it into end target
ETL systems are commonly used to integrate data
from multiple applications, typically developed and
supported by different vendors or hosted on
separate computer hardware.
Overview of ETL
• Extract
The first part of an ETL process involves extracting the data from the source
systems. In many cases this is the most challenging aspect of ETL, since extracting
data correctly sets the stage for how subsequent processes go further.
Most analytics projects consolidate data from different source systems. Each
separate system may also use a different data organization and/or format.
Common data source formats are relational databases and flat files, but may
include non-relational database structures such as Information Management
System (IMS) or even fetching from outside sources such as through web spidering
or screen-scraping. The streaming of the extracted data source and load on-the-fly
to the destination database is another way of performing ETL when no
intermediate data storage is required. In general, the goal of the extraction phase
is to convert the data into a single format appropriate for transformation
processing.
Overview of ETL
• Transform
The transform stage applies a series of rules or functions to the extracted data from the
source to derive the data for loading into the end target. Some data sources require
very little or even no manipulation of data where as others require transformation as
per there business requirement. Following are Some of the transformations
– Translating coded values
– Encoding free-form values
– Sorting
– Joining
– Aggregation
– Transposing or Pivoting
• Load
The load phase loads the data into the end target, usually the data warehouse (DW)
but can be in any other format such as flat file, relation database.
Overview of ETL
• Performance
ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour
(or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard
drives, multiple gigabit-network connections, and lots of memory. The fastest ETL
record is currently held by Syncsort, Vertica and HP at 5.4TB in under an hour,
which is more than twice as fast as the earlier record held by Microsoft and Unisys.
Overview of ETL
▪ Parallel Processing
ETL software follows parallel processing. This has enabled a number of methods to
improve overall performance of ETL processes when dealing with large volumes of
data.
ETL applications implement three main types of parallelism:
– Data: By splitting a single sequential file into smaller data files to provide
parallel access.
– Pipeline: Allowing the simultaneous running of several components on the
same data stream. For example: looking up a value on record 1 at the same
time as adding two fields on record 2.
– Component: The simultaneous running of multiple processes on different data
streams in the same job, for example, sorting one input file while removing
duplicates on another file.
Extraction, Transformation, and Loading (ETL)
Data extraction
get data from multiple, heterogeneous, and external sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse format
Load
sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
70
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
71
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Query
Operational Extract
Serve Reports
DBs Transform Data
Data mining
Load Warehouse
Refresh
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
72
OLAP
• OLTP (On-line Transaction Processing) : is
characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE).
• The main emphasis for OLTP systems is put on
very fast query processing, maintaining data
integrity in multi-access environments and an
effectiveness measured by number of
transactions per second.
• In OLTP database there is detailed and current
data, and schema used to store transactional
databases is the entity model.
• OLAP (On-line Analytical Processing) : is characterized
by relatively low volume of transactions.
• Queries are often very complex and involve
aggregations.
• For OLAP systems a response time is an effectiveness
measure.
• OLAP applications are widely used by Data Mining
techniques.
• In OLAP database there is aggregated, historical data,
stored in multi-dimensional schemas (usually star
schema).
Data models for OLTP and OLAP
For OLTP : ER model
For OLAP: Star or snowflake schema
Snowflake model
ER Diagram
Information services
• It is not the process of producing information
rather it also involves ensuring that the
information produced is aligned with business
requirements and can be acted upon to
produce value for the company.
• Information is delivered in the form of
reports,charts,dashboards .
• Data mining is a practice used to increase the
body of knowledge.
• Applied analytics is generally used to drive
action and produce outcomes.
Why Data Mining?
• The Explosive Growth of Data: from terabytes to yottabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
• Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
• 1. Data cleaning (to remove noise and inconsistent
data)
• 2. Data integration (where multiple data sources
may be combined)
• 3. Data selection (where data relevant to the analysis
task are retrieved from the database)
• 4. Data transformation (where data are transformed
or consolidated into forms appropriate for mining by
performing summary or aggregation operations, for
instance)
• 5. Data mining (an essential process where
intelligent methods are applied in order to extract
data patterns)
• 6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
• 7. Knowledge presentation (where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user)
ARCHITECTURE OF DATA MINING
REPRESENTATION FOR VISUALIZING
THE DISCOVERED PATTERNS
• This refers to the form in which discovered
patterns are to be displayed. These
representations may include the following:
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible !
• Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
THANKS