CABA_Theory_Chap1_Ref1
CABA_Theory_Chap1_Ref1
What is Data?
• Unprocessed / unorganized material (raw facts) on the basis of which any
decision cannot be taken.
• It comes from Latin word ―Datum.
• It is atomic unit of information.
• Electronic data processed by computing
devices. Example: 20, 90
What is Information?
• Collection of processed data on the basis of which we can take any
decision or draw conclusion.
• May be collected about a particular subject.
• Collected data organized and presented in systematic fashion to
understand underlying meaning
• It can give answers for ―who, ―where, ―when, ―what type queries.
• Useful form.
• Example: Average marks of the students is 75
Characteristics
Relevance
Complete
Accurate
Current
Economical
What is Knowledge?
Wisdom is ability to judge which aspects of knowledge are true, right, lasting, and
DIKW Pyramid
In decentralized processing, there are different CPU connected on the network and
each processor can do its job independent of each other. For example, in a Net cafe,
all computers can perform their own tasks. This type of network is called
decentralized network
In distributed processing, a problem is divided into many tasks - task are completed
by different systems connected through each other over the network- less secure-
system can continue even if one system fails
Another type of processing also exists named distributed processing. In this type of
processing different CPU are connected to the network and are controlled by single
CPU. For example in air reservation system there exist different terminals and
processing is done from many locations and all the computers are controlled by the
single main processor. This type of network is called distributed network
Separate files are created and stored for each application program
A system of organizing files in an operating system in which all files are
stored in a single directory
Major Drawback
• Data redundancy
– Duplication of data in separate files
• Lack of data integrity
– The degree to which the data in any one file is accurate
Advantages
– Improved strategic use of corporate data
– Reduced data redundancy
– Improved data integrity
– Easier modification and updating
– Data and program independence
– Better access to data and information
– Standardization of data access
– A framework for program development
– Better overall protection of the data
– Shared data and information
The data in an RDBMS is stored in database objects which are called as tables.
This table is basically a collection of related data entries and it consists of
numerous columns and rows.
Entity
• A generalized class of people, places, or things (objects) for
which data are collected, stored, and maintained
• Example, Customer, Employee
Attribu
te • A characteristic of an entity; something the entity is identified by
• Example, Customer name, Employee name
Data Dictionary
Metadata is "data [information] that provides information about other data". Many
distinct types of metadata exist, among these descriptive metadata, structural
metadata, administrative metadata, reference metadata and statistical
metadata
.
Descriptive metadata describes a resource for purposes such as discovery and
identification. It can include elements such as title, abstract, author, and
keywords.
Structural metadata is metadata about containers of data and indicates how
compound objects are put together, for example, how pages are ordered to
form chapters. It describes the types, versions, relationships and other
characteristics of digital materials.
Administrative metadata provides information to help manage a resource,
such as when and how it was created, file type and other technical
information, and who can access it.
Reference metadata describes the contents and quality of statistical data
Statistical metadata may also describe processes that collect, process, or
produce statistical data; such metadata are also called process data.
DDL
Data Definition Language (DDL) statements are used to define the database
COLUMN <column_name>;
DML
In Procedural DML, user has to specify what data are needed and how to obtain it.
In Non-Procedural DML user has to specify what data are needed without specifying how to get
it.
Example: SQL (Structured Query Languages)
Data Manipulation Language (DML) statements are used for managing data within schema
objects.
Some examples:
o SELECT - retrieve data from the a database
o INSERT - insert data into a table
o UPDATE - updates existing data within a table
o DELETE - deletes all records from a table, the space for the records remain
DCL
Data Control Language (DCL) statements are used to control database access permissions.
Someexamples:
o GRANT - gives user's access privileges to database
o REVOKE - withdraw access privileges given with the GRANT command
TCL
Transaction Control (TCL) statements are used to manage the changes made by DML
Durability. These properties ensure the accuracy and integrity of the data in the
database, ensuring that the data does not become corrupt as a result of some failure,
guaranteeing the validity of the data even when errors or failures occur.
The ACID properties allow us to write applications without considering the complexity
of the environment where the application is executed. This is essential for processing
Transaction
single unit of work, and a transaction may consist of one or many steps. A
transaction access data using read and write operations. Each transaction is a group of
operations that acts a single unit, produces consistent results, acts in isolation from
The goal of a transaction is to preserve the integrity and consistency of the data.
If a transaction succeeds, the data that were modified during the transaction will be
saved in the database. If some error occurs and needs to be cancelled or reverted, the
When we work with a database, we execute SQL declarations, and those operations
are generally executed in blocks, and those blocks are the transactions. They allow to
insert, update, delete, search data, and so on.
For example, transferring money between bank accounts is a transaction, what will
occur in this case is that the value must be debited from one account and credit in
another account.
Atomicity
A transaction must be an atomic unit of work, which means that all the modified
data are performed or none of them will be. The transaction should be completely
executed or fails completely, if one part of the transaction fails, all the transaction will
fail. This provides reliability because if there is a failure in the middle of a transaction,
For example, in a financial transaction, the money goes out from account A and goes to
account B, both operations should be executed together, and if one of them fails, the
other will not be executed. So the transaction is treated as a single entity, as a single
command. A transaction can have more than two operations, but it will always be
executed all of them or none. On this example, when the money is being transferred
from account A to account B, if something fails, the entire transaction will be aborted
leaving the data consistent. The transaction creates a new valid state of the data
and if some failure happens, return all the data with the state before the transaction
being executed.
The goal is to ensure that the database before and after the transaction be consistent.
If a transaction leaves data in an invalid state, the transaction is aborted and an error
is reported.
The data that is saved in the database must always be valid (the data will be valid
according to defined rules, including any constraints, cascades, and triggers that have
been applied on the database), this way the corruption of the database that can be
By Dr. Tapalina Bhattasali Page 16
CABA (Theory): Chapter 1- Information and Database
sale table with the code of a product that does not exist in the product table, the
transaction will fail. Another example, if you have a column that does not allow
negative numbers, and try to add or modify a record, using a value lower than zero on
This property ensures the isolation of each transaction, ensuring that the
means that each transaction in progress will not be interfered by any other transaction
until it is completed.
For example, if two clients are trying to buy at the same time the last available
product on the web site, when the first user finishes the shopping, it will make the
permanently in the database. This property ensures that the information that is
affects it.
Once the transaction is committed, it will remain in this state even if a serious
problem occurs, such as a crash or a power outage. For this purpose, the completed
drives, so the data will be always available, even if the DB instance is restarted.
Conclusion
The ACID properties ensure the integrity and the consistency of the data in the
database, ensuring that the data does not become corrupt as a result of some failure.
The databases that apply the ACID properties will ensure that only transactions that
were completely successful will be processed, and if some failure happens before a
Data Warehouse:
A data warehouse is a database, which is kept separate from the
organization's operational database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to
analyze itsbusiness.
A data warehouse helps executives to organize, understand, and use their
data to takestrategic decisions.
Data warehouse systems help in the integration of diversity of application
systems.
A data warehouse system helps in consolidated historical data analysis
A data warehouses is kept separate from operational databases due to the following
reasons:
An operational database is constructed for well-known tasks and workloads
such as searching particular records, indexing, etc. In contrast, data
warehouse queries are often complex and they present a general form of data.
Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while
an OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
Application:
A data warehouse helps business executives to organize, analyze, and use their data
for decision making. A data warehouse serves as a sole part of a plan-execute-assess
"closed-loop" feedback system for the enterprise management.
OLAP
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers and analysts to get an insight of the information through
fast, consistent, and interactive access to information.
Data Mining
Data Mining is defined as extracting information from huge sets of data. In other words,
we cansay that data mining is the procedure of mining knowledge from data.
The information or knowledge extracted so can be used for any of the following
applications:Exploration
In simple words, data mining is defined as a process used to extract usable data from a
larger set of any raw data. It implies analyzing data patterns in large batches of data using
one or more software. Data mining has applications in multiple fields, like science and
research. As an application of data mining, businesses can learn more about their
customers and develop more effective strategies related to various business functions and
in turn leverage resources in a more optimal and insightful manner. This helps businesses
be closer to their objective and make better decisions. Data mining involves effective data
collection and warehousing as well as computer processing. For segmenting the data and
evaluating the probability of future events, data mining uses sophisticated mathematical
algorithms. Data mining is also known as Knowledge Discoveryin Data (KDD).
• Clustering based on finding and visually documented groups of facts not previously known.
Although most database systems have incorporated backup and recovery tools into their
interfaces and infrastructure it is imperative to understand what the backup and recovery
process involves.
It is not just data files that need to be part of the backup process. Transaction logs must
also be backed up. Without the transaction logs the data files are useless in a recovery
event.
There are three main reasons of failure that happen enough to be worth incorporating into
backup and recovery plan. User error is the biggest reason for data damage, loss, or
corruption. In this type of failure, there is an application modifying or destroying the data
on its own or through a user choice. To fix this problem user must recover and restore to
the point in time before the corruption occurred. This returns the data to its original state
at the cost of any other changes that
were being made to the data since the point the corruption took place. Hardware failure
can also cause data loss or damage. Hardware failure can happen when the drives the data
files or transaction logs are stored on fail. Most databases will be stored on computer hard
drives or across groups of hard drives on designated servers.
The third reason for database failure is a disastrous or catastrophic event. This can be in
the form of fire, flood, or any naturally occurring storm. It can also happen through
electrical outage, a virus, or the deliberate hacking of your data. Any of these events can
corrupt or cause the loss of your data. The true disaster would be the lack of data
backup or the lack of a recovery plan during an event this severe. Without data backup,
recovery is impossible. And without a recovery plan there is no guarantee that your data
backup will make it through the recovery process.
Physical backups are backups of the physical files used in storing and recovering your
database, such as data files, control files, and archived redo logs. Ultimately, every physical
backup is a copy of files storing database information to some other location, whether on
disk or some offline storage.
Logical backups contain logical data (for example, tables or stored procedures) exported
from a database with an Oracle export utility and stored in a binary file, for later re-
importing into a database using the corresponding Oracle import utility.
Physical backups are the foundation of any sound backup and recovery strategy. Logical
backups are a useful supplement to physical backups in many circumstances but are not
sufficient protection against data loss without physical backups. Unless otherwise
specified, the term "backup" as used in the backup and recovery documentation refers to
physical backups.
Online backup has a lot of advantages over traditional backup methods such as
memory sticks, cards, discs and tapes. However, like everything though, online
backup naturally has some disadvantages too. Below we explore both…
Page 26
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through
fast, consistent, and interactive access to information. This chapter cover the types
of OLAP, operations on OLAP, difference between OLAP, and statistical databases
and OLTP.
Types of OLAP
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional
views of data. With multidimensional data stores, the storage utilization may be low if
the data set is sparse. Therefore, many MOLAP servers use two levels of data
storage representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allow to
store the large data volumes of detailed information.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAP operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Data Lake is like a large container which is very similar to real lake and
rivers. Just like in a lake you have multiple tributaries coming in, a data lake
has structured data, unstructured data, machine to machine, logs flowing
through in real-time.
After some time, Data Lake may lose relevance and momentum
There is larger amount risk involved while designing Data Lake
Unstructured Data may lead to Ungoverned Chao, Unusable Data,
Disparate & Complex Tools, Enterprise-Wide Collaboration, Unified,
Consistent, and Common
It also increases storage & computes costs
There is no way to get insights from others who have worked with the
data because there is no account of the lineage of findings by
previous analysts
The biggest risk of data lakes is security and access control.
Sometimes data can be placed into a lake without any oversight, as
some of the data may have privacy and regulatory need
Summary:
A Data Lake is a storage repository that can store large amount of
structured, semi-structured, and unstructured data.
The main objective of building a data lake is to offer an unrefined
view of data to data scientists.
Unified operations tier, Processing tier, Distillation tier and HDFS are
important layers of Data Lake Architecture
Data Ingestion, Data storage, Data quality, Data Auditing, Data
exploration, Data discover are some important components of Data
Lake Architecture
Design of Data Lake should be driven by what is available instead of
what is required.
Data Lake reduces long-term cost of ownership and allows economic
storage of files
The biggest risk of data lakes is security and access control.
Sometimes data can be placed into a lake without any oversight, as
some of the data may have privacy and regulatory need.