0% found this document useful (0 votes)
8 views

CABA_Theory_Chap1_Ref1

The document provides an overview of data, information, knowledge, and wisdom, explaining their definitions and interrelationships. It discusses data organization in databases, including centralized, decentralized, and distributed processing, as well as the advantages and disadvantages of database systems. Additionally, it covers SQL, data manipulation, and the ACID properties of transactions, emphasizing the importance of data integrity and consistency in database management.

Uploaded by

vaishhsingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

CABA_Theory_Chap1_Ref1

The document provides an overview of data, information, knowledge, and wisdom, explaining their definitions and interrelationships. It discusses data organization in databases, including centralized, decentralized, and distributed processing, as well as the advantages and disadvantages of database systems. Additionally, it covers SQL, data manipulation, and the ACID properties of transactions, emphasizing the importance of data integrity and consistency in database management.

Uploaded by

vaishhsingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CABA (Theory): Chapter 1- Information and Database

Data Organization Basics

What is Data?
• Unprocessed / unorganized material (raw facts) on the basis of which any
decision cannot be taken.
• It comes from Latin word ―Datum.
• It is atomic unit of information.
• Electronic data processed by computing
devices. Example: 20, 90

What is Information?
• Collection of processed data on the basis of which we can take any
decision or draw conclusion.
• May be collected about a particular subject.
• Collected data organized and presented in systematic fashion to
understand underlying meaning
• It can give answers for ―who, ―where, ―when, ―what type queries.
• Useful form.
• Example: Average marks of the students is 75

Organization needs to decide nature and volume of data required to derive


information.

Characteristics
Relevance
 Complete
 Accurate
 Current
 Economical

What is Knowledge?

 Application of data and information to answer ―how type queries.


 Facts and ideas that we acquire through study, research, investigation,
observation, or experience.
 Deterministic process- provides high level of predictability.
 Information are collected, stored and applied to solve any problem.
Example: How to evaluate marks of the students; How to market the
product; so that it sells more

Knowledge derived from information; Information derived from data.

Wisdom is ability to judge which aspects of knowledge are true, right, lasting, and

By Dr. Tapalina Bhattasali Page 1


CABA (Theory): Chapter 1- Information and Database
applicable.

DIKW Pyramid

By Dr. Tapalina Bhattasali Page 2


CABA (Theory): Chapter 1- Information and Database

Data Organization in the context of Database (Data Hierarchy)

By Dr. Tapalina Bhattasali Page 3


CABA (Theory): Chapter 1- Information and Database

By Dr. Tapalina Bhattasali Page 4


CABA (Theory): Chapter 1- Information and Database

Data Processing System

In centralized processing, all the processing is handled by a central system-


more secure - whole system may crash frequently

In centralized processing, one or more terminals are connected to a single processor.


Note that terminal is the combination of mouse, keyboard, and screen. In-library
there is one processor attached to different terminals and library users can search
any book from the terminal (mouse, keyboard, and screen). In centralized
processing all the terminals are controlled by a single processor (CPU) and any
command can be fulfilled by a single processor and this type of network is called
centralized network

In decentralized processing, different processors connected on the network and


each system can do its job independent of each other

In decentralized processing, there are different CPU connected on the network and
each processor can do its job independent of each other. For example, in a Net cafe,
all computers can perform their own tasks. This type of network is called
decentralized network

In distributed processing, a problem is divided into many tasks - task are completed
by different systems connected through each other over the network- less secure-
system can continue even if one system fails

Another type of processing also exists named distributed processing. In this type of
processing different CPU are connected to the network and are controlled by single
CPU. For example in air reservation system there exist different terminals and
processing is done from many locations and all the computers are controlled by the
single main processor. This type of network is called distributed network

Flat File Organization

Separate files are created and stored for each application program
A system of organizing files in an operating system in which all files are
stored in a single directory

Major Drawback

• Data redundancy
– Duplication of data in separate files
• Lack of data integrity
– The degree to which the data in any one file is accurate

By Dr. Tapalina Bhattasali Page 5


CABA (Theory): Chapter 1- Information and Database
Database

A pool of related data is shared by multiple application programs. Rather than


having separate data files, each application uses a collection of data that is either
joined or related in the database.

Advantages
– Improved strategic use of corporate data
– Reduced data redundancy
– Improved data integrity
– Easier modification and updating
– Data and program independence
– Better access to data and information
– Standardization of data access
– A framework for program development
– Better overall protection of the data
– Shared data and information

Disadvantages: High Complexity; Cost Overhead

Database Management System (DBMS)


• DBMS general-purpose software - collection of programs - enables users to
create and maintain a database.
• It facilitates processes of defining, constructing, manipulating and sharing
databases among various users and application.
Defining: specifying the data types, structures, and constraints for the data.
Constructing: includes storing the data itself on some storage medium.
Manipulating: includes querying the database to retrieve specific data,
updating the data to reflect changes in data, generating reports from data.
Sharing: allows multiple users to access the database concurrently.

By Dr. Tapalina Bhattasali Page 6


CABA (Theory): Chapter 1- Information and Database

By Dr. Tapalina Bhattasali Page 7


CABA (Theory): Chapter 1- Information and Database
Relational Database Management System (RDBMS)

RDBMS stands for relational database management system. A relational


model can be represented as a table of rows and columns. A relational
database has following majorcomponents:
1. Table
2. Record or Tuple
3. Field or Column name or Attribute
4. Domain
5. Instance
6. Schema
7. Keys

The data in an RDBMS is stored in database objects which are called as tables.
This table is basically a collection of related data entries and it consists of
numerous columns and rows.

A domain is a set of permitted values for an attribute in table. For example, a


domain of month- of-year can accept January, February,…December as values, a
domain of dates can accept all possible valid dates etc.

Design of a database is called the schema. Schema is of three types: Physical


schema, logical schema and view schema. The design of a database at physical
level is called physical schema, how the data stored in blocks of storage is
described at this level. Design of database at logical level is called logical schema,
programmers and database administrators work at this level, at this level data
can be described as certain types of data records gets stored in data structures,
however the internal details such as implementation of data structure is hidden at this
level Design of database at view level is called view schema. This generally
describes end user interaction with database systems.
The data stored in database at a particular moment of time is called instance of
database.

Entity
• A generalized class of people, places, or things (objects) for
which data are collected, stored, and maintained
• Example, Customer, Employee
Attribu
te • A characteristic of an entity; something the entity is identified by
• Example, Customer name, Employee name

Keys • A field or set of fields in a record that is used to identify the


record
• Example, A field or set of fields that uniquely identifies

By Dr. Tapalina Bhattasali Page 8


CABA (Theory): Chapter 1- Information and Database
record-roll_no, emp_id

A super key with no redundant attribute is known as candidate key. Candidate


keys are selected from the set of super keys, the only thing we take care while
selecting candidate key is that the candidate key should not have any redundant
attributes. That is the reason they are also termed as minimal super key.

Example: Table has three attributes-Emp_Id, Emp_Number &


Emp_Name.The candidate keys are:
{Emp_Id}
{Emp_Number}
A primary key is a minimal set of attributes (columns) in a table that uniquely
identifies tuples (rows) in that table. The value of the field should be unique and
not null (known as entity integrity constraint). Example: rollno.
Foreign keys are the columns of a table that points to the primary key of another
table. They act as a cross-reference between tables. Example: Stu_Id column in
Course_enrollment table is a foreign key as it points to the primary key of the
Student table
Referential Integrity: The foreign key value must be same with the value of the
attribute of that table, where it is considered as primary key or the value may be
null.

Data Dictionary

A detailed description of all data used in the database


– Provide a standard definition of terms and data elements
– Assist programmers in designing and writing programs
– Simplify database modification
By Dr. Tapalina Bhattasali Page 9
CABA (Theory): Chapter 1- Information and Database
– Reduce data redundancy
– Increase data reliability
– Faster program development
– Easier modification of data and information
Sample Data Dictionary

By Dr. Tapalina Bhattasali Page 10


CABA (Theory): Chapter 1- Information and Database

Metadata is "data [information] that provides information about other data". Many
distinct types of metadata exist, among these descriptive metadata, structural
metadata, administrative metadata, reference metadata and statistical
metadata
.
 Descriptive metadata describes a resource for purposes such as discovery and
identification. It can include elements such as title, abstract, author, and
keywords.
 Structural metadata is metadata about containers of data and indicates how
compound objects are put together, for example, how pages are ordered to
form chapters. It describes the types, versions, relationships and other
characteristics of digital materials.
 Administrative metadata provides information to help manage a resource,
such as when and how it was created, file type and other technical
information, and who can access it.
 Reference metadata describes the contents and quality of statistical data
 Statistical metadata may also describe processes that collect, process, or
produce statistical data; such metadata are also called process data.

By Dr. Tapalina Bhattasali Page 11


CABA (Theory): Chapter 1- Information and Database

SQL (Structured Query Language)

SQL is a standard language for accessing and manipulating databases.

 SQL can execute queries against a database


 SQL can retrieve data from a database
 SQL can insert records in a database
 SQL can update records in a database
 SQL can delete records from a database
 SQL can create new databases
 SQL can create new tables in a database
 SQL can create stored procedures in a database
 SQL can create views in a database
 SQL can set permissions on tables, procedures, and views

DDL

Data Definition Language (DDL) statements are used to define the database

structure or schema.Some examples:


 CREATE - to create objects in the database
 ALTER - alters the structure of the database
 DROP - delete objects from the database

Syntax: CREATE TABLE <table_name> (<column_name1> <datatype1>


<constraint1>
<column_name2> <datatype2> <constraint2><constraint-list>);
e.g CREATE TABLE Student (Reg_no text(10) PRIMARY KEY, Sname
text(30),DOB date, Saddress text(50));

Syntax: ALTER TABLE <table name> ADD COLUMN <column name><data


type>;

e.g ALTER TABLE Student ADD COLUMN (Age number(2),

Marks number(3)); Syntax: ALTER TABLE <table_name> DROP

COLUMN <column_name>;

e.g ALTER TABLE Student DROP COLUMN Age;

Syntax: ALTER TABLE <table_name> MODIFY (<column_name>


<NewDataType>(<NewSize>));

e.g ALTER TABLE Student MODIFY (Sname

By Dr. Tapalina Bhattasali Page 12


CABA (Theory):Chapter 1- Information and Database
text(40));Syntax:DROP TABLE <table name>;

e.g DROP TABLE Student;

DML

In Procedural DML, user has to specify what data are needed and how to obtain it.

In Non-Procedural DML user has to specify what data are needed without specifying how to get
it.
Example: SQL (Structured Query Languages)
Data Manipulation Language (DML) statements are used for managing data within schema
objects.
Some examples:
o SELECT - retrieve data from the a database
o INSERT - insert data into a table
o UPDATE - updates existing data within a table
o DELETE - deletes all records from a table, the space for the records remain

Syntax: SELECT * FROM <tablename>;


e.g SELECT * from Student;

Syntax: SELECT <column1, column2,….> FROM <tablename> WHERE condition; SELECT


First_name, DOB FROM STUDENT WHERE Reg_no
= 'S101';
Syntax: INSERT INTO <tablename> (column-1,column-2,…column-n) VALUES
(value-1,value-2, … value-n);
e.g INSERT INTO student (reg_no, first_name, last_name, dob,address, pincode) VALUES('A101',
'Mohd', 'Imran', '01-MAR-89','Allahabad', 211001);

Syntax: UPDATE <tablename>SET column1=value1, column2=value2… WHERE condition;


e.g UPDATE Student SET course='MCA' where reg_no='A101';

Syntax: DELETE from <tablename> WHERE condition;


e.g DELETE FROM student WHERE reg_no='A101';

DCL

Data Control Language (DCL) statements are used to control database access permissions.
Someexamples:
o GRANT - gives user's access privileges to database
o REVOKE - withdraw access privileges given with the GRANT command

TCL

Transaction Control (TCL) statements are used to manage the changes made by DML

By Dr. Tapalina Bhattasali Page 13


CABA (Theory):Chapter 1- Information and Database
statements. It allows statements to be grouped together into logical
transactions.
 COMMIT - save work done
 SAVEPOINT - identify a point in a transaction to which you can later roll back
 ROLLBACK - restore database to original since the last COMMIT

By Dr. Tapalina Bhattasali Page 14


CABA (Theory): Chapter 1- Information and Database
ACID is a concept (and an acronym) that refers to the four properties of a transaction

in database systems, which are: Atomicity, Consistency, Isolation and

Durability. These properties ensure the accuracy and integrity of the data in the

database, ensuring that the data does not become corrupt as a result of some failure,

guaranteeing the validity of the data even when errors or failures occur.

The ACID properties allow us to write applications without considering the complexity

of the environment where the application is executed. This is essential for processing

transactions in databases. Because of the ACID properties, we can focus on the

application logic instead of failures, recovery and sync of the data.

Transaction

Before explaining the four ACID properties, we need to understand what is a

transaction. A transaction is a sequence of operations that are executed as a

single unit of work, and a transaction may consist of one or many steps. A

transaction access data using read and write operations. Each transaction is a group of

operations that acts a single unit, produces consistent results, acts in isolation from

other operations and updates that it makes are durably stored.

The goal of a transaction is to preserve the integrity and consistency of the data.

If a transaction succeeds, the data that were modified during the transaction will be

saved in the database. If some error occurs and needs to be cancelled or reverted, the

changes that were made in the data will not be applied.

When we work with a database, we execute SQL declarations, and those operations

are generally executed in blocks, and those blocks are the transactions. They allow to
insert, update, delete, search data, and so on.

By Dr. Tapalina Bhattasali Page 15


CABA (Theory): Chapter 1- Information and Database

For example, transferring money between bank accounts is a transaction, what will

occur in this case is that the value must be debited from one account and credit in

another account.
Atomicity

A transaction must be an atomic unit of work, which means that all the modified

data are performed or none of them will be. The transaction should be completely

executed or fails completely, if one part of the transaction fails, all the transaction will

fail. This provides reliability because if there is a failure in the middle of a transaction,

none of the changes in that transaction will be committed.

For example, in a financial transaction, the money goes out from account A and goes to

account B, both operations should be executed together, and if one of them fails, the

other will not be executed. So the transaction is treated as a single entity, as a single

command. A transaction can have more than two operations, but it will always be

executed all of them or none. On this example, when the money is being transferred

from account A to account B, if something fails, the entire transaction will be aborted

and will rollback.


Consistency
This property ensures that the transaction maintains data integrity constraints,

leaving the data consistent. The transaction creates a new valid state of the data

and if some failure happens, return all the data with the state before the transaction

being executed.

The goal is to ensure that the database before and after the transaction be consistent.

If a transaction leaves data in an invalid state, the transaction is aborted and an error

is reported.

The data that is saved in the database must always be valid (the data will be valid
according to defined rules, including any constraints, cascades, and triggers that have

been applied on the database), this way the corruption of the database that can be
By Dr. Tapalina Bhattasali Page 16
CABA (Theory): Chapter 1- Information and Database

caused by an illegal transaction is avoided. For example, if we try to add a record in a

sale table with the code of a product that does not exist in the product table, the

transaction will fail. Another example, if you have a column that does not allow

negative numbers, and try to add or modify a record, using a value lower than zero on

this column, the transaction will fail.


Isolation

This property ensures the isolation of each transaction, ensuring that the

transaction will not be changed by any other concurrent transaction. It

means that each transaction in progress will not be interfered by any other transaction

until it is completed.

For example, if two clients are trying to buy at the same time the last available

product on the web site, when the first user finishes the shopping, it will make the

transaction of the other user be interrupted.


Durability

Once a transaction is completed and committed, its changes are persisted

permanently in the database. This property ensures that the information that is

saved in the database is immutable until another update or deletion transaction

affects it.

Once the transaction is committed, it will remain in this state even if a serious

problem occurs, such as a crash or a power outage. For this purpose, the completed

transactions are recorded on permanent memory devices (non-volatile) such as hard

drives, so the data will be always available, even if the DB instance is restarted.
Conclusion
The ACID properties ensure the integrity and the consistency of the data in the

database, ensuring that the data does not become corrupt as a result of some failure.
The databases that apply the ACID properties will ensure that only transactions that

By Dr. Tapalina Bhattasali Page 17


CABA (Theory): Chapter 1- Information and Database

were completely successful will be processed, and if some failure happens before a

transaction is completed, the data will not be changed.

By Dr. Tapalina Bhattasali Page 18


CABA (Theory): Chapter 1- Information and Database

Advanced Concepts of DBMS

Data Warehouse:
A data warehouse is a database, which is kept separate from the
organization's operational database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to
analyze itsbusiness.
A data warehouse helps executives to organize, understand, and use their
data to takestrategic decisions.
Data warehouse systems help in the integration of diversity of application
systems.
A data warehouse system helps in consolidated historical data analysis
A data warehouses is kept separate from operational databases due to the following
reasons:
An operational database is constructed for well-known tasks and workloads
such as searching particular records, indexing, etc. In contrast, data
warehouse queries are often complex and they present a general form of data.
Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while
an OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.

Data Warehouse Features


The key features of a data warehouse are discussed below:
Subject Oriented - A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing
operations. These subjects can be product, customers, suppliers, sales,
revenue, etc. A data warehouse does not focus on the ongoing operations;
rather it focuses on modeling and analysis of data for decisionmaking.
Integrated – A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
Time Variant - The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information
from the historical point of view.

By Dr. Tapalina Bhattasali Page 19


CABA (Theory): Chapter 1- Information and Database
Non-volatile - Non-volatile means the previous data is not erased when new
data is added to it. A data warehouse is kept separate from the operational
database and therefore frequent changes in operational database are not
reflected in the data warehouse.
A data warehouse does not require transaction processing, recovery, and
concurrency controls, because it is physically stored and separate from the
operational database.

Application:
A data warehouse helps business executives to organize, analyze, and use their data
for decision making. A data warehouse serves as a sole part of a plan-execute-assess
"closed-loop" feedback system for the enterprise management.

Types of Data Warehouse


Information processing, analytical processing and data mining are the three
types of data warehouse applications that are discussed below:
Information Processing – A data warehouse allows processing the data stored
in it. The data can be processed by means of querying, basic statistical
analysis, reporting using crosstabs, tables, charts, or graphs.
Analytical Processing – A data warehouse supports analytical processing of
the information stored in it. The data can be analyzed by means of basic
OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.
Data Mining - Data mining supports knowledge discovery by finding hidden
patterns and associations, constructing analytical models, performing
classification and prediction. These mining results can be presented using
visualization tools.
Functions of Data Warehouse
The following are the functions of data warehouse tools and utilities:
Data Extraction - Involves gathering data from multiple heterogeneous sources.
Data Cleaning - Involves finding and correcting the errors in data.
Data Transformation - Involves converting the data from legacy format to
warehouse format.
Data Loading - Involves sorting, summarizing, consolidating, checking
integrity, and building indices and partitions.
Refreshing - Involves updating from data sources to warehouse.

A data cube helps us to represent data in multiple dimensions. It is defined by


dimensions and facts. The dimensions are the entities with respect to which an
enterprise preserves the records.

By Dr. Tapalina Bhattasali Page 20


CABA (Theory): Chapter 1- Information and Database

Data marts contain a subset of organization-wide data that is valuable to specific


groups of people in an organization. In other words, a data mart contains only
those data that is specific to a particular group. For example, the marketing data
mart may contain only data related to items, customers, and sales. Data marts are
confined to subjects.

OLAP
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers and analysts to get an insight of the information through
fast, consistent, and interactive access to information.

By Dr. Tapalina Bhattasali Page 21


CABA (Theory): Chapter 1- Information and Database

Data Mining
Data Mining is defined as extracting information from huge sets of data. In other words,
we cansay that data mining is the procedure of mining knowledge from data.

The information or knowledge extracted so can be used for any of the following

applications:Exploration

In simple words, data mining is defined as a process used to extract usable data from a
larger set of any raw data. It implies analyzing data patterns in large batches of data using
one or more software. Data mining has applications in multiple fields, like science and
research. As an application of data mining, businesses can learn more about their
customers and develop more effective strategies related to various business functions and
in turn leverage resources in a more optimal and insightful manner. This helps businesses
be closer to their objective and make better decisions. Data mining involves effective data
collection and warehousing as well as computer processing. For segmenting the data and
evaluating the probability of future events, data mining uses sophisticated mathematical
algorithms. Data mining is also known as Knowledge Discoveryin Data (KDD).

By Dr. Tapalina Bhattasali Page 22


CABA (Theory): Chapter 1- Information and Database
Description: Key features of data mining:

• Automatic pattern predictions based on trend and behaviour analysis.

• Prediction based on likely outcomes.

• Creation of decision-oriented information.

• Focus on large data sets and databases for analysis.

• Clustering based on finding and visually documented groups of facts not previously known.

The Data Mining Process:


Technological Infrastructure required: 1. Database Size: For creating a more powerful
system more data is required to processed and maintained. 2. Query complexity: For
querying or processing more complex queries and the greater the number of queries, the
more powerful system is required.
Uses: 1. Data mining techniques are useful in many research projects, including
mathematics, cybernetics, genetics and marketing. 2. With data mining, a retailer could
manage and use point- of-sale records of customer purchases to send targeted promotions
based on an individual’s purchase history. The retailer could also develop products and
promotions to appeal to specific customer segments based on mining demographic data
from comment or warranty cards.

Database Backup and Recovery

Although most database systems have incorporated backup and recovery tools into their
interfaces and infrastructure it is imperative to understand what the backup and recovery
process involves.

Database Backup and Recovery Needs

It is not just data files that need to be part of the backup process. Transaction logs must
also be backed up. Without the transaction logs the data files are useless in a recovery
event.

Backup and Recovery and Database Failure

There are three main reasons of failure that happen enough to be worth incorporating into
backup and recovery plan. User error is the biggest reason for data damage, loss, or
corruption. In this type of failure, there is an application modifying or destroying the data
on its own or through a user choice. To fix this problem user must recover and restore to
the point in time before the corruption occurred. This returns the data to its original state
at the cost of any other changes that

By Dr. Tapalina Bhattasali Page 23


CABA (Theory): Chapter 1- Information and Database

were being made to the data since the point the corruption took place. Hardware failure
can also cause data loss or damage. Hardware failure can happen when the drives the data
files or transaction logs are stored on fail. Most databases will be stored on computer hard
drives or across groups of hard drives on designated servers.

Backup and Recovery and Disaster

The third reason for database failure is a disastrous or catastrophic event. This can be in
the form of fire, flood, or any naturally occurring storm. It can also happen through
electrical outage, a virus, or the deliberate hacking of your data. Any of these events can
corrupt or cause the loss of your data. The true disaster would be the lack of data
backup or the lack of a recovery plan during an event this severe. Without data backup,
recovery is impossible. And without a recovery plan there is no guarantee that your data
backup will make it through the recovery process.

Physical and Logical Backups

Physical backups are backups of the physical files used in storing and recovering your
database, such as data files, control files, and archived redo logs. Ultimately, every physical
backup is a copy of files storing database information to some other location, whether on
disk or some offline storage.
Logical backups contain logical data (for example, tables or stored procedures) exported
from a database with an Oracle export utility and stored in a binary file, for later re-
importing into a database using the corresponding Oracle import utility.
Physical backups are the foundation of any sound backup and recovery strategy. Logical
backups are a useful supplement to physical backups in many circumstances but are not
sufficient protection against data loss without physical backups. Unless otherwise
specified, the term "backup" as used in the backup and recovery documentation refers to
physical backups.

Online backup has a lot of advantages over traditional backup methods such as
memory sticks, cards, discs and tapes. However, like everything though, online
backup naturally has some disadvantages too. Below we explore both…

Online backup advantages over traditional backup methods…


 Data is stored offsite – Perhaps the most important aspect of online backup is that
your data is stored in a different location from the original data. This protects your
files from onsite risks such as floods, fire or employee theft. Traditional backup
requires manually taking the backup media offsite.
 Removes user intervention & manual steps – Online backup doesn‘t require
someone to change tapes over or label and store multiple CDs or memory sticks.
Online backup removes that user intervention or other manual steps; saving time
and money.
 Automatic Backups – Online backup records changes at each subsequent backup,
in accordance with an automated schedule you have chosen. The process comes
By Dr. Tapalina Bhattasali Page 24
CABA (Theory): Chapter 1- Information and Database
with built-in ‗retry‘ to account for any possible disruptions; unlike the unreliability
of human error or hardware or software faults of traditional backup methods. Some
remote backup services will work continuously, backing up files as they are
changed.
 Fully Encrypted Backups – Most remote backup services will use a 128 – 448 bit
encryption to send data links such as the internet. At BackupVault our system
uses AES 256bit encryption; the trusted, industry standard.
 Unlimited Data Backup – Unlimited data retention (presuming the backup
provider stays in business).
 Most remote backup services will maintain a list of versions of your files.
 A few remote backup services can reduce backup by only transmitting changed
binary data bits.
Disadvantages
 Whilst storing data offsite can be a huge advantage, avoiding USBs going in the
wash with your jacket, or discs being scratched or lost. Depending on the available
network bandwidth, the restoration of data from online backup can be slow.
Because data is stored offsite, the data must be recovered either via the Internet or
via a disk shipped from the online backup service provider.
 Some backup service providers have no guarantee that stored data will be kept
private — for example, from employees. However this is easily overcome and most
recommend that files be encrypted to prevent this.
 Like every business, it is a possibility that the online backup service provider could
also go out of business or be purchased; which may affect the accessibility of your
data or the cost to continue using the service.
 If the encryption password is lost, data recovery will be impossible. However with
managed services this should not be a problem. But whatever you do, do not lose
that encryption key!
 If you are using residential broadband services, they can often have monthly limits
that can preclude large backups.

By Dr. Tapalina Bhattasali Page 25


CABA (Theory):Chapter 1- Information and Database

Page 26
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through
fast, consistent, and interactive access to information. This chapter cover the types
of OLAP, operations on OLAP, difference between OLAP, and statistical databases
and OLTP.

Types of OLAP
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.

Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional
views of data. With multidimensional data stores, the storage utilization may be low if
the data set is sparse. Therefore, many MOLAP servers use two levels of data
storage representation to handle dense and sparse data sets.

Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allow to
store the large data volumes of detailed information.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAP operations in multidimensional data.
Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.

 Roll-up is performed by climbing up a concept hierarchy for the dimension


location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from
the level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are
removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works −

 Drill-down is performed by stepping down a concept hierarchy for the


dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to
the level of month.
 When drill-down is performed, one or more dimensions from the data cube are
added.
 It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides
a new sub-cube. Consider the following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time =
"Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves
three dimensions.

 (location = "Toronto" or "Vancouver")


 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order
to provide an alternative presentation of data. Consider the following diagram that
shows the pivot operation.
What is Data Lake?
A Data Lake is a storage repository that can store large amount of
structured, semi-structured, and unstructured data. It is a place to store
every type of data in its native format with no fixed limits on account size or
file. It offers high data quantity to increase analytic performance and native
integration.

Data Lake is like a large container which is very similar to real lake and
rivers. Just like in a lake you have multiple tributaries coming in, a data lake
has structured data, unstructured data, machine to machine, logs flowing
through in real-time.

Reasons for using Data Lake are:


 With the onset of storage engines like Hadoop storing disparate
information has become easy. There is no need to model data into an
enterprise-wide schema with a Data Lake.
 With the increase in data volume, data quality, and metadata, the
quality of analyses also increases.
 Data Lake offers business Agility
 Machine Learning and Artificial Intelligence can be used to make
profitable predictions.
 It offers a competitive advantage to the implementing organization.
 There is no data silo structure. Data Lake gives 360 degrees view of
customers and makes analysis more robust.

Benefits and Risks of using Data Lake:


Here are some major benefits in using a Data Lake:

 Helps fully with product ionizing & advanced analytics


 Offers cost-effective scalability and flexibility
 Offers value from unlimited data types
 Reduces long-term cost of ownership
 Allows economic storage of files
 Quickly adaptable to changes
 The main advantage of data lake is the centralization of different
content sources
 Users, from various departments, may be scattered around the globe
can have flexible access to the data

Risk of Using Data Lake:

 After some time, Data Lake may lose relevance and momentum
 There is larger amount risk involved while designing Data Lake
 Unstructured Data may lead to Ungoverned Chao, Unusable Data,
Disparate & Complex Tools, Enterprise-Wide Collaboration, Unified,
Consistent, and Common
 It also increases storage & computes costs
 There is no way to get insights from others who have worked with the
data because there is no account of the lineage of findings by
previous analysts
 The biggest risk of data lakes is security and access control.
Sometimes data can be placed into a lake without any oversight, as
some of the data may have privacy and regulatory need

Summary:
 A Data Lake is a storage repository that can store large amount of
structured, semi-structured, and unstructured data.
 The main objective of building a data lake is to offer an unrefined
view of data to data scientists.
 Unified operations tier, Processing tier, Distillation tier and HDFS are
important layers of Data Lake Architecture
 Data Ingestion, Data storage, Data quality, Data Auditing, Data
exploration, Data discover are some important components of Data
Lake Architecture
 Design of Data Lake should be driven by what is available instead of
what is required.
 Data Lake reduces long-term cost of ownership and allows economic
storage of files
 The biggest risk of data lakes is security and access control.
Sometimes data can be placed into a lake without any oversight, as
some of the data may have privacy and regulatory need.

You might also like