Data and Data Storage

The document discusses various types of data found in organizations, including structured, unstructured, and semi-structured data, along with their advantages and disadvantages. It also covers data storage options such as relational databases, NoSQL databases, data warehouses, and data lakes, as well as different types of data centers like on-premises, public cloud, colocation, and hybrid models. Additionally, it highlights the differences between cloud and on-prem data centers, emphasizing the shared responsibility model and data center tiers.

Uploaded by

ganesh697todkari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Data and Data Storage

Uploaded by

ganesh697todkari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Data and Data Storage

Types of data commonly found in an

organization:
• Unstructured data: Emails, white papers, product specifications and PDF
files.
• Transactional data: Business events and transactions, such as sales,
invoices and claims.
• Metadata: Data about other data, such as report definitions and log files.
• Hierarchical data: Relationships between data, such as organizational
structures or product lines.
• Reference data: Data used to classify or categorize other data, such as
country and currency codes and other industry classifications.
• Master data: Core data describing key business entities, such as customers,
products and locations.
What Is Structured Data
• Structured data is typically stored in tabular form and managed in a relational
database (RDBMS).
• Fields contain data of a predefined format.
• Some fields might have a strict format, such as phone numbers or addresses,
while other fields can have variable-length text strings, such as names or
descriptions.
• Structured data might be generated by either humans or machines.
• It is easy to manage and highly searchable, both via human-generated queries
and automated analysis by traditional statistical methods and machine learning
(ML) algorithms.
• Structured data is used in almost every industry. Common examples of
applications that rely on structured data include customer relationship
management (CRM), invoicing systems, product databases, and contact lists
Advantages of Structured Data
• Easy to use for business users—structured data can be used by business
users who understand the subject matter related to the data. It is useful for
entry level users with access to basic tools like Excel, and can be even more
useful for power users familiar with SQL or business intelligence (BI) tools.
• Extensive tools support—structured data is several decades old and most
data management and analytics tools support it. There is a huge variety of
RDBMS, data analytics, and big data management tools for structured
datasets.
• Instantly usable—structured data can be used, with no further processing,
by a variety of business processes. For example, customer data in
structured form can be visualized and manipulated by a CRM system.
Disadvantages of Structured Data
• Data preparation—data often needs to undergo complex transformations before
it can enter a flexible data store.
• Not flexible—structured data requires users to create schema data definitions in
advance. It is difficult to change the structure over time, and because there is a
fixed, predefined structure, data can only be used for its intended purpose. This
limits the use cases that can be served by structured data.
• High overhead—structured data is often stored in data warehouses, which can
store structured data at large scale and enable fast access for user queries. A data
warehouse is a complex system requiring significant resources to run, operate
and maintain.
• Complex data structures—as organizations grow, the number of databases,
tables, and fields grows exponentially. It becomes difficult to manage structured
data, and it is common to have overlaps between datasets, redundant data, and
stale or low quality data.
What Is Unstructured Data
• Unstructured data includes various content such as documents, videos,
audio files, posts on social media, and emails.
• These data types can be difficult to standardize and categorize.
• Unstructured data often consists of data collections rather than a clear
data element—for example, a document with thousands of words
addressing multiple topics. In this case, the document’s contents cannot
easily be defined as one entity.
• Generally, tools that handle structured data cannot parse unstructured
documents to help categorize their data.
• Unstructured data is manageable, but data items are typically stored as
objects in their original format.
• Users and tools can manipulate the data when needed; otherwise, it
remains in its raw form—a process known as schema-on-read.
Advantages of Unstructured Data
• Native format—unstructured data can be stored in its native format
until needed, with no pre-processing.
• Flexible—unstructured data can be used for many different purposes
and can contain a much wider variety of data, including textual data,
images, videos, and source code.
• Low overhead—unstructured data can be stored and processed at
much lower cost using elastically scalable data lakes.
Disadvantages of Unstructured Data
• Lack of visibility—it is difficult to tell what is stored in a data lake and
whether the data is useful. Data lakes can turn into “data swamps”
with large amounts of data, which is not useful for the organization,
yet incurs costs to store and manage it.
• Requires advanced analytics—there is typically a need for data
science skills and advanced algorithms to analyze and extract insights
from unstructured data. This also means it is not useful for most
business users, who do not have the skills to perform advanced
analytics.
• Requires dedicated tools—retrieving and processing unstructured
data requires specialized tooling and expertise.
Key Difference : Formats
Usually, structured data is in the form of numbers and text, presented
in standardized, readable formats. XML and CSV are the most popular
formats. In structured data models, the data format is predetermined.
On the other hand, unstructured data often comes in various shapes
and sizes. It does not conform to a predefined data model and stays in
the native (original) formats. Examples include video (i.e., WMV, MPW)
and audio files (i.e., MP3, WAV)
Key Difference : Data Model
• Structured data follows a predefined relational data model describing
the relationship of data elements.
• Unstructured data does not have a set data model but can have a
hidden structure.
Key Difference : Storage
• Organizations store structured data in relational databases. Data
warehouses help centralize large volumes of stored structured data
from different databases.
• Organizations store unstructured data in raw formats, not in
databases. Data lakes can store large amounts of unstructured data.
Key Difference : Database Type
• Structured data typically resides in a relational database, arranged in tables
with rows and columns. Labels specify the data types. A table’s schema
consists of the data column and type configuration. Relational databases
process data using SQL, an easy syntax for users to read.

• Unstructured data often resides in a non-relational (NoSQL) database. This

database type stores multiple data models without tables—this is usually a
document, wide-column, graph, and key-volume database. It can process
large data volumes and handle high loads. A NoSQL database contains
collections of documents that resemble rows but don’t use a tabular
schema, so there can be different data types in the same collection. The
non-relational model enables faster queries.
Key Difference : Searchability and Ease of Use
• Structured data is usually easier to search and use,
• Unstructured data involves more complex search and analysis.
Unstructured data requires processing to understand it, such as
stacking before placing it in a relational database.
• Structured data is older, so there are more analytics tools available.
Standard data mining solutions cannot handle unstructured data.
Key Difference : Quantitative vs. Qualitative
• Structured data is quantitative, meaning that it has countable
elements. It is easier to analyze by classifying items based on
common characteristics, investigating the relationships between
variables, or clustering the data into attribute-based groups.

• Unstructured data is qualitative, meaning the information it contains

is subjective, and traditional analytics tools and methods can’t handle
it. For example, customer feedback on social media can generate data
in text form, requiring advanced analytics to process it. Techniques
include splitting and stacking data volumes into logical groupings,
data mining, and pattern detection.
Semi Structured Data
• Data that contains elements of both structured and unstructured
data, typically organized with markers (e.g., tags or keys) to separate
data fields but without a rigid schema.
• Key Characteristics
• Flexible Schema: No fixed schema, but it has identifiable elements.
• Hierarchical or Nested: Often stored in tree-like or nested structures.
• Easier Parsing: Easier to parse than unstructured data due to tags or
delimiters.
• Examples: JSON, XML, YAML, emails, CSV with irregular columns.
Examples of semi structured data
JavaScript Object Notation XML (eXtensible Markup Language)

{ <person>
"name": "Alice", <name>Alice</name>
"age": 30, <age>30</age>
"skills": ["Python", "Data Analysis"] <skills>
} <skill>Python</skill>
<skill>Data Analysis</skill>
</skills>
</person>
YAML (Yet Another Markup Language)

name: Alice
age: 30
skills:
- Python
- Data Analysis
Data Storage Options
• Relational databases: Structured Data stored and organized in tables,
rows, and columns that are related to each other
• NoSQL databases : Stores unstructured Data in non tabular format
• Data Warehouse : a centralized repository of data that stores cleaned
and processed data that's structured and historical and organizes
information from multiple sources for business analysis and
reporting.
• Data lakes : a centralized repository that stores, processes, and
secures large amounts of raw data, including structured and
unstructured data, at any scale.
Non-Relational Databases (NoSQL databases)
They are very efficient in analyzing large size unstructured data.

• Key-value databases : Store and Manage associative array (dictionary or hash table)
consists of a collection of key-value pairs in which a key serves as a unique identifier to
retrieve an associated value. Values can be anything from simple objects, like integers or
strings, to more complex objects, like JSON structures.

• Document-oriented databases : Store data in the form of documents where each

document has a unique identifier — its key — and the document itself serves as the
value. Each document contains some kind of metadata that provides a degree of
structure to the data. Document stores often come with an API or query language that
allows users to retrieve documents based on the metadata they contain.

• Columnar databases / column-oriented databases : Data is stored in a column-wise

fashion, with all the values of a specific attribute stored together, which allows for faster
data retrieval and compression.
The Three-Tier Architecture
The three-tier (or three-schema)
architecture supported by popular DBMSs
achieves two important things:
1. It insulates programmers and end-users of
the database from the way that data is
physically stored in the computer(s).
2. It enables different users of the data to
see only the subset of data relevant to
them, organized to suit their particular
needs

• The internal schema describes how the data will be physically stored and accessed, using the facilities provided by a
particular DBMS.

• The conceptual schema describes the organization of the data into tables and columns

• The external schemas specify views that enable different users of the data to see it in different ways.
What is a data center?
• A data center is a physical location that stores computing machines
and their related hardware equipment. It contains the computing
infrastructure that IT systems require, such as servers, data storage
drives, and network equipment. It is the physical facility that stores
any company’s digital data.
• Key Components of enterprise data center infrastructure:
o Compute
o Storage
o Network
Types of Data Centers
On Prem

Public Cloud

CoLocation

Cloud

Hybrid
Enterprise (on-premises) data centers
• In this data center model, all IT infrastructure and data is hosted on-premises. Many companies
choose to have their own on-premises data centers because they feel they have more control
over information security, and can more easily comply with regulations such as the European
Union General Data Protection Regulation (GDPR) or the U.S. Health Insurance Portability and
Accountability Act (HIPAA). In an enterprise data center, the company is responsible for all
deployment, monitoring, and management tasks.
• On-premises data centers are fully owned company data centers that store sensitive data and
critical applications for that company. You set up the data center, manage its ongoing operations,
and purchase and maintain the equipment.

• Benefits: An enterprise data center can give better security because you manage risks internally.
You can customize the data center to meet your requirements.
• Limitations: It is costly to set up your own data center and manage ongoing staffing and running
costs. You also need multiple data centers because just one can become a single high-risk point of
failure.
Public cloud data centers
• Cloud data centers (also called cloud computing data centers) house IT
infrastructure resources for shared use by multiple customers—from scores
to millions of customers—via an Internet connection.

• Many of the largest cloud data centers—called hyperscale data centers—

are run by major cloud service providers like Amazon Web Services (AWS),
Google Cloud Platform, IBM Cloud, Microsoft Azure, and Oracle Cloud
Infrastructure. In fact, most leading cloud providers run several hyperscale
data centers around the world. Typically, cloud service providers maintain
smaller, edge data centers located closer to cloud customers (and cloud
customers’ customers). For real-time, data-intensive workloads such big
data analytics, artificial intelligence (AI), and content delivery applications,
edge data centers can help minimize latency, improving overall application
performance and customer experience.
Colocation data centers
• Colocation facilities are large data center facilities in which you can rent space to store
your servers, racks, and other computing hardware. The colocation center typically
provides security and support infrastructure such as cooling and network bandwidth.

• Benefits: Colocation facilities reduce ongoing maintenance costs and provide fixed
monthly costs to house your hardware. You can also geographically distribute hardware
to minimize latency and to be closer to your end users.
• Limitations: It can be challenging to source colocation facilities across the globe and in
different geographical areas you target. Costs could also add up quickly as you expand.

• In a managed data center, the client company leases dedicated servers, storage and
networking hardware from the data center provider, and the data center provider
handles the administration, monitoring and management for the client company.
Cloud data centers
• A cloud data center moves a traditional on-prem data center off-site.
Instead of personally managing their own infrastructure, an organization
leases infrastructure managed by a third-party partner and accesses data
center resources over the Internet. Under this model, the cloud service
provider is responsible for maintenance, updates, and meeting service level
agreements (SLAs) for the parts of the infrastructure stack under their
direct control.

• Benefits: A cloud data center reduces both hardware investment and the
ongoing maintenance cost of any infrastructure. It gives greater flexibility in
terms of usage options, resource sharing, availability, and redundancy.
Difference between Cloud and On Prem Data Center
S.No Cloud On PRem
Cloud is a virtual resource that helps businesses to store, Data Center is a physical resource that helps businesses to store,
1.
organize, and operate data efficiently. organize, and operate data efficiently.
The scalability of the cloud required less amount of The scalability of Data Center is huge in investment as compared to
2.
investment. the cloud.
The maintenance cost is less than service providers The maintenance cost is high because developers of the organization
3.
maintain it. do maintenance.
Third-Party needs to be trusted for the organization’s data The organization’s developers are trusted for the data stored in data
4.
to be stored. centers.
5. Performance is huge as compared with investment. Performance is less than compared to investment.
6. It requires a plan to customize the cloud. It is easily customizable without any hard plan.
It requires a stable internet connection to provide the
7. It may and may not require an internet connection.
function.
Data Centers require experienced developers to operate and are
8. Cloud is easy to operate and is considered a viable option.
considered not a viable option.

9. Data is generally collected from the internet Here, data is collected from the Organization’s network.

It finds use in scenarios where security is not a critical

It finds use in scenarios where the project requires a high level of
10. aspect. Hence, small web applications can be hosted
security.
easily.
Shared Responsibility Model
The migration from an on-premises data center to a cloud data center doesn’t mean moving
everything to the cloud. Many companies have hybrid cloud data centers which have a mix of on-
premises data center components and virtual data centers components.
Data Center Tiers.
• data centers can be defined by different levels of reliability or flexibility,
sometimes referred to as data center tiers.
• Tier I : These are the most basic types of data centers, including UPS. Tier I
data centers do not provide redundant systems but must guarantee at least
99.671% uptime.
• Tier II : These data centers include system, power and cooling redundancy
and guarantee at least 99.741% uptime.
• Tier III : These data centers offer partial fault tolerance, 72-hour outage
protection, full redundancy, and a 99.982% uptime guarantee.
• Tier IV : These data centers guarantee 99.995% uptime - or no more than
26.3 minutes of downtime per year - as well as full fault tolerance, system
redundancy, and 96 hours of outage protection.
Basic Data Center Operations
• Managing and monitoring servers.
• Ensuring high availability and performance.
• Maintaining security and disaster recovery systems.

Secrets to Winning at Office Politics
No ratings yet
Secrets to Winning at Office Politics
23 pages
Life Orientation Grade 11 Term 1 Week 9 - 2021
50% (2)
Life Orientation Grade 11 Term 1 Week 9 - 2021
4 pages
Sample Budget of Work COOKERY I
100% (2)
Sample Budget of Work COOKERY I
6 pages
Data Types
No ratings yet
Data Types
36 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
DA(Unit-1)
No ratings yet
DA(Unit-1)
45 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
DBMS (UNIT-6) (Advances in Databases and Big Data)
No ratings yet
DBMS (UNIT-6) (Advances in Databases and Big Data)
103 pages
Structured, Semi Structured and Unstructured Data
No ratings yet
Structured, Semi Structured and Unstructured Data
13 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Data Types and Sources
No ratings yet
Data Types and Sources
36 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
DA_Unit_1
No ratings yet
DA_Unit_1
44 pages
Unit 2 It-01-1
No ratings yet
Unit 2 It-01-1
72 pages
Database Design and Development
No ratings yet
Database Design and Development
74 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Unit 5 Managing Data Resources
No ratings yet
Unit 5 Managing Data Resources
59 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
Dbms Mse Ppt
No ratings yet
Dbms Mse Ppt
17 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit 6 NOSQL Databases and Data Warehousing
No ratings yet
Unit 6 NOSQL Databases and Data Warehousing
29 pages
Chapter One - DS Introduction
No ratings yet
Chapter One - DS Introduction
40 pages
Chapter 5 Data Resource Management
No ratings yet
Chapter 5 Data Resource Management
39 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Data Resource Management - 2
No ratings yet
Data Resource Management - 2
33 pages
4th - Business Intelligence
No ratings yet
4th - Business Intelligence
30 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
178chapter 5 - Data Resource Management
No ratings yet
178chapter 5 - Data Resource Management
31 pages
Set Software Programs Organization Storage Retrieval Data Database
No ratings yet
Set Software Programs Organization Storage Retrieval Data Database
26 pages
Introduction to Data Concepts 1
No ratings yet
Introduction to Data Concepts 1
15 pages
Data_Management_and_Applications
No ratings yet
Data_Management_and_Applications
10 pages
Introduction to Database Systems
No ratings yet
Introduction to Database Systems
4 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
EmTech Chapter 2 - Data Science
No ratings yet
EmTech Chapter 2 - Data Science
22 pages
21aim45a-Dbms Module-5
No ratings yet
21aim45a-Dbms Module-5
74 pages
Aplikasi DB-MKG 2
No ratings yet
Aplikasi DB-MKG 2
16 pages
Big Data Analytics Unit Test-I Answers Bank
No ratings yet
Big Data Analytics Unit Test-I Answers Bank
10 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Lecture02-Main Motivation and Drivers For Big Data Adoption
No ratings yet
Lecture02-Main Motivation and Drivers For Big Data Adoption
9 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
74 pages
1 Intro
No ratings yet
1 Intro
68 pages
33200122134_Twinkle Mahato_BG
No ratings yet
33200122134_Twinkle Mahato_BG
8 pages
First Data WarehouseAima First Final Updated 9 Sep 2016
No ratings yet
First Data WarehouseAima First Final Updated 9 Sep 2016
188 pages
Types of Digital Data
No ratings yet
Types of Digital Data
19 pages
Data Science Done
No ratings yet
Data Science Done
7 pages
Digital Data
No ratings yet
Digital Data
32 pages
2.structure and Unstructured Data Disruptive System
No ratings yet
2.structure and Unstructured Data Disruptive System
4 pages
Data Warehouse and Data Mining Notes
No ratings yet
Data Warehouse and Data Mining Notes
31 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Of, and Data Base: Hierarchical Network Relational
No ratings yet
Of, and Data Base: Hierarchical Network Relational
19 pages
Apply Business Technology Unit 2
No ratings yet
Apply Business Technology Unit 2
31 pages
Data Management MGT 301
No ratings yet
Data Management MGT 301
14 pages
Intro To Database
No ratings yet
Intro To Database
19 pages
Database Management Systems For Managers
No ratings yet
Database Management Systems For Managers
15 pages
DBMS
No ratings yet
DBMS
27 pages
MISY2010 - Module 5 Database Student
No ratings yet
MISY2010 - Module 5 Database Student
29 pages
Week 1
No ratings yet
Week 1
36 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Azure UAE IP Block
No ratings yet
Azure UAE IP Block
38 pages
Asme BPVC Section VIII Division 1: Key Changes 2019 Edition
No ratings yet
Asme BPVC Section VIII Division 1: Key Changes 2019 Edition
69 pages
Cs101-Latest - Solved-Midterm Paper 8 PDF
No ratings yet
Cs101-Latest - Solved-Midterm Paper 8 PDF
6 pages
Job Analysis Tools
No ratings yet
Job Analysis Tools
15 pages
TBM Crossing Stations
No ratings yet
TBM Crossing Stations
21 pages
Seasonal Businesses Can Stay Profitable in The Off-Season. Here's How
No ratings yet
Seasonal Businesses Can Stay Profitable in The Off-Season. Here's How
1 page
Mind Map-Establishment of Company Power
No ratings yet
Mind Map-Establishment of Company Power
3 pages
01 Operation Management
No ratings yet
01 Operation Management
24 pages
Marketing Concepts 2022
No ratings yet
Marketing Concepts 2022
4 pages
Philippine Competition Act Ra 10667
No ratings yet
Philippine Competition Act Ra 10667
33 pages
Councilman Cliff Olney - Formal Request For Legal Representation by The City 12-01-2023
No ratings yet
Councilman Cliff Olney - Formal Request For Legal Representation by The City 12-01-2023
2 pages
Lesson - End - Project - Problem - Statement - 3
No ratings yet
Lesson - End - Project - Problem - Statement - 3
3 pages
Keltron Report
No ratings yet
Keltron Report
35 pages
Pauwels 2016
No ratings yet
Pauwels 2016
12 pages
Foreign Judgment
100% (1)
Foreign Judgment
71 pages
Stat 166 Final Paper
No ratings yet
Stat 166 Final Paper
59 pages
Expression of Interest (EoI) For Technology Tie-Up For Air Cooled Condenser
No ratings yet
Expression of Interest (EoI) For Technology Tie-Up For Air Cooled Condenser
12 pages
Install With Me ! - How To Install NS-2.35 in Ubuntu-13.10 - 14
100% (1)
Install With Me ! - How To Install NS-2.35 in Ubuntu-13.10 - 14
38 pages
Composting Process
No ratings yet
Composting Process
5 pages
Worksheet No. 9 Worksheet On Management of Patients With Cerebrovascular Disorders and Neurologic Trauma
No ratings yet
Worksheet No. 9 Worksheet On Management of Patients With Cerebrovascular Disorders and Neurologic Trauma
5 pages
Design of Standalone PV System
No ratings yet
Design of Standalone PV System
6 pages
Balanga PDF
No ratings yet
Balanga PDF
110 pages
H Method and P Method
No ratings yet
H Method and P Method
2 pages
SimVoi 306 Example
No ratings yet
SimVoi 306 Example
19 pages
Ebook 13.4 Food Chains Food Webs
No ratings yet
Ebook 13.4 Food Chains Food Webs
4 pages
Annexure-C - Examination SOP
No ratings yet
Annexure-C - Examination SOP
42 pages
Aiga 086 - 14 Safe Startup and Shutdown Practices For Steam Reformers
No ratings yet
Aiga 086 - 14 Safe Startup and Shutdown Practices For Steam Reformers
22 pages