0% found this document useful (0 votes)
0 views

Data and Data Storage

The document discusses various types of data found in organizations, including structured, unstructured, and semi-structured data, along with their advantages and disadvantages. It also covers data storage options such as relational databases, NoSQL databases, data warehouses, and data lakes, as well as different types of data centers like on-premises, public cloud, colocation, and hybrid models. Additionally, it highlights the differences between cloud and on-prem data centers, emphasizing the shared responsibility model and data center tiers.

Uploaded by

ganesh697todkari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data and Data Storage

The document discusses various types of data found in organizations, including structured, unstructured, and semi-structured data, along with their advantages and disadvantages. It also covers data storage options such as relational databases, NoSQL databases, data warehouses, and data lakes, as well as different types of data centers like on-premises, public cloud, colocation, and hybrid models. Additionally, it highlights the differences between cloud and on-prem data centers, emphasizing the shared responsibility model and data center tiers.

Uploaded by

ganesh697todkari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data and Data Storage

Types of data commonly found in an


organization:
• Unstructured data: Emails, white papers, product specifications and PDF
files.
• Transactional data: Business events and transactions, such as sales,
invoices and claims.
• Metadata: Data about other data, such as report definitions and log files.
• Hierarchical data: Relationships between data, such as organizational
structures or product lines.
• Reference data: Data used to classify or categorize other data, such as
country and currency codes and other industry classifications.
• Master data: Core data describing key business entities, such as customers,
products and locations.
What Is Structured Data
• Structured data is typically stored in tabular form and managed in a relational
database (RDBMS).
• Fields contain data of a predefined format.
• Some fields might have a strict format, such as phone numbers or addresses,
while other fields can have variable-length text strings, such as names or
descriptions.
• Structured data might be generated by either humans or machines.
• It is easy to manage and highly searchable, both via human-generated queries
and automated analysis by traditional statistical methods and machine learning
(ML) algorithms.
• Structured data is used in almost every industry. Common examples of
applications that rely on structured data include customer relationship
management (CRM), invoicing systems, product databases, and contact lists
Advantages of Structured Data
• Easy to use for business users—structured data can be used by business
users who understand the subject matter related to the data. It is useful for
entry level users with access to basic tools like Excel, and can be even more
useful for power users familiar with SQL or business intelligence (BI) tools.
• Extensive tools support—structured data is several decades old and most
data management and analytics tools support it. There is a huge variety of
RDBMS, data analytics, and big data management tools for structured
datasets.
• Instantly usable—structured data can be used, with no further processing,
by a variety of business processes. For example, customer data in
structured form can be visualized and manipulated by a CRM system.
Disadvantages of Structured Data
• Data preparation—data often needs to undergo complex transformations before
it can enter a flexible data store.
• Not flexible—structured data requires users to create schema data definitions in
advance. It is difficult to change the structure over time, and because there is a
fixed, predefined structure, data can only be used for its intended purpose. This
limits the use cases that can be served by structured data.
• High overhead—structured data is often stored in data warehouses, which can
store structured data at large scale and enable fast access for user queries. A data
warehouse is a complex system requiring significant resources to run, operate
and maintain.
• Complex data structures—as organizations grow, the number of databases,
tables, and fields grows exponentially. It becomes difficult to manage structured
data, and it is common to have overlaps between datasets, redundant data, and
stale or low quality data.
What Is Unstructured Data
• Unstructured data includes various content such as documents, videos,
audio files, posts on social media, and emails.
• These data types can be difficult to standardize and categorize.
• Unstructured data often consists of data collections rather than a clear
data element—for example, a document with thousands of words
addressing multiple topics. In this case, the document’s contents cannot
easily be defined as one entity.
• Generally, tools that handle structured data cannot parse unstructured
documents to help categorize their data.
• Unstructured data is manageable, but data items are typically stored as
objects in their original format.
• Users and tools can manipulate the data when needed; otherwise, it
remains in its raw form—a process known as schema-on-read.
Advantages of Unstructured Data
• Native format—unstructured data can be stored in its native format
until needed, with no pre-processing.
• Flexible—unstructured data can be used for many different purposes
and can contain a much wider variety of data, including textual data,
images, videos, and source code.
• Low overhead—unstructured data can be stored and processed at
much lower cost using elastically scalable data lakes.
Disadvantages of Unstructured Data
• Lack of visibility—it is difficult to tell what is stored in a data lake and
whether the data is useful. Data lakes can turn into “data swamps”
with large amounts of data, which is not useful for the organization,
yet incurs costs to store and manage it.
• Requires advanced analytics—there is typically a need for data
science skills and advanced algorithms to analyze and extract insights
from unstructured data. This also means it is not useful for most
business users, who do not have the skills to perform advanced
analytics.
• Requires dedicated tools—retrieving and processing unstructured
data requires specialized tooling and expertise.
Key Difference : Formats
Usually, structured data is in the form of numbers and text, presented
in standardized, readable formats. XML and CSV are the most popular
formats. In structured data models, the data format is predetermined.
On the other hand, unstructured data often comes in various shapes
and sizes. It does not conform to a predefined data model and stays in
the native (original) formats. Examples include video (i.e., WMV, MPW)
and audio files (i.e., MP3, WAV)
Key Difference : Data Model
• Structured data follows a predefined relational data model describing
the relationship of data elements.
• Unstructured data does not have a set data model but can have a
hidden structure.
Key Difference : Storage
• Organizations store structured data in relational databases. Data
warehouses help centralize large volumes of stored structured data
from different databases.
• Organizations store unstructured data in raw formats, not in
databases. Data lakes can store large amounts of unstructured data.
Key Difference : Database Type
• Structured data typically resides in a relational database, arranged in tables
with rows and columns. Labels specify the data types. A table’s schema
consists of the data column and type configuration. Relational databases
process data using SQL, an easy syntax for users to read.

• Unstructured data often resides in a non-relational (NoSQL) database. This


database type stores multiple data models without tables—this is usually a
document, wide-column, graph, and key-volume database. It can process
large data volumes and handle high loads. A NoSQL database contains
collections of documents that resemble rows but don’t use a tabular
schema, so there can be different data types in the same collection. The
non-relational model enables faster queries.
Key Difference : Searchability and Ease of Use
• Structured data is usually easier to search and use,
• Unstructured data involves more complex search and analysis.
Unstructured data requires processing to understand it, such as
stacking before placing it in a relational database.
• Structured data is older, so there are more analytics tools available.
Standard data mining solutions cannot handle unstructured data.
Key Difference : Quantitative vs. Qualitative
• Structured data is quantitative, meaning that it has countable
elements. It is easier to analyze by classifying items based on
common characteristics, investigating the relationships between
variables, or clustering the data into attribute-based groups.

• Unstructured data is qualitative, meaning the information it contains


is subjective, and traditional analytics tools and methods can’t handle
it. For example, customer feedback on social media can generate data
in text form, requiring advanced analytics to process it. Techniques
include splitting and stacking data volumes into logical groupings,
data mining, and pattern detection.
Semi Structured Data
• Data that contains elements of both structured and unstructured
data, typically organized with markers (e.g., tags or keys) to separate
data fields but without a rigid schema.
• Key Characteristics
• Flexible Schema: No fixed schema, but it has identifiable elements.
• Hierarchical or Nested: Often stored in tree-like or nested structures.
• Easier Parsing: Easier to parse than unstructured data due to tags or
delimiters.
• Examples: JSON, XML, YAML, emails, CSV with irregular columns.
Examples of semi structured data
JavaScript Object Notation XML (eXtensible Markup Language)

{ <person>
"name": "Alice", <name>Alice</name>
"age": 30, <age>30</age>
"skills": ["Python", "Data Analysis"] <skills>
} <skill>Python</skill>
<skill>Data Analysis</skill>
</skills>
</person>
YAML (Yet Another Markup Language)

name: Alice
age: 30
skills:
- Python
- Data Analysis
Data Storage Options
• Relational databases: Structured Data stored and organized in tables,
rows, and columns that are related to each other
• NoSQL databases : Stores unstructured Data in non tabular format
• Data Warehouse : a centralized repository of data that stores cleaned
and processed data that's structured and historical and organizes
information from multiple sources for business analysis and
reporting.
• Data lakes : a centralized repository that stores, processes, and
secures large amounts of raw data, including structured and
unstructured data, at any scale.
Non-Relational Databases (NoSQL databases)
They are very efficient in analyzing large size unstructured data.

• Key-value databases : Store and Manage associative array (dictionary or hash table)
consists of a collection of key-value pairs in which a key serves as a unique identifier to
retrieve an associated value. Values can be anything from simple objects, like integers or
strings, to more complex objects, like JSON structures.

• Document-oriented databases : Store data in the form of documents where each


document has a unique identifier — its key — and the document itself serves as the
value. Each document contains some kind of metadata that provides a degree of
structure to the data. Document stores often come with an API or query language that
allows users to retrieve documents based on the metadata they contain.

• Columnar databases / column-oriented databases : Data is stored in a column-wise


fashion, with all the values of a specific attribute stored together, which allows for faster
data retrieval and compression.
The Three-Tier Architecture
The three-tier (or three-schema)
architecture supported by popular DBMSs
achieves two important things:
1. It insulates programmers and end-users of
the database from the way that data is
physically stored in the computer(s).
2. It enables different users of the data to
see only the subset of data relevant to
them, organized to suit their particular
needs

• The internal schema describes how the data will be physically stored and accessed, using the facilities provided by a
particular DBMS.

• The conceptual schema describes the organization of the data into tables and columns

• The external schemas specify views that enable different users of the data to see it in different ways.
What is a data center?
• A data center is a physical location that stores computing machines
and their related hardware equipment. It contains the computing
infrastructure that IT systems require, such as servers, data storage
drives, and network equipment. It is the physical facility that stores
any company’s digital data.
• Key Components of enterprise data center infrastructure:
o Compute
o Storage
o Network
Types of Data Centers
On Prem

Public Cloud

CoLocation

Cloud

Hybrid
Enterprise (on-premises) data centers
• In this data center model, all IT infrastructure and data is hosted on-premises. Many companies
choose to have their own on-premises data centers because they feel they have more control
over information security, and can more easily comply with regulations such as the European
Union General Data Protection Regulation (GDPR) or the U.S. Health Insurance Portability and
Accountability Act (HIPAA). In an enterprise data center, the company is responsible for all
deployment, monitoring, and management tasks.
• On-premises data centers are fully owned company data centers that store sensitive data and
critical applications for that company. You set up the data center, manage its ongoing operations,
and purchase and maintain the equipment.

• Benefits: An enterprise data center can give better security because you manage risks internally.
You can customize the data center to meet your requirements.
• Limitations: It is costly to set up your own data center and manage ongoing staffing and running
costs. You also need multiple data centers because just one can become a single high-risk point of
failure.
Public cloud data centers
• Cloud data centers (also called cloud computing data centers) house IT
infrastructure resources for shared use by multiple customers—from scores
to millions of customers—via an Internet connection.

• Many of the largest cloud data centers—called hyperscale data centers—


are run by major cloud service providers like Amazon Web Services (AWS),
Google Cloud Platform, IBM Cloud, Microsoft Azure, and Oracle Cloud
Infrastructure. In fact, most leading cloud providers run several hyperscale
data centers around the world. Typically, cloud service providers maintain
smaller, edge data centers located closer to cloud customers (and cloud
customers’ customers). For real-time, data-intensive workloads such big
data analytics, artificial intelligence (AI), and content delivery applications,
edge data centers can help minimize latency, improving overall application
performance and customer experience.
Colocation data centers
• Colocation facilities are large data center facilities in which you can rent space to store
your servers, racks, and other computing hardware. The colocation center typically
provides security and support infrastructure such as cooling and network bandwidth.

• Benefits: Colocation facilities reduce ongoing maintenance costs and provide fixed
monthly costs to house your hardware. You can also geographically distribute hardware
to minimize latency and to be closer to your end users.
• Limitations: It can be challenging to source colocation facilities across the globe and in
different geographical areas you target. Costs could also add up quickly as you expand.

• In a managed data center, the client company leases dedicated servers, storage and
networking hardware from the data center provider, and the data center provider
handles the administration, monitoring and management for the client company.
Cloud data centers
• A cloud data center moves a traditional on-prem data center off-site.
Instead of personally managing their own infrastructure, an organization
leases infrastructure managed by a third-party partner and accesses data
center resources over the Internet. Under this model, the cloud service
provider is responsible for maintenance, updates, and meeting service level
agreements (SLAs) for the parts of the infrastructure stack under their
direct control.

• Benefits: A cloud data center reduces both hardware investment and the
ongoing maintenance cost of any infrastructure. It gives greater flexibility in
terms of usage options, resource sharing, availability, and redundancy.
Difference between Cloud and On Prem Data Center
S.No Cloud On PRem
Cloud is a virtual resource that helps businesses to store, Data Center is a physical resource that helps businesses to store,
1.
organize, and operate data efficiently. organize, and operate data efficiently.
The scalability of the cloud required less amount of The scalability of Data Center is huge in investment as compared to
2.
investment. the cloud.
The maintenance cost is less than service providers The maintenance cost is high because developers of the organization
3.
maintain it. do maintenance.
Third-Party needs to be trusted for the organization’s data The organization’s developers are trusted for the data stored in data
4.
to be stored. centers.
5. Performance is huge as compared with investment. Performance is less than compared to investment.
6. It requires a plan to customize the cloud. It is easily customizable without any hard plan.
It requires a stable internet connection to provide the
7. It may and may not require an internet connection.
function.
Data Centers require experienced developers to operate and are
8. Cloud is easy to operate and is considered a viable option.
considered not a viable option.

9. Data is generally collected from the internet Here, data is collected from the Organization’s network.

It finds use in scenarios where security is not a critical


It finds use in scenarios where the project requires a high level of
10. aspect. Hence, small web applications can be hosted
security.
easily.
Shared Responsibility Model
The migration from an on-premises data center to a cloud data center doesn’t mean moving
everything to the cloud. Many companies have hybrid cloud data centers which have a mix of on-
premises data center components and virtual data centers components.
Data Center Tiers.
• data centers can be defined by different levels of reliability or flexibility,
sometimes referred to as data center tiers.
• Tier I : These are the most basic types of data centers, including UPS. Tier I
data centers do not provide redundant systems but must guarantee at least
99.671% uptime.
• Tier II : These data centers include system, power and cooling redundancy
and guarantee at least 99.741% uptime.
• Tier III : These data centers offer partial fault tolerance, 72-hour outage
protection, full redundancy, and a 99.982% uptime guarantee.
• Tier IV : These data centers guarantee 99.995% uptime - or no more than
26.3 minutes of downtime per year - as well as full fault tolerance, system
redundancy, and 96 hours of outage protection.
Basic Data Center Operations
• Managing and monitoring servers.
• Ensuring high availability and performance.
• Maintaining security and disaster recovery systems.

You might also like