Chapter 1 Notes

The document provides an overview of digital data classification, categorizing it into structured, semi-structured, and unstructured data. It explains the characteristics and examples of each type, emphasizing the importance of managing and analyzing data to derive valuable insights. Additionally, it discusses techniques for handling unstructured data, such as data mining and natural language processing.

Uploaded by

Abhishek Pisal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Chapter 1 Notes

Uploaded by

Abhishek Pisal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Chapter 1

Introduction to Types of Digital Data

Classification of Digital Data:-

 Irrespective of the size of the enterprise whether it is big or small, data continues
to be a precious and irreplaceable asset.
 Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to
draw valuable-insights. Digital data can be structured, semi-structured or unstructured
data.
 Data generates information and from information we can draw valuable insight.
As represented in below Figure, digital data can be broadly classified into structured,
semi structured, and unstructured data.

Classification of Digital Data as follows:

1) Unstructured data:- This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program. About 80% data of an
organization is in this format; for example, memos, chat rooms, PowerPoint.
presentations, images, videos, letters. researches, white papers, body of an email, etc.
2) Semi-structured data:- Semi-structured data is also referred to as self-describing
structure. This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a
computerprogram. About 10% data of an organization is in this format; for example,
HTML, XML, JSON, email data etc.
3) Structured data:- When data follows a pre-defined schema/structure we say it is
structured data. This is the data which is in an organized form (e.g., in rows and
columns) and be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. About 10% data of an organization
is in this format. Data stored in databases is an example of structured data.

Big Data Analytics

Structured Data :-
 This is the data which is in an organized form (e.g., in rows and columns) and can
be easily used by a computer program.
 Relationships exist between entities of data, such as classes and their objects.
 Data stored in databases is an example of structured data.
 If our data is highly structured, one can look at leveraging any of the available
RDBMS such as

[Oracle Corp. — Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC —
Greenplum, Teradata — Teradata, MySQL (open source), PostgreSQL (advanced
open source) etc.] to house it.

 These databases are typically used to hold transaction/operational data generated

and collected by day-to-day business activities. In other words, the data of the On-
Line Transaction Processing (OLTP) systems are generally quite structured.

Sources of Structured Data :-

 Below figure shows the different sources ofstructured data .

Figure:Sources of Structured-data

Big Data Analytics

Ease with Structured Data:-
 Structure data provide the ease of working .The ease is with respect to the
following:
1) Insert/update/delete: The Data Manipulation Language (DML) operations provide
he required ease with data input, storage, access, process, analysis, etc.
2) Security: How does one ensure the security of information? There are available
check encryption and tokenization solutions to warrant the security of information
throughout its lifecycle. Organizations are able to retain control and maintain
compliance adherence by ensuring that only authorized individuals are able to decrypt
and view sensitive information.
3) Indexing:- Indexing is a way to optimize the performance of a database by
minimizing the number of disk accesses required when a query is processed. It is a
data structure technique which is used to quickly locate and access the data in a
database. It speeds up the data retrieval operations (primarily the SELECT DML
statement) at the cost of additional writes and storage space, but the benefits that
ensue in search operation are worth the additional writes and storage space.
4) Scalability: The storage and processing capabilities of the traditional RDBMS can
be easily scaled up by increasing the horsepower of the database server (increasing
the primary and secondary or peripheral storage capacity, processing capacity of the
processor, etc.).
5) Transaction processing: RDBMS has support for Atomicity, Consistency,
Isolation, and Durability (ACID) properties of transaction.
a) Atomicity:- It mean that either the entire transaction takes place at once or doesn’t
happen at all. There is no midway i.e. transactions do not occur
partially. Each transaction is considered as one unit and either runs to completion or is
not executed at all. It involves the following two operations.
 Abort: If a transaction aborts, changes made to database are not visible.
 Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
b) Consistency :- This means that integrity constraints must be maintained so that the
atabase is consistent before and after the transaction. It refers to the correctness of a
database. The database moves from one consistent state to another consistent state . In

Big Data Analytics

other words , if the same piece of information is stored at two or more places , they
are in complete agreement
c) Isolation :- This property ensures that multiple transactions can occur concurrently
without leading to the inconsistency of database state.
 Transactions occur independently without interference. Changes occurring in a
particular transaction will not be visible to any other transaction until that particular
change in that transaction is written to memory or has been committed.
 This property ensures that the execution of transactions concurrently will result in a
tate that is equivalent to a state achieved these were executed serially in some order’.
d) Durability:- Al changes made to database during a transaction are permanent and
that accounts for the durability of the transaction .This property ensures that once the
transaction has completed execution, the updates and modifications to the database
are stored in and written to disk and they persist even if a system failure occurs.

 These updates now become permanent and are stored in non-volatile

memory. The effects of the transaction, thus, are never lost

Figure : Ease of Working with structure Data

Big Data Analytics

Semi-Structured data:-
 This is the data which does not conform to a data model but has some structure.
 However, it is not in a form which can be used easily by a computer program.
 Example, emails, XML, markup languages like HTML, etc. Metadata for this
data.
 This property ensures that once the transaction has completed execution, the updates
and modifications to the database are stored in and written to disk and they persist
even if a system failure occurs.
 These updates now become permanent and are stored in non-volatile memory. The
effects of the transaction, thus, are never lost.
It has the following features:
1) It does not conform to the data models that one typically associates with relational
databases or any other form of data tables.
2) It uses tagsto segregate semantic elements.
3) Tags are also used to enforce hierarchies of records and fields within data.
4) There is no separation between the data and the schema. The amount of structure
used is dictated by the purpose at hand.
5) In semi-structured data, entities belonging to the same class and also grouped
together neednot necessarily have the same set of attributes. And if at all, they have
the same set
of attributes, the order of attributes may not be similar and for all practical purposes it
is notimportant as well.
Sources of Semi-structured Data :-
 Amongst the sources for semi-structured data, the front runners are XML and
JSON
 XML: eXtensible Markup Language (XML) is hugely popularized by web
services developed utilizing the Simple Object Access Protocol (SOAP) principles.
 JSON: Java Script Object Notation (JSON) is used to transmit data between a
server and a web application.
 JSON is popularized by web services developed utilizing the Representational
State Transfer (REST) - an architecture style for creating scalable web services.
 MongoDB (open-source, distributed, NoSQL, documented oriented database) and

Big Data Analytics

Couchbase (originally known as Membase, open-source, distributed, NoSQL,
document oriented database) store data natively in JSON format.

Figure: Sources of Semi-structure Data

Example of HTML is as below

Example ofJSON document

Big Data Analytics

Characteristics of semi structure Data :
 Following figure will illustrate what are the characteristics of Semi Structure ata .

Unstructured Data :-

Big Data Analytics

 This is the data which does not conform to a data model or is not in a form which
can be used easily by a computer program.
 About 80% data of an organization is in this format.
 Example: memos, chat rooms, PowerPoint presentations, images, videos, letters,
researches, white papers, body of an email, etc. which is illustrated in following
figure .

Issues with terminology Unstructured Data :-

Big Data Analytics

Although unstructured data is known NOT to conform a pre-defined data model or to
be organised in a pre-defined manner , there are incidents wherein the structure of the
data (placed in the unstructured category) can be implied .
 Above figure mention some reasons behind placing data in the unstructured
category despite it having some structure or being highly structured.

How to Deal with Unstructured Data?

 The following techniques are used to find pattern in or interpret unstructured data
1) Data Mining:-
 First, we deal with large data sets.
 Second, we use methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems to unearth consistent patterns in large data
sets and/or systematic relationships between variables.
 It is the analysis step of the knowledge discovery in databases process.
 Few popular data mining algorithm are as follow .
a) Association rule mining:
 It is also called “market basket analysis” or “affinity analysis”.
 It is used to determine ―What goes with what?
 It is about when you buy a product, what is the other product that
you are likely to purchase with it.
 For example, if you pick up bread from the grocery, are you likely
to pick eggs or cheese to go with it.

Big Data Analytics

We are looking at predicting whether User 4 will prefer to learn using videos or is a
textual leaner depending on one or a couple of his or her known preferences.
 We analyze the preferences of similar user profiles and on the basis of it, predict
that User 4 will also like to learn using videos and is not a textual learner
2) Text Analytics or Text Mining:-
 Compared to the structured data stored in relational databases, text largely
unstructured, amorphous, and difficult to deal with algorithmically.
 Text mining is the process of gleaning high quality and meaningful information
(through devising of patterns and trends by means ofstatistical pattern learning) from
text.
 It includes tasks such as text categorization, text clustering, sentiment analysis,
concept/entity extraction, etc.
3) Natural language processing (NLP):- It is related to the area of human computer
interaction. It about enabling computers to understand human or natural language
input.
4) Noisy text analytics:- It is the process of extracting structured or semi-structured
information from noisy unstructured data such as chats, blogs, wikis, emails,
message boards, text messages, etc. The noisy unstructured data usually comprises
one or more of the following: Spelling mistakes, abbreviations, acronyms, non-
standard words, missing punctuation, missing letter case, filler words such as “ Uh “ ,
“Um” etc
5) Manual tagging with metadata:- This is about tagging manually with adequate
metadata to provide the requisite semantics to understand unstructured data.
6) Part-of-speech tagging:- It is also called POS or POST or grammatical tagging. It
is the process reading text and tagging each word in the sentence as belonging to a
particular part of speech such as “noun”, “verb”, “adjective” , etc.
7) Unstructured Information Management Architecture (UIMA):- It is an open
source platform from IBM. It is used for real-time content analytics. It is about
processing text and other unstructured to find latent meaning and relevant relationship.

Big Data Analytics

Data Analytics
100% (3)
Data Analytics
14 pages
Informatica Training
No ratings yet
Informatica Training
21 pages
Step by Step Installation of Microsoft Dynamics 365 Finance and Operations On Premise by Umesh Pandit PDF
100% (1)
Step by Step Installation of Microsoft Dynamics 365 Finance and Operations On Premise by Umesh Pandit PDF
75 pages
Getting Started With TeleForm
No ratings yet
Getting Started With TeleForm
102 pages
Big Data & Analytics (CSE448) L1 (1)
No ratings yet
Big Data & Analytics (CSE448) L1 (1)
51 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
No ratings yet
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
68 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Bda Unit 1
No ratings yet
Bda Unit 1
24 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Cloud computing
No ratings yet
Cloud computing
86 pages
01 Unit-BDA- Intro BDA
No ratings yet
01 Unit-BDA- Intro BDA
37 pages
DA(Unit-1)
No ratings yet
DA(Unit-1)
45 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
UNIT4
No ratings yet
UNIT4
20 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
35 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
3. AI primer
No ratings yet
3. AI primer
24 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
2023_IT_22IT405_U1-LM1 (1)
No ratings yet
2023_IT_22IT405_U1-LM1 (1)
11 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
big data analytics
No ratings yet
big data analytics
15 pages
Sybca Bigdata Notes
100% (1)
Sybca Bigdata Notes
11 pages
UNIT I notes
No ratings yet
UNIT I notes
26 pages
BigData_1
No ratings yet
BigData_1
14 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
BIGDATA ANALYTICS
No ratings yet
BIGDATA ANALYTICS
19 pages
All
No ratings yet
All
62 pages
Data Science Notes
No ratings yet
Data Science Notes
3 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
BDA Presentations M1 P1
No ratings yet
BDA Presentations M1 P1
40 pages
BDU1
No ratings yet
BDU1
39 pages
BDA Question Answer
No ratings yet
BDA Question Answer
29 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
10 pages
Lecture 1 Introduction to Data engineering
No ratings yet
Lecture 1 Introduction to Data engineering
7 pages
DA_Unit_1
No ratings yet
DA_Unit_1
44 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
unit 1 big data
No ratings yet
unit 1 big data
34 pages
Data and Its Types
No ratings yet
Data and Its Types
40 pages
Big Data Class 27Feb
No ratings yet
Big Data Class 27Feb
48 pages
BDA Question Bank
No ratings yet
BDA Question Bank
20 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
DATA ANALYTICS note
No ratings yet
DATA ANALYTICS note
52 pages
BDA Unit 1
No ratings yet
BDA Unit 1
22 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Infosys Revision sheet
No ratings yet
Infosys Revision sheet
6 pages
Unit 3
No ratings yet
Unit 3
11 pages
Designing A Documentum Access Control Model
No ratings yet
Designing A Documentum Access Control Model
10 pages
Cof C02
No ratings yet
Cof C02
7 pages
Overview of Hospital Information System
No ratings yet
Overview of Hospital Information System
11 pages
Crunherool Hits
No ratings yet
Crunherool Hits
27 pages
28
No ratings yet
28
3 pages
Multimedia Manual
No ratings yet
Multimedia Manual
113 pages
SAP - HANA - Tuitorial - Chapter 1
No ratings yet
SAP - HANA - Tuitorial - Chapter 1
8 pages
Durgesh Kaskar
No ratings yet
Durgesh Kaskar
67 pages
Data Warehousing
No ratings yet
Data Warehousing
8 pages
Lect 5 Data Warehousing I_240924_033406
No ratings yet
Lect 5 Data Warehousing I_240924_033406
38 pages
Salesforce Apex Code Cheat Sheets (1)
No ratings yet
Salesforce Apex Code Cheat Sheets (1)
4 pages
DSA Unit6 Theory
No ratings yet
DSA Unit6 Theory
23 pages
Unit 1
No ratings yet
Unit 1
82 pages
1 - Dbms II - Revised
No ratings yet
1 - Dbms II - Revised
41 pages
Unit 1 Notes - DW
No ratings yet
Unit 1 Notes - DW
25 pages
Determining Suitability of Database Functionality and Scalability
No ratings yet
Determining Suitability of Database Functionality and Scalability
27 pages
MySQL SQL Injection Cheat Sheet
No ratings yet
MySQL SQL Injection Cheat Sheet
3 pages
Netbackup 8.0 Blueprint Exchange
No ratings yet
Netbackup 8.0 Blueprint Exchange
36 pages
Building a RAG System with GPT-4_ A Step-by-Step Guide _ by Maysa Mayel _ Oct, 2024 _ Medium
No ratings yet
Building a RAG System with GPT-4_ A Step-by-Step Guide _ by Maysa Mayel _ Oct, 2024 _ Medium
11 pages
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
No ratings yet
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
54 pages
Big Data Analytics & Technologies: Hbase
No ratings yet
Big Data Analytics & Technologies: Hbase
30 pages
Fundametals of Database Module
No ratings yet
Fundametals of Database Module
153 pages
Cutting Tool Management
No ratings yet
Cutting Tool Management
33 pages
XIIInfo Pract pt1410
No ratings yet
XIIInfo Pract pt1410
3 pages
G-12 Unit 3(DBMS)
No ratings yet
G-12 Unit 3(DBMS)
6 pages