0% found this document useful (0 votes)

27 views

Big Data and Analytics Cse448 Module 1 L

jaffa

Uploaded by

IJRTET Vedanta Publications

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Big Data and Analytics Cse448 Module 1 L

jaffa

Uploaded by

IJRTET Vedanta Publications

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

SESSION 2016-2017

B.TECH (CSE) YEAR: IV SEMESTER: VIII

BIG DATA AND ANALYTICS
(CSE448)
MODULE 1 (L1)
Presented By
Dilip Kumar Sharma , Rahul Pradhan, Vivek Kumar, Yogesh Gupta
Dept of Computer Engineering & Applications
GLA University India
Classification of Digital Data

Digital data is classified into the following

categories:
 Structured data
 Semi-structured data

 Unstructured data
Classification of Digital Data

 Unstructured data:
 This is the data which does not conform to a data
model or is not in a form which can be used easily
by a computer program.
 About 80-90% data of an organization is in this
for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches,
white papers, body of an email etc.
Classification of Digital Data..
 Semi-structured data: This is the data which does not
conform to a data model but has some structure.
However, it is not in a form which can be used easily by a
computer program;
 for example, en XML, markup languages like HTML, etc.
Metadata for this data is available but is not sufficient.
 Structured data: This is the data which is in an
organized form (e.g., in rows and columns) and can be
easily used by a computer program. Relationships exist
between entities of data, such as classes their objects.
Data stored in databases is an example of structured
data.
Approximate Percentage
Distribution of Digital Data
 Approximate percentage distribution of digital
data
Structured Data

 This is the data which is in an organized form (e.g., in

rows and columns) and can be easily used by a
computer program.
 Relationships exist between entities of data, such as
classes and their objects.
 Data stored in databases is an example of structured
data.
Sources of Structured Data

 If your data is highly structured, one can look at

leveraging any of the available RDBMS
 [Oracle Corp. — Oracle, IBM — DB2, Microsoft —
Microsoft SQL Server, EMC — Greenplum, Teradata —
Teradata, MySQL (open source), PostgreSQL (advanced
open source) etc.] to house it.
 These databases are typically used to hold
transaction/operational data generated and collected by
day-to-day business activities. In other words, the data of
the On-Line Transaction Processing (OLTP) systems are
generally quite structured.
Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease of Working with Structured Data
The ease is with respect to the following:
 Insert/update/delete: The Data Manipulation Language
(DML) operations provide the required ease with data
input, storage, access, process, analysis, etc.
 Security: How does one ensure the security of
information? There are available check encryption and
tokenization solutions to warrant the security of
information throughout its lifecycle.
 Organizations are able to retain control and maintain
compliance adherence by ensuring that only authorized
individuals are able to decrypt and view sensitive
information.
Ease of Working with Structured Data

 Indexing: An index is a data structure that speeds up the

data retrieval operations (primarily the SELECT DML
statement) at the cost of additional writes and storage
space, but the benefits that ensue in search operation are
worth the additional writes and storage space.
 Scalability: The storage and processing capabilities of the
traditional RDBMS can be easily scaled up by increasing the
horsepower of the database server (increasing the primary
and secondary or peripheral storage capacity, processing
capacity of the processor, etc.).
Ease of Working with Structured Data
 Transaction processing: RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID) properties of
transaction.
 Atomicity: A transaction is atomic, means that either it happens in its
entirety or none of it at all.
 Consistency: The database moves from one consistent state to another
consistent state. In other words, if the same piece of information is
stored at two or more places, they are in complete agreement.
 Isolation: The resource allocation to the transaction happens such that the
transaction gets the impression that it is the only transaction happening
in isolation.
 Durability: All changes made to the database during a transaction are
permanent and that accounts for the durability of the transaction.
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /

Searching

Scalability

Transaction
Processing
Semi-structured Data

 This is the data which does not conform to a data

model but has some structure.
 However, it is not in a form which can be used easily
by a computer program.
 Example, emails, XML, markup languages like HTML,
etc. Metadata for this data is available but is not
sufficient.
Semi-structured Data
It has the following features:
 It does not conform to the data models that one typically associates with
relational databases or any other form of data tables.
 It uses tags to segregate semantic elements.
 Tags are also used to enforce hierarchies of records and fields within
data.
 There is no separation between the data and the schema.
 The amount of structure used is dictated by the purpose at hand.
 In semi-structured data, entities belonging to the same class and also
grouped together need not necessarily have the same set of attributes.
 And if at all, they have the same set of attributes, the order of attributes
may not be similar and for all practical purposes it is not important as
well.
Sources of Semi-structured Data

 Amongst the sources for semi-structured data, the front runners

are ―XML‖ and ―JSON‖.
 XML: eXtensible Markup Language (XML) is hugely popularized
by web services developed utilizing the Simple Object Access
Protocol (SOAP) principles.
Sources of Semi-structured Data

XML (eXtensible Markup Language)

Other Markup Languages

JSON (Java Script Object Notation)

Semi-Structured Data
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values

Data objects may have different

attributes not known beforehand
Sources of Semi-structured Data

 JSON: Java Script Object Notation (JSON) is used to transmit

data between a server and a web application.
 JSON is popularized by web services developed utilizing the
Representational State Transfer (REST) - an architecture style for
creating scalable web services.
 MongoDB (open-source, distributed, NoSQL, documented-
oriented database) and Couchbase (originally known as
Membase, open-source, distributed, NoSQL, document-oriented
database) store data natively in JSON format.
Sources of Semi-structured Data
An example of HTML is as follows:
<HTML>
<HEAD>
<TITLE>Place your title here</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"x/CENTER>
<HR> <a href="http://bigdatauniversity.com">Link Name</a>
<Hl>this is a Header</Hl>
<H2>this is a sub Header</H2>
Send me mail at <a href="mailto:[email protected]"> [email protected]</a>.
<P>a new paragraph!
<PxB>a new paragraph!</B>
<BRxBxI>this is a new sentence without a paragraph break, in bold italics.</Ix/B>
<HR>
</BODY>
</HTML>
Sources of Semi-structured Data

Sample JSON document

{
_id:9,
BookTitle: ―Fundamentals of Business Analytics‖,
AuthorName: ―Seema Acharya‖,
Publisher: ―Wiley India‖,
YearofPublication: ―2011‖
}
Unstructured Data

 This is the data which does not conform to a data

model or is not in a form which can be used easily by
a computer program.
 About 80–90% data of an organization is in this
format.
 Example: memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches,
white papers, body of an email, etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media data

Word
Document
Issues with terminology –
Unstructured Data

Structure can be implied despite not being

formerly defined.

Data with some structure may still be labeled

Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be

highly structured in ways that are unanticipated
or unannounced.
How to Deal with Unstructured Data?

 Today, unstructured data constitutes approximately

80% of the data that is being generated in any
enterprise.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

Issues with "Unstructured" Data

 Data Mining:
 First, we deal with large data sets.

 Second, we use methods at the intersection of

arti-ficial intelligence, machine learning, statistics,
and database systems to unearth consistent
patterns in large data sets and/or systematic
relationships between variables.
 It is the analysis step of the ―knowledge discovery
in databases‖ process.
Issues with "Unstructured" Data

Few popular data mining algorithms are as follows:

 Association rule mining:

 It is also called ―market basket analysis‖ or

―affinity analysis‖.
 It is used to determine ―What goes with what?‖

 It is about when you buy a product, what is the

other product that you are likely to purchase with it.
 For example, if you pick up bread from the
grocery, are you likely to pick eggs or cheese to go
with it.
Issues with "Unstructured" Data

 Regression analysis:
 It helps to predict the relationship between two
variables.
 The variable whose value needs to be predicted is
called the dependent variable and the variables
which are used to predict the value are referred to
as the independent variables.

Issues with "Unstructured" Data

 Collaborative filtering:
 It is about predicting a user’s preference or
preferences based on the preferences of a group of
users.
 For example, take a look at Table next slide.
 We are looking at predicting whether User 4 will prefer to
learn using videos or is a textual leaner depending on one
or a couple of his or her known preferences.
 We analyze the preferences of similar user profiles and on
the basis of it, predict that User 4 will also like to learn using
videos and is not a textual learner.
Issues with "Unstructured" Data

Table . Sample Record depicting learner’s preferences for

model of learning
Issues with "Unstructured" Data

 Text Analytics or Text Mining: Compared to the

structured data stored in relational databases, text
largely unstructured, amorphous, and difficult to deal
with algorithmically.
 Text mining is the process of gleaning high quality and
meaningful information (through devising of patterns
and trends by means of statistical pattern learning)
from text.
 It includes tasks such as text categorization, text
clusterirg, sentiment analysis, concept/entity extraction,
etc.
Issues with "Unstructured" Data
 Natural language processing (NLP): It is related to the
area of human computer interaction. It about enabling
computers to understand human or natural language
input.
 Noisy text analytics: It is the process of extracting
structured or semi-structured information from noisy
unstructured data such as chats, blogs, wikis, emails,
message-boards, text messages, etc.
 The noisy unstructured data usually comprises one or more of the
following: Spelling mistakes, abbreviations, acronyms, non-
standard words, missing punctuation, missing letter case, filler
words such as ―uh‖, ―urn‖, etc.
Issues with "Unstructured" Data

 Manual tagging with metadata: This is about

tagging manually with adequate metadata to
provide the requisite semantics to understand
unstructured data.
 Part-of-speech tagging: It is also called POS or POST
or grammatical tagging. It is the process reading text
and tagging each word in the sentence as belonging
to a particular part of speech such aj ―noun‖, ―verb‖,
―adjective‖, etc.
Issues with "Unstructured" Data

 Unstructured Information Management Architecture

(UIMA): It is an open source platform from IBM. It is
used for real-time content analytics.
 It is about processing text and other unstructured to
find latent meaning and relevant relationship buried
therein. Read up more on UIMA at the link
http://www.ibm.com/developerworks/data/downloa
ds/uima/
Summary

 Structured data: It conforms to a data model. For example,

RDBMS conforms to relational daci model. It has a pre-defined
schema.
 Semi-structured data: For this format of data, little metadata is
available, but is insufficient. Semi-structured data have a self-
describing structure. There is little or no separation between
data and schema.
 Unstructured data: This data is growing by the day and growing
by leaps and bounds. It has innumerable sources such as human
generated (social media data, emails, word documents,
pre-sentations, audio and video files that we create and share
every day, etc.) and machine generated data (sensors, web
server logs, call data records, etc.).
Answer a few quick questions …
 Match the following
Column A Column B
NLP Content analytics
Text analytics Text messages
UIMA Chats
Noisy unstructured data Text mining

Data mining Comprehend human or natural language input

Noisy unstructured data Uses methods at the intersection of statistics,

Artificial Intelligence, machine learning & DBs

IBM UIMA
Question‘s Answer ??
 Which category (structured, semi-structured, or
unstructured) will you place a Web Page in?
 Which category (structured, semi-structured, or
unstructured) will you place Word Document in?
 State a few examples of human generated and
machine-generated data.

BMC Helix IT Service Management Deployment 22.1.06: Technical Documentation
No ratings yet
BMC Helix IT Service Management Deployment 22.1.06: Technical Documentation
486 pages
Database Design and Modeling With PostgreSQL
100% (1)
Database Design and Modeling With PostgreSQL
450 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Ora2postgres DF
No ratings yet
Ora2postgres DF
72 pages
Big Data & Analytics (CSE448) L1 (1)
No ratings yet
Big Data & Analytics (CSE448) L1 (1)
51 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
No ratings yet
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
68 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Unit-1 (3)
No ratings yet
Unit-1 (3)
62 pages
Business Intelligence - Concepts
100% (2)
Business Intelligence - Concepts
162 pages
Data Types
No ratings yet
Data Types
36 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
2023_IT_22IT405_U1-LM1 (1)
No ratings yet
2023_IT_22IT405_U1-LM1 (1)
11 pages
Chapter 1 Notes
No ratings yet
Chapter 1 Notes
10 pages
Big Data
No ratings yet
Big Data
18 pages
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
37 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
All
No ratings yet
All
62 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
DA(Unit-1)
No ratings yet
DA(Unit-1)
45 pages
Types of Digital Data
No ratings yet
Types of Digital Data
26 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
Bi Mid 1
No ratings yet
Bi Mid 1
173 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Digital Data
No ratings yet
Digital Data
32 pages
DA_Unit_1
No ratings yet
DA_Unit_1
44 pages
Digital Data Part 1
No ratings yet
Digital Data Part 1
5 pages
Structured, Semi-Structured and Unstructured Data (M-2)
No ratings yet
Structured, Semi-Structured and Unstructured Data (M-2)
3 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Structured, Semi Structured and Unstructured Data
No ratings yet
Structured, Semi Structured and Unstructured Data
13 pages
01 Unit-BDA- Intro BDA
No ratings yet
01 Unit-BDA- Intro BDA
37 pages
BigData_1
No ratings yet
BigData_1
14 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
Unit I EBDP 2022
No ratings yet
Unit I EBDP 2022
80 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
02-Types of Digital Data
No ratings yet
02-Types of Digital Data
33 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
Unit - IV XML Databases Adbt 25 Pages
No ratings yet
Unit - IV XML Databases Adbt 25 Pages
13 pages
Lecture 1 Introduction to Data engineering
No ratings yet
Lecture 1 Introduction to Data engineering
7 pages
Unit 1: To Data Science
No ratings yet
Unit 1: To Data Science
56 pages
Chapter 2 - Types of digital data
No ratings yet
Chapter 2 - Types of digital data
12 pages
Structured and Unstructured Data: Learning Outcomes
100% (1)
Structured and Unstructured Data: Learning Outcomes
13 pages
3. AI primer
No ratings yet
3. AI primer
24 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Unit 1 (Big Data)
No ratings yet
Unit 1 (Big Data)
20 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Module 1
No ratings yet
Module 1
27 pages
Mod 2 Business Analytics
No ratings yet
Mod 2 Business Analytics
43 pages
Computer
No ratings yet
Computer
4 pages
Unit - Big - Data - (DK - PPT) - Part - 1
No ratings yet
Unit - Big - Data - (DK - PPT) - Part - 1
70 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
When Migrating An Oracle Database To Google Cloud, You Have Several Options and Migration Strategies To Choose - by Biswanath Giri - Medium
No ratings yet
When Migrating An Oracle Database To Google Cloud, You Have Several Options and Migration Strategies To Choose - by Biswanath Giri - Medium
11 pages
Case Study On User Roles
No ratings yet
Case Study On User Roles
9 pages
IBM Data Movement Tool
No ratings yet
IBM Data Movement Tool
27 pages
Windows Installation Guide
No ratings yet
Windows Installation Guide
17 pages
System Requirements For Qlik Sense
No ratings yet
System Requirements For Qlik Sense
11 pages
Internship Report
No ratings yet
Internship Report
25 pages
CV-Khudayor Compressed
No ratings yet
CV-Khudayor Compressed
1 page
[Ebooks PDF] download PostgreSQL High Availability Cookbook Master over 100 recipes to design and implement a highly available server with the advanced features of PostgreSQL 2nd Edition Shaun M. Thomas full chapters
100% (3)
[Ebooks PDF] download PostgreSQL High Availability Cookbook Master over 100 recipes to design and implement a highly available server with the advanced features of PostgreSQL 2nd Edition Shaun M. Thomas full chapters
53 pages
Postgresql Syntax
No ratings yet
Postgresql Syntax
17 pages
Security Best Practices For Postgresql: Whitepaper
No ratings yet
Security Best Practices For Postgresql: Whitepaper
14 pages
52492-rc071 Postgresql 2
No ratings yet
52492-rc071 Postgresql 2
10 pages
H 2
No ratings yet
H 2
181 pages
Web Services Spring Boot JPA Hibernate
No ratings yet
Web Services Spring Boot JPA Hibernate
13 pages
Migrating From PostgreSQL To MySQL at Cocolog, Japan's Largest Blog Community
100% (2)
Migrating From PostgreSQL To MySQL at Cocolog, Japan's Largest Blog Community
48 pages
FREESWITCH Mod - CDR - CSV 250519 1726 5024
No ratings yet
FREESWITCH Mod - CDR - CSV 250519 1726 5024
7 pages
Vaadin 14 Scalability Report - December 2019
No ratings yet
Vaadin 14 Scalability Report - December 2019
26 pages
Zabbix Performance Tuning 6.0
No ratings yet
Zabbix Performance Tuning 6.0
49 pages
Student Grading System Report 0727
50% (4)
Student Grading System Report 0727
14 pages
BattleCard FujitsuEnterprisePostgres 14
No ratings yet
BattleCard FujitsuEnterprisePostgres 14
4 pages
AZ 204 VCE Dumps (73 83)
No ratings yet
AZ 204 VCE Dumps (73 83)
11 pages
Chaitanya Reddy
No ratings yet
Chaitanya Reddy
3 pages
PGNP Manual
No ratings yet
PGNP Manual
92 pages
Instalan Guide Drupal
No ratings yet
Instalan Guide Drupal
69 pages
HOSxP Administration Manual
No ratings yet
HOSxP Administration Manual
66 pages
OpenText Documentum Platform Infrastructure Certification Gu22.2
No ratings yet
OpenText Documentum Platform Infrastructure Certification Gu22.2
37 pages
Instalacion NetBox
No ratings yet
Instalacion NetBox
4 pages
Vcloud Director Installation Part 1 - Database: This Is Part 1 of The VCD 9.0 Installation Guide
No ratings yet
Vcloud Director Installation Part 1 - Database: This Is Part 1 of The VCD 9.0 Installation Guide
15 pages

Big Data and Analytics Cse448 Module 1 L

Uploaded by

Big Data and Analytics Cse448 Module 1 L

Uploaded by

SESSION 2016-2017

B.TECH (CSE) YEAR: IV SEMESTER: VIII

Digital data is classified into the following

 This is the data which is in an organized form (e.g., in

 If your data is highly structured, one can look at

Structured data Spreadsheets

 Indexing: An index is a data structure that speeds up the

Ease with Structured data Indexing /

 This is the data which does not conform to a data

 Amongst the sources for semi-structured data, the front runners

XML (eXtensible Markup Language)

Other Markup Languages

JSON (Java Script Object Notation)

Data objects may have different

 JSON: Java Script Object Notation (JSON) is used to transmit

Sample JSON document

 This is the data which does not conform to a data

Structure can be implied despite not being

Data with some structure may still be labeled

Data may have some structure or may even be

 Today, unstructured data constitutes approximately

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

 Second, we use methods at the intersection of

Few popular data mining algorithms are as follows:

 It is also called ―market basket analysis‖ or

 It is about when you buy a product, what is the

Table . Sample Record depicting learner’s preferences for

 Text Analytics or Text Mining: Compared to the

 Manual tagging with metadata: This is about

 Unstructured Information Management Architecture

 Structured data: It conforms to a data model. For example,

Data mining Comprehend human or natural language input

Noisy unstructured data Uses methods at the intersection of statistics,

You might also like