Big Data and Analytics Cse448 Module 1 L
Big Data and Analytics Cse448 Module 1 L
Unstructured data
Classification of Digital Data
Unstructured data:
This is the data which does not conform to a data
model or is not in a form which can be used easily
by a computer program.
About 80-90% data of an organization is in this
for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches,
white papers, body of an email etc.
Classification of Digital Data..
Semi-structured data: This is the data which does not
conform to a data model but has some structure.
However, it is not in a form which can be used easily by a
computer program;
for example, en XML, markup languages like HTML, etc.
Metadata for this data is available but is not sufficient.
Structured data: This is the data which is in an
organized form (e.g., in rows and columns) and can be
easily used by a computer program. Relationships exist
between entities of data, such as classes their objects.
Data stored in databases is an example of structured
data.
Approximate Percentage
Distribution of Digital Data
Approximate percentage distribution of digital
data
Structured Data
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
OLTP Systems
Ease of Working with Structured Data
The ease is with respect to the following:
Insert/update/delete: The Data Manipulation Language
(DML) operations provide the required ease with data
input, storage, access, process, analysis, etc.
Security: How does one ensure the security of
information? There are available check encryption and
tokenization solutions to warrant the security of
information throughout its lifecycle.
Organizations are able to retain control and maintain
compliance adherence by ensuring that only authorized
individuals are able to decrypt and view sensitive
information.
Ease of Working with Structured Data
Input / Update /
Delete
Security
Scalability
Transaction
Processing
Semi-structured Data
Inconsistent Structure
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document
Issues with terminology –
Unstructured Data
Data Mining
Data Mining:
First, we deal with large data sets.
Regression analysis:
It helps to predict the relationship between two
variables.
The variable whose value needs to be predicted is
called the dependent variable and the variables
which are used to predict the value are referred to
as the independent variables.
Issues with "Unstructured" Data
Collaborative filtering:
It is about predicting a user’s preference or
preferences based on the preferences of a group of
users.
For example, take a look at Table next slide.
We are looking at predicting whether User 4 will prefer to
learn using videos or is a textual leaner depending on one
or a couple of his or her known preferences.
We analyze the preferences of similar user profiles and on
the basis of it, predict that User 4 will also like to learn using
videos and is not a textual learner.
Issues with "Unstructured" Data
IBM UIMA
Question‘s Answer ??
Which category (structured, semi-structured, or
unstructured) will you place a Web Page in?
Which category (structured, semi-structured, or
unstructured) will you place Word Document in?
State a few examples of human generated and
machine-generated data.