Chapter 2 - Types of Digital Data
Chapter 2 - Types of Digital Data
Learning objectives
Semi-structured
80% data
Structured data
10%
10%
Unstructured data
Semi-structured data
Structured data
80%
Unstructured data
• Unstructured data – data which does not conform to a data model.
▫ No identifiable structure within this kind of data is available
▫ Data cannot be stored in rows and columns in a relational database
▫ It is not in a form which can be used easily by a computer program
▫ Advantage - no additional effort on its classification is necessary
▫ Limitation - no controlled navigation within unstructured content is
possible
• Storing data in an unstructured form without any defined data
schema is a common way of filing information
▫ About 80-90% data of an organization is in this format
▫ Example:- memos, chat rooms, power-point presentations, images, videos, letters,
white papers, body of an email etc.
• A common technology to search in unstructured text documents is
full-text search.
▫ Famous full-text search engine library is Apache Lucene2 . Other examples are
MySql3 and Postgres indixes
▫ Advantage of full-text search - it completely is decoupled from the data
▫ This makes it very flexible - it can be used on every kind of textual data
▫ Limitation - it cannot be used to search for pictures or videos
Fully Structured data
• Structured data – data follows a
predefined schema i.e., data conforms
to some specification
▫ can be used easily by a computer program
▫ Example:- Data stored in databases in rows and columns
• Well-defined schema of fully structured data enables efficient data
processing, improved storage and navigation of content
• Designing a database schema is an elaborate process. It has to be
defined before the content is created. It defines the type and structure
of data and its relations. Figure above illustrates an ER-diagram and
its concrete tables within a RDBMS
▫ Limitation - difficult to subsequently extend a previously defined database
schema that already contains content.
▫ Advantage - existing tools & web frameworks, support the development of
database focused applications.
▫ For instance, Hibernate5 and Oracle TopLink6 are Object/Relational (O/R)
Mapping frameworks, which map classes and objects to relational database
tables and rows.
Semi-structured data
• In some applications, data is collected in an ad-hoc manner before it is
known how it will be stored and managed
• Semi- structured data does not conform to a data model but has some
structure
▫ Not all the information collected will have identical structure
▫ The schema information is mixed in with the data values, since each data object can
have different attributes that are not known in advance. Hence, this type of data is
sometimes referred to as self-describing data.
▫ Metadata for this data is available but is not sufficient
▫ Example: - emails, XML, markup languages like HTML etc.
▫ Example:- XML - language for data representation and exchange on the web. In XML
data can be directly encoded and a Document Type Definition (DTD)/ XML Schema
(XMLS) defines the structure of the XML document
▫ Advantage - ability to accommodate variations in structure
Case Study: GoodLife HealthCare Group
Organizational Structure
• GoodLife HealthCare Group has a Board of Company’s Directors at its helm.
They are 4 in all – each being an exceptional leader from the healthcare
industry. The group live by the norm that a disciplined training is the key to
success. The senior doctors and paramedical staff are very hands-on. They
walk the talk at all times. Strategic decisions are taken by the company’s board
after cautious consultation, thorough planning and review. The strategic
decisions are then conveyed to all the stakeholders such as the shareholders,
vendor partners, employees, external consultants, etc. the group has acute
focus on the quality of service being offered.
Contd…
Quality Management
• The healthcare group spends a great deal of time and effort in ensuring
that its people are the best. They believe that the nursing and paramedical
staff constitute the backbone of the organization. They have a clearly laid
out process for recruitment, training and on-the-job monitoring. The
organization has its own curriculum and a grueling training program that
all the new hires have to religiously undertake. Senior doctors act as
mentors of the junior doctors. It is their responsibility to groom their
junior partners. Every employee is visibly aware of the organizations
philosophy and the way of life prevalent in the organization.
Marketing
• GoodLife HealthCare Group had long realized the power of marketing.
They have utilized practically all channels to best sell their services. They
regularly advertise in the newspapers and magazines and more so when
they introduce a new therapy or treatment. They have huge hoardings
speaking about their facilities. They advertise on television with campaigns
that ensure that viewer cannot help but sit through it. They have had the
top saleable sportsperson to endorse their products. Lastly the group
understands that “word of mouth” is very powerful and the best
advertisers are the patient themselves who have been treated at one of the
group’s facilities.
Contd…
Alliance Management
• Over the years GoodLife HealthCare Group has established a good
alliance with specialist doctors, consultants, surgeons and
physiotherapists etc. the group has also built a strong network of
responsible and highly dependable supplier partners. There is a very
transparent system of communication to all its supplier partners.
There is a very transparent system of communication to all its
supplier partners. All its vendor partners are aware of the values
and organizations philosophy that defines the healthcare group. The
group believes in a win-win strategy for all. It is with the help and
support from these vendor partners that the group is able to stock
just the required amount of inventory and is able to procure the
emergency inventory supplies at a very short notice. The group
focuses only on its core business processes and out-sources the
remaining processes to remain in control and steer ahead of
competition.
Contd…
Future Outlook
• GoodLife HealthCare Group is looking at expansion in other countries of
the world too. They are also looking at growing in the existing markets.
They are 27,000 employees today and are looking at growing to 60,000 in
next 5 years. The group already has a dedicated wing for the treatment o
bone deformities. It aspires to set up a chemist store within its premises to
make it convenient for their patients. They would like to set up an artificial
limb center for the production of artificial limbs and rehabilitation of
patients of orthopedic surgeries. The GoodLife HealthCare Group realizes it
social obligation too and is looking forward to setting up a free hospital
with 250 beds in a couple of years time.
Information Technology
• Web presence
▫ GoodLife HealthCare Group has excellent web presence: leveraging website, social
networking and mobile devices banner ads.
▫ Leverages internet technology for surveys, targeted mailers.
▫ Self-help portal for online registration for treatment of ailments.
• Front office management
▫ Patient relationship management
▫ Alliance management
▫ Registration and discharge of patients
▫ Billing
▫ Help desk
Contd…
Human Capital Management & Training Management
• Employee satisfaction surveys
• Employee retention program management
• Employee training and development program management
Personal Productivity
• Email, web access, PDA connect
• Suggestions
• Quick surveys
• Feedback from communities of patients & also the communities of
specialist doctors , consultants and physiotherapists
Research questions
• Where is each type of its data present?
• How is it stored?
• How is the desired information extracted from it?
• How important is the information provided by it?
• How can this information augment public health and
healthcare services?
Terminology
• Entity – thing or object in real world that is distinguishable from all
other objects
▫ Example:- each person in an enterprise is an entity.
• Entity set – set of entities of the same type that share the same
properties/ attributes
▫ Example: - customers of a given bank can be defined as entity set
customer
• Attribute – an entity is represented by a set of attributes. Attributes
are descriptive properties possessed by each member of an entity set
▫ Possible attributes of the customer entity set are customer name, loan
amount etc.
• Database – includes a collection of entity sets each of which
contains any number of entities of the same type.
Getting into “GOODLIFE” database
• GoodLife witnesses enormous amounts of data being exchanged in
the following forms:
▫ Doctors/ nurses notes in an electronic report
▫ Emails sharing information about consultations/ investigations
▫ Narrative portions of electronic medical records
▫ Investigative reports
▫ Chat rooms
Structured
data
Characteristics of Attributes in
Data resided
Definition,
format and
meaning of
data is
explicitly
known
Contd…
• Data coming from databases such as Access, OLTP systems, SQL,
Excel etc. are in structured format
• Working with structured data is easy when it comes to storage,
scalability, security and update and delete operations
▫ Storage – both standard & user-defined data types can be used
▫ Scalability – not generally an issue with increase in data
▫ Security – ensuring security is easy
▫ Update and delete operations – easy due to structured format
• Hazel-free retrieval
▫ Retrieving information – well defined structure helps in easy retrieval of
data
▫ Indexing & searching – enables streamlined search
▫ Mining data – can be easily mined and knowledge can be extracted from it
▫ BI operations – works extremely well with structured data
The problem of IR
24
25
Example Info.
need
Query
IR
Document Retrieval system Answer list
collection
Googl
e
Web
2. Unstructured data
Dr.Sami, Dr.Raj & Dr.Rahul work at the medical facility of GoodLife. Over the past
few days, Dr.Sami & Dr.Raj had been exchanging long emails about a particular case of
gastro-intenstinal problem. Dr.Raj upon a particular combination of drugs has successfully
cured the disorders in his patients. He has written an email about this combination of drugs
to Dr.Sami.
Dr.Rahul has a patient with quite a similar case of gastro-intestinal disorder whose
cure Dr.Raj has chanced upon. Dr.Rahul already tried regular drugs but with no positive
results so far. He quickly searches the organizations database for process, but with no luck.
The information is tucked away in the email conversation between Dr.Sami & Dr.Raj.
Dr.Rahul would have accessed the process had the storage & analysis of unstructured data
been undertaken by GoodLife. Does not
Dr.Raj’s email to Dr.Sami has not been successfully conform
to any data
model
updated into the medical system database as it fell Cannot be
stored in
Has no
the form
in the unstructured format. easily
identifiabl
of rows &
columns
e structure
in a
database
Unstructu
red data
Characteristics of
Unstructured data Does not
follow any
Not in any
particular
rules/ format/
semantics sequence
Not easily
usable by
a program
Contd…
• Unstructured data cannot be stored in the form of rows & columns and
hence it is difficult to determine the meaning of the data
• It does not follow any rules/ semantics. It can be of any type & hence is
unpredictable
• Unstructured data can be classified into 2 broad categories:
▫ Bitmap objects – image, video or audio files etc.
▫ Textual objects – word documents, emails, excel spreadsheet etc.
• Web pages are said to be unstructured data even though they are
defined by HTML, which has a rich structure
▫ HTML is solely used for rendering & presentations
▫ Web pages usually carry links & references to external unstructured content
such as images, XML files etc.
How to manage unstructured data
• Few generic tasks to be performed to enable storage & search of
unstructured data are:
▫ Indexing – on the basis of some value in the data, index is defined
Index is an identifier and it represents the large record in the data set
In the absence of index, whole data set will be scanned for retrieving the desired
data
▫ Tags/ Metadata – using metadata, data in a document, etc. can be tagged.
It enables search & retrieval
▫ Classification/ Taxonomy – taxonomy is classifying data on the basis of the
relationships that exist between data
Data can be arranged in groups & placed in hierarchies based on the taxonomy
prevalent in an organization
▫ CAS (Content Addressable Storage) – stores data based on their metadata.
It assigns a unique name to every object stored in it
The object is retrieved based on its content & not its location
Used extensively to store emails etc.
How to store unstructured data
• Challenges faced while storing unstructured data are:
▫ Storage space – lot of space is required to store unstructured data. Difficult
to store images, videos, audios etc.
▫ Scalability – as the data grows, scalability becomes an issue & the cost of
storing such data increases
▫ Retrieve information – difficult to retrieve & recover data
▫ Security – difficult due to varied sources of data, e.g., emails, web pages etc.
▫ Update & delete – very difficult as retrieval is difficult due to no clear
structure
▫ Indexing & searching – indexing unstructured data is difficult & error-
prone as the structure is not clear & attributes are not pre-defined.
As a result, search results are not very accurate
Indexing becomes more difficult as the volume of the data grows
Storage space
Scalability
Retrieve information
Interpretation
Tags
Indexing
Classification/ Taxonomy
Possible solutions to challenges faced in extracting information
from unstructured data
Analysis
Unstructured data Acquired Subjected to
such as chat, from various semantic
sources analysis
• images, email etc.
Delivery
Structured
Query & Structured
information
presentation information
access
Users
3. Semi-structured data
Dr.Vishnu of “GoodLife HealthCare” organization usually gets a
blood test done for migraine patients visiting her. It is her observation that
patients with migraine have high platelet count. She makes a note of this in
the diagnosis & conclusion section in the blood test report of patients. One
day, another doctor, Dr.Mamatha searches the database when he is unable
to find the cause of migraine in one his patients, but with no luck! The
answer he is looking for is nestled in the vast hoards of data.
Dr.Vishnu’s blood test reports on patients were not successfully
updated into the medical system database as they were in the semi-
structured format. GoodLife HealthCare
Blood Test Report
Date <>
Department <> Attending Doctor <>
Patient Name <> Patient Age <>
Blood test report – Example Hemoglobin Content <>
for semi-structured data RBC Count <>
WBC Count <>
Platelet Count <>
Diagnosis <notes>
Conclusion <notes>
Contd…
• It is important to understand, manage, and analyze semi-structured
data coming from heterogeneous sources
• Semi-structured data does not conform to any data model
• Data cannot be stored in rows & columns as in a database
• Semi-structured data, however, has tags & markers which help group
the data & describe how the data is stored, giving some metadata, but
they are not sufficient for management & automation of data
• In semi-structured data, similar entities are grouped conform
Does not
to
a data
model but
& organized in a hierarchy tags &
elements
(metadata) Data
• The attributes/ properties within a group entities
Similar
are
cannot be
stored in
the form of
may/ may not be the same grouped
rows &
columns
<HTML>
<HEAD>
• The blood test report prepared ..
by Dr.Vishnu is semi-structured .
<TABLE>
• It has structured fields like Date, <TR>
Department, Patient Name etc., <TH><I> header 1 </I></TH>
and unstructured fields like <TH><I> header 2 </I></TH>
Diagnosis, Conclusion etc. <TH><I> header 3 </I></TH>
</TR>
<TR>
• Another example of semi- <TD> text 1 </TD>
structured data are web pages <TD><A HREF=http://www.stuff/> text
▫ These pages have content 2 </A></TD>
embedded within HTML & often <TD> text 3 </TD>
have some degree of metadata </TR>
..
within tags .
▫ This implies certain details of the </TABLE>
..
data being presented. .
</BODY>
</HTML>
Contd…
• Sources of semi-structured data
▫ Email, XML, TCP/ IP packets, Zipped files, Binary executables, Mark-up
languages, Integration of data from heterogeneous sources
• Characteristics of semi-structured data
▫ It is organized into semantic entities
▫ Similar entities are grouped together
▫ Entities in the same group may not have same attributes
▫ Order of attributes is not necessarily important
▫ Not always all attributes are required
▫ Size of the same attributes in a group may differ
▫ Type of the same attributes in a group may differ
• Integration of data from heterogeneous sources (e.g., RDBMS, OODBMS,
Structured file, Legacy system) leads to the data being semi-structured
Storage cost
RDBMS
Challenges for storing
semi-structured data Irregular & partial structure
Implicit structure
Evolving schemas
Flat files
Heterogeneous
Challenges for extracting
data
semi-structured data
Incomplete/
irregular structure
Possible solutions to challenges faced in extracting information
from semi-structured data