0% found this document useful (0 votes)
53 views38 pages

Introduction To Data Science - UNIT-1 (Session-2) - Dr.R.Richards - 09-03-2025

The document provides an introduction to data science, covering its definition, evolution, types of data, and the importance of data wrangling. It explains the various types of data, including structured, unstructured, semi-structured, and metadata, and discusses the applications and benefits of data science in decision-making and predictive analytics. Additionally, it highlights the challenges in data processing and the significance of big data in the modern data landscape.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views38 pages

Introduction To Data Science - UNIT-1 (Session-2) - Dr.R.Richards - 09-03-2025

The document provides an introduction to data science, covering its definition, evolution, types of data, and the importance of data wrangling. It explains the various types of data, including structured, unstructured, semi-structured, and metadata, and discusses the applications and benefits of data science in decision-making and predictive analytics. Additionally, it highlights the challenges in data processing and the significance of big data in the modern data landscape.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

BCA 602- Introduction

to Data Science
Unit No. I

Introduction to Data
Science

Centre for Distance and Online Education


Unit No. I

Introduction to Data
Science
Data Evolution (Data to Data
Science)

Centre for Distance and Online Education


Objectives
After completion of Unit-I, you will be able to understand the
following areas:
 Data to Data Science -Understanding data: Introduction -Types of Data.
 Data Evolution - Data Sources. Preparing and gathering data and knowledge
 Philosophies of data science - data all around us: the virtual wilderness
 Data wrangling: from capture to domestication
 Data science in a big data world
 Benefits and uses of data science and big data - facets of data.

Centre for Distance and Online Education


Data Science –Definition
Data Science is the area of study which involves extracting insights from vast
amounts of data by the use of various scientific methods, algorithms, and
processes. It helps you to discover hidden patterns from the raw data.

Data Science is a multi-disciplinary science with an objective to


perform data analysis to generate knowledge that can be used
for decision making. This knowledge can be in the form of similar
patterns or predictive planning models, forecasting models etc.

A data science application collects data and information from


multiple heterogenous sources, cleans, integrates, processes and
analyses this data using various tools and presents information
and knowledge in various visual forms.

Centre for Distance and Online Education


Centre for Distance and Online Education
Centre for Distance and Online Education
Advantages:

 It helps in making business decisions such


as deciding the health of companies with
whom they plan to collaborate.
 It may help in making better predictions for
the future such as making strategic plans of
the company based on present trends etc.
 It may identify similarities among various
data patterns leading to applications like
fraud detection, targeted marketing etc

Centre for Distance and Online Education


Centre for Distance and Online Education
Types of Data in Data Science
Data and Big Data includes huge volume, high velocity, and
extensible variety of data.

The data in it will be of four types.


1. Unstructured data: Word, PDF, Text, images, audio and video
2. Semi Structured data: XML data.
3. Meta Data: Data about data
4. Structured data: Relational data.

Centre for Distance and Online Education


Unstructured Big Data
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge data poses multiple challenges in terms of its processing
for deriving value out of it. Typical example of unstructured data is, a heterogeneous
data source containing a combination of simple text files, images, audio and videos etc.
Semi-Structured data
Semi structured data can contain both the forms of data. We can see Semi structured
data in form but it is actually not defined .With example a table definition in relational
DBMS.
Example of semi-structured data is a data represented in XML file. Web pages are
generated in scripting of HTML which is also an example semi structured data.
Personal data stored in a XML file
<rec><name>Amitav</name><gender>Male<gender><age>45</age></rec>
<rec><name>Sudipta</name><gender>Male</gender><age>17</age></rec>
<rec><name>Soumya</name><gender>Male</gender><age>15</age></rec>

Centre for Distance and Online Education


Meta Data
Metadata is defined as the data providing information about one or more
aspects of the data. It is used to summarize basic information about data
which can make tracking and working with specific data easier.
There are three main types of metadata:
• Descriptive metadata describes a resource identification It can
include elements such as title of the book, abstract and keywords.
• Structural metadata indicates how compound objects are put
together e.g. how pages are ordered to form chapters.
• Administrative metadata provides information to help manage a
resource, such as when and how it was created, file type and other
technical information, and who can access it.

Centre for Distance and Online Education


META DATA- EXAMPLE
Structured Data
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘Structured’ data. In other words all data which can be stored in database
SQL in form of table with rows and columns.

Centre for Distance and Online Education


Semi-structured Data
As the name suggest Semi-structured has some structure in it.
The structure of semi-structured data is due to the use of tags or
key/value pairs The common form of semi-structured data is
produced through XML, JSON objects, Server logs, EDI data, etc.
<Book>
<title>Data Science and Big Data</title>
<author>R Raman</author>
<author>C V Shekhar</author>
<yearofpublication>2020</yearofpublication>
</Book>
"Book": {
"Title": "Data
Science",
"Price": 5000,
"Year": 2020
}
Centre for Distance and Online Education
Structured and Semi structured Data
Structured and Semi structured Data
Examples of Data Science

I. Examples of data science and its applications are


everywhere. Data science has applications in
everything from food delivery, sports, traffic, and
health. Data is everywhere and so data science can
be applied to everything.

II. In terms of food, Uber is investing in an expansion to


its ride-sharing system focused on the delivery of
food, Uber Eats. Uber Eats needs to get people their
food in a timely fashion, while it is still hot and fresh.
In order for this to occur, data scientists for the
company need to use statistical modeling that takes
into account aspects like distance Centre
from restaurants to
for Distance and Online Education
Philosophies of data science - data all around us: the virtual
wilderness

Data science is the extraction of knowledge from data.


Simple enough, but that description doesn’t distinguish
data science from the many other similar terms, except
perhaps to claim that data science is an umbrella term for the
whole lot. On the other hand, this era of data science has a
property that no previous era had, and it is, to me, a fairly
compelling reason to apply a new term to the types of things
that data scientists do that previous applied statisticians and
data-oriented software engineers did not. This reason helps me
underscore an often-overlooked but very important aspect of
data science.

Centre for Distance and Online Education


• Measuring things in real life
• Measuring things online
• Scripting and web scraping
• Data-collection devices— Today’s concept of the Internet of Things gets
considerable media buzz partially for its value in creating data from physical devices,
some of which are capable of recording the physical world—for example, cameras,
thermometers, and gyroscopes.

• Log files or archives— Sometimes jargonized into digital trail or exhaust, log files are
(or can be) left behind by many software applications.

Centre for Distance and Online Education


Data wrangling: from capture to domestication
In data science, "data wrangling" refers to the process of taking
raw, unorganized data and transforming it into a clean, structured
format that is suitable for analysis, essentially "domesticating"
the data by cleaning, organizing, and structuring it to make it
usable for further operations like modeling and visualization; this
often involves capturing data from various sources, then cleaning
it by handling missing values, inconsistencies, and duplicates, and
finally reshaping it to fit the analysis needs.

Key points about data wrangling:

Goals: To prepare raw data for analysis by making it consistent,


accurate, and accessible.

Centre for Distance and Online Education


Data wrangling: from capture to domestication
Steps involved:

1.Data capture: Gathering data from different sources,


including databases, APIs, files, etc.
2.Data cleaning: Identifying and correcting errors like
missing values, outliers, duplicates, and inconsistent
formatting.
3.Data transformation: Reshaping data by combining
datasets, splitting variables, aggregating data, or
converting data types to fit analysis needs.
4.Data enrichment: Adding relevant information to
the dataset to enhance analysis.
Centre for Distance and Online Education
Why is data wrangling important?

 Accuracy of analysis: Clean and properly formatted data leads


to more reliable and meaningful results from analysis.
 Efficiency: By preparing data upfront, data scientists can spend
more time on analysis and modeling instead of struggling with
messy data.
 Model performance: Quality data is crucial for training accurate
machine learning models.

Common data wrangling challenges:


 Incomplete data: Dealing with missing values and finding ways
to impute them.
 Inconsistent formatting: Standardizing data formats across
different sources.
 Data quality issues: Identifying and correcting errors like typos
or incorrect data types.
 Data integration: Combining data fromCentre
multiple sources while
for Distance and Online Education
The data wrangling process typically involves these
steps:

 Discovering
 Structuring
 Cleaning
 Enriching
 Validating

Tools for data wrangling:

 Programming languages: Python (with libraries like


pandas, NumPy), R
 Data analysis platforms: Tableau, Power BI
 Data wrangling tools: Alteryx, Trifacta

Centre for Distance and Online Education


Data science in a big data world:

• Big data is a blanket term for any collection of data sets so large or complex that
it becomes difficult to process them using traditional data management
techniques such as, for example, the RDBMS (relational database management
systems).

• The widely adopted RDBMS has long been regarded as a one-size-fits-all


solution, but the demands of handling big data have shown otherwise. Data
science involves using methods to analyze massive amounts of data and extract
the knowledge

• Data science involves using methods to analyze massive amounts of data and
extract the knowledge it contains. You can think of the relationship between big
data and data science as being like the relationship between crude oil and an oil
refinery. Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.

Centre for Distance and Online Education


 Predictive analytics: Uses historical data and algorithms
Benefits and uses of data science and big data - facets of data
to make forecasts and predictions.
Data science and big data can help businesses and organizations
 Machine
make learning:
better decisions, Used
improve to create
customer predictive
experience, and models and
increase efficiency.
analyze data.
 Data visualization: Helps users understand complex data
sets.
 Recommendation systems: Helps users find relevant
information.
 Fraud detection: Helps identify fraudulent transactions.
 Sentiment analysis: Helps understand how people feel
about a product or service.

Centre for Distance and Online Education


Big data - Uses
 Improved decision-making: Helps organizations
make data-driven decisions.
 Better customer experiences: Helps
personalize customer experiences.
 More efficient operations: Helps streamline
operations.
 Improved risk management: Helps mitigate risk
and handle setbacks.
 Increased agility and innovation: Helps
organizations respond to market demands.
 Data science and big data are used in many
industries, including healthcare, finance,
marketing, and technology.

Centre for Distance and Online Education


Facets of data
In data science and big data you’ll come across many
different types of data, and each of them tends to require
different tools and techniques. The main categories of
data are these:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming

Centre for Distance and Online Education


STRUCTURED DATA
Structured data is data
that depends on a data
model and resides in a
fixed field within a
record. As such, it’s
often easy to store
structured data in tables
within databases or
Excel files (figure 1.1).
SQL, or Structured
Query Language, is the
preferred way to
manage and query data
that resides in
databases. You may also
come across structured
data that might give you
a hard time storing it in
a traditional relational
database. Hierarchical Centre for Distance and Online Education
UNSTRUCTURED DATA
Unstructured data is data that
isn’t easy to fit into a data model
because the content is context-
specific or varying. One example
of unstructured data is your
regular email (figure 1.2).
Although email contains
structured elements such as the
sender, title, and body text, it’s a
challenge to find the number of
people who have written an email
complaint about a specific
employee because so many ways
exist to refer to a person, for
example. The thousands of
different languages and dialects
out there further complicate this.
Centre for Distance and Online Education
NATURAL LANGUAGE
• Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific
data science techniques and linguistics.

• The natural language processing community has had success in


entity recognition, topic recognition, summarization, text
completion, and sentiment analysis, but models trained in one
domain don’t generalize well to other domains. Even state-of-
the-art techniques aren’t able to decipher the meaning of every
piece of text. This shouldn’t be a surprise though: humans
struggle with natural language as well. It’s ambiguous by nature.
The concept of meaning itself is questionable here. Have two
people listen to the same conversation. Will they get the same
meaning? The meaning of the same words can vary when
coming from someone upset or joyous.
Centre for Distance and Online Education
MACHINE-GENERATED DATA
Machine-generated data is information
that’s automatically created by a
computer, process, application, or other
machine without human intervention.
Machine-generated data is becoming a
major data resource and will continue to do
so. Wikibon has forecast that the market
value of the industrial Internet (a term
coined by Frost & Sullivan to refer to the
integration of complex physical machinery
with networked sensors and software) will
be approximately $540 billion in 2020. IDC
(International Data Corporation) has
estimated there will be 26 times more
connected things than people in 2020. This
network is commonly referred to as the
internet of things.
The analysis of machine data relies on
highly scalable tools, due to its high
volume and speed. Examples of machine
data are web server logs, call detail
Centre for Distance and Online Education
records, network event logs, and telemetry
GRAPH-BASED OR NETWORK DATA
“Graph data” can be a confusing term
because any data can be shown in a
graph. “Graph” in this case points to
mathematical graph theory. In graph
theory, a graph is a mathematical
structure to model pair-wise relationships
between objects. Graph or network data
is, in short, data that focuses on the
relationship or adjacency of objects. The
graph structures use nodes, edges, and
properties to represent and store
graphical data. Graph-based data is a
natural way to represent social networks,
and its structure allows you to calculate
specific metrics such as the influence of a
person and the shortest path between two
people.
Centre for Distance and Online Education
AUDIO, IMAGE, AND VIDEO
Audio, image, and video are data types that pose specific
challenges to a data scientist. Tasks that are trivial for humans,
such as recognizing objects in pictures, turn out to be challenging
for computers. MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they’ll increase video capture to
approximately 7 TB per game for the purpose of live, in-game
analytics. High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example, the
path taken by a defender relative to two baselines.

STREAMING DATA
While streaming data can take almost any of the previous forms,
it has an extra property. The data flows into the system when an
event happens instead of being loaded into a data store in a
batch. Although this isn’t really a different type of data, we treat it
here as such because you need to adapt your Centreprocess toOnline
for Distance and deal with
Education
Summary
• Data to Data Science -Understanding data: Introduction -Types of
Data.
• Data Evolution - Data Sources. Preparing and gathering data and
knowledge
• Philosophies of data science - data all around us: the virtual
wilderness
• Data wrangling: from capture to domestication
• Data science in a big data world
• Benefits and uses of data science and big data - facets of data
Centre for Distance and Online Education
Additional Resources
Video Links:
https://www.youtube.com/watch?v=lSwIe0TMUhc

https://www.youtube.com/watch?v=KxryzSO1Fjs

Tutorial Lesson Web site URL:

https://www.geeksforgeeks.org/introduction-to-data-science/

https://www.tutorialspoint.com/data_science/index.htm

https://www.tpointtech.com/data-science

Centre for Distance and Online Education


Any Questions ?

Centre for Distance and Online Education


Thank You!
The title of the college here along
with a brief description if required

Centre for Distance and Online Education

You might also like