0% found this document useful (0 votes)
6 views

1. Data Science

The document provides an overview of data science and big data, highlighting their definitions, characteristics, and the processes involved in managing and analyzing large datasets. It discusses the exponential growth of data, the challenges associated with big data, and the various types of data, including structured, unstructured, and streaming data. Additionally, it outlines the benefits of big data in various industries and the technologies used for operational and analytical purposes.

Uploaded by

lucifer267302
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

1. Data Science

The document provides an overview of data science and big data, highlighting their definitions, characteristics, and the processes involved in managing and analyzing large datasets. It discusses the exponential growth of data, the challenges associated with big data, and the various types of data, including structured, unstructured, and streaming data. Additionally, it outlines the benefits of big data in various industries and the technologies used for operational and analytical purposes.

Uploaded by

lucifer267302
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Data Science

Tushar B. Kute,
http://tusharkute.com
Objectives

• Defining data science and big data


• Recognizing the different types of data
• Gaining insight into the data science
process
Data All Around

• Lots of data is being collected and warehoused


– Web data, e-commerce
– Financial transactions, bank/credit
transactions
– Online trading and purchasing
– Social Network
– Cloud
Data and Big Data

• “90% of the world’s data was generated in the last few


years.”
• Due to the advent of new technologies, devices, and
communication means like social networking sites, the
amount of data produced by mankind is growing rapidly
every year.
• The amount of data produced by us from the beginning of
time till 2003 was 5 billion gigabytes. If you pile up the
data in the form of disks it may fill an entire football field.
• The same amount was created in every two days in 2011,
and in every six minutes in 2016. This rate is still growing
enormously.
Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity,


and complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
What is Big Data

• Big Data is a collection of large datasets


that cannot be processed using
traditional computing techniques.
• It is not a single technique or a tool,
rather it involves many areas of business
and technology.
Big Data

• Big Data is any data that is expensive to manage


and hard to extract value from
– Volume
• The size of the data
– Velocity
• The latency of data processing relative to the
growing demand for interactivity
– Variety and Complexity
• The diversity of sources, formats, quality,
structures.
Big Data
Characteristics of Big Data: Volume

• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Exponential increase in
collected/generated data
Computer Memory Units
Characteristics of Big Data: Variety

• Various formats, types, and structures

• Text, numerical, images, audio, video,


sequences, time series, social media
data, multi-dim arrays, etc…
• Static data vs. streaming data

• A single application can be


generating/collecting many types of
data

To extract knowledge all these types


of data need to linked together
Characteristics of Big Data: Velocity

• Data is begin generated fast and need to be


processed fast
• Online Data Analytics

• Late decisions, missing opportunities

• Examples
• E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next to
you.

• Healthcare monitoring: sensors monitoring your activities and


body any abnormal measurements require immediate reaction.
Big Data: 3 Vs
Big Data: The 4th V
What Comes Under Big Data?

• Black Box Data: It is a component of helicopter, airplanes, and


jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information
of the aircraft.
• Social Media Data: Social media such as Facebook and Twitter
hold information and the views posted by millions of people
across the globe.
• Stock Exchange Data: The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a
share of different companies made by the customers.
• Power Grid Data: The power grid data holds information
consumed by a particular node with respect to a base station.
What Comes Under Big Data?

• Transport Data: Transport data includes


model, capacity, distance and availability of a
vehicle.
• Search Engine Data: Search engines retrieve
lots of data from different databases.
• Structured data: Relational data.
• Semi Structured data: XML data.
• Unstructured data: Word, PDF, Text, Media
Logs.
Benefits of Big Data

• Using the information kept in the social network like


Facebook, the marketing agencies are learning
about the response for their campaigns,
promotions, and other advertising mediums.
• Using the information in the social media like
preferences and product perception of their
consumers, product companies and retail
organizations are planning their production.
• Using the data regarding the previous medical
history of patients, hospitals are providing better
and quick service.
Big Data Technologies

• Operational Big data


• Analytical Big data
Operational Big Data

• These include systems like MongoDB that


provide operational capabilities for real-
time, interactive workloads where data is
primarily captured and stored.
• NoSQL Big Data systems are designed to
take advantage of new cloud computing
architectures that have emerged over the
past decade to allow massive computations
to be run inexpensively and efficiently.
Analytical Big Data

• These includes systems like Massively Parallel


Processing (MPP) database systems and
MapReduce that provide analytical capabilities for
retrospective and complex analysis that may touch
most or all of the data.
• MapReduce provides a new method of analyzing
data that is complementary to the capabilities
provided by SQL, and a system based on
MapReduce that can be scaled up from single
servers to thousands of high and low end machines.
Who generates Big Data?

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of data) Mobile devices
(tracking all objects all the time)
Big Data generation models

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are
consuming data

New Model: all of us are generating data, and all of us are


consuming data
Challenges in Big Data

• The major challenges associated with big data


are as follows:
– Capturing data
– Curation
– Storage
– Searching
– Sharing
– Transfer
– Analysis
– Presentation
Types of Data

• Relational Data
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
What to do with this data?

• Aggregation and Statistics


– Data warehousing and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
What is Data Science ?

• An area that manages, manipulates, extracts,


and interprets knowledge from tremendous
amount of data.
• Data science (DS) is a multidisciplinary field of
study with goal to address the challenges in big
data.
• Data science principles apply to all data – big
and small.
What is Data Science ?

• Theories and techniques from many fields and disciplines


are used to investigate and analyze a large amount of data
to help decision makers in many industries such as science,
engineering, economics, politics, finance, and education
– Computer Science
• Pattern recognition, visualization, data warehousing,
High performance computing, Databases, AI
– Mathematics
• Mathematical Modeling
– Statistics
• Statistical and Stochastic modeling, Probability.
Data Science Disciplines
Real Life Examples

• Internet Search
• Digital Advertisements (Targeted Advertising and re-
targeting)
• Recommender Systems
• Image Recognition
• Speech Recognition
• Gaming
• Price Comparison Websites
• Airline Route Planning
• Fraud and Risk Detection
• Delivery logistics
Internet Search
Targeting Advertisement
Recommender System
Image Recognition
Speech Recognition
Computer Games
Price Comparison Website
Airline Route Planning
Fraud Detection
Delivery Logistics
Facets of Data

• In data science and big data you’ll come across many


different types of data, and each of them tends to
require different tools and techniques. The main
categories of data are these:
– Structured
– Unstructured
– Natural language
– Machine-generated
– Graph-based
– Audio, video, and images
– Streaming
Strutured Data

• Structured data is data that depends on a data


model and resides in a fixed field within a record.
• As such, it’s often easy to store structured data
in tables within databases or Excel files, SQL , or
Structured Query Language, is the preferred way
to manage and query data that resides in
databases.
• You may also come across structured data that
might give you a hard time storing it in a
traditional relational database.
Strutured Data
Unstructured Data

• Unstructured data is data that isn’t easy to fit into a data


model because the content is context-specific or varying.
One example of unstructured data is your regular email
• Although email contains structured elements such as the
sender, title, and body text, it’s a challenge to find the
number of people who have written an email complaint
about a specific employee because so many ways exist to
refer to a person, for example.
• The thousands of different languages and dialects out
there further complicate this.
• A human-written email, as shown in next figure, is also a
perfect example of natural language data.
Unstructured Data
Natural Language

• Natural language is a special type of unstructured data;


it’s challenging to process because it requires knowledge
of specific data science techniques and linguistics.
• The natural language processing community has had
success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis,
but models trained in one domain don’t generalize well to
other domains.
• Even state-of-the-art techniques aren’t able to decipher
the meaning of every piece of text. This shouldn’t be a
surprise though: humans struggle with natural language
as well. It’s ambiguous by nature.
Machine Generated Data

• Machine-generated data is information that’s automatically


created by a computer, process, application, or other machine
without human intervention.
• Machine-generated data is becoming a major data resource
and will continue to do so. Wikibon has forecast that the
market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex
physical machinery with networked sensors and software)
will be approximately $540 billion in 2020.
• IDC (International Data Corporation) has estimated there will
be 26 times more connected things than people in 2020. This
network is commonly referred to as the internet of things.
Machine Generated Data
Graph or Network Data

• “Graph data” can be a confusing term because any data can


be shown in a graph.
• “Graph” in this case points to mathematical graph theory. In
graph theory, a graph is a mathematical structure to model
pair-wise relationships between objects. Graph or network
data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to
represent and store graphical data. Graph-based data is a
natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence
of a person and the shortest path between two people.
Graph or Network Data

• Examples of graph-based data can be found on many social


media websites (For instance, on LinkedIn you can see who you
know at which company.
• Your follower list on Twitter is another example of graph-based
data. The power and sophistication comes from multiple,
overlapping graphs of the same nodes. For example, imagine
the connecting edges here to show “friends” on Facebook.
• Imagine another graph with the same people which connects
business colleagues via LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
Overlapping the three different-looking graphs makes more
interesting questions possible.
Graph or Network Data
Audio, Video and Image

• Audio, image, and video are data types that pose specific
challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging for
computers. MLBAM (Major League Baseball Advanced
Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose
of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
Audio, Video and Image

• Recently a company called DeepMind succeeded at


creating an algorithm that’s capable of learning how
to play video games.
• This algorithm takes the video screen as input and
learns to interpret everything via a complex process
of deep learning. It’s a remarkable feat that
prompted Google to buy the company for their own
Artificial Intelligence ( AI ) development plans.
• The learning algorithm takes in data as it’s produced
by the computer game; it’s streaming data.
Streaming Data

• While streaming data can take almost any of the


previous forms, it has an extra property.
• The data flows into the system when an event
happens instead of being loaded into a data store
in a batch.
• Although this isn’t really a different type of data,
we treat it here as such because you need to adapt
your process to deal with this type of information.
• Examples are the “What’s trending” on Twitter, live
sporting or music events, and the stock market.
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]

You might also like