0% found this document useful (0 votes)
28 views

Fds Module 1

Uploaded by

johnawick2727
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Fds Module 1

Uploaded by

johnawick2727
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

SAPTHAGIRI NPS UNIVERSITY– BANGALORE

School of Engineering & Technology (SOET)


Course Title: Fundamentals of Data Science
Course-Code: 24BTELY107 Semester: 1
A.Y: 2024-25 Module-1-PPT

Dr. Prof. Rajasekharaiah K.M.


BE-CSE M.Tech-CSE PhD-CSE M.Phil-CS
PGDIT LM-ISTE-New Delhi
Professor – CSE Dept.
School of Computer Science & Engineering Dept.
(SCSE)
Fundamentals of Data Science
Module-1 8 hrs
• Introduction To Data Science
Definition—Big Data and Data Science Hype—
Datafication—Data Science Profile—Meta Data—
Definition—Data Scientist—Statistical Inference—
Populations and Samples—Populations and Samples of
Big Data—Modeling-Data Warehouse—Philosophy of
Exploratory Data Analysis—The Data Science Process—A
Data Scientist’s Role in this Process - Case Study: Real
Direct—Housing Market Analysis
Fundamentals of DS
Introduction To Data Science :
Today’s world is Big-Data, Organizations store
Petabytes, Gigabytes of data. Hence, era of Big-
Data emerged. Data storage using Scientific
methods has become inevitable. Big-data
concept come. This problem of data storage has
been solved by using HADOOP Framework.
(After 2010 started)
Introduction To Data Science :
• Over the past few years, there’s been a lot of hype in the
media about “data science” and “Big Data.” Today, “Data
rules the world”. This has resulted in a huge demand for
Data Scientists.
• A Data Scientist helps companies with data-driven decisions,
to make their business better. Data science is a field that
deals with unstructured, structured data, and semi-
structured data. It involves practices like data cleansing, data
preparation, data analysis, and much more.
• Data science is the combination knowledge of: statistics,
mathematics, programming for problem-solving, capturing
data in ingenious ways. This umbrella term (DS) includes
various techniques that are used when extracting insights
and information from data.
What is Data Science? (10)
1. Data science is a multidisciplinary field knowledge
of –
• Mathematics
• Statistics
• Computer Science
• Domain knowledge
All above is required to extract insights (Database–
Hadoop-Big Data) and domain knowledge of data with
skills.
2. Data Science is blended (hybrid mode) with various
tools, algorithms and machine learning principles.
What is Data Science?
3. Simply, it involves obtaining meaningful
information or insights from structured or
unstructured data through a process of
analyzing, programming, and business skills.
4. Data Science is a combination of
multiple disciplines that uses statistics,
data analysis, and machine learning to
analyze data and to extract knowledge
and insights from it.
What is Data Science?
5. Data Science is about data gathering, analysis,
extracting and decision-making.
Data Science is about finding patterns in data, through
analysis and make future predictions.
By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden
information in the data)
What is Data Science?

6. DS - is a field containing many elements like


mathematics, statistics, computer science, etc. Those
who are good at these respective fields with enough
knowledge of the domain in which you are willing to
work can call themselves as Data Scientist.
7. It’s not an easy thing to do but not impossible too.
You need to start from data, it’s visualization,
programming, formulation, development, and
deployment of your model.
8. In the future, there will be great hype for data
scientist jobs. Taking in that mind, be ready to prepare
yourself to fit in this world.
What is Data Science?
9. Data Science is a deep study of the large amount
of data (BigData), which involves extracting
meaningful insights from raw, structured, and
unstructured data that is processed using the
scientific method, different tools , technologies,
and algorithms.
10. Data science uses the most powerful
hardware, programming (Python or R Language) ,
and most efficient algorithms to solve the data
related problems. It is the future of artificial
intelligence.
What is Data Science?
Definition
Best Definition:
Data science is the study of data to extract
meaningful insights for business. It is a
multidisciplinary approach that combines
principles and practices from the fields of
mathematics, statistics, artificial intelligence,
and computer engineering to analyze large
amounts of data. (AWS-Amazon Web Service-
Own Cloud Service-On-Line Marketing)
Need for Data Science
• Data Science is used in many industries in the world
today, e.g. banking, consultancy, healthcare, and
manufacturing. Also in Education, Political field etc
• For route planning: To discover the best routes to
ship
• To foresee delays for flight/ship/train etc. (through
predictive analysis)
• To create promotional offers in business sector
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections (Politics)
Why is Data Science important?
• Data science is important because it combines
tools, methods, and technology to generate
meaning from data.
• Modern organizations are undated with data;
there is a proliferation (Rapid increase) of
devices that can automatically collect and store
information.
• Online systems and payment portals capture
more data in the fields of e-commerce,
medicine, finance, and every other aspect of
human life. We have text, audio, video, and
image data available in vast quantities.
Future of Data Science?
Artificial intelligence and machine
learning innovations have made data
processing faster and more efficient. Industry
demand has created an ecosystem of courses,
degrees, and job positions within the field of
data science. Because of the cross-functional
skillset and expertise required, data science
shows strong projected growth over the
coming decades.
Where is Data Science Needed?
Applications
Data Science is used in many industries in the world today,
e.g. banking, consultancy, healthcare, and manufacturing.
Examples of where Data Science is needed:
• For route planning: To discover the best routes to ship
• To foresee delays for flight/ship/train etc. (through
predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections
DS – Applications (13)
1. HealthCare
2. Consumer goods – Malls & Marts
3. Finance Stock markets
4. Industry
5. Logistics (Transportation)
6. Political Field
7. Air-Line Route Planning
DS – Applications
8. E-commerce (On-Line Business)
9. Pattern Recognition
10. Speech Recognition
11. Internet Search
13. Video Games
How Does a Data Scientist Work?
A Data Scientist requires expertise in several
backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
• A Data Scientist must find patterns within the data.
Before he/she can find the patterns, he/she must
organize the data in a standard format.
Data Science - Conclusions
Data Science is a multidisciplinary field that
uses scientific methods, algorithms, and
computer systems to extract knowledge and
insights from structured and unstructured data.
It combines aspects of statistics, machine
learning, and domain expertise to analyze data
and make informed decision makings.
How Data Science Works?
Big Data (Hadoop)
Big data refers to significant volumes of data that
cannot be processed effectively with the
traditional (databases) applications that are
currently used. Examples of Databases like dBase,
MS-Access, MySQL, NoSql, Oracle, Sybase, Data
Warehouse and Hadoop. The processing of big
data begins with raw data that is not aggregated
and impossible to store in the memory of a single
computer.
Big Data - Outlook
Big-Data - Applications
Big Data and Data Science Hype (Boom)

What is Big Data?


Big data refers to large, diverse sets of
information that grow at ever-increasing rates.
The term encompasses the volume of
information, the velocity or speed at which it is
created and collected, and the variety or scope
of the data points being covered (commonly
known as the "Three V's" of big data). Big data
provides the raw material used in data mining.
What Is Big Data?
Three V's of big data
1. V – Volume
Volume refers to the amount of data
2. V – Velocity
Velocity refers to the speed of data processing
3. V – Variety
Variety refers to the number of types of data.
Above 3 V’s – defines Properties or Dimensions of
big data
Types of Big Data
Types of Big data:
1.Structured:
Structured data is data that has a standardized format for efficient access by
software and humans alike. It is typically tabular with rows and columns
that clearly define data attributes. Computers can effectively process
structured data for insights due to its quantitative nature.
2.Unstructured
Unstructured simply means that it is datasets (typical large collections of
files) that aren't stored in a structured database format. Unstructured data
has an internal structure, but it's not predefined through data models. It
might be human generated, or machine generated in a textual or a non-
textual format
3.Semi-Structured
Semi-structured data refers to data that is not captured or formatted in
conventional ways. Semi-structured data does not follow the format of a
tabular data model or relational databases because it does not have a fixed
schema.
Big Data - Uses
Big data Uses:
• Big data is used to analyze insights, which can lead
to better decisions and strategic business moves.
• Big data is high-volume, and high-velocity or high-
variety information assets that demand cost-
effective, innovative forms of information
processing that enable enhanced insight, decision
making, and process automation.
• Businesses that use big data effectively hold a
potential competitive advantage over those that
don't because they're able to make faster and more
informed business decisions.
DATA COLLECTION
• Data collection is the process of acquiring, collecting,
extracting, and storing the voluminous amount of data
which may be in the structured or unstructured form like
text, video, audio, XML files, records, or other image files
used in later stages of data analysis.
• In the process of big data analysis, “Data collection” is the
initial step before starting to analyze the patterns or
useful information in data. The data which is to be
analyzed must be collected from different valid sources.
• The actual data is then further divided mainly into
two types known as:
1. Primary data
2. Secondary data
Datafication (2013-BigData)
What is Datafication?
Datafication is a technological trend turning many
aspects of our life into data which is subsequently
transferred into information realized as a new form of
value.
What is the function of datafication?
Datafication is the act of transforming tasks, processes,
and behaviors into data. It's a technological trend that
turns actions into quantifiable, measurable data that
can be used for tracking in real time, analytics, and
insight.
BD-A Revolutionary event that will transform how today
We Live, Work and Think”.
Benefits of Datafication
1. Datafication is a technique that is
financially advantageous to pursue since it
provides great opportunity for streamlining
corporate procedures.
2. Datafication is a cutting-edge process
for creating a futuristic framework that is
both secure and inventive.
Data Science Profiles
Data Science Profiles
1. Data Analyst
Data Analysts are the individuals who are responsible
for reviewing the data so that they can identify the key
information in the businesses of customers. Therefore, it is the
process of collecting, processing, and analyzing the data to
extract meaningful insights and also data analyst support in
decision-making processes.
2. Data Scientist
Data Scientist are the individual who uses the data to
understand it. Therefore these data scientist are responsible to
collect, analyze and interpret the data to help to drive the
decision making.
Data Science Profiles:
3. Data Engineer
Data Engineer refers to experts who are
responsible for maintaining, designing and
optimizing the data infrastructure for the data
management and transform them.
Data Science – 12 Benefits
1. Enhanced decision-making capabilities. ...
2. Streamlined operations. ...
3. Customer insights and personalization. ...
4. Improved work efficiency. ...
5. High demand in job market.
6. Measuring performance. ...
7. Providing information to internal finances. ...
8. Developing better products. ...
9. Increasing efficiency. ...
10. Mitigating risk and fraud. ...(reduced)
11. Predicting outcomes and trends. ...
12. Improving customer experiences.
Data Science Contents?
Meta Data - Meaning
Metadata is information about data that helps
users understand, find, and reuse it. It is a key
part of data science and can be used for many
purposes, including:
• Data quality: Metadata can help confirm that
data is accurate and reliable.
• Data governance: Metadata is important for
compliance with corporate policies and
agreed standards.
Meta Data - Definition
Metadata is defined as the data providing
information about one or more aspects of the
data; it is used to summarize basic information
about data that can make tracking and working
with specific data easier. Some examples
include: Means of creation of the data. Purpose
of the data.
Meta Data – Examples
Metadata makes finding and working with data
easier – allowing the user to sort or locate
specific documents. Some examples of basic
metadata are author, date created, date
modified, and file size. Metadata is also used
for unstructured data such as images, video,
web pages, spreadsheets (excel) etc.
Statistical Inference (DS)
Meaning:
Using data analysis and statistics to make
conclusions about a population is
called statistical inference.
• a conclusion reached on the basis of evidence
and reasoning.
• "researchers are entrusted with drawing
inferences from the data"
Samples—Populations and Samples
of Big Data
A population refers to the entire group of
individuals, objects, or items that share a common
characteristic within a given context. On the other
hand, a sample is a subset of the population that is
selected for analysis.
What is population and sample in big data?
A population is the entire group that you want to
draw conclusions about. A sample is the specific
group that you will collect data from. The size of
the sample is always less than the total size of the
population.
Population Vs Samples
Population Sample
Samples offer a
more feasible
Collecting data
approach to
from an entire
studying
population can be
populations,
time-consuming,
allowing
expensive, and
researchers to
sometimes
draw conclusions
impractical or
based on smaller,
impossible.
manageable
datasets
Modeling in Data Science
Data modeling is the process of analyzing and
defining all the different data types your business
collects and produces, as well as the relationships
between those bits of data.
What is data modeling? (IBM)
Data modeling is the process of creating a visual
representation of either a whole information
system or parts of it to communicate connections
between data points and structures.
Data Science Modeling Steps (10)
1. Define Your Objective
2. Collect Data
3. Clean Your Data
4. Explore Your Data
5. Split Your Data
6. Choose a Model
7. Train Your Model
8. Evaluate Your Model
9. Improve Your Model
10. Deploy Your Model
Data Warehouse?
1. A data warehouse is an enterprise system
used for the analysis and reporting of
structured and semi-structured data from
multiple sources, such as point-of-sale
transactions, marketing automation, customer
relationship management, and more. A data
warehouse is suited for ad hoc analysis as well
custom reporting.
ETL - Extract, Transform and Load (Tool in DW)
Data Warehouse - Defined
2. A data warehouse is a type of data management
system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data
warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical
data.
• Data warehousing is essential for modern data
management, providing a strong foundation for
organizations to consolidate and analyze data
strategically.
• ‎Data Warehouse Architecture · ‎ETL Process in Data
Warehouse
What is ETL? & Uses
Extract, transform and load (ETL) is the process
of combining data from multiple sources into a
large, central repository called a data
warehouse. ETL uses a set of business rules
to clean and organize raw data and prepare it
for storage, data analytics, and machine
learning (ML).
Philosophy of Exploratory Data
Analysis (EDA)
Exploratory Data Analysis (EDA) adopts a
different philosophy. It focuses on maximizing
insights, extracting crucial variables, detecting
outliers, and testing assumptions to deeply
understand the data.
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA), which frequently
uses charts or images, as a key step in fully
understanding data and learning all its features.
Recognizing patterns, figuring out important
variables, and seeing if some things are connected
are all crucial. It's also important to find mistakes in
the data.
EDA helps you get important information, making it
easier to understand the data while getting rid of
extra or unnecessary values. It involves looking for
patterns, finding strange things, testing ideas,
checking guesses using simple statistics and
pictures.
Why is Exploratory Data Analysis important in data
science?
• It facilities a comprehensive understanding of
data before any assumptions are made.

• Its primary objective is to unveil patterns,


identify errors, and pin-point outliers.

• ensuring data scientists produce valid results


aligned with desired business outcomes.

• Leveraging vast datasets to make critical


decision-making has become a cornerstone of
success.
EDA Tools and Techniques:
Python or R Prog. Lang for Data Exploration
Benefits of EDA: (Exploratory Data Analysis)
• It offers numerous benefits when dealing with
datasets, it aids in generating insights and
queries or investigations.
• Secondly, EDA contributes to assessing the
quality and authenticity of the data. Identifying
errors, absent values or pre-dispositions in the
data becomes crucial for rectification or
adjustment.
• Thirdly, EDA assists in selecting the most suitable
techniques and models for your data..
Conclusions: EDA
 Exploratory Data Analysis (EDA) is insightful as it
encapsulates the attributes and qualities of a dataset.
 EDA adapts to the characteristics of the data, it operates
more as a method rather than a rigid procedure.
 Employing visual representations like charts and
diagrams.
 Both graphical and non-graphical statistical approaches
that find in executing EDA.
 EDA is efficient Data Analysis tool.
 AI-driven sectors like retail, e-commerce, banking,
finance, agriculture, healthcare, and more, EDA holds
pivotal significance role.
Data Science Process - (Life Cycles involved)
Data Science Process - Life Cycles (6 steps)
Step 1: Define the Problem and Create a Project
Charter.
Clearly defining the research goals is the first step
in the Data Science Process. A project
charter outlines the objectives, resources,
deliverables, and timeline, ensuring that all
stakeholders are aligned.

Step 2: Retrieve Data


Data can be stored in databases, data warehouses,
or data lakes within an organization. Accessing this
data often involves navigating company policies
and requesting permissions.
Data Science Process - Life Cycles (steps)
Step 3: Data Cleansing, Integration, and
Transformation
Data cleaning ensures that errors, inconsistencies,
and outliers are removed. Data
integration combines datasets from different
sources, while data transformation prepares the
data for modeling by reshaping variables or
creating new features.
Step 4: Exploratory Data Analysis (EDA)
During EDA, various graphical techniques like
scatter plots, histograms, and box plots are used to
visualize data and identify trends. This phase helps
in selecting the right modeling techniques.
Data Science Process - Life Cycles (steps)
Step 5: Build Models
In this step, machine learning or deep
learning models are built to make predictions or
classifications based on the data. The choice of
algorithm depends on the complexity of the
problem and the type of data.
Step 6: Present Findings and Deploy Models
Once the analysis is complete, results are presented
to stakeholders. Models are deployed into
production systems to automate decision-making
or support ongoing analysis.
Data Science Process - Life Cycles
DATA SCIENTIST’S ROLE
Data Scientist’s role involves leveraging data to derive
actionable insights and support data-driven decision-
making. Their roles and responsibilities can be
summarized into several key areas:
• Problem Definition
Collaborate with stakeholders to understand
business objectives and translate them into data
science problems.
• Data Collection
Identify and gather relevant data from various
sources using techniques like web scraping, APIs,
and database querying.
DATA SCIENTIST’S ROLE
• Data Cleaning and Preprocessing
Clean and preprocess data by handling
missing values, removing duplicates, and
transforming data into a suitable format for
analysis.
• Exploratory Data Analysis (EDA)
Perform descriptive statistics and create
visualizations to uncover patterns and
insights within the data.
DATA SCIENTIST’S ROLE
• Feature Engineering
Create and select important features to enhance
model performance and reduce dimensionality.
• Model Building
Choose appropriate algorithms, train models, and
fine-tune parameters to optimize performance.
• Model Evaluation
Evaluate models using relevant metrics and
validate them to ensure they generalize well to
new data.
DATA SCIENTIST’S ROLE
• Model Deployment
Develop and implement a strategy for deploying models
into production environments, ensuring seamless
integration with existing systems.
• Monitoring and Maintenance
Continuously monitor model performance, update and
retrain models as necessary, and perform error analysis to
refine models.
• Communication and Reporting
Present findings and insights through clear and compelling
storytelling, creating reports and dashboards for
stakeholders.
DATA SCIENTIST’S ROLE
• Ethical Considerations
Ensure data privacy, compliance with regulations, and
mitigate biases to promote fairness and ethical use of data.
• Continuous Learning
Stay updated with the latest advancements in data science
and continuously experiment with new techniques and
tools.
Conclusion - a Data Scientist combines technical skills
with business acumen to transform data into valuable
insights that drive strategic decisions and operational
improvements.
Fundamental of Data Science

// Module – 1// End (65)


*** al d best ***

You might also like