0% found this document useful (0 votes)

8 views8 pages

what is data science Explain big data and hype in data science.

Data science is the interdisciplinary field focused on extracting insights from structured and unstructured data using algorithms and computational techniques. Big Data refers to large datasets characterized by volume, velocity, variety, and veracity, which traditional methods struggle to handle. The hype surrounding data science often stems from vague definitions, exaggerated claims, and a misunderstanding of its relationship with traditional statistics.

Uploaded by

diksha8842

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

what is data science Explain big data and hype in data science.

Uploaded by

diksha8842

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

what is data science ? Explain big data and hype in data science.

Data science is a field that involves using algorithms, methods, and

systems to extract insights and knowledge from structured and
unstructured data. It combines aspects of statistics, computer science,
and domain expertise to analyze data and derive meaningful patterns
and information. Data science is distinguished from traditional
statistics by its focus on large-scale data processing, machine learning,
and the use of advanced computational techniques.

Big Data

Big Data refers to extremely large datasets that traditional data

processing software cannot handle efficiently. Characteristics of Big
Data are often summarized by the "four Vs":

1. Volume: The sheer amount of data generated every second,

requiring new methods of storage and processing.
2. Velocity: The speed at which new data is generated and moves
around. This involves real-time or near-real-time processing.
3. Variety: The different types of data, including structured,
unstructured, text, multimedia, etc.
4. Veracity: The quality and accuracy of the data, dealing with issues
such as uncertainty, biases, noise, and abnormalities in data.

Hype in Data Science

The hype around data science and Big Data can be both confusing and
misleading. Here are key points that contribute to this hype:

Lack of Clear Definitions: Terms like "Big Data" and "data science"
are often used without precise definitions, leading to ambiguity.

Disregard for Existing Research: The media often overlooks the

extensive history of work in statistics, computer science, and other
fields that underpin data science.
Exaggerated Claims: Media and industry often exaggerate the
capabilities and impact of data science, comparing data scientists
to "Masters of the Universe" and similar grandiose titles.

Overlap with Statistics: There's a perception that data science is

just a rebranding of statistics or machine learning, which can feel
dismissive to statisticians.

Doubt about its Scientific Nature: Some argue that anything

needing to label itself a "science" may not be a true science,
implying data science may be more of a craft or applied discipline.

## What is a Model?

A model is a simplified representation of reality, created to understand

and explain complex systems by focusing on essential aspects and
omitting extraneous details. Different fields use models to capture
specific attributes of the subjects they study:

- Architecture: Uses blueprints and scaled-down three-dimensional

versions to represent buildings.
- **Molecular Biology**: Uses three-dimensional visualizations to
represent protein structures and connections between amino acids.
- **Statistics and Data Science**: Uses mathematical functions to
capture the uncertainty and randomness of data-generating processes.

Models are artificial constructions that help us understand and predict

the behavior of systems by abstracting away unnecessary details.
However, it's crucial to consider what might have been overlooked
during this abstraction process.
### Statistical Modeling

Statistical modeling involves creating mathematical representations

of the relationships between variables to understand underlying
processes and make predictions. The steps include:

1. Conceptualization: Drawing a picture or diagram of the

underlying process to visualize relationships and causality.

2. Mathematical Representation: Expressing relationships using

mathematical equations. For example, in a linear relationship between
two variables x and y , you might write:
y= β0 + β1x

where β0 and β1are parameters whose values are unknown.

3. Parameter Estimation: Determining the values of the

parameters β0 and β1using data.

The goal is to create a model that accurately represents the data and
helps in making informed decisions and predictions.
Role of Statistical Inference in Data Science

Statistical inference is essential in data science for making sense of complex data,
understanding the underlying randomness and processes, and making informed
decisions based on data analysis.

Complexity and Data Generation:

- The world is a complex, random, and uncertain place, continuously generating
data through everyday activities (commuting, shopping, emailing, etc.).
- Real-world processes naturally produce data, which can be collected and
analyzed.

Data Collection and Subjectivity:

- Data represents traces of real-world processes.
- The choice of data collection methods is subjective, influencing which traces are
gathered.

**Sources of Uncertainty:**
- **Process Uncertainty**: The inherent randomness and unpredictability in the
processes themselves.
- **Data Collection Uncertainty**: Uncertainty arising from the methods and
procedures used to gather data.

**Simplifying Data:**
- Raw data from real-world processes can be vast and unwieldy.
- To understand and analyze this data, it must be simplified into more
comprehensible forms, such as statistical models or estimators.

**Statistical Estimators:**
- These are mathematical models or functions that simplify and summarize data.
- Estimators help capture the essence of the data in a more concise and
understandable way.

**Statistical Inference:**
- The field of statistical inference deals with developing methods and procedures
to extract meaningful insights from data generated by stochastic (random)
processes.
- It involves the process of turning real-world phenomena into data and then
using that data to understand and describe the world.
**Functions of Statistical Inference:**
1. **Description**: Summarizing and describing data to understand underlying
processes.
2. **Understanding**: Gaining insights into real-world phenomena through data
analysis.
3. **Prediction**: Using data to forecast future trends and behaviors.
4. **Decision Making**: Informing decisions based on statistical analysis and
data-driven insights.

5. Explain data science process with a neat diagram.

Data science is the field focused on extracting insights and

knowledge from data. It involves collecting, processing, analyzing,
and interpreting large volumes of data using techniques from
statistics, computer science, and domain-specific knowledge. The
main goal is to turn data into actionable insights for decision-making
and problem-solving. Key activities include data collection,
cleaning, analysis, modeling, visualization, and communication of
results.
Below is a detailed breakdown of the Data Science process:
1. **Real World Data Generation:**
- The real world consists of numerous activities generating raw data (e.g., people using
Google+, athletes competing, spammers sending emails, etc.).
- This raw data can take various forms, such as logs, records, emails, or genetic information.

2. **Data Collection:**
- Collect raw data related to the specific activity or phenomenon of interest.
- Raw data often contains noise and lacks structure, necessitating further processing.

3. Data Cleaning and Munging:

- **Objective**: Transform raw data into a clean, structured format suitable for analysis.
- **Activities**: Joining, scraping, wrangling data using tools like Python, shell scripts, R, or
SQL.
- **Output**: A structured dataset, typically in tabular form (e.g., columns like name, event,
year, gender, event time).

4. Exploratory Data Analysis (EDA):

- **Objective**: Understand the data’s characteristics and identify any issues.
- **Activities**: Checking for duplicates, missing values, outliers, and incorrect data entries.
- **Outcome**: Refined dataset ready for modeling. If issues are found, additional data
cleaning or collection may be required.

5. Model Design and Selection:

- **Objective**: Choose an appropriate model based on the type of problem (classification,
prediction, description).
- **Common Algorithms**: k-nearest neighbor (k-NN), linear regression, Naive Bayes, etc.
- **Considerations**: The model choice depends on the problem’s nature and the dataset’s
characteristics.

6. Model Interpretation and Communication:

- **Objective**: Interpret the model’s results and communicate findings.
- **Activities**: Visualization, reporting results to stakeholders, publishing papers, or
presenting academic talks.
- **Goal**: Ensure that results are understandable and actionable for decision-making.

7. Building Data Products:

- **Objective**: Develop prototypes or products like spam classifiers, search ranking
algorithms, recommendation systems.
- **Integration**: These data products are deployed in the real world, where user interactions
generate more data.
- **Feedback Loop**: The interaction of users with the data product generates new data,
creating a feedback loop that influences future data and models.

8. **Continuous Improvement:**
- **Objective**: Adjust and improve models based on the feedback loop and new data.
- **Activities**: Monitoring model performance, retraining models with new data, addressing
biases introduced by the model.
- **Outcome**: Enhanced model accuracy and effectiveness over time.
### What is RealDirect and How Does it Make Money?

RealDirect is a real estate company founded by Doug Perlson that aims to

improve the way people sell and buy houses using data-driven approaches.
It addresses the inefficiencies of the traditional broker system by employing
a team of licensed real estate agents who pool their knowledge and use
advanced data tools.
RealDirect provides an online platform for sellers to receive data-driven tips
and real-time recommendations, optimizing the sales process through the
use of both historical and real-time data.
The company offers a subscription service and reduced commission rates to
its clients, enhancing efficiency and reducing costs compared to traditional
brokerage services.

**Problems Addressed:**
1. **Broker System:**
- Brokers usually operate as independent agents and closely guard their data.
- Experienced brokers have only slightly more data than inexperienced ones.

2. **Data Quality:**
- Publicly available real estate data is often outdated, with a three-month lag
between a sale and when the data becomes available.

Solutions Provided by RealDirect:

1. **Team Approach:**
- RealDirect hires licensed real estate agents who work together and share their
knowledge.
- It provides an interface for sellers with data-driven tips and real-time
recommendations.

2. **Data Expertise:**
- Brokers at RealDirect become data experts, using tools to track new and
relevant data.
- Access to both public data and recent sources like co-op sales.

3. **Real-Time Data:**
- RealDirect works on providing real-time data feeds about searches, initial
offers, time between offer and close, and online search behavior.
#### How Does RealDirect Make Money?

**Subscription Model:**
- **Fee:** Sellers pay about $395 a month to access the selling tools provided by
RealDirect.

**Commission Model:**
- **Reduced Commission:** Sellers can use RealDirect’s agents at a reduced
commission rate (typically 2% of the sale) compared to the usual 2.5% or 3%.
- **Efficiency through Data:** By pooling data, RealDirect optimizes the selling
process, allowing them to handle more volume at lower commission rates.

**Additional Services:**
- **Value Addition:** Provides detailed information on buyer concerns such as
nearby parks, subways, schools, and price comparisons per square foot.

RealDirect leverages data to optimize the real estate process, providing a cost-
effective and efficient alternative to traditional broker services.

The Economics of Cultural Policy (David Throsby.) (Z-Library)
No ratings yet
The Economics of Cultural Policy (David Throsby.) (Z-Library)
295 pages
XDM-1000 Product Line RM ETSI B01 8.2.3-8.2.4 en
No ratings yet
XDM-1000 Product Line RM ETSI B01 8.2.3-8.2.4 en
688 pages
Cob 300 Business Plan
No ratings yet
Cob 300 Business Plan
33 pages
ADS Final Sem
No ratings yet
ADS Final Sem
112 pages
Datasciencevictoryy
No ratings yet
Datasciencevictoryy
16 pages
6220010
No ratings yet
6220010
37 pages
Ds
No ratings yet
Ds
5 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science & Cyber Security
No ratings yet
Data Science & Cyber Security
13 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
ds sem
No ratings yet
ds sem
71 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Data Science Notes
No ratings yet
Data Science Notes
61 pages
IDS UNIT 1,2,3,4 & 5
No ratings yet
IDS UNIT 1,2,3,4 & 5
117 pages
DOC-20241126-WA0001.
No ratings yet
DOC-20241126-WA0001.
9 pages
DS_UNIT I
No ratings yet
DS_UNIT I
3 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Introduction to Data Science __ 23CSH-283
100% (1)
Introduction to Data Science __ 23CSH-283
48 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
MSE-merged
No ratings yet
MSE-merged
78 pages
Ds unit 1 notes
No ratings yet
Ds unit 1 notes
23 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
ADS-IMP-QNA-2025-15-04-06-06-35_copy
No ratings yet
ADS-IMP-QNA-2025-15-04-06-06-35_copy
33 pages
r22 Unit1 Theory1 Ch1
No ratings yet
r22 Unit1 Theory1 Ch1
16 pages
Data Science
No ratings yet
Data Science
5 pages
datascience
No ratings yet
datascience
12 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
ds final
No ratings yet
ds final
3 pages
BDTT-introductry class
No ratings yet
BDTT-introductry class
3 pages
IDS Mid 1 Notes
No ratings yet
IDS Mid 1 Notes
80 pages
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
53 pages
Intro To Data Science Study Guide
No ratings yet
Intro To Data Science Study Guide
2 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
Data-Science-and-Analytics-Reviewer
No ratings yet
Data-Science-and-Analytics-Reviewer
5 pages
Summer Training
No ratings yet
Summer Training
8 pages
01_Introduction
No ratings yet
01_Introduction
7 pages
Data Science
No ratings yet
Data Science
65 pages
1. Introduction to Data Science
No ratings yet
1. Introduction to Data Science
12 pages
Data Science MBA
No ratings yet
Data Science MBA
6 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
A Functional Approach To Basics of Data Science With Excel-Book - Chapter 1 and 2 - 1st Print
No ratings yet
A Functional Approach To Basics of Data Science With Excel-Book - Chapter 1 and 2 - 1st Print
13 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
data Science
No ratings yet
data Science
3 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
File
No ratings yet
File
27 pages
DSUR_EA2352001010391_W3
No ratings yet
DSUR_EA2352001010391_W3
3 pages
Data Science (Quick Guide) for College Exams
No ratings yet
Data Science (Quick Guide) for College Exams
34 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
2 pages
Data Science Management_vss
No ratings yet
Data Science Management_vss
84 pages
Data-Science
No ratings yet
Data-Science
14 pages
Data Science Overview Basic to Advance Guide
No ratings yet
Data Science Overview Basic to Advance Guide
27 pages
dataScience(mod1)
No ratings yet
dataScience(mod1)
4 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science
No ratings yet
Data Science
11 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
Data Decoded - Understanding Big Data and Its Everyday Applications
From Everand
Data Decoded - Understanding Big Data and Its Everyday Applications
Michael Reed
No ratings yet
Perfect Draw! Quick-Play Printables - v1.7
No ratings yet
Perfect Draw! Quick-Play Printables - v1.7
16 pages
Boarding School
No ratings yet
Boarding School
106 pages
Rosatom 2021 Eng
No ratings yet
Rosatom 2021 Eng
238 pages
Perfecting The Rows
No ratings yet
Perfecting The Rows
4 pages
Leg Painting - Google Search
No ratings yet
Leg Painting - Google Search
1 page
G7 CNHS
No ratings yet
G7 CNHS
4 pages
Cystic Fibrosis ? PDF
No ratings yet
Cystic Fibrosis ? PDF
7 pages
Hitachi High-Tech - Mastering Side-By-Side Development Techniques To Innovate in A More Agile and Future-Ready Way
No ratings yet
Hitachi High-Tech - Mastering Side-By-Side Development Techniques To Innovate in A More Agile and Future-Ready Way
1 page
Applied Research Program: User Manual
No ratings yet
Applied Research Program: User Manual
33 pages
CPTED
No ratings yet
CPTED
4 pages
Performance Evaluation by Onsite Supervisor - English Version
No ratings yet
Performance Evaluation by Onsite Supervisor - English Version
4 pages
The Culture and Art of The Mangyan - Philippine Art, Culture and Antiquities
No ratings yet
The Culture and Art of The Mangyan - Philippine Art, Culture and Antiquities
5 pages
BMW2
No ratings yet
BMW2
2 pages
MESA White Paper 52 - Smart Manufacturing - Landscape Explained Short Version 3
No ratings yet
MESA White Paper 52 - Smart Manufacturing - Landscape Explained Short Version 3
15 pages
Laozi and Laoism The Authentic Philosoph
No ratings yet
Laozi and Laoism The Authentic Philosoph
39 pages
Nature Inspired Wind Turbine Blade Design
No ratings yet
Nature Inspired Wind Turbine Blade Design
21 pages
Present Simple Tense
No ratings yet
Present Simple Tense
8 pages
Lesson Plan 1&2 Grade 7 Natural Sciences T3 W4
No ratings yet
Lesson Plan 1&2 Grade 7 Natural Sciences T3 W4
16 pages
Penilaian Harian Ganjil B.inggris Kls 8 - 2022-2023 Fix
No ratings yet
Penilaian Harian Ganjil B.inggris Kls 8 - 2022-2023 Fix
11 pages
MPLS Traffic Engineering Cheatsheet
No ratings yet
MPLS Traffic Engineering Cheatsheet
1 page
DLL MATATAG _PE&HEALTH 7 Q4 W1-2 (1)
No ratings yet
DLL MATATAG _PE&HEALTH 7 Q4 W1-2 (1)
16 pages
Noosphere - 734 Bibliographic References (1926-2007)
No ratings yet
Noosphere - 734 Bibliographic References (1926-2007)
31 pages
The Definitive Guide To 3DS Custom Firmware
No ratings yet
The Definitive Guide To 3DS Custom Firmware
12 pages
House Construction Scheduling Example
No ratings yet
House Construction Scheduling Example
8 pages
NASM Intel x86 Assembly Language Cheat Sheet: Instruction Effect Examples
No ratings yet
NASM Intel x86 Assembly Language Cheat Sheet: Instruction Effect Examples
1 page
Chapter 1: Introduction: E-District: Gadchiroli
No ratings yet
Chapter 1: Introduction: E-District: Gadchiroli
54 pages
Mcclymont 2020
No ratings yet
Mcclymont 2020
6 pages

what is data science Explain big data and hype in data science.

Uploaded by

what is data science Explain big data and hype in data science.

Uploaded by

what is data science ? Explain big data and hype in data science.

Data science is a field that involves using algorithms, methods, and

Big Data refers to extremely large datasets that traditional data

1. Volume: The sheer amount of data generated every second,

Hype in Data Science

Disregard for Existing Research: The media often overlooks the

Overlap with Statistics: There's a perception that data science is

Doubt about its Scientific Nature: Some argue that anything

A model is a simplified representation of reality, created to understand

- **Architecture**: Uses blueprints and scaled-down three-dimensional

Models are artificial constructions that help us understand and predict

Statistical modeling involves creating mathematical representations

1. **Conceptualization**: Drawing a picture or diagram of the

2. **Mathematical Representation**: Expressing relationships using

where β0 and β1​are parameters whose values are unknown.

3. **Parameter Estimation**: Determining the values of the

**Complexity and Data Generation:**

**Data Collection and Subjectivity:**

5. Explain data science process with a neat diagram.

Data science is the field focused on extracting insights and

3. **Data Cleaning and Munging:**

4. **Exploratory Data Analysis (EDA):**

5. **Model Design and Selection:**

6. **Model Interpretation and Communication:**

7. **Building Data Products:**

RealDirect is a real estate company founded by Doug Perlson that aims to

**Solutions Provided by RealDirect:**

You might also like

- Architecture: Uses blueprints and scaled-down three-dimensional

1. Conceptualization: Drawing a picture or diagram of the

2. Mathematical Representation: Expressing relationships using

where β0 and β1are parameters whose values are unknown.

3. Parameter Estimation: Determining the values of the

Complexity and Data Generation:

Data Collection and Subjectivity:

3. Data Cleaning and Munging:

4. Exploratory Data Analysis (EDA):

5. Model Design and Selection:

6. Model Interpretation and Communication:

7. Building Data Products:

Solutions Provided by RealDirect: