0% found this document useful (0 votes)
3 views

what is data science Explain big data and hype in data science.

Data science is the interdisciplinary field focused on extracting insights from structured and unstructured data using algorithms and computational techniques. Big Data refers to large datasets characterized by volume, velocity, variety, and veracity, which traditional methods struggle to handle. The hype surrounding data science often stems from vague definitions, exaggerated claims, and a misunderstanding of its relationship with traditional statistics.

Uploaded by

diksha8842
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

what is data science Explain big data and hype in data science.

Data science is the interdisciplinary field focused on extracting insights from structured and unstructured data using algorithms and computational techniques. Big Data refers to large datasets characterized by volume, velocity, variety, and veracity, which traditional methods struggle to handle. The hype surrounding data science often stems from vague definitions, exaggerated claims, and a misunderstanding of its relationship with traditional statistics.

Uploaded by

diksha8842
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

what is data science ? Explain big data and hype in data science.

Data science is a field that involves using algorithms, methods, and


systems to extract insights and knowledge from structured and
unstructured data. It combines aspects of statistics, computer science,
and domain expertise to analyze data and derive meaningful patterns
and information. Data science is distinguished from traditional
statistics by its focus on large-scale data processing, machine learning,
and the use of advanced computational techniques.

Big Data

Big Data refers to extremely large datasets that traditional data


processing software cannot handle efficiently. Characteristics of Big
Data are often summarized by the "four Vs":

1. Volume: The sheer amount of data generated every second,


requiring new methods of storage and processing.
2. Velocity: The speed at which new data is generated and moves
around. This involves real-time or near-real-time processing.
3. Variety: The different types of data, including structured,
unstructured, text, multimedia, etc.
4. Veracity: The quality and accuracy of the data, dealing with issues
such as uncertainty, biases, noise, and abnormalities in data.

Hype in Data Science

The hype around data science and Big Data can be both confusing and
misleading. Here are key points that contribute to this hype:

Lack of Clear Definitions: Terms like "Big Data" and "data science"
are often used without precise definitions, leading to ambiguity.

Disregard for Existing Research: The media often overlooks the


extensive history of work in statistics, computer science, and other
fields that underpin data science.
Exaggerated Claims: Media and industry often exaggerate the
capabilities and impact of data science, comparing data scientists
to "Masters of the Universe" and similar grandiose titles.

Overlap with Statistics: There's a perception that data science is


just a rebranding of statistics or machine learning, which can feel
dismissive to statisticians.

Doubt about its Scientific Nature: Some argue that anything


needing to label itself a "science" may not be a true science,
implying data science may be more of a craft or applied discipline.

## What is a Model?

A model is a simplified representation of reality, created to understand


and explain complex systems by focusing on essential aspects and
omitting extraneous details. Different fields use models to capture
specific attributes of the subjects they study:

- **Architecture**: Uses blueprints and scaled-down three-dimensional


versions to represent buildings.
- **Molecular Biology**: Uses three-dimensional visualizations to
represent protein structures and connections between amino acids.
- **Statistics and Data Science**: Uses mathematical functions to
capture the uncertainty and randomness of data-generating processes.

Models are artificial constructions that help us understand and predict


the behavior of systems by abstracting away unnecessary details.
However, it's crucial to consider what might have been overlooked
during this abstraction process.
### Statistical Modeling

Statistical modeling involves creating mathematical representations


of the relationships between variables to understand underlying
processes and make predictions. The steps include:

1. **Conceptualization**: Drawing a picture or diagram of the


underlying process to visualize relationships and causality.

2. **Mathematical Representation**: Expressing relationships using


mathematical equations. For example, in a linear relationship between
two variables x and y , you might write:
y= β0 ​+ β1​x

where β0 and β1​are parameters whose values are unknown.

3. **Parameter Estimation**: Determining the values of the


parameters β0 and β1​using data.

The goal is to create a model that accurately represents the data and
helps in making informed decisions and predictions.
Role of Statistical Inference in Data Science

Statistical inference is essential in data science for making sense of complex data,
understanding the underlying randomness and processes, and making informed
decisions based on data analysis.

**Complexity and Data Generation:**


- The world is a complex, random, and uncertain place, continuously generating
data through everyday activities (commuting, shopping, emailing, etc.).
- Real-world processes naturally produce data, which can be collected and
analyzed.

**Data Collection and Subjectivity:**


- Data represents traces of real-world processes.
- The choice of data collection methods is subjective, influencing which traces are
gathered.

**Sources of Uncertainty:**
- **Process Uncertainty**: The inherent randomness and unpredictability in the
processes themselves.
- **Data Collection Uncertainty**: Uncertainty arising from the methods and
procedures used to gather data.

**Simplifying Data:**
- Raw data from real-world processes can be vast and unwieldy.
- To understand and analyze this data, it must be simplified into more
comprehensible forms, such as statistical models or estimators.

**Statistical Estimators:**
- These are mathematical models or functions that simplify and summarize data.
- Estimators help capture the essence of the data in a more concise and
understandable way.

**Statistical Inference:**
- The field of statistical inference deals with developing methods and procedures
to extract meaningful insights from data generated by stochastic (random)
processes.
- It involves the process of turning real-world phenomena into data and then
using that data to understand and describe the world.
**Functions of Statistical Inference:**
1. **Description**: Summarizing and describing data to understand underlying
processes.
2. **Understanding**: Gaining insights into real-world phenomena through data
analysis.
3. **Prediction**: Using data to forecast future trends and behaviors.
4. **Decision Making**: Informing decisions based on statistical analysis and
data-driven insights.

5. Explain data science process with a neat diagram.

Data science is the field focused on extracting insights and


knowledge from data. It involves collecting, processing, analyzing,
and interpreting large volumes of data using techniques from
statistics, computer science, and domain-specific knowledge. The
main goal is to turn data into actionable insights for decision-making
and problem-solving. Key activities include data collection,
cleaning, analysis, modeling, visualization, and communication of
results.
Below is a detailed breakdown of the Data Science process:
1. **Real World Data Generation:**
- The real world consists of numerous activities generating raw data (e.g., people using
Google+, athletes competing, spammers sending emails, etc.).
- This raw data can take various forms, such as logs, records, emails, or genetic information.

2. **Data Collection:**
- Collect raw data related to the specific activity or phenomenon of interest.
- Raw data often contains noise and lacks structure, necessitating further processing.

3. **Data Cleaning and Munging:**


- **Objective**: Transform raw data into a clean, structured format suitable for analysis.
- **Activities**: Joining, scraping, wrangling data using tools like Python, shell scripts, R, or
SQL.
- **Output**: A structured dataset, typically in tabular form (e.g., columns like name, event,
year, gender, event time).

4. **Exploratory Data Analysis (EDA):**


- **Objective**: Understand the data’s characteristics and identify any issues.
- **Activities**: Checking for duplicates, missing values, outliers, and incorrect data entries.
- **Outcome**: Refined dataset ready for modeling. If issues are found, additional data
cleaning or collection may be required.

5. **Model Design and Selection:**


- **Objective**: Choose an appropriate model based on the type of problem (classification,
prediction, description).
- **Common Algorithms**: k-nearest neighbor (k-NN), linear regression, Naive Bayes, etc.
- **Considerations**: The model choice depends on the problem’s nature and the dataset’s
characteristics.

6. **Model Interpretation and Communication:**


- **Objective**: Interpret the model’s results and communicate findings.
- **Activities**: Visualization, reporting results to stakeholders, publishing papers, or
presenting academic talks.
- **Goal**: Ensure that results are understandable and actionable for decision-making.

7. **Building Data Products:**


- **Objective**: Develop prototypes or products like spam classifiers, search ranking
algorithms, recommendation systems.
- **Integration**: These data products are deployed in the real world, where user interactions
generate more data.
- **Feedback Loop**: The interaction of users with the data product generates new data,
creating a feedback loop that influences future data and models.

8. **Continuous Improvement:**
- **Objective**: Adjust and improve models based on the feedback loop and new data.
- **Activities**: Monitoring model performance, retraining models with new data, addressing
biases introduced by the model.
- **Outcome**: Enhanced model accuracy and effectiveness over time.
### What is RealDirect and How Does it Make Money?

RealDirect is a real estate company founded by Doug Perlson that aims to


improve the way people sell and buy houses using data-driven approaches.
It addresses the inefficiencies of the traditional broker system by employing
a team of licensed real estate agents who pool their knowledge and use
advanced data tools.
RealDirect provides an online platform for sellers to receive data-driven tips
and real-time recommendations, optimizing the sales process through the
use of both historical and real-time data.
The company offers a subscription service and reduced commission rates to
its clients, enhancing efficiency and reducing costs compared to traditional
brokerage services.

**Problems Addressed:**
1. **Broker System:**
- Brokers usually operate as independent agents and closely guard their data.
- Experienced brokers have only slightly more data than inexperienced ones.

2. **Data Quality:**
- Publicly available real estate data is often outdated, with a three-month lag
between a sale and when the data becomes available.

**Solutions Provided by RealDirect:**


1. **Team Approach:**
- RealDirect hires licensed real estate agents who work together and share their
knowledge.
- It provides an interface for sellers with data-driven tips and real-time
recommendations.

2. **Data Expertise:**
- Brokers at RealDirect become data experts, using tools to track new and
relevant data.
- Access to both public data and recent sources like co-op sales.

3. **Real-Time Data:**
- RealDirect works on providing real-time data feeds about searches, initial
offers, time between offer and close, and online search behavior.
#### How Does RealDirect Make Money?

**Subscription Model:**
- **Fee:** Sellers pay about $395 a month to access the selling tools provided by
RealDirect.

**Commission Model:**
- **Reduced Commission:** Sellers can use RealDirect’s agents at a reduced
commission rate (typically 2% of the sale) compared to the usual 2.5% or 3%.
- **Efficiency through Data:** By pooling data, RealDirect optimizes the selling
process, allowing them to handle more volume at lower commission rates.

**Additional Services:**
- **Value Addition:** Provides detailed information on buyer concerns such as
nearby parks, subways, schools, and price comparisons per square foot.

RealDirect leverages data to optimize the real estate process, providing a cost-
effective and efficient alternative to traditional broker services.

You might also like