CHAPTER 1
CHAPTER 1
Introduction to Data
science
1.1 What is data science?
Data Science is the in-depth study and analysis of massive
datasets, aimed at extracting meaningful insights from both
structured and unstructured data. By employing the scientific
method, advanced technologies, and powerful algorithms, this
multidisciplinary field uncovers new and valuable information
from raw data. It leverages tools and techniques to process,
analyze, and derive innovative solutions.
1. Healthcare
Revolutionizing medical equipment and aiding in disease
detection and treatment.
2. Gaming
Video and computer games are now being created with the help
of data science and that has taken the gaming experience to the
next level.
3. Image Recognition
Identifying patterns and objects within images, from social media
tagging to diagnostics.
4. Recommendation Systems
Netflix and Amazon give movie and product recommendations
based on what you like to watch, purchase, or browse on their
platforms.
5. Logistics
Data Science is used by logistics companies to optimize routes to
ensure faster delivery of products and increase operational
efficiency.
6. Fraud Detection
Banking and financial institutions use data science and related
algorithms to detect fraudulent transactions.
7. Internet Search
When we think of search, we immediately think of Google. Right?
However, there are other search engines, such as Yahoo,
Duckduckgo, Bing, AOL, Ask, and others, that employ data
science algorithms to offer the best results for our searched
query in a matter of seconds. Given that Google handles more
than 20 petabytes of data per day. Google would not be the
'Google' we know today if data science did not exist.
8. Speech recognition
Speech recognition is dominated by data science techniques.
Virtual assistants like Siri, Alexa, and Google Assistant rely on
speech-to-text data science technologies.
9. Targeted Advertising
Digital marketing utilizes user behavior to tailor advertisements,
resulting in higher engagement compared to traditional
marketing.
10. Airline Route Planning
Predicting delays and optimizing routes for efficiency and
profitability.
Data can be stored in many forms, ranging from simple text files
to tables in a database. The objective now is acquiring all the data
you need. This may be difficult, and even if you succeed, data is
often like a diamond in the rough: it needs polishing to be of any
use to you.
2.3. Step 3: Data Preparation
Cleansing, integrating, and transforming data
The data received from the data retrieval phase. Your task now is
to sanitize and prepare it for use in the modeling and reporting
phase. Data model needs the data in a specific format, so data
transformation will always come into play. It’s a good habit to
correct data errors as early on in the process as possible. Figure
2.4. shows the most common actions to take during the data
cleansing, integration, and transformation phase.
Joining tables
Joining tables allows you to combine the information of one
observation found in one table with the information that you find
in another table. The focus is on enriching a single observation.
Let’s say that the first table contains information about the
purchases of a customer and the other table contains information
about the address where your customer lives. Joining the tables
allows you to combine the information, as shown in figure 2.7.
To join tables, you use variables that represent the same object in
both tables, such as a date, a country name, or a Social Security
number. These common fields are known as keys.
Appending tables
Appending or stacking tables is effectively adding observations
from one table to another table. Figure 2.8 shows an example of
appending tables. One table contains the observations from the
month January and the second table contains observations from
the month March. The result of appending these tables is a larger
one with the observations from January as well as March. The
equivalent operation in set theory would be the union, and this is
also the command in SQL, the common language of relational
databases. Other set operators are also used in data science,
such as set difference and intersection.
Figure 2.8 Appending data from tables is a common operation but
requires an equal structure in
Client Item Month
the tables being appended.
Ram Pen March
Sita Pencil March
Client Item Month
Ram Copy January
Sita Book January
The techniques you’ll use now are borrowed from the field of
machine learning, data mining, and/or statistics.
Building a model is an iterative process. The way you build your
model depends on whether you go with classic statistics or the
somewhat more recent machine learning, and the type of
technique you want to use. Either way, most models consist of the
following main steps:
1. Selection of a modeling technique and variables to enter in the
model
2. Execution of the model
3. Diagnosis and model comparison
Model and variable selection
You’ll need to select the variables you want to include in your
model and a modeling technique. Your findings from the
exploratory analysis should already given a fair idea of what
variables will help you construct a good model. Many modeling
techniques are available, and choosing the right model for a
problem requires judgment on your part.
Model execution
Once you’ve chosen a model you’ll need to implement it in code.
Most programming languages, such as Python, already have
libraries such as Stats Models or Scikit-learn. These packages use
several of the most popular techniques. Coding a model is a
nontrivial task in most cases, so having these libraries available
can speed up the process.
Model diagnostics and model comparison
You’ll be building multiple models from which you then choose
the best one based on multiple criteria. Working with a holdout
sample helps you pick the best-performing model. A holdout
sample is a part of the data you leave out of the model building
so it can be used to evaluate the model afterward. The principle
here is simple: the model should work on unseen data. You use
only a fraction of your data to estimate the model and the other
part, the holdout sample, is kept out of the equation. The model is
then unleashed on the unseen data and error measures are
calculated to evaluate it.
Figure 2.11. A holdout sample helps you compare models and
ensures that you can generalize results to data that the model
has not yet seen.
Fig. 2.11. A holdout sample
Many models make strong assumptions, such as independence of
the inputs, and you have to verify that these assumptions are
indeed met. This is called model diagnostics.
This section gave a short introduction to the steps required to
build a valid model. Once you have a working model, you’re ready
to go to the last step.
2.6. Step 6: Presenting findings and building applications
on top of them
After you’ve successfully analyzed the data and built a well-
performing model, you’re ready to present your findings to the
world figure 2.12. This is an exciting part; all your hours of hard
work have paid off and you can explain what you found to the
stakeholders.
Introduction
In this chapter, we will explore the data science library in
Python. However, before learning about various data
science libraries, it is quite important to create an
environmental setup for installing and using these data
science libraries in Python. Setting up the environment
for utilizing data science libraries like NumPy, SciPy,
Matplotlib, Pandas, and others in Python ensures
effective data analysis and modeling workflows. This
entails managing dependencies, version control, and
package management to guarantee project compatibility
and reproducibility. By creating isolated environments,
potential conflicts between different library versions are
mitigated, facilitating seamless collaboration and
reproducibility of results. So, let us get into the intricate
details of Python installation and Integrated
Development Environments (IDEs) like VSCode and
Jupyter Notebook for writing and executing the Python
code.
Structure
In this chapter, we will discuss the following topics:
Introduction to Python
Setup installation in Windows for Jupyter Notebook
Insights of Jupyter Notebook
Demo program using Jupyter Notebook
Introduction to Data Science Libraries in Python
Objectives
This chapter aims to provide a comprehensive guide on
creating an efficient Python programming environment,
highlighting the importance of an Integrated Development
Environment (IDE). It begins by detailing the step-by-step
installation of Jupyter Notebook on a Windows system, offering
guidance on setup and functionality. Next, it covers the
installation and utilization of Visual Studio Code (VSCode) for
Python programming. Finally, the chapter introduces key data
science libraries in Python, equipping learners with essential
tools for their programming endeavors.
Introduction to python
Before diving into Python, it is crucial to set up a proper
development environment. This chapter will walk you through
the process of installing Python on your system, ensuring you
are ready to write and execute Python code.
Python is an open-source, high-level programming language
known for its simplicity, readability, and versatility. It supports
multiple programming paradigms, making it an excellent
choice for beginners and experienced developers alike.
Whether you are exploring web development, data analysis,
machine learning, or scientific computing, Python provides
powerful tools to support your journey.
The first step in using Python is installing the Python
interpreter, which executes Python code and grants access to
a vast ecosystem of libraries and tools. This chapter will guide
you through the installation process on Windows, macOS, and
Linux, ensuring a smooth setup for your development
environment.
https://docs.anaconda.com/anaconda/install/linux/
https://code.visualstudio.com/docs/setup/mac
https://code.visualstudio.com/docs/setup/linux
Adding Elements:
In [ ] : list.append(5). # Adds 5 to the end
print(list)
Out [ ] [1,2,3,4,5]
:
Removing Elements:
In [ ] : list.remove(3) # Removing the value 3
print(list)
Out [ ] [1,2,4,5]
:
Array
In NumPy, arrays are the backbone of data storage and
manipulation. They act as structured grids or tables where every
value has a consistent type, known as the "array dtype." This
consistency simplifies accessing, processing, and interpreting
individual elements.
To create a NumPy array, you can start with a Python list, using
nested lists for data with multiple dimensions, such as a 2D or 3D
array.
3D random array
Randint
Random Matrix Using NumPy's Randint
The np.random.randint () function generates random integer
values within a specified range. By default, the starting value (n)
is 0.
Numbers within the range can repeat, as the function allows
duplicates.
This is useful for creating random numerical datasets for
simulations or testing.
Rand
Random Matrix Using NumPy's Rand
The np.random.rand () function in NumPy generates random
numbers following a uniform distribution, where every number in
the range 0 to 1 has an equal chance of being selected. The
output values are floating-point numbers.
Randn
Random Matrix Using NumPy's Randn
The np.random.randn() function is a convenient method to create
matrices filled with random values drawn from a standard normal
distribution (mean of 0 and standard deviation of 1). This is often
useful in simulations, algorithm testing, or initializing parameters
in machine learning models. it gives any random no ,positive or
negative also .
Uniform
Random Matrix Using NumPy's Uniform
The numpy.random.uniform() function provides a method for
generating random floating-point numbers distributed uniformly
within a specified range. Here's how it works, explained:
Specified Range: If both low and high values are provided, the
function generates random floats exclusively within the specified
range, ensuring no repetition among the values.
Single Value: If only a single argument is passed (the high value),
the function assumes the range to be [0, high] by default.
Default Behavior: If no arguments are given, the function
generates random float values within the range [0, 1]
Choice
Random Matrix Using NumPy's Choice
The np.random.choice () function generates a random value from
a specified sequence (e.g., a list or array).
If no range is provided but only a single value is given, it defaults
to a range from 0 to the specified number and generates a
random value within this range.
The selected elements can be repeated by default. To prevent
repetition, set the parameter replace=False.
Arange
The np.arange() function generates a sequence of numerical
values, such as integers or floats, based on the provided range
and step size.
It accepts a flexible number of positional arguments:
1. np.arange(start, stop): Generates numbers starting from
start (inclusive) to stop (exclusive).
2. np.arange(start, stop, step): Additionally specifies a step size
to control the spacing between consecutive values.
Identity Matrix
An identity matrix is a square matrix where all diagonal elements
are 1, and all off-diagonal elements are 0. It is widely used in data
science, especially in linear algebra and machine learning, for
matrix operations and solving systems of equations.
Reshape
The reshape function is used to change or modify the shape of a
NumPy array without altering its data.
The product of the dimensions specified in the reshape arguments
must equal the total number of elements in the original array.
now lets try to reshape the existing array
Scaler operations on array
A scalar operation involves a single number (scalar) operating on
every element of an array. These operations are fundamental
in data science for data preprocessing, feature scaling,
normalization, and other numerical transformations.
Arithmetic operation
Arithmetic operations refer to performing arithmetic calculations
like addition, subtraction, multiplication, division or modulo
directly on each element within a NumPy array. These operations
are applied uniformly across all elements, enabling efficient and
concise computations that are crucial in data science workflows.
Relational operation
Relational operators, also referred to as comparison operators,
are used to evaluate the relationship between values within a
dataset. These operators return a Boolean array where each
element is either True or False, based on whether the specified
condition is satisfied for the operands being compared.
Vector Operation