0% found this document useful (0 votes)
5 views

FDSUNIT 1

Data science involves analyzing raw data using statistics and machine learning to derive insights, with data categorized into structured, semi-structured, and unstructured forms. The document outlines the significance of data science in handling the vast amounts of data generated today, the concept of big data, and the challenges businesses face in data analytics, such as data quality, access, and visualization. It also describes the data science process and various roles within the field, emphasizing the need for clear goals and effective strategies to leverage data for informed decision-making.

Uploaded by

lijinv9072890125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

FDSUNIT 1

Data science involves analyzing raw data using statistics and machine learning to derive insights, with data categorized into structured, semi-structured, and unstructured forms. The document outlines the significance of data science in handling the vast amounts of data generated today, the concept of big data, and the challenges businesses face in data analytics, such as data quality, access, and visualization. It also describes the data science process and various roles within the field, emphasizing the need for clear goals and effective strategies to leverage data for informed decision-making.

Uploaded by

lijinv9072890125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT 1

What is Data Science?


Data science is the science of analyzing raw data using statistics and machine
learning techniques with the purpose of drawing conclusions about that
information.
Facets of data:
1.Structured
2.Semi structured
3. Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
1.Structured Data:
1.It concerns all data which can be stored in database SQL in table with rows and
columns.
2.They have relational key and can be easily mapped into pre-designed fields.
3.Today, those data are the most processed in development and the simplest way to
manage information.
4. But structured data represent only 5 to 10% of all informatics data.
2.Semi Structured Data:
1. Semi-structured data is information that doesn’t reside in a relational database but that
does have some organizational properties that make it easier to analyze.
2. With some process you can store them in relation database (it could be very hard for
some kind of semi structured data), but the semi structure exists to ease space, clarity or
compute…
3.But as Structured data, semi structured data represents a few parts of data (5 to 10%).
Examples of semi-structured: JSON, CSV , XML documents are semi structured
documents.
3.Unstructured data:
1.Unstructured data represent around 80% of data.
2. It often include text and multimedia content.
3.Examples include e-mail messages, word processing documents, videos, photos, audio
files, presentations, webpages and many other kinds of business documents.
4.Unstructured data is everywhere.
5.In fact, most individuals and organizations conduct their lives around unstructured data.
6.Just as with structured data, unstructured data is either machine generated or human
generated.
• Here are some examples of machine-generated unstructured data:
Satellite images: This includes weather data or the data that the government captures in
its satellite surveillance imagery. Just think about Google Earth, and you get the picture.
Photographs and video: This include security, surveillance, and traffic video.
Radar or sonar data: This includes vehicular, meteorological, and Seismic oceanography.
• The following list shows a few examples of human-generated unstructured data:
Social media data: This data is generated from the social media platforms such as
YouTube, Facebook, Twitter, LinkedIn, and Flickr.
Mobile data: This includes data such as text messages and location information.
website content: This comes from any site delivering unstructured content, like YouTube,
Flickr, or Instagram.
i)Natural Language:
Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, and sentiment analysis, but models trained in one domain
don’t generalize well to other domains.
ii)Graph based or Network Data:
In graph theory, a graph is a mathematical structure to model pair-wise relationships
between objects.
Graph or network data is, in short, data that focuses on the relationship or adjacency of
objects.
The graph structures use nodes, edges, and properties to represent and store graphical
data. Graph-based data is a natural way to represent social networks.
iii)Audio, Image & Video:
Audio, image, and video are data types that pose specific challenges to a data scientist.
MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll
increase video capture to approximately 7 TB per game for the purpose of live, in-game
analytics. High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time, for example, the path taken by a defender relative to two baselines.
iv)Streaming Data:
Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Examples are the-Log files generated by customers using your mobile or web applications,
online game activity, “What’s trending” on Twitter, live sporting or music events, and the
stock market.

Why do we need data science?


One of the reasons for the acceleration of data science in recent years is the enormous
volume of data currently available and being generated. Not only are huge amounts of
data being collected about many aspects of the world and our lives, but we concurrently
have the rise of inexpensive computing. This has formed the perfect storm in which we
have rich data and the tools to analyze it. Advancing computer memory capacities, more
enhanced software, more competent processors, and now, more numerous data scientists
with the skills to put this to use and solve questions using the data!
What is big data?
We frequently hear the term Big Data. So it deserves an introduction here – since it has
been so integral to the rise of data science.

What does big data mean?

Big Data literally means large amounts of data. Big data is the pillar behind the idea that
one can make useful inferences with a large body of data that wasn’t possible before with
smaller datasets. So extremely large data sets may be analyzed computationally to reveal
patterns, trends, and associations that are not transparent or easy to identify.

How much data is Big Data?


• Google processes 20 Petabytes(PB) per day (2008)
• Facebook has 2.5 PB of user data + 15 TB per day (2009)
• eBay has 6.5 PB of user data + 50 TB per day (2009)
• CERN’s Large Hadron Collider(LHC) generates 15 PB a year
What is data?
A set of values of qualitative or quantitative variables.

Top 10 Data Analytics Challenges for Businesses


Data on its own isn’t all that useful—it’s the analysis of data that lets teams make more
informed decisions and respond better to changing business conditions. Data analytics as a
process is central to an organization becoming truly data-driven. However, crafting,
implementing, and running a data analytics strategy takes time and effort, and the process
comes with some well-known yet formidable challenges.
1. Data quality

One of the biggest challenges most businesses face is ensuring that the data they collect is
reliable. When data suffers from inaccuracy, incompleteness, inconsistencies, and duplication,
that can lead to incorrect insights and poor decision-making. There are many tools available
for data preparation, deduplication, and enhancement, and ideally some of this functionality
is built into your analytics platform.

Non-standardized data can also be an issue—for example, when units, currencies, or date
formats vary. Standardizing as much as possible, as early as possible, will minimize cleansing
efforts and enable better analysis.

By implementing solutions such as data validation, data cleansing, and proper data
governance, organizations can ensure their data is accurate, consistent, complete, accessible,
and secure. This high-quality data can act as the fuel for effective data analysis and
ultimately lead to better decision-making.

2. Data access

Companies often have data scattered across multiple systems and departments, and in
structured, unstructured, and semi-structured formats. This makes it both difficult to
consolidate and analyze and vulnerable to unauthorized use. Disorganized data poses
challenges for analytics, machine learning, and artificial intelligence projects that work best
with as much data as possible to draw from.

For many companies, the goal is democratization—granting data access across the entire
organization regardless of department. To achieve this while also guarding against
unauthorized access, companies should gather their data in a central repository, such as a
data lake, or connect it directly to analytics applications using APIs and other integration
tools. IT departments should strive to create streamlined data workflows with built-in
automation and authentication to minimize data movement, reduce compatibility or format
issues, and keep a handle on what users and systems have access to their information.

3. Bad visualizations

Transforming data into graphs or charts through data visualization efforts helps present
complex information in a tangible, accurate way that makes it easier to understand. But using
the wrong visualization method or including too much data can lead to misleading
visualizations and incorrect conclusions. Input errors and oversimplified visualizations could
also cause the resulting report to misrepresent what’s actually going on.
Effective data analytics systems support report generation, provide guidance on
visualizations, and are intuitive enough for business users to operate. Otherwise, the burden
of preparation and output falls on IT, and the quality and accuracy of visualizations can be
questionable. To avoid this, organizations must make sure that the system they choose can
handle structured, unstructured, and semi-structured data.
So how do you achieve effective data visualization? Start with the following three keys
concepts:

Know your audience: Tailor your visualization to the interests of your viewers. Avoid
technical jargon or complex charts and be selective about the data you include. A CEO wants
very different information than a department head.
Start with a clear purpose: What story are you trying to tell with your data? What key
message do you want viewers to take away? Once you know this, you can choose the most
appropriate chart type. To that end, don’t just default to a pie or bar chart. There are many
visualization options, each suited for different purposes. Line charts show trends over time,
scatter plots reveal relationships between variables, and so on.
Keep it simple: Avoid cluttering your visualization with unnecessary elements. Use clear
labels, concise titles, and a limited color palette for better readability. Avoid misleading
scales, distorted elements, or chart types that might misrepresent the data.
4. Data privacy and security

Controlling access to data is a never-ending challenge that requires data classification as well
as security technology.

At a high level, careful attention must be paid to who is allowed into critical operational
systems to retrieve data, since any damage done here can bring a business to its knees.
Similarly, businesses need to make sure that when users from different departments log into
their dashboards, they see only the data that they should see. Businesses must establish
strong access controls and ensure that their data storage and analytics systems are secure
and compliant with data privacy regulations at every step of the data collection, analysis, and
distribution process.

Before you can decide which roles should have access to various types or pools of data, you
need to understand what that data is. That requires setting up a data classification system.
To get started. consider the following steps:
See what you have: Identify the types of data your organization collects, stores, and
processes, then label it based on sensitivity, potential consequences of a breach, and
regulations it’s subject to, such as HIPAA or GDPR.
Develop a data classification matrix: Define a schema with different categories, such as
public, confidential, and internal use only, and establish criteria for applying these
classifications to data based on its sensitivity, legal requirements, and your company policies.
See who might want access: Outline roles and responsibilities for data classification,
ownership, and access control. A finance department employee will have different access
rights than a member of the HR team, for example.
Then, based on the classification policy, work with data owners to categorize your data. Once
a scheme is in place, consider data classification tools that can automatically scan and
categorize data based on your defined rules.

Finally, set up appropriate data security controls and train your employees on them,
emphasizing the importance of proper data handling and access controls.

5. Talent shortage

Many companies can’t find the talent they need to turn their vast supplies of data into usable
information. The demand for data analysts, data scientists, and other data-related roles has
outpaced the supply of qualified professionals with the necessary skills to handle complex
data analytics tasks. And there’s no signs of that demand leveling out, either. By 2026, the
number of jobs requiring data science skills is projected to grow by nearly 28%, according to
the US Bureau of Labor Statistics.

Fortunately, many analytics systems today offer advanced data analytics capabilities, such as
built-in machine learning algorithms, that are accessible to business users without
backgrounds in data science. Tools with automated data preparation and cleaning
functionalities, in particular, can help data analysts get more done.

Companies can also upskill, identifying employees with strong analytical or technical
backgrounds who might be interested in transitioning to data roles and offering paid
training programs, online courses, or data bootcamps to equip them with the necessary
skills.

6. Too many analytics systems and tools

It’s not uncommon that, once an organization embarks on a data analytics strategy, it ends
up buying separate tools for each layer of the analytics process. Similarly, if departments act
autonomously, they may wind up buying competing products with overlapping or
counteractive capabilities; this can also be an issue when companies merge.

The result is a hodgepodge of technology, and if it’s deployed on-premises, then somewhere
there’s a data center full of different software and licenses that must be managed.
Altogether, this can lead to waste for the business and add unnecessary complexity to the
architecture. To prevent this, IT leaders should create an organization-wide strategy for data
tools, working with various department heads to understand their needs and requirements.
Issuing a catalog that includes various cloud-based options can help get everyone on a
standardized platform.

7. Cost

Data analytics requires investment in technology, staff, and infrastructure. But unless
organizations are clear on the benefits they’re getting from an analytics effort, IT teams may
struggle to justify the cost of implementing the initiative properly.
Deploying a data analytics platform via a cloud-based architecture can eliminate most
upfront capital expenses while reducing maintenance costs. It can also rein in the problem of
too many one-off tools.
Operationally, an organization’s return on investment comes from the insights that data
analytics can reveal to optimize marketing, operations, supply chains, and other business
functions. To show ROI, IT teams must work with stakeholders to define clear success metrics
that tie back to business goals. Examples might be that findings from data analytics led to a
10% increase in revenue, an 8% reduction in customer churn, or a 15% improvement in
operational efficiency. Suddenly, that cloud service seems like a bargain.

While quantifiable data is important, some benefits might be harder to measure directly, so
IT teams need to think beyond just line-item numbers. For example, a data project might
improve decision-making agility or customer experience, which can lead to long-term gains.

8. Changing technology

The data analytics landscape is constantly evolving, with new tools, techniques, and
technologies emerging all the time. For example, the race is currently on for companies to
get advanced capabilities such as artificial intelligence (AI) and machine learning (ML) into
the hands of business users as well as data scientists. That means introducing new tools that
make these techniques accessible and relevant. But for some organizations, new analytics
technologies may not be compatible with legacy systems and processes. This can cause data
integration challenges that require greater transformations or custom-coded connectors to
resolve.
Evolving feature sets also mean continually evaluating the best product fit for an
organization’s particular business needs. Again, using cloud-based data analytics tools can
smooth over feature and functionality upgrades, as the provider will ensure the latest version
is always available. Compare that to an on-premises system that might only be updated
every year or two, leading to a steeper learning curve between upgrades.

9. Resistance to change

Applying data analytics often requires what can be an uncomfortable level of change.
Suddenly, teams have new information about what’s happening in the business and different
options for how they should react. Leaders accustomed to operating on intuition rather than
data may also feel challenged—or even threatened—by the shift.

To prevent such a backlash, IT staff should collaborate with individual departments to


understand their data needs, then communicate how new analytics software can improve
their processes. As part of the rollout, IT teams can show how data analytics advancements
lead to more efficient workflows, deeper data insights, and ultimately, better decision-
making across the business.

10. Goalsetting

Without clear goals and objectives, businesses will struggle to determine which data sources
to use for a project, how to analyze data, what they want to do with results, and how they’ll
measure success. A lack of clear goals can lead to unfocused data analytics efforts that don’t
deliver meaningful insights or returns. This can be mitigated by defining the objectives and
key results of a data analytics project before it begins.

Data Science Process


What is Data Science?
Data can be proved to be very fruitful if we know how to manipulate it to get hidden
patterns from them. This logic behind the data or the process behind the manipulation is
what is known as Data Science. From formulating the problem statement and collection of
data to extracting the required results from them the Data Science process and the
professional who ensures that the whole process is going smoothly or not is known as the
Data Scientist. But there are other job roles as well in this domain like:
1. Data Engineers : They build and maintain data pipelines.
2. Data Analysts: They focus on interpreting data and generating reports.
3. Data Architect : They design data management systems.
4. Machine Learning Engineer : They develop and deploy predictive models.
5. Deep Learning Engineer : They create more advanced AI models to process complex
data.
Data Science Process Life Cycle
Some steps are necessary for any of the tasks that are being done in the field of data
science to derive any fruitful results from the data at hand.
• Data Collection – After formulating any problem statement the main task is to
calculate data that can help us in our analysis and manipulation. Sometimes data is
collected by performing some kind of survey and there are times when it is done by
performing scrapping.
• Data Cleaning – Most of the real-world data is not structured and requires cleaning
and conversion into structured data before it can be used for any analysis or modeling.
• Exploratory Data Analysis – This is the step in which we try to find the hidden
patterns in the data at hand. Also, we try to analyze different factors which affect the
target variable and the extent to which it does so. How the independent features are
related to each other and what can be done to achieve the desired results all these
answers can be extracted from this process as well. This also gives us a direction in
which we should work to get started with the modeling process.
• Model Building – Different types of machine learning algorithms as well as techniques
have been developed which can easily identify complex patterns in the data which will
be a very tedious task to be done by a human.
• Model Deployment – After a model is developed and gives better results on the
holdout or the real-world dataset then we deploy it and monitor its performance. This is
the main part where we use our learning from the data to be applied in real-world
applications and use cases.
Key Components of Data Science Process
Data Science is a very vast field and to get the best out of the data at hand one has to apply
multiple methodologies and use different tools to make sure the integrity of the data remains
intact throughout the process keeping data privacy in mind. If we try to point out the main
components of Data Science then it would be:
• Data Analysis – There are times when there is no need to apply advanced deep
learning and complex methods to the data at hand to derive some patterns from it. Due
to this before moving on to the modeling part, we first perform an exploratory data
analysis to get a basic idea of the data and patterns which are available in it this gives
us a direction to work on if we want to apply some complex analysis methods on our
data.
• Statistics – It is a natural phenomenon that many real-life datasets follow a normal
distribution. And when we already know that a particular dataset follows some known
distribution then most of its properties can be analyzed at once. Also, descriptive
statistics and correlation and covariances between two features of the dataset help us
get a better understanding of how one factor is related to the other in our dataset.
• Data Engineering – When we deal with a large amount of data then we have to make
sure that the data is kept safe from any online threats also it is easy to retrieve and
make changes in the data as well. To ensure that the data is used efficiently Data
Engineers play a crucial role.
• Advanced Computing
o Machine Learning – Machine Learning has opened new horizons which
had helped us to build different advanced applications and methodologies
so, that the machines become more efficient and provide a personalized
experience to each individual and perform tasks in a snap of the hand
earlier which requires heavy human labor and time intense.
o Deep Learning – This is also a part of Artificial Intelligence and Machine
Learning but it is a bit more advanced than machine learning itself. High
computing power and a huge corpus of data have led to the emergence of
this field in data science.
Knowledge and Skills for Data Science Professionals
Becoming proficient in Data Science requires a combination of skills, including:
• Statistics: Wikipedia defines it as the study of the collection, analysis, interpretation,
presentation, and organization of data. Therefore, it shouldn’t be a surprise that data
scientists need to know statistics.
• Programming Language R/ Python: Python and R are one of the most widely used
languages by Data Scientists. The primary reason is the number of packages available
for Numeric and Scientific computing.
• Data Extraction, Transformation, and Loading: Suppose we have multiple data
sources like MySQL DB, MongoDB, Google Analytics. You have to Extract data from
such sources, and then transform it for storing in a proper format or structure for the
purposes of querying and analysis. Finally, you have to load the data in the Data
Warehouse, where you will analyze the data. So, for people from ETL (Extract
Transform and Load) background Data Science can be a good career option.
Steps for Data Science Processes:
Step 1: Define the Problem and Create a Project Charter
Clearly defining the research goals is the first step in the Data Science Process.
A project charter outlines the objectives, resources, deliverables, and timeline, ensuring
that all stakeholders are aligned.
Step 2: Retrieve Data
Data can be stored in databases, data warehouses, or data lakes within an organization.
Accessing this data often involves navigating company policies and requesting
permissions.
Step 3: Data Cleansing, Integration, and Transformation
Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data
integration combines datasets from different sources, while data
transformation prepares the data for modeling by reshaping variables or creating new
features.
Step 4: Exploratory Data Analysis (EDA)
During EDA, various graphical techniques like scatter plots, histograms, and box plots
are used to visualize data and identify trends. This phase helps in selecting the right
modeling techniques.
Step 5: Build Models
In this step, machine learning or deep learning models are built to make predictions or
classifications based on the data. The choice of algorithm depends on the complexity of
the problem and the type of data.
Step 6: Present Findings and Deploy Models
Once the analysis is complete, results are presented to stakeholders. Models are deployed
into production systems to automate decision-making or support ongoing analysis.
Benefits and uses of data science and big data
• Governmental organizations are also aware of data’s value. A data scientist in a
governmental organization gets to work on diverse projects such as detecting fraud
and other criminal activity or optimizing project funding.
• Nongovernmental organizations (NGOs) are also no strangers to using data. They
use it to raise money and defend their causes. The World Wildlife Fund (WWF), for
instance, employs data scientists to increase the effectiveness of their fundraising
efforts.
• Universities use data science in their research but also to enhance the study
experience of their students. • Ex: MOOC’s- Massive open online courses.
Tools for Data Science Process
As time has passed tools to perform different tasks in Data Science have evolved to a
great extent. Different software like Matlab and Power BI, and programming Languages
like Python and R Programming Language provides many utility features which help us to
complete most of the most complex task within a very limited time and efficiently.
Usage of Data Science Process
The Data Science Process is a systematic approach to solving data-related problems and
consists of the following steps:
1. Problem Definition: Clearly defining the problem and identifying the goal of the
analysis.
2. Data Collection: Gathering and acquiring data from various sources, including data
cleaning and preparation.
3. Data Exploration: Exploring the data to gain insights and identify trends, patterns, and
relationships.
4. Data Modeling: Building mathematical models and algorithms to solve problems and
make predictions.
5. Evaluation: Evaluating the model’s performance and accuracy using appropriate
metrics.
6. Deployment: Deploying the model in a production environment to make predictions or
automate decision-making processes.
7. Monitoring and Maintenance: Monitoring the model’s performance over time and
making updates as needed to improve accuracy.
Challenges in the Data Science Process
1. Data Quality and Availability: Data quality can affect the accuracy of the models
developed and therefore, it is important to ensure that the data is accurate, complete,
and consistent. Data availability can also be an issue, as the data required for analysis
may not be readily available or accessible.
2. Bias in Data and Algorithms: Bias can exist in data due to sampling techniques,
measurement errors, or imbalanced datasets, which can affect the accuracy of models.
Algorithms can also perpetuate existing societal biases, leading to unfair or
discriminatory outcomes.
3. Model Overfitting and Underfitting: Overfitting occurs when a model is too complex
and fits the training data too well, but fails to generalize to new data. On the other
hand, underfitting occurs when a model is too simple and is not able to capture the
underlying relationships in the data.
4. Model Interpretability: Complex models can be difficult to interpret and understand,
making it challenging to explain the model’s decisions and decisions. This can be an
issue when it comes to making business decisions or gaining stakeholder buy-in.
5. Privacy and Ethical Considerations: Data science often involves the collection and
analysis of sensitive personal information, leading to privacy and ethical concerns. It is
important to consider privacy implications and ensure that data is used in a responsible
and ethical manner.
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves
analyzing and visualizing data to understand its key characteristics, uncover patterns, and
identify relationships between variables refers to the method of studying and exploring
record sets to apprehend their predominant traits, discover patterns, locate outliers, and
identify relationships between variables. EDA is normally carried out as a preliminary step
before undertaking extra formal statistical analyses or modeling.
Key aspects of EDA include:
• Distribution of Data: Examining the distribution of data points to understand their
range, central tendencies (mean, median), and dispersion (variance, standard
deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter
plots, and bar charts to visualize relationships within the data and distributions of
variables.
• Outlier Detection: Identifying unusual values that deviate from other data points.
Outliers can influence statistical analyses and might indicate data entry errors or
unique cases.
• Correlation Analysis: Checking the relationships between variables to understand
how they might affect each other. This includes computing correlation coefficients and
creating correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data
points, whether by imputation or removal, depending on their impact and the amount of
missing data.
• Summary Statistics: Calculating key statistics that provide insight into data trends and
nuances.
• Testing Assumptions: Many statistical tests and models assume the data meet
certain conditions (like normality or homoscedasticity). EDA helps verify these
assumptions.
Types of Exploratory Data Analysis
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing
information units to uncover styles, pick out relationships, and gain insights. There are
various sorts of EDA strategies that can be hired relying on the nature of the records and
the desires of the evaluation. Depending on the number of columns we are analyzing we
can divide EDA into three types: Univariate, bivariate and multivariate.
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is
primarily concerned with describing the data and finding patterns existing in a single
feature. This sort of evaluation makes a speciality of analyzing character variables inside
the records set. It involves summarizing and visualizing a unmarried variable at a time to
understand its distribution, relevant tendency, unfold, and different applicable records.
Common techniques include:
• Histograms: Used to visualize the distribution of a variable.
• Box plots: Useful for detecting outliers and understanding the spread and skewness of
the data.
• Bar charts: Employed for categorical data to show the frequency of each category.
• Summary statistics: Calculations like mean, median, mode, variance, and standard
deviation that describe the central tendency and dispersion of the data.
2. Bivariate Analysis
Bivariate evaluation involves exploring the connection between variables. It enables find
associations, correlations, and dependencies between pairs of variables. Bivariate analysis
is a crucial form of exploratory data analysis that examines the relationship between two
variables. Some key techniques used in bivariate analysis:
• Scatter Plots: These are one of the most common tools used in bivariate analysis. A
scatter plot helps visualize the relationship between two continuous variables.
• Correlation Coefficient: This statistical measure (often Pearson’s correlation
coefficient for linear relationships) quantifies the degree to which two variables are
related.
• Cross-tabulation: Also known as contingency tables, cross-tabulation is used to
analyze the relationship between two categorical variables. It shows the frequency
distribution of categories of one variable in rows and the other in columns, which helps
in understanding the relationship between the two variables.
• Line Graphs: In the context of time series data, line graphs can be used to compare
two variables over time. This helps in identifying trends, cycles, or patterns that emerge
in the interaction of the variables over the specified period.
• Covariance: Covariance is a measure used to determine how much two random
variables change together. However, it is sensitive to the scale of the variables, so it’s
often supplemented by the correlation coefficient for a more standardized assessment
of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the
dataset. It aims to understand how variables interact with one another, which is crucial for
most statistical modeling techniques. Techniques include:
• Pair plots: Visualize relationships across several variables simultaneously to capture a
comprehensive view of potential interactions.
• Principal Component Analysis (PCA): A dimensionality reduction technique used to
reduce the dimensionality of large datasets, while preserving as much variance as
possible.
Specialized EDA Techniques
In addition to univariate and multivariate analysis, there are specialized EDA techniques
tailored for specific types of data or analysis needs:
• Spatial Analysis: For geographical data, using maps and spatial plotting to
understand the geographical distribution of variables.
• Text Analysis: Involves techniques like word clouds, frequency distributions, and
sentiment analysis to explore text data.
• Time Series Analysis: This type of analysis is mainly applied to statistics sets that
have a temporal component. Time collection evaluation entails inspecting and
modeling styles, traits, and seasonality inside the statistics through the years.
Techniques like line plots, autocorrelation analysis, transferring averages, and ARIMA
(AutoRegressive Integrated Moving Average) fashions are generally utilized in time
series analysis.
Tools for Performing Exploratory Data Analysis
Exploratory Data Analysis (EDA) can be effectively performed using a variety of tools and
software, each offering unique features suitable for handling different types of data and
analysis requirements.
1. Python Libraries
• Pandas: Provides extensive functions for data manipulation and analysis, including
data structure handling and time series functionality.
• Matplotlib: A plotting library for creating static, interactive, and animated visualizations
in Python.
• Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing
attractive and informative statistical graphics.
• Plotly: An interactive graphing library for making interactive plots and offers more
sophisticated visualization capabilities.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps designed to help
you understand the data you’re working with, uncover underlying patterns, identify
anomalies, test hypotheses, and ensure the data is clean and suitable for further analysis.

What is Data Cleaning?


Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying
and removing any missing, duplicate, or irrelevant data. The goal of data cleaning is to
ensure that the data is accurate, consistent, and free of errors, as incorrect or inconsistent
data can negatively impact the performance of the ML model. Professional data scientists
usually invest a very large portion of their time in this step because of the belief that “Better
data beats fancier algorithms”.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in
the data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which
can negatively impact the accuracy and reliability of the insights derived from it.
Steps to Perform Data Cleanliness
Performing data cleaning involves a systematic process to identify and
rectify errors, inconsistencies, and inaccuracies in a dataset. The following
are essential steps to perform data cleaning.

• Removal of Unwanted Observations: Identify and eliminate irrelevant or redundant


observations from the dataset. The step involves scrutinizing data entries for duplicate
records, irrelevant information, or data points that do not contribute meaningfully to the
analysis. Removing unwanted observations streamlines the dataset, reducing noise
and improving the overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types. Standardize
formats, correct naming discrepancies, and ensure uniformity in data representation.
Fixing structure errors enhances data consistency and facilitates accurate analysis and
interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are data points
significantly deviating from the norm. Depending on the context, decide whether to
remove outliers or transform them to minimize their impact on analysis. Managing
outliers is crucial for obtaining more accurate and reliable insights from the data.
• Handling Missing Data: Devise strategies to handle missing data effectively. This
may involve imputing missing values based on statistical methods, removing records
with missing values, or employing advanced imputation techniques. Handling missing
data ensures a more complete dataset, preventing biases and maintaining the integrity
of analyses.
What is an outlier?
An outlier is a data point that significantly deviates from the rest of the data. It can be
either much higher or much lower than the other data points, and its presence can have a
significant impact on the results of machine learning algorithms. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier
analysis or outlier mining.
Types of Outliers
There are two main types of outliers:
• Global outliers: Global outliers are isolated data points that are far away from the
main body of the data. They are often easy to identify and remove.
• Contextual outliers: Contextual outliers are data points that are unusual in a specific
context but may not be outliers in a different context. They are often more difficult to
identify and may require additional information or domain knowledge to determine their
significance.
Algorithm
1. Calculate the mean of each cluster
2. Initialize the Threshold value
3. Calculate the distance of the test data from each cluster mean
4. Find the nearest cluster to the test data
5. If (Distance > Threshold) then, Outlier

o ne.
o Winsorization: Replacing outlier values with the nearest non-outlier value.
o Log transformation: Applying a logarithmic transformation to compress the
data and reduce the impact of extreme values.
3. Robust Estimation:
• This involves using algorithms that are less sensitive to outliers. Some examples
include:
o Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
o M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
o Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.
4. Modeling Outliers:
• This involves explicitly modeling the outliers as a separate group. This can be done by:
o Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
o Using a mixture model: Train a model that assumes the data comes from
a mixture of multiple distributions, where one distribution represents the
outliers.
What is a Missing Value?
Missing values are data points that are absent for a specific variable in a dataset. They
can be represented in various ways, such as blank cells, null values, or special symbols
like “NA” or “unknown.” These missing data points pose a significant challenge in data
analysis and can lead to inaccurate or biased results.
Missing values can pose a significant challenge in data analysis, as they can:
• Reduce the sample size: This can decrease the accuracy and reliability of your
analysis.
• Introduce bias: If the missing data is not handled properly, it can bias the results of
your analysis.
• Make it difficult to perform certain analyses: Some statistical techniques require
complete data for all variables, making them inapplicable when missing values are
present
Types of Missing Values
There are three main types of missing values:
1. Missing Completely at Random (MCAR): MCAR is a specific type of missing data in
which the probability of a data point being missing is entirely random and independent
of any other variable in the dataset. In simpler terms, whether a value is missing or not
has nothing to do with the values of other variables or the characteristics of the data
point itself.
2. Missing at Random (MAR): MAR is a type of missing data where the probability of a
data point missing depends on the values of other variables in the dataset, but not on
the missing variable itself. This means that the missingness mechanism is not entirely
random, but it can be predicted based on the available information.
3. Missing Not at Random (MNAR): MNAR is the most challenging type of missing data
to deal with. It occurs when the probability of a data point being missing is related to
the missing value itself. This means that the reason for the missing data is informative
and directly associated with the variable that is missing.
Effective Strategies for Handling Missing Values
in Data Analysis
Removing Rows with Missing Values
• Simple and efficient: Removes data points with missing values altogether.
• Reduces sample size: Can lead to biased results if missingness is not random.
• Not recommended for large datasets: Can discard valuable information.
Imputation Methods
• Replacing missing values with estimated values.
• Preserves sample size: Doesn’t reduce data points.
• Can introduce bias: Estimated values might not be accurate.
Here are some common imputation methods:
1- Mean, Median, and Mode Imputation:
• Replace missing values with the mean, median, or mode of the relevant variable.
• Simple and efficient: Easy to implement.
• Can be inaccurate: Doesn’t consider the relationships between variables.
In this example, we are explaining the imputation techniques for handling missing values in
the ‘Marks’ column of the DataFrame (df). It calculates and fills missing values with the
mean, median, and mode of the existing values in that column, and then prints the results
for observation.
1. Mean Imputation: Calculates the mean of the ‘Marks’ column in the DataFrame (df).
• df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
mean value.
• mean_imputation: The result is stored in the variable mean_imputation.
2. Median Imputation: Calculates the median of the ‘Marks’ column in the DataFrame
(df).
• df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
median value.
• median_imputation: The result is stored in the variable median_imputation.
3. Mode Imputation: Calculates the mode of the ‘Marks’ column in the DataFrame (df).
The result is a Series.
• .iloc[0]: Accesses the first element of the Series, which represents the mode.
• df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
mode value.
2. Forward and Backward Fill
• Replace missing values with the previous or next non-missing value in the same
variable.
• Simple and intuitive: Preserves temporal order.
• Can be inaccurate: Assumes missing values are close to observed values
These fill methods are particularly useful when there is a logical sequence or order in the
data, and missing values can be reasonably assumed to follow a pattern.
The method parameter in fillna() allows to specify the filling strategy, and here, it’s set
to ‘ffill’ for forward fill and ‘bfill’ for backward fill.
1. Forward Fill (forward_fill)
• df['Marks'].fillna(method='ffill'): This method fills missing values in
the ‘Marks’ column of the DataFrame (df) using a forward fill strategy. It replaces
missing values with the last observed non-missing value in the column.
• forward_fill: The result is stored in the variable forward_fill.
2. Backward Fill (backward_fill)
• df['Marks'].fillna(method='bfill'): This method fills missing values in
the ‘Marks’ column using a backward fill strategy. It replaces missing values with
the next observed non-missing value in the column.
• backward_fill: The result is stored in the variable backward_fill.
Note
• Forward fill uses the last valid observation to fill missing values.
• Backward fill uses the next valid observation to fill missing values.
3. Interpolation Techniques
• Estimate missing values based on surrounding data points using techniques like linear
interpolation or spline interpolation.
• More sophisticated than mean/median imputation: Captures relationships between
variables.
• Requires additional libraries and computational resources.
These interpolation techniques are useful when the relationship between data points can
be reasonably assumed to follow a linear or quadratic pattern. The method parameter in
the interpolate() method allows to specify the interpolation strategy.
1. Linear Interpolation
• df['Marks'].interpolate(method='linear'): This method performs linear
interpolation on the ‘Marks’ column of the DataFrame (df). Linear interpolation
estimates missing values by considering a straight line between two adjacent non-
missing values.
• linear_interpolation: The result is stored in the
variable linear_interpolation.
2. Quadratic Interpolation
• df['Marks'].interpolate(method='quadratic'): This method
performs quadratic interpolation on the ‘Marks’ column. Quadratic interpolation
estimates missing values by considering a quadratic curve that passes through
three adjacent non-missing values.
• quadratic_interpolation: The result is stored in the
variable quadratic_interpolation.

Data Warehouse Architecture


A Data Warehouse therefore can be described as a system that consolidates and manages
data from different sources to assist an organization in making proper decisions. This makes
the work of handling data to report easier. Two main construction approaches are used: Two
of the most common models that have been developed are the Top-Down approach and the
Bottom-Up approach and each of them possesses its strengths and weaknesses.
A Data-Warehouse is a heterogeneous collection of data sources organized under a unified
schema. There are 2 approaches for constructing a data warehouse: The top-down
approach and the Bottom-up approach are explained below.
What is Top-Down Approach?
The initial approach developed by Bill Inmon known as the top-down approach starts with
building a single source data warehouse for the whole company. Merges and processes
external data through the ETL (Extract, Transform, Load) process and subsequently stores
them in the data warehouse. Specialized data marts for different organizations departments,
for instance, the finance department are then formed from there. The strength of this method
is that it offers a clear structure for managing data, however, this method can be expensive
as well as time-consuming and for that reason, it is ideal for large organizations only.

The essential components are discussed below:


1. External Sources: External source is a source from where data is collected
irrespective of the type of data. Data can be structured, semi structured and
unstructured as well.

2. Stage Area: Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into datawarehouse.
For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.

• T(Transform): Data is transformed into the standard format.

• L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse: After cleansing of data, it is stored in the data warehouse as central
repository. It actually stores the meta data and the actual data gets stored in the data
marts. Note that data warehouse stores the data in its purest form in this top-down
approach.

4. Data Marts: Data mart is also a part of storage component. It stores the information of
a particular function of an organisation which is handled by single authority. There can
be as many number of data marts in an organisation depending upon the functions. We
can also say that data mart contains subset of the data stored in data warehouse.

5. Data Mining: The practice of analysing the big data present in data warehouse is data
mining. It is used to find the hidden patterns that are present in the database or in data
warehouse with the help of algorithm of data mining.
This approach is defined by Inmon as – data warehouse as a central repository for the
complete organisation and data marts are created from it after the complete data
warehouse has been created.
Advantages of Top-Down Approach
1. Since the data marts are created from the datawarehouse, provides consistent
dimensional view of data marts.
2. Also, this model is considered as the strongest model for business changes. That’s
why, big organisations prefer to follow this approach.
3. Creating data mart from data warehouse is easy.
4. Improved data consistency: The top-down approach promotes data consistency by
ensuring that all data marts are sourced from a common data warehouse. This ensures
that all data is standardized, reducing the risk of errors and inconsistencies in
reporting.
5. Easier maintenance: Since all data marts are sourced from a central data warehouse,
it is easier to maintain and update the data in a top-down approach. Changes can be
made to the data warehouse, and those changes will automatically propagate to all the
data marts that rely on it.
6. Better scalability: The top-down approach is highly scalable, allowing organizations to
add new data marts as needed without disrupting the existing infrastructure. This is
particularly important for organizations that are experiencing rapid growth or have
evolving business needs.
7. Improved governance: The top-down approach facilitates better governance by
enabling centralized control of data access, security, and quality. This ensures that all
data is managed consistently and that it meets the organization’s standards for quality
and compliance.
8. Reduced duplication: The top-down approach reduces data duplication by ensuring
that data is stored only once in the data warehouse. This saves storage space and
reduces the risk of data inconsistencies.
9. Better reporting: The top-down approach enables better reporting by providing a
consistent view of data across all data marts. This makes it easier to create accurate
and timely reports, which can improve decision-making and drive better business
outcomes.
10. Better data integration: The top-down approach enables better data integration by
ensuring that all data marts are sourced from a common data warehouse. This makes
it easier to integrate data from different sources and provides a more complete view of
the organization’s data.
Disadvantages of Top-Down Approach
1. The cost, time taken in designing and its maintenance is very high.
2. Complexity: The top-down approach can be complex to implement and maintain,
particularly for large organizations with complex data needs. The design and
implementation of the data warehouse and data marts can be time-consuming and
costly.
3. Lack of flexibility: The top-down approach may not be suitable for organizations that
require a high degree of flexibility in their data reporting and analysis. Since the design
of the data warehouse and data marts is pre-determined, it may not be possible to
adapt to new or changing business requirements.
4. Limited user involvement: The top-down approach can be dominated by IT
departments, which may lead to limited user involvement in the design and
implementation process. This can result in data marts that do not meet the specific
needs of business users.
5. Data latency: The top-down approach may result in data latency, particularly when
data is sourced from multiple systems. This can impact the accuracy and timeliness of
reporting and analysis.
6. Data ownership: The top-down approach can create challenges around data
ownership and control. Since data is centralized in the data warehouse, it may not be
clear who is responsible for maintaining and updating the data.
7. Cost: The top-down approach can be expensive to implement and maintain,
particularly for smaller organizations that may not have the resources to invest in a
large-scale data warehouse and associated data marts.
8. Integration challenges: The top-down approach may face challenges in integrating
data from different sources, particularly when data is stored in different formats or
structures. This can lead to data inconsistencies and inaccuracies.
What is Bottom-Up Approach?
Bottom up Approach is the Ralph Kimball’s approach of the construction of individual data
marts that lie at the center of specific business goals or functions such as marketing or sales.
These data marts are extracted transformed & loaded first to provide organizations’ ability
to generate reports instantly. In turn, these data marts are affiliated to the more centralized
and broad data warehouse system. This is a more flexible method of training, cheaper and
best recommendable in smaller organizations. Nevertheless, it entails the creation of data
silos and disparities, and this may not allow an organization to have a coherent perspective
in its various departments.

1. First, the data is extracted from external sources (same as happens in top-down
approach).

2. Then, the data go through the staging area (as explained above) and loaded into data
marts instead of datawarehouse. The data marts are created first and provide reporting
capability. It addresses a single business area.

3. These data marts are then integrated into datawarehouse.


This approach is given by Kinball as – data marts are created first and provides a thin view
for analyses and datawarehouse is created after complete data marts have been created.
Advantages of Bottom-Up Approach
1. As the data marts are created first, so the reports are quickly generated.
2. We can accommodate more number of data marts here and in this way datawarehouse
can be extended.
3. Also, the cost and time taken in designing this model is low comparatively.
4. Incremental development: The bottom-up approach supports incremental
development, allowing for the creation of data marts one at a time. This allows for quick
wins and incremental improvements in data reporting and analysis.
5. User involvement: The bottom-up approach encourages user involvement in the
design and implementation process. Business users can provide feedback on the data
marts and reports, helping to ensure that the data marts meet their specific needs.
6. Flexibility: The bottom-up approach is more flexible than the top-down approach, as it
allows for the creation of data marts based on specific business needs. This approach
can be particularly useful for organizations that require a high degree of flexibility in
their reporting and analysis.
7. Faster time to value: The bottom-up approach can deliver faster time to value, as the
data marts can be created more quickly than a centralized data warehouse. This can
be particularly useful for smaller organizations with limited resources.
8. Reduced risk: The bottom-up approach reduces the risk of failure, as data marts can
be tested and refined before being incorporated into a larger data warehouse. This
approach can also help to identify and address potential data quality issues early in the
process.
9. Scalability: The bottom-up approach can be scaled up over time, as new data marts
can be added as needed. This approach can be particularly useful for organizations
that are growing rapidly or undergoing significant change.
10. Data ownership: The bottom-up approach can help to clarify data ownership and
control, as each data mart is typically owned and managed by a specific business unit.
This can help to ensure that data is accurate and up-to-date, and that it is being used
in a consistent and appropriate way across the organization.
Disadvantage of Bottom-Up Approach
1. This model is not strong as top-down approach as dimensional view of data marts is
not consistent as it is in above approach.
2. Data silos: The bottom-up approach can lead to the creation of data silos, where
different business units create their own data marts without considering the needs of
other parts of the organization. This can lead to inconsistencies and redundancies in
the data, as well as difficulties in integrating data across the organization.
3. Integration challenges: Because the bottom-up approach relies on the integration of
multiple data marts, it can be more difficult to integrate data from different sources and
ensure consistency across the organization. This can lead to issues with data quality
and accuracy.
4. Duplication of effort: In a bottom-up approach, different business units may duplicate
effort by creating their own data marts with similar or overlapping data. This can lead to
inefficiencies and higher costs in data management.
5. Lack of enterprise-wide view: The bottom-up approach can result in a lack of
enterprise-wide view, as data marts are typically designed to meet the needs of
specific business units rather than the organization as a whole. This can make it
difficult to gain a comprehensive understanding of the organization’s data and business
processes.
6. Complexity: The bottom-up approach can be more complex than the top-down
approach, as it involves the integration of multiple data marts with varying levels of
complexity and granularity. This can make it more difficult to manage and maintain the
data warehouse over time.
7. Risk of inconsistency: Because the bottom-up approach allows for the creation of
data marts with different structures and granularities, there is a risk of inconsistency in
the data. This can make it difficult to compare data across different parts of the
organization or to ensure that reports are accurate and reliable.
What is Data Mining?
Data mining is the process of extracting knowledge or insights from large amounts of data
using various statistical and computational techniques. The data can be structured, semi-
structured or unstructured, and can be stored in various forms such as databases, data
warehouses, and data lakes.
The primary goal of data mining is to discover hidden patterns and relationships in the data
that can be used to make informed decisions or predictions. This involves exploring the
data using various techniques such as clustering, classification, regression analysis,
association rule mining, and anomaly detection.
Data Mining refers to the detection and extraction of new patterns from the already
collected data. Data mining is the amalgamation of the field of statistics and computer
science aiming to discover patterns in incredibly large datasets and then transform them
into a comprehensible structure for later use.
The architecture of Data Mining:

Basic Working:
1. It all starts when the user puts up certain data mining requests, these requests are then
sent to data mining engines for pattern evaluation.
2. These applications try to find the solution to the query using the already present
database.
3. The metadata then extracted is sent for proper analysis to the data mining engine
which sometimes interacts with pattern evaluation modules to determine the result.
4. This result is then sent to the front end in an easily understandable manner using a
suitable interface.
A detailed description of parts of data mining architecture is shown:
1. Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of
data sources. The data in these sources may be in the form of plain text, spreadsheets,
or other forms of media like photos or videos. WWW is one of the biggest sources of
data.
2. Database Server: The database server contains the actual data ready to be
processed. It performs the task of handling data retrieval as per the request of the
user.
3. Data Mining Engine: It is one of the core components of the data mining architecture
that performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules: They are responsible for finding interesting patterns in
the data and sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface: Since the user cannot fully understand the complexity of the
data mining process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base: Knowledge Base is an important part of the data mining engine that
is quite beneficial in guiding the search for the result patterns. Data mining engines
may also sometimes get inputs from the knowledge base. This knowledge base may
contain data from user experiences. The objective of the knowledge base is to make
the result more accurate and reliable.
Types of Data Mining architecture:
1. No Coupling: The no coupling data mining architecture retrieves data from particular
data sources. It does not use the database for retrieving the data which is otherwise
quite an efficient and accurate way to do the same. The no coupling architecture for
data mining is poor and only used for performing very simple data mining processes.
2. Loose Coupling: In loose coupling architecture data mining system retrieves data
from the database and stores the data in those systems. This mining is for memory-
based data mining architecture.
3. Semi-Tight Coupling: It tends to use various advantageous features of the data
warehouse systems. It includes sorting, indexing, and aggregation. In this architecture,
an intermediate result can be stored in the database for better performance.
4. Tight coupling: In this architecture, a data warehouse is considered one of its most
important components whose features are employed for performing data mining tasks.
This architecture provides scalability, performance, and integrated information
Advantages of Data Mining:
• Assists in preventing future adversaries by accurately predicting future trends.
• Contributes to the making of important decisions.
• Compresses data into valuable information.
• Provides new trends and unexpected patterns.
• Helps to analyze huge data sets.
• Aids companies to find, attract and retain customers.
• Helps the company to improve its relationship with the customers.
• Assists Companies to optimize their production according to the likability of a certain
product thus saving costs to the company.
Disadvantages of Data Mining:
• Excessive work intensity requires high-performance teams and staff training.
• The requirement of large investments can also be considered a problem as sometimes
data collection consumes many resources that suppose a high cost.
• Lack of security could also put the data at huge risk, as the data may contain private
customer details.
• Inaccurate data may lead to the wrong output.
• Huge databases are quite difficult to manage.

Time Series Analysis and Forecasting


Time series analysis and forecasting are crucial for predicting future trends, behaviors, and
behaviours based on historical data. It helps businesses make informed decisions,
optimize resources, and mitigate risks by anticipating market demand, sales fluctuations,
stock prices, and more. Additionally, it aids in planning, budgeting, and strategizing across
various domains such as finance, economics, healthcare, climate science, and resource
management, driving efficiency and competitiveness.
Components of Time Series Data
There are four main components of a time series:

1. Trend: Trend represents the long-term movement or directionality of the data over
time. It captures the overall tendency of the series to increase, decrease, or remain
stable. Trends can be linear, indicating a consistent increase or decrease, or nonlinear,
showing more complex patterns.
2. Seasonality: Seasonality refers to periodic fluctuations or patterns that occur at
regular intervals within the time series. These cycles often repeat annually, quarterly,
monthly, or weekly and are typically influenced by factors such as seasons, holidays,
or business cycles.
3. Cyclic variations: Cyclical variations are longer-term fluctuations in the time series
that do not have a fixed period like seasonality. These fluctuations represent economic
or business cycles, which can extend over multiple years and are often associated with
expansions and contractions in economic activity.
4. Irregularity (or Noise): Irregularity, also known as noise or randomness, refers to the
unpredictable or random fluctuations in the data that cannot be attributed to the trend,
seasonality, or cyclical variations. These fluctuations may result from random events,
measurement errors, or other unforeseen factors. Irregularity makes it challenging to
identify and model the underlying patterns in the time series data.
Time Series Visualization
Time series visualization is the graphical representation of data collected over successive
time intervals. It encompasses various techniques such as line plots, seasonal subseries
plots, autocorrelation plots, histograms, and interactive visualizations. These methods help
analysts identify trends, patterns, and anomalies in time-dependent data for better
understanding and decision-making.
Different Time series visualization graphs
1. Line Plots: Line plots display data points over time, allowing easy observation of
trends, cycles, and fluctuations.
2. Seasonal Plots: These plots break down time series data into seasonal components,
helping to visualize patterns within specific time periods.
3. Histograms and Density Plots: Shows the distribution of data values over time,
providing insights into data characteristics such as skewness and kurtosis.
4. Autocorrelation and Partial Autocorrelation Plots: These plots visualize correlation
between a time series and its lagged values, helping to identify seasonality and lagged
relationships.
5. Spectral Analysis: Spectral analysis techniques, such as periodograms and
spectrograms, visualize frequency components within time series data, useful for
identifying periodicity and cyclical patterns.
6. Decomposition Plots: Decomposition plots break down a time series into its trend,
seasonal, and residual components, aiding in understanding the underlying patterns.
These visualization techniques allow analysts to explore, interpret, and communicate
insights from time series data effectively, supporting informed decision-making and
forecasting.
Feature Selection Techniques in Machine
Learning
Feature selection:
Feature selection is a process that chooses a subset of features from the original features
so that the feature space is optimally reduced according to a certain criterion.
Feature selection is a critical step in the feature construction process. In text
categorization problems, some words simply do not appear very often. Perhaps the word
“groovy” appears in exactly one training document, which is positive. Is it really wort h
keeping this word around as a feature ? It’s a dangerous endeavor because it’s hard to tell
with just one training example if it is really correlated with the positive class or is it just
noise. You could hope that your learning algorithm is smart enough to figure it out. Or you
could just remove it.
There are three general classes of feature selection algorithms: Filter methods, wrapper
methods and embedded methods.
The role of feature selection in machine learning is,
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Features Selection Algorithms are as follows:
1. Instance based approaches: There is no explicit procedure for feature subset
generation. Many small data samples are sampled from the data. Features are weighted
according to their roles in differentiating instances of different classes for a data sample.
Features with higher weights can be selected.
2. Nondeterministic approaches: Genetic algorithms and simulated annealing are also
used in feature selection.
3. Exhaustive complete approaches: Branch and Bound evaluates estimated accuracy
and ABB checks an inconsistency measure that is monotonic. Both start with a full feature
set until the preset bound cannot be maintained.
While building a machine learning model for real-life dataset, we come across a lot of
features in the dataset and not all these features are important every time. Adding
unnecessary features while training the model leads us to reduce the overall accuracy of
the model, increase the complexity of the model and decrease the generalization capability
of the model and makes the model biased. Even the saying “Sometimes less is better”
goes as well for the machine learning model. Hence, feature selection is one of the
important steps while building a machine learning model. Its goal is to find the best
possible set of features for building a machine learning model.
Some popular techniques of feature selection in machine learning are:
• Filter methods
• Wrapper methods
• Embedded methods
Filter Methods
These methods are generally used while doing the pre-processing step. These methods
select features from the dataset irrespective of the use of any machine learning algorithm.
In terms of computation, they are very fast and inexpensive and are very good for
removing duplicated, correlated, redundant features but these methods do not remove
multicollinearity. Selection of feature is evaluated individually which can sometimes help
when features are in isolation (don’t have a dependency on other features) but will lag
when a combination of features can lead to increase in the overall performance of the
model.

Filter Methods Implementation

Some techniques used are:


• Information Gain – It is defined as the amount of information provided by the feature
for identifying the target value and measures reduction in the entropy values.
Information gain of each attribute is calculated considering the target values for feature
selection.
• Chi-square test — Chi-square method (X2) is generally used to test the relationship
between categorical variables. It compares the observed values from different
attributes of the dataset to its expected value.

Chi-square Formula

• Fisher’s Score – Fisher’s Score selects each feature independently according to their
scores under Fisher criterion leading to a suboptimal set of features. The larger the
Fisher’s score is, the better is the selected feature.
• Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Variance Threshold – It is an approach where all features are removed whose
variance doesn’t meet the specific threshold. By default, this method removes features
having zero variance. The assumption made using this method is higher variance
features are likely to contain more information.
• Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the
mean absolute difference from the mean value.
• Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean (AM)
to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to ∞ as
AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant feature.
• Mutual Dependence – This method measures if two variables are mutually dependent,
and thus provides the amount of information obtained for one variable on observing the
other variable. Depending on the presence/absence of a feature, it measures the
amount of information that feature contributes to making the target prediction.
• Relief – This method measures the quality of attributes by randomly sampling an
instance from the dataset and updating each feature and distinguishing between
instances that are near to each other based on the difference between the selected
instance and two nearest instances of same and opposite classes.
Wrapper methods:
Wrapper methods, also referred to as greedy algorithms train the algorithm by using a
subset of features in an iterative manner. Based on the conclusions made from training in
prior to the model, addition and removal of features takes place. Stopping criteria for
selecting the best subset are usually pre-defined by the person training the model such as
when the performance of the model decreases or a specific number of features has been
achieved. The main advantage of wrapper methods over the filter methods is that they
provide an optimal set of features for training the model, thus resulting in better accuracy
than the filter methods but are computationally more expensive.

Wrapper Methods Implementation

Some techniques used are:


• Forward selection – This method is an iterative approach where we initially start with
an empty set of features and keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the addition of a new variable does not
improve the performance of the model.
• Backward elimination – This method is also an iterative approach where we initially
start with all features and after each iteration, we remove the least significant feature.
The stopping criterion is till no improvement in the performance of the model is
observed after the feature is removed.
• Bi-directional elimination – This method uses both forward selection and backward
elimination technique simultaneously to reach one unique solution.
• Exhaustive selection – This technique is considered as the brute force approach for
the evaluation of feature subsets. It creates all possible subsets and builds a learning
algorithm for each subset and selects the subset whose model’s performance is best.
• Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is trained
on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
Embedded methods:
In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, thus having its own built-in feature selection methods. Embedded methods
encounter the drawbacks of filter and wrapper methods and merge their advantages.
These methods are faster like those of filter methods and more accurate than the filter
methods and take into consideration a combination of features as well.
Embedded Methods Implementation

Some techniques used are:


• Regularization – This method adds a penalty to different parameters of the machine
learning model to avoid over-fitting of the model. This approach of feature selection
uses Lasso (L1 regularization) and Elastic nets (L1 and L2 regularization). The penalty
is applied over the coefficients, thus bringing down some coefficients to zero. The
features having zero coefficient can be removed from the dataset.
• Tree-based methods – These methods such as Random Forest, Gradient Boosting
provides us feature importance as a way to select features as well. Feature importance
tells us which features are more important in making an impact on the target feature.

You might also like