Data sci notes
Data sci notes
Course Objectives:
Course Outcomes:
Students will be able:
To gain a comprehensive understanding of data terminologies and concepts.
To collect, clean, and preprocess data for analysis.
To create compelling visual representations of data to aid in decision-making processes.
To implement data analytics projects, demonstrating project management skills,
Team work and problem-solving abilities.
To develop proficiency in statistical analysis and apply various techniques to extract
meaningful insights from data.
---------------------------------------------------------------------------------------------------------------------
UNIT – I: INTRODUCTION TO DATA ANALYTICS
Introduction: Meaning of Data Analytics- Need of Data Analytics- Business Analytics vs. Data
Analytics - Categorization of Data Analytical Models. Data Scientist vs. Data Engineer vs.
Data Analyst- Role of Data Analyst- Data Analytics in Practice.
INTRODUCTION:
Data analytics converts raw data into actionable insights. It includes a range of tools,
technologies, and processes used to find trends and solve problems by using data. Data
analytics can shape business processes, improve decision-making, a this type of analysis helps
describe or summaries quantitative data by presenting statistics.
For example, descriptive statistical analysis could show sales distribution across a group of
employees and the average sales figure per employee. Descriptive analysis answers the question,
“What happened and foster business growth.
MEANING:
Data Analysis is the process of systematically applying statistical and/or logical techniques to
describe and illustrate, condense and recap, and evaluate data.
There are companies that’ll help you in finding the insights of the value chains that are already
there in your organization and his is going to be done through data analytics.
2. Industry knowledge:
Industry knowledge is another thing that you’ll be able to comprehend once you get into data
analytics; it is going to show how you can go about your business in the near future and what is
that the economy already has its hands on. That’s how you are going to avail the benefit before
anyone else.
As the economy keeps on changing and keeping pace with the dynamic trends is very
important but at the same time profit making is one thing that an organization would most of
the time aim for, Data Analytics gives us analyzed data that helps us in seeing opportunities
before the time that’s another way of unlocking more options.
BUSINESS ANALYTICS vs. DATA ANALYTICS
Risk Management: Analyzing historical and current data can help businesses identify potential
risks and develop strategies to mitigate them, reducing the likelihood of losses.
Innovation: By analyzing data, businesses can identify patterns and trends that can lead to
innovative product or service offerings, improving their position in the market.
Data Analytics:
Data analytics is the process of examining, cleaning, transforming, and interpreting complex
datasets to extract valuable insights, draw conclusions, and support decision-making. It involves
various techniques from statistics, mathematics, and computer science to uncover patterns,
trends, correlations, and other useful information within the data. The primary goal of data
analytics is to gain a deeper understanding of data, solve problems, and aid in strategic decision-
making.
DATA ANALYTICS
Predictive analytics turn the data into valuable, actionable information. Predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling, machine
learning, data mining, and game theory that analyze current and historical facts to make
predictions about a future event. Techniques that are used for predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Corner Stone’s of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all management
reporting such as sales, marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each decision
option. Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen. Further, Prescriptive Analytics can suggest decision options on how to
take advantage of a future opportunity or mitigate a future risk and illustrate the implication of
each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such as
economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem. We try to find any dependency and pattern in the historical data of
the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming. Common techniques used
for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
1. transportation businesses cut expenses and speed up delivery times.
Data Analyst
Most entry-level professionals interested in getting into a data-related job start off as Data
analysts. Qualifying for this role is as simple as it gets. All you need is a bachelor’s degree and
good statistical knowledge. Strong technical skills would be a plus and can give you an edge
over most other applicants. Other than this, companies expect you to understand data handling,
modeling and reporting techniques along with a strong understanding of the business.
Data Engineer
Data Engineer either acquires a master’s degree in a data-related field or gather a good amount of
experience as a Data Analyst. A Data Engineer needs to have a strong technical background with
the ability to create and integrate APIs. They also need to understand data pipelining and
performance optimization.
Data Scientist
Data Scientist is the one who analyses and interpret complex digital data. While there are several
ways to get into a data scientist’s role, the most seamless one is by acquiring enough experience
and learning the various data scientist skills These skills include advanced statistical analyses, a
complete understanding of machine learning, data conditioning etc.
For a better understanding of these professionals, let’s dive deeper and understand their required
skill-sets.
ROLE OF DATA ANALYST
Common responsibilities for Data Analysts include extracting data using special tools and
software, responding to data-related queries, setting up processes to make data more efficient,
analyzing and interpreting trends from the data, and reporting trends to add business value.
1. Data Collection:
Data Gathering: Data analysts collect data from various sources, including surveys, databases,
and web sources.
Data Cleaning: They clean and preprocess data to remove inconsistencies, errors, and ensure it's
ready for analysis.
2. Data Analysis:
Exploratory Data Analysis (EDA): They perform EDA to understand the data's structure,
relationships, and distributions.
Statistical Analysis: Utilize statistical methods to analyze data and extract meaningful insights.
Predictive Analysis: Build predictive models using statistical techniques and machine learning
algorithms.
3. Data Visualization:
Data Visualization: Present data findings through visual means such as charts, graphs, and
dashboards, making complex data more accessible to stakeholders.
Dashboard Creation: Develop interactive dashboards that allow users to explore data on their
own.
4. Interpreting Results:
Interpretation: Analyze the results of data analyses and translate them into actionable business
insights.
5. Communication:
Reporting: Prepare and deliver reports and presentations to both technical and non-technical
stakeholders.
Storytelling: Frame data insights into compelling stories that can be easily understood by people
without a technical background.
6. Continuous Learning:
Stay Updated: Keep up-to-date with the latest tools, technologies, and techniques in data
analysis and data science.
7. Problem-Solving:
Identify Problems: Work with stakeholders to identify business problems that can be solved
using data.
Innovative Solutions: Develop innovative solutions using data analysis methods to address
business challenges.
Data Ethics: Adhere to ethical standards and guidelines when working with sensitive or private
data.
Data Privacy: Ensure compliance with data privacy regulations and protect the confidentiality of
data.
9. Collaboration:
Collaborative Work: Collaborate with data engineers, data scientists, and other professionals to
achieve common goals.
Data analytics has a wide range of applications across various industries and sectors. Here are
some of the key areas where data analytics is commonly applied:
• SQL Bolt.
• Tableau.
• Python Tutorial.
Data: Introduction to Digital Data- Types of Digital Data - Data Collection- Data
Preprocessing- Data Preprocessing Elements: Data Quality, Data Cleaning, and Data
Integration - Data Reduction-Data Transformation-Data Discretization.
Data Science Project Life Cycle: Business Requirement- Data Acquisition- Data Preparation-
Hypothesis and Modeling- Evaluation and Interpretation- Deployment- Operations-
Optimization- Applications for Data Science.
INTRODUCTION
Digital data is the electronic representation of information in a format or language that machines
can read and understand. In more technical terms, digital data is a binary format of information
that's converted into a machine-readable digital format. The power of digital data is that any
analog inputs, from very simple text documents to genome sequencing results, can be
represented with the binary system.
Whenever you send an email, read a social media post or take pictures with your digital
camera you are working with the digital data.
In general data can be any character or text, numbers, voice messages, SMS, wtsapp
messages, pictures, sound and video.
MEANING:
Any data that can be processed by digital computer and stored in the sequences of 0‘s & 1’s
(binary language) knows as digital data.
Digital data can represent a wide range of information, including text, images, videos,
sound, and more. Different types of data are encoded using specific digital formats. For
instance:
Text: Text characters are represented using character encoding standards like ASCII or
Unicode.
Images: Images are represented as a grid of pixels, each pixel being represented by
binary numbers indicating its color.
Audio: Sound waves are converted into digital signals through a process called sampling,
where the amplitude of the sound wave is measured at regular intervals and converted
into binary numbers.
Video: Videos are a sequence of images displayed rapidly. Each frame of the video is
represented as a digital image.
TYPES OF DIGITAL DATA
Unstructured Data: This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program. About 80-90% data of an organizations is in
this format; (e.g:memos, chat rooms, power point presentations, images, videos, letters,
researchers, white papers, body of an email.etc.)
Semi-Structured Data: This is the data which does not conform to a data model but has some
structure. However it is not in a form which can be used easily by a computer program; for E.g.
Emails, XML, markup languages like html, EYX.
Structured Data: This is the data which is in an organized form (e.g: in rows and columns) and
can be easily used by a computer program. Relationships exist between entities of data, such as
classes and their objects. Data stored in databases is an example of structured data.
10
10
Structured Data
Semi Structured Data
Unstrctured Data
80
STRUCTURED DATA:
UNSTRUCTURED DATA:
SEMI STRUCTURED DATA:
DATA COLLECTION
Data collection is the process of gathering and measuring information on variables of interest, in
a systematic and organized manner. It is a fundamental step in the research process and is crucial
for making informed decisions, conducting analyses, and drawing conclusions. Here are the key
aspects of data collection:
Clearly define the purpose of the data collection. What do you want to achieve? What questions
are you trying to answer?
Determine the sources from which data will be collected. Sources can be primary (collected
firsthand) or secondary (already existing data).
Sensor Data: Gathering data from various sensors (temperature, GPS, etc.).
Focus Groups: Discussions with a selected group of people about a specific topic.
Develop surveys, questionnaires, interview protocols, or other tools needed for data collection.
Ensure they are clear, unbiased, and appropriate for the target audience.
5. Pilot Testing:
Test the data collection tools on a small scale to identify and fix any issues with the questions or
methods.
6. Data Collection:
Implement the data collection process. This may involve administering surveys, conducting
interviews, or other chosen methods.
Record the collected data systematically. Ensure proper storage and organization to prevent data
loss or corruption.
Validate the collected data to ensure accuracy and reliability. Clean the data by removing
inconsistencies, errors, or outliers.
9. Data Analysis:
Interpret the results of the data analysis. Prepare reports, charts, graphs, or presentations to
convey the findings effectively.
11. Ethical Considerations:
Ensure that data collection follows ethical guidelines, including informed consent, privacy, and
confidentiality of participants.
Continuously monitor the data collection process to identify and address any issues promptly.
Good data collection practices are essential for ensuring the reliability and validity of the
collected information, which forms the basis for meaningful analysis and informed decision-
making.
DATA PREPROCESSING
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.
It is a data mining technique that involves transforming raw data into an understandable
format.
Meaning:-Data preprocessing is a crucial step in the data analysis and machine learning process.
It involves cleaning, transforming, and organizing raw data into a format that can be effectively
utilized for analysis or training machine learning models.
ELEMENTS:-
Data Quality:
Data quality measures how well a dataset meets criteria for accuracy, completeness, validity,
consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data
governance initiatives within an organization.
Accuracy:
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different format s,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of
data analysis and to avoid overfitting of the model. Some common steps involved in data
reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques such as PCA,
linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.
Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering
Data Science Project Life Cycle
Earlier data used to be much less and generally accessible in a well-structured form, that we
could save effortlessly and easily in Excel sheets, and with the help of Business Intelligence
tools data can be processed efficiently.
But Today we used to deals with large amounts of data like about 3.0 quintals bytes of records
is producing on each and every day, which ultimately results in an explosion of records and
data. According to recent researches, it is estimated that 1.9 MB of data and records are
created in a second that too through a single individual.
So this is a very big challenge for any organization to deal with such a massive amount of data
generating every second. For handling and evaluating this data we required some very
powerful, complex algorithms and technologies and this is where Data science comes into the
picture.
The following are some primary motives for the use of Data science technology:
1. It helps to convert the big quantity of uncooked and unstructured records into
significant insights.
2. It can assist in unique predictions such as a range of surveys, elections, etc.
3. It also helps in automating transportation such as growing a self-driving car; which
can be the future of transportation.
4. Companies are shifting towards Data science and opting for this technology.
Amazon, Netflix, etc, which cope with the big quantity of data, are the use of
information science algorithms for higher consumer experience.
The data science project life cycle outlines the stages and steps involved in executing a data
science project from conception to deployment. While the exact process can vary depending on
the project and organization, the following is a common framework for the data science project
life cycle:
Define the problem or business question you want to address with data science.
Understand the domain and context of the problem.
Determine the objectives and goals of the project.
2. Data Collection:
4. Feature Engineering:
6. Model Evaluation:
Assess model performance using appropriate metrics (e.g., accuracy, precision, recall,
F1-score, RMSE, etc.).
Compare multiple models and select the best-performing one.
Perform cross-validation to estimate model generalization.
7. Model Deployment:
9. Documentation:
Document the entire project, including data sources, data pre-processing steps, modelling
techniques, and results.
Ensure code is well-documented for future reference and collaboration.
If the project proves successful, consider scaling up the deployment to handle larger data
volumes or additional use cases.
Properly archive the project and its assets for future reference and compliance purposes.
The data science project life cycle is an iterative process, and it's important to maintain flexibility
throughout each stage as new insights and challenges may arise. Collaboration, effective
communication, and a focus on business objectives are key elements for success in data science
projects.
Data Science Lifecycle revolves around the use of machine learning and different analytical
strategies to produce insights and predictions from information in order to acquire a
commercial enterprise objective. The complete method includes a number of steps like data
cleaning, preparation, modelling, model evaluation, etc. It is a lengthy procedure and may
additionally take quite a few months to complete. So, it is very essential to have a generic
structure to observe for each and every hassle at hand. The globally mentioned structure in
fixing any analytical problem is referred to as a Cross Industry Standard Process for Data
Mining or CRISP-DM framework.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets,
cleaning it, treating the lacking values through either eliminating them or imputing
them, treating inaccurate data through eliminating them, additionally test for outliers
the use of box plots and cope with them. Constructing new data, derive new elements
from present ones. Format the data into the preferred structure, eliminate undesirable
columns and features. Data preparation is the most time-consuming but arguably the
most essential step in the complete existence cycle. Your model will be as accurate as
your data.
4. Exploratory Data Analysis: This step includes getting some concept about the answer
and elements affecting it, earlier than constructing the real model. Distribution of data
inside distinctive variables of a character is explored graphically the usage of bar-
graphs, Relations between distinct aspects are captured via graphical representations
like scatter plots and warmth maps. Many data visualization strategies are considerably
used to discover each and every characteristic individually and by means of combining
them with different features.
5. Data Modelling: Data modelling is the coronary heart of data analysis. A model takes
the organized data as input and gives the preferred output. This step consists of
selecting the suitable kind of model, whether the problem is a classification problem, or
a regression problem or a clustering problem. After deciding on the model family,
amongst the number of algorithms amongst that family, we need to cautiously pick out
the algorithms to put into effect and enforce them. We need to tune the hyper
parameters of every model to obtain the preferred performance. We additionally need to
make positive there is the right stability between overall performance and
generalizability. We do no longer desire the model to study the data and operate poorly
on new data.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be
deployed. The model is examined on an unseen data, evaluated on a cautiously thought
out set of assessment metrics. We additionally need to make positive that the model
conforms to reality. If we do not acquire a quality end result in the evaluation, we have
to re-iterate the complete modelling procedure until the preferred stage of metrics is
achieved. Any data science solution, a machine learning model, simply like a human,
must evolve, must be capable to enhance itself with new data, adapt to a new evaluation
metric. We can construct more than one model for a certain phenomenon; however, a
lot of them may additionally be imperfect. The model assessment helps us select and
construct an ideal model.
7. Model Deployment: The model after a rigorous assessment is at the end deployed in
the preferred structure and channel. This is the last step in the data science life cycle.
Each step in the data science life cycle defined above must be laboured upon carefully.
If any step is performed improperly, and hence, have an effect on the subsequent step
and the complete effort goes to waste. For example, if data is no longer accumulated
properly, you’ll lose records and you will no longer be constructing an ideal model. If
information is not cleaned properly, the model will no longer work. If the model is not
evaluated properly, it will fail in the actual world. Right from Business perception to
model deployment, every step has to be given appropriate attention, time, and effort.
Optimization in machine learning involves adjusting algorithms to better align with desired
models. It's a great way to understand and optimize data. For example, your optimization
algorithm can be trained to catch inaccuracies or inconsistencies in the system, removing the
need for you to comb through data by hand.
The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly use Search engines like Google, Yahoo, Safari,
Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search for something suppose “Data Structure and algorithm courses”
then at that time on Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is done using
Data Science, and we get the topmost visited Web Links.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With
the help of Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what as the speed limit in highways,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company. Also, Financial Industries uses Data
Science Analytics tools in order to predict the future. It allows the companies to predict
customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the future
outcome. Data is analyzed in such a way that it makes it possible to predict future stock prices
over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions
similar to choices according to our past data and also we get recommendations according to
most buy the product, most rated, most searched, etc. This is all done with the help of Data
Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then Facebook suggests us
auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google search it
and after that, I changed my mind to buy offline. In Real -World Data Science helps those
companies who are paying for Advertisements for their mobile. So everywhere on the internet
in the social media, in the websites, in the apps everywhere I will see the recommendation of
that mobile phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the Computer
will improve its performance. There are many games like Chess, EA Sports, etc. will use Data
Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with
full disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of
time, resources, and finance or developing new Medicine or drug but with the help of Data
Science, it becomes easy because the prediction of success rate can be easily determined based
on biological data or factors. The algorithms based on data science will forecast how this will
react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.
12. Auto complete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science concept
of Auto complete feature is used where he/she is an efficient choice to auto-complete the
whole line. Also in Search Engines in social media, in various apps, AutoComplete feature is
widely used.
UNIT – III: BIG DATA MANAGEMENT AND DATA VISUALIZATION
Introduction to Big Data: Evolution of Big Data concept – Features of Big Data- Big Data
Challenges-Big Data Analytics.
Introduction to Data Visualization: Data Visualization concept- Importance of data
visualization –- Structure of Visualization - Tools for Data visualization- Data Queries-Data
Dashboards- Principles of effective Data Dashboards-Applications of Data Dashboards.
The era of digital intelligence has led to a substantial shift towards fast data generation. With the rise
of social media and ecommerce, 2.5 quintillion bytes of data are produced every day. For example,
the New York Stock Exchange generates about one terabyte of new trade data per day. This amount
of information is not easy to process, especially insofar as 90% of it comes unstructured and
disorganized. This continuous generation of huge data volumes that are difficult to analyze using
traditional data mining techniques is called big data, and it is one of the most conspicuous
phenomena in the 21st-century world.
The concept of Big Data has evolved significantly over the years, driven by advancements in
technology, changes in data generation and consumption patterns, and the growing need for
businesses and organizations to extract meaningful insights from vast amounts of data. Here's a brief
overview of the evolution of the Big Data concept:
The term "Big Data" wasn't widely used, and the challenges associated with handling large
volumes of data were not as prominent.
Traditional databases and data processing techniques were primarily used for relatively
smaller datasets.
Emergence of the Term (2000-2010):
With the proliferation of the internet, social media, and e-commerce, the volume of data
being generated increased exponentially.
Doug Laney's famous 3Vs model (Volume, Velocity, Variety) gained prominence as a way
to characterize the challenges posed by Big Data.
Technologies like Apache Hadoop, an open-source framework for distributed storage and
processing of large data sets, emerged.
Big Data became closely intertwined with advanced analytics and machine learning, as
organizations sought to derive actionable insights from their data.
The integration of Big Data with real-time analytics became more prevalent, allowing
organizations to make quicker decisions based on up-to-date information.
The rise of streaming data technologies enabled the processing of data in real time.
Focus on Data Governance and Privacy (2018-present):
As the importance of data privacy and governance increased, there was a growing emphasis
on ensuring ethical and responsible use of Big Data.
Regulatory frameworks, such as GDPR (General Data Protection Regulation), influenced
how organizations handle and process personal data.
Beyond the 3Vs (2020s and Beyond):
The original 3Vs model expanded to include additional characteristics like Veracity,
Variability, and Value.
The concept of Edge Computing gained importance as organizations sought to process data
closer to the source to reduce latency and bandwidth usage.
The use of AI and machine learning in handling and analyzing Big Data continued to grow,
allowing for more sophisticated insights.
Hybrid and Multi-Cloud Approaches (Ongoing):
Organizations increasingly adopted hybrid and multi-cloud strategies to manage their Big
Data workloads, combining on-premises infrastructure with cloud services.
The focus shifted towards optimizing costs, performance, and flexibility in Big Data
solutions.
The evolution of the Big Data concept is ongoing, and it continues to be shaped by
technological advancements, changing business requirements, and the evolving landscape of
data-related challenges and opportunities.
Definition of Big Data
Big Data refers to extremely large and complex datasets that are beyond the capability of traditional
data processing tools and methods to capture, store, manage, and analyze within a reasonable
timeframe.
The concept of Big Data is often described using the "3Vs" model:
1. Volume: Refers to the sheer size of the data. Big Data involves handling datasets that are
typically measured in terabytes, petabytes, or even exabytes.
2. Velocity: Refers to the speed at which data is generated, processed, and made available for
analysis. Big Data often involves real-time or near-real-time processing to keep up with the
constant flow of data.
3. Variety: Encompasses the diversity of data types, formats, and sources. Big Data includes
structured, semi-structured, and unstructured data from various sources such as text, images,
videos, social media, sensor data, and more.
Over time, the concept has evolved to include additional characteristics such as veracity (the
quality and reliability of the data), variability (inconsistency in data flow), value (the
importance and usefulness of insights derived from the data), and others.
The processing and analysis of Big Data require advanced technologies and tools, including
distributed computing frameworks like Apache Hadoop and Apache Spark, NoSQL
databases, machine learning algorithms, and cloud computing infrastructure. Organizations
leverage Big Data to extract valuable insights, make data-driven decisions, and gain a
competitive advantage in various domains, including business, healthcare, finance, and
research.
The concept is characterized by three main dimensions, often referred to as the "3Vs":
Volume, Velocity, and Variety. Additionally, other characteristics like Veracity, Variability,
and Value have been included to provide a more comprehensive understanding of the
challenges and opportunities associated with Big Data.
Here's a breakdown of the key aspects of the definition:
1. Volume:
Definition: Refers to the vast amounts of data generated or collected by organizations. This
data could come from various sources, such as social media, sensors, transactions, and more.
Significance: The sheer quantity of data is one of the defining features of Big Data.
Traditional databases may struggle to handle such large volumes.
2. Velocity:
Definition: Describes the speed at which data is generated, processed, and made
available for analysis. Big Data often involves real-time or near-real-time processing to
keep up with the continuous flow of data.
Significance: The pace at which data is generated and needs to be processed is a crucial
factor. Real-time processing enables quick decision-making based on the most current
information.
3. Variety:
Definition: Encompasses the diversity of data types, formats, and sources. Big Data
includes structured, semi-structured, and unstructured data from a wide range of origins,
such as text, images, videos, and more.
Significance: Big Data is not limited to structured databases but includes a variety of
data types. Handling this diversity requires flexible data processing and analytics tools.
4. Veracity:
Definition: Refers to the quality and reliability of the data. Big Data may include data
with varying levels of accuracy and reliability, and managing this aspect is crucial for
meaningful analysis.
Significance: Ensuring the trustworthiness of the data is essential to prevent inaccurate
or misleading insights.
5. Variability:
Definition: Reflects the inconsistency in the data flow, which may have irregular patterns or
periodic variations. Big Data processing systems need to handle fluctuations and variations
in data flow.
Significance: Big Data is not always consistent, and the ability to manage variations in data
flow is essential for accurate analysis.
6. Value:
Definition: Represents the importance and usefulness of the insights derived from Big
Data. The ultimate goal is to extract valuable insights and actionable information.
Significance: The ultimate goal of Big Data is to extract valuable insights and
actionable information that can benefit an organization in terms of decision-making,
efficiency, and innovation.
7. Complexity:
Definition: Refers to the intricacy of data relationships and the challenges associated with
analyzing interconnected datasets.
Significance: Big Data solutions often deal with complex data structures, requiring
advanced analytics and tools to uncover meaningful patterns and relationships.
The big data includes information produced by humans and devices. Device-driven data is
largely clean and organized, but of far greater interest is human-driven data that exist in
various formats and need more exquisite tools for proper processing and management.
– Real-time data. They are produced on online streaming media, such as YouTube, Twitch,
Skype, or Netflix.
-Transactional data. They are gathered when a user makes an online purchase (information
on the product, time of purchase, payment methods, etc.)
– Natural language data. These data are gathered mostly from voice searches that can be
made on different devices accessing the Internet.
– Time series data. This type of data is related to the observation of trends and phenomena
taking place at this very moment and over a period of time, for instance, global
temperatures, mortality rates, pollution levels, etc.
– Linked data. They are based on HTTP, RDF, SPARQL, and URIs web technologies and
meant to enable semantic connections between various databases so that computers could
read and perform semantic queries correctly.
1. Storage: With vast amounts of data generated daily, the greatest challenge is storage
(especially when the data is in different formats) within legacy systems. Unstructured data
cannot be stored in traditional databases.
2. Processing: Processing big data refers to the reading, transforming, extraction, and
formatting of useful information from raw information. The input and output of information
in unified formats continue to present difficulties.
Hadoop Ecosystem
Apache Spark
NoSQL Databases
R Software
Predictive Analytics
Prescriptive Analytics
9. Real-Time Insights
The term "real-time analytics" describes the practice of performing analyses on data as a
system is collecting it. Decisions may be made more efficiently and with more accurate
information thanks to real-time analytics tools, which use logic and mathematics to deliver
insights on this data quickly.
Data visualization is the representation of information and data using charts, graphs, maps, and
other visual tools. These visualizations allow us to easily understand any patterns, trends, or outliers
in a data set.
Data visualization also presents data to the general public or specific audiences without technical
knowledge in an accessible manner. For example, the health agency in a government might provide
a map of vaccinated regions.
The purpose of data visualization is to help drive informed decision-making and to add colourful
meaning to an otherwise bland database.
Meaning
Data visualization is the process of representing data in a visual way, such as using graphs, charts,
or maps. It's used to communicate complex information in an intuitive way, making it easier for
people to understand and analyze data.
The goal of data visualization is to make it easier to identify trends, patterns, and outliers in large
data sets. It can also be used to deliver visual reporting on the performance, operations, or general
statistics of an application, network, hardware, or IT asset.
The four pillars of data visualization are: Distribution, Relationship, Comparison, and Composition.
Data visualization can be used in many contexts in nearly every field, like public policy, finance,
marketing, retail, education, sports, history, and more. Here are the benefits of data visualization:
Storytelling: People are drawn to colors and patterns in clothing, arts and culture, architecture, and
more. Data is no different—colors and patterns allow us to visualize the story within the data.
Visualize relationships: It’s easier to spot the relationships and patterns within a data set when the
information is presented in a graph or chart.
Exploration: More accessible data means more opportunities to explore, collaborate, and inform
actionable decisions.
Tableau
Google Charts
Dundas BI
Power BI
JupyteR
Infogram
ChartBlocks
D3.js
FusionCharts
Grafana
Visualizing data can be as simple as a bar graph or scatter plot but becomes powerful when
analyzing.
Some common types of data visualization include: Bar charts, Line charts, Scatter plots, Pie charts,
Heat maps.
Table: A table is data displayed in rows and columns, which can be easily created in a
Word document or Excel spreadsheet.
Chart or graph: Information is presented in tabular form with data displayed along an x
and y axis, usually with bars, points, or lines, to represent data in comparison. An info
graphic is a special type of chart that combines visuals and words to illustrate the data.
Gantt chart: A Gantt chart is a bar chart that portrays a timeline and tasks specifically used
in project management.
Pie chart: A pie chart divides data into percentages featured in “slices” of a pie, all adding
up to 100%.
Geospatial visualization: Data is depicted in map form with shapes and colours that
illustrate the relationship between specific locations, such as a choropleth or heat map.
Dashboard: Data and visualizations are displayed, usually for business purposes, to help
analysts understand and present data.
Line Graph Visualization -is used to give a more in-depth view of the trend, projection,
average, and growth of a metric over time. The benefit of this type of visualization is that it
enables you to visualize patterns over a longer period and compare data to a previous period
or goal. In addition, this is a common visualization that will be familiar and easy to read by
most.
Bar Graph Visualization – is best used to demonstrate comparisons between values. It is
especially valuable when you want to compare different Google Analytics Dimensions to
each other.
Pie Chart Visualization – is usually used to illustrate numerical proportions through the
size of the slices. It is easy to read and ideal for demonstrating distribution.
Table Visualization – is mostly used to list Metrics by order of importance, ranking, and so
on. It is a great option for producing a clear view and comparisons of Metric values. It can
show Total SUM, AVG, MIN, or MAX and data for multiple or just a single Metric.
Funnel Visualization – is a great option for tracking the stages potential customers travel
through in order to become a customer.
Number Visualization – is best for showing a simple Metric or to draw attention to one key
number. It is compact, clear and precise and is great for highlighting progress in your
Dashboard.
Gauge Visualization – much like the progress bar, gauge visualization is best used to show
progress towards reaching a Goal but also for presenting maximum value.
1. Numerical Data:
Numerical data is also known as quantitative data. Numerical data is any data where data generally
represents amount such as height, weight, age of a person, etc. Numerical data visualization is
easiest way to visualize data. It is generally used for helping others to digest large data sets and raw
numbers in a way that makes it easier to interpret into action. Numerical data is categorized into
two categories :
o Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
o Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).
The type of visualization techniques that are used to represent numerical data visualization is
Charts and Numerical Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
2. Categorical Data:
Categorical data is also known as qualitative data. Categorical data is any data where data
generally represents groups. It simply consists of categorical variables that are used to represent
characteristics such as a person’s ranking, a person’s gender, etc. Categorical data visualization
is all about depicting key themes, establishing connections, and lending context. Categorical
data is classified into three categories:
Ordinal Data –In this, classification is based on ordering of information (Example: Timeline or
processes).
The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram, etc.
Applications of data visualisation
Healthcare Industries
A dashboard that visualises a patient's history might aid a current or new doctor in comprehending a
patient's health. It might give faster care facilities based on illness in the event of an emergency.
Instead than sifting through hundreds of pages of information, data visualisation may assist in
finding trends.
Health care is a time-consuming procedure, and the majority of it is spent evaluating prior reports.
By boosting response time, data visualisation provides a superior selling point. It gives matrices that
make analysis easier, resulting in a faster reaction time.
Business intelligence
When compared to local options, cloud connection can provide the cost-effective “heavy lifting” of
processor-intensive analytics, allowing users to see bigger volumes of data from numerous sources
to help speed up decision-making.
Because such systems can be diverse, comprised of multiple components, and may use their own
data storage and interfaces for access to stored data, additional integrated tools, such as those geared
toward business intelligence (BI), help provide a cohesive view of an organization's entire data
system (e.g., web services, databases, historians, etc.).
Multiple datasets can be correlated using analytics/BI tools, which allow for searches using a
common set of filters and/or parameters. The acquired data may then be displayed in a standardised
manner using these technologies, giving logical "shaping" and better comparison grounds for end
users.
Military
It's a matter of life and death for the military; having clarity of actionable data is critical, and taking
the appropriate action requires having clarity of data to pull out actionable insights.
The adversary is present in the field today, as well as posing a danger through digital warfare
and cyber security. It is critical to collect data from a variety of sources, both organised and
unstructured. The volume of data is enormous, and data visualisation technologies are essential for
rapid delivery of accurate information in the most condensed form feasible. A greater grasp of past
data allows for more accurate forecasting.
Dynamic Data Visualization aids in a better knowledge of geography and climate, resulting in a
more effective approach. The cost of military equipment and tools is extremely significant; with bar
and pie charts, analysing current inventories and making purchases as needed is simple.
Finance Industries
For exploring/explaining data of linked customers, understanding consumer behaviour, having a
clear flow of information, the efficiency of decision making, and so on, data visualisation tools are
becoming a requirement for financial sectors.
For associated organisations and businesses, data visualisation aids in the creation of patterns, this
aids in better investment strategy. For improved business prospects, data visualisation emphasises
the most recent trends.
Data science
Data scientists generally create visualisations for their personal use or to communicate information
to a small group of people. Visualization libraries for the specified programming languages and
tools are used to create the visual representations.
Open source programming languages, such as Python, and proprietary tools built for
complicated data analysis are commonly used by data scientists and academics. These data
scientists and researchers use data visualisation to better comprehend data sets and spot patterns and
trends that might otherwise go undiscovered.
Marketing
In marketing analytics, data visualisation is a boon. We may use visuals and reports to analyse
various patterns and trends analysis, such as sales analysis, market research analysis, customer
analysis, defect analysis, cost analysis, and forecasting. These studies serve as a foundation for
marketing and sales.
Visual aids can assist your audience grasp your main message by visually engaging them and
visually engaging them. The major advantage of visualising data is that it can communicate a point
faster than a boring spreadsheet.
In b2b firms, data-driven yearly reports and presentations don't fulfil the needs of people who are
seeing the information. They are unable to grasp the art of engaging with their audience in a
meaningful or memorable manner. Your audience will be more interested in your facts if you
present them as visual statistics, and you will be more inclined to act on your discoveries.
When you place an order for food on your phone, it is given to the nearest delivery person. There is
a lot of math involved here, such as the distance between the delivery executive's present position
and the restaurant, as well as the time it takes to get to the customer's location.
Customer orders, delivery location, GPS service, tweets, social media messages, verbal comments,
pictures, videos, reviews, comparative analyses, blogs, and updates have all become common ways
of data transmission.
Users may obtain data on average wait times, delivery experiences, other records, customer service,
meal taste, menu options, loyalty and reward point programmes, and product stock and inventory
data with the help of the data.
Education
Users may visually engage with data, answer questions quickly, make more accurate, data-informed
decisions, and share their results with others using intuitive, interactive dashboards.
The ability to monitor students' progress throughout the semester, allowing advisers to act quickly
with outreach to failing students. When they provide end users access to interactive, self-service
analytic visualisations as well as ad hoc visual data discovery and exploration, they make quick
insights accessible to everyone – even those with little prior experience with analytics.
E-commerce
In e-commerce, any chance to improve the customer experience should be taken. The key to
running a successful internet business is getting rapid insights. This is feasible with data
visualisation because crossing data shows features that would otherwise be hidden.
Your marketing team may use data visualisation to produce excellent content for your audience that
is rich in unique information. Data may be utilised to produce attractive narrative through the use of
info graphics, which can easily and quickly communicate findings.
Patterns may be seen all throughout the data. You can immediately and readily detect them if you
make them visible. These behaviours indicate a variety of consumer trends, providing you with
knowledge to help you attract new clients and close sales.(From)
Conclusion
When data is visualised, it is processed more quickly. Data visualisation organises all of the
information in a way that the traditional technique would miss.
We can observe ten data visualisation applications in many sectors, and as these industries continue
to expand, the usage of data visualisations will proliferate and evolve into key assets for these
organisations.
Database Query Definition
A data query is a request for specific information or insights from a dataset or database. It involves
retrieving, filtering, aggregating, or manipulating data based on certain criteria or conditions. Data
queries can be simple or complex, depending on the complexity of the desired information and the
structure of the dataset.
In the context of databases, data queries are typically expressed using query languages such as SQL
(Structured Query Language) for relational databases or specialized query languages for NoSQL
databases. These queries can range from basic operations like selecting specific columns from a table
to more complex operations involving joins, filtering, grouping, and aggregation.
Data queries are fundamental to data analysis, reporting, and decision-making processes in various
domains such as business, finance, healthcare, and scientific research. They enable users to extract
actionable insights from large volumes of data, facilitating informed decision-making and problem-
solving.
There are several types of data queries, each serving different purposes and allowing users to extract
specific information or insights from a dataset. Here are some common types of data queries:
Select Query: A select query retrieves data from one or more tables in a database. It allows
users to specify the columns they want to retrieve and may include filtering conditions to
restrict the returned rows.
Filtering Query: Filtering queries are used to retrieve data that meets specific criteria or
conditions. These queries typically include WHERE clauses to specify the conditions that
the data must satisfy.
Join Query: A join query combines data from multiple tables based on a related column or
key. Joins are used to retrieve information that spans multiple tables, allowing users to
correlate data from different sources.
Subquery: A subquery is a query nested within another query. Subqueries can be used to
retrieve data dynamically based on the results of another query. They are often used in
filtering conditions or as part of a larger query's logic.
Sorting Query: Sorting queries arrange the retrieved data in a specified order, such as
ascending or descending order based on one or more columns. Sorting is useful for
organizing data for presentation or analysis.
Grouping Query: Grouping queries group the data based on one or more columns and then
perform aggregate calculations within each group. They are used to analyze data at different
levels of granularity and to calculate summary statistics for each group.
Insert Query: An insert query adds new data to a database table. It specifies the values to be
inserted into each column of the table.
Update Query: An update query modifies existing data in a database table. It specifies the
changes to be made to one or more rows based on specified conditions.
Delete Query: A delete query removes data from a database table based on specified
conditions. It deletes one or more rows that meet the specified criteria.
These are some of the common types of data queries used in database management systems and
data analysis tasks. The choice of query type depends on the specific requirements of the analysis or
operation being performed.
Data Dashboards
A data dashboard is an interactive analysis tool used by businesses to track and monitor the
performance of their strategies with quality KPIs. Armed with real-time data, these tools enable
companies to extract actionable insights and ensure continuous growth.
They offer users a comprehensive overview of their company’s various internal departments, goals,
initiatives, processes, or projects. These are measured through key performance indicators (KPIs),
which provide insights that will enable you to foster growth and improvement.
Online dashboards provide immediate navigable access to actionable analysis that has the power to
boost your bottom line through continual commercial evolution.
To properly define dashboards, you need to consider the fact that, without the existence of
dashboards and dashboard reporting practices, businesses would need to sift through colossal stacks
of unstructured information, which is both inefficient and time-consuming.
Alternatively, a business would have to "shoot in the dark" concerning its most critical processes,
projects, and internal insights, which is far from ideal in today’s world.
Creating effective data dashboards involves adhering to several principles to ensure they are
informative, actionable, and user-friendly. Here are some key principles to consider when
designing data dashboards:
1. Clarity of Purpose: Clearly define the purpose of the dashboard and the specific audience
it serves. Understanding the users' needs and objectives will help guide the design and
content of the dashboard.
2. Focus on Key Metrics: Prioritize the display of key metrics and insights that are most
relevant to the users' goals and decision-making processes. Avoid cluttering the dashboard
with unnecessary information that may distract from the main objectives.
3. Simplicity and Conciseness: Keep the dashboard design simple and concise to facilitate
easy comprehension. Use clear and intuitive visualizations that convey information
efficiently without overwhelming the user.
4. Consistent Design and Layout: Maintain a consistent design and layout throughout the
dashboard to provide a cohesive user experience. Use standardized formatting, color
schemes, and typography to enhance readability and usability.
8. Contextualization and Storytelling: Provide context and narrative around the data to help
users interpret the insights and understand their implications. Use annotations, captions,
and narrative elements to tell a compelling story with the data.
9. Data Quality and Accuracy: Ensure the accuracy and reliability of the data presented on
the dashboard by performing thorough data validation and quality checks. Display data
sources, definitions, and any relevant caveats to promote transparency and trust.
10. Iterative Improvement: Continuously solicit feedback from users and stakeholders to
identify areas for improvement and refinement. Iterate on the dashboard design based on
user feedback and evolving requirements to ensure its ongoing effectiveness.
By following these principles, you can create data dashboards that effectively communicate
insights, support decision-making, and drive action within organizations.
Our first data dashboard template is a management dashboard. It is a good example of a “higher
level” dashboard for a C-level executive. You’ll notice that this example is focused on
important management KPIs like:
Maintaining your financial health and efficiency is pivotal to your business's continual growth and
evolution. Without keeping your fiscal processes water-tight, your organization will essentially
become a leaky tap, draining your budgets dry right under your nose.
This sales dashboard below is a sales manager’s dream. This dashboard example shows how long
customers are taking to move through your funnel, on average. It expands on this by showing how
different sales managers perform compared to others.
4) IT Project Management Dashboard
Our IT project management dashboard offers a wealth of at-a-glance insights geared towards
reducing project turnaround times while enhancing efficiency in key areas, including total tickets
vs. open tickets, projects delivered on budget, and average handling times.
All KPIs complement each other, shedding light on real-time workloads, overdue tasks,
budgetary trends, upcoming deadlines, and more. This perfectly balanced mix of metrics will
assist everyone within your IT department to balance their personal workloads while gaining the
insight required putting strategic enhancements into action and delivering projects with
consistent success.
5.) Human Resources Talent Dashboard
You can also take a look at a quick overview of the hiring stats that include the time to fill,
training costs, new hires, and cost per hire. The total number of employees, monthly salary, and
vacancies will provide you with an at-a-glance overview of the talent management stats within
the first quarter of the year.
UNIT – IV: DESCRIPTIVE STATISTICAL MEASURES
MEASURES OF LOCATION :
AVERAGES
Averages, also known as measures of central tendency, are statistical tools used to represent a set
of data with a single value that summarizes the "center" or "typical" value of the dataset.
Averages are fundamental in various fields, including statistics, mathematics, economics, and
business, as they help in understanding and interpreting data, making comparisons, and drawing
meaningful conclusions. They provide a way to simplify complex data into a single,
representative value, making it easier to analyze and work with large datasets.
DEFINITION OF AVERAGES:
Averages are mathematical calculations that yield a single value that is considered representative
of a set of data points. They aim to provide insights into where the bulk of the data is
concentrated, helping to identify central tendencies or typical values within the dataset. There are
several types of averages, each with its method of calculation and specific use cases.
Objectives of Averaging:
Types of Averages:
There are several types of averages, each with its unique characteristics and applications:
Definition: The mean is the most common measure of central tendency. It's calculated by adding
up all the values in a dataset and then dividing the sum by the total number of data points.
Formula: Mean (μ) = (Sum of all values) / (Total number of values).
Characteristics:
Sensitive to outliers: Extreme values can greatly affect the mean.
Appropriate for interval and ratio data (continuous data).
Application: The mean is used to calculate average sales, costs, revenues, and other financial
metrics. It helps in budgeting, forecasting, and performance analysis
2. Median:
Definition: The median is the middle value in a dataset when it's arranged in ascending or
descending order. If there's an even number of values, the median is the average of the two
middle values.
Characteristics:
Not affected by outliers: The median is resistant to extreme values.
Useful for skewed or non-normally distributed data.
Applicable to ordinal, interval, and ratio data.
Application: The median is often used in salary studies, income distribution analysis, and
market research to understand the typical income or price range.
3. Mode:
Definition: The mode is the value that appears most frequently in a dataset.
Characteristics:
Useful for nominal data (categorical data).
A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode.
Application: In marketing, it helps identify the most popular product or service. In inventory
management, it can indicate the most commonly ordered item.
4. Weighted Mean:
Definition: The weighted mean is calculated by multiplying each value by its corresponding
weight and then summing the products. It's often used when some data points are more
significant than others.
Formula: Weighted Mean (μ) = (Σ(xi * wi)) / Σwi, where xi is the data point, and wi is its
weight.
Application: In financial analysis, it's used to calculate the weighted average cost of capital
(WACC) by considering the weights of debt and equity
5. Geometric Mean:
Definition: The geometric mean is used for datasets with values that represent growth rates or
ratios. It's calculated as the nth root of the product of n values.
Formula: Geometric Mean (μ) = (x1 * x2 * ... * xn)^(1/n).
Characteristics:
Appropriate for data with multiplicative relationships.
Used in finance, biology, and other fields where ratios are important.
6. Harmonic Mean:
Definition: The harmonic mean is calculated as the reciprocal of the arithmetic mean of the
reciprocals of the values. It's used for averaging rates or proportions.
Formula: Harmonic Mean (μ) = n / (1/x1 + 1/x2 + ... + 1/xn), where xi is the data point.
Characteristics:
Appropriate for averaging rates or proportions, e.g., speed, efficiency, or time.
Each measure of location has its strengths and is suitable for different types of data and research
questions. Choosing the right measure of central tendency depends on the nature of the dataset
and the goals of the analysis. It's also common to use multiple measures in combination to gain a
more comprehensive understanding of the data's central tendency.
MEASURES OF DISPERSION
Introduction to Variation:
Variation is a fundamental concept in statistics and data analysis. It refers to the degree of
difference or fluctuation in data points within a dataset. Understanding variation is crucial in
various fields, including business, economics, science, and quality control, as it helps us make
sense of data, identify patterns, and make informed decisions. Studying variation allows us to
gain insights into the underlying processes and factors that affect data, ultimately leading to
improved decision-making and problem-solving.
Definition of Variation:
Variation is the extent to which data points or observations differ within a dataset. It can
manifest as differences in values, patterns, or trends. Variation can be classified into different
types, such as:
1. Common Cause Variation: This variation is inherent in any process and arises from random or
common factors. It represents the natural variability in data and is typically expected.
2. Special Cause Variation: Special cause variation results from specific, identifiable factors not
part of the normal process. It indicates unusual or unexpected deviations from the expected
pattern.
1. Quality Improvement: In quality control and manufacturing, identifying and reducing variation
is critical for producing consistent and high-quality products or services.
2. Data Analysis: Variation analysis helps data analysts and researchers identify trends, anomalies,
and potential outliers within datasets, leading to more accurate conclusions.
3. Process Improvement: In business and operations management, studying variation is essential
for process improvement, efficiency optimization, and cost reduction.
4. Risk Assessment: Variation analysis is used in risk assessment to identify potential risks and
uncertainties in financial forecasts, investment portfolios, and project timelines.
5. Decision-Making: Managers and decision-makers rely on variation analysis to make informed
decisions, set realistic goals, and allocate resources effectively.
1. Range:
Description: The range is the simplest measure of dispersion and represents the difference
between the highest and lowest values in a dataset.
Significance: It provides a quick assessment of the spread of data. However, it's sensitive to
outliers and may not capture the full picture of dispersion in the middle of the dataset.
Formula: Range = Maximum Value - Minimum Value
2. Variance:
Description: Variance measures the average of the squared differences between each data point
and the mean. It quantifies the overall variability in the dataset.
Significance: Variance takes into account all data points, giving more weight to larger
deviations. It's widely used but can be influenced by extreme values.
Formula: Variance (σ²) = Σ(xi - μ)² / N, where xi is each data point, μ is the mean, and N is the
number of data points.
3. Standard Deviation:
Description: The standard deviation is the square root of the variance. It provides a measure of
dispersion that is in the same unit as the data.
Significance: Standard deviation is a common and intuitive measure of spread. It quantifies how
much individual data points deviate from the mean.
Formula: Standard Deviation (σ) = √(Variance)
Description: CV expresses the standard deviation as a percentage of the mean. It allows for
comparing the relative variation between datasets with different means.
Significance: It's useful for comparing the variability of datasets with different scales or units.
Formula: CV = (Standard Deviation / Mean) * 100
Description: IQR measures the range between the 25th and 75th percentiles (Q1 and Q3). It's
resistant to outliers.
Significance: IQR provides a robust measure of the spread in the middle 50% of the data,
making it useful for skewed distributions.
Formula: IQR = Q3 - Q1, where Q3 is the 75th percentile and Q1 is the 25th percentile.
Description: MAD calculates the average of the absolute differences between each data point
and the mean. It's less sensitive to outliers than the standard deviation.
Significance: MAD provides a robust measure of dispersion that can be useful when dealing
with datasets containing extreme values.
Formula: MAD = Σ|xi - μ| / N, where |xi - μ| represents the absolute differences.
Description: CQV compares the IQR to the median. It measures the relative spread of data
within the interquartile range.
Significance: CQV helps assess the relative variability of data within the middle 50% of the
dataset.
Formula: CQV = (IQR / Median) * 100
These methods of studying dispersion are essential for understanding the spread, variability, and
distribution of data. The choice of method depends on the characteristics of the dataset and the
specific goals of the analysis, including the need for resistance to outliers and the desire to
express variability relative to other statistics like the mean or median.
Measures of association
Methods of association, also known as measures of association or correlation analysis, are
statistical techniques used to quantify and understand the relationships between variables in a
dataset. These methods are fundamental in various fields, including statistics, social sciences,
economics, epidemiology, and data analysis. They help researchers and analysts uncover
patterns, dependencies, and interactions between variables, which in turn enables better decision-
making, prediction, and hypothesis testing.
Identify Relationships: These methods help identify and quantify relationships between
variables, allowing researchers to understand how changes in one variable are associated
with changes in another.
Predict Outcomes: By quantifying associations, researchers can make predictions about
one variable based on the values of others, which is valuable for forecasting and decision-
making.
Hypothesis Testing: Correlation analysis is used to test hypotheses about the existence
and strength of relationships, helping researchers draw meaningful conclusions from data.
Data Exploration: These methods are essential for exploratory data analysis, enabling
researchers to gain insights into complex datasets.
Variable Selection: In regression modeling and machine learning, measures of
association assist in selecting relevant predictors and building effective models.
ProbabilityDistribution
A probability distribution is a fundamental concept in probability theory and statistics that
describes how the likelihood of different outcomes or events is spread or distributed over a
sample space. It provides a mathematical framework for quantifying uncertainty and making
predictions in a wide range of fields, including mathematics, science, economics, engineering,
and social sciences. Here are some elaborative notes on probability distributions:
Elements
1. Random Variables:
A random variable is a variable that takes on different values based on the outcomes of a
random experiment.
Random variables can be classified into two types: discrete and continuous.
Discrete Random Variables: Take on a countable number of distinct values (e.g.,
the number of heads in a series of coin tosses).
Continuous Random Variables: Can take on any value within a range (e.g., the
height of individuals in a population).
2. Probability Mass Function (PMF):
For discrete random variables, the probability distribution is described by a Probability
Mass Function (PMF).
The PMF provides the probabilities associated with each possible value of the random
variable.
Properties of a PMF:
Sum of probabilities equals 1.
Non-negative probabilities for each value.
P(X = x) represents the probability of the random variable X taking on the value
x.
3. Probability Density Function (PDF):
For continuous random variables, the probability distribution is described by a
Probability Density Function (PDF).
The PDF provides the relative likelihood of a random variable taking on a specific range
of values.
Properties of a PDF:
Area under the curve equals 1.
The PDF doesn't directly give probabilities for specific values; instead, it gives
probabilities for intervals.
4. Common Probability Distributions:
There are several well-known probability distributions, including:
Bernoulli Distribution: Models a binary outcome (e.g., success/failure).
Binomial Distribution: Models the number of successes in a fixed number of
Bernoulli trials.
Poisson Distribution: Models the number of events occurring in a fixed interval
of time or space.
Normal Distribution (Gaussian): Characterized by a bell-shaped curve and
often used for modeling continuous data.
Exponential Distribution: Models the time between events in a Poisson process.
Uniform Distribution: All values in a range are equally likely.
5. Moments of a Distribution:
Moments of a probability distribution describe various characteristics of the distribution.
The first moment is the mean (expected value), the second moment is the variance, and
the square root of the variance is the standard deviation.
6. Use in Statistics:
Probability distributions are essential in statistical analysis, hypothesis testing, and
parameter estimation.
They help quantify uncertainty, model real-world phenomena, and make predictions.
7. Simulation and Modeling:
Probability distributions are used in simulations to model random events and make
predictions in various fields, such as finance, engineering, and computer science.
8. Central Limit Theorem:
The Central Limit Theorem states that the sum or average of a large number of
independent and identically distributed random variables approaches a normal
distribution, regardless of the original distribution of the variables.
9. Applications:
Probability distributions find applications in diverse fields, including risk assessment,
quality control, genetics, economics, and machine learning.
Understanding and working with probability distributions is crucial for making informed
decisions in situations involving uncertainty and randomness. Different situations may require
the use of specific probability distributions that best capture the underlying phenomena.
Probability Distributions
Probability distributions have numerous applications in the business world, where they are used
to model and analyze uncertainty, make informed decisions, and manage risks. Here are some
key applications of probability distributions in business:
A discrete probability distribution is a statistical distribution that describes the probabilities of all
possible outcomes or values of a discrete random variable. Discrete random variables are those
that can only take on a finite or countable number of distinct values. In a discrete probability
distribution, each possible value of the random variable is associated with a probability that sums
to 1. Here are some key characteristics and examples of discrete probability distributions:
1. Countable Outcomes: Discrete random variables can take on values that are countable, such as
integers or specific categories.
2. Probability Mass Function (PMF): A discrete probability distribution is typically defined by its
Probability Mass Function (PMF). The PMF specifies the probability associated with each
possible value of the random variable.
3. Probability Sum: The sum of the probabilities for all possible outcomes must equal 1, as the
random variable must take on one of these values.
4. Disjoint Outcomes: The events corresponding to different values of the random variable are
mutually exclusive, meaning that only one of them can occur at a time.
5. Example: A common example of a discrete probability distribution is the probability distribution
for the outcome of rolling a fair six-sided die. The random variable can take on values 1, 2, 3, 4,
5, or 6, each with a probability of 1/6.
1. Infinite Outcomes: Continuous random variables can take on an infinite number of values
within a specified interval. This interval is often denoted as [a, b], where 'a' and 'b' represent the
lower and upper bounds.
2. Probability Density Function (PDF): The PDF is the counterpart of the Probability Mass
Function (PMF) for discrete distributions. It defines the probability density or likelihood of the
random variable falling within a particular interval.
3. Probability Integration: Instead of summing probabilities as in discrete distributions,
probabilities in continuous distributions are calculated by integrating the PDF over a specified
range.
4. Probability at a Single Point: For continuous random variables, the probability of the variable
taking on a specific point value is technically zero. In other words, P(X = x) = 0 for any single
value 'x.'
5. Example: A common example of a continuous probability distribution is the normal distribution
(Gaussian distribution), which describes many natural phenomena and is characterized by its
bell-shaped curve.
1. Normal Distribution: The normal distribution is widely used to model continuous data in
various fields, including finance, biology, and engineering. It is characterized by two parameters:
the mean (μ) and the standard deviation (σ).
2. Exponential Distribution: This distribution models the time between events in a Poisson
process, such as the arrival of customers at a service center or the decay of radioactive particles.
3. Uniform Distribution: The uniform distribution describes a situation where all values within a
specified interval are equally likely. It is often used for random sampling and simulation.
4. Lognormal Distribution: The lognormal distribution is used to model data that is skewed
positively and can be transformed into a normal distribution by taking the natural logarithm of
the data.
5. Weibull Distribution: Commonly used in reliability engineering, the Weibull distribution
models the time until failure of a product or system.
6. Gamma Distribution: The gamma distribution is versatile and used to model waiting times,
income distributions, and other continuous data.
7. Beta Distribution: The beta distribution is often used in Bayesian statistics to model the
probability distribution of a random variable constrained to a finite range, such as proportions or
probabilities.
8. Cauchy Distribution: The Cauchy distribution has heavy tails and is used in physics and
engineering for modeling certain physical phenomena.
9. Triangular Distribution: This distribution is often used when data is assumed to be bounded by
minimum and maximum values and follows a triangular shape.
10. Pareto Distribution: Commonly used in economics, the Pareto distribution
describes the distribution of income and wealth, where a small percentage of the population
holds the majority of resources.
Continuous probability distributions are essential for modeling and analyzing real-world
phenomena, as they allow for the representation of continuous and uncountable data. They are
particularly valuable in fields where precision and accuracy in statistical analysis are crucial,
such as finance, engineering, and scientific research
Data Modeling
Data modeling is a fundamental process in data management, database design, and information
systems that involves creating a structured representation of data and its relationships to enable
efficient data storage, retrieval, and analysis. Here are elaborative notes on data modeling:
Data models typically involve entities (also called objects or concepts) and their attributes
(characteristics or properties).
Entities are represented as tables in a relational database, and each row in a table represents an
instance of an entity.
Attributes describe the properties or features of entities, and each attribute corresponds to a
column in a table.
4. Relationships:
Data models depict relationships between entities. Common relationship types include one-to-
one, one-to-many, and many-to-many.
Relationships are often represented graphically in Entity-Relationship Diagrams (ERDs) using
lines connecting related entities.
5. Normalization:
Normalization is a process applied to data models to eliminate data redundancy and improve data
integrity.
It involves organizing data into separate tables and linking them through relationships.
The goal is to minimize data duplication and ensure that each piece of data is stored in only one
place in the database.
Various software tools are available to assist in creating and visualizing data models. These tools
include:
ERD Tools: Specialized tools for creating Entity-Relationship Diagrams.
Database Management Systems (DBMS): Many DBMSs offer built-in data modeling
capabilities.
General-purpose Modeling Tools: Tools like Microsoft Visio and Lucidchart can be
used for data modeling.
Data models can be represented using standardized modeling languages, such as:
Unified Modeling Language (UML): Widely used for modeling not only data but also
software systems.
Data Modeling Notation (DMN): A specific notation for data modeling purposes.
8. Iterative Process:
Data modeling is often an iterative process that evolves as project requirements become clearer.
Models may undergo revisions and refinements as stakeholders' needs change.
9. Real-world Applications:
Data modeling is essential in various domains, including database design, software development,
business analysis, and data warehousing.
It plays a crucial role in industries such as finance, healthcare, retail, and manufacturing.
In conclusion, data modeling is a structured and systematic approach to representing data and its
relationships, ensuring data accuracy, consistency, and efficient management. It is a foundational
step in database design and information system development, facilitating effective data
organization and utilization in various applications and industries
Distribution Fitting:
Distribution fitting is a statistical technique used to identify and select the probability distribution
that best describes a given dataset. Here are short notes on distribution fitting:
1. Purpose: Distribution fitting is used to model the underlying data-generating process. It helps
statisticians and analysts understand the distribution of data and make predictions.
2. Types of Distributions: Common probability distributions used for fitting include normal
(Gaussian), exponential, Poisson, binomial, and many others. The choice of distribution depends
on the nature of the data.
3. Goodness-of-Fit Tests: Statistical tests, such as the Kolmogorov-Smirnov test or chi-square test,
are used to assess how well a chosen distribution fits the data. These tests compare observed data
with expected values from the distribution.
4. Parameters Estimation: For many distributions, there are parameters (e.g., mean and standard
deviation for a normal distribution) that need to be estimated from the data. Maximum likelihood
estimation is a common method for parameter estimation.
5. Visual Inspection: Data analysts often start by plotting the data and comparing it to the
probability distribution graphically. This helps identify deviations from the assumed distribution.
6. Real-world Applications: Distribution fitting is used in various fields, including finance (e.g.,
stock returns), biology (e.g., species population data), and quality control (e.g., defect counts).
7. Fitting Algorithms: Software tools like R, and Python (with libraries like SciPy and
statsmodels), and specialized statistical packages provide functions and algorithms for
distribution fitting.
8. Limitations: Distribution fitting assumes that the data follows a specific distribution, which may
not always be the case. It's a simplifying assumption that may not capture complex data patterns.
In summary, data modeling and distribution fitting are essential techniques in data analysis and
statistics. Data modeling structures and simplifies data for analysis, while distribution fitting
helps identify the probability distribution that best describes the data, aiding in making informed
statistical inferences and predictions.
Each probability distribution is characterized by its Probability Density Function (PDF) for
continuous distributions or Probability Mass Function (PMF) for discrete distributions. The PDF
or PMF defines the probabilities associated with different values or ranges of values.
3. Population or Sample:
You should identify whether you are dealing with a population (the entire set of data) or a
sample (a subset of the population). The methods may vary depending on this distinction.
To perform random sampling, you need a source of random numbers. Modern programming
languages and statistical software provide functions or libraries for generating random numbers
that follow a uniform distribution between 0 and 1.
The Inverse Transform Method is a common technique for random sampling from probability
distributions. It involves the following steps:
Calculate the Cumulative Distribution Function (CDF) for the chosen distribution. The
CDF provides the cumulative probability up to a given value.
Generate a random number from a uniform distribution between 0 and 1.
Use the inverse of the CDF to map the random number to a specific value from the
chosen distribution.
6. Repeated Sampling:
To obtain multiple random samples, repeat the random sampling process as many times as
needed. Each iteration generates one value or data point according to the chosen probability
distribution.
7. Sample Size:
The sample size determines how many random values or data points you generate. The sample
size should be determined based on the requirements of your analysis or study.
9. Applications:
Random sampling from probability distributions has wide-ranging applications in fields such as:
Finance: For risk assessment and portfolio optimization.
Manufacturing: To simulate production processes and quality control.
Epidemiology: For disease modeling and forecasting.
Engineering: In reliability analysis and system design.
Social Sciences: To model and simulate various social phenomena.
10. Limitations: - It's important to note that random sampling assumes that the chosen
probability distribution accurately represents the real-world phenomenon. If the distribution is
not a good fit, the results may not be representative or valid.
In summary, random sampling from a probability distribution is a powerful tool for generating
data points that follow a specific statistical distribution. It is a fundamental technique in statistics,
simulation, and various scientific and engineering fields, enabling data-driven decision-making
and analysis
UNIT – V: PREDICTIVE ANALYTICS
Karl Pearson's correlation coefficient, often referred to as Pearson's r or simply the Pearson
correlation is a statistical method used to measure the linear association between two continuous
variables. Here are some key points to note about Pearson's correlation technique:
1. Definition: Pearson's correlation coefficient is a statistic that quantifies the strength and direction
of the linear relationship between two continuous variables. It provides a value between -1 and 1,
where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear
relationship, and 0 indicates no linear relationship.
2. Formula: The formula for Pearson's correlation coefficient is given by:
r = (Σ((X - X̄)(Y - Ȳ))) / [√(Σ(X - X̄)²) * √(Σ(Y - Ȳ)²)]
Where:
r is the Pearson correlation coefficient.
X and Y are the individual data points.
X̄ and Ȳ are the means of the X and Y data, respectively.
3. Interpretation:
r = 1: Indicates a perfect positive linear relationship.
r = 0: Suggests no linear relationship.
r = -1: Suggests a perfect negative linear relationship.
Values between -1 and 1 indicate the strength and direction of the linear relationship. The
closer the absolute value of r is to 1, the stronger the relationship.
4. Assumptions:
Pearson's correlation assumes that both variables are normally distributed.
It assumes that the relationship between the variables is linear.
It is sensitive to outliers, so extreme data points can significantly affect the correlation
coefficient.
5. Use Cases:
Pearson's correlation is commonly used in various fields, including statistics, economics,
social sciences, and more.
It is used to explore and quantify relationships between variables, such as the correlation
between income and education, temperature and ice cream sales, or test scores and study
time.
6. Strengths:
It provides a simple and intuitive measure of the linear relationship between two
variables.
The coefficient is standardized, making it easier to compare the strength of relationships
across different studies or data sets.
It is widely used and accepted in statistical analysis.
7. Limitations:
Pearson's correlation only measures linear relationships, so it may not capture nonlinear
associations.
It is sensitive to outliers and can be influenced by extreme values.
It does not imply causation. A significant correlation does not necessarily mean that one
variable causes the other.
8. Statistical Significance:
Hypothesis testing can be used to determine if the calculated correlation coefficient is
statistically significant. This involves testing the null hypothesis that there is no
correlation in the population.
9. Alternative Correlation Coefficients:
If the assumptions of Pearson's correlation are not met, other correlation coefficients, like
Spearman's rank correlation or Kendall's tau, can be used to assess relationships between
variables.
Pearson's correlation is a valuable tool for analyzing and understanding the linear relationships
between continuous variables. However, it is important to be aware of its assumptions and
limitations when applying it to real-world data
Multiple correlation, in statistics, quantifies the linear relationship between a single dependent
variable and multiple independent variables. It essentially tells you how well you can predict the
dependent variable based on a linear combination of the independent variables
1. Types of Correlation:
Positive Correlation: Both variables move in the same direction.
Negative Correlation: Variables move in opposite directions.
No Correlation: There is no discernible relationship.
2. Correlation Coefficient:
The correlation coefficient quantifies the strength and direction of the relationship.
Common coefficients include Pearson's correlation coefficient (for linear relationships)
and Spearman's rank correlation coefficient (for monotonic relationships).
3. Limitations:
Correlation does not imply causation.
Outliers can influence results.
Non-linear relationships may not be captured.
Multiple Correlation:
1. Definition:
Multiple correlation involves examining the relationship between a dependent variable
and two or more independent variables simultaneously.
2. Multiple Correlation Coefficient:
The multiple correlation coefficient (�R) measures the strength and direction of the
linear relationship between the dependent variable and a combination of independent
variables.
3. Regression Analysis:
Often used in conjunction with regression analysis, where the goal is to predict the
dependent variable using a linear combination of independent variables.
4. Interpretation:
2R2 (coefficient of determination) indicates the proportion of variance in the dependent
variable explained by the independent variables.
5. Assumptions:
Linearity: The relationships are assumed to be linear.
Independence: Observations should be independent.
Homoscedasticity: The variance of errors should be constant
Applications:
Regression analysis: Multiple correlation forms the basis for linear regression, where you predict
a continuous dependent variable based on multiple independent variables.
Multivariate analysis: It can be used in various multivariate techniques like factor analysis and
principal component analysis to understand relationships between multiple variables.
Scientific research: It's widely used in various fields to investigate the relationships between
different factors influencing a phenomenon.
1. Definition:
Spearman's rank correlation is a non-parametric measure of association that quantifies the degree
and direction of the monotonic relationship between two variables.
It is often used when the data does not meet the assumptions of parametric tests, such as the
assumption of a linear relationship or normally distributed data.
2. Formula:
3. Interpretation:
4. Use Cases:
Spearman's rank correlation is used when working with data that doesn't meet the assumptions of
parametric methods.
It is valuable for assessing relationships between ordinal, ranked, or non-normally distributed
data.
Common applications include analyzing the relationship between customer preferences, exam
scores, or performance rankings.
5. Advantages:
Robust to outliers: It is less influenced by extreme values, making it suitable for data with
outliers.
No distributional assumptions: Unlike parametric tests, Spearman's rank correlation does not
assume the normality of data.
Handles ordinal data: It is appropriate for data with ranked or ordered categories.
6. Limitations:
7. Statistical Significance:
Statistical software packages like R, and Python (with libraries like Scipy or Statsmodels), and
dedicated statistical software such as SPSS and SAS can be used to calculate Spearman's rank
correlation.
Spearman's rank correlation is a valuable tool when dealing with data that doesn't conform to the
assumptions of parametric tests. It provides a non-parametric measure of association that is
robust, especially in the presence of outliers or non-normally distributed data, making it widely
used in various fields for assessing relationships between variables
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to assess
the strength of the relationship between variables and for modeling the future relationship
between them.
Multiple regression analysis is a statistical technique that analyzes the relationship between two
or more variables and uses the information to estimate the value of the dependent variables. In
multiple regression, the objective is to develop a model that describes a dependent variable y to
more than one independent variable.
Multiple Regression Formula
In linear regression, there is only one independent and dependent variable involved. But, in the
case of multiple regression, there will be a set of independent variables that helps us to explain
better or predict the dependent variable y.
where x1, x2, ….xk are the k independent variables and y is the dependent variable.
Stepwise regression is a step-by-step process that begins by developing a regression model with
a single predictor variable and adding and deletingthe predictor variable one step at a time.
Stepwise multiple regression is the method to determine a regression equation that begins with a
single independent variable and adds independent variables one by one. The stepwise multiple
regression method is also known as the forward selection method because we begin with no
independent variables and add one independent variable to the regression equation at each of the
iterations. There is another method called the backward elimination method, which begins with
an entire set of variables and eliminates one independent variable at each of the iterations.
Residual: The variations in the dependent variable explained by the regression model are called
residual or error variation. It is also known as random error or sometimes just “error”. This is a
random error due to different sampling methods
Only independent variables with non-zero regression coefficients are included in the
regression equation.
The changes in the multiple standard errors of estimate and the coefficient of
determination are shown.
The stepwise multiple regression is efficient in finding the regression equation with only
significant regression coefficients.
The steps involved in developing the regression equation are clear.
The method of least squares defines the solution for the minimization of the sum of squares of
deviations or the errors in the result of each equation. Find the formula for the sum of squares of
errors, which helps to find the variation in observed data.
The least-squares method is often applied in data fitting. The best-fit result is assumed to reduce
the sum of squared errors or residuals which are stated to be the differences between the
observed or experimental value and the corresponding fitted value given in the model.
In linear regression, the line of best fit is straight as shown in the following diagram:
The given data points are to be minimized by the method of reducing residuals or offsets of each
point from the line. Vertical offsets are generally used in surface, polynomial, and hyperplane
problems, while perpendicular offsets are utilized in common practice.
Least Square Method Formula
The least-square method states that the curve that best fits a given set of observations, is said to
be a curve having a minimum sum of the squared residuals (or deviations or errors) from the
given data points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. Also,
suppose that f(x) is the fitting curve and d represents the error or deviation from each given
point.
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the property that the sum
of squares of all the deviations from given values must be minimum, i.e:
Suppose we have to determine the equation of the line of best fit for the given data, then we first
use the following formula.
By solving these two normal equations we can get the required trend line equation.
Thus, we can get the line of best fit with the formula y = ax + b
2. Data preparation:
Clean and pre-process your data: handle missing values, outliers, and scaling.
Explore and visualize your data to understand its distribution and potential relationships.
3. Model selection:
Choose the appropriate regression model based on your data and the research question. Some
common choices include:
o Linear regression: for modeling linear relationships.
o Logistic regression: for modeling binary outcomes (e.g., yes/no).
o Decision trees: for non-linear relationships and complex interactions.
o Random forest: an ensemble method combining multiple decision trees for improved accuracy.
5. Model improvement:
Analyze the model's residuals and identify potential areas for improvement.
Try different model configurations (e.g., different features, hyperparameters) and compare their
performance.
Regularize your model to prevent overfitting (e.g., L1 or L2 regularization).
1. Nominal Variables:
Nominal variables represent categories with no inherent order or ranking. Examples
include:
Gender: Male, Female, Non-Binary
Color: Red, Blue, Green
Country: USA, Canada, India
When dealing with nominal variables in regression analysis, dummy coding or one-hot encoding
is commonly used to convert these categories into numerical values that can be used in
regression models.
2. Ordinal Variables:
Ordinal variables, on the other hand, have a meaningful order or ranking among their
categories, but the intervals between the categories are not uniform. Examples include:
Education Level: High School, Bachelor's, Master's, PhD
Income Bracket: Low, Medium, High
Ordinal variables can be treated similarly to nominal variables, but you might also consider
ordinal coding, where numerical values are assigned based on the order of the categories.
Handling categorical variables appropriately is crucial for building accurate and interpretable
regression models. The choice of encoding method depends on the nature of the data and the
assumptions of the model being used
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification
Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique to
solve such classification problems. E.g., if we have two classes with multiple features and need
to separate them efficiently. When we classify them using a single feature, then it may show
overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-D plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane
into the 1-D plane. Using this technique, we can also maximize the separability between multiple
classes.
How does Linear Discriminant Analysis (LDA) work?
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we
need to classify them efficiently. As we have already seen in the above example LDA enables us
to draw a straight line that can completely separate the two classes of data points. Here, LDA
uses an X-Y axis to create a new axis by separating them using a straight line and projecting data
onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
Using the above two conditions, LDA generates a new axis in such a way that it can maximize
the distance between the means of the two classes and minimize the variation within each class.
In other words, we can say that the new axis will increase the separation between the data points
of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that performs
well for binary classification but falls short in the case of multiple classification problems
with well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.
However, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also fails
in some cases where the Mean of the distributions is shared.
ANOVA
A key statistical test in research fields including biology, economics, and psychology, analysis of
variance (ANOVA) is very useful for analyzing datasets. It allows comparisons to be made
between three or more groups of data. Here, we summarize the key differences between these
two tests, including the assumptions and hypotheses that must be made about each type of test.
A one-way ANOVA compares three or more than three categorical groups to establish whether
there is a difference between them. Within each group, there should be three or more
observations (here, this means walruses), and the means of the samples are compared.
The null hypothesis (H0) is that there is no difference between the groups and equality
between means (walruses weigh the same in different months).
The alternative hypothesis (H1) is that there is a difference between the means and
groups (walruses have different weights in different months)
A two-way ANOVA is, like a one-way ANOVA, a hypothesis-based test. However, in the two-
way ANOVA each sample is defined in two ways, and resultingly put into two categorical
groups. Thinking again of our walruses, researchers might use a two-way ANOVA if their
question is: “Are walruses heavier in early or late mating season and does that depend on the sex
of the walrus?” In this example, both “month in mating season” and “sex of walrus” are factors –
meaning in total, there are two factors. Once again, each factor’s number of groups must be
The two-way ANOVA therefore examines the effect of two factors (month and sex) on a
dependent variable – in this case, weight, and also examines whether the two factors affect each
other to influence the continuous variable.
Your dependent variable – here, “weight”, should be continuous – that is, measured on a
scale that can be subdivided using increments (i.e. grams, milligrams)
Your two independent variables – here, “month” and “sex”, should be in categorical,
independent groups.
Sample independence – that each sample has been drawn independently of the other
samples
Variance Equality – That the variance of data in the different groups should be the same
Normality – That each sample is taken from a normally distributed population
Because the two-way ANOVA considers the effect of two categorical factors, and the effect of
the categorical factors on each other, there are three pairs of null or alternative hypotheses for the
two-way ANOVA. Here, we present them for our walrus experiment, where the month of mating
season and sex are the two independent variables.