1-DA (1).pptx
1-DA (1).pptx
Data analytics encompasses the extraction (or collection) of raw data, the
preparation and subsequent analysis of that data, and storytelling—sharing
key insights from the data, using them to explain or predict certain scenarios
and outcomes, and to inform decisions, strategies, and next steps.
An example
Imagine you’re a data analyst working for a public transport network—think MTA in
New York City, or TFL in London. There’s a major sporting event coming up in the city,
and you know that people will be flying in from all over to attend.
In order to avoid absolute chaos, you need to adapt the usual public transport
schedule to accommodate for this influx of people and increase in travel throughout
the city. How do you plan ahead with accuracy?
You guessed it…data analytics! You analyze data from similar events that have
happened in the past and use it to predict the number, frequency, and types of
journeys that are likely to occur around this event. With these insights, you’re able
to ensure that public transportation continues to run smoothly
As you can see, data analytics replaces guesswork with data-driven insights. It helps
you make sense of the past and predict future trends and behaviors, leaving you
much better equipped to make smart decisions.
What does a data analyst do?
As a data analyst, it’s your job to turn raw data into meaningful insights.
Any kind of data analysis usually starts with a specific problem you want to solve, or
a question you need to answer—
for example:
“Why did we lose so many customers in the last quarter?” or
“Why are patients dropping out of their therapy programs at the halfway mark?”
To find the insights and answers you need, you’ll generally go through the following
steps:
Data Analysis Process
1. Defining the question
2. Collecting the data
3. Cleaning the data
4. Analyzing the data
5. Sharing your results
6. Embracing failure
7. Summary
1. Defining the question
The first step in any data analysis process is to define your objective. In data
analytics jargon, this is sometimes called the ‘problem statement’.
Defining your objective means coming up with a hypothesis and figuring how
to test it. Start by asking: What business problem am I trying to solve?
While this might sound straightforward, it can be trickier than it seems.
A data analyst’s job is to understand the business and its goals in enough
depth that they can frame the problem the right way.
Two interesting examples.
Let’s say you work for a fictional company called TopNotch Learning. TopNotch
creates custom training software for its clients. While it is excellent at securing
new clients, it has much lower repeat business. As such, your question might not
be, “Why are we losing customers?” but, “Which factors are negatively impacting
the customer experience?” or better yet: “How can we boost customer retention
(hold) while minimizing costs?”
Now you’ve defined a problem, you need to determine which sources of data will
best help you solve it. This is where your business acumen comes in again. For
instance, perhaps you’ve noticed that the sales process for new clients is very
slick, but that the production team is inefficient. Knowing this, you could
hypothesize that the sales process wins lots of new clients, but the subsequent
customer experience is lacking. Could this be why customers don’t come back?
Which sources of data will help you answer this question?
Tools to help define your objective
Defining your objective is mostly about soft skills, business knowledge, and
lateral thinking. But you’ll also need to keep track of business metrics and key
performance indicators (KPIs).
Monthly reports can allow you to track problem points in the business. Some
KPI dashboards come with a fee, like Databox and DashThis. However, you’ll
also find open-source software like Grafana, Freeboard, and Dashbuilder.
These are great for producing simple dashboards, both at the beginning and
the end of the data analysis process.
2. Collecting the data
Once you’ve established your objective, you’ll need to create a strategy for
collecting and aggregating the appropriate data. A key part of this is
determining which data you need.
A DMP is a piece of software that allows you to identify and aggregate data
from numerous sources, before manipulating them, segmenting them, and
so on.
There are many DMPs available. Some well-known enterprise DMPs include
Salesforce DMP, SAS, and the data integration platform, Xplenty. If you want
to play around, you can also try some open-source platforms like Pimcore or
D:Swarm.
3. Cleaning the data
Once you’ve collected your data, the next step is to get it ready for analysis.
This means cleaning, or ‘scrubbing’ it, and is crucial in making sure that you’re
working with high-quality data. Key data cleaning tasks include:
Removing major errors, duplicates, and outliers—all of which are inevitable
problems when aggregating data from numerous sources.
Removing unwanted data points—extracting irrelevant observations that
have no bearing on your intended analysis.
Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or
layout issues, which will help you map and manipulate your data more easily.
Filling in major gaps—as you’re tidying up, you might notice that important
data are missing. Once you’ve identified gaps, you can go about filling them.
Key benefits of data cleaning
Staying organized: Today’s businesses collect lots of information from clients,
customers, product users, and so on. These details include everything from
addresses and phone numbers to bank details and more. Cleaning this data
regularly means keeping it tidy. It can then be stored more effectively and
securely.
Avoiding mistakes: Dirty data doesn’t just cause problems for data analytics. It
also affects daily operations. For instance, marketing teams usually have a
customer database. If that database is in good order, they’ll have access to
helpful, accurate information. If it’s a mess, mistakes are bound to happen, such
as using the wrong name in personalized mail outs.
Improving productivity: Regularly cleaning and updating data means rogue
information is quickly eliminated (purged). This saves teams from having to
wade (walk) through old databases or documents to find what they’re looking
for.
Continue ………….
Avoiding unnecessary costs: Making business decisions with bad data can
lead to expensive mistakes. But bad data can incur costs in other ways too.
Simple things, like processing errors, can quickly snowball into bigger
problems. Regularly checking data allows you to detect blips sooner. This gives
you a chance to correct them before they require a more time-consuming
(and costly) fix.
Data validity
Validity is the degree to which a dataset conforms to a defined format or set of rules. These rules, or constraints,
are easy to enforce with modern data capture systems, e.g. online forms. Since forms are a common source of
data capture (one we’re all familiar with) let’s use them to highlight a few examples:
🡪 Data type: In an online form, values must match the data type, e.g. numbers > numerical, true/false > Boolean,
and so on.
🡪 Range: Data must fall within a particular range. Ever tried putting a false year of birth into a form (e.g. 1700)? It
will tell you this invalid because it falls outside of the accepted date range.
🡪 Mandatory data: It’s happened to us all. You hit submit and the form comes back at you with an angry, red
warning to say you can’t leave cell ‘X’ empty. This is mandatory data. In online forms, it includes things like email
addresses and customer ID numbers.
Data accuracy
Accuracy is a simple measure of whether your data are correct.
This could be anything from your date of birth to your bank balance, eye color, or geographical
location. Data accuracy is important for the obvious reason that if data are incorrect, they’ll hurt the
results of any analysis (and subsequent business decisions).
Unfortunately, it’s hard to measure accuracy since we can’t test it against existing ‘gold standard’
datasets.
Data completeness
Data completeness is how exhaustive a dataset is. In short, do you have all the necessary
information needed to complete your task?
Identifying an incomplete dataset isn’t always as easy as looking for empty cells. Let’s say you have
a database of customer contact details, missing half the surnames. If you wanted to list the
customers alphabetically, the dataset would be incomplete. But if your only aim was to analyze
customer dialing codes to determine geographical locators, surnames wouldn’t matter. Like data
accuracy, incomplete data are challenging to fix.
This is because it’s not always possible to infer missing data based on what you already have.
Data consistency
Data consistency refers to whether your data match information from other sources. This determines its
reliability.
For instance, if you work at a doctor’s surgery, you may find patients with two phone numbers or postal
addresses. The data here are inconsistent. It’s not always possible to return to the source, so determining
data consistency requires smart thinking. You may be able to infer which data are correct by looking at the
most recent entry, or by determining reliability in some other way.
Data uniformity
Data uniformity looks at units of measure, metrics, and so on.
For instance, imagine you’re combining two datasets on people’s weight. One dataset uses the metric
system, the other, imperial. For the data to be of any use during analysis, all the measurements must be
uniform, i.e. all in kilograms or all in pounds. This means converting it all to a single unit. Luckily, this aspect
of data quality is easier to manage. It doesn’t mean filling in gaps or determining accuracy…phew!
Data relevance
Data relevance is a more subjective measure of data quality. It looks at whether data is sufficiently complete,
uniform, consistent (and so on) to fulfill its given task.
Another aspect of data relevance, though, is timeliness. Is the data available when you need it? Is it accessible
to everyone who requires it? For instance, if you’re reporting to the Board with quarterly profits and losses,
you need the most up-to-date information. With only the previous quarter’s figures, you’ll have
lower-quality data and can therefore only offer lower quality insights.
How to clean
1: Get rid of unwanted observations
The first stage in any data cleaning process is to remove the observations (or data points) you don’t
want. This includes irrelevant observations, i.e. those that don’t fit the problem you’re looking to
solve.
For instance, if we were running an analysis on vegetarian eating habits, we could remove any
meat-related observations from our data set.
However, free tools offer limited functionality for very large datasets. Python
libraries (e.g. Pandas) and some R packages are better suited for heavy data
scrubbing. You will, of course, need to be familiar with the languages.
Alternatively, enterprise tools are also available.
If you’re familiar with Python and R, there are also many data visualization
libraries and packages available. For instance, check out the Python
libraries Plotly, Seaborn, and Matplotlib. Whichever data visualization tools
you use, make sure you polish up your presentation skills, too. Remember:
Visualization is great, but communication is key!
Basics Statistics
Basics Statistics
Frequency Distribution.
A frequency distribution is a representation, either in a graphical or tabular
format, that displays the number of observations within a given interval.
The frequency is how often a value occurs in an interval, while the
distribution is the pattern of frequency of the variable.
Data presentation is the process of visually representing data sets to convey
information effectively to an audience. In an era where the amount of data
generated is vast, visually presenting data using methods such as diagrams,
graphs, and charts has become crucial.
There are two types such as: Graphical and Numerical.
Following one is the numerical example:
Q1. The marks obtained by 6 students in a class test are 20, 22, 24,
26, 28, 30. Find the mean.
Q2. If the arithmetic mean of 14 observations 26, 12, 14, 15, x, 17, 9, 11, 18,
16, 28, 20, 22, 8 is 17. Find the missing observation.
Given 14 observations are: 26, 12, 14, 15, x, 17, 9, 11, 18, 16, 28, 20, 22, 8
Arithmetic mean = 17
We know that,
Arithmetic mean = Sum of observations/Total number of observations
Hence,
17 = (216 + x)/14
17 x 14 = 216 + x
216 + x = 238
x = 238 – 216
x = 22
Therefore, the missing observation is 22.
Q3. Find the Variance and Standard deviation of the following numbers: 1, 3, 5, 5,
6, 7, 9, 10.
The mean = (1+ 3+ 5+ 5+ 6+ 7+ 9+ 10)/8 = 46/ 8 = 5.75
Step 1: Subtract the mean value from individual value
(1 – 5.75), (3 – 5.75), (5 – 5.75), (5 – 5.75), (6 – 5.75), (7 – 5.75), (9 – 5.75), (10 –
5.75)
= -4.75, -2.75, -0.75, -0.75, 0.25, 1.25, 3.25, 4.25
Step 2: Squaring the above values
we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563, 10.563, 18.063
Step 3: 22.563 + 7.563 + 0.563 + 0.563 + 0.063 + 1.563 + 10.563 + 18.063
= 61.504
Step 4: n = 8, therefore variance (σ2) = 61.504/ 8 = 7.69
Now, Standard deviation (σ) = 2.77
Q4. Calculate the range and coefficient of range for the following data values:
45, 55, 63, 76, 67, 84, 75, 48, 62, 65
Let Xi values be: 45, 55, 63, 76, 67, 84, 75, 48, 62, 65
Here, Maxium value (Xmax) = 84 and Minimum or Least value (Xmin) = 45
Range = Maximum value - Minimum value = 84 – 45 = 39
Coefficient of range = (Xmax – Xmin)/(Xmax + Xmin)
= (84 – 45)/(84 + 45)
= 39/129
= 0.302 (approx)
Q5. Find the median, lower quartile, upper quartile and inter-quartile range of the
following data set of scores: 19, 21, 23, 20, 23, 27, 25, 24, 31 ?
First, lets arrange of the values in an ascending order:
19, 20, 21, 23, 23, 24, 25, 27, 31
Now let’s calculate the Median:
Lower Quartile:
Average of 2nd and 3rd terms
= (20 + 21)/2 = 20.5 = Lower Quartile
Upper Quartile:
Average of 7th and 8th terms
= (25 + 27)/2 = 26 = Upper Quartile
IQR = Upper quartile – Lower quartile
= 26 – 20.5
= 5.5