FDA Practical_Book
FDA Practical_Book
Sr.
Name of Experiments
No.
1 Apply pivot table of Excel to perform data analysis
Perform the Logistic Regression and given dataset and Interpret the
6
regression table
Aim :
Apply pivot table of Excel to perform data analysis.
Theory:
Data analysis on a large set of data is quite often necessary and important. It
involves summarizing the data, obtaining the needed values and presenting the
results.
Excel provides PivotTable to enable you summarize thousands of data values easily
and quickly so as to obtain the required results.
Consider the following table of sales data. From this data, you might have to
summarize total sales region wise, month wise, or salesperson wise. The easy way
to handle these tasks is to create a PivotTable that you can dynamically modify to
summarize the results the way you want.
Creating PivotTable
As you can see in the dialog box, you can use either a Table or Range from the current
workbook or use an external data source.
In the Table / Range Box, type the table name.
Click New Worksheet to tell Excel where to keep the PivotTable.
Click OK.
A Blank PivotTable and a PivotTable fields list appear.
Recommended PivotTables
In case you are new to PivotTables or you do not know which fields to select from the
data, you can use the Recommended PivotTables that Excel provides.
Click the data table.
Click the INSERT tab.
Click on Recommended PivotTables in the Tables group. The
Recommended PivotTables dialog box appears.
In the recommended PivotTables dialog box, the possible customized PivotTables
that suit your data are displayed.
Click each of the PivotTable options to see the preview on the right side.
Click the PivotTable Sum of Order Amount by Salesperson and month.
Click OK. The selected PivotTable appears on a new worksheet. You can observe
the PivotTable fields that was selected in the PivotTable fields list.
PivotTable Fields
The headers in your data table will appear as the fields in the PivotTable.
You can select / deselect them to instantly change your PivotTable to display only the
information you want and in a way that you want. For example, if you want to display
the account information instead of order amount information, deselect Order Amount
and select Account.
Input Dataset:
Output:
Practical No. 02
Aim :
Perform Descriptive statistics of given dataset using Data Analysis Toolbox of Excel.
Theory:
Data Analysis Toolbox of Excel
Excel provides a data analysis tool called Descriptive Statistics which produces a
summary of the key statistics for a data set.
Example 1: Provide a table of the most common descriptive statistics for the scores
in column A of Figure 1.
The output from the tool is shown in the right side of Figure 1. To use the tool,
select Data > Analysis | Data Analysis and choose the Descriptive Statistics
option.
A dialog box appears as in Figure 2
Input Dataset:
Figure 2 – Dialog box for Excel’s data analysis tool
Now click on Input Range and highlight the scores in column A (i.e. cells A3:A14).
If you include the heading, as is done here, check the Labels in first row. Since we
want the output to start in cell C3, click the Output Range radio button and insert C3
(or click on cell C3). Finally,
click the Summary statistics checkbox and press the OK button. Note that if we
had also checked the Kth Largest checkbox, the output would also contain the value
for LARGE(A4:A14, k) where k is the number we insert in the box to the right of the
label Kth Largest. Similarly, checking the Kth Smallest checkbox outputs
SMALL(A4:A14, k).
The option Confidence Interval for the Mean option generates a confidence interval
using the t distribution as explained in One Sample t-Test.
Aim :
Perform the Histogram Analysis of given dataset using Data Analysis Toolbox of
Excel.
Theory:
Histogram in Excel
What is a histogram?
"Histogram is a graphical representation of the distribution of numerical data."
Absolutely true, and… totally unclear :) Well, let's think about histograms in another
way.
Have you ever made a bar or column chart to represent some numerical data? I bet
everyone has. A histogram is a specific use of a column chart where each column
represents the frequency of elements in a certain x range. In other words, a
histogram graphically displays the number of elements within the consecutive non-
overlapping intervals, or bins.
For example, you can make a histogram to display the number of days with a
temperature between 61-65, 66-70, 71-75, etc. degrees, the number of sales with
amounts between $100-$199, $200-$299, $300-$399, the number of students with
test scores between 41-60, 61-80, 81-100, and so on.
The following screenshot gives an idea of how an Excel histogram can look like:
8. Click OK.
INPUT :
OUTPUT:
Parts of a Histogram
1. The title: The title describes the information included in the
histogram.
2. X-axis: The X-axis are intervals that show the scale of values
which the measurements fall under.
3. Y-axis: The Y-axis shows the number of times that the values
occurred within the intervals set by the X-axis.
4. The bars: The height of the bar shows the number of times that the
values occurred within the interval, while the width of the bar
shows the interval that is covered. For a histogram with equal bins,
the width should be the same across all bars.
Importance of a Histogram
Creating a histogram provides a visual representation of data distribution.
Histograms can display a large amount of data and the frequency of the
data values. The median and distribution of the data can be determined by
a histogram. In addition, it can show any outliers or gaps in the data.
Practical No. 04
Aim :
Perform Simple Linear Regression using Data Analysis Toolbox of Excel or with
Python and Interpret the regression table
Theory:
Linear regression analysis, in general, is a statistical method that shows or predicts
the relationship between two variables or factors.
There are 2 types of factors in regression analysis:
Dependent variable (y):It’s also called the ‘criterion variable’, ‘response’, or
‘outcome’ and is the factor being solved.
Independent variable (x): This is otherwise known as ‘explanatory variables’ or
‘predictors’. They are factors used in solving the dependent variable due to their
influence or effect on the said variable.
Usually, this type of analysis is used when one is trying to find or establish the
correlation between variables.
Here’s the linear regression formula:
y = bx + a + ε
As you can see, the equation shows how y is related to x.
On an Excel chart, there’s a trendline you can see which illustrates the regression
line — the rate of change.
Here’s a more detailed definition of the formula’s parameters:
y (dependent variable)
b (the slope of the regression line)
x (independent variable)
a (y-intercept of the regression line)
ε (the error term which accounts the variability in y that can’t be explained by
the analysis)
The analysis accounts for an error since they can’t be completely eliminated
especially in a predictive analysis such as this.
But don’t be surprised if you can’t find the error term in Excel. The program does it in
the background.
In summary, here’s what you need to do to insert a scatter plot in Excel:
Format your data in such a way that the independent variable is on the left column
and the dependent variable on the right.
Highlight your data.
Find and click the ‘Scatter’ icon under the ‘Scatter’ group on the ‘Charts’ category
on the ribbon.
To draw the regression line, let’s add a trendline on the chart. Click on any of the
data points and right-click. Select ‘Add Trendline’.
After that, a window will open at the right-hand side.
‘Linear’ is the default ‘Trendline Options’. If it’s not selected, click on it.
Also, if you like to show the equation on the chart, tick the ‘Display Equation on
chart’ box.
How to interpret the results
Primarily, what you’re looking in a simple linear regression is the correlation
between the variables. Fortunately, in Excel, the trendline does it all for you.
The trendline will tell you if the relationship of your variables is positive or negative.
Positive: If the line shows an upward trend. This indicates that as the
independent variable increases, the dependent variable also increases. The
same with our example, as the pageviews increase, we can expect to see a rise
in sales as well.
Negative: If the line shows a downward trend. This suggests that as the
independent variable increases, the dependent variable decreases.
None at all: This is easy to spot. There is no correlation between the variables
(therefore, no way to predict the next values) when the points in the scatter plot
don’t resemble a line as they are scattered. You can still see a line if you add a
trendline no matter how random the points are, but the line is usually close to a
horizontal line.
Aim :
Perform Multiple Linear Regression using Data Analysis Toolbox of Excel or with
Python and Interpret the regression table.
Theory:
Regression Analysis With Excel
In the real world, you will probably never conduct multiple regression analysis by
hand. Most likely, you will use computer software (SAS, SPSS, Minitab, Excel, etc.).
In this lesson, using data from the table, we are going to complete the following
tasks:
If the Data Analysis button is not visible, the Analysis ToolPak is not enabled. In that
case, do the following:
This enables the Analysis ToolPak. Now, when you click the Data tab, you will see a
Data Analysis button in the upper right corner under the Data tab. (If this explanation
of how to enable the Analysis ToolPak is unclear,
go to https://stattrek.com/anova/excel-analysis-toolpak for more detailed instruction.)
Data Entry With Excel
Data entry with Excel is easy. There are three main steps:
o Enter data on spreadsheet.
o Identify independent and dependent variables.
o Specify desired analyses.
To illustrate the process, we'll walk through each step, using data from our sample
problem. First, we want to enter data on an Excel spreadsheet.
By default, Excel will produce a standard set of outputs. For this sample problem,
that's all we need; so click OK to generate standard regression outputs.
Note: If desired, you can request additional outputs in the form of residual plots and
normal probability plots. To produce the plots, check the appropriate box(es) under
Output options on the Regression dialog box.
Excel provides everything we need to address the tasks we defined for this sample
problem. Recall that we wanted to do three things:
Regression Equation
The first task in our analysis is to define a linear, least-squares regression equation
to predict test score, based on IQ and study hours. Since we have two independent
variables, the equation takes the following form:
ŷ = b0 + b1x1 + b2x2
In this equation, ŷ is the predicted test score. The independent variables are IQ and
study hours, which are denoted by x1 and x2, respectively. The regression
coefficients are b0, b1, and b2. On the right side of the equation, the only unknowns
are the regression coefficients; so to specify the equation.
Excel does all the hard work behind the scenes, and displays the result in a
regression coefficients table:
Here, we see that the regression intercept (b0) is 23.156, the regression coefficient
for IQ (b1) is 0.509, and the regression coefficient for study hours (b2) is 0.467. So
the least-squares regression equation can be re-written as:
This is the only linear equation that satisfies a least-squares criterion. That means
this equation fits the data from which it was created better than any other linear
equation.
SSTO = Σ ( y - y )2
R2 = SSR / SSTO
where SSR is the sum of squares due to regression, SSTO is the total sum of
squares, ŷ is the predicted value of the dependent variable, y is the dependent
variable mean, and y is the dependent variable raw score.
Luckily, you will never have to compute the coefficient of multiple determination by
hand. It is a standard output of Excel (and most other analysis packages), as shown
below.
A quick glance at the output suggests that the regression equation fits the data
pretty well. The coefficient of muliple determination is 0.905. For our sample
problem, this means 90.5% of test score variation can be explained by IQ and by
hours spent in study.
An Alternative View of R2
The coefficient of multiple correlation (R2) is the square of the correlation between
actual and predicted values of the dependent variable. Thus,
R2 = r2y, ŷ
where y is the dependent variable raw score, ŷ is the predicted value of the
dependent variable, and ry, ŷ is the correlation between y and ŷ.
ANOVA Table
Another way to evaluate the regression equation would be to assess the statistical
significance of the regression sum of squares. For that, we examine the ANOVA
table produced by Excel:
This table tests the statistical significance of the independent variables as predictors
of the dependent variable. The last column of the table shows the results of an
overall F test. The F statistic (33.4) is big, and the p value (0.00026) is small. This
indicates that one or both independent variables has explanatory power beyond
what would be expected by chance.
Like the coefficient of multiple correlation, the overall F test found in the ANOVA
table suggests that the regression equation fits the data well.
The regression coefficients table shows the following information for each
coefficient: its value, its standard error, a t-statistic, and the significance of the t-
statistic. In this example, the t-statistics for IQ and study hours are both statistically
significant at the 0.05 level. This means that IQ contributes significantly to the
regression after effects of study hours are taken into account. And study hours
contribute significantly to the regression after effects of IQ are taken into account.
Note: This analysis omits any consideration of multicollinearity, a topic we will cover
in the next lesson. Be aware, however, that it is best practice to assess
multicollinearity in the independent variables before testing significance of
regression coefficients.
INPUT:
OUTPUT:
Final Thoughts / Conclusion
This lesson was all about multiple regression analysis. We used Excel, but the
analysis would be much the same with other software packages. All major software
packages (SAS, SPSS, Minitab, etc.) produce three key outputs:
Aim :
Perform the Logistic Regression and given dataset and interpret the regression
table.
Use the following steps to perform logistic regression in Excel for a dataset that
shows whether or not college basketball players got drafted into the NBA (draft: 0 =
no, 1 = yes) based on their average points, rebounds, and assists in the previous
season.
Next, we will have to create a few new columns that we will use to optimize for these
regression coefficients including the logit, elogit, probability, and log likelihood.
Next, we will create the logit column by using the the following formula:
Step 4: Create values for elogit.
Next, we will create values for elogit by using the following formula:
Next, we will create values for probability by using the following formula:
Lastly, we will find the sum of the log likelihoods, which is the number we will
attempt to maximize to solve for the regression coefficients.
Step 8: Use the Solver to solve for the regression coefficients.
If you haven’t already install the Solver in Excel, use the following steps to do so:
Click File.
Click Options.
Click Add-Ins.
Click Solver Add-In, then click Go.
In the new window that pops up, check the box next to Solver Add-In, then
click Go.
Once the Solver is installed, go to the Analysis group on the Data tab and
click Solver. Enter the following information:
Set Objective: Choose cell H14 that contains the sum of the log likelihoods.
By Changing Variable Cells: Choose the cell range B15:B18 that contains
the regression coefficients.
Make Unconstrained Variables Non-Negative: Uncheck this box.
Select a Solving Method: Choose GRG Nonlinear.
However, typically in logistic regression we’re interested in the probability that the
response variable = 1.
So, we can simply reverse the signs on each of the regression coefficients:
Now these regression coefficients can be used to find the probability that draft = 1.
For example, suppose a player averages 14 points per game, 4 rebounds per game,
and 5 assists per game. The probability that this player will get drafted into the NBA
can be calculated as:
Since this probability is greater than 0.5, we predict that this player would get drafted
into the NBA.
INPUT:
Practical No. 07
Aim :
Install Tableau, Understand the User Interface, Dimensions, Measures, Pages,
Filters, and Marks and Show Me, Dataset Connections and Create a visualization.
Theory:
There are two points to consider here:
Tableau Public is free.
Tableau Desktop is available only for commercial use.
2- The file will start downloading in “.exe” format. You can view the download progress
on the bottom-left corner of the tab.
3- Once the progress is 100 percent, open the file. Accept the terms and conditions
by selecting the checklist boxes and click on the “Install” button.
4- Once the installation is complete, open Tableau and start the screen of Tableau
Public as shown below.
2- Click on the “TRY NOW” button in the top-right corner of the website as shown
below.
3- Once you click on the “TRY NOW” button, you will be redirected to a page that will
ask you to feed in your official email address. After filling in the email address, click
on the “DOWNLOAD FREE TRIAL” button.
4- The latest version of Tableau Desktop will start downloading, and you will be able
to view the download progress in the bottom-left corner of the screen.
5- Once downloaded, open the file. Accept the terms and conditions, and click on the
“Install” button.
6- A pop-up option will appear asking for the approval of the administrator to install
the software. Click on “YES” to approve and move further.
7- On approval, the installation will start. On the completion of the installation, open
Tableau.
8- This is the final stage that asks for registration. Click on “Activate Tableau” and
enter your license details or credentials.
9- Click on “Start Trial Now” and wait for the registration process to complete.
1- To open the Tableau Workspace, go to File in the Start window and click on “New.”
The Tableau Workspace looks like the following screenshot.
Tableau Navigation
Data Source: It is typically used for either the addition of a new data
source or the modification of an existing source.
Current Sheet: In the image, what you see as “Sheet 1” refers to the
current sheet. All sheets and dashboards in the current workbook can be
viewed here.
New Sheet: The first squared box with a “+” sign refers to this option. It is
used for creating new worksheets in Tableau Desktop.
New Dashboard: The second squared box with a “+” sign refers to this
option. This icon is used to create a new dashboard in Tableau Workbook.
New Story: The third squared box with a “+” sign refers to this option. This
icon is also used to create a new storyboard in the Tableau workbook.
INPUT & OUTPUT:
Practical No. 08
Aim :
Various graphs in Tableau, Integration of Map and geo-locations, Creating
Interactive Dashboard and Publishing your Dashboard to Tableau Public Site.
Theory:
Importance of Spatial Data?
Spatial data is one of the most demand data types. Because spatial data can help
us;
determine relationships
identify patterns
make prediction
Maps are certainly a great way to display spatial data. Nowadays, you don’t need
any GIS Software for creating maps and publishing them. BI Tools are now capable
of creating a thematic map.
There are lots of Business Intelligence Tools that help you create powerful and
effective graphs, dashboard, visualisations such as Tableau, Qlik Sence, Power BI,
Looker, Microstrategy etc.
Tableau Public is free software that can allow anyone to connect to a spreadsheet or
file and create interactive data visualizations for the web. You can download Tableau
Public for windows and Mac using this link.
However, you need to create a profile to present your visualisations on the internet.
When you open Tableau Public Application, you can see the “Connect” area at the
upper left.
Browse your LondonBoroughProfile.csv file and Click Open. Now you have 2
connections and 2 files one is Spatial file and other is “Text file”.
If you want to preview your tabular data, you can click the “View Data” button and
preview your data.
To create a relation of your two files, select LondonBoroughProfile.csv file, drag and
drop to relation area.
Now you have to select a common column (borough name) of two files. The name of
the column at LondonBoroughs.geojson is “Neighbourhood” and the name of the
column at LondonBoroughProfile.csv is “Borough Name”
After creating the relationship between your files, you can see connections as shown
below.
Step 3: Creating a Map Sheet
Now you have a basic map of London Boroughs. You need to configure your map
with some appearance features such as label, colour, tooltip etc.
To create a label: Select Neighbourhood, drag and drop to Label
OUTPUT:
Practical No. 09
Aim :
Scatter Plots, Data Highlighter, Pages and Cards, Annotations Creating Storyand
publishing on Tableau Public
Theory:
Add new features as a Tooltip:
Tooltip provides us with additional information about attributes. So you can add
more attributes that user can interact when cursor moving over the map. Select one
or more attributes, drag and drop to Tooltip area. In this tutorial I
selected “Employment Rate (%) 2015”, “Average Age 2017” and “Number of
Cars, (2011 Census)” attributes and renamed shown below;
Employment Rate
Average Age
Number of Cars
To creating new graphs click New Worksheet that placed lower left.
After creating the new worksheet, empty worksheet appears. Select Borough
Name, drag and drop to Column area. Select Crime Rates, drag and drop to Rows
area.
To colourise and add a label, select Crime Rates, drag and drop to Colour and again
Crime Rates, drag and drop to Label
here is no data at “City of London” borough so you need to remove this record from
your graph. Click Borough Name’s arrow and select Filter. Then uncheck City of
London.
You can create more worksheet that includes different graphs. And finally, you can
combine these worksheets to a dashboard. To create a new dashboard, click New
Dashboard button and empty dashboard appears.
Drag Sheet1 (Thematic Map) and Sheet2 (Bar Chart) drop to the dashboard area.
So, if you select a borough from the map or graph you see results to other graphs
dynamically
Step 6: Publishing
Finally, you prepared a Dashboard that includes your map and graph. Now you need
to publish this dashboard to your Tableau Public profile. Click File and select Save to
Tableau Public As
After saving the Dashboard, your Tableau Public profile opens automatically. You
can use the Full-Screen button to maximize your dashboard.
By converting and following above steps you can convert it into scatter plot or any
other type of data visual form.
OUTPUT:
Practical No. 10
Aim: Given a case study: Perform Interactive Data Visualization with Tableau.
Case Study: