0% found this document useful (0 votes)
22 views

Data Science Imp Q and A

ds

Uploaded by

Sowjanya Balaga
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Science Imp Q and A

ds

Uploaded by

Sowjanya Balaga
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 29

1. Differentiate between supervised learning and unsupervised learning.

 Supervised learning: Supervised learning is the learning of the model where with
input variable ( say, x) and an output variable (say, Y) and an algorithm to map
the input to the output.
That is, Y = f(X)
 The basic aim is to approximate the mapping function so well that when there is a
new input data (x) then the corresponding output variable can be predicted.
 The machine learns under supervision. It contains a model that is able to predict
with the help of a labeled dataset. A labeled dataset is one where you already
know the target answer.

 In this case, we have images that are labeled a spoon or a knife.


 This known data is fed to the machine, which analyzes and learns the association
of these images based on its features such as shape, size, sharpness, etc.
 Now when a new image is fed to the machine without any label, the machine is
able to predict accurately that it is a spoon with the help of the past data.

Unsupervised Learning:

 In Unsupervised Learning, the machine uses unlabeled data and learns on itself
without any supervision i.e.,
o Unsupervised learning is a machine learning technique in which models are
not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data
o The machine tries to find a pattern in the unlabeled data and gives a response.
 For example

o The unsupervised learning algorithm is given an input dataset containing


images of different types of spoons and knife’s.

o The algorithm is never trained upon the given dataset,

 Which means it does not have any idea about the features of the
dataset
 i.e whether it's a spoon or a knife..
 The machine identifies patterns from the given set and groups them
based on their patterns, similarities, etc.

2. What is meant by logistic regression?


 Logistic regression is the appropriate regression analysis to conduct when the
dependent variable is dichotomous (binary).
 logistic regression is a predictive analysis.
 Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.
 Logistic Regression is a “Supervised machine learning” algorithm that can be used to
model the probability of a certain class or event. It is used when the data is linearly
separable and the outcome is binary or dichotomous in nature.
 That means Logistic regression is usually used for Binary classification problems.
 Binary Classification refers to predicting the output variable that is discrete
in two classes.
 A few examples of Binary classification are Yes/No, Pass/Fail, Win/Lose,
Cancerous/Non-cancerous, etc.

3. Draw a box plot of the following observations 28, 42, 25, 34, 37, 26, 33, 28, 36, 33, 22.

Boxplots are a measure of how well distributed is the data in a data set. It divides the
data set into three quartiles. This graph represents the minimum, maximum, median,
first quartile and third quartile in the data set. It is also useful in comparing the
distribution of data across data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
 x is a vector or a formula.
 data is the data frame.
 notch is a logical value. Set as TRUE to draw a notch.
 varwidth is a logical value. Set as true to draw width of the box proportionate to
the sample size.
 names are the group labels which will be printed under each boxplot.
 main is used to give a title to the graph.

4. Write the purpose of clustering.


 Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group than those in other groups.
 In simple words, the aim is to segregate groups with similar
traits and assign them into clusters.

 For example.
o you are the head of a rental store and wish to understand preferences of
your costumers to scale up your business.
o Is it possible for you to look at details of each costumer and devise a
unique business strategy for each one of them?
 Definitely not. But, what you can do is to cluster all of your
costumers into say 10 groups based on their purchasing habits and
use a separate strategy for costumers in each of these 10 groups.
And this is what we call clustering.

5. How do you do a regression analysis in R?

Refer the classnotes for theory

6. What are the different R packages?


1) tidyr
The word tidyr comes from the word tidy, which means clear. So the tidyr package
is used to make the data' tidy'.
2) ggplot2
R allows us to create graphics declaratively. R provides the ggplot package for this
purpose.
3) ggraph
R provides an extension of ggplot known as ggraph. The limitation of ggplot is the
dependency on tabular data is taken away in ggraph.
4) dplyr
This library facilitates several functions for the data frame in R.
5) tidyquant
The tidyquant is a financial package which is used for carrying out quantitative
financial analysis.
6) dygraphs
The dygraphs package provides an interface to the main This package is essentially
used for plotting time-series data in R.
7) leaflet
For creating interactive visualization, R provides the leaflet package. This package
is an open-source JavaScript library
8) ggmap
For delineating spatial visualization, the ggmap package is used. It is a mapping
package which consists of various tools for geolocating and routing.
9) glue
R provides the glue package to perform the operations of data wrangling. This
package is used for evaluating R expressions which are present within the string.
10) shiny
R allows us to develop interactive and aesthetically pleasing web apps by providing
a shiny package.

7. How do you create a Data frame in R?


 Data Frames are data displayed in a format as a table.
 Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each
column should have the same type of data.
 Use the data.frame() function to create a data frame:
8. Distinguish between invalid values and outliers.
 Outliers are extreme values that differ from most other data points in a dataset.
They can have a big impact on your statistical analyses and skew the results of
any hypothesis tests.
 It’s important to carefully identify potential outliers in your dataset and deal with
them in an appropriate manner for accurate results.
 Outliers are values at the extreme ends of a dataset.
 Some outliers represent true values from natural variation in the population. Other
outliers may result from incorrect data entry, equipment malfunctions, or
other measurement errors.
 An outlier isn’t always a form of dirty or incorrect data, so you have to be careful
with them in data cleansing.
Note :No need to write example in internal.it is for your understaning
Interquartile range method

1. Sort your data from low to high


2. Identify the first quartile (Q1), the median, and the third quartile (Q3).
3. Calculate your IQR = Q3 – Q1
4. Calculate your upper fence = Q3 + (1.5 * IQR)
5. Calculate your lower fence = Q1 – (1.5 * IQR)
6. Use your fences to highlight any outliers, all values that fall outside your fences.

Example: Using the interquartile range to find outliers


We’ll walk you through the popular IQR method for identifying outliers using a step-by-
step example.

Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll
use the IQR method to check whether they are outliers.

25 37 24 28 35 22 31 53 41 64 29

Step 1: Sort your data from low to high


First, you’ll simply sort your data in ascending order.
22 24 25 28 29 31 35 37 41 53 64

Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered
from low to high.

Since you have 11 values, the median is the 6th value. The median value is 31.

22 24 25 28 29 31 35 37 41 53 64

Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we remove
the median from our calculations.

The Q1 is the value in the middle of the first half of your dataset, excluding the median.
The first quartile value is 25.

22 24 25 28 29

Your Q3 value is in the middle of the second half of your dataset, excluding the median.
The third quartile value is 41.

35 37 41 53 64

Step 3: Calculate your IQR


The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the IQR.

Formula Calculation

IQR = Q3 – Q1 Q1 = 26

Q3 = 41

IQR = 41 – 26

= 15

Step 4: Calculate your upper fence


The upper fence is the boundary around the third quartile. It tells you that any values exceeding the
upper fence are outliers.

Formula Calculation

Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)


= 41 + 22.5

= 63.5

Step 5: Calculate your lower fence


The lower fence is the boundary around the first quartile. Any values less than the lower fence are
outliers.

Formula Calculation

Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)

= 26 – 22.5

= 3.5

Step 6: Use your fences to highlight any outliers


Go back to your sorted dataset from Step 1 and highlight any values that are greater than
the upper fence or less than your lower fence. These are your outliers.

 Upper fence = 63.5


 Lower fence = 3.5

22 24 25 28 29 31 35 37 41 53 64

You find one outlier, 64, in your dataset.

Invalid Data

 Invalid data are values that are originally generated incorrectly.


 They may be individual data points or include all the measurements for a specific
metric.
 Invalid data can be difficult to identify visually but may become apparent during an
exploratory statistical analysis.
 They generally have to be removed from the data set.

LONG ANSWER Q&A

1) (i) Discuss data types in R with examples.


A variable can store different types of values such as numbers, characters etc. These
different types of data that we can use in our code are called data types. For example,
x <- 123L
Here, 123L is an integer data. So the data type of the variable x is integer.
We can verify this by printing the class of x.
x <- 123L
# print value of x
print(x)

# print type of x
print(class(x))
Output
[1] 123
[1] "integer"
Here, x is a variable of data type integer.

Different Types of Data Types


In R, there are 6 basic data types:
 logical
 numeric
 integer
 complex
 character
 raw

1. Logical Data Type


The logical data type in R is also known as boolean data type. It can only have two
values: TRUE and FALSE. For example,

bool1 <- TRUE


print(bool1)
print(class(bool1))
bool2 <- FALSE
print(bool2)
print(class(bool2))
Output
[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"

Note: You can also define logical variables with a single letter
- T for TRUE or F for FALSE.

For example,
is_weekend <- F
print(class(is_weekend)) # "logical"

2. Numeric Data Type


In R, the numeric data type represents all real numbers with or without decimal values.
For example,

# floating point values


weight <- 63.5
print(weight)
print(class(weight))

# real numbers
height <- 182

print(height)
print(class(height))
Output
[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"
Here, both weight and height are variables of numeric type.

3. Integer Data Type

The integer data type specifies real values without decimal points. We use the suffix L to
specify integer data. For example,

integer_variable <- 186L


print(class(integer_variable))
Output
[1] "integer"

Here, 186L is an integer data. So we get "integer" when we print the class
of integer_variable.

4. Complex Data Type

The complex data type is used to specify purely imaginary values in R. We use the
suffix i to specify the imaginary part. For example,

# 2i represents imaginary part


complex_value <- 3 + 2i

# print class of complex_value


print(class(complex_value))
Output
[1] "complex"
Here, 3 + 2i is of complex data type because it has an imaginary part 2i.

5. Character Data Type


The character data type is used to specify character or string values in a variable.
In programming, a string is a set of characters. For example, 'A' is a single character
and "Apple" is a string.
You can use single quotes '' or double quotes "" to represent strings. In general, we use:
'' for character variables
"" for string variables

For example,
# create a string variable
fruit <- "Apple"

print(class(fruit))

# create a character variable


my_char <- 'A'

print(class(my_char))
Output
[1] "character"
[1] "character"
Here, both the variables - fruit and my_char - are of character data type.

6. Raw Data Type


A raw data type specifies values as raw bytes. You can use the following methods to
convert character data types to a raw data type and vice-versa:
charToRaw() - converts character data to raw data
rawToChar() - converts raw data to character data

For example,
# convert character to raw
raw_variable <- charToRaw("Welcome to Programiz")

print(raw_variable)
print(class(raw_variable))

# convert raw to character


char_variable <- rawToChar(raw_variable)

print(char_variable)
print(class(char_variable))
Output
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "Welcome to Programiz"
[1] "character"
In this program,
We have first used the charToRaw() function to convert the string "Welcome to
Programiz" to raw bytes.

This is why we get "raw" as output when we print the class of raw_variable.
Then, we have used the rawToChar() function to convert the data in raw_variable back to
character form.
This is why we get "character" as output when we print the class of char_variable.

(ii) Explain about various control structures in R.


List of R Control Structures with Examples

1. if Condition in R
This task is carried out only if this condition is returned as TRUE. R makes it even easier:
You can drop the word then and specify your choice in an if statement.
Syntax:
if (test_expression) {
statement
}

if-else Condition in R
An if…else statement contains the same elements as an if statement (see the preceding
section), with some extra elements:
 The keyword else, placed after the first code block.
 The second block of code, contained within braces, that has to be carried out, only if
the result of the condition in the if() statement is FALSE.
Syntax:
if (test_expression) {
statement
} else {
statement
}
3. for Loop in R
A loop is a sequence of instructions that is repeated until a certain condition is reached. for,
while and repeat, with the additional clauses break and next are used to construct loops.

Example:
These control structures in R, made of the rectangular box ‘init’ and the diamond. It is
executed a known number of times. for is a block that is contained within curly braces.

4. Nested Loop in R
It is similar to the standard for loop, which makes it easy to convert for loop to a foreach
loop. Unlike many parallel programming packages for R, foreach doesn’t require the body
of for loop to be turned into a function. We can call this a nesting operator because it is
used to create nested foreach loops.
Example:

5. while Loop in R
The format is while(cond) expr, where cond is the condition to test and expr is an
expression.
R would complain about the missing expression that was supposed to provide the required
True or False and in fact, it does not know ‘response’ before using it in the loop. We can
also do this because, if we answer right at first attempt, the loop will not be executed at all.

Example:

6. repeat and break Statement in R


We use break statement inside a loop (repeat, for, while) to stop the iterations and flow the
control outside of the loop. While in a nested looping situation, where there is a loop inside
another loop, this statement exits from the innermost loop that is being evaluated.
A repeat loop is used to iterate over a block of code, multiple numbers of times. There is
no condition check in a repeat loop to exit the loop. We ourselves put a condition explicitly
inside the body of the loop and use the break statement to exit the loop. Failing to do so
will result in an infinite loop.
Syntax:
repeat {
# simulations; generate some value have an expectation if within some range,
# then exit the loop
if ((value - expectation) <= threshold) {
break
}
}

Example of Break Statement in R:


7. next Statement in R
next jumps to the next cycle without completing a particular iteration. In fact, it jumps to
the evaluation of the condition holding the current loop. Next statement enables to skip the
current iteration of a loop without terminating it.

Example:

8. return Statement in R
Many times, we will require some functions to do processing and return back the result.
This is accomplished with the return() statement in R.
Syntax:
return(expression)

Example
:

2) ( i ) Describe briefly about the assumptions of linear regression?


Linear regression is a useful statistical method we can use to understand the
relationship between two variables, x and y. However, before we conduct linear
regression, we must first make sure that four assumptions are met:
1. Linear relationship: There exists a linear relationship between the independent
variable, x, and the dependent variable, y.
How to determine if this assumption is met
The easiest way to detect if this assumption is met is to create a scatter plot of x vs.
y. If there is a linear relationship between the two variables then the points in the
plot will fall along a straight line, then we can say assumption is met.
For example, the points in the plot below look like they fall on roughly a straight
line, which indicates that there is a linear relationship between x and y:

2. Independence: The residuals are independent. In particular, there is no correlation


between consecutive residuals in time series data.
The simplest way to test if this assumption is met is to look at a residual time series
plot, which is a plot of residuals vs. time. Ideally, most of the residual
autocorrelations should fall within the 95% confidence bands around zero, which
are located at about +/- 2-over the square root of n, where n is the sample size
3. Homoscedasticity: The next assumption of linear regression is that the residuals
have constant variance at every level of x. This is known as homoscedasticity.
The assumption of a classical linear regression model is that there should be
homoscedasticity among the data. The scatterplot graph is again the ideal way to
determine the homoscedasticity. The data is said to homoscedastic when the residuals
are equal across the line of regression. In other words, the variance is equal.
4. Normality: The residuals of the model are normally distributed. This assumption of
linear regression is that all the variables in the data set should be multivariate normal.
In other words, it suggests that the linear combination of the random variables should have
a normal distribution.

(ii) Describe about multinomial logistic repression model?


 The multinomial logistic regression model is a classification algorithm that
extends the concept of logistic regression to solve multiclass possible outcome
problems with one or more independent variables.
 While binary logistic regression predicts binary outcomes (0 or 1, yes or no,
spam or not spam, etc.),
o The multinomial regression model predicts one out of k possible
outcomes (k can be any arbitrary positive integer).
 The dependent variable in the multinomial logistic regression algorithm can have
two or more possible outcomes.
 Multinomial logistic regression is practical to make classifications based on the
values of a set of predictor variables.
 Here’s a simple example to understand the dependent and independent variables
in multinomial logistic regression:
o Suppose we have a machine learning model that uses multinomial logistic
regression to predict the ice cream flavour a person is likely to choose.
o Here, factors such as the person’s age, gender, mood, occasion, income
status, and price of ice cream are the independent variables that determine
the ice cream flavour the person will possibly go for.
o In this example, the dependent variable is the ice cream flavour that can
belong to many categories (chocolate, vanilla, butterscotch, coffee, etc.).
Assumptions for Multinomial Logistic Regression
Assumption #1
The dependent variable should be either nominal or ordinal. A nominal variable has two or
more categories with no meaningful ordering, such as three types of cuisines: Continental,
Chinese, and Italian. On the contrary, ordinal variables have two or more categories with
an order. An example of an ordinal variable would be the grades in an exam, that is,
Excellent (A), Good (B), and Average ( C ).
Assumption #2
You have a set of one or more independent variables that can be continuous, nominal, or
ordinal. Continuous variables are numeric variables and can have an infinite number of
values within a specified range.
Assumption #3
The observations must be independent, and the dependent variables must be mutually
exhaustive and exclusive. Mutually exhaustive implies every observation must fall into
some category of the dependent variable. On the other hand, mutually exclusive means
when there are two or more categories of the variable, no observation falls into more than
one category.
Assumption #4
There must be no multicollinearity amidst independent variables. Multicollinearity happens
when more than two independent variables have a high correlation, making it difficult to
understand the contribution of each independent variable to the dependent variable
category.
Assumption #5
The data points must not have outliers, highly influential points, or high leverage values.
Assumption #6
Lastly, any constant independent variable and the dependent variable’s logit transformation
must have a linear relationship. The idea behind a logit is to restrict the probability values
between 0 and 1 using a logarithmic function.
3) Explain the K-mean algorithm and implementation of k-means in R.

4) Explain K-nearest neighbour technique and implementation of KNN in R.


 KNN which stand for K Nearest Neighbor is a Supervised Machine Learning
algorithm that classifies a new data point into the target class, depending on the
features of its neighboring data points.
 For example we want a machine to distinguish between images of cats & dogs. To
do this we must input a dataset of cat and dog images and we have to train our
model to detect the animals based on certain features.

What is KNN Algorithm? – KNN Algorithm In R

 When a new image is given to the model,

o the KNN algorithm will classify it into either cats or dogs depending on the
similarity in their features.

o So if the new image has pointy ears, it will classify that image as a cat
because it is similar to the cat images.

 In this manner, the KNN algorithm classifies data points based on how similar they
are to their neighboring data points.

Features Of KNN Algorithm


The KNN algorithm has the following features:

 KNN is a Supervised Learning algorithm that uses labeled input data set to predict
the output of the data points.
 It is one of the most simple Machine learning algorithms and it can be easily
implemented.
 It is mainly based on feature similarity.
o KNN checks how similar a data point is to its neighbor and classifies the data
point into the class it is most similar to.

 KNN is a non-parametric model which means that it does not make any assumptions
about the data set.
 KNN is a lazy algorithm, this means that it memorizes the training data set instead
of learning a discriminative function from the training data.
 KNN can be used for solving both classification and regression problems.

KNN Algorithm Example


To understand how KNN algorithm works, let’s consider the following scenario:

 In the above image, we have two classes of data, namely class A (squares) and Class
B (triangles)
 The problem statement is to assign the new input data point to one of the two classes
by using the KNN algorithm
 The first step in the KNN algorithm is to define the value of ‘K’. ‘K’ stands for the
number of Nearest Neighbors and hence the name K Nearest Neighbors (KNN).

 In the above image, the value of ‘K’ is 3. This means that the algorithm will
consider the three neighbors that are the closest to the new data point in order to
decide the class of this new data point.
 The closeness between the data points is calculated by using measures such as
Euclidean and Manhattan distance
 At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if we were to
classify the new data point based on ‘K’ = 3, then it would be assigned to Class A
(squares).
 But what if the ‘K’ value is set to 7? Which means we are telling algorithm to look
for the seven nearest neighbors and classify the new data point into the class it is
most similar to.
 At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to
classify the new data point based on ‘K’ = 7, then it would be assigned to Class B
(triangles) since the majority of its neighbors were of class B.

In practice, there’s a lot more to consider while implementing the KNN algorithm.

 KNN uses Euclidean distance as a measure to check the distance between a new
data point and its neighbors, let’s see how.

 Consider the above image, here we’re going to measure the distance between P1 and
P2 by using the Euclidian Distance measure.
 The coordinates for P1 and P2 are (1,4) and (5,1) respectively.
 The Euclidian Distance can be calculated like so:

It is as simple as that! KNN makes use of simple measure in order to solve complex
problems, this is one of the reasons why KNN is such a commonly used algorithm.

To sum it up, let’s look at the pseudocode for KNN Algorithm.


KNN Algorithm Pseudocode
Consider the set, (Xi, Ci),

 Where Xi denotes feature variables and ‘i’ are data points ranging from i=1, 2, ….., n
 Ci denotes the output class for Xi for each i

The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of ‘i’ by assuming that the
total number of classes is denoted by ‘c’.

KNN Algorithm Pseudocode:


 Calculate D(x, xi), where 'i' =1, 2, ….., n and 'D' is the Euclidean measure between the
data points.
 The calculated Euclidean distances must be arranged in ascending order.
 Initialize k and take the first k distances from the sorted list.
 Figure out the k points for the respective k distances.
 Calculate ki, which indicates the number of data points belonging to the ith class among
k points i.e. k ≥ 0
 If ki >kj ∀ i ≠ j; put x in class i.

5) (i) Discuss the steps involved in Downloading and Installing R?


Installing R on Windows OS
To install R on Windows OS:

1. Go to the CRAN website.


2. Click on "Download R for Windows".
3. Click on "install R for the first time" link to download the R executable
(.exe) file.
4. Run the R executable file to start installation, and allow the app to make
changes to your device.
5. Select the installation language.
6. Follow the installation instructions.
7. Click on "Finish" to exit the installation setup.

R has now been sucessfully installed on your Windows OS. Open the R GUI to start
writing R codes.

(ii) Explain about various data objects in R.

Vectors are the basic R data objects and there are 6 types of atomic vectors. They can
be
 Integer,
 Logical,
 Double,
 Complex,
 Character and
 Raw

Creation of Vector
There are two types of vector creation:
 Single Element Vector
 Multiple Elements Vector
Single Element Vector
Whenever 1 word is written in R, it becomes a vector of length 1 and fits in one of
the above vector types.

Multiple Elements Vectors in R programming


 Using Colon operator with numeric data
This operator helps in a constant change over the numeric data with limits.

Using sequence(Seq.) operator

Accessing Vector Elements


Indexing helps access the elements of a vector. The[ ] brackets are used for indexing.

Lists?
Lists are the R objects with numbers, strings, vectors and another list or matrix inside it.
Creating a List
Example to create a list containing numbers, strings, vectors, and logical values.

Naming List Elements


Names can be given to list elements and can be accessed using the corresponding names.

Accessing List Elements


Index of the element of the list can be given access to Elements of the list.
Syntax:
list_name <- list(.,..,.)
names(list_name) <- c(.,.,.)
print(list_name[1])

Matrices
Matrices are the R objects wherein the elements are organized in a 2-D rectangular shape.
In a matrix, it contains elements of the same atomic types.
The matrix function is denoted as a matrix().
Syntax
matrix(data, nrow, ncol, byrow, dimnames)
 data is the parameter of input,
 nrow is number of rows and
 ncol is the number of columns to be created;
 byrow has TRUE or FALSE as its logical values, and dimname is the rows or
columns name.

Access Matrix Items

You can access the items by using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the column-position:

Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column should
have the same type of data.
Use the data.frame() function to create a data frame:
Tables
Another common way to store information is in a table. First let us see how to create one
way table. One way to create a table is using the table command. The argument it takes is a
vector of factors, and it calculates the frequency that each factor occurs

6) (i) Briefly discuss about multi linear regression.

(ii) Explain about logistic regression in detail.

 Logistic regression is the appropriate regression analysis to conduct when the


dependent variable is dichotomous (binary).

 Logistic regression is a predictive analysis.

 Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.

 Logistic Regression is another statistical analysis method used when our dependent
variable is dichotomous or binary.
o It just means a variable that has only 2 outputs, for example,

 A person will survive this accident or not,

 The student will pass this exam or not.

 The outcome can either be yes or no (2 outputs).

o This regression technique is similar to linear regression and can be used to


predict the Probabilities for classification problems.

Why do we use Logistic Regression rather than Linear Regression?

 logistic regression only used when our dependent variable is binary .


 But
o linear regression this dependent variable is continuous.
o The second use linear regression to find the best fit line which aims at
minimizing the distance between the predicted value and actual value, the
line will be like this:

 Here the threshold value is 0.5,


 Which means if the value of h(x) is greater than 0.5 then we predict
malignant tumor (1) and if it is less than 0.5 then we predict benign tumor
(0). Everything seems okay here but now let’s change it a bit, we add some
outliers in our dataset, now this best fit line will shift to that point. Hence the
line will be somewhat like this:
 The blue line represents the old threshold and the yellow line represents the new
threshold which is maybe 0.2 here. To keep our predictions right we had to lower our
threshold value. Hence we can say that linear regression is prone to outliers. Now here
if h(x) is greater than 0.2 then only this regression will give correct outputs.
 Another problem with linear regression is that the predicted values may be out of range.
We know that probability can be between 0 and 1, but if we use linear regression this
probability may exceed 1 or go below 0.

 To overcome these problems we use Logistic Regression,


 which converts this straight best fit line in linear regression to an S-curve using the
sigmoid function, which will always give values between 0 and 1.

Logistic Function(Sigmoid function)

 How logistic regression squeezes the output of linear regression between 0 and 1.
 Let’s start by mentioning the formula of logistic function:

We all know the equation of the best fit line in linear regression is:

 Let’s say instead of y we are taking probabilities (P).


 But there is an issue here,
 the value of (P) will exceed 1 or go below 0 and we know that range of Probability
is (0-1).

 To overcome this issue we take “odds” of P(The odds are defined as the probability
that the event will occur divided by the probability that the event will not occur):

 We know that odds can always be positive which means the range will always be (0,+∞
).
 The problem here is that the range is restricted (the number of data points will be
decreased) as a result correlation will decrease.
 It is difficult to model a variable that has a restricted range. To control this we take
the log of odds which has a range from (-∞, +∞).

 Now we just want a function of P because we want to predict probability right? not log
of odds.
 To do so we will multiply by exponent on both sides and then solve for P
 Now we have our logistic function, also called a sigmoid function.
 The graph of a sigmoid function is as shown below. It squeezes a straight line into an S-
curve.

7) Explain about various performance measures in detail.

You might also like