Data Science Imp Q and A
Data Science Imp Q and A
Supervised learning: Supervised learning is the learning of the model where with
input variable ( say, x) and an output variable (say, Y) and an algorithm to map
the input to the output.
That is, Y = f(X)
The basic aim is to approximate the mapping function so well that when there is a
new input data (x) then the corresponding output variable can be predicted.
The machine learns under supervision. It contains a model that is able to predict
with the help of a labeled dataset. A labeled dataset is one where you already
know the target answer.
Unsupervised Learning:
In Unsupervised Learning, the machine uses unlabeled data and learns on itself
without any supervision i.e.,
o Unsupervised learning is a machine learning technique in which models are
not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data
o The machine tries to find a pattern in the unlabeled data and gives a response.
For example
Which means it does not have any idea about the features of the
dataset
i.e whether it's a spoon or a knife..
The machine identifies patterns from the given set and groups them
based on their patterns, similarities, etc.
3. Draw a box plot of the following observations 28, 42, 25, 34, 37, 26, 33, 28, 36, 33, 22.
Boxplots are a measure of how well distributed is the data in a data set. It divides the
data set into three quartiles. This graph represents the minimum, maximum, median,
first quartile and third quartile in the data set. It is also useful in comparing the
distribution of data across data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
x is a vector or a formula.
data is the data frame.
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box proportionate to
the sample size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.
For example.
o you are the head of a rental store and wish to understand preferences of
your costumers to scale up your business.
o Is it possible for you to look at details of each costumer and devise a
unique business strategy for each one of them?
Definitely not. But, what you can do is to cluster all of your
costumers into say 10 groups based on their purchasing habits and
use a separate strategy for costumers in each of these 10 groups.
And this is what we call clustering.
Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll
use the IQR method to check whether they are outliers.
25 37 24 28 35 22 31 53 41 64 29
Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered
from low to high.
Since you have 11 values, the median is the 6th value. The median value is 31.
22 24 25 28 29 31 35 37 41 53 64
Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we remove
the median from our calculations.
The Q1 is the value in the middle of the first half of your dataset, excluding the median.
The first quartile value is 25.
22 24 25 28 29
Your Q3 value is in the middle of the second half of your dataset, excluding the median.
The third quartile value is 41.
35 37 41 53 64
Formula Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Formula Calculation
= 63.5
Formula Calculation
= 26 – 22.5
= 3.5
22 24 25 28 29 31 35 37 41 53 64
Invalid Data
# print type of x
print(class(x))
Output
[1] 123
[1] "integer"
Here, x is a variable of data type integer.
Note: You can also define logical variables with a single letter
- T for TRUE or F for FALSE.
For example,
is_weekend <- F
print(class(is_weekend)) # "logical"
# real numbers
height <- 182
print(height)
print(class(height))
Output
[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"
Here, both weight and height are variables of numeric type.
The integer data type specifies real values without decimal points. We use the suffix L to
specify integer data. For example,
Here, 186L is an integer data. So we get "integer" when we print the class
of integer_variable.
The complex data type is used to specify purely imaginary values in R. We use the
suffix i to specify the imaginary part. For example,
For example,
# create a string variable
fruit <- "Apple"
print(class(fruit))
print(class(my_char))
Output
[1] "character"
[1] "character"
Here, both the variables - fruit and my_char - are of character data type.
For example,
# convert character to raw
raw_variable <- charToRaw("Welcome to Programiz")
print(raw_variable)
print(class(raw_variable))
print(char_variable)
print(class(char_variable))
Output
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "Welcome to Programiz"
[1] "character"
In this program,
We have first used the charToRaw() function to convert the string "Welcome to
Programiz" to raw bytes.
This is why we get "raw" as output when we print the class of raw_variable.
Then, we have used the rawToChar() function to convert the data in raw_variable back to
character form.
This is why we get "character" as output when we print the class of char_variable.
1. if Condition in R
This task is carried out only if this condition is returned as TRUE. R makes it even easier:
You can drop the word then and specify your choice in an if statement.
Syntax:
if (test_expression) {
statement
}
if-else Condition in R
An if…else statement contains the same elements as an if statement (see the preceding
section), with some extra elements:
The keyword else, placed after the first code block.
The second block of code, contained within braces, that has to be carried out, only if
the result of the condition in the if() statement is FALSE.
Syntax:
if (test_expression) {
statement
} else {
statement
}
3. for Loop in R
A loop is a sequence of instructions that is repeated until a certain condition is reached. for,
while and repeat, with the additional clauses break and next are used to construct loops.
Example:
These control structures in R, made of the rectangular box ‘init’ and the diamond. It is
executed a known number of times. for is a block that is contained within curly braces.
4. Nested Loop in R
It is similar to the standard for loop, which makes it easy to convert for loop to a foreach
loop. Unlike many parallel programming packages for R, foreach doesn’t require the body
of for loop to be turned into a function. We can call this a nesting operator because it is
used to create nested foreach loops.
Example:
5. while Loop in R
The format is while(cond) expr, where cond is the condition to test and expr is an
expression.
R would complain about the missing expression that was supposed to provide the required
True or False and in fact, it does not know ‘response’ before using it in the loop. We can
also do this because, if we answer right at first attempt, the loop will not be executed at all.
Example:
Example:
8. return Statement in R
Many times, we will require some functions to do processing and return back the result.
This is accomplished with the return() statement in R.
Syntax:
return(expression)
Example
:
o the KNN algorithm will classify it into either cats or dogs depending on the
similarity in their features.
o So if the new image has pointy ears, it will classify that image as a cat
because it is similar to the cat images.
In this manner, the KNN algorithm classifies data points based on how similar they
are to their neighboring data points.
KNN is a Supervised Learning algorithm that uses labeled input data set to predict
the output of the data points.
It is one of the most simple Machine learning algorithms and it can be easily
implemented.
It is mainly based on feature similarity.
o KNN checks how similar a data point is to its neighbor and classifies the data
point into the class it is most similar to.
KNN is a non-parametric model which means that it does not make any assumptions
about the data set.
KNN is a lazy algorithm, this means that it memorizes the training data set instead
of learning a discriminative function from the training data.
KNN can be used for solving both classification and regression problems.
In the above image, we have two classes of data, namely class A (squares) and Class
B (triangles)
The problem statement is to assign the new input data point to one of the two classes
by using the KNN algorithm
The first step in the KNN algorithm is to define the value of ‘K’. ‘K’ stands for the
number of Nearest Neighbors and hence the name K Nearest Neighbors (KNN).
In the above image, the value of ‘K’ is 3. This means that the algorithm will
consider the three neighbors that are the closest to the new data point in order to
decide the class of this new data point.
The closeness between the data points is calculated by using measures such as
Euclidean and Manhattan distance
At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if we were to
classify the new data point based on ‘K’ = 3, then it would be assigned to Class A
(squares).
But what if the ‘K’ value is set to 7? Which means we are telling algorithm to look
for the seven nearest neighbors and classify the new data point into the class it is
most similar to.
At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to
classify the new data point based on ‘K’ = 7, then it would be assigned to Class B
(triangles) since the majority of its neighbors were of class B.
In practice, there’s a lot more to consider while implementing the KNN algorithm.
KNN uses Euclidean distance as a measure to check the distance between a new
data point and its neighbors, let’s see how.
Consider the above image, here we’re going to measure the distance between P1 and
P2 by using the Euclidian Distance measure.
The coordinates for P1 and P2 are (1,4) and (5,1) respectively.
The Euclidian Distance can be calculated like so:
It is as simple as that! KNN makes use of simple measure in order to solve complex
problems, this is one of the reasons why KNN is such a commonly used algorithm.
Where Xi denotes feature variables and ‘i’ are data points ranging from i=1, 2, ….., n
Ci denotes the output class for Xi for each i
The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of ‘i’ by assuming that the
total number of classes is denoted by ‘c’.
R has now been sucessfully installed on your Windows OS. Open the R GUI to start
writing R codes.
Vectors are the basic R data objects and there are 6 types of atomic vectors. They can
be
Integer,
Logical,
Double,
Complex,
Character and
Raw
Creation of Vector
There are two types of vector creation:
Single Element Vector
Multiple Elements Vector
Single Element Vector
Whenever 1 word is written in R, it becomes a vector of length 1 and fits in one of
the above vector types.
Lists?
Lists are the R objects with numbers, strings, vectors and another list or matrix inside it.
Creating a List
Example to create a list containing numbers, strings, vectors, and logical values.
Matrices
Matrices are the R objects wherein the elements are organized in a 2-D rectangular shape.
In a matrix, it contains elements of the same atomic types.
The matrix function is denoted as a matrix().
Syntax
matrix(data, nrow, ncol, byrow, dimnames)
data is the parameter of input,
nrow is number of rows and
ncol is the number of columns to be created;
byrow has TRUE or FALSE as its logical values, and dimname is the rows or
columns name.
You can access the items by using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the column-position:
Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column should
have the same type of data.
Use the data.frame() function to create a data frame:
Tables
Another common way to store information is in a table. First let us see how to create one
way table. One way to create a table is using the table command. The argument it takes is a
vector of factors, and it calculates the frequency that each factor occurs
Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.
Logistic Regression is another statistical analysis method used when our dependent
variable is dichotomous or binary.
o It just means a variable that has only 2 outputs, for example,
How logistic regression squeezes the output of linear regression between 0 and 1.
Let’s start by mentioning the formula of logistic function:
We all know the equation of the best fit line in linear regression is:
To overcome this issue we take “odds” of P(The odds are defined as the probability
that the event will occur divided by the probability that the event will not occur):
We know that odds can always be positive which means the range will always be (0,+∞
).
The problem here is that the range is restricted (the number of data points will be
decreased) as a result correlation will decrease.
It is difficult to model a variable that has a restricted range. To control this we take
the log of odds which has a range from (-∞, +∞).
Now we just want a function of P because we want to predict probability right? not log
of odds.
To do so we will multiply by exponent on both sides and then solve for P
Now we have our logistic function, also called a sigmoid function.
The graph of a sigmoid function is as shown below. It squeezes a straight line into an S-
curve.