Big Data Assignments Answer
Big Data Assignments Answer
Assignment No.1
Big data is defined as large data sets that can be analyzed to find
patterns, trends, and associations. Companies use big data to gain a
competitive advantage. For example, Amazon uses big data to
personalize the shopping experience, optimize its supply chain, and
develop new products. Logistics companies use big data to streamline
their operations, such as tracking warehouse stock levels and traffic
reports.
Assignment No.2
Ans. Big Data has a wide range of applications across various industries and
sectors. Some of the key applications of Big Data include :
Overall, Big Data has the potential to transform industries and drive
innovation by providing valuable insights, improving decision-making, and
enhancing operational efficiency.
Assignment No. 3
Advantages :
1. High-dimensional data: SVMs are effective at handling large
amounts of data.
2. Generalization: SVMs have good generalization capabilities and are
less likely to overfit.
3. Kernel trick: SVMs can solve complex problems with the right kernel
function.
4. Memory efficiency: SVMs use a subset of training points, called
support vectors, in the decision function.
5. Versatility: SVMs can use different kernel functions, and you can
specify custom kernels.
Disadvantages :
1. Noise and outliers: SVMs are sensitive to noise and outliers, which
can affect the model’s boundary.
2. Big data: SVMs can have difficulty with large amounts of data.
3. Nonlinear SVM: Nonlinear SVMs can be slow.
4. Runtime and memory: SVMs can have increased runtime and
memory requirements.
5. Traditional SVMs: Traditional SVMs may not fully use training data,
which can lead to loss of information and local accuracy issues.
Regression models that involve one explanatory variable are called Simple
Regression. For example, the relationship between crop yields and rainfall.
Regression models are also divided into linear and nonlinear models,
depending on whether the relationship between the response and
explanatory variables is linear or nonlinear.
Y = a + bx
The slope b of the best-fitting line, based on the least squares criterion, can
be shown be
Where the summation is overall n pairs of (x, y) values. The value of a, the y-
intercept, can be turn be shown to be a function of b, x and y i.e.
Fig. 2.8: Regression Model y populations with equal o² and u
Y=a+bx+e
Naïve Bayes model is easy to build and particularly useful for very large data
sets. Along with simplicity, Naïve Bayes is known to outperform even highly
sophisticated classification methods. The principle behind Naïve Bayes is the
Bayes theorem also known as the Bayes Rule. It is a probabilistic classifier,
which means it predicts on the basis of the probability of an object.
Applications:
1. It is used in medical data classification.
2. It is used for Credit Scoring.
3. It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
4. It is used in Text classification such as Spam filtering and Sentiment
analysis.
Assignment No. 4
For example, let’s create a simple scatter plot in R using the built-in “mtcars”
dataset. This dataset contains information about various car models,
including their miles per gallon (mpg) and horsepower (hp) ratings.
Data(mtcars)
This code will generate a scatter plot with horsepower on the x-axis and
miles per gallon on the y-axis, showing the relationship between these two
variables for each car model in the dataset. This visualization can help us
understand how horsepower affects fuel efficiency in different cars.
(1)It is always guaranteed that likelihood will increase with each iteration.
(2)The E-step and M-step are often pretty easy for many problems in
terms of implementation.
(3)Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm :
1. Head () : This function returns the first n rows of a matrix or data frame
Syntax:
head(df)
head(df,n=number)
2. Tail () : This function returns the last n rows of a matrix or data frame
Syntax:
tail(df)
tail(df,n=number)
3. Select() : It is used to select data by its column name. We can select any
number of columns in a number of ways.
4. Filter() : It is used to find rows with matching criteria. It also works like the
select() function, i.e., we pass a data frame along with a condition separated
by a comma.
Mutating joins combine variables from the two data frames x and y:
• inner_join() : return all rows from x where there are matching values in y,
and all columns from x and y. If there are multiple matches between x and y,
all combination of the matches are returned.
• left_join() : return all rows from x, and all columns from x and y. Rows in x
with no match in y will have NA values in the new columns. If there are
multiple matches between x and y, all combinations of the matches are
returned.
• right_join() : return all rows from y, and all columns from x and y. Rows in y
with no match in x will have NA values in the new columns. If there are
multiple matches between x and y, all combinations of the matches are
returned.
• full_join() : return all rows and all columns from both x and y. Where there
are not matching values, returns NA for the one missing.
There are many different types of probability distributions that can be used
to model various scenarios, such as the normal distribution, binomial
distribution, Poisson distribution, and exponential distribution. Each
distribution has its own set of parameters that determine the shape and
characteristics of the distribution.