0% found this document useful (0 votes)
22 views

Big Data Assignments Answer

The document discusses various aspects of big data, including types of data analytics (descriptive, diagnostic, predictive, prescriptive, text, and spatial), sources of big data, advantages and disadvantages of machine learning, and applications of big data across industries. It also covers the data analysis process, the life cycle of data analytics, and specific algorithms like Support Vector Machines and Naïve Bayes. Additionally, it touches on data visualization using R and the functions in the 'dplyr' package.

Uploaded by

kanowjiav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Big Data Assignments Answer

The document discusses various aspects of big data, including types of data analytics (descriptive, diagnostic, predictive, prescriptive, text, and spatial), sources of big data, advantages and disadvantages of machine learning, and applications of big data across industries. It also covers the data analysis process, the life cycle of data analytics, and specific algorithms like Support Vector Machines and Naïve Bayes. Additionally, it touches on data visualization using R and the functions in the 'dplyr' package.

Uploaded by

kanowjiav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Subject : -Big data

Assignment No.1

Q.1) Explain different types of data analytics?

Ans. 1. Descriptive Analytics: Descriptive analytics focuses on summarizing


historical data to understand what has happened in the past. It involves
analyzing data to gain insights into patterns, trends, and relationships within
the data.

2. Diagnostic Analytics: Diagnostic analytics aims to understand why certain


events occurred in the past by identifying the root causes of specific
outcomes. It involves digging deeper into the data to uncover relationships
and correlations that explain why certain events occurred.

3. Predictive Analytics: Predictive analytics uses statistical algorithms and


machine learning techniques to forecast future outcomes based on historical
data. It involves building models that can predict future trends and
behaviors, allowing organizations to make informed decisions and take
proactive actions.

4. Prescriptive Analytics: Prescriptive analytics goes beyond predicting future


outcomes by recommending actions to achieve desired outcomes. It involves
using optimization and simulation techniques to determine the best course of
action to achieve specific goals.

5. Diagnostic Analytics: Diagnostic analytics involves analyzing data to


understand why certain events occurred in the past. It focuses on identifying
the root causes of specific outcomes by digging deeper into the data to
uncover relationships and correlations.

6. Text Analytics: Text analytics involves analyzing unstructured text data,


such as social media posts, customer reviews, and emails, to extract insights
and patterns. It uses natural language processing and machine learning
techniques to analyze and interpret text data.

7. Spatial Analytics: Spatial analytics involves analyzing geographic data to


understand spatial relationships and patterns. It is commonly used in fields
such as urban planning, logistics, and environmental science to analyze
spatial data and make informed decisions based on location-based insights.
Q.2) Differentiate between population and sample.

Q.3) Which are the Big Data Sources with example?

Ans. Big data can come from many sources, including :

 Transactional systems: Banks and stock markets are major sources of


big data. For example, banking systems generate structured data like
receipts and payments, which include information like the amount,
date, and source of money.
 Internet: Big data can come from internet clickstream logs and social
networks.
 Machine-generated data: This includes sensor data from industrial
equipment, manufacturing machines, and internet of things devices, as
well as network and server log files.
 External data: Big data environments can include external data on
consumers, financial markets, weather, traffic conditions, geographic
information, and scientific research.
 Images, videos, and audio files: These are also forms of big data.

Big data is defined as large data sets that can be analyzed to find
patterns, trends, and associations. Companies use big data to gain a
competitive advantage. For example, Amazon uses big data to
personalize the shopping experience, optimize its supply chain, and
develop new products. Logistics companies use big data to streamline
their operations, such as tracking warehouse stock levels and traffic
reports.

Q.4) Give advantages and disadvantages of Machine Learning?

Ans. Advantages of Machine Learning:

1. Efficiency: Machine learning algorithms can process large amounts of data


quickly and accurately, making them more efficient than traditional methods.

2. Personalization: Machine learning can be used to personalize user


experiences, such as recommending products or services based on past
behavior.

3. Automation: Machine learning can automate repetitive tasks, saving time


and resources for businesses.

4. Scalability: Machine learning algorithms can scale to handle large amounts


of data without sacrificing performance.

5. Improved decision-making: Machine learning can analyze data and provide


insights that can help businesses make better decisions.

Disadvantages of Machine Learning:

1. Data dependency: Machine learning algorithms require large amounts of


high-quality data to function properly, which can be a challenge for some
businesses.
2. Lack of transparency: Some machine learning algorithms are black boxes,
making it difficult to understand how they arrive at their decisions.

3. Bias: Machine learning algorithms can perpetuate biases present in the


data used to train them, leading to unfair or discriminatory outcomes.

4. Overfitting: Machine learning models can sometimes be too complex and


perform well on training data but poorly on new, unseen data.

5. Security risks: Machine learning systems can be vulnerable to attacks,


such as adversarial attacks or data poisoning, which can compromise their
performance and reliability

Assignment No.2

Q.1) Explain applications of Big Data.

Ans. Big Data has a wide range of applications across various industries and
sectors. Some of the key applications of Big Data include :

1. Business Intelligence: Big Data is used by businesses to gain insights


and make data-driven decisions. It helps in analyzing customer
behavior, market trends, and competitor strategies to improve
business performance.
2. Healthcare: Big Data is used in healthcare to analyze patient data,
medical records, and clinical trials to improve patient outcomes,
personalize treatments, and reduce healthcare costs.
3. Finance: Big Data is used in the finance industry for fraud detection,
risk management, and predictive analytics. It helps financial
institutions to identify and prevent fraudulent activities, assess credit
risks, and make informed investment decisions.
4. Marketing and Advertising: Big Data is used in marketing and
advertising to analyze customer preferences, target specific audiences,
and measure the effectiveness of marketing campaigns. It helps
businesses to optimize their marketing strategies and improve
customer engagement.
5. Manufacturing: Big Data is used in manufacturing to monitor and
optimize production processes, predict equipment failures, and
improve supply chain management. It helps manufacturers to increase
efficiency, reduce downtime, and minimize costs.
6. Transportation and Logistics: Big Data is used in transportation and
logistics to optimize route planning, track shipments, and improve fleet
management. It helps companies to reduce transportation costs,
improve delivery times, and enhance customer satisfaction.
7. Agriculture: Big Data is used in agriculture to monitor crop health,
optimize irrigation, and predict crop yields. It helps farmers to make
informed decisions, increase productivity, and reduce environmental
impact.

Overall, Big Data has the potential to transform industries and drive
innovation by providing valuable insights, improving decision-making, and
enhancing operational efficiency.

Q.2) Explain the process of data analysis.

Ans. The data analysis process is a systematic way to investigate data to


answer specific questions or topics. It involves a number of steps, including :

 Defining objectives: Define the objectives of the analysis and the


questions that will be answered.
 Collecting data: Gather relevant data from credible sources.
 Cleaning data: Remove errors and duplicates, reconcile
inconsistencies, and standardize the data structure and format.
 Analyzing data: Use appropriate statistical methods or tools to analyze
the data.
 Interpreting and visualizing data: Interpret the results and visualize
them in a way that is easy to understand.
 Communicating findings: Communicate the findings effectively through
visualizations or reports.

Data analysis is a crucial part of decision making, problem solving, and


innovation. It can help identify trends, patterns, and meaningful insights that
can help solve the original problem.

Q.3) Explain life cycle of data analytics?


Ans.

 Phase 1: Discovery: In Phase 1, the team learns the business domain,


including relevant history such as whether the organization or business
unit has attempted similar projects in the past from which they can
learn. The team assesses the resources available to support the project
in terms of people, technology, time, and data. Important activities in
this phase include framing the business problem as an analytics
challenge that can be addressed in subsequent phases and formulating
initial hypotheses (His) to test and begin learning the data.
 Phase 2: Data preparation: Phase 2 requires the presence of an
analytic sandbox, in which the team can work with data and perform
analytics for the duration of the project. The team needs to execute
Extract, Load, and Transform (ELT) or Extract, Transform and Load
(ETL) to get data into the sandbox. The ELT and ETL are sometimes
abbreviated as ETLT. Data should be transformed in the ETLT process
so the team can work with it and analyze it. In this phase, the team
also needs to familiarize itself with the data thoroughly and take steps
to condition the data
 Phase 3: Model planning: Phase 3 is model planning, where the team
determines the methods, techniques, and workflow it intends to follow
for the subsequent model building phase. The team explores the data
to learn about the relationships between variables and subsequently
selects key variables and the most suitable models.
 Phase 4: Model building: In Phase 4, the team develops datasets for
testing, training, and production purposes. In addition, in this phase
the team builds and executes models based on the work done in the
model planning phase. The team also considers whether its existing
tools will suffice for running the models, or if it will need a more robust
environment for executing models and workflows (for example, fast
hardware and parallel processing, if applicable).
 Phase 5: Communicate results: In Phase 5, the team, in collaboration
with major stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in Phase 1. The
team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
 Phase 6: Operationalize: In Phase 6, the team delivers final reports,
briefings, code, and technical documents. In addition, the team may
run a pilot project to implement. The models in a production
environment.

Assignment No. 3

Q.1) State advantages and disadvantages of SVM.

Ans. Support Vector Machines (SVMs) have several advantages and


disadvantages, including:

 Advantages :
1. High-dimensional data: SVMs are effective at handling large
amounts of data.
2. Generalization: SVMs have good generalization capabilities and are
less likely to overfit.
3. Kernel trick: SVMs can solve complex problems with the right kernel
function.
4. Memory efficiency: SVMs use a subset of training points, called
support vectors, in the decision function.
5. Versatility: SVMs can use different kernel functions, and you can
specify custom kernels.

 Disadvantages :
1. Noise and outliers: SVMs are sensitive to noise and outliers, which
can affect the model’s boundary.
2. Big data: SVMs can have difficulty with large amounts of data.
3. Nonlinear SVM: Nonlinear SVMs can be slow.
4. Runtime and memory: SVMs can have increased runtime and
memory requirements.
5. Traditional SVMs: Traditional SVMs may not fully use training data,
which can lead to loss of information and local accuracy issues.

Q.2) Compare supervised and unsupervised machine leaning.


Q.3) Explain different types of regression models.

Ans. The two basic types of regression analysis are:

It is used to estimate the relationship between a dependent variable and a


single independent variable.

(1)Simple Regression Analysis:

Regression models that involve one explanatory variable are called Simple
Regression. For example, the relationship between crop yields and rainfall.

(2)Multiple Regression Analysis:

It is used to estimate the relationship between a dependent variable and two


or more independent variables. When two or more explanatory variables are
involved, the relationships are called Multiple Regressions. For example, the
relationship between the salaries of employees and their experience and
education. Multiple regression analysis introduces several additional
complexities but may. Produce more realistic results than simple regression
analysis.

Regression models are also divided into linear and nonlinear models,
depending on whether the relationship between the response and
explanatory variables is linear or nonlinear.

In a simple linear regression, there are two variables x and y, wherein y


depends on x or say influenced by x. Here y is called as dependent, or
criterion variable and x is independent or predictor variable,
The relationship between the independent and dependent variables is linear
is the major assumption in a linear regression model.

The regression line of y on x is expressed as under:

Y = a + bx

Where, a constant, b regression coefficient. In this equation, a and b are the


two regression parameter. While there are a number of possible criteria for
choosing a best-fitting line, one of the most useful is the least squares
criterion.

The slope b of the best-fitting line, based on the least squares criterion, can
be shown be

Where the summation is overall n pairs of (x, y) values. The value of a, the y-
intercept, can be turn be shown to be a function of b, x and y i.e.
Fig. 2.8: Regression Model y populations with equal o² and u

Based on above, the individual observations of y, are,

Y=a+bx+e

Q.4) Explain naïve bayes with the help of example.

Ans. Naïve Bayes algorithm is a Supervised Learning Algorithm. It is a


classification technique based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naïve Bayes classifier
assumes that the presence of a particular feature in a class is unrelated to
the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and


about 3 inches in diameter. Even if these features depend on each other or
upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that
is why it is known as ‘Naïve’.

Naïve Bayes model is easy to build and particularly useful for very large data
sets. Along with simplicity, Naïve Bayes is known to outperform even highly
sophisticated classification methods. The principle behind Naïve Bayes is the
Bayes theorem also known as the Bayes Rule. It is a probabilistic classifier,
which means it predicts on the basis of the probability of an object.

 Applications:
1. It is used in medical data classification.
2. It is used for Credit Scoring.
3. It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
4. It is used in Text classification such as Spam filtering and Sentiment
analysis.

Assignment No. 4

Q1) What is data visualization? Explain with example in R?

Ans. Data visualization is the graphical representation of data to help users


understand and interpret patterns, trends, and relationships within the data.
It involves creating visualizations such as charts, graphs, and maps to make
complex data more accessible and easier to comprehend.
One popular tool for data visualization is R, a programming language and
software environment for statistical computing and graphics. R provides a
wide range of packages and functions for creating various types of
visualizations.

For example, let’s create a simple scatter plot in R using the built-in “mtcars”
dataset. This dataset contains information about various car models,
including their miles per gallon (mpg) and horsepower (hp) ratings.

# Load the mtcars dataset

Data(mtcars)

# Create a scatter plot of mpg vs. hp

Plot(mtcars$hp, mtcars$mpg, xlab = “Horsepower”, ylab = “Miles per


Gallon”, main = “Scatter Plot of MPG vs. HP”)

This code will generate a scatter plot with horsepower on the x-axis and
miles per gallon on the y-axis, showing the relationship between these two
variables for each car model in the dataset. This visualization can help us
understand how horsepower affects fuel efficiency in different cars.

Q.2) Advantages and disadvantages of EM algorithm.

Ans. Advantages of EM algorithm :

(1)It is always guaranteed that likelihood will increase with each iteration.
(2)The E-step and M-step are often pretty easy for many problems in
terms of implementation.
(3)Solutions to the M-steps often exist in the closed form.

Disadvantages of EM algorithm :

(1)It has slow convergence.


(2)It makes convergence to the local optima only.
(3)(3) It requires both the probabilities, forward and backward (numerical
optimization requires only forward probability).

Q.3) Explain functions included in “dplyr” package.

Ans. Functions included in "dplyr" package


Following are some of the important functions included in the dplyr package:

1. Head () : This function returns the first n rows of a matrix or data frame

Syntax:

head(df)

head(df,n=number)

where, df - Data frame and n-number of rows

2. Tail () : This function returns the last n rows of a matrix or data frame

Syntax:

tail(df)

tail(df,n=number)

where, df - Data frame and n- number of rows

3. Select() : It is used to select data by its column name. We can select any
number of columns in a number of ways.

4. Filter() : It is used to find rows with matching criteria. It also works like the
select() function, i.e., we pass a data frame along with a condition separated
by a comma.

5. Mutate() : It is used to create new columns and preserve the existing


columns in a dataset.
It is useful to create attributes that are functions of other attributes in the
dataset.

6. Arrange() : It is used to sort rows by variables in both an ascending and


descending order.

7. Summarise() : It is used to find insights (mean, median, mode, etc.) from a


dataset.
It is used to aggregate multiple values to a single value.
It is most often used with the group_by function, and the output has one row
per group.

8. Group_by() : This function is used to group observations within a dataset


by one or more variables. Most data operations are performed on groups
defined by variables.

9. Join() : It is used two join two data frames.


Currently dplyr supports four types of mutating joins, two types of filtering
joins, and a nesting join.

 Mutating joins combine variables from the two data frames x and y:
• inner_join() : return all rows from x where there are matching values in y,
and all columns from x and y. If there are multiple matches between x and y,
all combination of the matches are returned.
• left_join() : return all rows from x, and all columns from x and y. Rows in x
with no match in y will have NA values in the new columns. If there are
multiple matches between x and y, all combinations of the matches are
returned.
• right_join() : return all rows from y, and all columns from x and y. Rows in y
with no match in x will have NA values in the new columns. If there are
multiple matches between x and y, all combinations of the matches are
returned.
• full_join() : return all rows and all columns from both x and y. Where there
are not matching values, returns NA for the one missing.

 Filtering joins keep cases from the left-hand data.frame :


•semi_join() : return all rows from x where there are matching values in y,
keeping just columns from x. A semi join differs from an inner join because
an inner join will return one row of x for each matching row of y, where a
semi join will never duplicate rows of x.
• anti_join() : return all rows from x where there are not matching values in
y, keeping just columns from x.

 Nesting joins create a list column of dataframes :


• nest_join() : return all rows and all columns from x. Adds a list column of
tibbles. Each tibble contains all the rows from y that match that row of x.
When there is no match, the list column is a 0-row tibble with the same
column names and types as y.

Q.4) Explain probability distribution modeling.

Ans. Probability distribution modeling is a statistical technique used to


describe the likelihood of various outcomes in a given situation. It involves
using mathematical functions to represent the probability of different events
occurring within a specific range or set of values.

There are many different types of probability distributions that can be used
to model various scenarios, such as the normal distribution, binomial
distribution, Poisson distribution, and exponential distribution. Each
distribution has its own set of parameters that determine the shape and
characteristics of the distribution.

By using probability distribution modeling, researchers and analysts can


make predictions and draw conclusions about the likelihood of certain events
happening based on the data available. This can be particularly useful in
fields such as finance, economics, biology, and engineering, where
understanding and predicting uncertainty is crucial.

You might also like