DM-Model Question Paper Solutions
DM-Model Question Paper Solutions
DATA MINING
2. Define Prediction.
Ans: PREDICTION:
To find a numerical output, prediction is used. The training dataset contains the
inputs and numerical output values. According to the training dataset, the algorithm
generates a model or predictor. When fresh data is provided, the model should find
a numerical output. This approach, unlike classification, does not have a class label.
A continuous-valued function or ordered value is predicted by the model.
Example: 1. Predicting the worth of a home based on facts like the number of
rooms, total area, and so on.
3. Define Regression.
Ans: REGRESSION IN DATA MINING:
Regression refers to a data mining technique that is used to predict the numeric
values in a given data set. Regression involves the technique of fitting a straight line
or a curve on numerous data points.
For example, regression might be used to predict the product or service cost or other
variables. It is also used in various industries for business and marketing behavior,
trend analysis, and financial forecast.
Regression is divided into five different types
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
PART-B
II. Answer any Four questions, each carries Five marks. ( 4 x 4 = 20 )
7) What are the difference between Data Mining and knowledge discovery in databases?
Ans: DATA MINING VS KDD.
Key Data Mining KDD
Features
Basic Data mining is the process of identifying The KDD method is a complex
Definition patterns and extracting details about big and iterative approach to
data sets using intelligent methods. knowledge extraction from big
data.
Scope In the KDD method, the fourth phase is KDD is a broad method that
called "data mining." includes data mining as one of its
steps.
Knowledge Presentation
Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.
8) What are the various issues associated with the Data Mining?
Ans: FACTORS THAT CREATE SOME ISSUES.
PERFORMANCE ISSUES :
Applications
Simplistic algorithm — uses only value of K (odd number) and the distance function
(Euclidean, as mentioned today).
Efficient method for small datasets.
Utilises “Lazy Learning.” In doing so, the training dataset is stored and is used only when
making predictions therefore making it more quick than Support Vector Machines
(SVMs) and Linear Regression.
10) Describe in detail one of the Decision Tree Algorithm give examples.
Ans: Decision tree algorithm:
1. Begin with the entire dataset as the root node of the decision tree.
2. Determine the best attribute to split the dataset based on a given
criterion,
3. Create a new internal node that corresponds to the best attribute and
connects it to the root node.
4. Partition the dataset into subsets based on the values of the best
attribute.
5. Recursively repeat steps 1-4 for each subset until all instances in a
given subset belong to the same class or no further splitting is
possible.
6. Assign a leaf node to each subset that contains instances that belong to
the same class.
7. Make predictions based on the decision tree by traversing it from the
root node to a leaf node that corresponds to the instance being
classified.
The following decision tree is for the concept to buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
Fig: Dendrogram
Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread
would simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core
0, could sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements
[N/2] . . . [N − 1]. So the Two threads would be running in parallel on separate computing cores.
3. As there is only one execution thread operating on all sets of data, so the speedup is
more.
PART C
III. Answer any Four questions, each carries Five marks. ( 4 x 8 = 32 )
13) How can you describe Data mining from the perspective of database?
Ans: Data Mining from a Database Perspective.
The time complexity of K-means is O(tkn) where t is the number of iteratio ns.
K-means finds a local optimum and may actually miss the global optimum. K-�eans
does not work on categorical data because the mean must be defined on the attnbute
type.
16) Write a short note on hierarchical clustering.
Ans: repeated
17) What do you mean by Large item-sets explain in detail.
Ans: refer notes
18) What is Data Parallelism explain in detail?
Ans: repeated
Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread
would simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core
0, could sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements
[N/2] . . . [N − 1]. So the Two threads would be running in parallel on separate computing cores.
3. As there is only one execution thread operating on all sets of data, so the speedup is
more.
Ans: Regression
Regression is a statistical tool that helps determine the cause and effect relationship
between the variables. It determines the relationship between a dependent and an
independent variable. It is generally used to predict future trends and events.
Ans: CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each fork is
split into a predictor variable and each node has a prediction for the target variable at the end.
PART-B
II. Answer any Four questions, each carries Five marks. ( 4 x 4 = 20 )
7) What are the difference between Data Mining and knowledge discovery in databases?
Ans: DATA MINING VS KDD.
Scope In the KDD method, the fourth KDD is a broad method that
phase is called "data mining." includes data mining as one of its
steps.
Knowledge Presentation
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This
is known as the marginal probability.
Bayesian interpretation:
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Although the naive Bayes approach is straightforward to use, it does not always yield satisfactory
results. First, the attributes usually are not independent. We could use a subset of the attributes by
ignoring any that are dependent on others. The technique does not handle continuous data.
Ans: 1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
2. Clustering:
3. Regression:
Regression analysis is the data mining process ,used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used
to define the probability of the specific variable. Regression, primarily a form of
planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given
data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.,
For example, a list of grocery items that you have been buying for the last six
months. It calculates a percentage of items being purchased together.
5. Outlier detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique
may be used in various domains like intrusion, detection, fraud detection, etc. It is
also known as Outlier Analysis or Outlier mining. The outlier is a data point that
diverges too much from the rest of the dataset. The majority of the real-world
datasets have an outlier. Outlier detection plays a significant role in the data mining
field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless
sensor network data, etc.
6. Sequential Patterns:
7. Prediction:
10) Describe in detail one of the Decision Tree Algorithm give examples.
Ans: Decision tree algorithm:
8. Begin with the entire dataset as the root node of the decision tree.
9. Determine the best attribute to split the dataset based on a given
criterion,
10.Create a new internal node that corresponds to the best attribute and
connects it to the root node.
11.Partition the dataset into subsets based on the values of the best
attribute.
12.Recursively repeat steps 1-4 for each subset until all instances in a
given subset belong to the same class or no further splitting is
possible.
13.Assign a leaf node to each subset that contains instances that belong to
the same class.
14.Make predictions based on the decision tree by traversing it from the
root node to a leaf node that corresponds to the instance being
classified.
The following decision tree is for the concept to buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
11) Explain Hierarchical clustering in detail.
Ans: HIERARCHICAL ALGORITH MS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of
clusters. Hierarchical algorithms differ in how the sets are created. A tree data
structure, called a dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters. The root in a dendrogram
tree contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the
dendrogram represent new clusters formed by merging the clusters that appear as
its children in the tree. Each level in the tree is associated with the distance
measure that was used to merge the clusters. All clusters created at a particular
level were combined because the children clusters had a distance between them
less than the distance value associated with this level in the tree.
Fig: Dendrogram
Fig: Five Levels of Clustering
shows six elements, {A, B, C, D, E, F}, to be clustered. Parts (a) to (e) of the figure
show five different sets of clusters. In part (a) each cluster is viewed to consist of
The space complexity for hierarchical algorithms is O (n2) because this is the space
required for the adjacency matrix. The space required for the dendrogram is O (kn),
which is much less than O (n2) . The time complexity for hierarchical algorithms is
0 (kn2) because there is one iteration for each level in the dendrogram. Depending
on the specific algorithm, however, this could actually be O (maxd n2) where maxd
is the maximum distance between points. Different algorithms may actually merge
the closest clusters from the next lowest level or simply create new clusters at each
level with progressively larger distances. Hierarchical techniques are well suited for
many clustering applications that naturally exhibit a nesting relationship between
clusters.
PART C
III. Answer any Four questions, each carries Five marks. ( 4 x 8 = 32 )
13) How can you describe Data mining from the perspective of a database?
Ans: Data Mining from a Database Perspective.
Data mining refers to extracting or mining knowledge from large amounts of data.
In other words, data mining is the science, art, and technology of discovering large
and complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process more
efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in
data mining:
15) Explain how K-Means Clustering algorithm is working and give examples.
Ans: K Means Clustering:
K- means is an iterative clustering algorithm in which items are moved among sets of clus-
. ters until the desired set is reached. As such, it may be viewed as a type of squared error
algorithm, although the convergence criteria need not be defined based on the squared
error. A high degree of similarity among elements in clusters is obtained, while a high
degree of dissimilarity among elements in different clusters is achieved simultaneously
The time complexity of K-means is O(tkn) where t is the number of iteratio ns.
K-means finds a local optimum and may actually miss the global optimum. K-�eans
does not work on categorical data because the mean must be defined on the attnbute
type.
o Outlier handling is difficult. Here the elements do not naturally fall into
any cluster
o Dynamic data in the database implies that cluster membership may
change over time
o Interpreting the semantic meaning of each cluster may be
difficult. With classification, the labeling of the classes is known ahead
of time. However, with clustering, this may not be the case. Thus, when
the clustering process finishes creating a set of clusters, the exact
meaning of each cluster may not be obvious. Here There is no one
correct answer to a clustering problem. In fact, many answers may be
found. The exact number of clusters required is not easy to determine.
Again, a domain expert may be required.
o Another related issue is what data should be used for clustering. Unlike
learning
during a classification process, where there is some a priori knowledge concerning
what the attributes of each classification should be, in clustering we have no
supervised learning to aid the process. Indeed, clustering can be viewed as similar
to unsupervised learning.
Apriori says:
The probability that item I is not frequent is if:
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets
whose occurrence is satisfying the min sup are determined. Only those candidates
which count more than or equal to min_sup, are taken ahead for the next iteration
and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join
step, the 2-itemset is generated by forming a group of 2 by combining items with
itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This
iteration will follow antimonotone property where the subsets of 3-itemsets, that is
the 2 –itemset subsets of each group fall in min_sup. If all 2-itemset subsets are
frequent then the superset will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and
pruning if its subset does not meet the min_sup criteria. The algorithm is stopped
when the most frequent itemset is achieved.
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large
databases
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
2. The entire database needs to be scanned.
3. FPM has many applications in the field of data analysis, software bugs,
cross-marketing, sale campaign analysis, market basket analysis, etc.
Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread
would simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core
0, could sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements
[N/2] . . . [N − 1]. So the Two threads would be running in parallel on separate computing cores.
3. As there is only one execution thread operating on all sets of data, so the speedup is
more.