0% found this document useful (0 votes)
32 views

Data Mining Edited

The document discusses various concepts related to data mining including how classification impacts data mining, what association is in data mining, explaining an association rule, discussing cluster analysis concepts, explaining what an anomaly is and how to avoid it, and discussing methods to avoid false discoveries.

Uploaded by

ENOCK CHERUIYOT
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Data Mining Edited

The document discusses various concepts related to data mining including how classification impacts data mining, what association is in data mining, explaining an association rule, discussing cluster analysis concepts, explaining what an anomaly is and how to avoid it, and discussing methods to avoid false discoveries.

Uploaded by

ENOCK CHERUIYOT
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

1

Student’s Name

Institutional Affiliation

Course Code

Course Title

Lecturer’s Name

Due Date
2

1.How does data and classifying data impact data mining?

Machine learning makes use of data mining techniques such as grouping, classification, and

regression, as well as machine learning itself. Data mining is said to be used to identify patterns in data,

and subsequently learning machines are said to use the patterns discovered by data mining to learn

something new about the data they have been given (Aggarwal, 2015). Classification is a data mining

process that allocates items in a collection to one of several target groups or classes, as determined by

the data mining procedure. The goal of classification is to predict the target class for each data sample

with high precision. The numeric-targeted regression algorithm-based prediction model is not a

classification technique, as is the case with classification methods (Aggarwal, 2015). When dealing with

large quantities of data, data mining is a technique for identifying patterns and trends that cannot be

found using conventional analytical methods (Aggarwal, 2015). Illustration data mining, text mining,

social media mining, internet mining, audio and video mining, to name a few types of data mining, are

all instances of the practice. Data mining is the process of identifying patterns in large data sets that may

be useful in a variety of situations (Aggarwal, 2015). The use of machine learning, statistics, and artificial

intelligence to extract data and anticipate future events is part of a multidisciplinary skill set. Data

mining insights are used for a variety of purposes, including marketing, fraud detection, and scientific

discovery, to name a few (Aggarwal, 2015). Data mining is the act of uncovering previously unknown or

hidden relationships between data sets via statistical analysis. KDD (knowledge discovery in data),

knowledge extraction, data pattern analysis, and information harvesting are all terms used to describe

the process of data mining (Aggarwal, 2015).

2.What is association in data mining?

At its most basic level, mining association rules involves using machine learning models to

evaluate data in order to identify patterns, or courses, within a collection of information. The most

significant connections are found by looking for data that contains frequent if-then patterns, and then
3

utilizing criteria and confidence to identify and find these patterns. Association rules are very helpful

when it comes to evaluating large datasets (Solanki & Patel, 2015). In the supermarket, data is acquired

via the use of a bar code scanner. It is possible to find an immense number of transaction records in the

database, and each transaction record provides information about all of the goods bought by a client in

a single transaction (Solanki & Patel, 2015). Product groupings that are often bought together may be

identified by managers, who can then utilize this knowledge to alter shop layouts, cross-sell products,

and promote in a more targeted manner (Solanki & Patel, 2015). The standards established by the

mining organization are intended to discover intriguing links and relationships among enormous

amounts of data that would otherwise go undetected. This rule defines the frequency with which items

appear in a transaction and is applied to all transactions (Solanki & Patel, 2015). Market-based analysis is

a great illustration of this.

A mining association rule is a simple if/then statement that, as the name implies, aids in the

establishment of a relationship between a relational database that seems to be autonomous and other

data repositories (Solanki & Patel, 2015). They are essentially mathematical in nature since they operate

with numerical datasets, which is the case for the vast majority of machine learning algorithms (Solanki

& Patel, 2015). Association rule mining, on the other hand, is appropriate for non-numerical, categorical

data and needs just a basic computation.

3.Select a specific association rule (from the text) and thoroughly explain the key concepts.

If a collection of items (itemsets) has received more than basic support, association rules will

search for all of them, and then use a big itemset to construct desired rules with a greater degree of

confidence than minimum trust (Hussain et al., 2018). The study of market baskets is a common

application of association rules. Association mining, also known as regulation of mining associations, is a

technique for identifying patterns, correlations, and connections in data stored in a variety of databases,
4

including relational databases, transactional databases, and other types of repositories (Hussain et al.,

2018).

Association rules have 2 parts:

i. Consequences (then) and

ii. An antecedent (if)

4.Discuss cluster analysis concepts.

Clustered data analysis (also known as grouping) is a method for breaking down a large number

of data items (or observations) into smaller groups (Kassambara, 2017). Each subset is represented by a

cluster, and the elements within each cluster are similar to yet distinct from those within the other

groupings. It is the act of breaking up or grouping data points into numerous groups such that the data

points in one group are more similar to the data points in other groups than they are to the data points

in the other groups (Kassambara, 2017). Clustering may be applied to any population or collection of

data points. The goal is to classify groups of people that have similar features and assign them to a

cluster of people. Clustering is a technique for categorizing a collection of abstract objects (Kassambara,

2017). A collection of data components may be regarded as a single entity for the purposes of reporting.

When doing a cluster analysis, we divided a collection of data into groups based on their degree of

similarity, and then assigned a label to each of those groupings.

In comparison to classification, the primary benefit of clustering is that it allows for more

flexibility in responding to changing circumstances and aids in the identification of a key characteristic

that differentiates distinct groups (Kassambara, 2017). A broad range of applications for clustering

analysis are found in a variety of disciplines, including market research and pattern recognition, data

analysis, and image processing (Kassambara, 2017). Besides that, clustering may be useful in identifying

distinct consumer groups for marketing purposes. Additionally, they may categorize their customers

depending on their purchasing habits and preferences (Kassambara, 2017).


5

5.Explain what an anomaly is and how to avoid it.

Keeping your attention on the entities represented by data sets is the quickest and most

effective method to prevent update anomalies. In the prior case, the confluence of the concepts of

orders and goods resulted in inconsistencies (Agrawal & Agrawal, 2015). Orders and products are two

different categories of information that should be separated in a data collection. Normalization is the

process of grouping connections into a well-structured relationship that enables users to insert, remove,

and edit tuples without creating database inconsistencies (Agrawal & Agrawal, 2015). Normalization is

used in relational databases to prevent data from becoming inconsistent. Many issues may arise when it

is tried to import conceptual models into a database management system before they have been

normalized (Agrawal & Agrawal, 2015). Anomaly is an issue that arises as a consequence of a connection

being established based on the look of the user. Anomalies may be classified into three categories.:

deletions, anomalies of insertion, and updates.

An update anomaly is a discrepancy in data that occurs as a result of redundancy and

incomplete updates. Anomaly detection, also known as outlier detection, is the process of identifying

occurrences, observations, or unexpected events that are substantially different from the usual (Agrawal

& Agrawal, 2015). It is often used to unlabeled data by data scientists in a method known as anomalous

detection without supervision, which is a technique that requires no supervision. Both of the following

basic assumptions govern all kinds of anomalous detection: Anomalies in the data occur rarely, and the

characteristics of data anomalies vary substantially from those of typical occurrences (Agrawal &

Agrawal, 2015).

6.Discuss methods to avoid false discoveries.

Tradition has it that the family-wise error rate (FWER), which is the probability of at least one

erroneous result, has been dealt with via the use of different testing control methods (Goeschel, 2016).

Many hypotheses are tested in high throughput research, with hundreds of thousands to millions of
6

hypotheses being tested at a time. Decreased false discovery rates (FDR) have emerged as a popular and

effective way of lowering error levels (Goeschel, 2016). While the traditional FDR method only takes into

account P values as input, it has been shown that more recent FDR approaches improve power by

adding other information such as informative variables, weights, and hypotheses in the analysis

(Goeschel, 2016). At the present, however, there is no agreement on how existing techniques should be

compared with one another. With rigorous benchmark comparisons based on simulation experiments

and six case studies in computational biology, we assess the accuracy, applicability, and simplicity of use

of two standard FDR control techniques as well as six alternative methods (Goeschel, 2016).

Even when the covariate is really non-informative, the technique for combining informative

covariates is much more potent than the traditional approach and does not fall far short of the power of

the classic approach in terms of precision (Goeschel, 2016). With the exception of two novel techniques

that have been tested in restricted circumstances, the majority of strategies are successful in controlling

FDR. Furthermore, we discovered that contemporary FDR techniques outperformed classical methods

when informative variables, total hypothesis testing, and the percentage of hypotheses that are really

non-zero were included in the analysis (Goeschel, 2016). Comparing current FDR techniques that contain

relevant covariates to classic FDR control procedures, the relative benefits of contemporary FDR

methods that include relevant covariates vary depending on the applications and covariate applications

that are accessible (Goeschel, 2016). Our results are presented as a practical guide, replete with

suggestions, to help researchers in choosing suitable methods for correcting erroneous discoveries in

the field of neuroscience.


7

References

Aggarwal, C. C. (2015). Data classification. In Data Mining (pp. 285-344). Springer, Cham.

Solanki, S. K., & Patel, J. T. (2015, February). A survey on association rule mining. In 2015 fifth

international conference on advanced computing & communication technologies (pp. 212-216). IEEE.

Hussain, S., Atallah, R., Kamsin, A., & Hazarika, J. (2018, April). Classification, clustering and association

rule mining in educational datasets using data mining tools: A case study. In Computer Science On-line

Conference (pp. 196-211). Springer, Cham.

Kassambara, A. (2017). Practical guide to cluster analysis in R: Unsupervised machine learning (Vol. 1).

Sthda.

Agrawal, S., & Agrawal, J. (2015). Survey on anomaly detection using data mining techniques. Procedia

Computer Science, 60, 708-713.

Goeschel, K. (2016, March). Reducing false positives in intrusion detection systems using data-mining

techniques utilizing support vector machines, decision trees, and naive Bayes for off-line analysis. In

SoutheastCon 2016 (pp. 1-6). IEEE.

You might also like