0% found this document useful (0 votes)
985 views

Data Mining Metrices

The document discusses nine different metrics that are commonly used in data mining for predictive analytics: 1. Regression analysis, choice modeling, and rule induction are three techniques that analyze relationships between variables to make predictions. 2. Network/link analysis explores relationships between different object types for applications like fraud detection. 3. Clustering/ensembles categorize objects into groups to identify patterns and predict behaviors, while neural networks mimic the brain's learning. 4. Memory-based reasoning and decision trees are additional predictive techniques, and uplift modeling directly measures the impact of marketing activities. These nine metrics are important for understanding variables, predicting outcomes, and enhancing decision-making through data mining.

Uploaded by

Boobalan R
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
985 views

Data Mining Metrices

The document discusses nine different metrics that are commonly used in data mining for predictive analytics: 1. Regression analysis, choice modeling, and rule induction are three techniques that analyze relationships between variables to make predictions. 2. Network/link analysis explores relationships between different object types for applications like fraud detection. 3. Clustering/ensembles categorize objects into groups to identify patterns and predict behaviors, while neural networks mimic the brain's learning. 4. Memory-based reasoning and decision trees are additional predictive techniques, and uplift modeling directly measures the impact of marketing activities. These nine metrics are important for understanding variables, predicting outcomes, and enhancing decision-making through data mining.

Uploaded by

Boobalan R
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

DATA MINING METRICES

INTRODUCTION

The simple productivity measures and hard constraints used in


many paratransit vehicle scheduling software programs do not fully
capture the interest of all the stakeholders in a typical paratransit
organization (e.g., passengers, drivers, municipal government). As a
result, many paratransit agencies still retain a human scheduler to look
through all of the schedules to manually pick out impractical,
unacceptable runs. (A run is considered one vehicle's schedule for one
day.) The goal of this research was to develop a systematic tool that can
compute all the relevant performance metrics of a run, predict its overall
quality, and identify bad runs automatically.

This assignment presents a methodology that includes a number of


performance metrics reflecting the key interests of the stakeholders (e.g.,
number of passengers per vehicle per hour, dead-heading time,
passenger wait time, passenger ride time, and degree of zigzagging) and
a data-mining tool to fit the metrics to the ratings provided by
experienced schedulers. The encouraging preliminary results suggest that
the proposed methodology can be easily extended to and implemented in
other paratransit organizations to improve efficiency by effectively
detecting poor schedules. Many criteria can be used to evaluate the
performance of supervised learning. Different criteria are appropriate in
different settings, and it is not always clear which criteria to use.

A further complication is that learning methods that perform well on


one criterion may not perform well on other criteria. For example, SVMs
and boosting are designed to optimize accuracy, whereas neural nets
typically optimize squared error or cross entropy. We conducted an
empirical study using a variety of learning methods (SVMs, neural nets, k-
nearest neighbor, bagged and boosted trees, and boosted stumps) to
compare nine boolean classification performance metrics: Accuracy, Lift,

1
F-Score, Area under the ROC Curve, Average Precision, Precision/Recall
Break-Even Point, Squared Error, Cross Entropy, and Probability
Calibration. Multidimensional scaling (MDS) shows that these metrics span
a low dimensional manifold.

The three metrics that are appropriate when predictions are


interpreted as probabilities: squared error, cross entropy, and calibration,
lay in one part of metric space far away from metrics that depend on the
relative order of the predicted values: ROC area, average precision, break-
even point, and lift. In between them fall two metrics that depend on
comparing predictions to a threshold: accuracy and F-score. As expected,
maximum margin methods such as SVMs and boosted trees have
excellent performance on metrics like accuracy, but perform poorly on
probability metrics such as squared error.

What was not expected was that the margin methods have excellent
performance on ordering metrics such as ROC area and average precision.
We introduce a new metric, SAR, that combines squared error, accuracy,
and ROC area into one metric. MDS and correlation analysis shows that
SAR is centrally located and correlates well with other metrics, suggesting
that it is a good general purpose metric to use when more specific criteria
are not known.

We investigate the use of data mining for the analysis of software


metric databases, and some of the issues in this application domain.
Software metrics are collected at various phases of the software
development process, in order to monitor and control the quality of a
software product. However, software quality control is complicated by the
complex relationship between these metrics and the attributes of a
software development process. Data mining has been proposed as a
potential technology for supporting and enhancing our understanding of
software metrics and their relationship to software quality.

NINE METRICS USED IN DATA MINING


2
Predictive analytics enables you to develop mathematical models to help you better
understand the variables driving success. Predictive analytics relies on formulas that compare
past successes and failures, and then uses those formulas to predict future outcomes.
Predictive analytics, pattern recognition, and classification problems are not new. Long used
in the financial services and insurance industries, predictive analytics is about using statistics,
data mining, and game theory to analyze current and historical facts in order to make
predictions about future events. The value of predictive analytics is obvious. The more you
understand customer behavior and motivations, the more effective your marketing will be.

1. Regression analysis. Regression models are the mainstay of predictive analytics.


The linear regression model analyzes the relationship between the response or dependent
variable and a set of independent or predictor variables. That relationship is expressed as an
equation that predicts the response variable as a linear function of the parameters.

2. Choice modeling. Choice modeling is an accurate and general-purpose tool for


making probabilistic predictions about decision-making behavior. It behooves every
organization to target its marketing efforts at customers who have the highest probabilities of
purchase.

Choice models are used to identify the most important factors driving customer
choices. Typically, the choice model enables a firm to compute an individual's likelihood of
purchase, or other behavioral response, based on variables that the firm has in its database,
such as geo-demographics, past purchase behavior for similar products, attitudes, or
psychographics.

3. Rule induction. Rule induction involves developing formal rules that are extracted
from a set of observations. The rules extracted may represent a scientific model of the data or
local patterns in the data. One major rule-induction paradigm is the association rule.
Association rules are about discovering interesting relationships between variables in large
databases. It is a technique applied in data mining and uses rules to discover regularities
between products. For example, if someone buys peanut butter and jelly, he or she is likely to
buy bread. The idea behind association rules is to understand when a customer does X, he or
she will most likely do Y. Understanding those kinds of relationships can help with
forecasting sales, promotional pricing, or product placements.

3
4. Network/Link Analysis. This is another technique for associating like records. Link
analysis is a subset of network analysis. It explores relationships and associations among
many objects of different types that are not apparent from isolated pieces of information. It is
commonly used for fraud detection and by law enforcement. You may be familiar with link
analysis, since several Web-search ranking algorithms use the technique.

5. Clustering/Ensembles. Cluster analysis, or clustering, is a way to categorize a


collection of "objects," such as survey respondents, into groups or clusters to look for
patterns. Ensemble analysis is a newer approach that leverages multiple cluster solutions (an
ensemble of potential solutions). There are various ways to cluster or create ensembles.
Regardless of the method, the purpose is generally the sameto use cluster analysis to
partition into a group of segments and target markets to better understand and predict the
behaviors and preferences of the segments. Clustering is a valuable predictive-analytics
approach when it comes to product positioning, new-product development, usage habits,
product requirements, and selecting test markets.

6. Neural networks. Neural networks were designed to mimic how the brain learns
and analyzes information. Organizations develop and apply artificial neural networks to
predictive analytics in order to create a single framework.

The idea is that a neural network is much more efficient and accurate in circumstances
where complex predictive analytics is required, because neural networks comprise a series of
interconnected calculating nodes that are designed to map a set of inputs into one or more
output signals. Neural networks are ideal for deriving meaning from complicated or
imprecise data and can be used to extract patterns and detect trends that are too complex to be
noticed by humans or other computer techniques. Marketing organizations find neural
networks useful for predicting customer demand and customer segmentation.

7. Memory-based reasoning (MBR)/Case-based reasoning. This technique has results


similar to a neural network's but goes about it differently. MBR looks for "neighbor" kind of
data rather than patterns. It solves new problems based on the solutions of similar past
problems. MBR is an empirical classification method and operates by comparing new
unclassified records with known examples and patterns.

4
8. Decision trees. Decision trees use real data-mining algorithms to help with
classification. A decision-tree process will generate the rules followed in a process. Decision
trees are useful for helping you choose among several courses of action and enable you to
explore the possible outcomes for various options in order to assess the risk and rewards for
each potential course of action. Such an analysis is useful when you need to choose among
different strategies or investment opportunities, and especially when you have limited
resources.

9. Uplift modeling, aka net-response modeling or incremental-response modeling.


This technique directly models the incremental impact of targeting marketing activities.

The uplift of a marketing campaign is usually defined as the difference in response


rates between a treated group and a randomized control group. Uplift modeling uses a
randomized scientific control to measure the effectiveness of a marketing action and to build
a model that predicts the incremental response to the marketing action.

CONCLUSION

The detection of function clones in software systems is valuable for the code
adaptation and error checking maintenance activities. This assignment presents an efficient
metrics-based data mining clone detection approach. First, metrics are collected for all
functions in the software system.

A data mining algorithm, fractal clustering, is then utilized to partition the software
system into a relatively small number of clusters. Each of the resulting clusters encapsulates
functions that are within a specific proximity of each other in the metrics space. Finally, clone
classes, rather than pairs, are easily extracted from the resulting clusters. For large software
systems, the approach is very space efficient and linear in the size of the data set. Evaluation
is performed using medium and large open source software systems. In this evaluation, the
effect of the chosen metrics on the detection precision is investigated.

REFERENCES

http://www.marketingprofs.com/articles/2010/3567/the-nine-most-common-data-
mining-techniques-used-in-predictive-analytics

5
http://www.networkworld.com/article/2231920/microsoft-subnet/data-mining-
your-performance-metrics---uncover-that-nugget---.html
http://trrjournalonline.trb.org/doi/abs/10.3141/2072-15?journalCode=trr
http://dl.acm.org/citation.cfm?id=1014063

You might also like