Measures in Data Mining - Categorization and Computation

In data mining, measures are quantitative techniques used to summarize, describe, and analyze large datasets. They help transform raw data into meaningful statistics by capturing properties such as central tendency, dispersion, and distribution. These measures are essential for data-driven decision-making, pattern discovery, and predictive analysis.

Data mining measures are classified based on how their aggregate functions behave when data is partitioned.

1. Holistic Measures

A measure is holistic if it cannot be computed from fixed-size summaries of partitions and requires access to the full dataset.

Key properties:

Cannot be derived by combining sub-aggregates.
Require complete data for exact computation.

Examples: median(), mode(), rank().

2. Distributive

A measure is distributive if it can be computed on data partitions and the partial results can be combined to obtain the final result.

Key properties:

Works by dividing data into subsets.
Final result equals the aggregate computed over the full dataset.

Examples: sum(), count(), min(), max().

3. Algebraic Measures

A measure is algebraic if it can be computed using a fixed number of distributive measures.

Key properties:

Combines results of distributive functions.
Requires a fixed-size summary.

Examples: avg() (uses sum() and count()), MinN(), MaxN(), centerOfMass().

Computation of Measures

Computing measures involves applying mathematical operations and aggregation logic to structured data. Steps in Measure Computation are discussed below:

1. Data Collection and Preprocessing

Remove duplicates and handle missing values.
Convert data to consistent formats.

2. Measure Selection

Choose appropriate measures based on analysis goals
(e.g., sum() for totals, avg() for trends, variance() for dispersion).

3. Formula Application

Apply statistical formulas:
Mean = total values ÷ number of values
Standard deviation = measure of spread from the mean

4. Aggregation

Combine results from partitions or subsets for large datasets.

5. Interpretation and Reporting

Analyze computed values against benchmarks or historical data.
Use results to identify trends, patterns, and anomalies.

Measures in Data Mining - Categorization and Computation

1. Holistic Measures

2. Distributive

3. Algebraic Measures

Computation of Measures

2. Measure Selection

3. Formula Application

4. Aggregation

5. Interpretation and Reporting

Explore