Data Partitioning
Data Partitioning
Frans Coenen
Department of Computer Science, The University of Liverpool
Liverpool, L69 3BX, UK
[email protected]
Abstract
This paper explores and demonstrates (by experiment) the capabilities of Multi-Agent
Data Mining (MADM) System in the context of parallel and distributed Data Mining
(DM). The exploration is conducted by considering a specific parallel/distributed DM
scenario, namely data partitioning to achieve parallel/distributed ARM. To facilitate the
partitioning a compressed set enumeration tree data structure (the T-tree) is used together
with an associated ARM algorithm (Apriori-T). The aim of the scenario is to demonstrate
that the MADM vision is capable of exploiting the benefits of parallel computing;
particularly parallel query processing and parallel data accessing. In addition the
approach described offers significant advantages with respect to computational efficiency
when compared to alternative mechanisms for (a) dividing the input data between
processors (agents) and (b) achieving distributed/parallel ARM.
Introduction
A common feature of most DM tasks is that they are resource intensive and operate on
large sets of data. Data sources measured in gigabytes or terabytes are quite common in
DM. This has called for fast DM algorithms that can mine very large databases in a
reasonable amount of time. However, despite the many algorithmic improvements
proposed in many serial algorithms, the large size and dimensionality of many databases
makes the DM of such databases too slow and too big to be processed using a single
process. There is therefore a growing need to develop efficient parallel DM algorithms
that can run on distributed systems.
There are several ways in which data distribution can occur, and these require different
approaches to model construction, including:
• Horizontal Data Distribution. The most straight forward form of distribution is
horizontal partitioning, in which different records are collected at different sites, but each
record contains all of the attributes for the object it describes. This is the most common
and natural way in which data may be distributed. For example, a multinational company
deals with customers in several countries, collecting data about different customers in
each country. It may want to understand its customers worldwide in order to construct a
global advertising campaign.
• Vertical Data Distribution. The second form of distribution is vertical partitioning, in
which different attributes of the same set of records are collected at different sites. Each
site collects the values of one or more attributes for each record and so, in a sense, each
site has a different view of the data. For example, a credit-card company may collect data
about transactions by the same customer in different countries and may want to treat the
transactions in different countries as different aspects of the customers total card usage.
Vertically partitioned data is still rare, but it is becoming more common and important
[85].
This paper also addresses a second generic MADM scenario, that of distributed/parallel
DM. This scenario assumes an end user who owns a large data set and wishes to obtain
DM results but lacks the required resources (i.e. processors and memory). The data set is
partitioned into horizontal or vertical partitions that can be distributed among a number of
processors (agents) and independently processed, to identify local itemsets, on each
process.
In the exploration of the applicability of MADM to parallel/distributed ARM, the two
parallel ARM approaches, based on the Apriori algorithm, are described and their
performance evaluated as indicated above. Recall that DATA-HS makes use of a
horizontal partitioning of the data. The data is apportioned amongst the processes,
typically by horizontally segmenting the dataset into sets of records.
DATA-VP makes use of a vertical partitioning approach to distributing the input dataset
over the available number of DM (worker) agents. To facilitate the vertical data
partitioning the tree data structure, described in Paper 5, Section 5.3, is again used
together with the Apriori-T ARM algorithm [31]. Using both approaches each partition
can be mined in isolation, while at the same time taking into account the possibility of the
existence of frequent itemsets dispersed across two or more partitions. In the first
approach, DATA-HS, the scenario complements the meta ARM scenario described in the
previous paper.
The rest of paper is organized as follows: In Section 6.1 some background on data
distribution and the motivation for the scenario is described. Data partitioning is
introduced in Section 6.2. Data partitioning may be achieved in either a horizontal or
vertical manner. In Section 6.3 a parallel/distributed task with Data Horizontal
Segmentation (DATA-HS) algorithm is described. Before describing the vertical
approach the Apriori-T algorithm is briefly described in Section 6.4.
The parallel/distributed task with Data Vertical Partitioning (DATA-VP) algorithm
(which is founded on Apriori-T) is then described in Section 6.5. The DATA-VP MADM
task architecture and network configuration is presented in Section 6.6. Experimentation
and Analysis, comparing the operation of DATA-HS and DATA-VP, is then presented in
Section 6.7. Discussion of how this scenario addresses the goal of this paper is presented
in Section 6.8. Finally a summary is given in Section 6.9.
The DATA-VP task architecture shown in Figure 6.2 assumes the availability of at least
one worker (DM agent), preferably more. Figure 6.2 shows the assumed distribution of
agents and shared data across the network. The figure also shows the house-keeping
JADE agents (AMS and DF) through which agents find each other.
Messaging
Parallel/distributed ARM tends to entail much exchange of data messaging as the task
proceeds. Messaging represents a significant computational overhead, in some cases
outweighing any other advantage gained. Usually the number of messages sent and the
size of the content of the message are significant factors affecting performance. It is
therefore expedient, in the context of the techniques described here, to minimize the
number of messages that are required to be sent as well as their size.
The technique described here is One-to-Many approach, where only the task Figure 6.2:
Parallel/Distributed ARM Model for DATA-VP Task Architecture agent can
send/receive messages to/from DM agents. This involves fewer operations, although, the
significance of this advantage decreases as the number of agents used increases.
To evaluate the two approaches, in the context of the EMADS vision, a number of
experiments were conducted. These are described and analysed in this section.
The experiments presented here used up to six data partitions and two artificial datasets:
(i) T20.D100K.N250.num, and (ii) T20.D500K.N500.num where T = 20 (average
number of items per transactions), D = 100K or D = 500K (Number of transactions), and
N = 500 or N = 250 (Number of items) are used. The datasets were generated using the
IBM Quest generator used in Agrawal and Srikant [2].
As noted above the most significant overhead of any parallel/distributed system is the
number and size of messages sent and received between agents. For the DATA-VP
EMADS approach, the number of messages sent is independent of the number of levels
in the T-tree; communication takes place only at the end of the tree construction. DATA-
VP passes entire pruned sub (local) T-tree branches.
(a) Number of Data Partitions (b) Support Threshold Figure 6.3: Average of Execution
Time for Dataset T20.D100K.N250.num (a) Number of Data Partitions (b) Support
Threshold Figure 6.4: Average of Execution Time for Dataset T20.D500K.N500.num
Therefore, DATA-VP has a clear advantage in terms of the number of messages sent.
Figure 6.3 and Figure 6.4 show the effect of increasing the number of data partitions with
respect to a range of support thresholds. As shown in Figure 6.3 the DATA-VP algorithm
shows better performance compared to the DATA-HS algorithm. This is largely due to
the smaller size of the dataset and the T-tree data structure which: (i) facilitates vertical
distribution of the input dataset, and (ii) readily lends itself to parallelization/distribution.
However, when the data size is increased as in the second experiment, and further DM
(worker) agents are added (increasing the number of data partitions), the results shown
in Figure 6.4, show that the increasing overhead of messaging size outweighs any gain
from using additional agents, so that parallelization/distribution becomes counter
productive. Therefore DATA-HS showed better performance from the addition of further
data agents compared to the DATA-VP approach.
Discussion
MADM can be viewed as an effective distributed and parallel environment where the
constituent agents function autonomously and (occasionally) exchange information with
each other. EMADS is designed with asynchronous, distributed communication protocols
that enable the participating agents to operate independently and collaborate with other
peer agents as necessary, thus eliminating centralized control and synchronization
barriers.
Distributed and parallel DM can improve both efficiency and scalability first by
executing the DM processes in parallel improving the run-time efficiency and second, by
applying the DM processes on smaller subsets of data that are properly partitioned and
distributed to fit in main memory (a data reduction technique).
The scenario, described in this paper, demonstrated that MADM provides suitable
mechanisms for exploiting the benefits of parallel computing; particularly parallel data
processing. The scenario also demonstrated that MADM is suitable for re-usability and
illustrated how it is supported by re-employing the meta ARM task agent, described in
the previous paper, with the DATA-HS task.
Conclusion
In this paper a MADM method for parallel/distributed ARM has been described so as to
explore the MADM issues of scalability and re-usability. Scalability is explored by
parallel processing of the data and re-usability is explored by reemploying the meta ARM
task agent with the DATA-HS task.
The solution to the scenario considered in this paper made use of a vertical data
partitioning or a horizontal data segmentation technique to distribute the input data
amongst a number of agents. In the horizontal data segmentation (DATA-HS) method,
the dataset was simply divided into segments each comprising an equal number of
records. Each segment was then assigned to a data agent that allowed for using the meta
ARM task when employed on EMADS. Each DM agent then used its local data agent to
generate a complete local T-tree for its allocated segment. Finally, the local T-trees were
collated into a single tree which contained the overall frequent itemsets. The proposed
vertical partitioning (DATA-VP) was facilitated by the T-tree data structure, and an
associated mining algorithm (Apriori-T), that allowed for computationally effective
parallel/distributed ARM when employed on EMADS.
The reported experimental results showed that the data partitioning methods described
are extremely effective in limiting the maximal memory requirements of the algorithm,
while their execution time scale only slowly and linearly with increasing data
dimensions. Their overall performance, both in execution time and especially in memory
requirements has brought significant improvement.