Load_Balancing_Query_Processing_in_Metric-Space_Si (2)
Load_Balancing_Query_Processing_in_Metric-Space_Si (2)
net/publication/254038730
CITATIONS READS
7 1,740
2 authors:
All content following this page was uploaded by Veronica Gil Costa on 12 September 2014.
Abstract—Metric-space similarity search has been proven composed of a query receptionist machine that continuously
suitable for searching large collections of complex objects schedules query processing on a large set of processors,
such as images. A number of distributed index data struc- where each processor holds a portion of the metric-space
tures and respective parallel query processing algorithms have
been proposed for clusters of distributed memory processors. index and a subset of the database objects. Similar work
Previous work has shown that best performance is achieved could be [16], though it is for text search, and we compare
when using global indexing as opposed to local indexing. our proposal against an adaptation of that approach as a
However global indexing is prone to performance degradation baseline case. Note that we aim at designing a dedicated
when query load becomes unbalanced across processors. This search service where processors are not shared by other
paper proposes a query scheduling algorithm that solves this
problem. It adaptively load balances processing of user queries applications and no virtualization technology is used so that
that are dynamically skewed towards particular sections of a highly optimized service is achieved. Queries must be
the distributed index. Sections highly hit by queries can be processed in an on-line manner which discards systems for
kept replicated. Experimental results show that with 1%– off-line parallel computation such as map-reduce/hadoop.
10% replication performance improves significantly (e.g., 35%) Scheduling has to be made as new queries arrive, and
under skewed work-loads.
thereby it must be performed very fast to prevent the query
Keywords- distributed metric space search; load balance receptionist machine from becoming a bottleneck. There
is no margin for iterative task re-assignment computations
I. I NTRODUCTION or expensive communication actions related to re-allocating
New applications of Web search engines require search objects or index sections among processors. Actually, a given
operations to be performed on data objects that are more data re-allocation action could not be longer useful a few
complex than plain text. Metric spaces have proven useful steps ahead in query processing as new queries shift to a
to search complex data objects such as images [4]. In this different topic trend.
case, search queries are represented by an object of the same The proposed scheduling algorithm meets the on-line
type to those in the database wherein, for instance, one is requirement as it executes a few operations on integers per
interested in retrieving the top-k objects that are the most query to decide in what order each query must visit the
similar to the query. A data structure is used to index the processors. It also introduces a modest increase in inter-
database objects to speed up query processing by reducing processor communication by including an integer value
the database objects compared against the query. in each message. The algorithm is able to dynamically
On the other hand, dealing efficiently with multiple user follow changes in query trend that constantly shift points
queries, each potentially at a different stage of execution of imbalance. It does so based on the notion of simulating
at any time instant, requires the use of suitable parallel computations that are inherently asynchronous by means of a
computing techniques. User queries tend to be highly skewed bulk-synchronous counterpart, whose well-defined structure
along time in an unpredictable and dynamic way and thereby allows quantification of alternative execution plans for each
performing parallel query processing in a balanced manner active query. To avoid combinatorial explosion we use as
on a distributed index data structure can be difficult to a primitive operation the greedy heuristic “the least loaded
achieve. The contribution of this paper is the proposal processor first” [13] (we can also use alternative heuristics
of a query scheduling algorithm devised to overcome this for on-line task assignment [8]).
difficulty. The remaining sections are organized as follows. Section
To our knowledge there is no previous work on load 2 describes metric space search and Section 3 reviews related
balancing this kind of information retrieval application under work. Sections 4 presents the proposed algorithm. Section 5
our specific computational setting. That is, search engines contains experiments and Section 6 conclusions.
Search(LC, q, r)
II. M ETRIC -S PACE S EARCH 1. If LC is empty Then Return
2. Let LC = (c, rc , I) : E
A metric space (U, d) is composed of a universe of valid
3. Compute the distance d(c, q)
objects U and a distance function d : U × U → R+ defined 4. If d(c, q) ≤ r Add c to the set of
among them. The distance function determines the similarity results
between two given objects. The goal is, given a set of objects 5. If d(c, q) ≤ rc + r Then Search I
and a query, to retrieve all objects close enough to the query. exhaustively
6. If d(c, q) > rc − r Then Search(E, q, r)
This function holds several properties: strictly positiveness
(d(x, y) > 0 and if d(x, y) = 0 then x = y), symmetry
Figure 1. LC search algorithm.
(d(x, y) = d(y, x)), and the triangle inequality (d(x, z) ≤
d(x, y) + d(y, z)). In this setting a database (instance) X List of Cluster
is simply a finite collection of objects from U . The finite
subset X ⊂ U , with size n = |X|, is called database and E E
(c1,r1) (c2,r2)
represents the collection of objects.
There are two main queries of interest: (a) range search: q I I
that retrieves all the objects u ∈ X within a radius r of the
query q, that is: RX (q, r) = {u ∈ X : d(q, u) ≤ r}, and (b)
Figure 2. The influence zone of centers c1 , c2 and c3 .
k-nearest neighbors search: retrieves the set kNN X (q, k)
⊆ X such that |kNN X (q, k)| = k and, for every u ∈ the asymmetry of the data structure, we can stop the search
kNN X (q, k) and every v ∈ X \ kNN (q, k), it holds that before traversing the whole list of clusters: If the query
d(q, u) ≤ d(q, v). Usually, the k nearest neighbors queries ball (q, r) is totally and strictly contained in the center ball
are implemented upon range queries. (c, rc ), we do not need to traverse E since the construction
In this paper we work with an index data structure called process ensures that all the elements that are inside the query
List of Clusters (LC) [6]. In [12], it has been shown that ball (q, r) have been inserted in I (as shown in line 6 of
the LC enhanced with a pivoting scheme can significantly Figure 1). Figure 2 shows three clusters with centers in
outperform well-known alternative metric-space indexes. order of construction c1 , c2 and c3 and a query q with radius
The LC data structure is formed from a set of centers r. In this example the search algorithm computes d(q, c1 )
(objects) as follows. We first choose a “center” c ∈ X and and determines that d(q, c1 ) > r1 + r. Continuing with the
a radius rc . The center ball (c, rc ) is the subset of elements next cluster the algorithm searches inside the cluster c2 after
of X which are at distance at most rc from c. We define verifying that the cluster intersects the query ball. Remember
IX,c,rc = {u ∈ X − {c}, d(c, u) ≤ rc } as the cluster of that the objects found at the intersection of clusters c2 and
internal elements which lie inside the center ball (c, rc ), and c3 are stored only into c2 , so the algorithm does not provide
EX,c,rc = {u ∈ X, d(c, u) > rc } as the external elements. duplicate results. Finally the algorithm repeats the search
The clustering process is recursively applied in E. The work process over the cluster c3 .
in [6] shows that the best strategy is to choose the next We assume a parallel processing architecture in which
center as the element that maximizes the sum of distances a receptionist machine receives queries from users and
to previous centers. evenly distributes their processing onto the processors. We
There are two simple ways to divide the space into call the receptionist machine broker. The processors work
clusters: taking a fixed radius for each partition or using cooperatively to produce the query answers and pass the
a fixed size. To ensure good load balance across processors, results back to the broker. The work in [12] studied various
we consider partitions with a fixed size of k elements, thus forms of parallelization of the LC strategy concluding that a
the radius rc of a cluster with center c is the maximum global indexing strategy called GG, which stands for Global
distance between c and its k-nearest neighbor. For practical Index and Global Centers, achieves the best performance.
reasons we actually store into a cluster at most k elements. The GG strategy builds in parallel a LC index for the
This because during search it is more efficient to have all whole database and distributes uniformly at random the
elements at distance rc from a center c either inside the clusters of the LC data structure onto the processors. Upon
cluster or outside it, but never some part inside and the reception of a query q, the broker sends it to a circularly
remaining one outside of the cluster. selected processor. This processor becomes the ranker for
During the processing of a search query q with radius the query. It calculates the query plan, namely the list of GG
r, the idea is that if the first center is c and its radius clusters that intersect q. To this end, it broadcasts the query
is rc , we evaluate d(q, c) and add c to the result set if to all processors and they calculate in parallel a fraction
d(q, c) ≤ r. Then, we scan exhaustively the cluster I only if 1/P of the query plan. Then they send their nq /P pieces
the query ball (q, r) intersects the center ball (c, rc ). Next, of the global plan to the ranker, which merges them to get
we continue with the set E recursively. However, because of the global plan with clusters sorted in construction order.
The ranker sends the query q and its plan to the processor time. As opposed to [14], the solution we propose is based
i containing the first cluster to be visited, namely the first on the use of a fast on-line query scheduling algorithm at the
cluster in the building order that intersects with q. This query receptionist machine. It achieves better performance
processor i goes directly to the GG clusters that intersect than [14] and index construction cost is very small. The
with q, compare q against the database objects stored in work in [16] applies for text “the least loaded processor
them, and return to the ranker the objects that are within the first” heuristic on global indexing for inverted files. The
range of q. After processor i finishes, it sends the results to achieved performance is not better than local indexing. In
the ranker. The remaining part of the query plan is passed to our application domain global indexing is more efficient than
the next processor j and so on, till completing the processing local indexing provided that imbalance is kept under control.
of the query. Thus, each processor acts both as a ranker of Load balancing among other optimizations for multi-
a different subset of all queries and as a processor in charge dimensional indexing in peer-to-peer (p2p) systems have
of processing GG clusters by means of multi-threading. been comprehensively reviewed in [25]. Regarding load
balance in general p2p systems, there are approaches based
III. R ELATED W ORK on data replication (cf. [20]), namespace balancing (cf. [17]),
Distributed metric-space query processing was first stud- virtual servers (cf. [26]), node migration (cf. [10]), multiple
ied in [19]. Four alternative query processing methods are hash functions (cf. [2]), path replication (cf. [3]), lookup
proposed for local indexing. In this case, the dataset is evenly balance (cf. [7]), and consistent hashing based methods
distributed on P processors and a local index is constructed like in [21]. Work specific for p2p metric spaces has been
in each processor. The first query processing method sends presented in [15], [18], [9], [24]. Also in [5] there is work
the query to all P processors and they reply with the top-k that considers moving data around processors to balance
results which are computed using their local indexes. The workload. On the other hand, there are metric space algo-
second method visits each processor sequentially and in each rithms developed for shared memory systems [11] and GPU
visit determines whether there are better results than the systems [1]. Techniques for executing similarity queries on
current top-k ones. This tends to reduce the number of results cloud infrastructure can be found in [23].
sent to the query receptionist machine. The third method
sends the query to f < P processors and determines the IV. P ROPOSED S CHEDULING A LGORITHM
best results among the f · k results collected from those
processors. Then the other P − f processors are contacted The query receptionist machine (broker) schedules query
to determine whether they are able to produce better results processing onto processors by using information from a
than the current global top-k ones for the query. The fourth simulation of a bulk-synchronous parallel (BSP) computer
method performs iterations by asking from each processor [22]. This simulation is performed on-the-fly by the broker
top k/P results in each iteration. As database objects are by feeding the simulated BSP computer with the actual
distributed at random, there is a high probability that the stream of incoming queries. BSP computations are kept syn-
global top-k results will be determined in few iterations. chronized with computations performed by actual processors
The work in [19] was extended in [12] for the context of a by collecting a few low-cost statistics from them.
clustering based index data structure. Several alternatives for In a BSP computer, parallel computation is organized as
index distribution onto processors were studied, concluding a sequence of supersteps. During each superstep, processes
that global indexing achieves better performance than local in processors may only perform computations on local data
indexing because it significantly reduces the average number and/or send messages to other processes. These messages are
of processors contacted per query. Global indexing refers available for processing at their destinations by the start of
to a single index that is constructed by considering the the next superstep. Each superstep is ended with the barrier
whole set of database objects and then evenly distributing synchronization of all BSP processes. Multi-threading can
the index onto the processors. In a way, global indexing occur in each superstep and it is efficient to have one multi-
helps the receptionist machine to quickly select the above threaded BSP process in each multi-core processor.
f processors most likely to contain the global top-k results, Depending on query traffic intensity, processors can per-
and makes it very unlikely to go further on the remaining form query processing in either asynchronous (Async) or
P − f processors. However, a drawback of global indexing synchronous (Sync) modes of parallel processing [12]. The
is the potential for processor imbalance which arises when Async mode, best suited for low query traffic, performs stan-
many queries tend to hit the same few processors. dard asynchronous message passing among multi-threaded
The work in [14] proposes solving imbalance of global processors, where threads concurrently perform computa-
indexing with an off-line method to distribute the index onto tions of single queries (i.e., one thread per query at a time).
processors. However, this requires very expensive O(n2 ) The Sync mode, best suited for high query traffic, processes
pre-processing of the n-sized index. In addition, this solution batches of queries by using all threads to process each query
fails when user queries dynamically change focus along in parallel and performing communication after ending each
batch. We use BSP to implement the Sync mode, though the S CHEDULE ()
broker simulates BSP in either modes. 1. while( there are pending queries )
//get slot for new active query.
The BSP computer simulated by the broker processes 2. node ← E XTRACT M IN SS TEP(PQ)
queries in a round robin fashion across supersteps by letting //assign next query plan to the slot.
each active query to process one GG cluster per superstep 3. node.tuples ← G ET Q UERY P LAN(node)
per processor. The results of the simulation are used to //determine execution order.
define query plans which contain information indicating in 4. node.sstep ← A SSIGN SS TEPS(Load, node)
5. I NSERT (PQ, node)
which round-robin iteration (superstep) the visits to the index
clusters must take place. We also replicate a few GG clusters A SSIGN SS TEPS(Load, node)
to reduce imbalance caused by clusters that are required 1. sstep ← node.sstep //get initial superstep.
(“visited”) by very frequent queries. 2. for(i ←0; i<node.tuples.size(); i ← i + 1){
3. ss = sstep
We define the plan of a query q as a sequence of nq tuples
4. while(true){//search for a processor and superstep.
[ci , d(q, ci ), pi , si ] with i = 1 ... nq , where (a) nq is the total 5. Limit ←E FFICIENCY G OAL ( Load[sstep]
number of clusters to be visited by the query q, (b) ci is the [all processors] )
center/cluster id, (c) d(q, ci ) is the distance between q and 6. if(node.tuples[i].cluster is replicated)
the center ci , (d) pi is the processor in which the cluster is proc←M IN(Load[sstep][proc replicas])
7. else proc ← node.tuples[i].proc
hosted, and (e) si is the superstep at which the scheduler
8. if (Load[sstep][proc] < Limit AND
determines the cluster ci must be visited. sstep-ss < VL)
The tuple elements ci and d(q, ci ) are determined by the 9. break
global index GG algorithm whereas si is set by the scheduler 10. sstep←sstep+1 //go to the next superstep.
and pi is also set by the scheduler when the cluster is 11. } // end while
12. node.tuples[i].proc ← proc
replicated in two or more processors. For clusters ci that
13. node.tuples[i].sstep ← sstep
are not replicated, the values pi are fixed before hand since 14. Load[sstep][proc]←Load[sstep][proc]+1
it is known in what processors the index clusters are hosted. 15. sstep ← sstep + 1
The broker maintains a data window which registers pro- 16. } // end for
cessors workload through supersteps. We call it L OAD and 17. return sstep
it is implemented as a table with rows indicating supersteps Figure 3. Scheduling single queries (one BSP process per processor).
and columns indicating BSP processes (processors). The
table cells register the number of clusters visited by queries. consideration a BSP efficiency goal of 15% of imbalance.
The steps followed by the load balancing algorithm are The BSP efficiency for a measure X is defined by the
shown in Figure 3. This is an on-line scheduling algorithm ratio average(X)/maximum(X) ≤ 1, over the P processors
for single queries. The node data structure keeps informa- or BSP processes. This efficiency defines the maximum
tion of queries. Assuming that for each query the broker number of clusters that may be visited in each processor
knows its sequence [ci , d(q, ci ), pi , si ] with i = 1 ... nq , the at a given superstep. The respective value is stored in the
scheduling algorithm properly defines the values of all pairs Limit variable. If the algorithm cannot assign the cluster to
(si , pi ), with items denoted by node.tuples[i].sstep a processor in a given superstep, it tries on the next superstep
and node.tuples[i].proc respectively. and so on until either the efficiency goal is achieved or a
Each query is scheduled to start at an initial superstep maximum number of supersteps VL is reached. This last
and progress forward in the superstep count whilst visiting condition is used to prevent from indefinite delay in cluster
the clusters in the query plan. Consequently with the round- scheduling and is not expected to occur frequently.
robin principle, the last superstep of a query signals the start The average number of active queries Q used by the sche-
of a new query in the following superstep. duling algorithm to set the value of its Limit and assign
The routine S CHEDULE () keeps a priority queue PQ work to the processors, depends on the observed query traffic
which enables it to select the node with the least value of and also on the capacity of the processors to solve queries.
final superstep count of one of the Q active queries under The broker is not allowed to saturate processors by sending
processing. When scheduling a new query, the superstep to them more queries than they can process efficiently so that
value of this node is assigned as the starting superstep of query throughput equals arrival query rate in processors. As
the query and the node is inserted back into the PQ. This soon as the processing of one active query is ended, a new
time its superstep value is equal to the final superstep of the one is introduced in the scheduling algorithm to determine
newly scheduled query. its execution plan and send it to processing.
The routine A SSIGN SS TEPS () performs the actual The value of Q is user query traffic dependent and
scheduling and BSP simulation by defining values to the dynamically changes along time. We calculate it as follows.
pairs (si , pi ) in accordance with the current workload and Let us assume that n queries are completely processed in a
advancing superstep count. This is effected by taking into period of ∆ units of time. These queries arrived at different
time instants to the broker machine during the period ∆ and L OAD as soon as the k nearest neighbors are found.
they also finished at different instants for ∆ long enough. Cluster replication can be calculated off-line from pre-
The observed value of Q, during ∆, can be estimated vious user queries, or clusters that become popular can
using the G/G/∞ queuing model. Let S be the sum of the be silently replicated by a background process running
differences δq = [DepartureTime – ArrivalTime] of queries, independently of query processing.
that is the sum of the intervals of time elapsed between Regarding the time complexity of the scheduling algo-
the arrival of the queries to the broker and the end of their rithm, it is affected by: (a) the number of replicas Ri ≤ P
complete processing. Then the average Q is given by S/∆. assigned to the cluster with center ci (if the cluster is
This because the number of active servers in a G/G/∞ model not replicated Ri = 1), (b) the set of clusters Cq vis-
is defined as the ratio of the arrival rate of events to the ited by each query where |Cq | = nq , (c) the maximum
service rate of events (λ/µ). If n queries are received by number of iterations (VL) performed in case the condition
the processor during ∆, then the arrival rate is λ = n/∆ LOAD[sstep][proc] < Limit never holds, and (d) the
and the service rate is µ = n/S. cost O(log Q) of the priority queue. Therefore, the upper-
For the Sync mode, supersteps of processors and su- bound of thePcost of executing the scheduler for a query q is
persteps of the BSP computer simulated by the broker log Q + nq · i∈Cq (P · VL). Certainly this is an exaggeration
are exactly the same. It is straightforward to keep them since a small percentage of GG clusters are replicated and
tightly synchronized. For the Async mode, instead, it is most cases are accommodated within a small range of
necessary to establish a synchrony relationship between the supersteps since the broker does not saturate processors. The
fully asynchronous computations of the Async mode and the experimental results confirm this claim.
simulated BSP computer. To solve this problem, we provide
the broker with an equivalent of Sync mode by predicting V. E XPERIMENTS
Sync supersteps from the Async computations as follows. The results were obtained on a cluster with 120 dual
In the Async mode, the obtained value of Q is the average processors. We had exclusive access to 64 processors located
number of active queries at any time instant (time at which distantly in terms of the communication network among
each of them can be at a different stage of execution). The nodes and separated from all other nodes. All experiments
n queries detected in the period ∆ arrived at different time were run with all data in main memory. We show running
instants. On the other hand, the broker knows the plan of time values normalized to 1 for each dataset to better
each query so it can predict the supersteps to be consumed illustrate the difference between strategies. In all cases we
by each query if they were executed in the Sync mode. divide the values by the observed maximum in the respective
Therefore, it is necessary to associate the time instant in experiment. We used C++, MPI and BSPonMPI.
which each query starts processing in the Async mode with To run the experiments we used two large datasets. The
the superstep count of a hypothetical BSP computer serving first one is a vocabulary obtained from a 1.5TB sample
exactly the same workload during ∆. of the UK Web. In this text sample we found 26,000,000
To estimate the total number of Sync supersteps from vocabulary terms. The distance function used to determine
the Async mode during ∆, we can equip each processor the similarity between two terms is the edit function. It
with a local superstep counter Cp . All co-resident threads counts the number of insertions, deletions or replacements
share and update the local counter Cp . Also, any message m that is necessary to make two strings identical (a potential
departing from a thread with counter value Cp and addressed use is: “did you mean this term?”). On this dataset we
to a thread located in another processor, carries the value executed an actual query log limited to one term per query.
m . Cp = Cp +1. In addition, any thread receiving a message The query log was taken from the Yahoo! Search Engine for
m with value m . Cp > Cp causes Cp = m . Cp where Cp queries submitted on the UK Web during 2005.
is the local superstep counter of the processor hosting the The second dataset representing 10, 000, 000 image ob-
thread. In this way the total number of supersteps is defined jects was generated synthetically as follows. We took the col-
by the maximum Cp value considering all processors. lection of images from a NASA dataset1 containing 40, 701
Therefore, our main hypothesis is as follows: since the images vectors, and we used it as an empirical probability
simulated BSP computer and the actual processors are kept distribution upon which we generated our random image
in precise synchrony in either Async or Sync modes, the objects. We call this dataset NASA-2. The query log for
query plans delivered by the proposed scheduling algorithm this dataset was generated in exactly the same way, i.e., by
tend to produce well balanced computations across proces- generating random image objects using the NASA dataset
sors. Next section presents results that validate this claim. as an empirical probability distribution. As we do not have
For k-NN queries, the sequence [ci , d(q, ci ), pi , si ] must a real user query log from NASA dataset, we associate
be processed in increasing order of d(q, ci ) values. This to each query image object with a text query from the UK
avoid traversing all clusters indicated in the query plan. In
this case, the scheduler removes all remaining clusters from 1 http://www.sisap.org/library/dbs/vectors/nasa.tar.gz
log and we orderly replicated it in the log the same number number of processed queries scales up simultaneously. In
of times than the respective text query. This simulated user this last case, the efficiency lost is only logarithmic (x-axis
preferences for determined subsets of objects in the database. is in log2 scale in Figure 4.a). This for high query traffic.
We emphasize that in the execution of the parallel pro- On the right side of Figure 4.b we show results obtained
grams we injected the same total number of queries q by using a standard multi-threaded asynchronous (Async)
in each processor. That is, the total number of queries mode of query processing. The left side is the Sync mode by
processed in each experiment reported below is q × P = Q. means of BSP. This figure shows performance for different
Thereby running times are expected to grow with P since query traffic conditions, when the search engine is operating
the communication hardware has at least log P scalability. in either mode separately, and using either global (GG) or
Thus in figures shown below, curves for, say, 16 processors local (LL) indexing. Performance differences between the
are higher in running time than the ones for 4 processors. Sync and Async modes can be significant depending on
We have found this setting useful to see the efficiency of the the query traffic situation. For high traffic the Sync mode
different strategies in the sense of how well they support the outperforms the Async mode. The inverse occurs for low
inclusion of more processors/queries to work on a data set of traffic. Both modes can be used in combination but the
a fixed size N . We run for k = 128 and 1,000,000 queries. relevant point for this paper is that the proposed scheduling
algorithm can be employed transparently in both modes.
1.2 Certainly, applying our scheduling algorithm in the Sync
NASA-2
1
mode is trivial as broker simulation and actual processors
can execute the same supersteps.
Normalized Running Time
0.8
P REDICTING BSP C OMPUTATIONS : In the following we
0.6 GG
present experimental results that show the ability of the bro-
LL
0.4 ker to predict the computations of the simulated BSP com-
search radius= 0.6 search radius= 0.8 puter from the computations performed by the asynchronous
0.2
processors. We compare the quality of the prediction against
0 an actual BSP implementation of the processors. We call
4 8 18 32 64 4 8 16 32 64
Number of Processors
the two parallel realizations as Sync (BSP processors) and
(a) Async (asynchronous processors) respectively.
We divided the execution period in three equally sized
1
P
time intervals. During the first interval we processed queries
Normalized Running Time
300 GG Index
32 P EC ED EP EV
200 16 4 0.77 0.57 0.47 0.74
8 8 0.72 0.42 0.37 0.65
100 16 0.65 0.33 0.31 0.55
4
32 0.55 0.28 0.27 0.47
0
1 1/3 2/3 1
Normalized execution time period
GG2 Index
(a) P EC ED EP EV
170 4 0.88 0.95 1.00 0.83
160 8 0.87 0.95 1.00 0.77
150 16 0.86 0.94 1.00 0.70
140 32 0.85 0.92 1.00 0.63
Active Queries
130
120 A1 A2 A1
110
100
GG
90 GG2
to clusters only. The results show that GG2 improves overall 0.4
load balance in a significant manner. 0.2
NASA-2 dataset
The improvement of the BSP efficiencies has a clear 0
impact in running time as shown in figures 6.a and 6.b for 4 8 16 32
the UK and NASA-2 datasets respectively. These results are Number of Processors
for the Async realization and show that GG2 outperforms
Figure 6. Normalized running times.
GG by a wide margin. The curves also show results for 1%,
5% and 10% of replication of index clusters, curves labeled
VI. C ONCLUSIONS
with suffixes R1, R5, and R10 respectively. They show that
further improvement in performance can be achieved by a We have proposed a scheduling algorithm to perform well-
modest replication of index clusters onto processors. balanced parallel query processing on a distributed metric-
To assess the effects of using the GG2 scheduling algo- space index. The algorithm is suitable for systems composed
rithm instead of plainly using the “least loaded processor of a query receptionist machine and a set of distributed
first” heuristic [16], [13], we performed experiments includ- memory processors, where it is critical to prevent from
ing this heuristic. Figure 7 shows that GG2 outperforms the bottlenecks at the query receptionist side and where it is
heuristic by about 50%. A key advantage of GG2 is that also critical to properly load balance computations in order
it avoids saturation that can arise during unpredictable time to achieve an efficient rate of queries solved per unit time.
intervals. When the scheduling algorithm detects that some Thus target applications are cases where it is necessary to
processors get more load than a given upper-bound, it evenly search for non-textual content such as images in large search
distributes load across subsequent supersteps. engines for the Web.
Least Load
[5] Catalyurek, U.V., Boman, E.G., Devine, K.D., Bozdağ, D.,
Average query response time 1 GG2 Heaphy, R.T., Riesen, L.A.: A repartitioning hypergraph
model for dynamic load balancing. J. Parallel Distrib. Com-
0.8
put. 69, 711–724 (2009).
0.6
[6] Chavez, E., Navarro, G.: A compact space decomposition for
effective metric indexing. Pattern Recognition Letters 26(9),
0.4 1363–1376 (2005).
[7] Chen Z., Huang G., Xu J., Yang Y. Adaptive load balancing
0.2 for lookups in heterogeneous DHT. In: EUC, pp. 513–518,
2008.
0 [8] Ding, J., Zhang, G.: A note on online scheduling for jobs with
4 8 16 32
Number of Processors
arbitrary release times. In: COCOA. pp. 354–362 (2009).
[9] Doulkeridis, C., Vlachou, A., Kotidis, Y., Vazirgiannis, M.:
Figure 7. NASA-2 Collection: Least load heuristic vs GG2. Peer-to-peer similarity search in metric spaces. In: VLDB
(2007).
[10] Giakkoupis G., Hadzilacos V.: A scheme for load balancing in
The experiments show that efficient performance is heterogenous distributed hash tables. In: PODC, pp 302–311,
achieved by applying the scheduling algorithm to a global 2005.
index evenly distributed on processors. The algorithm de- [11] Gil-Costa, V., Barrientos, R., Marin, M., Bonacic, C.:
fines the order in which each query must visit processors Scheduling metric-space queries processing on multi-core
processors. In: PDP. pp. 187–194 (2010).
to prevent imbalance. Importantly enough, the algorithm
[12] Gil-Costa, V., Marin, M., Reyes, N.: Parallel query processing
defines the order in which a given operation must take on distributed clustering indexes. J. Discrete Algorithms 7(1),
place with respect to other operations scheduled in the same 3–17 (2009).
processor. This 2D view of the scheduling problem, namely [13] Graham, R.L.: Bounds for certain multiprocessing anomalies.
space and time, provides our proposal a key advantage over J. on Applied Mathematics 45(9), 1563–1581 (1966).
1D strategies such as the commonly used “least loaded [14] Marin, M., Ferrarotti, F., Gil-Costa, V.: Distributing a metric-
processor first” heuristic. Our scheme can accommodate space search index onto processors. In: ICPP. pp. 13–16
(2010).
index cluster replication to deal with queries severely skewed
[15] Marin, M., Gil-Costa, V., Hernández, C.: Dynamic p2p in-
to particular index sections. The results show that a small dexing and search based on compact clustering. In: SISAP.
percentage of replication (e.g., 5%) has a significant impact pp. 124–131 (2009).
in performance improvement (e.g., 35%). [16] Moffat, A., Webber, W., Zobel, J.: Load balancing for term-
Extension to multiple brokers can be made by simple distributed parallel retrieval. In: SIGIR. pp. 348–355 (2006).
composition. The rationale is that if query traffic is evenly [17] Naor M., Wieder U.: Novel architectures for p2p applications:
the continuous-discrete approach. In: SPAA pp. 50–59, 2003.
distributed on the brokers and each broker applies the
[18] Novak, D., Batko, M., Zezula, P.: Metric index: An efficient
same algorithm, in the overall, computations will end up and scalable solution for precise and approximate similarity
being well balanced since each broker produces balanced search. Inf. Syst. 36(4), 721–733 (2011).
query plans. Also, the proposed scheduling algorithm can [19] Papadopoulos, A., Manolopoulos, Y.: Distributed processing
be applied to any other metric-space indexing strategy based of similarity queries. Distributed and Parallel Databases 9(1),
on computing pair-wise distance evaluations among database 67–92 (2001).
objects. These operations are used by all well-known index- [20] Pitoura T., Ntarmos N., Triantafillou P.: Replication, Load
Balancing and Efficient Range Query Processing in DHTs.
ing strategies and they dominate running time cost. In: EDBT pp. 131–148 (2006).
[21] Raiciu, C., Huici, F., Rosenblum, D.S., Handley, M.: ROAR:
Acknowledgment. This work has been partially supported Increasing the flexibility and performance of distributed
by FONDEF D09I1185 R&D project. search. In: SIGCOMM (2009).
[22] Valiant, L.: A bridging model for parallel computation.
Comm. ACM 33, 103–111 (Aug 1990).
R EFERENCES [23] Yiu, M.L., Assent, I., Jensen, C.S., Kalnis, P.: Outsourced
similarity search on metric data assets. IEEE Trans. Knowl.
[1] Barrientos, R.J., Gómez, J.I., Tenllado, C., Prieto-Matı́as, M., Data Eng. (TKDE) 99 (2010).
Marı́n, M.: k-NN query processing in metric spaces using [24] Yuan, Y., Wang, G., Sun, Y.: Efficient peer-to-peer similarity
GPUs. In: Euro-Par. pp. 380–392 (2011). query processing for high-dimensional data. In: Asia-Pacific
[2] Bauer D., Hurley P., Waldvogel M. : Replica placement and Web Conference. pp. 195–201 (2010).
location using distributed hash tables. In: LCN, pp. 315–324, [25] Zhang C., Xiao W., Tang D., Tang J.: P2P-based multidimen-
2007. sional indexing methods: A survey. Journal of Systems and
Software, 84(12) pp. 23482362, 2011.
[3] Bianchi S., Serbu S., Felber P., Kropf P.: Adaptive load
balancing for DHT lookups. In ICCCN, pp. 411–418, 2006. [26] Zhu Y., Hu Y.: Efficient, proximity-aware load balancing for
DHT-based p2p systems. IEEE Trans. Parallel Distrib. Syst.,
[4] Brin, S.: Near neighbor search in large metric spaces. In 16 pp. 349–361, 2005.
VLDB pp. 574–584 (1995).