0% found this document useful (0 votes)
4 views

Matrix Multiplication of Big Data Using

This document provides an abstract for a paper on using MapReduce for matrix multiplication of big data. It discusses how MapReduce breaks problems into smaller parallel tasks (map and reduce) to solve matrix multiplication more efficiently. It also includes tables and figures showing increasing research on big data, matrix multiplication, and MapReduce between 2010-2016.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Matrix Multiplication of Big Data Using

This document provides an abstract for a paper on using MapReduce for matrix multiplication of big data. It discusses how MapReduce breaks problems into smaller parallel tasks (map and reduce) to solve matrix multiplication more efficiently. It also includes tables and figures showing increasing research on big data, matrix multiplication, and MapReduce between 2010-2016.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

IT-DREPS Conference, Amman, Jordan Dec 6-8, 2017

Matrix Multiplication of Big Data Using


MapReduce: A Review

Mais Haj Qasem, Alaa Abu Sarhan, Raneem Qaddoura, and Basel A. Mahafzah
Computer Science Department
The University of Jordan
Amman, Jordan

Abstract— One popular application for big data is matrix parallel. This makes the problem more time feasible.
multiplication, which has been solved using many approaches. MapReduce is among these computing systems [9].
Recently, researchers have applied MapReduce as a new approach MapReduce is a parallel approach that consist of two
to solve this problem. In this paper, we provide a review for matrix sequential tasks which are Map and Reduce tasks. These tasks
multiplication of big data using MapReduce. This review includes
the techniques for solving matrix multiplication using
are implemented with several subtasks. There are many
MapReduce, the time-complexity and the number of mappers applications using MapReduce; such as, MapReduce with K-
needed for each technique. Moreover, this review provides the means for remote-sensing image clustering [10], MapReduce
number of articles published between the year 2010 and 2016 for with decision tree for classification [10], and MapReduce with
each of the following three topics: Big Data, Matrix Multiplication, expectation maximization for text filtering [11]. MapReduce
and MapReduce. has also been used in real-time systems [12] and for job
scheduling [13].
Keywords — Big Data; MapReduce; Matrix Multiplication The rest of the paper is organized as follows. Section 2
presents big data analysis and research trends and a brief
I. INTRODUCTION summary about Matrix Multiplication and MapReduce. Section
Big data analysis is the process of examining and evaluating 3 reviews work related to using the MapReduce in solving the
large and varied data sets. Large size of data is continually matrix multiplication problem and compares between the
generated every day and is useful for many applications related works. Section 4 summarizes and concludes the paper.
including social networks, news feeds, business and marketing,
II. BIG DATA ANALYSIS AND RESEARCH TRENDS
and cloud services. In order to extract useful information from
these large size of data, machine learning algorithms or Big data is a hot research topic which has many applications
algebraic analysis are performed. where complex and huge data should be analyzed. Number of
Matrix multiplication has many related real-life published articles in this topic is shown in Table I and illustrated
applications, which is a fundamental operation in linear algebra. in Fig. 1. The research in this topic is increasing as it is used
It became a problem when matrices are huge and considered as almost anywhere these days in news articles, professional
“big data”. Recently, researchers have found many applications magazines, and social networks like tweets, YouTube videos,
for matrices due to the extensive use of personal computers, and blog discussions. Google scholar is used to extract number
which increased the use of matrices in a wide variety of of articles published for each year using the query string "Big
applications, such as economics, engineering, statistics, and Data" as exact search term. As shown from the figure, number
other sciences [1]. of published articles is significantly increasing from 2010 to
A massive amount of computation is being involved in 2015 and an insignificant decrease is recognized in 2016.
matrix multiplication especially when it is considered as big
TABLE I. BIG DATA PUBLISHED ARTICLES
data, this encouraged researchers to investigate computational
problems thoroughly to enhance the efficiency of the Year No. of Articles
implemented algorithms for matrix multiplication. Hence, over 2010 2,520
the years, several parallel and distributed systems for matrix
2011 4,000
multiplication problem have been proposed to reduce the
communication and computation time [2-4]. 2012 11,200
Parallel and distributed algorithms over various 2013 27,900
interconnection networks; such as Chained-Cubic Tree (CCT),
Hyper Hexa-Cell (HHC), Optical CCT, and OTIS HHC [5-8], 2014 47,500
divide large problems into smaller sub-problems and assign 2015 67,300
each of them to different processors, then all processors run in
2016 56,800

978-1-5386-1986-5/17/$31.00 ©2017 IEEE


IT-DREPS Conference, Amman, Jordan Dec 6-8, 2017
figure, number of published articles is increasing from 2010 to
2016.

TABLE II. MATRIX MULTIPLICATION PUBLISHED ARTICLES

Year No. of Articles


2010 7,340

2011 7,720

2012 9,250

2013 9,460

2014 9,670

2015 9,850

2016 10,600

Fig. 1. Big data published articles.

One of the most important factors discussed in big data is


the speed in which it is analyzed which is used for measuring
the efficiency of the algorithms used, especially when big data
are stored in cloud platforms. One of the big challenges facing
big data analysis is multiplying matrices which is implemented
using many approaches and frameworks like MapReduce. The
following sub sections discuss Matrix Multiplication and
MapReduce.
A. Matrix Multiplication in Big Data
Many problems are solved using matrix multiplication as it
is an essential operation in linear algebra. Thus, the efficiency of
the matrix multiplication algorithms can be enhanced by
investigating these problems. Fig. 2 shows how matrices are
multiplied to form the resulting matrix. In matrix multiplication,
number of columns of the first matrix is the same as number of Fig. 3. Matrix multiplication published articles.
rows of the second matrix, where the sizes of the first matrix and
the second matrix are n × m and m × q, respectively. B. Role of MapReduce in Big Data
Parallel computation has been largely used for matrix
multiplication which is replaced recently with MapReduce [14].
MapReduce is a framework for big data in parallel distributed
environments. It consists of two sequential tasks; Map and
Reduce:
x Map: takes a set of data and change it to another set of
data so that each processor work on different set of data
which is formed using a key-value pairs.
x Reduce: combine identical key-value pairs to form the
intended output. This should always start after the map
Fig. 2. Matrix multiplication operations. task.
The time complexity of the algorithm is O(n3), which The advantage of the map and reduce tasks that a program
requires to locate every element of the arrays that are multiplied. can execute on multiple processors in parallel [15]. Hadoop is
Better approaches have been proposed over the years with less an implementation of MapReduce and is an open source java-
time complexity than the brute-force algorithm; such as, based platform developed by Google [16], but includes an
MapReduce [9]. additional task called "shuffle" which sorts the data so that data
The research in this topic is also increasing as shown in Table with identical key is located to the same processor. Hadoop
II and illustrated in Fig. 3. Google scholar is used to extract architecture is illustrated in Fig. 4.
number of articles published for each year using the query string
"Matrix Multiplication" as exact search term. As shown from the
IT-DREPS Conference, Amman, Jordan Dec 6-8, 2017
III. MATRIX MULTIPLICATION USING MAPREDUCE
There are a lot of applications that use matrix multiplication
in which the matrices are considered big. Thus, finding efficient
matrix multiplication algorithm is a popular research topic. Time
and cost are the main challenges facing this problem in which
several algorithms were proposed in the literature in order to
solve this problem [2, 17].
Using the sequential algorithm as shown previously in Fig.
2 takes too much space and time. That is why parallel algorithms
were introduced for this type of problems. Ballard [18] proposed
a new parallel algorithm that is based on Strassen's fast matrix
Fig. 4. Hadoop architecture. multiplication and minimizes communication. Choi [19]
presented Parallel Universal Matrix Multiplication Algorithm
on distributed memory concurrent computers. Agarwal [20]
Hadoop distributes the input datasets to the available proposed a scheme for matrix-matrix multiplication on a
processors which are called mappers. Each mapper implements distributed-memory parallel computer.
the map task and sends its output to be shuffled and then sent to
the reducer which combines all the mappers' output to form the Recently MapReduce has been used instead of traditional
final output. parallel-based algorithms, where the number of research articles
combining the matrix multiplication with the MapReduce is only
TABLE III. MAPREDUCE PUBLISHED ARTICLES 4 articles that we will discuss and compare through this paper.

Year No. of Articles MapReduce is a parallel framework for big data, which
contains two jobs when is applied on matrix multiplication:
2010 3380
x First job: the reduce task is inactive, while the map task
2011 4990
is simply used to read the input file and create a pair of
2012 8040 elements for multiplication.
2013 10700 x Second job: the map task implements the multiplication
2014 13300
independently for each pair of elements, while reduce job
combines the results for each output element.
2015 15300
Operations in each job of MapReduce are presented in Table
2016 15100 IV, the first job responsible for reading the input elements from
the input file and the other job does the multiplication and
combination. This schema is called element-to-element
technique since each mapper implements element by element
multiplication, this technique is shown in Fig. 6.

TABLE IV. ELEMENT BY ELEMENT OPERATION

Job 1: Map Job2: Map Job 2: Reduce


< key, [ * ]…
Input Files ‹ , ›
[ * ]>
< key, * +… +
Output ‹ , › ‹key, * ›
* >
n: number of rows for the first matrix, m: number of columns for the first matrix or number of rows for
the second matrix, q: number of columns for the second matrix

Fig. 5. MapReduce published articles.

The research in this topic is also increasing as shown in Table Fig. 6. Element by element schema matrix multiplication.
III and illustrated in Fig. 5. Google scholar is used to extract In order to reduce the overall computational and overcome
number of articles published for each year using the query string the disadvantage of the element-by-element technique, a
"MapReduce" as exact search term. As shown from the figure, blocking technique has been used. Sun and Rishe [21] proposed
number of published articles is increasing from 2010 to blocking technique which is MapReduce matrix factorization
2016. Qaddoura
IT-DREPS Conference, Amman, Jordan Dec 6-8, 2017
technique, in order to enhance the efficiency of matrix TABLE VII. COLUMN BY ROW OPERATION
multiplication. Job 1:
Job2: Map Job 2: Reduce
Map
In this technique, also two jobs were used to complete the
Files ‹ … , ‹key,
multiplication process. This technique decomposes first matrix [ * ]… [ ∗
Input … ›
into row vectors while it composes the second matrix into ]›
column vectors, as shown in Fig. 7. Using this technique, the
communication overhead and memory utilization is decreased, ‹ … , ‹key, ‹key, [ * ]+
Output … › [ * ]… [ ∗ ⋯+ [ ∗ ]›
and the computation process for each map is increased.
]›
Operations in each job of MapReduce are presented in Table V.
n: number of rows for the first matrix, m: number of columns for the first matrix or number of rows for
the second matrix, q: number of columns for the second matrix

TABLE V. ROW BY COLUMN OPERATION

Job 1: Map Job2: Map Job 2: Reduce


Files ‹a_ij…a_im, b_(jk ‹key, [a_ij*b_(jk
Input )…b_(mk )› )]… [a_im*b_(mk
)] ›
‹ … , ‹key, ‹key, [ * ]+
… › [ * ⋯+ [ ∗ ]› Fig. 9. Column by row schema matrix multiplication.
Output
]……… [ ∗
Deng and We [23] compared between the element-to-row
]›
n: number of rows for the first matrix, m: number of columns for the first matrix or number of rows for
and row-to-column matrix multiplication which are considered
the second matrix, q: number of columns for the second matrix as block-based technique. The experiments performed show that
the element-to-row technique runs faster than row-to-column
technique, and these techniques are faster than element-to-
element technique. Deng and Wu [23] modified the reading
process in Hadoop by reading the input file in preprocessing
rather than reading it in map, this will lead to minimizing the
number of MapReduce jobs to one, it also reduces the overall
computation and reduce the memory consumption. This
preprocessing stage has been implemented in the HAMA project
Fig. 7. Row by column schema matrix multiplication. [24] for the same purpose.
Zheng et al. [22] presented new two techniques for matrix In order to enhance the efficiency of matrix multiplication in
multiplication which are element-by-row, and column-by-row, MapReduce, Kadhum et al. [25] proposed new technique that
this is done by decomposing the first matrix into element or balances between the processing overhead and the I/O overhead
columns instead of rows and the second matrix also into rows by using a balanced number of mappers so that they are not too
instead of columns, as shown in Figs. 8 and 9. Tables VI and VII many which reduce the I/O overhead, and not too few which
show the map and reduce processes for these two techniques. reduce the processing overhead. Their technique implements
matrix multiplication as element-to-block technique, as shown
TABLE VI. ELEMENT BY ROW OPERATION
in Figs. 10 and 11. In the element-by-row-block scheme, the
Job 1: Map Job2: Map Job 2: Reduce second matrix is divided into row-based blocks. In the row-block
Files ‹ , … › ‹key, by column-block scheme the first matrix is divided into row-
Input [ * ]… [ ∗ based blocks and the second matrix is divided into column-based
]› blocks.
‹ , … ‹key, ‹key, * +
Output › [ * ]… [ ∗ ⋯+ ∗ ›
]›
n: number of rows for the first matrix, m: number of columns for the first matrix or number of rows for
the second matrix, q: number of columns for the second matrix

Fig. 10. Element by row-block schema matrix multiplication.

Fig. 8. Element by row schema matrix multiplication.


IT-DREPS Conference, Amman, Jordan Dec 6-8, 2017
It is also observed from Table VIII that the time complexity
of column-by-row technique is O(n2); that is n columns
multiplied by n rows which makes it the least feasible one in
terms of time complexity, but not in terms of the number of
mappers, where each mapper holds one column of elements for
the first matrix and one row of elements for the second matrix.
Thus, n mappers would be needed.
Fig. 11. Row-block by column-block schema matrix multiplication.
The element-by-column-block and column-block-by-row-
We compare all the pervious augments strategies for matrix
block strategies compromise between the time complexity of the
multiplication in terms of time complexity and number of
algorithm, which is O(n/L), and the number of mappers, which
mappers, as listed in Table VIII. We assume that the matrices
is n2×L, which might be the most feasible technique for some
are square (i.e., n × n matrices), and L block size for block
applications.
partition.
IV. CONCLUSION
TABLE VIII. COMPARISON BETWEEN STRATEGIES OF MATRIX
MULTIPLICATION In this paper, we provide statistics of the number of articles
published between the year 2010 and 2016 for Big Data, Matrix
Matrix Multiplication Number of
Time Complexity Multiplication, and MapReduce using the google scholar search
Strategies Mappers
engine. We observed that articles are increasingly published for
Element-by-Element O(logn) n3 these three areas of research, while combining them together as
Column-by-Row O(n )
2
n one research area is still a recent research area, where only four
papers are presented; which have been discussed through this
Element-by-Row O(n) n2 paper.
Row-by-Column O(n) n2 We reviewed all the techniques used by MapReduce to solve
Element-by-Column-Block O(n/L) n2×L the problem of multiplying huge matrices and we compared
between the presented techniques in terms of time complexity
Column-Block-by-Row-Block O(n/L) n2×L and number of needed mappers. We concluded that column-by-
row technique having a time complexity of O(n) and n number
of mappers is probably the best technique, while element-by-
For example, in element-by-element matrix multiplication, column-block and column-block-by-row-block are moderately
mappers collaborate in computing each element of output acceptable ones, which compromises between the time
matrix. Thus, mappers are needed, as shown in Table VIII. complexity of the algorithm and the number of mappers. The
Regarding time complexity, Fig. 12 shows task-interaction element-by-element technique is the worst one in terms of
graph for element-by-element matrix multiplication for matrices number of mappers.
of size 4×4. Number of mappers for each level of the graph
equals the number of mappers in the previous level divided by
2. Thus, the time complexity for element-by-element matrix REFERENCES
multiplication is O(log n), which is the lower bound for parallel [1] L. Xiufeng, N. Iftikhar, and X. Xie, “Survey of real-time processing
matrix multiplication; this means that this technique is consider systems for big data,” In: Proceedings of the 18th International Database
Engineering & Applications Symposium, ACM, 2014.
as the most efficient technique in terms of time complexity, as
shown in Table VIII. [2] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic
progressions,” Journal of Symbolic Computation, vol. 9, pp. 251-280,
1990.
[3] A. Grama, A. Gupta, G. Karyp, and G. Kumar, Introduction to Parallel
Computing, Addison Wesley, USA, 2003.
[4] W. Cohen and B. Mahafzah, “Statistical analysis of message passing
programs to guide computer design,” In: Proceedings of the Thirty-First
Annual Hawaii International Conference on System Sciences (HICSS
1998), IEEE Computer Society, vol. 7, pp. 544-553, Kohala Coast,
Hawaii, USA, January 1998.
[5] M. Abdullah, E. Abuelrub, and B. Mahafzah, “The chained-cubic tree
interconnection network,” International Arab Journal of Information
Technology, vol. 8, pp. 334-343, 2011.
[6] B. Mahafzah and I. Al-Zoubi “Broadcast communication cperations for
hyper hexa-cell interconnection network” Telecommunication Systems,
May 2017. DOI: 10.1007/s11235-017-0322-3.
[7] B. Mahafzah, M. Alshraideh, T. Abu-Kabeer, E. Ahmad, and N. Hamad,
“The optical chained-cubic tree interconnection network: Topological
structure and properties,” Computers & Electrical Engineering, vol. 38,
pp. 330-345, March 2012.
[8] B. Mahafzah, A. Sleit, N. Hamad, E. Ahmad, T. Abu-Kabeer “The OTIS
hyper hexa-cell optoelectronic architecture,” Computing, vol. 94, pp. 411-
432, 2012.
Fig. 12. Task-interaction graph for element-by-element matrix multiplication.
IT-DREPS Conference, Amman, Jordan Dec 6-8, 2017
[9] J. Norstad, “A mapreduce algorithm for matrix multiplication,” 2009. annual ACM symposium on Parallelism in algorithms and architectures,
http://www.norstad.org/matrix-multiply/index.html [Accessed on May 2012.
2017]. [19] J. Choi, D. Walker, and J. Dongarra, “PUMMA: Parallel universal matrix
[10] G-Q. Wu, et al. “MReC4. 5: C4. 5 ensemble classification with multiplication algorithms on distributed memory concurrent
MapReduce,” 2009 Fourth ChinaGrid Annual Conference, IEEE, 2009. computers,” Concurrency: Practice and Experience, vol. 6, pp. 543-570,
[11] J. Lin and C. Dyer, “Data-intensive text processing with MapReduce,” 1994.
Synthesis Lectures on Human Language Technologies, vol. 3, pp. 1-177, [20] R. Agarwal, F. Gustavson, and M. Zubair, “A high-performance matrix-
2010. multiplication algorithm on a distributed-memory parallel computer,
[12] X. Liu, N. Iftikhar, and X. Xie, “Survey of real-time processing systems using overlapped communication,” IBM Journal of Research and
for big data,” In: Proceedings of the 18th International Database Development, vol. 38, pp. 673-681, 1994.
Engineering & Applications Symposium, ACM, 2014. [21] Z. Sun, T. Li, and N. Rishe, “Large-scale matrix factorization using
[13] Z. Matei, et al. “Job scheduling for multi-user mapreduce clusters,” EECS mapreduce,” In: 2010 IEEE International Conference on Data Mining
Department, University of California, Berkeley, Tech. Rep. UCB/EECS- Workshops, 2010.
2009-55, 2009. [22] J. Zheng, R. Zhu, and Y. Shen, “Sparse matrix multiplication algorithm
[14] U. Catalyurek and C. Aykanat, “Hypergraph-partitioning-based based on MapReduce,” J. Zhongkai Univ. Agric. Eng. vol. 26, pp. 1-6,
decomposition for parallel sparse-matrix vector multiplication,” IEEE 2013.
Trans. Parallel Distrib. Syst. vol. 10, pp. 673-693, 1999. [23] S. Deng and W. Wenhua, “Efficient matrix multiplication in Hadoop,” Int.
[15] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on J. Comput. Sci. Appl. vol. 13, pp. 93-104, 2016.
large clusters,” Commun. ACM, vol. 51, pp. 107-113, 2008. [24] S. Seo, et al. “An efficient matrix computation with the mapreduce
[16] J. Dean and S. Ghemawat, “MapReduce: a flexible data processing tool,” framework,” In: 2010 IEEE Second International Conference on Cloud
Commun. ACM, vol. 53, pp. 72-77, 2010. Computing Technology and Science (CloudCom), pp. 721-726,
November 2010.
[17] K. Thabet and S. Al-Ghuribi, “Matrix multiplication algorithms,” Int. J.
Comput. Sci. Netw. Secur. (IJCSNS), vol. 12, pp. 74-79, 2012. [25] M. Kadhum, M. Qasem, A. Sleit, and A. Sharieh, “Efficient MapReduce
matrix multiplication with optimized mapper set,” In: Computer Science
[18] G. Ballard, et al. “Communication-optimal parallel algorithm for On-line Conference, pp. 186-196, 2017.
strassen's matrix multiplication,” In: Proceedings of the twenty-fourth

You might also like