0% found this document useful (0 votes)

2 views23 pages

Unit II Matrix Multiplication

The document discusses parallel algorithms for matrix multiplication, highlighting various decomposition methods such as row-wise, column-wise, and block-wise approaches. It explains specific algorithms like Cannon's, Fox's, and Strassen's, detailing their complexities, communication overhead, and practical applications. The document emphasizes the importance of parallel computing in enhancing the efficiency of matrix multiplication, especially for large matrices.

Uploaded by

khuranasimar15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views23 pages

Unit II Matrix Multiplication

Uploaded by

khuranasimar15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 2.

2
Parallel Algorithms
Introduction to Parallel Matrix Multiplication
Matrix multiplication is a fundamental operation in linear algebra with numerous applications. Parallel
computing can significantly speed up this process, especially for large matrices. It seems like you’re asking
about parallel algorithms for matrix multiplication. Matrix multiplication is a computationally intensive
task, and parallel algorithms are designed to speed it up by distributing the work across multiple processors
or threads. Below, A common parallel matrix multiplication algorithm is explained below, by providing an
example, and keep it clear and structured. Matrix multiplication involves taking two matrices, A (size i×k)
and B (size k×j), and producing a result matrix C (size i×j). The formula for each element C[i][j] is:
________________________________________

Common Approaches used in matrix multiplication:

A. Decomposition Methods
1. Row-wise Decomposition
 Distribute matrix rows among processes
 Each process computes a portion of the result matrix
 Requires entire vector X to be available to all processes3

2. Column-wise Decomposition
 Distribute matrix columns among processes
 Each process computes partial results
 Requires a final reduction step to obtain the complete result vector4
3. Block-wise (Checkerboard) Decomposition or Block-Based Parallel Multiplication
 Divide matrices into rectangular subblocks
 Distribute subblocks among processes in a 2D grid pattern
 More scalable for large matrices and high numbers of processes
One widely used method is to divide the matrices into smaller blocks (submatrices) and assign each block
computation to a separate processor or thread. This is especially effective on systems like multi-core
CPUs, GPUs, or distributed clusters.

Steps:
Algorithm Description
1. Partition the Matrices:
o Split A into row blocks.
o Split B into column blocks.
o Split C into corresponding blocks.
2. Assign Tasks:
o Each processor computes a block of C by multiplying a row block of A with a column
block of B.
3. Parallel Execution:
o Processors work simultaneously on their assigned blocks.
4. Combine Results:
o Once all blocks are computed, the full matrix C C C is assembled.
Parallel Framework
This can be implemented using:
 Threads (e.g., OpenMP on a multi-core CPU).
 Processes (e.g., MPI on a cluster).
 GPU threads (e.g., CUDA).

Example
Let’s multiply two 4x4 matrices in parallel:

Result matrix C will also be 4x4.

Step 1: Partition into Blocks

Divide each matrix into 2x2 submatrices (assuming 4 processors):

C blocks: C11,C12,C21,C22 (are to be computed now).

Step 2: Assign to Processors
With 4 processors:
 Processor 1 computes C11=A11⋅B11+A12⋅B21
 Processor 2 computes C12=A11⋅B12+A12⋅B22

 Processor 3 computes C21=A21⋅B11+A22⋅B21

 Processor 4 computes C22=A21⋅B12+A22⋅B22 C
Step 3: Compute Each Block
Step 4: Assemble C

Why Parallel?
 Speed: Each processor handles a smaller chunk simultaneously.
 Scalability: Works on multi-core systems, clusters, or GPUs.
Practical Notes
 Load Balancing: Ensure each processor gets equal work.
 Communication: In distributed systems (e.g., MPI), processors need to share A A A and B B B
blocks, adding overhead.
 Tools: Libraries like BLAS (CPU) or cu BLAS (GPU) optimize this further.

Specific Algorithms
1. Fox's Algorithm
 Theoretical Complexity: Fox's algorithm has a time complexity of O(n3p)O(pn3) when
using pp processors, assuming ideal conditions and no communication overhead. However,
in practice, communication costs can significantly affect performance.
 Communication Overhead: Fox's algorithm involves broadcasting diagonal elements and
shifting matrix blocks, which can lead to substantial communication overhead, especially
in distributed memory architectures.
 Scalability: It is relatively easy to implement but may not scale as well as other algorithms
due to its communication requirements.
2. Strassen's Algorithm
 Theoretical Complexity: Strassen's algorithm has a time complexity
of O(nlog27)≈O(n2.807)O(nlog27)≈O(n2.807), making it faster than the
standard O(n3)O(n3) algorithm for large matrices.
 Communication Overhead: While Strassen's algorithm reduces computational complexity,
its communication overhead is generally higher due to the recursive nature of the
algorithm, which requires more data movement between levels of recursion.
 Scalability: It is more complex to implement and less stable numerically than the standard
algorithm but offers significant speedup for very large matrices.
3. Cannon's Algorithm
 Theoretical Complexity: Cannon's algorithm also has a time complexity
of O(n3p)O(pn3) when using pp processors, similar to Fox's algorithm. However, it is
optimized for 2D mesh architectures.
 Communication Overhead: Cannon's algorithm minimizes communication overhead by
aligning matrix blocks in a way that reduces the need for data exchange between
processors. It shifts blocks cyclically, which can be efficient in homogeneous 2D grids.
 Scalability: It is well-suited for homogeneous 2D grids but can be challenging to extend to
heterogeneous grids

Explanation with example:

1. Cannon's Algorithm : Cannon’s Algorithm is designed for parallel matrix
multiplication. It involves two key phases: initial alignment (shifting the blocks of A and
B) and computation (multiplying and accumulating results over iterations). Let’s compute
C following the algorithm.
Iterations: p\sqrt{p}p
 If using ppp processors in a p×p\sqrt{p} \times \sqrt{p}p×p grid, the algorithm runs for
p\sqrt{p}p iterations.
 Each processor performs one local matrix multiplication per iteration.
 Matrix shifts occur at each step.

🔹 Iteration Formula:
Iterations=p\text{Iterations} = \sqrt{p}Iterations=p
 Example: If using a 4×4 processor grid (p=16p = 16p=16), the number of iterations is 4.
 Example Calculation:
o First Iteration: Initial matrix shifting & local computation.
o Second Iteration: Shift again & recompute.
o Third Iteration: Repeat.
o Fourth Iteration: Final computation.

🔹 Total Multiplications:
 Each processor performs O(n3/p)O(n^3 / p)O(n3/p) computations.
 If p=n2p = n^2p=n2 (max parallelism), each processor does **O(1)O(1)O(1)
multiplications per iteration.

Steps:
1. Initial alignment of matrix blocks
2. Perform local block multiplications
3. Shift blocks cyclically/
4. Repeat steps 2-3 until all blocks are processed

Example 1 :

Step 1 : Initial Alignment :

Step 2: Perform local block multiplications

Step 3: Shift Block cyclic and iteration

Then multiply and accumulate the result .

Multiply and add to the previous matrix C values:

Step 4 : Repeat steps 2-3 until all blocks are processed

Example: Multiplying Two 4x4 Matrices

To begin with, we need to partition matrices A and B into smaller blocks. Since we are using a 4by 4 grid
of processes, each block will be 1 1 in this case. The initial alignment involves shifting

the rows of matrix A to the left by their row index modulo 4 and shifting the columns of matrix B up by
their column index modulo 4. This ensures that each process can start multiplying its local blocks. We
will use a 4x4 processor grid, meaning we have 16 processes arranged in a 2D grid.

Given two matrices A and B:

Step 1: Initial Alignment

1. Partition Matrices: Divide A and B into 16 blocks, each 1x1 in this case, since we are using a 4x4
grid of processes.

2. Initial Shift: Perform an initial shift to align the blocks so that each process can start multiplying
its local blocks.

 For matrix A, shift each row to the left by its row index modulo 4.

 For matrix B, shift each column up by its column index modulo 4.

After shifting:
3. Local Multiplication: Each process multiplies its local blocks of A and B and accumulates the
result.

Step 2: Cyclic Shift and Multiply

1. Shift Blocks: Shift each row of A one position to the left and each column of B one position up.

After shifting:
2. Local Multiplication: Each process multiplies its new local blocks of A and B and adds the result
to the accumulated sum.

Accumulate: C =C1+C2
Step 3 & 4: Repeat Shift and Multiply
Repeat the process of shifting and multiplying until all blocks have been processed. After four iterations,
the final result matrix C is obtained by summing all the intermediate results.

Final Result

The final result matrix C is calculated by summing the results from all iterations:

C =C1+C2+C3+C4
After performing the calculations for each step and summing the results, we obtain the final matrix C.
This example illustrates how Cannon's algorithm efficiently performs matrix multiplication in parallel by
minimizing communication overhead through structured data shifts and local multiplications.

2. Fox's Algorithm: Fox’s Algorithm is a parallel algorithm for matrix multiplication designed
for distributed computing. It works efficiently on 2D processor grids (p × p). Unlike Cannon’s
Algorithm, which shifts elements, Fox’s Algorithm broadcasts matrix blocks.
Steps:
1. Organize processes in a 2D grid
2. Broadcast diagonal elements
3. Perform local multiplications
4. Shift matrix B blocks
5. Repeat steps 2-4 for all stages
this is to inform you that i have already taken my attendance in the first 5 minutes already you can take
proof from my excel from the class students you can ask randomly even the CR is standing along with me
and discussing, i didn't announce any single uid or name. So how it can be possible.
3. Strassen's Algorithm : Strassen’s Algorithm is a divide-and-conquer approach that reduces the
number of multiplications required for matrix multiplication. It is more efficient than the classical method,
reducing the time complexity from O(n³) to approximately O(n^{2.81}).

Steps:
1. Divide matrices into quadrants
2. Compute seven matrix products (M1 to M7)
3. Combine results to form the final matrix
Example (2x2 matrices):
Performance Considerations
 The choice of algorithm depends on matrix size and hardware architecture
 Strassen's algorithm is typically beneficial for matrices larger than 1000x1000
 Practical implementations often use hybrid approaches, switching to standard methods for smaller
submatrices

Key Differences Row-wise Decomposition Column-wise Decomposition

Matrix divided into rows; each processor Matrix divided into columns; each processor
Matrix
multiplies assigned rows with the entire multiplies the entire first matrix with assigned
Division
second matrix. columns.

Each processor needs access to the entire Each processor needs access to the
Data Distribution
second matrix (all columns), which can be entire first matrix (all rows), which can
and Access
memory-intensive. be memory-intensive.

Computation and Each processor independently computes a Each processor computes partial results
Communication complete row, reducing inter-processor for multiple rows, requiring more inter-
communication but increasing memory processor communication to combine
bandwidth usage. results.

Easier to implement; good load May need extra steps for load balancing,
Scalability and
balancing if rows are evenly divisible especially if columns are not evenly divisible
Load Balancing
by processors. by processors.

High memory bandwidth required to High memory bandwidth required to

Memory Usage
access the entire second matrix for each access the entire first matrix for each
and Bandwidth
row computation. column computation.

Less synchronization required since each More synchronization required to

Synchronization and
processor computes a complete row combine partial results from different
Reduction
independently. processors.

Comparison Table between Cannon’s , Fox’s and Strassen’s Algorithm :

Cannon’s
Feature Strassen’s Algorithm Fox’s Algorithm
Algorithm

Parallel & Block- Parallel & Broadcast-

Type Divide-and-Conquer
based based

Time O(n3/p)O(n^3 / O(n2.81)O(n^{2.81})O(n2.81) O(n3/p)O(n^3 /

Complexity p)O(n3/p) (parallel) (sequential) p)O(n3/p) (parallel)

Works well in
Works efficiently on
Parallelism Limited parallelism distributed memory
2D processor grids
settings

Requires p=np =
Requires p×pp \times
Processor Grid \sqrt{n}p=n Not processor-dependent
pp×p processor grid
processors

High (since all

Medium (distributed
Memory Usage submatrices are Low (less storage required)
memory system)
stored locally)

Works best with Works with any size,

Works for any power of 2, needs
Matrix Size square matrices and but best for square
padding for odd sizes
even sizes matrices

Communication High due to frequent Moderate

Low (recursion-based)
Cost data shifting (broadcasting needed)
Cannon’s
Feature Strassen’s Algorithm Fox’s Algorithm
Algorithm

Even load Can be imbalanced for non-power-of-2

Load Balancing Even load distribution
distribution sizes

Large, distributed Large-scale parallel

Small to medium matrices where fast
Best Used For parallel systems (2D computing on multiple
sequential computation is needed
mesh) processors

All Command 7360 - CLi & TL1
100% (3)
All Command 7360 - CLi & TL1
7 pages
CVP Stoppage Instructions
No ratings yet
CVP Stoppage Instructions
3 pages
Grasim HR Report
No ratings yet
Grasim HR Report
22 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Exercise 9
No ratings yet
Exercise 9
5 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Efficient Parallel Implementation of The Fox Algorithm
No ratings yet
Efficient Parallel Implementation of The Fox Algorithm
8 pages
CO-2 (2)
No ratings yet
CO-2 (2)
22 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
COA_Imple
No ratings yet
COA_Imple
22 pages
matrix_mul
No ratings yet
matrix_mul
33 pages
Content PDF
No ratings yet
Content PDF
14 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Task 1 Types of Parallel Processing
No ratings yet
Task 1 Types of Parallel Processing
3 pages
HPC-Practical-4Addition of two large vectors
No ratings yet
HPC-Practical-4Addition of two large vectors
4 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Fox Example
No ratings yet
Fox Example
2 pages
MPI Matrix Multiplication 1 PDF
No ratings yet
MPI Matrix Multiplication 1 PDF
23 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Week 09 2021b
No ratings yet
Week 09 2021b
52 pages
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
No ratings yet
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
8 pages
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
No ratings yet
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
39 pages
Matrix Matrix Algorithm, Cannons Alg
No ratings yet
Matrix Matrix Algorithm, Cannons Alg
13 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
120210142_2
No ratings yet
120210142_2
3 pages
Matrix_multiplication_algorithm
No ratings yet
Matrix_multiplication_algorithm
9 pages
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
No ratings yet
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
12 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
VSS-NumericalLibraries
No ratings yet
VSS-NumericalLibraries
21 pages
06 cmsc416 Algorithms
No ratings yet
06 cmsc416 Algorithms
40 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Web GPU
0% (1)
Web GPU
40 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Unit 4 HPC Part8
No ratings yet
Unit 4 HPC Part8
16 pages
Week3 PDF
No ratings yet
Week3 PDF
36 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
PDF (Ebook) Parallel Programming with Co-Arrays by Robert W. Numrich ISBN 9781439840047, 1439840040 download
100% (8)
PDF (Ebook) Parallel Programming with Co-Arrays by Robert W. Numrich ISBN 9781439840047, 1439840040 download
65 pages
Gauss
No ratings yet
Gauss
7 pages
Discrete project
No ratings yet
Discrete project
10 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
DAA IA-1 Case Study Material-CSE
No ratings yet
DAA IA-1 Case Study Material-CSE
9 pages
Efficient Parallel Algorithm for
No ratings yet
Efficient Parallel Algorithm for
12 pages
Algorithms For Parallel Machines
No ratings yet
Algorithms For Parallel Machines
7 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Libya Free High Study Academy / Misrata: - Report of Address
No ratings yet
Libya Free High Study Academy / Misrata: - Report of Address
19 pages
ex.5daa
No ratings yet
ex.5daa
7 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
MATLAB for Beginners: A Gentle Approach
From Everand
MATLAB for Beginners: A Gentle Approach
Peter I. Kattan
No ratings yet
Nfs-Toc Mcqs Questions
No ratings yet
Nfs-Toc Mcqs Questions
10 pages
Unit II Leadership Election Algorithm
No ratings yet
Unit II Leadership Election Algorithm
8 pages
Unit II Mutual Exclusion
No ratings yet
Unit II Mutual Exclusion
14 pages
TOC UNIT 1
No ratings yet
TOC UNIT 1
2 pages
Basic SQL Queries for Data Analysts
No ratings yet
Basic SQL Queries for Data Analysts
2 pages
Raffmetal: UNI EN 1676 and 1706
No ratings yet
Raffmetal: UNI EN 1676 and 1706
2 pages
Very Large Floating Structures
78% (9)
Very Large Floating Structures
28 pages
Adbi4201 Tugas 2
No ratings yet
Adbi4201 Tugas 2
2 pages
Your Brain And Law School A Context And Practice Book Context And Practice Series Marybeth Herald instant download
No ratings yet
Your Brain And Law School A Context And Practice Book Context And Practice Series Marybeth Herald instant download
26 pages
1888 1982 Reff2021
No ratings yet
1888 1982 Reff2021
13 pages
3a. Dairy Cattle Production
No ratings yet
3a. Dairy Cattle Production
57 pages
Download ebooks file Master React in 5 Days: Become a React Expert in Under a Week Eric Sarrion all chapters
100% (4)
Download ebooks file Master React in 5 Days: Become a React Expert in Under a Week Eric Sarrion all chapters
66 pages
Microwind
No ratings yet
Microwind
13 pages
Ipgrabflashx Raw
No ratings yet
Ipgrabflashx Raw
9 pages
(CV) Adil
No ratings yet
(CV) Adil
4 pages
Ahmed Said Ali - Bill of Costs
100% (1)
Ahmed Said Ali - Bill of Costs
6 pages
Motor Fan illustrated　モーターファン・イラストレーテッド - May 2025
No ratings yet
Motor Fan illustrated　モーターファン・イラストレーテッド - May 2025
100 pages
Test Bank for Occupational Therapy in Mental Health : A Vision for Participation, 2nd Edition, Catana Brown, Virginia C Stoffel, Jaime Munoz all chapter instant download
100% (6)
Test Bank for Occupational Therapy in Mental Health : A Vision for Participation, 2nd Edition, Catana Brown, Virginia C Stoffel, Jaime Munoz all chapter instant download
20 pages
Annexure-I Hira For Upvc & PP Piping Works.
No ratings yet
Annexure-I Hira For Upvc & PP Piping Works.
2 pages
Thanks For Travelling With Us, Raju: Ride Detailsbill Details
No ratings yet
Thanks For Travelling With Us, Raju: Ride Detailsbill Details
2 pages
Ra 11479 /the Anti-Terrorism Act of 2020 - Explainer
No ratings yet
Ra 11479 /the Anti-Terrorism Act of 2020 - Explainer
4 pages
TCVN 5307-2009 Petroleum and Petroleum Products Terminal - Design Requirements
No ratings yet
TCVN 5307-2009 Petroleum and Petroleum Products Terminal - Design Requirements
55 pages
A Stimulated Simulation System Based On Ovation Virtual DCS
No ratings yet
A Stimulated Simulation System Based On Ovation Virtual DCS
5 pages
SLD Model
No ratings yet
SLD Model
7 pages
Preventing The Next Major Incident - OGP Safety Workshop Summary 363
No ratings yet
Preventing The Next Major Incident - OGP Safety Workshop Summary 363
18 pages
Lustica Bay Marina Montenero Brochure PDF
No ratings yet
Lustica Bay Marina Montenero Brochure PDF
16 pages
Agenda: Customer Analysis Competitor Analysis Swot 4P's Conclusion
No ratings yet
Agenda: Customer Analysis Competitor Analysis Swot 4P's Conclusion
43 pages
Aegis-Vopak-Terminals-Limited Annual Report F y 23-24
No ratings yet
Aegis-Vopak-Terminals-Limited Annual Report F y 23-24
133 pages
Tieu Luan Marketing Can Ban
No ratings yet
Tieu Luan Marketing Can Ban
6 pages
BPP Publishing ACCA 2017 Studying Materials: Study Text Practice & Revision Kit
100% (1)
BPP Publishing ACCA 2017 Studying Materials: Study Text Practice & Revision Kit
4 pages
Filipinolohiya
No ratings yet
Filipinolohiya
77 pages
Optical Disc Packaging
No ratings yet
Optical Disc Packaging
30 pages

Unit II Matrix Multiplication

Uploaded by

Unit II Matrix Multiplication

Uploaded by

Chapter 2.

Common Approaches used in matrix multiplication:

Result matrix C will also be 4x4.

Step 1: Partition into Blocks

C blocks: C11,C12,C21,C22 (are to be computed now).

 Processor 3 computes C21=A21⋅B11+A22⋅B21

Explanation with example:

Step 1 : Initial Alignment :

Step 3: Shift Block cyclic and iteration

Multiply and add to the previous matrix C values:

Example: Multiplying Two 4x4 Matrices

Given two matrices A and B:

 For matrix B, shift each column up by its column index modulo 4.

Step 2: Cyclic Shift and Multiply

Key Differences Row-wise Decomposition Column-wise Decomposition

High memory bandwidth required to High memory bandwidth required to

Less synchronization required since each More synchronization required to

Comparison Table between Cannon’s , Fox’s and Strassen’s Algorithm :

Parallel & Block- Parallel & Broadcast-

Time O(n3/p)O(n^3 / O(n2.81)O(n^{2.81})O(n2.81) O(n3/p)O(n^3 /

High (since all

Works best with Works with any size,

Communication High due to frequent Moderate

Even load Can be imbalanced for non-power-of-2

Large, distributed Large-scale parallel

You might also like