0% found this document useful (0 votes)
2 views23 pages

Unit II Matrix Multiplication

The document discusses parallel algorithms for matrix multiplication, highlighting various decomposition methods such as row-wise, column-wise, and block-wise approaches. It explains specific algorithms like Cannon's, Fox's, and Strassen's, detailing their complexities, communication overhead, and practical applications. The document emphasizes the importance of parallel computing in enhancing the efficiency of matrix multiplication, especially for large matrices.

Uploaded by

khuranasimar15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views23 pages

Unit II Matrix Multiplication

The document discusses parallel algorithms for matrix multiplication, highlighting various decomposition methods such as row-wise, column-wise, and block-wise approaches. It explains specific algorithms like Cannon's, Fox's, and Strassen's, detailing their complexities, communication overhead, and practical applications. The document emphasizes the importance of parallel computing in enhancing the efficiency of matrix multiplication, especially for large matrices.

Uploaded by

khuranasimar15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 2.

2
Parallel Algorithms
Introduction to Parallel Matrix Multiplication
Matrix multiplication is a fundamental operation in linear algebra with numerous applications. Parallel
computing can significantly speed up this process, especially for large matrices. It seems like you’re asking
about parallel algorithms for matrix multiplication. Matrix multiplication is a computationally intensive
task, and parallel algorithms are designed to speed it up by distributing the work across multiple processors
or threads. Below, A common parallel matrix multiplication algorithm is explained below, by providing an
example, and keep it clear and structured. Matrix multiplication involves taking two matrices, A (size i×k)
and B (size k×j), and producing a result matrix C (size i×j). The formula for each element C[i][j] is:
________________________________________

Common Approaches used in matrix multiplication:


A. Decomposition Methods
1. Row-wise Decomposition
 Distribute matrix rows among processes
 Each process computes a portion of the result matrix
 Requires entire vector X to be available to all processes3

2. Column-wise Decomposition
 Distribute matrix columns among processes
 Each process computes partial results
 Requires a final reduction step to obtain the complete result vector4
3. Block-wise (Checkerboard) Decomposition or Block-Based Parallel Multiplication
 Divide matrices into rectangular subblocks
 Distribute subblocks among processes in a 2D grid pattern
 More scalable for large matrices and high numbers of processes
One widely used method is to divide the matrices into smaller blocks (submatrices) and assign each block
computation to a separate processor or thread. This is especially effective on systems like multi-core
CPUs, GPUs, or distributed clusters.

Steps:
Algorithm Description
1. Partition the Matrices:
o Split A into row blocks.
o Split B into column blocks.
o Split C into corresponding blocks.
2. Assign Tasks:
o Each processor computes a block of C by multiplying a row block of A with a column
block of B.
3. Parallel Execution:
o Processors work simultaneously on their assigned blocks.
4. Combine Results:
o Once all blocks are computed, the full matrix C C C is assembled.
Parallel Framework
This can be implemented using:
 Threads (e.g., OpenMP on a multi-core CPU).
 Processes (e.g., MPI on a cluster).
 GPU threads (e.g., CUDA).

Example
Let’s multiply two 4x4 matrices in parallel:

Result matrix C will also be 4x4.

Step 1: Partition into Blocks


Divide each matrix into 2x2 submatrices (assuming 4 processors):

C blocks: C11,C12,C21,C22 (are to be computed now).


Step 2: Assign to Processors
With 4 processors:
 Processor 1 computes C11=A11⋅B11+A12⋅B21
 Processor 2 computes C12=A11⋅B12+A12⋅B22

 Processor 3 computes C21=A21⋅B11+A22⋅B21


 Processor 4 computes C22=A21⋅B12+A22⋅B22 C
Step 3: Compute Each Block
Step 4: Assemble C

Why Parallel?
 Speed: Each processor handles a smaller chunk simultaneously.
 Scalability: Works on multi-core systems, clusters, or GPUs.
Practical Notes
 Load Balancing: Ensure each processor gets equal work.
 Communication: In distributed systems (e.g., MPI), processors need to share A A A and B B B
blocks, adding overhead.
 Tools: Libraries like BLAS (CPU) or cu BLAS (GPU) optimize this further.

Specific Algorithms
1. Fox's Algorithm
 Theoretical Complexity: Fox's algorithm has a time complexity of O(n3p)O(pn3) when
using pp processors, assuming ideal conditions and no communication overhead. However,
in practice, communication costs can significantly affect performance.
 Communication Overhead: Fox's algorithm involves broadcasting diagonal elements and
shifting matrix blocks, which can lead to substantial communication overhead, especially
in distributed memory architectures.
 Scalability: It is relatively easy to implement but may not scale as well as other algorithms
due to its communication requirements.
2. Strassen's Algorithm
 Theoretical Complexity: Strassen's algorithm has a time complexity
of O(nlog27)≈O(n2.807)O(nlog27)≈O(n2.807), making it faster than the
standard O(n3)O(n3) algorithm for large matrices.
 Communication Overhead: While Strassen's algorithm reduces computational complexity,
its communication overhead is generally higher due to the recursive nature of the
algorithm, which requires more data movement between levels of recursion.
 Scalability: It is more complex to implement and less stable numerically than the standard
algorithm but offers significant speedup for very large matrices.
3. Cannon's Algorithm
 Theoretical Complexity: Cannon's algorithm also has a time complexity
of O(n3p)O(pn3) when using pp processors, similar to Fox's algorithm. However, it is
optimized for 2D mesh architectures.
 Communication Overhead: Cannon's algorithm minimizes communication overhead by
aligning matrix blocks in a way that reduces the need for data exchange between
processors. It shifts blocks cyclically, which can be efficient in homogeneous 2D grids.
 Scalability: It is well-suited for homogeneous 2D grids but can be challenging to extend to
heterogeneous grids

Explanation with example:


1. Cannon's Algorithm : Cannon’s Algorithm is designed for parallel matrix
multiplication. It involves two key phases: initial alignment (shifting the blocks of A and
B) and computation (multiplying and accumulating results over iterations). Let’s compute
C following the algorithm.
Iterations: p\sqrt{p}p
 If using ppp processors in a p×p\sqrt{p} \times \sqrt{p}p×p grid, the algorithm runs for
p\sqrt{p}p iterations.
 Each processor performs one local matrix multiplication per iteration.
 Matrix shifts occur at each step.

🔹 Iteration Formula:
Iterations=p\text{Iterations} = \sqrt{p}Iterations=p
 Example: If using a 4×4 processor grid (p=16p = 16p=16), the number of iterations is 4.
 Example Calculation:
o First Iteration: Initial matrix shifting & local computation.
o Second Iteration: Shift again & recompute.
o Third Iteration: Repeat.
o Fourth Iteration: Final computation.

🔹 Total Multiplications:
 Each processor performs O(n3/p)O(n^3 / p)O(n3/p) computations.
 If p=n2p = n^2p=n2 (max parallelism), each processor does **O(1)O(1)O(1)
multiplications per iteration.

Steps:
1. Initial alignment of matrix blocks
2. Perform local block multiplications
3. Shift blocks cyclically/
4. Repeat steps 2-3 until all blocks are processed

Example 1 :

Step 1 : Initial Alignment :


Step 2: Perform local block multiplications

Step 3: Shift Block cyclic and iteration


Then multiply and accumulate the result .

Multiply and add to the previous matrix C values:


Step 4 : Repeat steps 2-3 until all blocks are processed

Example: Multiplying Two 4x4 Matrices

To begin with, we need to partition matrices A and B into smaller blocks. Since we are using a 4by 4 grid
of processes, each block will be 1 1 in this case. The initial alignment involves shifting

the rows of matrix A to the left by their row index modulo 4 and shifting the columns of matrix B up by
their column index modulo 4. This ensures that each process can start multiplying its local blocks. We
will use a 4x4 processor grid, meaning we have 16 processes arranged in a 2D grid.

Given two matrices A and B:


Step 1: Initial Alignment

1. Partition Matrices: Divide A and B into 16 blocks, each 1x1 in this case, since we are using a 4x4
grid of processes.

2. Initial Shift: Perform an initial shift to align the blocks so that each process can start multiplying
its local blocks.

 For matrix A, shift each row to the left by its row index modulo 4.

 For matrix B, shift each column up by its column index modulo 4.

After shifting:
3. Local Multiplication: Each process multiplies its local blocks of A and B and accumulates the
result.

Step 2: Cyclic Shift and Multiply


1. Shift Blocks: Shift each row of A one position to the left and each column of B one position up.

After shifting:
2. Local Multiplication: Each process multiplies its new local blocks of A and B and adds the result
to the accumulated sum.

Accumulate: C =C1+C2
Step 3 & 4: Repeat Shift and Multiply
Repeat the process of shifting and multiplying until all blocks have been processed. After four iterations,
the final result matrix C is obtained by summing all the intermediate results.

Final Result

The final result matrix C is calculated by summing the results from all iterations:

C =C1+C2+C3+C4
After performing the calculations for each step and summing the results, we obtain the final matrix C.
This example illustrates how Cannon's algorithm efficiently performs matrix multiplication in parallel by
minimizing communication overhead through structured data shifts and local multiplications.

2. Fox's Algorithm: Fox’s Algorithm is a parallel algorithm for matrix multiplication designed
for distributed computing. It works efficiently on 2D processor grids (p × p). Unlike Cannon’s
Algorithm, which shifts elements, Fox’s Algorithm broadcasts matrix blocks.
Steps:
1. Organize processes in a 2D grid
2. Broadcast diagonal elements
3. Perform local multiplications
4. Shift matrix B blocks
5. Repeat steps 2-4 for all stages
this is to inform you that i have already taken my attendance in the first 5 minutes already you can take
proof from my excel from the class students you can ask randomly even the CR is standing along with me
and discussing, i didn't announce any single uid or name. So how it can be possible.
3. Strassen's Algorithm : Strassen’s Algorithm is a divide-and-conquer approach that reduces the
number of multiplications required for matrix multiplication. It is more efficient than the classical method,
reducing the time complexity from O(n³) to approximately O(n^{2.81}).

Steps:
1. Divide matrices into quadrants
2. Compute seven matrix products (M1 to M7)
3. Combine results to form the final matrix
Example (2x2 matrices):
Performance Considerations
 The choice of algorithm depends on matrix size and hardware architecture
 Strassen's algorithm is typically beneficial for matrices larger than 1000x1000
 Practical implementations often use hybrid approaches, switching to standard methods for smaller
submatrices

Key Differences Row-wise Decomposition Column-wise Decomposition

Matrix divided into rows; each processor Matrix divided into columns; each processor
Matrix
multiplies assigned rows with the entire multiplies the entire first matrix with assigned
Division
second matrix. columns.

Each processor needs access to the entire Each processor needs access to the
Data Distribution
second matrix (all columns), which can be entire first matrix (all rows), which can
and Access
memory-intensive. be memory-intensive.

Computation and Each processor independently computes a Each processor computes partial results
Communication complete row, reducing inter-processor for multiple rows, requiring more inter-
communication but increasing memory processor communication to combine
bandwidth usage. results.

Easier to implement; good load May need extra steps for load balancing,
Scalability and
balancing if rows are evenly divisible especially if columns are not evenly divisible
Load Balancing
by processors. by processors.

High memory bandwidth required to High memory bandwidth required to


Memory Usage
access the entire second matrix for each access the entire first matrix for each
and Bandwidth
row computation. column computation.

Less synchronization required since each More synchronization required to


Synchronization and
processor computes a complete row combine partial results from different
Reduction
independently. processors.

Comparison Table between Cannon’s , Fox’s and Strassen’s Algorithm :

Cannon’s
Feature Strassen’s Algorithm Fox’s Algorithm
Algorithm

Parallel & Block- Parallel & Broadcast-


Type Divide-and-Conquer
based based

Time O(n3/p)O(n^3 / O(n2.81)O(n^{2.81})O(n2.81) O(n3/p)O(n^3 /


Complexity p)O(n3/p) (parallel) (sequential) p)O(n3/p) (parallel)

Works well in
Works efficiently on
Parallelism Limited parallelism distributed memory
2D processor grids
settings

Requires p=np =
Requires p×pp \times
Processor Grid \sqrt{n}p=n Not processor-dependent
pp×p processor grid
processors

High (since all


Medium (distributed
Memory Usage submatrices are Low (less storage required)
memory system)
stored locally)

Works best with Works with any size,


Works for any power of 2, needs
Matrix Size square matrices and but best for square
padding for odd sizes
even sizes matrices

Communication High due to frequent Moderate


Low (recursion-based)
Cost data shifting (broadcasting needed)
Cannon’s
Feature Strassen’s Algorithm Fox’s Algorithm
Algorithm

Even load Can be imbalanced for non-power-of-2


Load Balancing Even load distribution
distribution sizes

Large, distributed Large-scale parallel


Small to medium matrices where fast
Best Used For parallel systems (2D computing on multiple
sequential computation is needed
mesh) processors

You might also like