Unit II Matrix Multiplication
Unit II Matrix Multiplication
2
Parallel Algorithms
Introduction to Parallel Matrix Multiplication
Matrix multiplication is a fundamental operation in linear algebra with numerous applications. Parallel
computing can significantly speed up this process, especially for large matrices. It seems like you’re asking
about parallel algorithms for matrix multiplication. Matrix multiplication is a computationally intensive
task, and parallel algorithms are designed to speed it up by distributing the work across multiple processors
or threads. Below, A common parallel matrix multiplication algorithm is explained below, by providing an
example, and keep it clear and structured. Matrix multiplication involves taking two matrices, A (size i×k)
and B (size k×j), and producing a result matrix C (size i×j). The formula for each element C[i][j] is:
________________________________________
2. Column-wise Decomposition
Distribute matrix columns among processes
Each process computes partial results
Requires a final reduction step to obtain the complete result vector4
3. Block-wise (Checkerboard) Decomposition or Block-Based Parallel Multiplication
Divide matrices into rectangular subblocks
Distribute subblocks among processes in a 2D grid pattern
More scalable for large matrices and high numbers of processes
One widely used method is to divide the matrices into smaller blocks (submatrices) and assign each block
computation to a separate processor or thread. This is especially effective on systems like multi-core
CPUs, GPUs, or distributed clusters.
Steps:
Algorithm Description
1. Partition the Matrices:
o Split A into row blocks.
o Split B into column blocks.
o Split C into corresponding blocks.
2. Assign Tasks:
o Each processor computes a block of C by multiplying a row block of A with a column
block of B.
3. Parallel Execution:
o Processors work simultaneously on their assigned blocks.
4. Combine Results:
o Once all blocks are computed, the full matrix C C C is assembled.
Parallel Framework
This can be implemented using:
Threads (e.g., OpenMP on a multi-core CPU).
Processes (e.g., MPI on a cluster).
GPU threads (e.g., CUDA).
Example
Let’s multiply two 4x4 matrices in parallel:
Why Parallel?
Speed: Each processor handles a smaller chunk simultaneously.
Scalability: Works on multi-core systems, clusters, or GPUs.
Practical Notes
Load Balancing: Ensure each processor gets equal work.
Communication: In distributed systems (e.g., MPI), processors need to share A A A and B B B
blocks, adding overhead.
Tools: Libraries like BLAS (CPU) or cu BLAS (GPU) optimize this further.
Specific Algorithms
1. Fox's Algorithm
Theoretical Complexity: Fox's algorithm has a time complexity of O(n3p)O(pn3) when
using pp processors, assuming ideal conditions and no communication overhead. However,
in practice, communication costs can significantly affect performance.
Communication Overhead: Fox's algorithm involves broadcasting diagonal elements and
shifting matrix blocks, which can lead to substantial communication overhead, especially
in distributed memory architectures.
Scalability: It is relatively easy to implement but may not scale as well as other algorithms
due to its communication requirements.
2. Strassen's Algorithm
Theoretical Complexity: Strassen's algorithm has a time complexity
of O(nlog27)≈O(n2.807)O(nlog27)≈O(n2.807), making it faster than the
standard O(n3)O(n3) algorithm for large matrices.
Communication Overhead: While Strassen's algorithm reduces computational complexity,
its communication overhead is generally higher due to the recursive nature of the
algorithm, which requires more data movement between levels of recursion.
Scalability: It is more complex to implement and less stable numerically than the standard
algorithm but offers significant speedup for very large matrices.
3. Cannon's Algorithm
Theoretical Complexity: Cannon's algorithm also has a time complexity
of O(n3p)O(pn3) when using pp processors, similar to Fox's algorithm. However, it is
optimized for 2D mesh architectures.
Communication Overhead: Cannon's algorithm minimizes communication overhead by
aligning matrix blocks in a way that reduces the need for data exchange between
processors. It shifts blocks cyclically, which can be efficient in homogeneous 2D grids.
Scalability: It is well-suited for homogeneous 2D grids but can be challenging to extend to
heterogeneous grids
🔹 Iteration Formula:
Iterations=p\text{Iterations} = \sqrt{p}Iterations=p
Example: If using a 4×4 processor grid (p=16p = 16p=16), the number of iterations is 4.
Example Calculation:
o First Iteration: Initial matrix shifting & local computation.
o Second Iteration: Shift again & recompute.
o Third Iteration: Repeat.
o Fourth Iteration: Final computation.
🔹 Total Multiplications:
Each processor performs O(n3/p)O(n^3 / p)O(n3/p) computations.
If p=n2p = n^2p=n2 (max parallelism), each processor does **O(1)O(1)O(1)
multiplications per iteration.
Steps:
1. Initial alignment of matrix blocks
2. Perform local block multiplications
3. Shift blocks cyclically/
4. Repeat steps 2-3 until all blocks are processed
Example 1 :
To begin with, we need to partition matrices A and B into smaller blocks. Since we are using a 4by 4 grid
of processes, each block will be 1 1 in this case. The initial alignment involves shifting
the rows of matrix A to the left by their row index modulo 4 and shifting the columns of matrix B up by
their column index modulo 4. This ensures that each process can start multiplying its local blocks. We
will use a 4x4 processor grid, meaning we have 16 processes arranged in a 2D grid.
1. Partition Matrices: Divide A and B into 16 blocks, each 1x1 in this case, since we are using a 4x4
grid of processes.
2. Initial Shift: Perform an initial shift to align the blocks so that each process can start multiplying
its local blocks.
For matrix A, shift each row to the left by its row index modulo 4.
After shifting:
3. Local Multiplication: Each process multiplies its local blocks of A and B and accumulates the
result.
After shifting:
2. Local Multiplication: Each process multiplies its new local blocks of A and B and adds the result
to the accumulated sum.
Accumulate: C =C1+C2
Step 3 & 4: Repeat Shift and Multiply
Repeat the process of shifting and multiplying until all blocks have been processed. After four iterations,
the final result matrix C is obtained by summing all the intermediate results.
Final Result
The final result matrix C is calculated by summing the results from all iterations:
C =C1+C2+C3+C4
After performing the calculations for each step and summing the results, we obtain the final matrix C.
This example illustrates how Cannon's algorithm efficiently performs matrix multiplication in parallel by
minimizing communication overhead through structured data shifts and local multiplications.
2. Fox's Algorithm: Fox’s Algorithm is a parallel algorithm for matrix multiplication designed
for distributed computing. It works efficiently on 2D processor grids (p × p). Unlike Cannon’s
Algorithm, which shifts elements, Fox’s Algorithm broadcasts matrix blocks.
Steps:
1. Organize processes in a 2D grid
2. Broadcast diagonal elements
3. Perform local multiplications
4. Shift matrix B blocks
5. Repeat steps 2-4 for all stages
this is to inform you that i have already taken my attendance in the first 5 minutes already you can take
proof from my excel from the class students you can ask randomly even the CR is standing along with me
and discussing, i didn't announce any single uid or name. So how it can be possible.
3. Strassen's Algorithm : Strassen’s Algorithm is a divide-and-conquer approach that reduces the
number of multiplications required for matrix multiplication. It is more efficient than the classical method,
reducing the time complexity from O(n³) to approximately O(n^{2.81}).
Steps:
1. Divide matrices into quadrants
2. Compute seven matrix products (M1 to M7)
3. Combine results to form the final matrix
Example (2x2 matrices):
Performance Considerations
The choice of algorithm depends on matrix size and hardware architecture
Strassen's algorithm is typically beneficial for matrices larger than 1000x1000
Practical implementations often use hybrid approaches, switching to standard methods for smaller
submatrices
Matrix divided into rows; each processor Matrix divided into columns; each processor
Matrix
multiplies assigned rows with the entire multiplies the entire first matrix with assigned
Division
second matrix. columns.
Each processor needs access to the entire Each processor needs access to the
Data Distribution
second matrix (all columns), which can be entire first matrix (all rows), which can
and Access
memory-intensive. be memory-intensive.
Computation and Each processor independently computes a Each processor computes partial results
Communication complete row, reducing inter-processor for multiple rows, requiring more inter-
communication but increasing memory processor communication to combine
bandwidth usage. results.
Easier to implement; good load May need extra steps for load balancing,
Scalability and
balancing if rows are evenly divisible especially if columns are not evenly divisible
Load Balancing
by processors. by processors.
Cannon’s
Feature Strassen’s Algorithm Fox’s Algorithm
Algorithm
Works well in
Works efficiently on
Parallelism Limited parallelism distributed memory
2D processor grids
settings
Requires p=np =
Requires p×pp \times
Processor Grid \sqrt{n}p=n Not processor-dependent
pp×p processor grid
processors