0% found this document useful (0 votes)
4 views

Assignment 2 MPI MSA

The document outlines the instructions for an assignment consisting of two questions related to programming and optimization techniques. Question 1 focuses on SIMD optimization using AVX/NEON for matrix transposition and element-wise multiplication, while Question 2 involves implementing a parallel algorithm using MPI for computing a distance matrix for multiple sequence alignment. Both questions require specific submission formats, performance comparisons, and detailed reports on the implementation process and challenges faced.

Uploaded by

i220818
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Assignment 2 MPI MSA

The document outlines the instructions for an assignment consisting of two questions related to programming and optimization techniques. Question 1 focuses on SIMD optimization using AVX/NEON for matrix transposition and element-wise multiplication, while Question 2 involves implementing a parallel algorithm using MPI for computing a distance matrix for multiple sequence alignment. Both questions require specific submission formats, performance comparisons, and detailed reports on the implementation process and challenges faced.

Uploaded by

i220818
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Assignment 2

Instructions:

● Attempt both questions.


● Make 2 separate folders for each of the questions and combine in one Zip
archive renamed as ROLL-NUM_SECTION_A2.ZIP. The folder must contain
only code and pdf file (no binaries, no exe files etc.). Do not submit a .rar file.
● Each folder must contain the files as per the mentioned guidelines for
respective questions.
● Each file that you submit must contain your name, student-id e.g. make ROLL-
NUM_SECTION_A2.cpp (23i-0001_A_A2.c) and so on., and assignment # on
top of the file in comments.
● All the submissions will be on Google classroom within the deadline.
Submissions other than Google classroom (e.g. email etc.) will not be
accepted.
● The student is solely responsible to check the final zip files for issues like
corrupt files, viruses in the file, mistakenly exe sent. If we cannot download the
file from Google classroom due to any reason, it will lead to zero marks in the
assignment.
● Displayed output should be well mannered and well presented. Use appropriate
comments and indentation in your source code.
● Be prepared for viva or anything else after the submission of the assignment.
● If there is a syntax error in code, zero marks will be awarded in that part of
assignment.
● Understanding the assignment is also part of the assignment.
● Zero marks will be awarded to the students involved in plagiarism. (Copying
from the internet is the easiest way to get caught).
● Late Submission policy will be applied as described in the course outline.

Tip: For timely completion of the assignment, start as early as possible.


Note: Follow the given instruction to the letter, failing to do so will result in a zero.
Question#1:
SIMD Optimization using AVX/NEON

Objective:

The goal of this task is to familiarize students with SIMD (Single Instruction, Multiple Data)
programming by optimizing a computationally intensive task using AVX intrinsics (for x86
processors) or NEON intrinsics (for Apple Silicon). Students will compare the performance of
their SIMD-optimized implementation against a scalar (non-SIMD) implementation.

Problem Statement
You are given a computationally intensive task: matrix transposition with element-wise
multiplication. Your task is to implement this operation in two ways:
1. Scalar Implementation: A straightforward, non-SIMD implementation. We can try two
options using 2D array and 1D array.
2. SIMD-Optimized Implementation: An optimized version using AVX intrinsics (for x86) or
NEON intrinsics (for Apple Silicon).

The operation is defined as follows:


- Given two matrices, A and B, of size N x N, compute the transpose of A (let’s call it A_T) and
then perform element-wise multiplication of `A_T` and `B`. The result is stored in a matrix `C`
of size `N x N`.

Requirements
1. Scalar Implementation:
- Implement the matrix transposition and element-wise multiplication using scalar operations
(no SIMD).
- Ensure the implementation is correct and works for any square matrix of size `N x N`.

2. SIMD-Optimized Implementation:
- Use AVX intrinsics (for x86) or NEON intrinsics (for Apple Silicon) to optimize the
computation.
- Focus on optimizing both the matrix transposition and the element-wise multiplication.
- Ensure the implementation is correct and works for any square matrix of size `N x N`.

3. Performance Comparison:
- Measure the execution time of both implementations for different matrix sizes (e.g., `N =
256, 512, 1024`).
- Compare the performance of the SIMD-optimized implementation against the scalar
implementation.

4. Report:
- Provide a brief report explaining your approach to SIMD optimization.
- Include performance results (e.g., execution time, speedup achieved).
- Discuss any challenges you faced and how you addressed them.

Implementation Details
- Use single-precision floating-point numbers (`float`) for matrix elements.
- Assume `N` is a multiple of 8 (for AVX) or 4 (for NEON) to simplify alignment and
vectorization.
include <stdio.h>
include <stdlib.h>
include <time.h>
include <immintrin.h> // AVX intrinsics (x86)
// include <arm_neon.h> // NEON intrinsics (Apple Silicon)

define N 256

void scalar_2Dimplementation(float A[N][N], float B[N][N], float


C[N][N]) {
// Scalar implementation of matrix transposition and element-wise
multiplication
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
C[i][j] = A[j][i] B[i][j];
}
}
}
void scalar_1Dimplementation(float A[N][N], float B[N][N], float
C[N][N]) {
// Scalar implementation of matrix transposition and element-wise
multiplication
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
//Insert Your code here
}
}

void simd_implementation(float A[N][N], float B[N][N], float C[N][N])


{
// SIMD-optimized implementation using AVX or NEON intrinsics
// TODO: Implement SIMD optimization here
}

int main(int argc, char* argv) {


//add code here to obtain value of N from command line parameters
float A[N][N], B[N][N], C[N][N];
// Initialize matrices A and B with random values
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
A[i][j] = (float)rand() / RAND_MAX;
B[i][j] = (float)rand() / RAND_MAX;
}
}

// Measure performance of scalar implementation


clock_t start = clock();
scalar_2Dimplementation(A, B, C);
clock_t end = clock();
printf("Scalar time: %f seconds\n", (double)(end - start) /
CLOCKS_PER_SEC);
clock_t start = clock();
scalar_1Dimplementation(A, B, C);
clock_t end = clock();
printf("Scalar1D time: %f seconds\n", (double)(end - start) /
CLOCKS_PER_SEC);

// Measure performance of SIMD implementation


start = clock();
simd_implementation(A, B, C);
end = clock();
printf("SIMD time: %f seconds\n", (double)(end - start) /
CLOCKS_PER_SEC);

return 0;
}

Q#1: Submission Guidelines:

1. Source code for both scalar and SIMD implementations.


2. A report (PDF) containing:
- Explanation of your SIMD optimization approach.
- Performance results (execution time and speedup).
Challenges faced and solutions.

Q#1: Evaluation Criteria:

1. Correctness of the implementation.


2. Performance improvement achieved by the SIMD-optimized implementation.
3. Quality of the report (clarity, depth of analysis).

Question#2

MPI-Based solution to compute Distance Matrix for Multiple Sequence


Alignment (MSA)

Objective:
Design and implement a parallel algorithm using the Message Passing Interface (MPI) to
perform large-scale multiple sequence alignment (MSA) needs distance matrix. The
distance matrices are used by MSA algorithms such as clustalw, clustal-omega, MAFTT for
multiple sequecne allignment. The focus should be on performance optimization, scalability,
and addressing the challenges of distributed computing in bioinformatics.

Problem Description:

Given a large collection of DNA sequences, perform multiple sequence alignment (MSA) to
identify regions of similarity that may indicate functional, structural, or evolutionary
relationships. Implement the progressive alignment method, which is widely used for MSA
due to its efficiency and scalability. A detailed explanation of sequence alignment and an
example using MAFFT is provided below.
Serial Algorithm Overview:

Multiple sequence alignment can be solved using various approaches. The assignment
requires implementing MAFFT, which follows these steps:

Steps Involved:

1. Pairwise Sequence Alignment: Compute pairwise alignments for all sequences


using FFT-based alignment techniques.
2. Guide Tree Construction: Use Fast Fourier Transform (FFT) and progressive
alignment to construct a phylogenetic tree representing sequence relationships.
3. Progressive Alignment: Align sequences progressively based on the guide tree,
starting with the most closely related sequences.
4. Final Refinement: Improve the alignment iteratively using consistency-based
refinement techniques.

What is Sequence Alignment?

Sequence alignment is the process of arranging biological sequences (DNA, RNA, or protein)
to identify regions of similarity. The goal is to determine functional, structural, or evolutionary
relationships between the sequences.

● Pairwise Alignment: Aligns two sequences (e.g., Needleman-Wunsch for global


alignment, Smith-Waterman for local alignment).
● Multiple Sequence Alignment (MSA): Aligns more than two sequences to find
common evolutionary patterns.

Example: Step-by-Step Explanation of FFT-Based MSA

We align four sequences using FFT-based MSA. The sequences:

Seq1: MKTLLILTCLVAVALARPKAQQL
Seq2: MKTVLILTCLVALAKPKAQQL
Seq3: MKTLLILACLVALARKAQQL
Seq4: MKTLLILTCLVALAKPQQL

Step 1: Convert Sequences to Numerical Representation

● Each amino acid or nucleotide is converted into a numerical vector (e.g., using
physicochemical properties or binary encoding).
● Example encoding (simplified for demonstration):
○ A = (1,0,0)
○ C = (0,1,0)
○ G = (0,0,1)

For parallel execution, each process (MPI rank) can handle a subset of sequences.
Step 2: Apply FFT to Sequences (Parallelizable)

● FFT transforms each sequence into the frequency domain.


● Parallelization: Each MPI process computes FFT on a subset of sequences.
● Example FFT transformation of Seq1 (simplified output):

Seq1 FFT: [F1, F2, F3, ...]


Seq2 FFT: [G1, G2, G3, ...]
Seq3 FFT: [H1, H2, H3, ...]
Seq4 FFT: [I1, I2, I3, ...]

Step 3: Compute Pairwise Correlations (Parallelizable)

● Cross-correlate sequences in the frequency domain to detect similarities.

For 4 sequences, possible pairwise alignments:

● (Seq1, Seq2)
● (Seq1, Seq3)
● (Seq1, Seq4)
● (Seq2, Seq3)
● (Seq2, Seq4)
● (Seq3, Seq4)

The Needleman–Wunsch algorithm is a dynamic programming approach used for global


sequence alignment. It finds the optimal alignment between two sequences (for each of pair
above) by considering insertions, deletions, and substitutions, ensuring that the alignment
spans the entire length of both sequences.

Steps Involved:

1. Initialization:

● Create a scoring matrix HHH with dimensions (n+1)×(m+1), where n and m are the
lengths of the two sequences.
● Initialize the first row and first column with gap penalties:

where Wg is the gap penalty.

2. Scoring:
● Fill in the scoring matrix using the following recurrence relation:

where:
○ s(ai, bj) is the substitution score for aligning characters ai and bj.
○ Wg is the gap penalty.

3. Traceback:

● Start from the bottom-right of the matrix and backtrack to reconstruct the optimal
alignment.
● Move according to the highest scoring path:
○ Diagonal move → Match/Mismatch
○ Up move → Insertion (gap in sequence 2)

○ Left move → Deletion (gap in sequence 1)


● The alignment ends when the top-left cell is reached.

Step 4: Construct Distance Matrix

● Convert correlation scores into distances.


● Build a guide tree using UPGMA or neighbor-joining.

Example distance matrix (simplified):

Seq1 Seq2 Seq3 Seq4

Seq1 0.0 0.2 0.4 0.3

Seq2 0.2 0.0 0.3 0.2

Seq3 0.4 0.3 0.0 0.3

Seq4 0.3 0.2 0.3 0.0


Or in a more visual representation in the form of guided tree:

---- Seq1

----|

| ---- Seq2

----|

| ---- Seq4

—--------|

---- Seq3

This tree guides progressive alignment.

Step 5: Perform Progressive Alignment (Optional for Bonus)

● Start with the most similar sequences first.


● Align groups progressively based on the tree.

Example alignment after progressive merging:

Seq1: MKTLLILTCLVAVALARPKAQQL
Seq2: MKTVLILTCLVALA--KPKAQQL
Seq3: MKTLLILACLVALA--RKAQQL
Seq4: MKTLLILTCLVALA--KPQQL

Dataset to be used:
The following link contains a few benchmark dataset for this problem. Also the repository
contains a solution based on SIMD that you may explore for learning purposes.
https://github.com/TimoLassmann/kalign/blob/main/scripts/benchmark.org

Sketch of Parallel Algorithm (MPI):

You may deviate from this approach if you feel you have abetter idea.

Data Partitioning:

● Distribute the sequences among MPI processes for pairwise sequence alignment
computations.
● Construct the guide tree in a distributed manner to reduce memory overhead.
● Perform progressive alignment in a hierarchical manner across MPI ranks.

Communication:

● Processes exchange aligned sub-sequences as needed.


● Use point-to-point communication for sequence exchange and guide tree
construction.
● Reduce inter-process synchronization to avoid performance bottlenecks.

Synchronization:

● Ensure all processes complete pairwise alignment before constructing the guide tree.
● Implement efficient synchronization during progressive alignment.

Computation:

● Each process calculates the alignment for its assigned sequence pairs.
● Construct and store sub-alignments progressively

Result Aggregation:

● Gather and merge the partial alignments from all MPI processes into a single process
for final output.

Implementation Requirements:

● Programming Language: Use C++ and MPICH (Using Point-to-Point


Communication Functions only) for implementation.
● Serial Version: Implement a serial version of the algorithm to serve as a baseline for
performance comparison.
● Parallel Version: Develop a parallel version addressing the additional constraints
outlined above.

Report Requirements:

1. Parallel Algorithm: Describe the parallelization strategy, data partitioning,


communication, and synchronization mechanisms.
2. Challenges Faced :Discuss any challenges faced, particularly in relation to the
additional constraints, and how they were overcome.
3. Code: Include your well-commented C++ code, ensuring clarity and readability.

Evaluation Criteria:

● Correctness: Accuracy and reliability of the implementation.


● Report Quality: Clarity, completeness, and depth of analysis in the report.
● Code Quality: Readability, documentation, and structure of the code.

Challenges and Considerations:

1. Memory Management: Implement efficient memory usage techniques to handle large


scoring matrices, possibly utilizing sparse representations or block-based
computations.
2. Communication Overhead: Minimize data communication between processes to
enhance performance, potentially through data compression or efficient
communication patterns.
3. Load Balancing: Ensure an even distribution of work among processes.
4. Scalability: Evaluate how well the parallel algorithm scales with the number of
processes, identifying any bottlenecks that may arise[8,12,16].

Constraints:

1. Memory Efficiency: Implement strategies to manage memory usage effectively,


ensuring the algorithm can handle datasets that exceed the memory capacity of a
single node.
2. Load Balancing: Ensure an even distribution of computational work among
processes to prevent bottlenecks and optimize performance.
3. Communication Minimization: Minimize inter-process communication overhead,
which can significantly impact performance in distributed environments.
4. Fault Tolerance: Incorporate mechanisms to handle process failures gracefully,
ensuring the algorithm can recover and continue processing without data loss.
5. Dynamic Scalability: Allow the algorithm to adjust dynamically to varying numbers of
processes, facilitating efficient utilization of available computational resources.

You might also like