Empirical evaluation of methods for de novo genome assembly

View article
Loading...
PeerJ Computer Science

Main article text

 

Introduction

De novo Assembly METHOD

Overlap Layout Consensus (OLC) method

  1. The overlap of each pair of reads is identified using all-against-all pairwise read alignment. K-mers pre-calculation for all reads would improve performance considerably. It selects candidates that share K-mers and measures alignment by using K-mers as alignment seeds. The detection of overlaps is overly sensitive to limited overlap length and the size of the K-mer. The selection of these parameters can therefore significantly influence the assembler’s efficiency. There would be too many candidates for small parameter values, while large values, in comparison, can lead to accurate contiguities that are shorter. Consequently, finding a good balance requires a considerable amount of time.

  2. Based on the overlap information, the OLC constructs an overlap graph. Within this step, the OLC finds a special form of path, i.e., a simple path where every node is distinct. This path is a Hamiltonian path as the nodes will be visited exactly once. However, finding a Hamiltonian path is an NP-hard problem. This problem is solved in practice using a greedy strategy or heuristic algorithms.

  3. Finally, the OLC performs multiple sequence alignment (MSA). MSA is intended to decide the exact layout and voting strategies. Alternatively, it may use statistical methods to define the best consensus sequence. However, no method efficiently resolves the optimal MSA problem. Therefore, the consensus stage uses pairwise alignments driven by the approximate read layout.

Assemblers using OLC

De Bruijn Graph (DBG) method

  • Select a value for k

  • Make k-mers

  • Count the k-mers

  • Make the DBG

  • Categorize the de Bruijn graph based on the expressions of nodes and edges in two forms, i.e., Hamiltonian and Eulerian graph approach. The k-mers are the nodes in the Hamiltonian approach, while they are the edges in the Eulerian approach. The graph method in the Hamiltonian approach resembles the OLC method. The sequences are constructed in this approach to find Hamiltonian paths that pass through all nodes and are only visited once. This is an NP-complete problem when the number of nodes is not negligible. Consequently, this makes assembly problems a simpler issue in the theory of algorithms, which is the most crucial advantage of DBG.

  • Revise contigs from the simplified graph.

Assemblers using DBG

  • The accurate distance estimation performed in this stage (k-bimer adaptation) between k-mers is based upon the joint distance histogram and assembly graph analysis.

  • Inspired by the PDBG (Medvedev et al., 2011) method, the paired assembly graph is constructed in this stage (contig construction).

  • The DBG is constructed.

  • By backtracking graph simplifications, the last stage (contig construction) is completed. SPAdes generates DNA sequences of contigs and maps reads to contigs.

String graph-based method

Assemblers using string graph

Hybrid method

Assemblers using hybrid methods

Experiments

Dataset

Assemblies

Results

Arabidopsis thaliana

Bacillus cereus

Caenorhabditis elegans

Escherichia coli

Human genome

Saccharomyces cerevisiae

Staphylococcus aureus

Conclusion

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Firaol Dida conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Gangman Yi conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data and code are available at GitHub: https://github.com/Firaol1221/Empirical-Evaluation-of-Methods-forDe-Novo-Genome-Assembly.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1F1A1064019). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

27 Citations 3,240 Views 525 Downloads