Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
Apart from these there have been few other protocols such as Support Protocols
and Alternate Protocols.
SOFTWARES AND FILES
● Softwares
○ BWA
○ GATK
○ SAMtools
○ Picard Tools
● Files
○ Sequence of raw reads in FASTQ format
○ Reference Genome in FASTA format
○ Database of known variants in VCF format.
Preparing the Reference Sequence
● GATK uses two files to safely access reference genome.
○ a dictionary of contig names and sizes.
○ index file to efficient random access to reference bases.
● The index file is created by using SAMtools and BWA.
○ BWA is separately used to create some other files for aligning reads
● The dictionary file is created using Picard Tools.
FASTQ To BAM
● Now the reads are aligned to the reference genome using BWA.
● Duplicate reads are marked for those aligned reads and are removed as they
doesn’t provide any additional information. This is done using Picard Tools.
● The above process creates a BAM file with duplicate reads marked.
● Now the bam file is marked with known indels giving a list of target regions.
● According to these targets list the reads are now realigned for a better
alignment.
● The above two steps are carried out using GATK.
● Later a Base Quality Score Recalibration is done using GATK again.
HaplotypeCaller Vs UnifiedGenotyper
● HaplotypeCaller is capable of calling SNPs and indels simultaneously via
local de-novo assembly of haplotypes in an active region. This allows the
HaplotypeCaller to be more accurate when calling regions that are
traditionally difficult to call, for example when they contain different types of
variants close to each other.
● Unifiedgenotyper calls SNPs and indels separately by considering each
variant locus independently. The model it uses to do so has been generalized
to work with data from organisms of any ploidy
RAW VARIANTS To ANALYSIS READY VARIANTS
● As we have used the HaplotypeCaller we will have 2 classes of variants SNPs
and Indels.
● Variant Quality Score Recalibration is done on SNPs and Indels separately.
● Specify call sets that should be used to build the recalibration model for Indels
and SNPs.
● Specify which annotations should be used to evaluate the likelihood of SNPs
and Indels being real.
● Recalibration models are build seperately using GATK and desired levels of
recalibration are applied to detect the original SNPs and Indels.
● The Output is annotated with recalibrated quality scores.