0% found this document useful (0 votes)
116 views8 pages

Gatk Pipeline Presentation: From Fastq Data To High Confident Variants

The document summarizes a GATK pipeline for analyzing sequencing data from the NA12878 genome. It describes the major steps: 1) aligning FASTQ reads to the reference genome using BWA to produce a BAM file, 2) calling variants using GATK HaplotypeCaller, 3) filtering and annotating variants with tools like Picard and GATK. It also lists the main software tools used, including BWA, GATK, SAMtools, and Picard Tools. The goal is to produce high confidence variants from the raw sequencing data through alignment, variant calling, and quality control steps.

Uploaded by

Sampreeth Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views8 pages

Gatk Pipeline Presentation: From Fastq Data To High Confident Variants

The document summarizes a GATK pipeline for analyzing sequencing data from the NA12878 genome. It describes the major steps: 1) aligning FASTQ reads to the reference genome using BWA to produce a BAM file, 2) calling variants using GATK HaplotypeCaller, 3) filtering and annotating variants with tools like Picard and GATK. It also lists the main software tools used, including BWA, GATK, SAMtools, and Picard Tools. The goal is to produce high confidence variants from the raw sequencing data through alignment, variant calling, and quality control steps.

Uploaded by

Sampreeth Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

GATK PIPELINE PRESENTATION

FROM FASTQ DATA TO HIGH CONFIDENT VARIANTS.


DATASET USED
● The dataset that has been used was NA12878
● This was generated on Illumina HiSeq.
● Chromosome 20 was the major focus for the whole workflow.
WORKFLOW
● There were 3 protocols that were followed.
○ FASTQ to BAM
○ CALLING VARIANTS
○ FILTERING VARIANTS.

Apart from these there have been few other protocols such as Support Protocols
and Alternate Protocols.
SOFTWARES AND FILES
● Softwares
○ BWA
○ GATK
○ SAMtools
○ Picard Tools
● Files
○ Sequence of raw reads in FASTQ format
○ Reference Genome in FASTA format
○ Database of known variants in VCF format.
Preparing the Reference Sequence
● GATK uses two files to safely access reference genome.
○ a dictionary of contig names and sizes.
○ index file to efficient random access to reference bases.
● The index file is created by using SAMtools and BWA.
○ BWA is separately used to create some other files for aligning reads
● The dictionary file is created using Picard Tools.
FASTQ To BAM
● Now the reads are aligned to the reference genome using BWA.
● Duplicate reads are marked for those aligned reads and are removed as they
doesn’t provide any additional information. This is done using Picard Tools.
● The above process creates a BAM file with duplicate reads marked.
● Now the bam file is marked with known indels giving a list of target regions.
● According to these targets list the reads are now realigned for a better
alignment.
● The above two steps are carried out using GATK.
● Later a Base Quality Score Recalibration is done using GATK again.
HaplotypeCaller Vs UnifiedGenotyper
● HaplotypeCaller is capable of calling SNPs and indels simultaneously via
local de-novo assembly of haplotypes in an active region. This allows the
HaplotypeCaller to be more accurate when calling regions that are
traditionally difficult to call, for example when they contain different types of
variants close to each other.
● Unifiedgenotyper calls SNPs and indels separately by considering each
variant locus independently. The model it uses to do so has been generalized
to work with data from organisms of any ploidy
RAW VARIANTS To ANALYSIS READY VARIANTS
● As we have used the HaplotypeCaller we will have 2 classes of variants SNPs
and Indels.
● Variant Quality Score Recalibration is done on SNPs and Indels separately.
● Specify call sets that should be used to build the recalibration model for Indels
and SNPs.
● Specify which annotations should be used to evaluate the likelihood of SNPs
and Indels being real.
● Recalibration models are build seperately using GATK and desired levels of
recalibration are applied to detect the original SNPs and Indels.
● The Output is annotated with recalibrated quality scores.

You might also like