DNA-seq_Germline_Variants

Germline Variants Pipeline

This repository provides a step-by-step pipeline for processing germline variant analysis using genomic sequencing data. The pipeline includes raw data preprocessing, alignment, variant calling, and annotation using widely used bioinformatics tools.


Table of Contents

  1. Server Information
  2. Genomic Sequencing Data
  3. QC raw reads
  4. Align reads
  5. Sort alignment
  6. Mark duplicates
  7. Index BAM file
  8. Variant Calling
  9. Combine Variants
  10. Genotype Variants
  11. Variant Filtering
  12. Genotype Concordance
  13. Variant Annotation
  14. Extract Variant Fields
  15. Class Exercise

1. Server Information

Prerequisites

Security Practices

Installing Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Installing software

# conda install mamba
# mamba install sra-tools fastqc trimmomatic multiqc curl 
# mamba install bwa samtools picard gatk4
# mamba install snpeff snpsift

2. Genomic Sequencing Data

Dataset Information

mkdir Germline_Variants
cd Germline_Variants/
# Download raw sequencing data
fastq-dump --split-files -X 100000 SRR1972917
fastq-dump --split-files -X 100000 SRR1972918

3. QC raw reads

Run FASTQC and Trimmomatic for Quality Filtering and Trimming

mkdir qc
fastqc *.fastq -o qc/
# Download adapter file and trim sequences
curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR1972917_1.fastq SRR1972917_2.fastq \
    SRR1972917_trimmed_1.fastq SRR1972917_unpaired_1.fastq \
    SRR1972917_trimmed_2.fastq SRR1972917_unpaired_2.fastq \
    ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
trimmomatic PE SRR1972918_1.fastq SRR1972918_2.fastq \
    SRR1972918_trimmed_1.fastq SRR1972918_unpaired_1.fastq \
    SRR1972918_trimmed_2.fastq SRR1972918_unpaired_2.fastq \
    ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
mkdir qc_trimmed
fastqc *trimmed_*.fastq -o qc_trimmed/

4. Align reads

Run bwa mem

bwa mem -R '@RG\tID:SRR1972917\tSM:SRR1972917\tPL:ILLUMINA\tLB:SRR1972917' \
    /home/bqhs/ebola/AF086833.fa SRR1972917_trimmed_1.fastq SRR1972917_trimmed_2.fastq > SRR1972917_raw.sam
bwa mem -R '@RG\tID:SRR1972918\tSM:SRR1972918\tPL:ILLUMINA\tLB:SRR1972918' \
    /home/bqhs/ebola/AF086833.fa SRR1972918_trimmed_1.fastq SRR1972918_trimmed_2.fastq > SRR1972918_raw.sam

5. Sort alignment

Run samtools sort

samtools sort SRR1972917_raw.sam > SRR1972917_sort.bam
samtools sort SRR1972918_raw.sam > SRR1972918_sort.bam

6. Mark duplicates

Run Picard MarkDuplicates

# Mark Duplicates
picard MarkDuplicates -Xmx10g I=SRR1972917_sort.bam O=SRR1972917_dedup.bam M=SRR1972917_dedup.txt
# Collect Alignment Metrics
picard CollectAlignmentSummaryMetrics -Xmx10g INPUT=SRR1972917_dedup.bam OUTPUT=SRR1972917_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/ebola/AF086833.fa
# Provides summary of alignment statistics
samtools flagstat SRR1972917_dedup.bam
picard MarkDuplicates -Xmx10g I=SRR1972918_sort.bam O=SRR1972918_dedup.bam M=SRR1972918_dedup.txt
picard CollectAlignmentSummaryMetrics -Xmx10g INPUT=SRR1972918_dedup.bam OUTPUT=SRR1972918_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/ebola/AF086833.fa
samtools flagstat SRR1972918_dedup.bam

7. Index BAM file

Run samtools index

samtools index SRR1972917_dedup.bam
samtools index SRR1972918_dedup.bam

8. Variant Calling

Run GATK HaplotypeCaller

gatk HaplotypeCaller -R /home/bqhs/ebola/AF086833.fa -I SRR1972917_dedup.bam -O SRR1972917.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/ebola/AF086833.fa -I SRR1972918_dedup.bam -O SRR1972918.g.vcf -ERC GVCF

Why use -ERC GVCF?


9. Combine Variants

Run GATK CombineGVCFs

gatk CombineGVCFs -R /home/bqhs/ebola/AF086833.fa -V SRR1972917.g.vcf -V SRR1972918.g.vcf -O combined.g.vcf

10. Genotype Variants

Run GATK GenotypeGVCFs

gatk GenotypeGVCFs -R /home/bqhs/ebola/AF086833.fa -V combined.g.vcf -O combined.vcf

What’s the difference between GVCF and VCF?


11. Variant Filtering

Run GATK VariantFiltration

1st Filtering Step (Site-Level Filtering)

gatk VariantFiltration -R /home/bqhs/ebola/AF086833.fa -V combined.vcf -O combined.filter1.vcf \
    -filter "QUAL < 30.0 || DP < 10" --filter-name lowQualDp

2nd Filtering Step (Genotype-Level Filtering)

gatk VariantFiltration -R /home/bqhs/ebola/AF086833.fa -V combined.filter1.vcf -O combined.filter2.vcf \
    -G-filter "GQ < 20.0" -G-filter-name lowGQ

Why Perform Both Steps?


12. Genotype Concordance

Run GATK GenotypeConcordance

gatk GenotypeConcordance \
    -CV combined.filter2.vcf \
    -TV /home/bqhs/ebola/ebola-samples.vcf \
    -O SRR1972917.concordance.grp \
    -CS SRR1972917 \
    -TS SRR1972917

When should you run Genotype Concordance?


13. Variant Annotation

Run snpEff ann

# mamba install snpeff snpsift
snpEff ann -v \
  -c /home/vedbar_2025/miniconda3/share/snpeff-5.2-1/snpEff.config \
  -s snpeff.html \
  AF086833 \
  combined.filter2.vcf > combined.ann.vcf

14. Extract Variant Fields

Run snpSift extractFields

SnpSift extractFields combined.ann.vcf \
    ID CHROM POS REF ALT QUAL DP FILTER \
    ANN[0].GENE ANN[0].GENEID ANN[0].EFFECT ANN[0].IMPACT \
    ANN[0].BIOTYPE ANN[0].HGVS_C ANN[0].HGVS_P \
    GEN[0].GT GEN[0].GQ GEN[0].FT \
    GEN[1].GT GEN[1].GQ GEN[1].FT > combined.txt

Explanation of Each Field:


15. Class Exercise

Create Directory

mkdir Class
cd Class  

Run GATK HaplotypeCaller

gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/mother.bam -O mother.g.vcf -ERC GVCF 
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/father.bam -O father.g.vcf -ERC GVCF 
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/son.bam    -O son.g.vcf -ERC GVCF

Run GATK CombineGVCFs

gatk CombineGVCFs -R /home/bqhs/hg38/genome.fa -V mother.g.vcf -V father.g.vcf -V son.g.vcf -O family.g.vcf

Note: Running GATK on big files takes time. So, you can copy vcf files and proceed to downstream analysis.

cp /home/bqhs/dna/Results/*vcf* ./

Run GATK GenotypeGVCFs

gatk GenotypeGVCFs -R /home/bqhs/hg38/genome.fa -V family.g.vcf -O family.vcf

Run GATK Variant Filtration

gatk VariantFiltration \
    -R /home/bqhs/hg38/genome.fa \
    -V family.vcf \
    -O family.filter.vcf \
    -filter "QUAL < 30.0 || DP < 10" \
    --filter-name lowQualDp

# Check the flags
grep "lowQualDp" family.filter.vcf

Run GATK Calculate Genotype Posteriors

cp /home/bqhs/dna/trio.ped ./

Pedigree

gatk CalculateGenotypePosteriors \
    -R /home/bqhs/hg38/genome.fa \
    -V family.filter.vcf \
    -ped trio.ped \
    -supporting /home/bqhs/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    -O family.CGP.vcf

Key Benefits of calculating genotype posterior

Note: It is recommended to run CalculateGenotypePosteriors before VariantFiltration.

Run GATK Variant Filtration

gatk VariantFiltration \
    -R /home/bqhs/hg38/genome.fa \
    -V family.CGP.vcf \
    -O family.CGP.filter.vcf \
    -G-filter "GQ < 20.0" \
    -G-filter-name lowGQ
# Check the flags
grep "lowGQ" family.CGP.filter.vcf

Run GATK CollectVariantCallingMetrics

gatk CollectVariantCallingMetrics \
   -I family.CGP.filter.vcf \
   --DBSNP /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz \
   -O family.CGP.filter.metrics

Run GATK GenotypeConcordance

Run snpEff ann

#conda install snpeff 
#conda install snpsift
#snpEff ann
#SnpSift extractFields
 snpEff  -Xmx10g ann hg38  -v -s snpeff.html family.CGP.filter.vcf > family.ann.vcf

Run SnpSift extractFields

SnpSift annotate /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz family.ann.vcf > family.ann2.vcf

Getting a spreadsheet of variants with annotations

SnpSift extractFields family.ann2.vcf \
ID CHROM POS REF ALT QUAL DP FILTER \
ANN[0].GENE ANN[0].GENEID ANN[0].EFFECT ANN[0].IMPACT \
ANN[0].BIOTYPE ANN[0].HGVS_C ANN[0].HGVS_P \
GEN[0].GT GEN[0].GQ GEN[0].FT \
GEN[1].GT GEN[1].GQ GEN[1].FT \
GEN[2].GT GEN[2].GQ GEN[2].FT > family.txt

Transfer all results to your local machine via FileZilla and view them.

Workflow Review

Workflow


Contributing

Contributions to improve this pipeline are welcome!