This repository provides a step-by-step pipeline for processing germline variant analysis using genomic sequencing data. The pipeline includes raw data preprocessing, alignment, variant calling, and annotation using widely used bioinformatics tools.
168.105.161.70
22
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# conda install mamba
# mamba install sra-tools fastqc trimmomatic multiqc curl
# mamba install bwa samtools picard gatk4
# mamba install snpeff snpsift
fastq-dump
to download paired-end reads:mkdir Germline_Variants
cd Germline_Variants/
# Download raw sequencing data
fastq-dump --split-files -X 100000 SRR1972917
fastq-dump --split-files -X 100000 SRR1972918
mkdir qc
fastqc *.fastq -o qc/
# Download adapter file and trim sequences
curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR1972917_1.fastq SRR1972917_2.fastq \
SRR1972917_trimmed_1.fastq SRR1972917_unpaired_1.fastq \
SRR1972917_trimmed_2.fastq SRR1972917_unpaired_2.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
trimmomatic PE SRR1972918_1.fastq SRR1972918_2.fastq \
SRR1972918_trimmed_1.fastq SRR1972918_unpaired_1.fastq \
SRR1972918_trimmed_2.fastq SRR1972918_unpaired_2.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
mkdir qc_trimmed
fastqc *trimmed_*.fastq -o qc_trimmed/
bwa mem -R '@RG\tID:SRR1972917\tSM:SRR1972917\tPL:ILLUMINA\tLB:SRR1972917' \
/home/bqhs/ebola/AF086833.fa SRR1972917_trimmed_1.fastq SRR1972917_trimmed_2.fastq > SRR1972917_raw.sam
bwa mem -R '@RG\tID:SRR1972918\tSM:SRR1972918\tPL:ILLUMINA\tLB:SRR1972918' \
/home/bqhs/ebola/AF086833.fa SRR1972918_trimmed_1.fastq SRR1972918_trimmed_2.fastq > SRR1972918_raw.sam
samtools sort SRR1972917_raw.sam > SRR1972917_sort.bam
samtools sort SRR1972918_raw.sam > SRR1972918_sort.bam
# Mark Duplicates
picard MarkDuplicates -Xmx10g I=SRR1972917_sort.bam O=SRR1972917_dedup.bam M=SRR1972917_dedup.txt
# Collect Alignment Metrics
picard CollectAlignmentSummaryMetrics -Xmx10g INPUT=SRR1972917_dedup.bam OUTPUT=SRR1972917_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/ebola/AF086833.fa
# Provides summary of alignment statistics
samtools flagstat SRR1972917_dedup.bam
picard MarkDuplicates -Xmx10g I=SRR1972918_sort.bam O=SRR1972918_dedup.bam M=SRR1972918_dedup.txt
picard CollectAlignmentSummaryMetrics -Xmx10g INPUT=SRR1972918_dedup.bam OUTPUT=SRR1972918_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/ebola/AF086833.fa
samtools flagstat SRR1972918_dedup.bam
samtools index SRR1972917_dedup.bam
samtools index SRR1972918_dedup.bam
gatk HaplotypeCaller -R /home/bqhs/ebola/AF086833.fa -I SRR1972917_dedup.bam -O SRR1972917.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/ebola/AF086833.fa -I SRR1972918_dedup.bam -O SRR1972918.g.vcf -ERC GVCF
-ERC GVCF
?gatk CombineGVCFs -R /home/bqhs/ebola/AF086833.fa -V SRR1972917.g.vcf -V SRR1972918.g.vcf -O combined.g.vcf
combined.g.vcf
) to perform the final joint genotyping step and produce a standard VCF (combined.vcf
) containing the called variants for all samples.gatk GenotypeGVCFs -R /home/bqhs/ebola/AF086833.fa -V combined.g.vcf -O combined.vcf
combined.vcf
file.-R /home/bqhs/ebola/AF086833.fa
: Specifies the reference genome file (AF086833.fa).-V combined.vcf
: Input VCF file (combined.vcf).-O combined.filter1.vcf
: Output VCF file (combined.filter1.vcf) after applying the variant filters.-filter *QUAL < 30.0 || DP < 10*
Applies a filter to flag variants with:
--filter-name lowQualDp
: Labels filtered variants as “lowQualDp” in the VCF file.gatk VariantFiltration -R /home/bqhs/ebola/AF086833.fa -V combined.vcf -O combined.filter1.vcf \
-filter "QUAL < 30.0 || DP < 10" --filter-name lowQualDp
-R /home/bqhs/ebola/AF086833.fa
: Specifies the reference genome file (AF086833.fa).-V combined.filter1.vcf
: Takes the previously filtered VCF file (combined.filter1.vcf) as input.-O combined.filter2.vcf
: Outputs a new VCF file (combined.filter2.vcf) after applying genotype filters.-G-filter "GQ < 20.0"
: Applies a genotype-level filter to flag individual genotype calls where:
-G-filter-name lowGQ
: Labels filtered genotype calls as “lowGQ” in the VCF file.gatk VariantFiltration -R /home/bqhs/ebola/AF086833.fa -V combined.filter1.vcf -O combined.filter2.vcf \
-G-filter "GQ < 20.0" -G-filter-name lowGQ
-CV combined.filter2.vcf
: Callset VCF file (the file you generated from variant calling).-TV /home/bqhs/ebola/ebola-samples.vcf
: Truth VCF file (a reference VCF containing known variants).-O SRR1972917.concordance.grp
: Output concordance report.-CS SRR1972917
: Callset sample name.-TS SRR1972917
: Truth sample name.gatk GenotypeConcordance \
-CV combined.filter2.vcf \
-TV /home/bqhs/ebola/ebola-samples.vcf \
-O SRR1972917.concordance.grp \
-CS SRR1972917 \
-TS SRR1972917
snpEff ann
Annotation mode (ann stands for “annotate”).-v
Verbose mode (useful for debugging).-c /home/vedbar_2025/miniconda3/share/snpeff-5.2-1/snpEff.config
Specifies the configuration file location.-s snpeff.html
Generates a summary HTML report (snpeff.html).AF086833
The reference genome name (must match a genome database in snpEff).combined.filter2.vcf
Input VCF file with variants to annotate.combined.ann.vcf
Output as a new annotated VCF file.# mamba install snpeff snpsift
snpEff ann -v \
-c /home/vedbar_2025/miniconda3/share/snpeff-5.2-1/snpEff.config \
-s snpeff.html \
AF086833 \
combined.filter2.vcf > combined.ann.vcf
SnpSift extractFields combined.ann.vcf \
ID CHROM POS REF ALT QUAL DP FILTER \
ANN[0].GENE ANN[0].GENEID ANN[0].EFFECT ANN[0].IMPACT \
ANN[0].BIOTYPE ANN[0].HGVS_C ANN[0].HGVS_P \
GEN[0].GT GEN[0].GQ GEN[0].FT \
GEN[1].GT GEN[1].GQ GEN[1].FT > combined.txt
ID
→ Variant ID (if available)CHROM
→ Chromosome namePOS
→ Position of the variantREF
→ Reference alleleALT
→ Alternate allele(s)QUAL
→ Quality scoreDP
→ Read depthFILTER
→ Filter status (PASS or other filters)ANN[0].GENE
→ Gene name affected by the variantANN[0].GENEID
→ Gene ID (if available)ANN[0].EFFECT
→ Predicted effect of the variant (e.g., missense_variant, synonymous_variant)ANN[0].IMPACT
→ Impact classification (HIGH, MODERATE, LOW, MODIFIER)ANN[0].BIOTYPE
→ Gene biotype (protein-coding, lncRNA, etc.)ANN[0].HGVS_C
→ HGVS notation for coding sequence changeANN[0].HGVS_P
→ HGVS notation for protein changeGEN[0].GT
→ Genotype of the first sample (0/1, 1/1, etc.)GEN[0].GQ
→ Genotype quality for the first sampleGEN[0].FT
→ Filter status for the first sampleGEN[1].GT
→ Genotype of the second sample (if applicable)GEN[1].GQ
→ Genotype quality for the second sampleGEN[1].FT
→ Filter status for the second samplemkdir Class
cd Class
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/mother.bam -O mother.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/father.bam -O father.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/son.bam -O son.g.vcf -ERC GVCF
gatk CombineGVCFs -R /home/bqhs/hg38/genome.fa -V mother.g.vcf -V father.g.vcf -V son.g.vcf -O family.g.vcf
Note: Running GATK on big files takes time. So, you can copy vcf files and proceed to downstream analysis.
cp /home/bqhs/dna/Results/*vcf* ./
gatk GenotypeGVCFs -R /home/bqhs/hg38/genome.fa -V family.g.vcf -O family.vcf
gatk VariantFiltration \
-R /home/bqhs/hg38/genome.fa \
-V family.vcf \
-O family.filter.vcf \
-filter "QUAL < 30.0 || DP < 10" \
--filter-name lowQualDp
# Check the flags
grep "lowQualDp" family.filter.vcf
gatk CalculateGenotypePosteriors
is used to refine genotype calls by incorporating population-level allele frequency data and Mendelian inheritance priors (if family data is available).cp /home/bqhs/dna/trio.ped ./
gatk CalculateGenotypePosteriors \
-R /home/bqhs/hg38/genome.fa \
-V family.filter.vcf \
-ped trio.ped \
-supporting /home/bqhs/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-O family.CGP.vcf
Note: It is recommended to run CalculateGenotypePosteriors before VariantFiltration.
gatk VariantFiltration \
-R /home/bqhs/hg38/genome.fa \
-V family.CGP.vcf \
-O family.CGP.filter.vcf \
-G-filter "GQ < 20.0" \
-G-filter-name lowGQ
# Check the flags
grep "lowGQ" family.CGP.filter.vcf
gatk CollectVariantCallingMetrics \
-I family.CGP.filter.vcf \
--DBSNP /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz \
-O family.CGP.filter.metrics
# gatk GenotypeConcordance -CV [your callset vcf] -TV [truth set vcf] -O [output name] -CS [sample name in your callset] -TS [sample name in truth set]
#conda install snpeff
#conda install snpsift
#snpEff ann
#SnpSift extractFields
snpEff -Xmx10g ann hg38 -v -s snpeff.html family.CGP.filter.vcf > family.ann.vcf
SnpSift annotate /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz family.ann.vcf > family.ann2.vcf
SnpSift extractFields family.ann2.vcf \
ID CHROM POS REF ALT QUAL DP FILTER \
ANN[0].GENE ANN[0].GENEID ANN[0].EFFECT ANN[0].IMPACT \
ANN[0].BIOTYPE ANN[0].HGVS_C ANN[0].HGVS_P \
GEN[0].GT GEN[0].GQ GEN[0].FT \
GEN[1].GT GEN[1].GQ GEN[1].FT \
GEN[2].GT GEN[2].GQ GEN[2].FT > family.txt