This repository provides a step-by-step pipeline for processing germline variant analysis using genomic sequencing data. The pipeline includes raw data preprocessing, alignment, variant calling, and annotation using widely used bioinformatics tools.
168.105.161.7022wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# conda install mamba
# mamba install sra-tools fastqc trimmomatic multiqc curl
# mamba install bwa samtools picard gatk4
# mamba install snpeff snpsift
fastq-dump to download paired-end reads:mkdir Germline_Variants
cd Germline_Variants/
# Download raw sequencing data
fastq-dump --split-files -X 100000 SRR1972917
fastq-dump --split-files -X 100000 SRR1972918
mkdir qc
fastqc *.fastq -o qc/
# Download adapter file and trim sequences
curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR1972917_1.fastq SRR1972917_2.fastq \
SRR1972917_trimmed_1.fastq SRR1972917_unpaired_1.fastq \
SRR1972917_trimmed_2.fastq SRR1972917_unpaired_2.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
trimmomatic PE SRR1972918_1.fastq SRR1972918_2.fastq \
SRR1972918_trimmed_1.fastq SRR1972918_unpaired_1.fastq \
SRR1972918_trimmed_2.fastq SRR1972918_unpaired_2.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
mkdir qc_trimmed
fastqc *trimmed_*.fastq -o qc_trimmed/
bwa mem -R '@RG\tID:SRR1972917\tSM:SRR1972917\tPL:ILLUMINA\tLB:SRR1972917' \
/home/bqhs/ebola/AF086833.fa SRR1972917_trimmed_1.fastq SRR1972917_trimmed_2.fastq > SRR1972917_raw.sam
bwa mem -R '@RG\tID:SRR1972918\tSM:SRR1972918\tPL:ILLUMINA\tLB:SRR1972918' \
/home/bqhs/ebola/AF086833.fa SRR1972918_trimmed_1.fastq SRR1972918_trimmed_2.fastq > SRR1972918_raw.sam
samtools sort SRR1972917_raw.sam > SRR1972917_sort.bam
samtools sort SRR1972918_raw.sam > SRR1972918_sort.bam
# Mark Duplicates
picard MarkDuplicates -Xmx10g I=SRR1972917_sort.bam O=SRR1972917_dedup.bam M=SRR1972917_dedup.txt
# Collect Alignment Metrics
picard CollectAlignmentSummaryMetrics -Xmx10g INPUT=SRR1972917_dedup.bam OUTPUT=SRR1972917_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/ebola/AF086833.fa
# Provides summary of alignment statistics
samtools flagstat SRR1972917_dedup.bam
picard MarkDuplicates -Xmx10g I=SRR1972918_sort.bam O=SRR1972918_dedup.bam M=SRR1972918_dedup.txt
picard CollectAlignmentSummaryMetrics -Xmx10g INPUT=SRR1972918_dedup.bam OUTPUT=SRR1972918_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/ebola/AF086833.fa
samtools flagstat SRR1972918_dedup.bam
samtools index SRR1972917_dedup.bam
samtools index SRR1972918_dedup.bam
gatk HaplotypeCaller -R /home/bqhs/ebola/AF086833.fa -I SRR1972917_dedup.bam -O SRR1972917.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/ebola/AF086833.fa -I SRR1972918_dedup.bam -O SRR1972918.g.vcf -ERC GVCF
-ERC GVCF?gatk CombineGVCFs -R /home/bqhs/ebola/AF086833.fa -V SRR1972917.g.vcf -V SRR1972918.g.vcf -O combined.g.vcf
combined.g.vcf) to perform the final joint genotyping step and produce a standard VCF (combined.vcf) containing the called variants for all samples.gatk GenotypeGVCFs -R /home/bqhs/ebola/AF086833.fa -V combined.g.vcf -O combined.vcf
combined.vcf file.-R /home/bqhs/ebola/AF086833.fa: Specifies the reference genome file (AF086833.fa).-V combined.vcf: Input VCF file (combined.vcf).-O combined.filter1.vcf: Output VCF file (combined.filter1.vcf) after applying the variant filters.-filter *QUAL < 30.0 || DP < 10* Applies a filter to flag variants with:
--filter-name lowQualDp: Labels filtered variants as “lowQualDp” in the VCF file.gatk VariantFiltration -R /home/bqhs/ebola/AF086833.fa -V combined.vcf -O combined.filter1.vcf \
-filter "QUAL < 30.0 || DP < 10" --filter-name lowQualDp
-R /home/bqhs/ebola/AF086833.fa: Specifies the reference genome file (AF086833.fa).-V combined.filter1.vcf: Takes the previously filtered VCF file (combined.filter1.vcf) as input.-O combined.filter2.vcf: Outputs a new VCF file (combined.filter2.vcf) after applying genotype filters.-G-filter "GQ < 20.0": Applies a genotype-level filter to flag individual genotype calls where:
-G-filter-name lowGQ: Labels filtered genotype calls as “lowGQ” in the VCF file.gatk VariantFiltration -R /home/bqhs/ebola/AF086833.fa -V combined.filter1.vcf -O combined.filter2.vcf \
-G-filter "GQ < 20.0" -G-filter-name lowGQ
-CV combined.filter2.vcf: Callset VCF file (the file you generated from variant calling).-TV /home/bqhs/ebola/ebola-samples.vcf: Truth VCF file (a reference VCF containing known variants).-O SRR1972917.concordance.grp: Output concordance report.-CS SRR1972917: Callset sample name.-TS SRR1972917: Truth sample name.gatk GenotypeConcordance \
-CV combined.filter2.vcf \
-TV /home/bqhs/ebola/ebola-samples.vcf \
-O SRR1972917.concordance.grp \
-CS SRR1972917 \
-TS SRR1972917
snpEff ann Annotation mode (ann stands for “annotate”).-v Verbose mode (useful for debugging).-c /home/vedbar_2025/miniconda3/share/snpeff-5.2-1/snpEff.config Specifies the configuration file location.-s snpeff.html Generates a summary HTML report (snpeff.html).AF086833 The reference genome name (must match a genome database in snpEff).combined.filter2.vcf Input VCF file with variants to annotate.combined.ann.vcf Output as a new annotated VCF file.# mamba install snpeff snpsift
snpEff ann -v \
-c /home/vedbar_2025/miniconda3/share/snpeff-5.2-1/snpEff.config \
-s snpeff.html \
AF086833 \
combined.filter2.vcf > combined.ann.vcf
SnpSift extractFields combined.ann.vcf \
ID CHROM POS REF ALT QUAL DP FILTER \
ANN[0].GENE ANN[0].GENEID ANN[0].EFFECT ANN[0].IMPACT \
ANN[0].BIOTYPE ANN[0].HGVS_C ANN[0].HGVS_P \
GEN[0].GT GEN[0].GQ GEN[0].FT \
GEN[1].GT GEN[1].GQ GEN[1].FT > combined.txt
ID → Variant ID (if available)CHROM → Chromosome namePOS → Position of the variantREF → Reference alleleALT → Alternate allele(s)QUAL → Quality scoreDP → Read depthFILTER → Filter status (PASS or other filters)ANN[0].GENE → Gene name affected by the variantANN[0].GENEID → Gene ID (if available)ANN[0].EFFECT → Predicted effect of the variant (e.g., missense_variant, synonymous_variant)ANN[0].IMPACT → Impact classification (HIGH, MODERATE, LOW, MODIFIER)ANN[0].BIOTYPE → Gene biotype (protein-coding, lncRNA, etc.)ANN[0].HGVS_C → HGVS notation for coding sequence changeANN[0].HGVS_P → HGVS notation for protein changeGEN[0].GT → Genotype of the first sample (0/1, 1/1, etc.)GEN[0].GQ → Genotype quality for the first sampleGEN[0].FT → Filter status for the first sampleGEN[1].GT → Genotype of the second sample (if applicable)GEN[1].GQ → Genotype quality for the second sampleGEN[1].FT → Filter status for the second samplemkdir Class
cd Class
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/mother.bam -O mother.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/father.bam -O father.g.vcf -ERC GVCF
gatk HaplotypeCaller -R /home/bqhs/hg38/genome.fa -I /home/bqhs/dna/son.bam -O son.g.vcf -ERC GVCF
gatk CombineGVCFs -R /home/bqhs/hg38/genome.fa -V mother.g.vcf -V father.g.vcf -V son.g.vcf -O family.g.vcf
Note: Running GATK on big files takes time. So, you can copy vcf files and proceed to downstream analysis.
cp /home/bqhs/dna/Results/*vcf* ./
gatk GenotypeGVCFs -R /home/bqhs/hg38/genome.fa -V family.g.vcf -O family.vcf
gatk VariantFiltration \
-R /home/bqhs/hg38/genome.fa \
-V family.vcf \
-O family.filter.vcf \
-filter "QUAL < 30.0 || DP < 10" \
--filter-name lowQualDp
# Check the flags
grep "lowQualDp" family.filter.vcf
gatk CalculateGenotypePosteriors is used to refine genotype calls by incorporating population-level allele frequency data and Mendelian inheritance priors (if family data is available).cp /home/bqhs/dna/trio.ped ./

gatk CalculateGenotypePosteriors \
-R /home/bqhs/hg38/genome.fa \
-V family.filter.vcf \
-ped trio.ped \
-supporting /home/bqhs/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-O family.CGP.vcf
Note: It is recommended to run CalculateGenotypePosteriors before VariantFiltration.
gatk VariantFiltration \
-R /home/bqhs/hg38/genome.fa \
-V family.CGP.vcf \
-O family.CGP.filter.vcf \
-G-filter "GQ < 20.0" \
-G-filter-name lowGQ
# Check the flags
grep "lowGQ" family.CGP.filter.vcf
gatk CollectVariantCallingMetrics \
-I family.CGP.filter.vcf \
--DBSNP /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz \
-O family.CGP.filter.metrics
# gatk GenotypeConcordance -CV [your callset vcf] -TV [truth set vcf] -O [output name] -CS [sample name in your callset] -TS [sample name in truth set]
#conda install snpeff
#conda install snpsift
#snpEff ann
#SnpSift extractFields
snpEff -Xmx10g ann hg38 -v -s snpeff.html family.CGP.filter.vcf > family.ann.vcf
SnpSift annotate /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz family.ann.vcf > family.ann2.vcf
SnpSift extractFields family.ann2.vcf \
ID CHROM POS REF ALT QUAL DP FILTER \
ANN[0].GENE ANN[0].GENEID ANN[0].EFFECT ANN[0].IMPACT \
ANN[0].BIOTYPE ANN[0].HGVS_C ANN[0].HGVS_P \
GEN[0].GT GEN[0].GQ GEN[0].FT \
GEN[1].GT GEN[1].GQ GEN[1].FT \
GEN[2].GT GEN[2].GQ GEN[2].FT > family.txt
