DNA-seq_Somatic_Variants

Somatic Variants Pipeline

This repository is designed to analyze somatic mutations from HCC1143 tumor samples with a matched normal control. The pipeline includes preprocessing, somatic variant calling, filtering, annotation, and copy number variation (CNV) detection using widely used bioinformatics tools.


Table of Contents

  1. Server Information
  2. Genomic Sequencing Data
  3. Preprocessing Steps
  4. Somatic Variant Calling and Filtering
  5. Annotate Variants
  6. Copy Number Variation

1. Server Information

Prerequisites

Security Practices

Installing Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Installing software

# conda install mamba
# mamba install sra-tools fastqc trimmomatic multiqc curl 
# mamba install bwa samtools picard gatk4
# mamba install snpeff snpsift

2. Genomic Sequencing Data

Sample Information


3. Preprocessing Steps

Steps Common to Germline and Somatic Variants

  1. Quality Control:
    • FastQC, Trimmomatic → Checks read quality and trims adapters.
  2. Read Alignment:
    • bwa mem → Maps reads to the reference genome.
  3. Sort alignment
    • samtools sort → Organizes BAM file
  4. Mark duplicates
    • picard MarkDuplicates → Removes PCR artifacts
  5. Recalibrate base quality scores
    • GATK BaseRecalibrator & GATK ApplyBQSR → Adjusts systematic sequencing errors
  6. Index BAM file
    • samtools index → Enables efficient access to BAM files.

4. Somatic Variant Calling and Filtering

mkdir -p Somatic_mutation/Exercise
cd Somatic_mutation/Exercise

1. Run GATK Mutect2

 gatk Mutect2 \
    -R /home/bqhs/mutect2/Homo_sapiens_assembly38.fasta \
    -I /home/bqhs/mutect2/tumor.bam -tumor HCC1143_tumor \
    -I /home/bqhs/mutect2/normal.bam -normal HCC1143_normal \
    -pon /home/bqhs/mutect2/chr17_m2pon.vcf.gz \
    --germline-resource /home/bqhs/mutect2/chr17_af-only-gnomad_grch38.vcf.gz \
    -L /home/bqhs/mutect2/chr17plus.interval_list \
    -O somatic_m2.vcf.gz

Extract Read Group Information from BAM Header

samtools view -H /home/bqhs/mutect2/tumor.bam | grep '@RG'
samtools view -H /home/bqhs/mutect2/normal.bam | grep '@RG'

Extract Sample Name Using GATK

gatk GetSampleName -I /home/bqhs/mutect2/tumor.bam  -O tumor.txt
gatk GetSampleName -I /home/bqhs/mutect2/normal.bam  -O normal.txt

Extract subsets records that contain a comma in the 5th column.

zcat somatic_m2.vcf.gz | awk '$5 ~","'

https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format

https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-

Displays the first 10,000 lines of the decompressed VCF file

zcat somatic_m2.vcf.gz | head -n 10000

View the standard Format in the VCF header

zcat somatic_m2.vcf.gz | grep '##FORMAT'

View the standard INFO in the VCF header

zcat somatic_m2.vcf.gz | grep '##INFO'

2. Run GATK GetPileupSummaries

# For Normal
gatk GetPileupSummaries \
  -R /home/bqhs/mutect2/Homo_sapiens_assembly38.fasta \
  -I /home/bqhs/mutect2/normal.bam \
  -V /home/bqhs/mutect2/chr17_small_exac_common_3_grch38.vcf.gz \
  -L /home/bqhs/mutect2/targets_chr17.interval_list \
  -O normal.pileups.table
# For Tumor
gatk GetPileupSummaries \
  -R /home/bqhs/mutect2/Homo_sapiens_assembly38.fasta \
  -I /home/bqhs/mutect2/tumor.bam \
  -V /home/bqhs/mutect2/chr17_small_exac_common_3_grch38.vcf.gz \
  -L /home/bqhs/mutect2/targets_chr17.interval_list \
  -O tumor.pileups.table


Displays the contents of the tumor pileup summary table.

cat tumor.pileups.table | grep '^chr17' | awk '$5>=3'
cat normal.pileups.table | grep '^chr17' | awk '$5>=3'

3. Run GATK CalculateContamination

gatk CalculateContamination \
  -I tumor.pileups.table \
  -matched normal.pileups.table \
  -O contamination.table

4. Run GATK FilterMutectCalls

gatk FilterMutectCalls \
  -V somatic_m2.vcf.gz \
  --contamination-table contamination.table \
  -R /home/bqhs/mutect2/Homo_sapiens_assembly38.fasta \
  -O somatic.filtered.vcf.gz
# contamination-table: Tables containing contamination information
zcat somatic.filtered.vcf.gz | grep '##FILTER'
zcat somatic.filtered.vcf.gz | grep 'germ-line'
zcat somatic.filtered.vcf.gz | grep 'PASS'
zcat somatic.filtered.vcf.gz | grep 'contamiation'

5. Run GATK CollectSequencingArtifactMetrics

gatk CollectSequencingArtifactMetrics \
  -R /home/bqhs/mutect2/Homo_sapiens_assembly38.fasta \
  -I /home/bqhs/mutect2/tumor.bam \
  -EXT ".txt" \
  -O tumor_artifact


5. Annotate Variants

Functional annotation

SnpEff does not accept compressed VCF files. So unzip the file.

gunzip -c somatic.filtered.vcf.gz > somatic.filtered.vcf

snpEff is a variant effect predictor that annotates variants with their potential functional impact.

snpEff -Xmx2g ann hg38 -v -s snpeff.html somatic.filtered.vcf > somatic.filtered.ann.vcf

Annotate with dbSNP Using SnpSift

SnpSift annotate /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz somatic.filtered.ann.vcf > somatic.filtered_1.ann.vcf

Lists sample names present in the final annotated VCF

# mamba install bcftools
bcftools query -l somatic.filtered_1.ann.vcf

Extract relevant fields

6. Copy Number Variation

Create a New Conda Environment with Python 3.9

conda create -n cnvkit_env python=3.9 -y

Activate the New Environment

conda activate cnvkit_env

Install Pandas 1.5.3

conda install pandas=1.5.3 -y

Install CNVkit

conda install -c bioconda cnvkit

Verify CNVkit Installation

cnvkit.py version

Run CNV detection using CNVKit

cnvkit.py batch /home/bqhs/mutect2/tumor.bam \
    -n /home/bqhs/mutect2/normal.bam \
    -t /home/bqhs/mutect2/targets_chr17.interval_list \
    -f /home/bqhs/mutect2/Homo_sapiens_assembly38.fasta \
    --annotate /home/bqhs/mutect2/refFlat.txt \
    --scatter --diagram \
    -d cnvkit_output

Call CNVs

cd cnvkit_output
cnvkit.py call tumor.cns -o tumor.call.cns 

Copy Number Variation (CNV) plot for a tumor sample

tumor-scatter

Deactivate Environment After Running

conda deactivate

Few more steps

Examine results in IGV.

cp /home/bqhs/mutect2/tumor.bam ./
cp /home/bqhs/mutect2/normal.bam ./
samtools index tumor.bam
samtools index normal.bam
cp /home/bqhs/mutect2/chr17_m2pon.vcf.gz ./
cp /home/bqhs/mutect2/chr17_af-only-gnomad_grch38.vcf.gz ./

IGV_Result_1

IGV_Result_2