DNA-seq_Genome_Alignment

Genome Alignment Pipeline

This repository provides a step-by-step guide for genome alignment using breast cancer and non-tumor breast cell line datasets. It includes quality control, read alignment, post-processing, and visualization using various bioinformatics tools.


Table of Contents

  1. Server Information
  2. Dataset
  3. Quality Visualization
  4. Quality Filtering and Trimming
  5. Genome Alignment
  6. Sort and Convert to BAM Format
  7. Mark Duplicates
  8. Recalibrate Base Quality Scores
  9. Index BAM File
  10. Visualization using IGV
  11. Exercise
  12. Useful Links

1. Server Information

Prerequisites

Security Practices

Installing Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Installing software

# conda install mamba
# mamba install sra-tools fastqc trimmomatic multiqc curl spades quast
mamba install bwa samtools picard gatk4

2. Dataset

Dataset Information

fastq-dump --split-files -X 100000 SRR097848

3. Quality Visualization

Run FASTQC

mkdir qc
fastqc *.fastq -o qc/

4. Quality Filtering and Trimming

Download Adapter Sequences

curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa

Trim Low-Quality Reads and Adapters using Trimmomatic

trimmomatic PE SRR097848_1.fastq SRR097848_2.fastq trimmed_1.fastq unpaired_1.fastq trimmed_2.fastq unpaired_2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20

5. Genome Alignment

Step 1: Index the Reference Genome

# bwa index /home/bqhs/hg38/BWAIndex/version0.6.0/genome.fa

Step 2: Align Paired-End Reads to the Reference Genome

bwa mem -R '@RG\tID:SRR097848\tSM:SRR097848\tPL:ILLUMINA\tLB:SRR097848' /home/bqhs/hg38/BWAIndex/version0.6.0/genome.fa trimmed_1.fastq trimmed_2.fastq > SRR097848_raw.sam

6. Sort and Convert to BAM Format

Sort the SAM file by genomic coordinates for efficient downstream processing.

samtools sort SRR097848_raw.sam > SRR097848_sort.bam

7. Mark Duplicates

Why is it important to mark duplicates in sequencing data?

Why “Mark” Instead of “Remove”?

Step 1: Mark Duplicates

picard MarkDuplicates -Xmx50g I=SRR097848_sort.bam O=SRR097848_dedup.bam M=SRR097848_dedup.txt

Step 2: Collect Alignment Metrics

picard CollectAlignmentSummaryMetrics -Xmx50g INPUT=SRR097848_dedup.bam OUTPUT=SRR097848_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/hg38/genome.fa

8. Recalibrate Base Quality Scores

Step 1: Base Quality Score Recalibration (BQSR Calculation)

gatk BaseRecalibrator -R /home/bqhs/hg38/genome.fa -I SRR097848_dedup.bam --known-sites /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz --known-sites /home/bqhs/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -O recal.table

Step 2: Apply BQSR Corrections

gatk ApplyBQSR -R /home/bqhs/hg38/genome.fa -I SRR097848_dedup.bam --bqsr-recal-file recal.table -O SRR097848_FINAL.bam

9. Index BAM File

Creates an index file for the sorted and processed BAM file to enable fast access to genomic regions.

samtools index SRR097848_FINAL.bam

10. Visualization using IGV

Load indexed BAM files in genome browsers IGV: IGV Software
IGV

IGV Browser


11. Exercise

11.1 DNAseq Data

mkdir -p breast_cancer
cd breast_cancer
fastq-dump --split-files -X 100000 SRR097849


11.2 QC Raw Reads

mkdir -p qc
fastqc *.fastq -o qc/
curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR097849_1.fastq SRR097849_2.fastq trimmed_1.fastq unpaired_1.fastq trimmed_2.fastq unpaired_2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20

11.3 Align Reads using BWA MEM

bwa mem -t 10 -R '@RG\tID:SRR097849\tSM:SRR097849\tPL:ILLUMINA\tLB:SRbwR097849' /home/bqhs/hg38/BWAIndex/version0.6.0/genome.fa trimmed_1.fastq trimmed_2.fastq > SRR097849_raw.sam

11.4 Sort and Convert to BAM Format

samtools sort SRR097849_raw.sam > SRR097849_sort.bam

11.5 Mark Duplicates

picard MarkDuplicates -Xmx50g I=SRR097849_sort.bam O=SRR097849_dedup.bam M=SRR097849_dedup.txt
picard CollectAlignmentSummaryMetrics -Xmx50g INPUT=SRR097849_dedup.bam OUTPUT=SRR097849_aln_metrics.txt REFERENCE_SEQUENCE=/home/bqhs/hg38/genome.fa

11.6 Recalibrate Base Quality Scores

gatk BaseRecalibrator -R /home/bqhs/hg38/genome.fa -I SRR097849_dedup.bam --known-sites /home/bqhs/hg38/dbsnp_146.hg38.vcf.gz --known-sites /home/bqhs/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -O recal.table
gatk ApplyBQSR -R /home/bqhs/hg38/genome.fa -I SRR097849_dedup.bam --bqsr-recal-file recal.table -O SRR097849_FINAL.bam

11.7 Index BAM File

samtools index SRR097849_FINAL.bam

11.8 Visualization using IGV

Videos


Contributing

Contributions to improve this pipeline are welcome!