DNA-seq_Genome_Assembly

Genome Assembly Pipeline

This repository provides a detailed step-by-step guide for genome assembly using Rhodobacter sphaeroides dataset as the primary example. It covers key processes such as data preprocessing, genome assembly, evaluation, visualization, comparison, and annotation. Additionally, it includes a guide for genome assembly using a downsampled Zaire Ebolavirus dataset for rapid analysis.


Table of Contents

  1. Server Information
  2. Dataset
  3. Quality Visualization
  4. Quality Filtering and Trimming
  5. Genome Assembly
  6. Assembly Evaluation
  7. Assembly Visualization
  8. Comparing Genomes
  9. Genome Annotation
  10. Exercise
  11. Educational Notes

1. Server Information

Prerequisites

Security Practices

Installing Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Installing software

conda install mamba
mamba install sra-tools fastqc trimmomatic multiqc curl spades quast

2. Dataset

Dataset Information

Why This Dataset?

This dataset was chosen for its small genome size, high-quality sequencing data, and suitability for teaching genome assembly principles.

Download Dataset

  1. Create a directory for the assembly process:
     mkdir -p assembly/rhodobacter
     cd assembly/rhodobacter
    
  2. Use fastq-dump to download paired-end reads:
     fastq-dump --split-files SRR522246
    

3. Quality Visualization

Why Perform Quality Control?

Steps

  1. Create a directory for quality control results:
     mkdir -p qc
    
  2. Run FastQC to assess the quality of raw reads:
     fastqc *.fastq -o qc/
    
  3. Aggregate and summarize results using MultiQC:
     multiqc qc/
    
    • Transfer qc and multiqc results to your local machine and view HTML reports.

Key Metrics to Examine


4. Quality Filtering and Trimming

Why Trim Reads?

Steps

  1. Download adapter sequences:
     curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa
    
  2. Trim reads using Trimmomatic:
     trimmomatic PE SRR522246_1.fastq SRR522246_2.fastq \
       trimmed_1.fastq unpaired_1.fastq \
       trimmed_2.fastq unpaired_2.fastq \
       ILLUMINACLIP:adapters.fa:2:30:10 \
       LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
    

Explanation of Parameters


5. Genome Assembly

What is Genome Assembly?

Reconstructing a genome by piecing together sequencing reads.

Steps

  1. Combine unpaired reads:
     cat unpaired_1.fastq unpaired_2.fastq > unpaired.fastq
    
  2. Assemble the genome using SPAdes:
     spades.py -k 21,33,55,77,99 --careful -o spades_output \
       -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
    

    Explanation of Parameters

    • -k 21,33,55,77,99 : Multiple k-mers improve assembly accuracy.
    • -careful : Flag reduces errors like mismatches and indels.
    • -o : Output directory for assembly results.
    • -1/-2 : Input files containing forward and reverse paired-end reads.
    • -s : Unpaired reads to improve completeness.

Note


6. Assembly Evaluation

Why Evaluate?

 esearch -db nucleotide -query NC_007493.2 | efetch -format fasta > ref_genome_NC_007493.fa
 esearch -db nucleotide -query NC_011958 | efetch -format fasta > ref_genome_NC_011958.fa
 quast -R ref_genome_NC_011958.fa spades_output/scaffolds.fasta

Note

Key Metrics


7. Assembly Visualization

Why Visualize?

Steps

  1. Install Bandage.
  2. Load the Assembly Graph
    • Open Bandage and navigate to the main menu.
    • Click File > Load graph and select your final SPAdes assembly graph file:
      spades_output/assembly_graph.fastg.
  3. Visualize the Graph
    • After loading the graph, click the Draw graph button on the left side of the interface.
    • The visualization will display a clean assembly graph, showcasing the contiguity of your results.
  4. Compare with a Messier Assembly Graph
    • For comparison, load the graph generated using 21-mers:
      spades_output/K21/assembly_graph.fastg.
    • This graph will likely be less clean and show more fragmented or complex regions, demonstrating the impact of k-mer size on assembly quality.

8. Comparing Genomes

Why Compare?

Identify structural variations, conserved regions, and evolutionary differences.

Copy both reference and your assembled scaffolds fasta files. Compare them with Mauve

Reference Genome: NC_011958

Steps

  1. Install Mauve.
  2. Open Mauve and Start Alignment
    • Launch Mauve, and from the main menu, select File > Align with progressiveMauve.
    • A pop-up window will appear to specify input genomes.
  3. Add Sequences - Both reference and your assembled scaffolds fasta files
    • Click Add Sequence and select all the reference genome Ref: NC_011958 files you downloaded (the .gb files), then click Open.
    • Click Add Sequence again, and this time select your assembled genome file:
      spades_output/scaffolds.fasta.
    • Click Open to add it to the alignment.
  4. Align and Visualize
    • Once all sequences are added, click Align to process the data.
    • Mauve will generate a visualization comparing the conservation of regions across the reference and assembled genomes.
    • Because the reference genomes are in GenBank format (.gb files), gene annotations will be included in the visualization, making it easier to identify functional regions.

9. Genome Annotation

Why Annotate?

Steps

  1. Use RAST for prokaryotic genome annotation:
    • Upload spades_output/scaffolds.fasta.
    • Select genetic code table 11 (bacterial genomes).
  2. Review the annotated genes and predicted functions.

Alternatives


10. Exercise

Exercise : I. Dataset

Exercise : II. Quality Visualization

mkdir qc
fastqc *.fastq -o qc/
multiqc .

Exercise : III. Quality Filtering and Trimming

curl –OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR1553425_1.fastq SRR1553425_2.fastq trimmed_1.fastq unpaired_1.fastq trimmed_2.fastq unpaired_2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20

Exercise : IV. Genome Assembly

cat unpaired_1.fastq unpaired_2.fastq > unpaired.fastq
spades.py -k 21,33,55,77 --careful -o spades_output -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
# redo for k 21 only
spades.py -k 21 --careful -o spades_output_k21 -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq

Exercise : V. Assembly Evaluation

# conda install -c bioconda entrez-direct
mamba install entrez-direct
esearch -db nucleotide -query NC_002549 | efetch -format fasta > ref_genome.fa

Exercise : VI. Assembly Visualization

  1. Load the Assembly Graph
    • Open Bandage and navigate to the main menu.
    • Click File > Load graph and select your final SPAdes assembly graph file:
      spades_output/assembly_graph.fastg.
  2. Visualize the Graph
    • After loading the graph, click the Draw graph button on the left side of the interface.
    • The visualization will display a clean assembly graph, showcasing the contiguity of your results.
  3. Compare with a Messier Assembly Graph
    • For comparison, load the graph generated using 21-mers:
      spades_output_k21/K21/assembly_graph.fastg.
    • This graph will likely be less clean and show more fragmented or complex regions, demonstrating the impact of k-mer size on assembly quality.

Exercise : VII. Comparing Genome

  1. Open Mauve and Start Alignment
    • Launch Mauve, and from the main menu, select File > Align with progressiveMauve.
    • A pop-up window will appear to specify input genomes.
  2. Add Sequences
    • Click Add Sequence and select all the reference genome files you downloaded (the .gb files), then click Open.
    • Click Add Sequence again, and this time select your assembled genome file:
      spades_output/scaffolds.fasta.
    • Click Open to add it to the alignment.
  3. Align and Visualize
    • Once all sequences are added, click Align to process the data.
    • Mauve will generate a visualization comparing the conservation of regions across the reference and assembled genomes.
    • Because the reference genomes are in GenBank format (.gb files), gene annotations will be included in the visualization, making it easier to identify functional regions.
      • Download from GenBank in GB format GenBank_Data
      • Mauve Result Mauve_result

Exercise : VIII. Genome Annotation

  1. Annotate Genes in the Assembled Genome
    • The final step is to annotate possible genes in your assembled genome.
  2. Find Open Reading Frames (ORFs)
    • Use NCBI’s ORFfinder to identify open reading frames (ORFs). An ORF is a sequence of DNA that has the potential to code for a protein.
    • Open the spades_output/scaffolds.fasta file and copy the sequence of only the first contig.
    • Paste the sequence into ORFfinder and submit it with the default settings.
  3. Download ORF Results
    • After processing, locate the “Mark subset…” option in the bottom-right box.
    • Select “All ORFs” and click “Download marked set.”
    • By default, this will save the protein predictions in a FASTA file.
    • To download the coding sequences (CDS) instead, use the dropdown menu and switch from “Protein FASTA” to “CDS FASTA.”
  4. Predict Functions Using BLAST
    • Use the downloaded results to predict gene functions:
  5. Interpret Results
    • Review the BLAST outputs to identify putative gene functions, conserved domains, or homologous sequences in other organisms.
    • Use these annotations to better understand the biological roles of the genes in your assembled genome.

11. Educational Notes

Key Takeaways

Quality Control and Filtering

Videos


Contributing

Contributions to improve this pipeline are welcome!