Genome Assembly Pipeline
This repository provides a detailed step-by-step guide for genome assembly using Rhodobacter sphaeroides dataset as the primary example. It covers key processes such as data preprocessing, genome assembly, evaluation, visualization, comparison, and annotation. Additionally, it includes a guide for genome assembly using a downsampled Zaire Ebolavirus dataset for rapid analysis.
Table of Contents
- Server Information
- Dataset
- Quality Visualization
- Quality Filtering and Trimming
- Genome Assembly
- Assembly Evaluation
- Assembly Visualization
- Comparing Genomes
- Genome Annotation
- Exercise
- Educational Notes
Prerequisites
Security Practices
- Avoid multiple failed login attempts to prevent account locking.
- Use strong passwords or SSH keys.
- Log out after completing tasks.
Installing Miniconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Installing software
conda install mamba
mamba install sra-tools fastqc trimmomatic multiqc curl spades quast
2. Dataset
- Organism: Rhodobacter sphaeroides
- SRA ID: SRR522246
- Size: Paired-end reads (~4.6 GB each, total 9.2 GB).
- Time to download: > 15 minutes
Why This Dataset?
This dataset was chosen for its small genome size, high-quality sequencing data, and suitability for teaching genome assembly principles.
Download Dataset
- Create a directory for the assembly process:
mkdir -p assembly/rhodobacter
cd assembly/rhodobacter
- Use
fastq-dump
to download paired-end reads:
fastq-dump --split-files SRR522246
3. Quality Visualization
- Identify low-quality bases, adapter contamination, and sequencing artifacts.
- Ensure reliable downstream analysis.
Steps
- Create a directory for quality control results:
- Run FastQC to assess the quality of raw reads:
- Aggregate and summarize results using MultiQC:
- Transfer qc and multiqc results to your local machine and view HTML reports.
Key Metrics to Examine
- Per-base sequence quality: Should be high across all cycles.
- GC content: Compare with the expected range for the organism.
- Overrepresented sequences: Indicates adapter contamination.
4. Quality Filtering and Trimming
Why Trim Reads?
- Improves downstream processes by removing low-quality regions and adapter sequences.
Steps
- Download adapter sequences:
curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa
- Trim reads using Trimmomatic:
trimmomatic PE SRR522246_1.fastq SRR522246_2.fastq \
trimmed_1.fastq unpaired_1.fastq \
trimmed_2.fastq unpaired_2.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 \
LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
Explanation of Parameters
- PE - Specifies that the input data consists of paired-end reads (two FASTQ files, one for each read direction).
- SRR522246_1.fastq and SRR522246_2.fastq - The input files contain raw reads for forward (_1) and reverse (_2) sequences, respectively.
- trimmed_1.fastq and trimmed_2.fastq - Output files containing trimmed paired reads for forward and reverse sequences.
- unpaired_1.fastq and unpaired_2.fastq - Output files containing unpaired reads that are discarded from the paired-end alignment due to trimming.
- ILLUMINACLIP:adapters.fa:2:30:10 - Removes adapter sequences from reads.
- adapters.fa: The file containing adapter sequences to be clipped.
- 2: Maximum mismatch count allowed between the adapter and the read.
- 30: The minimum score for the adapter match to retain the read.
- 10: The length of the initial seed match for adapter removal.
- LEADING:20 - Trims the low-quality bases if quality score falls below 20 from the start of a read.
- TRAILING:20 - Trims the low-quality bases if quality score falls below 20 from the end of a read.
- AVGQUAL:20 - Discards the entire read if the average quality score across the read falls below 20.
- MINLEN:20 - Discards reads that are shorter than 20 bases after trimming.
5. Genome Assembly
What is Genome Assembly?
Reconstructing a genome by piecing together sequencing reads.
Steps
- Combine unpaired reads:
cat unpaired_1.fastq unpaired_2.fastq > unpaired.fastq
- Assemble the genome using SPAdes:
spades.py -k 21,33,55,77,99 --careful -o spades_output \
-1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
Explanation of Parameters
- -k 21,33,55,77,99 : Multiple k-mers improve assembly accuracy.
- -careful : Flag reduces errors like mismatches and indels.
- -o : Output directory for assembly results.
- -1/-2 : Input files containing forward and reverse paired-end reads.
- -s : Unpaired reads to improve completeness.
Note
- SPAdes uses multiple k-mers for robust assembly. Using multiple k-mer sizes allows it to leverage the advantages of both small and large k-mers, resulting in better handling of repetitive regions and longer contigs.
6. Assembly Evaluation
Why Evaluate?
esearch -db nucleotide -query NC_007493.2 | efetch -format fasta > ref_genome_NC_007493.fa
esearch -db nucleotide -query NC_011958 | efetch -format fasta > ref_genome_NC_011958.fa
quast -R ref_genome_NC_011958.fa spades_output/scaffolds.fasta
Note
- Copy the results folder locally via FileZilla and check out the HTML report
- Rhodobacter sphaeroides Reference genome consists of 2 chromosomes and 5 plasmids
Key Metrics
- N50: Larger values indicate better assembly contiguity.
- Misassemblies: Highlight potential errors in scaffolding.
7. Assembly Visualization
Why Visualize?
- Understand the structure of the assembly and detect misassemblies.
Steps
- Install Bandage.
- Load the Assembly Graph
- Open Bandage and navigate to the main menu.
- Click File > Load graph and select your final SPAdes assembly graph file:
spades_output/assembly_graph.fastg
.
- Visualize the Graph
- After loading the graph, click the Draw graph button on the left side of the interface.
- The visualization will display a clean assembly graph, showcasing the contiguity of your results.
- Compare with a Messier Assembly Graph
- For comparison, load the graph generated using 21-mers:
spades_output/K21/assembly_graph.fastg
.
- This graph will likely be less clean and show more fragmented or complex regions, demonstrating the impact of k-mer size on assembly quality.
8. Comparing Genomes
Why Compare?
Identify structural variations, conserved regions, and evolutionary differences.
Copy both reference and your assembled scaffolds fasta files. Compare them with Mauve
Steps
- Install Mauve.
- Open Mauve and Start Alignment
- Launch Mauve, and from the main menu, select File > Align with progressiveMauve.
- A pop-up window will appear to specify input genomes.
- Add Sequences - Both reference and your assembled scaffolds fasta files
- Click Add Sequence and select all the reference genome
Ref: NC_011958
files you downloaded (the .gb files), then click Open.
- Click Add Sequence again, and this time select your assembled genome file:
spades_output/scaffolds.fasta
.
- Click Open to add it to the alignment.
- Align and Visualize
- Once all sequences are added, click Align to process the data.
- Mauve will generate a visualization comparing the conservation of regions across the reference and assembled genomes.
- Because the reference genomes are in GenBank format (.gb files), gene annotations will be included in the visualization, making it easier to identify functional regions.
9. Genome Annotation
Why Annotate?
- Predict genes, their functions, and associated pathways.
Steps
- Use RAST for prokaryotic genome annotation:
- Upload
spades_output/scaffolds.fasta
.
- Select genetic code table 11 (bacterial genomes).
- Review the annotated genes and predicted functions.
Alternatives
- Prokka: Command-line tool for rapid genome annotation.
- ORFfinder: Predicts open reading frames (ORFs).
10. Exercise
Exercise : I. Dataset
- Dataset Information: Zaire Ebolavirus
- Source: NCBI Sequence Read Archive (SRA)
- SRA ID: SRR1553425
- Details: Paired-end sequencing data from the 2014 Zaire ebolavirus outbreak in Sierra Leone.
- Subset: Use only 100,000 paired-end reads for faster processing:
mkdir -p assembly/ebola
cd assembly/ebola
fastq-dump --split-files -X 100000 SRR1553425
Exercise : II. Quality Visualization
mkdir qc
fastqc *.fastq -o qc/
multiqc .
- Transfer qc and multiqc results to your local machine via FileZilla and view HTML report
Exercise : III. Quality Filtering and Trimming
curl –OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR1553425_1.fastq SRR1553425_2.fastq trimmed_1.fastq unpaired_1.fastq trimmed_2.fastq unpaired_2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20
Exercise : IV. Genome Assembly
cat unpaired_1.fastq unpaired_2.fastq > unpaired.fastq
spades.py -k 21,33,55,77 --careful -o spades_output -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
# redo for k 21 only
spades.py -k 21 --careful -o spades_output_k21 -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
Exercise : V. Assembly Evaluation
- Download Reference Genomes
- Obtain the Ebola virus reference genomes in GenBank format (.gb files) for comparison:
Zaire ebolavirus isolate Ebola virus NC_002549.1 GenBank file
# conda install -c bioconda entrez-direct
mamba install entrez-direct
esearch -db nucleotide -query NC_002549 | efetch -format fasta > ref_genome.fa
- Evaluate the assembly using QUAST
quast -s -R ref_genome.fa spades_output/scaffolds.fasta
- Copy the results folder locally via FileZilla and check out the HTML report
Exercise : VI. Assembly Visualization
- Load the Assembly Graph
- Open Bandage and navigate to the main menu.
- Click File > Load graph and select your final SPAdes assembly graph file:
spades_output/assembly_graph.fastg
.
- Visualize the Graph
- After loading the graph, click the Draw graph button on the left side of the interface.
- The visualization will display a clean assembly graph, showcasing the contiguity of your results.
- Compare with a Messier Assembly Graph
- For comparison, load the graph generated using 21-mers:
spades_output_k21/K21/assembly_graph.fastg
.
- This graph will likely be less clean and show more fragmented or complex regions, demonstrating the impact of k-mer size on assembly quality.
Exercise : VII. Comparing Genome
- Open Mauve and Start Alignment
- Launch Mauve, and from the main menu, select File > Align with progressiveMauve.
- A pop-up window will appear to specify input genomes.
- Add Sequences
- Click Add Sequence and select all the reference genome files you downloaded (the .gb files), then click Open.
- Click Add Sequence again, and this time select your assembled genome file:
spades_output/scaffolds.fasta
.
- Click Open to add it to the alignment.
- Align and Visualize
- Once all sequences are added, click Align to process the data.
- Mauve will generate a visualization comparing the conservation of regions across the reference and assembled genomes.
- Because the reference genomes are in GenBank format (.gb files), gene annotations will be included in the visualization, making it easier to identify functional regions.
- Download from GenBank in GB format

- Mauve Result

Exercise : VIII. Genome Annotation
- Annotate Genes in the Assembled Genome
- The final step is to annotate possible genes in your assembled genome.
- Find Open Reading Frames (ORFs)
- Use NCBI’s ORFfinder to identify open reading frames (ORFs). An ORF is a sequence of DNA that has the potential to code for a protein.
- Open the
spades_output/scaffolds.fasta
file and copy the sequence of only the first contig.
- Paste the sequence into ORFfinder and submit it with the default settings.
- Download ORF Results
- After processing, locate the “Mark subset…” option in the bottom-right box.
- Select “All ORFs” and click “Download marked set.”
- By default, this will save the protein predictions in a FASTA file.
- To download the coding sequences (CDS) instead, use the dropdown menu and switch from “Protein FASTA” to “CDS FASTA.”
- Predict Functions Using BLAST
- Use the downloaded results to predict gene functions:
- Interpret Results
- Review the BLAST outputs to identify putative gene functions, conserved domains, or homologous sequences in other organisms.
- Use these annotations to better understand the biological roles of the genes in your assembled genome.
11. Educational Notes
Key Takeaways
Quality Control and Filtering
- Base Quality Assessment (FastQC): Ensures high-quality sequencing data to improve genome assembly.
- Trimming Reads: Eliminates contaminants, low-quality bases, and adapter sequences while preserving critical information.
Genome Assembly Insights
- Selecting Optimal K-mer Sizes: Using multiple k-mers enhances assembly accuracy by capturing short and long sequence overlaps.
- Assembly Quality Metrics: High N50, and comprehensive coverage indicate a well-assembled genome.
Genome Annotation Best Practices
- BLAST-based Functional Annotation: Use BLASTn for nucleotide sequences and BLASTp for protein sequences to predict gene functions.
- Alternative Annotation Tools: Prokka, RAST, ORFfinder, EggNOG-mapper, InterProScan
Visualization Techniques
- Bandage: Visualize genome assembly graphs to identify contigs, misassemblies, and scaffold connections.
- Mauve: Perform whole-genome alignments to compare structural variations and conserved genomic regions.
- IGV (Integrative Genomics Viewer): View read alignments, gene annotations, and genomic variants.
Videos
Contributing
Contributions to improve this pipeline are welcome!