DNA-seq_Genome_Assembly

Genome Assembly Pipeline

This repository provides a detailed step-by-step guide for genome assembly using Rhodobacter sphaeroides dataset as the primary example. It covers key processes such as data preprocessing, genome assembly, evaluation, visualization, comparison, and annotation. Additionally, it includes a guide for genome assembly using a downsampled Zaire Ebolavirus dataset for rapid analysis.

Server Information
Dataset
Quality Visualization
Quality Filtering and Trimming
Genome Assembly
Assembly Evaluation
Assembly Visualization
Comparing Genomes
Genome Annotation
Exercise
Educational Notes

1. Server Information

Prerequisites

Tools: Install PuTTY, FileZilla, Bandage, Mauve.
Server Details:
- IP Address: 168.105.161.70
- Port: 22
- Access: Requires JABSOM or UH network (use VPN for remote access).
- Note: With PuTTY and FileZilla you can connect to server.

Security Practices

Avoid multiple failed login attempts to prevent account locking.
Use strong passwords or SSH keys.
Log out after completing tasks.

Installing Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Installing software

conda install mamba
mamba install sra-tools fastqc trimmomatic multiqc curl spades quast

2. Dataset

Dataset Information

Organism: Rhodobacter sphaeroides
SRA ID: SRR522246
Size: Paired-end reads (~4.6 GB each, total 9.2 GB).
Time to download: > 15 minutes

Why This Dataset?

This dataset was chosen for its small genome size, high-quality sequencing data, and suitability for teaching genome assembly principles.

Download Dataset

Create a directory for the assembly process:

 mkdir -p assembly/rhodobacter
 cd assembly/rhodobacter

Use fastq-dump to download paired-end reads:
```
 fastq-dump --split-files SRR522246
```

3. Quality Visualization

Why Perform Quality Control?

Identify low-quality bases, adapter contamination, and sequencing artifacts.
Ensure reliable downstream analysis.

Steps

Create a directory for quality control results:
```
 mkdir -p qc
```
Run FastQC to assess the quality of raw reads:
```
 fastqc *.fastq -o qc/
```
Aggregate and summarize results using MultiQC:
```
 multiqc qc/
```
- Transfer qc and multiqc results to your local machine and view HTML reports.

Key Metrics to Examine

Per-base sequence quality: Should be high across all cycles.
GC content: Compare with the expected range for the organism.
Overrepresented sequences: Indicates adapter contamination.

4. Quality Filtering and Trimming

Why Trim Reads?

Improves downstream processes by removing low-quality regions and adapter sequences.

Steps

Download adapter sequences:

 curl -OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa

Trim reads using Trimmomatic:

 trimmomatic PE SRR522246_1.fastq SRR522246_2.fastq \
   trimmed_1.fastq unpaired_1.fastq \
   trimmed_2.fastq unpaired_2.fastq \
   ILLUMINACLIP:adapters.fa:2:30:10 \
   LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20

Explanation of Parameters

PE - Specifies that the input data consists of paired-end reads (two FASTQ files, one for each read direction).
SRR522246_1.fastq and SRR522246_2.fastq - The input files contain raw reads for forward (_1) and reverse (_2) sequences, respectively.
trimmed_1.fastq and trimmed_2.fastq - Output files containing trimmed paired reads for forward and reverse sequences.
unpaired_1.fastq and unpaired_2.fastq - Output files containing unpaired reads that are discarded from the paired-end alignment due to trimming.
ILLUMINACLIP:adapters.fa:2:30:10 - Removes adapter sequences from reads.
- adapters.fa: The file containing adapter sequences to be clipped.
- 2: Maximum mismatch count allowed between the adapter and the read.
- 30: The minimum score for the adapter match to retain the read.
- 10: The length of the initial seed match for adapter removal.
LEADING:20 - Trims the low-quality bases if quality score falls below 20 from the start of a read.
TRAILING:20 - Trims the low-quality bases if quality score falls below 20 from the end of a read.
AVGQUAL:20 - Discards the entire read if the average quality score across the read falls below 20.
MINLEN:20 - Discards reads that are shorter than 20 bases after trimming.

5. Genome Assembly

What is Genome Assembly?

Reconstructing a genome by piecing together sequencing reads.

Steps

Combine unpaired reads:

 cat unpaired_1.fastq unpaired_2.fastq > unpaired.fastq

Assemble the genome using SPAdes:
```
 spades.py -k 21,33,55,77,99 --careful -o spades_output \
   -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
```
Explanation of Parameters
- -k 21,33,55,77,99 : Multiple k-mers improve assembly accuracy.
- -careful : Flag reduces errors like mismatches and indels.
- -o : Output directory for assembly results.
- -1/-2 : Input files containing forward and reverse paired-end reads.
- -s : Unpaired reads to improve completeness.

Note

SPAdes uses multiple k-mers for robust assembly. Using multiple k-mer sizes allows it to leverage the advantages of both small and large k-mers, resulting in better handling of repetitive regions and longer contigs.

6. Assembly Evaluation

Why Evaluate?

Assess the quality, contiguity, and completeness of the assembly.

Note install the software only if you have not downloaded earlier.

# conda install -c bioconda entrez-direct
# OR
# mamba install entrez-direct

Download Reference genome
- Organism: Cereibacter sphaeroides 2.4.1
- Organism: Cereibacter sphaeroides KD131
- Above are different strains of the same species (Rhodobacter sphaeroides), indicating possible genomic variation due to strain-specific adaptations.

 esearch -db nucleotide -query NC_007493.2 | efetch -format fasta > ref_genome_NC_007493.fa
 esearch -db nucleotide -query NC_011958 | efetch -format fasta > ref_genome_NC_011958.fa
 quast -R ref_genome_NC_011958.fa spades_output/scaffolds.fasta

Note

Copy the results folder locally via FileZilla and check out the HTML report
Rhodobacter sphaeroides Reference genome consists of 2 chromosomes and 5 plasmids

Key Metrics

N50: Larger values indicate better assembly contiguity.
Misassemblies: Highlight potential errors in scaffolding.

7. Assembly Visualization

Why Visualize?

Understand the structure of the assembly and detect misassemblies.

Steps

Install Bandage.
Load the Assembly Graph
- Open Bandage and navigate to the main menu.
- Click File > Load graph and select your final SPAdes assembly graph file:
  spades_output/assembly_graph.fastg.
Visualize the Graph
- After loading the graph, click the Draw graph button on the left side of the interface.
- The visualization will display a clean assembly graph, showcasing the contiguity of your results.
Compare with a Messier Assembly Graph
- For comparison, load the graph generated using 21-mers:
  spades_output/K21/assembly_graph.fastg.
- This graph will likely be less clean and show more fragmented or complex regions, demonstrating the impact of k-mer size on assembly quality.

8. Comparing Genomes

Why Compare?

Identify structural variations, conserved regions, and evolutionary differences.

Copy both reference and your assembled scaffolds fasta files. Compare them with Mauve

Reference Genome: NC_011958

Steps

Install Mauve.
Open Mauve and Start Alignment
- Launch Mauve, and from the main menu, select File > Align with progressiveMauve.
- A pop-up window will appear to specify input genomes.
Add Sequences - Both reference and your assembled scaffolds fasta files
- Click Add Sequence and select all the reference genome Ref: NC_011958 files you downloaded (the .gb files), then click Open.
- Click Add Sequence again, and this time select your assembled genome file:
  spades_output/scaffolds.fasta.
- Click Open to add it to the alignment.
Align and Visualize
- Once all sequences are added, click Align to process the data.
- Mauve will generate a visualization comparing the conservation of regions across the reference and assembled genomes.
- Because the reference genomes are in GenBank format (.gb files), gene annotations will be included in the visualization, making it easier to identify functional regions.

9. Genome Annotation

Why Annotate?

Predict genes, their functions, and associated pathways.

Steps

Use RAST for prokaryotic genome annotation:
- Upload spades_output/scaffolds.fasta.
- Select genetic code table 11 (bacterial genomes).
Review the annotated genes and predicted functions.

Alternatives

Prokka: Command-line tool for rapid genome annotation.
ORFfinder: Predicts open reading frames (ORFs).

10. Exercise

Exercise : I. Dataset

Dataset Information: Zaire Ebolavirus
Source: NCBI Sequence Read Archive (SRA)
SRA ID: SRR1553425
Details: Paired-end sequencing data from the 2014 Zaire ebolavirus outbreak in Sierra Leone.

Subset: Use only 100,000 paired-end reads for faster processing:

  mkdir -p assembly/ebola
  cd assembly/ebola
  fastq-dump --split-files -X 100000 SRR1553425

Exercise : II. Quality Visualization

mkdir qc
fastqc *.fastq -o qc/
multiqc .

Transfer qc and multiqc results to your local machine via FileZilla and view HTML report

Exercise : III. Quality Filtering and Trimming

curl –OL https://raw.githubusercontent.com/BioInfoTools/BBMap/master/resources/adapters.fa > adapters.fa
trimmomatic PE SRR1553425_1.fastq SRR1553425_2.fastq trimmed_1.fastq unpaired_1.fastq trimmed_2.fastq unpaired_2.fastq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:20

Exercise : IV. Genome Assembly

cat unpaired_1.fastq unpaired_2.fastq > unpaired.fastq
spades.py -k 21,33,55,77 --careful -o spades_output -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq
# redo for k 21 only
spades.py -k 21 --careful -o spades_output_k21 -1 trimmed_1.fastq -2 trimmed_2.fastq -s unpaired.fastq

Exercise : V. Assembly Evaluation

Download Reference Genomes
- Obtain the Ebola virus reference genomes in GenBank format (.gb files) for comparison:
  Zaire ebolavirus isolate Ebola virus NC_002549.1 GenBank file

# conda install -c bioconda entrez-direct
mamba install entrez-direct
esearch -db nucleotide -query NC_002549 | efetch -format fasta > ref_genome.fa

Evaluate the assembly using QUAST

 quast -s -R ref_genome.fa spades_output/scaffolds.fasta

Copy the results folder locally via FileZilla and check out the HTML report

Exercise : VI. Assembly Visualization

Load the Assembly Graph
- Open Bandage and navigate to the main menu.
- Click File > Load graph and select your final SPAdes assembly graph file:
  spades_output/assembly_graph.fastg.
Visualize the Graph
- After loading the graph, click the Draw graph button on the left side of the interface.
- The visualization will display a clean assembly graph, showcasing the contiguity of your results.
Compare with a Messier Assembly Graph
- For comparison, load the graph generated using 21-mers:
  spades_output_k21/K21/assembly_graph.fastg.
- This graph will likely be less clean and show more fragmented or complex regions, demonstrating the impact of k-mer size on assembly quality.

Exercise : VII. Comparing Genome

Open Mauve and Start Alignment
- Launch Mauve, and from the main menu, select File > Align with progressiveMauve.
- A pop-up window will appear to specify input genomes.
Add Sequences
- Click Add Sequence and select all the reference genome files you downloaded (the .gb files), then click Open.
- Click Add Sequence again, and this time select your assembled genome file:
  spades_output/scaffolds.fasta.
- Click Open to add it to the alignment.
Align and Visualize
- Once all sequences are added, click Align to process the data.
- Mauve will generate a visualization comparing the conservation of regions across the reference and assembled genomes.
- Because the reference genomes are in GenBank format (.gb files), gene annotations will be included in the visualization, making it easier to identify functional regions.
  - Download from GenBank in GB format
  - Mauve Result

Exercise : VIII. Genome Annotation

Annotate Genes in the Assembled Genome
- The final step is to annotate possible genes in your assembled genome.
Find Open Reading Frames (ORFs)
- Use NCBI’s ORFfinder to identify open reading frames (ORFs). An ORF is a sequence of DNA that has the potential to code for a protein.
- Open the spades_output/scaffolds.fasta file and copy the sequence of only the first contig.
- Paste the sequence into ORFfinder and submit it with the default settings.
Download ORF Results
- After processing, locate the “Mark subset…” option in the bottom-right box.
- Select “All ORFs” and click “Download marked set.”
- By default, this will save the protein predictions in a FASTA file.
- To download the coding sequences (CDS) instead, use the dropdown menu and switch from “Protein FASTA” to “CDS FASTA.”
Predict Functions Using BLAST
- Use the downloaded results to predict gene functions:
  - For CDS regions, use NCBI’s Nucleotide BLAST (BLASTn).
  - For protein sequences, use NCBI’s Protein BLAST (BLASTp). Align your sequences against known databases (such as Swiss-Prot or NR (Non-redundent protein database) ) to provide functional annotations for the identified ORFs.
Interpret Results
- Review the BLAST outputs to identify putative gene functions, conserved domains, or homologous sequences in other organisms.
- Use these annotations to better understand the biological roles of the genes in your assembled genome.

11. Educational Notes

Key Takeaways

Quality Control and Filtering

Base Quality Assessment (FastQC): Ensures high-quality sequencing data to improve genome assembly.
Trimming Reads: Eliminates contaminants, low-quality bases, and adapter sequences while preserving critical information.
Genome Assembly Insights
Selecting Optimal K-mer Sizes: Using multiple k-mers enhances assembly accuracy by capturing short and long sequence overlaps.
Assembly Quality Metrics: High N50, and comprehensive coverage indicate a well-assembled genome.
Genome Annotation Best Practices
BLAST-based Functional Annotation: Use BLASTn for nucleotide sequences and BLASTp for protein sequences to predict gene functions.
Alternative Annotation Tools: Prokka, RAST, ORFfinder, EggNOG-mapper, InterProScan
Visualization Techniques
Bandage: Visualize genome assembly graphs to identify contigs, misassemblies, and scaffold connections.
Mauve: Perform whole-genome alignments to compare structural variations and conserved genomic regions.
IGV (Integrative Genomics Viewer): View read alignments, gene annotations, and genomic variants.

DNA-seq_Genome_Assembly

Genome Assembly Pipeline

Table of Contents

1. Server Information

Prerequisites

Security Practices

Installing Miniconda

Installing software

2. Dataset

Dataset Information

Why This Dataset?

Download Dataset

3. Quality Visualization

Why Perform Quality Control?

Steps

Key Metrics to Examine

4. Quality Filtering and Trimming

Why Trim Reads?

Steps

Explanation of Parameters

5. Genome Assembly

What is Genome Assembly?

Steps

Explanation of Parameters

Note

6. Assembly Evaluation

Why Evaluate?

Note

Key Metrics

7. Assembly Visualization

Why Visualize?

Steps

8. Comparing Genomes

Why Compare?

Reference Genome: NC_011958

Steps

9. Genome Annotation

Why Annotate?

Steps

Alternatives

10. Exercise

Exercise : I. Dataset

Exercise : II. Quality Visualization

Exercise : III. Quality Filtering and Trimming

Exercise : IV. Genome Assembly

Exercise : V. Assembly Evaluation

Exercise : VI. Assembly Visualization

Exercise : VII. Comparing Genome

Exercise : VIII. Genome Annotation

11. Educational Notes

Key Takeaways

Quality Control and Filtering

Genome Assembly Insights

Genome Annotation Best Practices

Visualization Techniques

Videos

Contributing

Contributions to improve this pipeline are welcome!