Lesson 3: Genome Assembly Foundations
1. What is Genome Assembly?
Section titled “1. What is Genome Assembly?”Genome assembly is the process of reconstructing a genome sequence from smaller sequencing reads.
A sequencer does not usually output one full chromosome as a single read. Instead, it produces many overlapping pieces (reads). Assembly algorithms use these overlaps to rebuild longer sequences called contigs and scaffolds.
In practical terms, assembly answers this question:
How do we go from thousands or millions of short/long fragments to a biologically meaningful genome sequence?
2. Why Assembly Is Important in Bioinformatics
Section titled “2. Why Assembly Is Important in Bioinformatics”Genome assembly is important because many downstream analyses depend on having a good genome representation.
- It helps discover unknown pathogens or strains when no perfect reference exists.
- It supports gene discovery and annotation in non-model organisms.
- It enables structural variant detection (insertions, deletions, rearrangements).
- It provides a foundation for comparative genomics and evolutionary studies.
- It improves interpretation of functional omics data (RNA-seq, metagenomics) by giving correct genomic context.
A poor assembly can mislead all later analysis steps. For this reason, assembly quality is not optional; it is central to reliable biological interpretation.
3. Core Concepts You Must Know
Section titled “3. Core Concepts You Must Know”3.1 Reads, Contigs, and Scaffolds
Section titled “3.1 Reads, Contigs, and Scaffolds”- Reads are raw sequences from the sequencer.
- Contigs are contiguous sequences assembled from overlapping reads.
- Scaffolds are linked contigs with estimated gaps (often using read-pair or long-range information).
3.2 Coverage (Depth)
Section titled “3.2 Coverage (Depth)”Coverage tells us how many times a genomic region is represented by reads.
Approximate formula:
Coverage = (Number_of_reads x Read_length) / Genome_size
Higher coverage usually improves assembly confidence, but very uneven coverage can still cause assembly gaps or errors.
3.3 Repeats
Section titled “3.3 Repeats”Repeated regions are hard to assemble because many genomic positions look similar.
- Short reads struggle more with long repeats.
- Long reads often bridge repeats better and improve contiguity.
4. De novo vs Reference-Based Assembly
Section titled “4. De novo vs Reference-Based Assembly”4.1 De novo Assembly
Section titled “4.1 De novo Assembly”De novo assembly reconstructs a genome from reads alone, without aligning to an existing reference.
When to use it:
- No suitable reference genome exists.
- You expect novel sequence not present in existing references.
- You want unbiased reconstruction.
Strengths:
- Can reveal new sequences and genome rearrangements.
- Avoids reference bias.
Limitations:
- Computationally demanding.
- Sensitive to read quality and coverage.
- Harder in complex or highly repetitive genomes.
4.2 Reference-Based (Reference-Guided) Assembly
Section titled “4.2 Reference-Based (Reference-Guided) Assembly”Reference-based assembly aligns reads to a known reference genome and reconstructs sample sequence relative to it.
When to use it:
- A high-quality, closely related reference is available.
- You mainly want consensus sequence and variants.
- You need a faster and often simpler workflow.
Strengths:
- Efficient and reproducible for known organisms.
- Strong for SNP/indel calling and consensus generation.
Limitations:
- Can miss novel insertions or major rearrangements.
- Introduces reference bias if the sample differs strongly from the reference.
5. Software Choices for Short vs Long Reads
Section titled “5. Software Choices for Short vs Long Reads”Different read types require different algorithmic strategies.
5.1 Short-Read Assembly Tools (Illumina)
Section titled “5.1 Short-Read Assembly Tools (Illumina)”Common options:
- SPAdes
- MEGAHIT
- Velvet (older but important historically)
Why these work for short reads:
They often use de Bruijn graph methods, which are efficient for massive numbers of accurate short reads.
5.2 Long-Read Assembly Tools (Oxford Nanopore / PacBio)
Section titled “5.2 Long-Read Assembly Tools (Oxford Nanopore / PacBio)”Common options:
- Flye
- Canu
- Raven
- wtdbg2
Why these work for long reads:
They are designed to handle higher error rates and long overlaps, which improves assembly across repeats.
5.3 Hybrid Assembly
Section titled “5.3 Hybrid Assembly”Hybrid assembly combines short-read accuracy with long-read contiguity.
Common options:
- Unicycler
- hybridSPAdes
This is often useful when both read types are available and you need strong accuracy plus long contigs.
6. High-Level Assembly Pipeline (Overview)
Section titled “6. High-Level Assembly Pipeline (Overview)”This is a general assembly workflow you can adapt by project.
- Organize files and project folders.
- Inspect raw reads (
head,tail, read counts). - Perform read quality control and contamination filtering.
- Run assembly (de novo or reference-guided).
- Assess assembly quality (N50, total length, number of contigs, completeness).
- Assign biological identity (for example, BLAST on contigs).
- Document software versions, parameters, and outputs for reproducibility.
7. File Organization Pattern for Assembly Projects
Section titled “7. File Organization Pattern for Assembly Projects”Good organization prevents errors and makes your analysis reproducible.
Use clear subfolders for input, intermediate files, and final outputs.
cd Training/long_readsmkdir -p denovo/{raw,clean,assembly,blast,logs,reference}8. Quick Data Inspection Using Existing Training Files
Section titled “8. Quick Data Inspection Using Existing Training Files”Before assembling, verify what files are available and inspect read structure.
cd Training/long_readsls -lh
head -n 8 barcode57.fastqhead -n 8 barcode58.fastq
tail -n 8 barcode57.fastqtail -n 8 barcode58.fastqCount total lines and estimate number of reads (FASTQ uses 4 lines per read).
wc -l barcode57.fastq barcode58.fastq9. Key Take-Home Messages
Section titled “9. Key Take-Home Messages”- Assembly is the bridge between raw sequencing output and interpretable genomes.
- De novo and reference-based assembly solve different biological questions.
- Tool choice depends strongly on read type and study objective.
- Strong folder structure and clear command history are as important as the assembler itself.
In the next lessons, you will run a complete long-read de novo pipeline and then a reference-guided workflow.