Lesson 4 — Abstraction and Algorithm Design
Learning Objectives
Section titled “Learning Objectives”By the end of this lesson, you will be able to:
- Identify which details are essential and which are irrelevant for a given bioinformatics problem
- Explain the costs and benefits of different levels of abstraction
- Evaluate multiple algorithms for the same problem using structured trade-off analysis
- Apply the SACRED framework (Speed, Accuracy, Cost, Robustness, Ease, Data suitability) to algorithm selection
- Make and justify algorithm choices under realistic biological and computational constraints
- Recognise when an abstraction has failed and how to revise it
Part 1: Abstraction in Depth
Section titled “Part 1: Abstraction in Depth”1.1 Levels of Abstraction
Section titled “1.1 Levels of Abstraction”Every bioinformatics analysis involves multiple simultaneous levels of abstraction. Understanding which level you are working at — and which levels you are ignoring — is fundamental to correct interpretation.
Consider a single RNA-Seq experiment:
| Level | What is real | What we abstract to |
|---|---|---|
| Molecular | Individual mRNA molecules in a cell | A count of transcript copies |
| Cellular | Heterogeneous cell population | A “bulk” average signal |
| Technical | Probabilistic basecalling, PCR amplification bias | A quality-filtered read |
| Statistical | Continuous expression distribution | A discrete count per gene |
| Biological | Complex gene regulatory networks | A list of differentially expressed genes |
At each level, you discard information to make the problem tractable. The key question at each step: is what I am discarding relevant to my biological question?
If your question is “which genes are differentially expressed on average across a tissue?”, bulk RNA-Seq is an appropriate abstraction. If your question is “do all cells respond uniformly, or is there a rare responding subpopulation?”, bulk RNA-Seq is the wrong abstraction — you need single-cell RNA-Seq, which maintains cellular resolution.
1.2 Case Study: What Details Matter in Genome Assembly?
Section titled “1.2 Case Study: What Details Matter in Genome Assembly?”Scenario: You are assembling a begomovirus genome (~2.7 kb per DNA component) from Nanopore reads.
Details that matter:
| Detail | Why it matters |
|---|---|
| Read length distribution | Determines whether reads can span the full genome |
| Error rate of the reads | Determines polishing requirements and assembler choice |
| Expected genome size | Sets coverage requirements and helps validate assembly |
| Genome circularity | Begomoviruses have circular genomes; most assemblers assume linear — must check |
| Host proportion in sample | Directly affects effective viral coverage after host removal |
Details that can be safely abstracted away at assembly stage:
| Detail | Why it can be ignored |
|---|---|
| Exact basecalling model used | Relevant at QC, but by assembly time reads are already called |
| DNA extraction protocol | Affects yield (relevant earlier) but not assembler input logic |
| Which primers were used for library prep | Relevant for adapter trimming (already done); irrelevant for assembly |
| Sample collection date | Epidemiologically important, but not relevant to assembly algorithm |
| The researcher’s name | Not relevant at any computational step |
The key insight: “Can be safely ignored” depends entirely on the question. The DNA extraction protocol is irrelevant for assembly, but essential for troubleshooting if yields were unexpectedly low. Always ask: “ignored for what purpose?“
1.3 Case Study: Abstractions in Reference-Based Assembly vs. De Novo
Section titled “1.3 Case Study: Abstractions in Reference-Based Assembly vs. De Novo”When you choose between reference-based and de novo assembly, you are choosing between two different abstractions of the same biological reality.
Reference-based assembly abstracts the genome as: “A set of positions along a known coordinate system (the reference), where some positions may differ from the reference.”
This abstraction is powerful when:
- A close reference exists (> ~90% identity)
- You are interested in variation (SNPs, indels) rather than novel sequence
- Computational resources are limited
This abstraction fails when:
- The sample contains large insertions not in the reference
- The sample is so divergent that most reads fail to map
- You need to discover novel genomic elements
De novo assembly abstracts the genome as: “A sequence that can be reconstructed from read-read overlaps, without assumptions about the content.”
This abstraction is powerful when:
- No close reference exists
- You expect structural novelty
- You are characterising a new organism or virus
This abstraction fails when:
- Coverage is insufficient for confident overlap detection
- Reads are too short to span complex repeat regions
- The sample is heavily contaminated (overlaps from contaminating DNA cause misassembly)
The lesson: Neither abstraction is universally correct. The begomovirus project in Module 5 was a case where de novo assembly was the appropriate choice — you were working with potentially novel strains, and the reference-based approach would have missed divergent regions. But if you had been typing a known pathogen for clinical surveillance, reference-based mapping would have been faster and sufficient.
1.4 When Abstractions Fail: A Diagnostic Guide
Section titled “1.4 When Abstractions Fail: A Diagnostic Guide”An abstraction fails when the detail you discarded turns out to be relevant after all. Recognising this is a critical debugging skill.
Signs that your abstraction has failed:
-
Results are biologically implausible. Assembly size is 50 kb for a known 5 kb virus → you have included contaminating sequences in your abstraction of “viral reads.”
-
Results are inconsistent across replicate analyses. Every time you re-run with slightly different parameters, you get a completely different answer → your abstraction is too sensitive to parameters that should not matter, suggesting the true signal is weak.
-
A downstream tool refuses to accept your output. For example, GATK expects reads to have read groups (
@RGheader); if your alignment lacks these, the tool will crash. This is a signal that you abstracted away a metadata field (read group assignment) that a downstream algorithm needs. -
Statistical diagnostics fail. A QQ plot that shows severe inflation, a coverage plot with extreme bimodality, a BUSCO score near 0% — these are signs that reality does not match your model.
What to do when an abstraction fails:
Step 1: Identify which abstraction failed (Which detail did you discard that turned out to matter?)
Step 2: Revise the abstraction (Add back the necessary detail)
Step 3: Re-run from the point of failure (Do not re-run the entire pipeline)
Step 4: Document the revision (Why did you add that detail back? What was the evidence?)Part 2: Algorithm Design in Depth
Section titled “Part 2: Algorithm Design in Depth”2.1 The SACRED Framework for Algorithm Selection
Section titled “2.1 The SACRED Framework for Algorithm Selection”When choosing between multiple valid algorithms for the same problem, apply the SACRED framework:
| Letter | Factor | Key question |
|---|---|---|
| S | Speed | How long will this take, given my data size? |
| A | Accuracy | How correct are the results? What are the error rates? |
| C | Cost | What computational resources are required? (RAM, CPU, storage) |
| R | Robustness | How does the algorithm perform on messy, real-world data? |
| E | Ease | How hard is it to install, configure, and interpret? |
| D | Data suitability | Was this algorithm designed for data like mine? |
No algorithm scores highest on all six. The art of algorithm design is knowing which factors matter most for your specific problem.
2.2 Detailed Trade-Off Analysis: Read Mapping Algorithms
Section titled “2.2 Detailed Trade-Off Analysis: Read Mapping Algorithms”Problem: You have Illumina short reads and need to map them to a reference genome for variant calling.
| Factor | BWA-MEM | Bowtie2 | STAR |
|---|---|---|---|
| Speed | Fast (~30 min/sample) | Very fast | Moderate (genome index is slow to build) |
| Accuracy | High, especially for long reads and variants | Good for short reads | Excellent for spliced RNA-Seq; poor for DNA |
| Cost (RAM) | ~5–8 GB | ~4–6 GB | ~30 GB (due to splice-aware index) |
| Robustness | Handles indels well | Less robust to large indels | Handles introns correctly |
| Ease | Well-documented, standard | Well-documented | Complex configuration for splice-aware mode |
| Data suitability | DNA-Seq, amplicons, WGS | DNA-Seq, short reads | RNA-Seq only |
Verdict: For DNA variant calling → BWA-MEM. For RNA-Seq → STAR. Using STAR for DNA variant calling would miss structural variants and consume unnecessary memory. Using BWA-MEM for RNA-Seq would fail to map reads that span exon-exon junctions.
2.3 Detailed Trade-Off Analysis: Variant Callers
Section titled “2.3 Detailed Trade-Off Analysis: Variant Callers”Problem: You want to identify germline variants in a human genome.
| Factor | GATK HaplotypeCaller | samtools mpileup + bcftools | FreeBayes |
|---|---|---|---|
| Speed | Slow (hours per sample) | Fast (minutes) | Moderate |
| Accuracy | Very high, especially indels | Good for SNPs, weaker for complex indels | Good, especially for polyploid |
| Cost (RAM) | 4–8 GB | 2–4 GB | 4–6 GB |
| Robustness | Handles complex loci via local reassembly | Sensitive to mapping quality at complex loci | Handles multi-allelic sites well |
| Ease | Complex GATK ecosystem, steep learning curve | Simple command-line interface | Moderate complexity |
| Data suitability | Germline, human WGS/WES | Any organism, quick variant survey | Any organism, polyploid support |
Verdict: For clinical-grade germline variant calling in a funded project → GATK HaplotypeCaller. For a rapid exploratory variant survey in a non-human organism → samtools mpileup + bcftools. The cost (time and complexity) of GATK is justified in high-stakes contexts but not always in exploratory analyses.
2.4 Detailed Trade-Off Analysis: De Novo Assembly Tools
Section titled “2.4 Detailed Trade-Off Analysis: De Novo Assembly Tools”Problem: You want to assemble a 2.7 kb viral genome from Nanopore reads.
| Factor | Flye | Canu | Hifiasm | SPAdes |
|---|---|---|---|---|
| Speed | Fast (minutes for small genome) | Slow (hours) | Fast | Fast |
| Accuracy | Good, needs polishing | Very good after self-correction | Excellent (designed for HiFi) | Good for short reads |
| Cost (RAM) | 4–16 GB | 16–64 GB | 16 GB | 4–8 GB |
| Robustness | Handles high error rate well | Robust to errors via self-correction | Requires HiFi reads | Best for short Illumina reads |
| Ease | Simple, well-documented | Complex parameter tuning | Straightforward | Well-documented |
| Data suitability | Nanopore raw reads (R9/R10) | Nanopore or PacBio CLR | PacBio HiFi (CCS) only | Illumina short reads |
Verdict for the begomovirus project: Flye was the correct choice. Canu would have worked but with longer runtime and more memory. Hifiasm was not appropriate because the data was not HiFi. SPAdes was not appropriate because the reads were long Nanopore reads with a high error profile.
2.5 Algorithm Design: Multiple Valid Approaches to the Same Problem
Section titled “2.5 Algorithm Design: Multiple Valid Approaches to the Same Problem”A critical insight in computational thinking: for almost every bioinformatics problem, multiple valid algorithms exist. The goal is not to find the only correct algorithm — it is to find a justified algorithm for your constraints.
Example: Detecting Host Contamination Before Assembly
Approach A: Map-and-filter (standard approach)
# Map all reads to host genomeminimap2 -ax map-ont host_genome.fa reads.fastq > mapped.sam
# Extract unmapped reads (these are potentially non-host)samtools view -f 4 mapped.sam | samtools fastq > viral_reads.fastq- Advantage: Fast, well-established, works with any reference genome
- Disadvantage: Requires a good host reference genome; divergent host strains may be missed
Approach B: Taxonomic classification (Kraken2)
# Classify reads by taxonomykraken2 --db k2_standard reads.fastq --output kraken_output.txt --report report.txt
# Extract non-host readsextract_kraken_reads.py -k kraken_output.txt -s reads.fastq \ --exclude --taxid 3983 > viral_reads.fastq # 3983 = Manihot esculenta- Advantage: Does not require a complete host reference genome; classifies to species level
- Disadvantage: Depends on database completeness; novel organisms may be missed; requires larger database download
Approach C: GC-content separation (opportunistic — only when GC profiles are very different)
# Compute per-read GC content and filterawk '{if (NR%4==2) {gc=gsub(/[GCgc]/,""); print gc/length($0)}}' reads.fastq > gc_content.txt# Then filter reads with GC content in the viral range- Advantage: No reference required
- Disadvantage: Crude, unreliable when host and pathogen have similar GC content; would fail for begomoviruses (GC ~40%) in cassava (GC ~34%)
Which approach should you choose?
For the Module 5 scenario:
- Host reference genome (Manihot esculenta) is available on NCBI → Approach A is fast and reliable
- Approach B is valid but requires downloading a large Kraken2 database (~50 GB)
- Approach C is inappropriate — GC profiles too similar
Constraint-based selection:
| Constraint | Preferred approach |
|---|---|
| Good reference genome available, fast turnaround needed | A (map-and-filter) |
| No reference genome, comprehensive classification needed | B (Kraken2) |
| Very limited storage and compute | A with a reduced reference |
| Novel host with no genome | B or A against closest relative |
2.6 How to Choose: A Practical Decision Guide
Section titled “2.6 How to Choose: A Practical Decision Guide”When you face a choice between multiple algorithms, work through this checklist:
1. What is my data type? → This immediately eliminates algorithms that are incompatible
2. What is my primary constraint? → Time? Memory? Accuracy? Choose the algorithm where that factor scores highest
3. What are the known failure modes of each option? → Does my data trigger any known failure modes?
4. What have others used for this type of data? → Check tool benchmarking papers, GitHub Issues, and bioinformatics forums
5. What is the cost of being wrong? → Clinical application: prioritise accuracy. Exploratory analysis: prioritise speed.
6. Can I test multiple approaches on a subset? → If compute allows, run two approaches on a small subset and compare2.7 Parameter Tuning: A Special Case of Algorithm Design
Section titled “2.7 Parameter Tuning: A Special Case of Algorithm Design”Choosing an algorithm is only the first decision. Most bioinformatics tools have multiple parameters, each of which can significantly change the output. Parameter tuning is itself an algorithm design problem.
Guiding principles for parameter tuning:
-
Start with defaults. Tool authors set defaults for typical use cases. Understand what “typical” means before departing from defaults.
-
Tune one parameter at a time. Changing multiple parameters simultaneously makes it impossible to attribute the effect to a specific change.
-
Tune on a subset. Run multiple parameter combinations on a small subset of your data (e.g., one sample, or 10% of reads) before committing to a full run.
-
Evaluate with a metric, not by eye. Define your quality metric before tuning: N50, BUSCO score, number of mapped reads, QQ lambda. Then optimise toward that metric — not toward a result that “looks right.”
-
Document every parameter change. A run with a non-default parameter that is not documented is unreproducible.
Example: Flye genome size parameter
# Correct usage for a ~5 kb begomovirus genome:flye --nano-raw viral_reads.fastq --genome-size 5k --out-dir assembly/ --threads 4
# What happens if you set genome-size 50m (50 Megabases, wrong by 10,000x)?# Flye will attempt to allocate enormous memory for the read graph,# likely fail, or produce a wildly incorrect assembly.# This is not a hardware failure — it is an algorithm design mistake.2.8 Case Study: RNA-Seq — Multiple Valid Pipelines
Section titled “2.8 Case Study: RNA-Seq — Multiple Valid Pipelines”Goal: Identify differentially expressed genes between infected and uninfected cassava leaves.
Pipeline A: STAR + featureCounts + DESeq2 (most common)
1. Quality trim reads (Trimmomatic)2. Map to reference genome (STAR --outSAMtype BAM SortedByCoordinate)3. Quantify gene-level counts (featureCounts -p -s 0)4. Import to R, normalise, test (DESeq2)5. Visualise (volcano plot, heatmap, MA plot)Pipeline B: Salmon + tximport + DESeq2 (faster, pseudoalignment)
1. Quality trim reads (Trimmomatic or fastp)2. Quasi-map to reference transcriptome (salmon quant --validateMappings)3. Import transcript-level counts to R (tximport, summarise to gene level)4. Test (DESeq2)5. VisualisePipeline C: Trinity (de novo) + RSEM + edgeR (when no reference genome available)
1. Quality trim reads2. De novo transcriptome assembly (Trinity --seqType fq --max_memory 50G)3. Quantify (RSEM or kallisto)4. Test (edgeR)5. VisualiseTrade-off comparison:
| Factor | Pipeline A | Pipeline B | Pipeline C |
|---|---|---|---|
| Requires reference genome | Yes (genome + GTF) | Yes (transcriptome) | No |
| Speed | Moderate | Fast | Slow (assembly) |
| Accuracy | High | Slightly lower (no alignment) | Variable (depends on assembly) |
| RAM usage | High (STAR index ~30 GB) | Low (Salmon index ~2 GB) | Very high (Trinity) |
| Handles multi-mapping reads | Yes (featureCounts) | Yes (EM algorithm) | Yes (RSEM) |
| Best for | Model organism with genome | Fast exploratory analysis | Novel organism |
For cassava (Manihot esculenta, reference genome available): Pipeline A or B. Pipeline C is unnecessary. For a novel pathosystem without a reference genome: Pipeline C.
Summary
Section titled “Summary”Core insight on abstraction: Every model, tool, and pipeline is an abstraction. Your job is not to find the perfect abstraction — it is to find one that discards the right details for your specific question, and to know what it throws away.
Core insight on algorithm design: There is rarely one correct algorithm. There is usually a best-justified algorithm given your data, your constraints, and your question. The justification is as important as the choice.
| Skill | Practice activity |
|---|---|
| Abstraction | For the next tool you run, write down three things it assumes and three things it ignores |
| Algorithm selection | Find two tools for the same task and complete a SACRED comparison table |
| Parameter tuning | Run the same tool with two different key parameters and compare the outputs using a defined quality metric |
Looking Ahead
Section titled “Looking Ahead”Lesson 5 brings all four principles together in integrated case studies — showing how computational thinking operates in a real, complete, and messy analysis — including the inevitable failures, unexpected results, and iterative corrections that define real bioinformatics work.