Skip to content

Lesson 3 — Decomposition and Pattern Recognition in Action

By the end of this lesson, you will be able to:

  • Systematically decompose an unfamiliar bioinformatics problem using a structured approach
  • Apply the “five questions” framework to any new problem before starting an analysis
  • Recognise at least six recurring patterns in NGS data and their standard responses
  • Use a decision tree to select an appropriate solution strategy
  • Identify common decomposition mistakes and how to avoid them
  • Describe how decomposition and pattern recognition were applied in the Module 5 project

Decomposition sounds simple. Break the problem apart. But in practice, it is the skill that most distinguishes experienced bioinformaticians from beginners. Two common failure modes:

Failure mode 1: Decomposing too shallowly. “My problem is: assemble and annotate the genome.” This is not decomposed — it is just restated. A useful decomposition reaches the level where each sub-problem has a clear input, a clear output, and a tool or method that can solve it.

Failure mode 2: Decomposing incorrectly. Sometimes the pieces you identify do not actually combine to solve the original problem. This happens when you decompose based on familiar steps rather than logical structure. For example, decomposing “identify differentially expressed genes” into “run FastQC, trim reads, map reads, call variants” — the last step (variant calling) does not contribute to differential expression analysis. You have recognised familiar tools without thinking about whether they solve your specific problem.


1.2 A Systematic Framework: Five Questions Before You Start

Section titled “1.2 A Systematic Framework: Five Questions Before You Start”

Before decomposing any bioinformatics problem, answer these five questions:

Q1: What is the final biological output? What form does the answer take? A list of genes? A phylogenetic tree? A VCF file? A heatmap? Clarity about the endpoint prevents the analysis from drifting.

Q2: What data do I have at the start? What format? What quality? What technology (Illumina/Nanopore/PacBio)? Single-end or paired-end? How many samples? This determines which tools are available and which assumptions are valid.

Q3: What is the logical gap between Q1 and Q2? The gap defines the problem. Each step that bridges the gap is a sub-problem. Map the gap as explicitly as you can.

Q4: What are the known failure modes at each step? Contamination, low coverage, poor assembly, incorrect annotation, batch effects. Identifying these before you start tells you where to build validation checkpoints.

Q5: What defines “good enough” at each step? Q30 for read quality? N50 > genome size? BUSCO completeness > 90%? Without quality thresholds, you cannot decide when to proceed and when to investigate.


1.3 Case Study: Decomposing a Viral Assembly Problem

Section titled “1.3 Case Study: Decomposing a Viral Assembly Problem”

Scenario: You have received a FASTQ file from a field sample. The researcher believes it contains a novel plant virus. Your task: characterise the virus.

Applying the five questions:

QuestionAnswer
Q1 (output)Annotated viral genome sequence, placed in a phylogenetic tree
Q2 (data)Nanopore long reads, mixed sample (plant + potential virus), unknown coverage
Q3 (gap)Raw mixed reads → clean viral reads → assembled viral genome → annotated genome → tree
Q4 (failure modes)Host contamination dominates; virus may be low abundance; assembly may fragment; BLAST may find no close relatives
Q5 (thresholds)Coverage ≥ 30×; assembly completeness assessed by genome fraction covered; BLAST identity ≥ 70% for genus-level assignment

Decomposition tree:

Characterise the novel virus
├── 1. Data quality assessment
│ ├── 1a. NanoStat on raw reads (read count, N50, mean quality)
│ └── 1b. Flag if mean quality < Q10 (Nanopore baseline)
├── 2. Host read removal
│ ├── 2a. Download or locate host reference genome (e.g., Manihot esculenta)
│ ├── 2b. Map reads to host (minimap2 -ax map-ont)
│ ├── 2c. Extract unmapped reads (samtools view -f 4)
│ └── 2d. Verify enrichment: what % of reads were host?
├── 3. Coverage check
│ ├── 3a. Estimate genome size (assume ~5 kb for begomovirus)
│ ├── 3b. Calculate: (read_count × avg_length) / genome_size
│ └── 3c. Decision: if coverage < 30×, consider combining samples
├── 4. Assembly
│ ├── 4a. Run Flye (--nano-raw, --genome-size 5k)
│ ├── 4b. Evaluate assembly: QUAST (N50, total length, contig count)
│ └── 4c. Polish if needed (Medaka or Racon)
├── 5. Viral contig identification
│ ├── 5a. BLAST all contigs against NCBI viral database
│ ├── 5b. Retain contigs with viral hits (e-value < 1e-5)
│ └── 5c. Flag unexpected hits (plant, bacteria) for removal
├── 6. Genome annotation
│ ├── 6a. Predict ORFs (Prokka or custom BLAST against known viral proteins)
│ └── 6b. Assign gene names and functional categories
└── 7. Phylogenetic analysis
├── 7a. Download reference sequences from NCBI (same genus/family)
├── 7b. Multiple sequence alignment (MAFFT)
├── 7c. Trim alignment (trimAl)
├── 7d. Infer tree (IQ-TREE or FastTree)
└── 7e. Visualise and annotate (FigTree, iTOL)

Each leaf node of this tree has: a clear input, a clear tool, and a clear output. That is a correctly decomposed problem.


Scenario: A research group has genotyped 3,000 individuals (1,500 cases with chronic kidney disease, 1,500 controls) using a 500,000-SNP array. They want to identify genetic variants associated with disease risk.

Five questions:

QuestionAnswer
Q1 (output)A ranked list of SNPs with association statistics and annotated candidate genes
Q2 (data)Raw genotype array data (PLINK format), plus phenotype file (case/control), and covariate file (age, sex, ancestry PCs)
Q3 (gap)Raw genotypes → QC-filtered genotypes → association analysis → multiple testing correction → annotation → biological interpretation
Q4 (failure modes)Related individuals inflate statistics; population stratification creates false positives; poorly QC-filtered SNPs generate artefactual associations
Q5 (thresholds)Genome-wide significance p < 5×10⁻⁸; SNP call rate ≥ 95%; sample call rate ≥ 95%; HWE p > 1×10⁻⁶ in controls

Decomposition:

GWAS: Identify CKD risk variants
├── 1. Genotype QC (per-SNP and per-sample)
│ ├── 1a. Remove SNPs with call rate < 95%
│ ├── 1b. Remove samples with call rate < 95%
│ ├── 1c. Remove SNPs with MAF < 1%
│ ├── 1d. Remove SNPs failing HWE in controls (p < 1e-6)
│ └── 1e. Check sex concordance (reported vs. inferred)
├── 2. Population stratification control
│ ├── 2a. LD-prune SNPs (window = 50 SNPs, step = 5, r² < 0.2)
│ ├── 2b. Merge with HapMap/1000G reference panel
│ ├── 2c. Run PCA (PLINK --pca)
│ └── 2d. Exclude population outliers (> 6 SD from mean on PC1-PC2)
├── 3. Relatedness check
│ ├── 3a. Calculate identity-by-descent (IBD) coefficients
│ └── 3b. Remove one individual from each related pair (PI_HAT > 0.2)
├── 4. Association testing
│ ├── 4a. Logistic regression (binary phenotype: case/control)
│ ├── 4b. Include covariates: age, sex, top 10 PCs
│ └── 4c. Generate summary statistics (beta, SE, p-value per SNP)
├── 5. Visualisation and quality checks
│ ├── 5a. Manhattan plot (p-values across genome)
│ ├── 5b. QQ plot (observed vs. expected p-values)
│ └── 5c. Calculate genomic inflation factor lambda (should be ~1.0)
├── 6. Identify significant loci
│ ├── 6a. Apply genome-wide significance threshold (p < 5e-8)
│ ├── 6b. LD-clump to identify independent signals
│ └── 6c. Regional association plots for each locus
└── 7. Functional annotation
├── 7a. Annotate lead SNPs (dbSNP, Ensembl)
├── 7b. Check if SNPs fall in coding regions, promoters, eQTLs
└── 7c. Gene ontology and pathway enrichment for candidate genes

MistakeWhat it looks likeHow to fix it
Circular decomposition”Step 1: analyse. Step 2: get results from analysis.”Each sub-step must reduce the gap between input and output
Tool-first decomposition”Step 1: run FastQC. Step 2: run Trimmomatic. Step 3: run…”Start with logic, not tools. Add tools once the logic is clear
Missing validation stepsNo checkpoints between major stepsAdd QC steps between each major phase
Insufficient depthStopping at 2–3 high-level stepsContinue until each leaf has a clear tool and input/output
Ignoring failure modesNot considering what to do if assembly failsFor each step, ask: “What do I do if this gives bad output?”
Coupling independent sub-problemsAssuming Step 3 must come from Step 2’s exact outputKeep sub-problems modular so they can be revised independently

2.1 How to Build a Pattern Recognition Vocabulary

Section titled “2.1 How to Build a Pattern Recognition Vocabulary”

Pattern recognition is a learned skill. Each time you encounter a problem, file it mentally (or literally) under:

  • “What type of problem is this?” (alignment, assembly, clustering, regression, annotation)
  • “What did the data look like?” (low coverage, adapter contamination, batch effect, host contamination)
  • “What worked, and why?”
  • “What failed, and why?”

Over time, you build a library of patterns. The goal of this section is to give you a head start on that library.


Pattern 1: The Low-Quality Tail

Presentation: FastQC per-base quality plot shows high Q30+ quality for the first 50–80% of read length, then a rapid decline to Q20 or below at the 3’ end.

Cause: Illumina synthesis chemistry — polymerase fidelity decreases over longer synthesis runs. Signal dephasing and phasing errors accumulate.

Standard response: Trim 3’ ends using Trimmomatic (TRAILING:20) or Cutadapt. Do not trim uniformly from both ends — 5’ end quality is usually fine.

Pattern recognition flag: “Declining 3’ quality = chemistry artefact, not biology.”


Pattern 2: Adapter Contamination

Presentation: FastQC adapter content plot shows rising lines; overrepresented sequences contain known adapter sequences (e.g., AGATCGGAAGAGC for Illumina TruSeq).

Cause: Insert size shorter than read length causes the sequencer to read into the adapter.

Standard response: Adapter trimming with Trimmomatic (ILLUMINACLIP) or Cutadapt. Verify with post-trim FastQC.

Recognition flag: “Any rising line in adapter content = trimming required.”


Pattern 3: GC Content Bimodality

Presentation: FastQC per-sequence GC content shows two peaks instead of the expected bell curve — e.g., one peak at ~35% (host) and one at ~55% (pathogen).

Cause: Mixed sample with organisms of different GC composition.

Standard response: This is actually diagnostic — it tells you the sample is mixed and the contaminating organism has a different GC content from the target. Proceed with host filtering; the bimodality will resolve.

Recognition flag: “Two GC peaks = mixed sample. Identify and separate by GC, mapping, or depth.”


Pattern 4: Low Assembly Contiguity Despite High Coverage

Presentation: Assembly produces hundreds of short contigs even though estimated coverage is 100×+. N50 is much smaller than expected genome size.

Cause: Usually one of three things: (1) repeat-rich regions that the assembler cannot resolve; (2) contamination creating ambiguous overlaps; (3) wrong assembler for the data type (e.g., short-read assembler used for long reads, or long-read assembler used for short reads).

Standard response: Check contig length distribution. BLAST short contigs to check for contamination. If repeat-driven, switch to long-read assembly or use a repeat-aware assembler. If contamination-driven, filter before assembly.

Recognition flag: “High coverage + low N50 = repeat or contamination problem, not a data quantity problem.”


Pattern 5: Genomic Inflation in GWAS

Presentation: QQ plot shows observed p-values consistently above the diagonal across all tested SNPs (genomic inflation factor λ > 1.05).

Cause: Cryptic population stratification (samples from different ancestry groups mixed without correction), relatedness among samples, or systematic genotyping bias.

Standard response: Include principal components as covariates in the regression model. Remove related individuals. Check for batch effects in genotyping. If λ remains elevated, apply genomic control correction.

Recognition flag: “QQ inflation = stratification or relatedness — not a sign of true widespread association.”


Pattern 6: Uneven Coverage Across Assembly

Presentation: When mapping reads back to the assembly, some regions have 0× or near-0× coverage while others have 500×+. QUAST reports large numbers of misassembled regions.

Cause: The assembler collapsed repeats (two distinct genomic regions mapped to one contig), or structural variants created coverage anomalies.

Standard response: Inspect low-coverage regions — are they at contig ends? Are they in repetitive regions by BLAST? Flag potential misassemblies. If using long reads, re-polish with a tool that can detect and correct misassemblies (e.g., Medaka with a coverage filter).

Recognition flag: “Extreme coverage heterogeneity = assembly artefact, not data problem.”


Pattern 7: Zero Differentially Expressed Genes

Presentation: DESeq2 or edgeR returns no significant results despite a clear visual difference between groups in the heatmap.

Cause: Usually low biological replication (n=2 per group), high within-group variability, or incorrect contrast specification in the model.

Standard response: Check the design formula. Visualise sample-level PCA — do the groups actually separate? Check dispersion estimates — are they unreasonably high? With n=2, statistical power is near zero.

Recognition flag: “Zero DE genes with visual group separation = underpowered design or model error, not absence of biology.”


2.3 Decision Trees for Choosing Solution Strategies

Section titled “2.3 Decision Trees for Choosing Solution Strategies”

Decision trees translate pattern recognition into action. When you recognise a pattern, a decision tree tells you which branch to follow.

Decision Tree 1: Choosing an Assembly Strategy

What type of data do I have?
├── Short reads only (Illumina, 100–300 bp)
│ ├── Do I have a reference genome?
│ │ ├── YES → Reference-based assembly (BWA-MEM + samtools + bcftools)
│ │ └── NO → De novo (SPAdes for small genomes; Velvet for very small)
│ └── What is my genome size?
│ ├── < 10 Mb → SPAdes works well
│ └── > 100 Mb → Consider MaSuRCA (hybrid) or Platanus
├── Long reads only (Nanopore / PacBio)
│ ├── Nanopore → Flye (--nano-raw or --nano-hq depending on chemistry)
│ ├── PacBio CLR → Canu or FALCON
│ ├── PacBio HiFi (CCS) → hifiasm (very high accuracy, preferred)
│ └── Do I need polishing?
│ ├── Nanopore reads → Medaka (neural network polisher) or Racon
│ └── HiFi → usually self-correcting; polishing optional
└── Hybrid (short + long reads)
└── MaSuRCA, Unicycler (for bacterial/viral), or long-read assembly + Pilon polishing

Decision Tree 2: Diagnosing a Failed Assembly

Assembly failed or produced poor results. What happened?
├── Zero contigs assembled
│ └── Check: did reads actually pass to assembler? (file size, format check)
├── Hundreds of very short contigs (< 500 bp)
│ ├── Check coverage: is it sufficient (≥ 30×)?
│ │ ├── NO → Sequence more, or combine samples
│ │ └── YES → Likely repeat or contamination problem
│ └── Check GC content (FastQC): bimodal?
│ ├── YES → Mixed sample — filter host reads first
│ └── NO → Try repeat-aware assembler or different k-mer settings
├── Assembly much larger than expected genome size
│ └── Contamination incorporated into assembly — BLAST contigs, filter non-target hits
└── Assembly correct size but poor N50
├── Are reads long enough to span repeats?
│ ├── NO → Switch to long-read technology or accept fragmented assembly
│ └── YES → Investigate repeat complexity; try graph-based assembly
└── Is coverage highly uneven? → Possible library preparation bias

2.4 Pattern Recognition Across Domains: A Summary Table

Section titled “2.4 Pattern Recognition Across Domains: A Summary Table”
DomainCommon patternWhat it signalsStandard response
QC (all technologies)Declining 3’ qualityChemistry artefactTrim 3’ ends
QC (Illumina)Adapter content risingShort insert libraryTrim adapters
QC (all)Bimodal GC curveMixed sampleHost filtering or source check
AssemblyHigh coverage, low N50Repeats or contaminationFilter, switch assembler
AssemblyContigs >> expected sizeContamination assembledBLAST and filter
MappingLow mapping rate (< 60%)Wrong reference, wrong species, or heavy contaminationCheck reference, filter reads
GWASλ > 1.1Stratification or relatednessAdd PCs, remove relatedness
GWASNo hits at genome-wide thresholdUnderpowered or wrong phenotype definitionCheck sample size, phenotype
RNA-SeqZero DE genesUnderpowered or model errorCheck design, check replication
RNA-SeqThousands of DE genesBatch effect or normalisation failureCheck PCA, recheck normalisation

2.5 Connecting to Module 5: What Patterns Did You Already Recognise?

Section titled “2.5 Connecting to Module 5: What Patterns Did You Already Recognise?”

In the Module 5 begomovirus project, many students implicitly applied pattern recognition:

  • Recognising that host reads would dominate the raw FASTQ (pattern: mixed sample from field collection)
  • Recognising that Flye was the appropriate assembler for Nanopore data (pattern: long-read assembly tool selection)
  • Recognising that BLAST hits to plastid sequences were host contamination, not viral (pattern: unexpected BLAST taxonomy = contamination)
  • Recognising that a single contig covering the full begomovirus DNA-A segment was a sign of a good assembly (pattern: expected genome completeness for a known genome size)

Each of these was pattern recognition. You had enough background from Modules 1–4 to recognise the pattern and act on it — even if you did not have a name for what you were doing.


SkillWhat to practice
DecompositionFor every new problem, draw a tree before you open a terminal
Five questionsAlways answer Q1–Q5 before designing an analysis
Pattern catalogueKeep a personal log of patterns you encounter
Decision treesBuild your own trees as you gain experience — they are the crystallised form of pattern recognition

Key takeaway: Decomposition and pattern recognition are what turn a complex, overwhelming problem into a series of familiar, tractable steps. You build pattern recognition by solving many problems. You apply decomposition every time you start a new one.


Lesson 4 examines abstraction and algorithm design in depth — with particular attention to the hardest practical questions: what details can you safely ignore, and how do you choose between multiple valid approaches to the same problem?