Solutions
Note for instructors: These solutions show one correct approach for each exercise. Students may use different (equally valid) commands. Encourage discussion of alternative approaches rather than treating these as the only right answers.
Exercise Set 1: Linux Foundations
Section titled “Exercise Set 1: Linux Foundations”1.1 — Finding Your Bearings
Section titled “1.1 — Finding Your Bearings”-
From
~toTraining/long_reads/:- Absolute path:
/home/<username>/path/to/Training/long_reads/(exact path depends on system) - Relative path:
Training/long_reads/(if Training is directly inside~)
- Absolute path:
-
ls Training/lists filenames only.ls -lh Training/lists with permissions, owner, size (human-readable), and date. -
ls -lhR Training/lists recursively through all subdirectories (-R= recursive).
Discussion answer: Scripts use paths to find files. If a script uses a relative path like ./data/reads.fastq, it will break the moment you run the script from a different directory.
1.2 — Counting Reads
Section titled “1.2 — Counting Reads”-
Example output:
34268 Training/long_reads/barcode57.fastq -
Reads = 34268 ÷ 4 = 8567 reads
-
barcode58.fastq: 36784 lines → 9196 reads |sample3.fastq: 11880 lines → 2970 reads -
Check all three at once:
Terminal window wc -l Training/long_reads/*.fastq
Discussion answer: An odd line count means the file is malformed — a read record is incomplete. This could indicate a truncated file (failed download or interrupted write).
1.3 — Inspecting a FASTQ Record
Section titled “1.3 — Inspecting a FASTQ Record”head -n 4 Training/long_reads/barcode57.fastqtail -n 4 Training/long_reads/barcode57.fastqhead -n 4 Training/short_reads/paired/SRR1553607_1.fastqDiscussion answer:
- Nanopore headers contain a UUID, timestamp, and sequencing chemistry information (e.g.,
r10.4.1_e8.2_400bps_sup@v4.3.0) - Illumina headers contain a run accession, read number, and read length (e.g.,
@SRR1553607.1 1 length=101) - These differences reflect the platforms: Nanopore is real-time streaming; Illumina is batch synthesis
1.4 — Comparing Two Files
Section titled “1.4 — Comparing Two Files”wc -l Training/short_reads/paired/SRR1553607_1.fastq Training/short_reads/paired/SRR1553607_2.fastqBoth should show 813780 lines.
head -n 1 Training/short_reads/paired/SRR1553607_1.fastqhead -n 1 Training/short_reads/paired/SRR1553607_2.fastqHeaders will differ in pair number (.1 vs .2) but share the same read ID.
Discussion answer: Check before trimming. Trimming tools can discard unpaired reads, so after trimming you need to verify pairing is maintained (most trimmers do this automatically, but it is worth confirming).
1.5 — File Permissions and Scripts
Section titled “1.5 — File Permissions and Scripts”-
ls -l Training/short_reads/unpaired/download.sh— look forxin the permissions string. On this system it shows-rwxrwxrwx, so it is executable. -
cat Training/short_reads/unpaired/download.sh— it usescurlto download samples from NCBI SRA via FTP. -
chmod +x ~/count_reads.sh -
bash count_reads.shruns the file with bash regardless of permissions../count_reads.shrequires the file to be executable (+x) and uses the shebang line to determine the interpreter.
Discussion answer:
- The file is not executable (
chmod +xnot run) - The shebang line is wrong or missing (e.g.,
#!/usr/bin/pythonon a system where Python is at a different path)
1.6 — Absolute vs. Relative Paths
Section titled “1.6 — Absolute vs. Relative Paths”-
Absolute paths are safer in scripts because they work regardless of what directory the script is called from.
-
Example absolute path:
/mnt/c/Users/WAVECU001/Documents/Bioinformatic_Fridays/Training/long_reads/barcode57.fastq -
From
Training/short_reads/:../long_reads/barcode57.fastq -
../means “go up one directory level.” Example: fromTraining/short_reads/paired/,../../long_reads/navigates toTraining/long_reads/.
Discussion answer: Use a variable at the top of the script: TRAINING_DIR="/path/to/Training". Each user sets this once. Alternatively, use $1 as an argument so the path is provided at runtime.
Exercise Set 2: Text Processing and Pipelines
Section titled “Exercise Set 2: Text Processing and Pipelines”2.1 — Counting Sample Types
Section titled “2.1 — Counting Sample Types”ls Training/short_reads/unpaired/ | grep Case | wc -l# Output: 3
ls Training/short_reads/unpaired/ | grep Healthy | wc -l# Output: 4Using grep -c:
ls Training/short_reads/unpaired/ | grep -c CaseSame result here, but -c can behave differently when applied directly to files vs. stdin.
Discussion answer: -c counts lines with a match. It is equivalent to piping to wc -l in this context, but pipe + wc -l is more explicit and easier to read.
2.2 — Working with Compressed Files
Section titled “2.2 — Working with Compressed Files”-
zcat(orgzip -dc) -
Terminal window zcat Training/short_reads/unpaired/SRR11282407_Case.fastq.gz | head -n 8 -
Terminal window zcat Training/short_reads/unpaired/SRR11282407_Case.fastq.gz | wc -lDivide result by 4 to get read count.
-
Compressed files: (a) save disk space significantly (gzip typically reduces FASTQ to ~25% of original size); (b) faster to transfer over networks.
Discussion answer: Compressed files take longer to process because they must be decompressed on-the-fly. For pipelines that read the same file many times, it may be worth decompressing once. For pipelines that read each file once, keeping compressed saves space.
2.3 — Extracting Read Headers
Section titled “2.3 — Extracting Read Headers”grep "^@" Training/long_reads/barcode57.fastq | head -n 5grep -c "^@" Training/long_reads/barcode57.fastqExpected: 8567. In practice, Nanopore quality strings rarely contain @ at position 1, but it is theoretically possible.
Discussion answer: A safer approach is to extract every 4th line starting at line 1 using awk 'NR%4==1'. This is position-based rather than content-based and never misidentifies quality lines.
2.4 — Sorting and Deduplication
Section titled “2.4 — Sorting and Deduplication”-
ls -lhS Training/short_reads/unpaired/(capital-Sfor size sort, add-rfor reverse/largest first) -
ls Training/short_reads/unpaired/ | sort -
Output of
sort | uniq -c | sort -rn:2 Sample12 Sample21 Sample3Column 1 = count, column 2 = value.
sort -rnsorts numerically in reverse (most frequent first).
Discussion answer: sort | uniq -c is used in bioinformatics to count occurrences of read IDs (to detect duplicates), count k-mer frequencies, or count how many times each adapter sequence appears.
2.5 — Building a Read Summary Pipeline
Section titled “2.5 — Building a Read Summary Pipeline”for f in Training/long_reads/*.fastq; do reads=$(wc -l < "$f") reads=$((reads / 4)) echo "$(basename "$f") $reads"doneRedirect to file:
for f in Training/long_reads/*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) echo "$(basename "$f") $reads"done > read_counts.txt> overwrites the file each time; >> appends. Running twice with > leaves only the second run’s output. Running twice with >> leaves both.
2.6 — Investigating Read Quality Lines
Section titled “2.6 — Investigating Read Quality Lines”awk 'NR%4==0' Training/long_reads/barcode57.fastq | head -n 5awk 'NR%4==0' Training/short_reads/paired/SRR1553607_1.fastq | head -n 5The one-liner breakdown:
awk 'NR%4==0' barcode57.fastq— extracts every 4th line (quality lines)awk '{print length($0)}'— prints the length of each quality string = read lengthsort -n— sorts numericallytail -n 5— shows the 5 longest reads
Discussion answer: grep cannot be used because quality lines do not have a consistent unique prefix. NR%4==0 uses the mathematical property of FASTQ (every 4th line) rather than content matching.
Exercise Set 3: Scripting and Automation
Section titled “Exercise Set 3: Scripting and Automation”3.1 — From Command to Script
Section titled “3.1 — From Command to Script”#!/usr/bin/env bash
echo -e "File\t\tReads"for f in Training/long_reads/*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) echo -e "$(basename "$f")\t$reads"doneMake executable: chmod +x count_reads.sh
Run: ./count_reads.sh
Discussion answer: If run from a different directory, Training/long_reads/*.fastq will fail because it is a relative path. Fix: use an absolute path or pass the directory as an argument (Exercise 3.2).
3.2 — Making Scripts Flexible with Arguments
Section titled “3.2 — Making Scripts Flexible with Arguments”#!/usr/bin/env bash
DIR=$1
echo -e "File\t\tReads"for f in "$DIR"/*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) echo -e "$(basename "$f")\t$reads"doneRun:
./count_reads.sh Training/long_reads/./count_reads.sh Training/short_reads/paired/Discussion answer: Without an argument, $1 is empty. The loop glob /*.fastq expands to an empty match and the script may print nothing or an error. The user has no idea what went wrong.
3.3 — Adding Input Validation
Section titled “3.3 — Adding Input Validation”#!/usr/bin/env bash
if [ -z "$1" ]; then echo "Error: No directory provided." echo "Usage: ./count_reads.sh <directory>" exit 1fi
if [ ! -d "$1" ]; then echo "Error: '$1' is not a valid directory." exit 1fi
DIR=$1
echo -e "File\t\tReads"for f in "$DIR"/*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) echo -e "$(basename "$f")\t$reads"done-d tests whether a path exists and is a directory.
Discussion answer: exit 0 signals success; exit 1 (or any non-zero) signals failure. Pipelines and workflow managers check exit codes to decide whether to proceed to the next step.
3.4 — Organising Samples by Condition
Section titled “3.4 — Organising Samples by Condition”#!/usr/bin/env bash
mkdir -p results/case results/healthy
for f in Training/short_reads/unpaired/*.fastq.gz; do name=$(basename "$f") if echo "$name" | grep -q "Case"; then cp "$f" results/case/ elif echo "$name" | grep -q "Healthy"; then cp "$f" results/healthy/ fidoneExpected result: 3 files in results/case/, 4 files in results/healthy/.
Discussion answer: For large files, use ln -s (symbolic links) instead of cp — this avoids duplicating hundreds of gigabytes of data while still making files appear in both locations.
3.5 — Logging Script Output
Section titled “3.5 — Logging Script Output”#!/usr/bin/env bash
LOG="read_counts_log.txt"DIR=${1:-.}
echo "=== Run: $(date) ===" >> "$LOG"
echo -e "File\t\tReads"for f in "$DIR"/*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) echo -e "$(basename "$f")\t$reads"done | tee -a "$LOG"tee -a prints to stdout AND appends to the file.
Discussion answer: Appending (>>) causes problems when you need to know which run produced which results and the log grows unboundedly. Overwriting (>) is better when only the latest result matters. In practice, including a timestamp in the log (as above) gives the best of both worlds.
3.6 — Dataset Summary (Challenge)
Section titled “3.6 — Dataset Summary (Challenge)”Reference solution structure (many valid implementations exist):
#!/usr/bin/env bash
TRAINING=${1:-Training}OUTPUT="summary.txt"
{echo "=== Training Dataset Summary ==="echo "Date: $(date)"echo ""
echo "--- Long Reads ---"for f in "$TRAINING/long_reads/"*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) size=$(ls -lh "$f" | awk '{print $5}') printf "%-30s %7d reads %s\n" "$(basename "$f")" "$reads" "$size"done
echo ""echo "--- Short Reads (Paired) ---"for f in "$TRAINING/short_reads/paired/"*.fastq; do reads=$(( $(wc -l < "$f") / 4 )) size=$(ls -lh "$f" | awk '{print $5}') printf "%-30s %7d reads %s\n" "$(basename "$f")" "$reads" "$size"done
echo ""echo "--- Short Reads (Unpaired) ---"for f in "$TRAINING/short_reads/unpaired/"*.fastq.gz; do size=$(ls -lh "$f" | awk '{print $5}') printf "%-40s %s\n" "$(basename "$f")" "$size"done
echo ""total=$(ls "$TRAINING/short_reads/unpaired/"*.fastq.gz | wc -l)cases=$(ls "$TRAINING/short_reads/unpaired/"*.fastq.gz | grep -c Case)healthy=$(ls "$TRAINING/short_reads/unpaired/"*.fastq.gz | grep -c Healthy)echo "Total unpaired samples: $total"echo "Case samples: $cases"echo "Healthy samples: $healthy"} | tee "$OUTPUT"Exercise Set 3: Conda and Environments
Section titled “Exercise Set 3: Conda and Environments”3.1 — Diagnosing Your Setup
Section titled “3.1 — Diagnosing Your Setup”conda --versionconda infoconda env listActive environment is shown with * in conda env list and displayed in the prompt as (envname).
Discussion answer: Installing everything in base risks version conflicts across projects, makes the environment fragile, and violates reproducibility — you cannot easily share just the packages needed for one project.
3.2 — Choosing the Right Channels
Section titled “3.2 — Choosing the Right Channels”conda config --add channels biocondaconda config --add channels conda-forgeconda config --set channel_priority strictconda config --show channelsDiscussion answer: The colleague may be installing from the defaults channel which has commercial licensing restrictions. They may also get outdated or incompatible versions of bioinformatics tools that are better maintained in bioconda.
3.3 — Creating a Reproducible Environment
Section titled “3.3 — Creating a Reproducible Environment”conda create -n qc_env python=3.10conda activate qc_envconda install -c bioconda -c conda-forge fastqc multiqcExport: conda env export > qc_env.yml
Recreate: conda env create -f qc_env.yml
Discussion answer: --no-builds removes platform-specific build strings, making the YAML file portable across Linux/Mac/Windows. Use --no-builds when sharing with colleagues on different systems.
3.4 — Debugging a Bad Install
Section titled “3.4 — Debugging a Bad Install”-
Strategies in order:
- Try installing one package at a time to isolate the conflict
- Specify the channel explicitly:
conda install -c bioconda -c conda-forge minimap2 samtools - Use
mamba(faster solver that handles conflicts better):mamba install minimap2 samtools - Create a fresh environment and install from scratch
-
They installed into
(base), polluting the base environment and risking breaking other workflows. -
conda list
Discussion answer: Bioinformatics tools often depend on compiled C/C++ libraries with strict version requirements (htslib, zlib, etc.). These system-level dependencies create more conflicts than pure Python packages.
3.5 — Environment Strategy
Section titled “3.5 — Environment Strategy”- One shared environment — same pipeline, tools are likely compatible
- Separate environments — explicit version requirement conflict
- One shared environment — single end-to-end pipeline
- Separate environments — you are deliberately testing version differences and need them isolated
Discussion answer: Too many environments waste disk space and make it hard to remember which environment to use. Too few environments cause version conflicts and reproducibility issues. Project-per-environment is usually the right balance.
3.6 — Connecting Conda to the Omics Workflow
Section titled “3.6 — Connecting Conda to the Omics Workflow”conda create -n bioinfo python=3.10conda activate bioinfoconda install -c bioconda -c conda-forge fastqc multiqc flye minimap2 samtoolsconda env export > bioinfo_environment.ymlCheck flye --version failure reasons:
- Environment not activated (
conda activate bioinfo) - Tool not installed in this environment (
conda list | grep flye)
Discussion answer: Give them: (1) bioinfo_environment.yml to recreate the environment, (2) conda env create -f bioinfo_environment.yml command, (3) conda activate bioinfo to use it.
Exercise Set 4: Omics Integration
Section titled “Exercise Set 4: Omics Integration”4.1 — Comparing Sequencing Technologies
Section titled “4.1 — Comparing Sequencing Technologies”head -n 8 Training/long_reads/barcode57.fastqhead -n 8 Training/short_reads/paired/SRR1553607_1.fastqExpected comparison:
| Property | Nanopore (barcode57) | Illumina (SRR1553607_1) |
|---|---|---|
| Header contains | UUID, timestamp, sequencing chemistry | Run accession, read number, read length |
| Approximate read length | Variable, 300–2000+ bp (count sequence characters) | Fixed 101 bp (stated in header) |
| Quality string appearance | Mixed low/high ASCII characters | Mostly uppercase letters (high quality) |
-
The most obvious visual difference: Nanopore sequences are much longer — you can see this just by looking at how many characters are on line 2 of each record. Illumina reads are a uniform short length.
-
Scenario answers:
- A 101 bp Illumina read cannot span a 400 bp repeat — the entire read falls inside the repeat and cannot be placed unambiguously.
- A 1500 bp Nanopore read can span the repeat and extend into unique flanking sequence on both sides, giving the assembler enough context to place it correctly.
- Choose Nanopore for this assembly.
Discussion answer: Even if Nanopore has a higher error rate, a long read that spans a repeat gives the assembler unambiguous placement information. Short reads that land entirely within a repeat look identical to reads from all other copies of that repeat — the assembler cannot tell which copy they came from.
4.2 — Coverage Estimation
Section titled “4.2 — Coverage Estimation”Using the provided average read lengths:
barcode57:
Coverage = (8,567 reads × 2,900 bp) / 5,000 bp = 24,844,300 / 5,000 = ~4,969×barcode58:
Coverage = (9,196 reads × 3,100 bp) / 5,000 bp = 28,507,600 / 5,000 = ~5,702×Combined:
Coverage = ((8,567 + 9,196) reads × ~3,000 bp average) / 5,000 bp ≈ ~10,659×Both samples individually far exceed 30×. Combining them is not necessary for depth — but might help if one sample has quality issues.
Discussion answer: Extremely high coverage can slow assemblers and in some cases introduce errors because there are many more reads to compare and error-correct. Some assemblers (like Canu) automatically subsample to a target depth. Flye is designed to handle very high Nanopore coverage well. The practical concern is mostly computation time.
4.3 — Running and Interpreting QC
Section titled “4.3 — Running and Interpreting QC”-
Commands:
Terminal window fastqc Training/short_reads/unpaired/*.fastq.gz -o QC_results/multiqc QC_results/ -o QC_results/ -
Graph meanings:
Graph What it tells you Per base sequence quality Whether quality drops along the read length Adapter content Whether synthetic adapter sequences contaminate your reads Per sequence GC content Whether the GC distribution matches the expected organism Overrepresented sequences Whether any sequence dominates (e.g., adapters, rRNA) -
Decisions:
- Sample A (quality drop after pos 85): Trim — quality tail at the 3’ end is normal for Illumina; trim to the point where quality drops
- Sample B (92% adapter content): Trim — adapter contamination this high will ruin any downstream analysis
- Sample C (bimodal GC): Investigate further — two GC peaks usually means two organisms (contamination or mixed infection); trimming will not fix this
- Sample D (Q30+ throughout): Keep as-is — this is ideal data
-
Options for low-quality Case sample: (a) trim more aggressively and recheck QC; (b) exclude and note it in methods — but this changes your case/control ratio from 3:4 to 2:4; (c) include it and accept that it may add noise. Never silently remove — always document.
Discussion answer: Removing a Case sample because of quality could introduce selection bias. You are now comparing 2 Cases to 4 Healthy samples, and the removed Case might have been the most biologically important one. Any removal decision must be justified in the methods section.
4.4 — Choosing an Assembly Strategy
Section titled “4.4 — Choosing an Assembly Strategy”-
The problem:
barcode57.fastqcontains a mix of cassava host reads and viral reads. The cassava genome is ~750 Mb. Running Flye on the raw data would attempt to assemble the entire cassava genome alongside the virus, producing a massive, confusing output where viral contigs are buried among thousands of cassava contigs. -
Contamination removal logic:
- Reference: cassava genome (download from NCBI/Phytozome)
- Tool:
minimap2maps the reads to the cassava reference - Keep: reads that did not map to cassava — these are the virus reads
- Output: a cleaned FASTQ file containing only unmapped reads, ready for Flye
-
Choose de novo when: you suspect a novel or divergent virus strain; your field isolate may differ enough from the reference that mapping would miss parts of it. Choose reference-based when: you know the exact strain you are looking for; you need a quick comparison against a known genome for routine surveillance.
-
--genome-size 5ktells Flye the expected genome size so it can set internal parameters (like minimum overlap length) appropriately. If badly wrong (e.g.,--genome-size 750minstead of5k), the assembler would use settings designed for a large genome, likely producing poor results or running for an extremely long time.
Discussion answer: Skipping contamination removal means the assembler sees ~750 Mb of cassava data and ~0.03 Mb of virus. The cassava assembly would dominate completely. The viral signal would be so diluted that most viral reads would either be discarded or misassembled. You would spend days waiting for an assembly that tells you nothing about the virus.
4.5 — Interpreting Assembly Metrics
Section titled “4.5 — Interpreting Assembly Metrics”-
Assembly A is better. It has far fewer contigs (3 vs 47), a much larger largest contig (4,890 bp vs 1,203 bp), a high N50, and a total length close to the expected genome size (6,100 bp vs ~5,000 bp expected). Assembly B is fragmented and bloated.
-
N50 in plain language: Imagine lining up all your contigs from longest to shortest and adding up their lengths one by one. The N50 is the length of the contig you were holding when you reached the halfway point of your total assembly. A high N50 means a few long contigs make up most of the assembly — which is what you want.
-
Assembly B total length explanations: (a) contamination was not fully removed — cassava or other host sequence was assembled alongside the virus, inflating the total; (b) the assembler made mis-joins through repeat regions, creating chimeric contigs and inflating length.
-
The third contig in Assembly A could be: DNA-B (the second segment of the begomovirus genome, which is a separate circular molecule), a satellite DNA molecule associated with the virus, or a small amount of residual host sequence that escaped contamination removal.
Discussion answer: Consider a genome made of two copies of the same sequence. The assembler might join them into one long contig with a very high N50 — but the assembly is wrong because it collapsed what should be two separate loci. N50 should always be interpreted alongside biological expectations (expected genome size, expected number of segments).
4.6 — End-to-End Pipeline Design (Challenge)
Section titled “4.6 — End-to-End Pipeline Design (Challenge)”Example reference design:
project/├── raw/ # original barcode FASTQ files (never modified)├── qc/ # FastQC and MultiQC outputs├── clean/ # host-removed reads per sample├── assembly/ # Flye output per sample├── blast/ # BLAST annotation results├── logs/ # one log file per step per sample└── results/ # final summary table and consensus sequencesWorkflow:
- QC — Run FastQC on all 10 raw files; run MultiQC to aggregate; flag samples with average quality below Q15 or >50% adapter content
- Contamination removal — Map each sample to cassava genome with minimap2; extract unmapped reads (the virus reads); save cleaned FASTQ to
clean/ - Coverage check — Use
wc -lon each cleaned file, divide by 4 for read count, multiply by average read length, divide by 5000; flag samples below 30× - Assembly — Run Flye on each cleaned sample; review
assembly_info.txtfor contig count, N50, and total length - Confirmation — BLAST top contigs against NCBI or a local ACMV reference; require ≥90% identity; anything below is not confirmed ACMV
- Automation — Wrap steps 2–5 in a
forloop with the sample name as a variable (from Module 1 scripting) - Final output — A table with columns: sample ID | raw reads | clean reads | coverage | N50 | top BLAST hit | % identity | ACMV infected (yes/no)
Discussion answer: All per-sample steps (QC, contamination removal, assembly) can run in parallel across samples because each sample is independent. Within a single sample, the steps must run in order: QC → clean → assemble → BLAST. The dependency graph looks like 10 independent vertical pipelines running side by side, not one long chain.