Lesson 5: Reference-Based Assembly

1. Lesson Goal

In this lesson, you will map long reads to a combined reference that contains:

ACMV DNA-A
ACMV DNA-B

You will then build a consensus sequence and produce alignment summary files. This is a reference-guided assembly workflow.

2. Why Use Reference-Based Assembly Here

Reference-based assembly is ideal when you have a trusted reference and want to answer questions like:

Assemble the full genome if de novo assembly fails.
Which parts of DNA-A and DNA-B are covered by reads?
What sample-specific differences exist relative to reference?

Compared with de novo assembly, this method is often faster and easier to interpret for known genomes. The tradeoff is that you may miss novel sequences or rearrangements not present in the reference. It is prone to reference bias if the sample differs significantly from the reference.

3. Input Files for This Lesson

We will use the cleaned FASTQ files from Lesson 4. These are the same reads used for de novo assembly, but now they will be mapped to a combined ACMV reference.

Training/long_reads/denovo/clean/barcode57_clean.fastq
Training/long_reads/denovo/clean/barcode58_clean.fastq (if available; otherwise continue with barcode57 only)

Reference you will provide:

You will go online to get this yourself. Download ACMV DNA-A and DNA-B sequences from NCBI, combine them into a single FASTA file, and place it in the reference folder.

4. Folder Organization for Reference-Based Work

Keep this workflow separate from de novo outputs.

cd Training/long_reads
mkdir -p reference_based/{raw,reference,mapping,consensus,stats,logs}

Copy reads into the reference-based raw folder.

cp barcode57.fastq reference_based/raw/
cp barcode58.fastq reference_based/raw/

5. Software environment

Use the same environment from Lesson 4 if already created.

conda activate bioinfo

6. Prepare the combined ACMV reference

Place your combined ACMV FASTA into reference_based/reference.

Example expected file path:

Training/long_reads/reference_based/reference/acmv_combined.fa

cp <PATH_TO_YOUR_ACMV_COMBINED_FASTA> reference_based/reference/acmv_combined.fa

7. Map long reads to ACMV DNA-A + DNA-B

Map reads with Nanopore preset.

minimap2 -t 2 -ax map-ont  \
  reference_based/reference/acmv_combined.fa \
  reference_based/raw/barcode57_clean.fastq \
  > reference_based/mapping/barcode57_vs_acmv.sam

Convert, sort, and index BAM.

samtools view -b reference_based/mapping/barcode57_vs_acmv.sam > reference_based/mapping/barcode57_vs_acmv.bam

samtools sort reference_based/mapping/barcode57_vs_acmv.bam -o reference_based/mapping/barcode57_vs_acmv.sorted.bam

samtools index reference_based/mapping/barcode57_vs_acmv.sorted.bam

8. Evaluate mapping quality and coverage

Generate key alignment statistics.

samtools flagstat reference_based/mapping/barcode57_vs_acmv.sorted.bam > reference_based/stats/flagstat.txt

samtools idxstats reference_based/mapping/barcode57_vs_acmv.sorted.bam > reference_based/stats/idxstats.txt

samtools depth reference_based/mapping/barcode57_vs_acmv.sorted.bam > reference_based/stats/depth.txt

Check outputs quickly.

head -n 20 reference_based/stats/flagstat.txt
head -n 20 reference_based/stats/idxstats.txt
head -n 20 reference_based/stats/depth.txt

Interpretation guide:

flagstat.txt tells you overall alignment rate.
idxstats.txt shows reads mapped to each reference sequence (DNA-A vs DNA-B).
depth.txt shows per-position coverage, useful for finding low-coverage regions.

9. Build a consensus sequence

Generate a consensus FASTA.

samtools consensus \
  -o reference_based/consensus/barcode57_acmv_consensus.fa \
  reference_based/mapping/barcode57_vs_acmv.sorted.bam

Preview consensus output.

head -n 40 reference_based/consensus/barcode57_acmv_consensus.fa

10. Final Output Checklist

Expected important files:

reference_based/mapping/barcode57_vs_acmv.sorted.bam
reference_based/stats/flagstat.txt
reference_based/stats/idxstats.txt
reference_based/stats/depth.txt
reference_based/consensus/barcode57_acmv_consensus.fa

11. Key Take-Home Messages

Reference-guided assembly is efficient for known viral genomes.
Mapping statistics are essential to judge confidence, not optional.
Consensus sequence quality depends on read depth and alignment quality.
Keeping de novo and reference-based folders separate prevents analysis confusion.