Lesson 5: Reference-Based Assembly
1. Lesson Goal
Section titled “1. Lesson Goal”In this lesson, you will map long reads to a combined reference that contains:
- ACMV DNA-A
- ACMV DNA-B
You will then build a consensus sequence and produce alignment summary files. This is a reference-guided assembly workflow.
2. Why Use Reference-Based Assembly Here
Section titled “2. Why Use Reference-Based Assembly Here”Reference-based assembly is ideal when you have a trusted reference and want to answer questions like:
- Assemble the full genome if de novo assembly fails.
- Which parts of DNA-A and DNA-B are covered by reads?
- What sample-specific differences exist relative to reference?
Compared with de novo assembly, this method is often faster and easier to interpret for known genomes. The tradeoff is that you may miss novel sequences or rearrangements not present in the reference. It is prone to reference bias if the sample differs significantly from the reference.
3. Input Files for This Lesson
Section titled “3. Input Files for This Lesson”We will use the cleaned FASTQ files from Lesson 4. These are the same reads used for de novo assembly, but now they will be mapped to a combined ACMV reference.
Training/long_reads/denovo/clean/barcode57_clean.fastqTraining/long_reads/denovo/clean/barcode58_clean.fastq(if available; otherwise continue withbarcode57only)
Reference you will provide:
- You will go online to get this yourself. Download ACMV DNA-A and DNA-B sequences from NCBI, combine them into a single FASTA file, and place it in the reference folder.
4. Folder Organization for Reference-Based Work
Section titled “4. Folder Organization for Reference-Based Work”Keep this workflow separate from de novo outputs.
cd Training/long_readsmkdir -p reference_based/{raw,reference,mapping,consensus,stats,logs}Copy reads into the reference-based raw folder.
cp barcode57.fastq reference_based/raw/cp barcode58.fastq reference_based/raw/5. Software environment
Section titled “5. Software environment”Use the same environment from Lesson 4 if already created.
conda activate bioinfo6. Prepare the combined ACMV reference
Section titled “6. Prepare the combined ACMV reference”Place your combined ACMV FASTA into reference_based/reference.
Example expected file path:
Training/long_reads/reference_based/reference/acmv_combined.fa
cp <PATH_TO_YOUR_ACMV_COMBINED_FASTA> reference_based/reference/acmv_combined.fa7. Map long reads to ACMV DNA-A + DNA-B
Section titled “7. Map long reads to ACMV DNA-A + DNA-B”Map reads with Nanopore preset.
minimap2 -t 2 -ax map-ont \ reference_based/reference/acmv_combined.fa \ reference_based/raw/barcode57_clean.fastq \ > reference_based/mapping/barcode57_vs_acmv.samConvert, sort, and index BAM.
samtools view -b reference_based/mapping/barcode57_vs_acmv.sam > reference_based/mapping/barcode57_vs_acmv.bamsamtools sort reference_based/mapping/barcode57_vs_acmv.bam -o reference_based/mapping/barcode57_vs_acmv.sorted.bamsamtools index reference_based/mapping/barcode57_vs_acmv.sorted.bam8. Evaluate mapping quality and coverage
Section titled “8. Evaluate mapping quality and coverage”Generate key alignment statistics.
samtools flagstat reference_based/mapping/barcode57_vs_acmv.sorted.bam > reference_based/stats/flagstat.txtsamtools idxstats reference_based/mapping/barcode57_vs_acmv.sorted.bam > reference_based/stats/idxstats.txtsamtools depth reference_based/mapping/barcode57_vs_acmv.sorted.bam > reference_based/stats/depth.txtCheck outputs quickly.
head -n 20 reference_based/stats/flagstat.txthead -n 20 reference_based/stats/idxstats.txthead -n 20 reference_based/stats/depth.txtInterpretation guide:
flagstat.txttells you overall alignment rate.idxstats.txtshows reads mapped to each reference sequence (DNA-A vs DNA-B).depth.txtshows per-position coverage, useful for finding low-coverage regions.
9. Build a consensus sequence
Section titled “9. Build a consensus sequence”Generate a consensus FASTA.
samtools consensus \ -o reference_based/consensus/barcode57_acmv_consensus.fa \ reference_based/mapping/barcode57_vs_acmv.sorted.bamPreview consensus output.
head -n 40 reference_based/consensus/barcode57_acmv_consensus.fa10. Final Output Checklist
Section titled “10. Final Output Checklist”Expected important files:
reference_based/mapping/barcode57_vs_acmv.sorted.bamreference_based/stats/flagstat.txtreference_based/stats/idxstats.txtreference_based/stats/depth.txtreference_based/consensus/barcode57_acmv_consensus.fa
11. Key Take-Home Messages
Section titled “11. Key Take-Home Messages”- Reference-guided assembly is efficient for known viral genomes.
- Mapping statistics are essential to judge confidence, not optional.
- Consensus sequence quality depends on read depth and alignment quality.
- Keeping de novo and reference-based folders separate prevents analysis confusion.