Lesson 2 - Text Processing, Pipes, and Redirection
Learning Objectives
Section titled “Learning Objectives”- Explain how pipes connect commands in bioinformatics workflows.
- Use redirection to save command output into files.
- Count FASTQ reads with
wcand the 4-line structure. - Use
grepto filter filenames and text. - Use
sortanduniqto summarize lists.
Conceptual Overview
Section titled “Conceptual Overview”Pipes (|) send the output of one command directly into another. This is the foundation of many bioinformatics one-liners. Redirection (> and >>) lets you save command output to a file or append to it. Together, these tools let you build small, reproducible data summaries without opening a spreadsheet.
FASTQ files store reads in 4-line blocks:
- Header
- Sequence
+separator- Quality string
If you count lines with wc -l, you can estimate reads by dividing by 4.
Worked Examples
Section titled “Worked Examples”1) Count reads with wc
Section titled “1) Count reads with wc”wc -l uses the -l option to count lines in the file. Divide by 4 to estimate reads.
wc -l Training/short_reads/paired/SRR1553607_1.fastqOutput:
813780 Training/short_reads/paired/SRR1553607_1.fastq2) Count Case samples using a pipe
Section titled “2) Count Case samples using a pipe”grep filters lines that match the word Case, and wc -l counts those lines.
ls Training/short_reads/unpaired | grep Case | wc -lOutput:
33) Count Healthy samples using a pipe
Section titled “3) Count Healthy samples using a pipe”grep filters lines that match the word Healthy, and wc -l counts those lines.
ls Training/short_reads/unpaired | grep Healthy | wc -lOutput:
44) Sort and deduplicate a list
Section titled “4) Sort and deduplicate a list”sort arranges filenames alphabetically, and uniq removes duplicate adjacent lines.
To better understand sort and uniq, lets create a small dataset with duplicates: for now dont worry about what the following script does, we would cover that in later lessons.
- First create a script with nano: Please create this inside the ‘training’ directory
Tip: You can check your current directory with pwd and change directories with cd <directory_name>
The command below creates and open a new file called generate_data.sh
nano generate_data.sh- Now copy and paste the following lines into nano:
#!/bin/bash
# Define the output file nameFILE="attendees.csv"
# Array of departments to create repetitionsDEPARTPS=("Bioinformatics" "Virology" "Genetics" "Computer Science" "Bioinformatics")
# Array of namesNAMES=("Olabode" "Alice" "Zainab" "Chidi" "Bayo" "Alice" "Zainab")
echo "Generating random data in $FILE..."
# Create or clear the file and add a headerecho "Name,Department" > "$FILE"
# Loop to generate 200 lines of random datafor i in {1..200}do # Pick a random name and department from the arrays NAME=${NAMES[$RANDOM % ${#NAMES[@]}]} DEPT=${DEPARTPS[$RANDOM % ${#DEPARTPS[@]}]}
# Append to the CSV echo "$NAME,$DEPT" >> "$FILE"done
echo "Done! You can now use 'head $FILE' to see the messy data."- Save and exit nano (Ctrl + O, Enter, Ctrl + X).
- (optional) Make the script executable:
chmod +x generate_data.shthis would allow you to run the script directly using ./generate_data.sh. Might be easier to just run it with bash as below
- Run the script to generate the data:
- The command below would run the script we just created and generate a file called attendees.csv in your current directory.
bash generate_data.shAfter running the script, you should see a new file named attendees.csv in your current directory. The directory structure should look like this:
├── attendees.csv├── generate_data.sh├── long_reads└── short_reads- Lets quickly check the contents of the generated file by looking at the first 10 lines
head attendees.csvOutput:
Name,DepartmentOlabode,VirologyZainab,VirologyAlice,VirologyZainab,BioinformaticsChidi,GeneticsAlice,VirologyBayo,Computer ScienceBayo,BioinformaticsAlice,Computer SciencePS because the data is randomly generated, your output will likely differ from the above example. You would however see the same structure and format.
- Let’s see if we remember the command to count how many lines are in the file
Output:
201 attendees.csv- Now we can use
sortanduniqto get a list of unique departments in the file. Run the command below:
sort attendees.csv | uniqOutput:
Alice,BioinformaticsAlice,Computer ScienceAlice,GeneticsAlice,VirologyBayo,BioinformaticsBayo,Computer ScienceBayo,GeneticsBayo,VirologyChidi,BioinformaticsChidi,Computer ScienceChidi,GeneticsChidi,VirologyName,DepartmentOlabode,BioinformaticsOlabode,Computer ScienceOlabode,GeneticsOlabode,VirologyZainab,BioinformaticsZainab,Computer ScienceZainab,GeneticsZainab,VirologyWhat do you notice about the output?
Note that the header “Name,Department” is also included in the output. This is because sort and uniq treat it like any other line.
Did you also notice that the number of lines in the output is less than the number of lines in the input file? This is because there were duplicate entries in the original file, and uniq removed those duplicates.
you can pipe the output of uniq to wc -l to count the number of unique entries if you want to Alternatively you can use the -u option with sort to get unique lines directly:
sort -u attendees.csvIf you are interested in kowing how mnay times each unique entry appears in the file, you can use the -c option with uniq after sorting:
sort attendees.csv | uniq -cOutput:
29 Alice,Bioinformatics 10 Alice,Computer Science 10 Alice,Genetics 9 Alice,Virology 7 Bayo,Bioinformatics 9 Bayo,Computer Science 6 Bayo,Genetics 5 Bayo,Virology 13 Chidi,Bioinformatics 9 Chidi,Computer Science 5 Chidi,Genetics 6 Chidi,Virology 1 Name,Department 13 Olabode,Bioinformatics 4 Olabode,Computer Science 5 Olabode,Genetics 8 Olabode,Virology 24 Zainab,Bioinformatics 10 Zainab,Computer Science 7 Zainab,Genetics 11 Zainab,VirologyJust to confirm how many entries there were based on the counts, you can sum them up using awk: a super poerful command which we will not cover in this module but just so you see its wonders, run the below
sort attendees.csv | uniq -c | awk '{sum += $1} END {print sum}'Output:
201That was a quick overview of sort and uniq. We will explore these commands more in future lessons. Now let’s proceed to the next section.
5) Save output with redirection
Section titled “5) Save output with redirection”> redirects output to a file and overwrites it if it exists.
No output appears because it is written to Module_1_Linux/healthy_files.txt.
ls Training/short_reads/unpaired | grep Healthy > Module_1_Linux/healthy_files.txtNB: If you run the above command again, it will overwrite the contents of healthy_files.txt. To add to the file instead of overwriting, use >> see below.
>> appends output to the end of a file.
After this, the file contains Healthy and Case filenames, in the order you wrote them.
ls Training/short_reads/unpaired | grep Case >> Module_1_Linux/healthy_files.txtExercises
Section titled “Exercises”-
Use
wc -lto count lines inTraining/short_reads/paired/SRR1553607_2.fastq. Estimate the read count by dividing by 4. -
Use a pipe to count how many filenames in
Training/short_reads/unpairedcontainHealthy. Expected output:4. -
Use a pipe to count how many filenames in
Training/short_reads/unpairedcontainCase. Expected output:3. -
Use
sortanduniqto list unique filenames inTraining/short_reads/unpaired. -
Challenge: Use
wc -lwith both paired files in one command and redirect the output toModule_1_Linux/read_line_counts.txt. -
Use
sortanduniq -cto get a list of unique entries and then save that list to a file calledattendee_summary.txt.
Solutions
Section titled “Solutions”Solution 1
Section titled “Solution 1”wc -l counts lines in the file. Divide by 4 to estimate reads.
wc -l Training/short_reads/paired/SRR1553607_2.fastqOutput:
813780 Training/short_reads/paired/SRR1553607_2.fastqSolution 2
Section titled “Solution 2”grep filters for Healthy filenames, and wc -l counts them.
ls Training/short_reads/unpaired | grep Healthy | wc -lOutput:
4Solution 3
Section titled “Solution 3”grep filters for Case filenames, and wc -l counts them.
ls Training/short_reads/unpaired | grep Case | wc -lOutput:
3Solution 4
Section titled “Solution 4”sort orders the list, and uniq removes duplicates.
ls Training/short_reads/unpaired | sort | uniqOutput:
SRR11282407_Case.fastq.gzSRR11282408_Healthy.fastq.gzSRR11282409_Healthy.fastq.gzSRR11282410_Case.fastq.gzSRR11282411_Healthy.fastq.gzSRR11282412_Case.fastq.gzSRR11282413_Healthy.fastq.gzdownload.shSolution 5
Section titled “Solution 5”wc -l can take multiple files and prints one line per file plus a total. No output appears because it is written to Module_1_Linux/read_line_counts.txt.
wc -l Training/short_reads/paired/SRR1553607_1.fastq Training/short_reads/paired/SRR1553607_2.fastq > Module_1_Linux/read_line_counts.txtSolution 6
Section titled “Solution 6”sort orders the list, uniq -c counts occurrences, and > saves to a file.
sort attendees.csv | uniq -c > attendee_summary.txt