Skip to content

Lesson 2 - Text Processing, Pipes, and Redirection

  • Explain how pipes connect commands in bioinformatics workflows.
  • Use redirection to save command output into files.
  • Count FASTQ reads with wc and the 4-line structure.
  • Use grep to filter filenames and text.
  • Use sort and uniq to summarize lists.

Pipes (|) send the output of one command directly into another. This is the foundation of many bioinformatics one-liners. Redirection (> and >>) lets you save command output to a file or append to it. Together, these tools let you build small, reproducible data summaries without opening a spreadsheet.

FASTQ files store reads in 4-line blocks:

  1. Header
  2. Sequence
  3. + separator
  4. Quality string

If you count lines with wc -l, you can estimate reads by dividing by 4.

wc -l uses the -l option to count lines in the file. Divide by 4 to estimate reads.

Terminal window
wc -l Training/short_reads/paired/SRR1553607_1.fastq

Output:

813780 Training/short_reads/paired/SRR1553607_1.fastq

grep filters lines that match the word Case, and wc -l counts those lines.

Terminal window
ls Training/short_reads/unpaired | grep Case | wc -l

Output:

3

grep filters lines that match the word Healthy, and wc -l counts those lines.

Terminal window
ls Training/short_reads/unpaired | grep Healthy | wc -l

Output:

4

sort arranges filenames alphabetically, and uniq removes duplicate adjacent lines.

To better understand sort and uniq, lets create a small dataset with duplicates: for now dont worry about what the following script does, we would cover that in later lessons.

  1. First create a script with nano: Please create this inside the ‘training’ directory

Tip: You can check your current directory with pwd and change directories with cd <directory_name>

The command below creates and open a new file called generate_data.sh

Terminal window
nano generate_data.sh
  1. Now copy and paste the following lines into nano:
#!/bin/bash
# Define the output file name
FILE="attendees.csv"
# Array of departments to create repetitions
DEPARTPS=("Bioinformatics" "Virology" "Genetics" "Computer Science" "Bioinformatics")
# Array of names
NAMES=("Olabode" "Alice" "Zainab" "Chidi" "Bayo" "Alice" "Zainab")
echo "Generating random data in $FILE..."
# Create or clear the file and add a header
echo "Name,Department" > "$FILE"
# Loop to generate 200 lines of random data
for i in {1..200}
do
# Pick a random name and department from the arrays
NAME=${NAMES[$RANDOM % ${#NAMES[@]}]}
DEPT=${DEPARTPS[$RANDOM % ${#DEPARTPS[@]}]}
# Append to the CSV
echo "$NAME,$DEPT" >> "$FILE"
done
echo "Done! You can now use 'head $FILE' to see the messy data."
  1. Save and exit nano (Ctrl + O, Enter, Ctrl + X).
  2. (optional) Make the script executable:
Terminal window
chmod +x generate_data.sh

this would allow you to run the script directly using ./generate_data.sh. Might be easier to just run it with bash as below

  1. Run the script to generate the data:
  • The command below would run the script we just created and generate a file called attendees.csv in your current directory.
Terminal window
bash generate_data.sh

After running the script, you should see a new file named attendees.csv in your current directory. The directory structure should look like this:

├── attendees.csv
├── generate_data.sh
├── long_reads
└── short_reads
  1. Lets quickly check the contents of the generated file by looking at the first 10 lines
Terminal window
head attendees.csv

Output:

Name,Department
Olabode,Virology
Zainab,Virology
Alice,Virology
Zainab,Bioinformatics
Chidi,Genetics
Alice,Virology
Bayo,Computer Science
Bayo,Bioinformatics
Alice,Computer Science

PS because the data is randomly generated, your output will likely differ from the above example. You would however see the same structure and format.

  1. Let’s see if we remember the command to count how many lines are in the file
Terminal window

Output:

201 attendees.csv
  1. Now we can use sort and uniq to get a list of unique departments in the file. Run the command below:
Terminal window
sort attendees.csv | uniq

Output:

Alice,Bioinformatics
Alice,Computer Science
Alice,Genetics
Alice,Virology
Bayo,Bioinformatics
Bayo,Computer Science
Bayo,Genetics
Bayo,Virology
Chidi,Bioinformatics
Chidi,Computer Science
Chidi,Genetics
Chidi,Virology
Name,Department
Olabode,Bioinformatics
Olabode,Computer Science
Olabode,Genetics
Olabode,Virology
Zainab,Bioinformatics
Zainab,Computer Science
Zainab,Genetics
Zainab,Virology

What do you notice about the output? Note that the header “Name,Department” is also included in the output. This is because sort and uniq treat it like any other line.

Did you also notice that the number of lines in the output is less than the number of lines in the input file? This is because there were duplicate entries in the original file, and uniq removed those duplicates.

you can pipe the output of uniq to wc -l to count the number of unique entries if you want to Alternatively you can use the -u option with sort to get unique lines directly:

Terminal window
sort -u attendees.csv

If you are interested in kowing how mnay times each unique entry appears in the file, you can use the -c option with uniq after sorting:

Terminal window
sort attendees.csv | uniq -c

Output:

29 Alice,Bioinformatics
10 Alice,Computer Science
10 Alice,Genetics
9 Alice,Virology
7 Bayo,Bioinformatics
9 Bayo,Computer Science
6 Bayo,Genetics
5 Bayo,Virology
13 Chidi,Bioinformatics
9 Chidi,Computer Science
5 Chidi,Genetics
6 Chidi,Virology
1 Name,Department
13 Olabode,Bioinformatics
4 Olabode,Computer Science
5 Olabode,Genetics
8 Olabode,Virology
24 Zainab,Bioinformatics
10 Zainab,Computer Science
7 Zainab,Genetics
11 Zainab,Virology

Just to confirm how many entries there were based on the counts, you can sum them up using awk: a super poerful command which we will not cover in this module but just so you see its wonders, run the below

Terminal window
sort attendees.csv | uniq -c | awk '{sum += $1} END {print sum}'

Output:

201

That was a quick overview of sort and uniq. We will explore these commands more in future lessons. Now let’s proceed to the next section.

> redirects output to a file and overwrites it if it exists. No output appears because it is written to Module_1_Linux/healthy_files.txt.

Terminal window
ls Training/short_reads/unpaired | grep Healthy > Module_1_Linux/healthy_files.txt

NB: If you run the above command again, it will overwrite the contents of healthy_files.txt. To add to the file instead of overwriting, use >> see below.

>> appends output to the end of a file. After this, the file contains Healthy and Case filenames, in the order you wrote them.

Terminal window
ls Training/short_reads/unpaired | grep Case >> Module_1_Linux/healthy_files.txt
  1. Use wc -l to count lines in Training/short_reads/paired/SRR1553607_2.fastq. Estimate the read count by dividing by 4.

  2. Use a pipe to count how many filenames in Training/short_reads/unpaired contain Healthy. Expected output: 4.

  3. Use a pipe to count how many filenames in Training/short_reads/unpaired contain Case. Expected output: 3.

  4. Use sort and uniq to list unique filenames in Training/short_reads/unpaired.

  5. Challenge: Use wc -l with both paired files in one command and redirect the output to Module_1_Linux/read_line_counts.txt.

  6. Use sort and uniq -c to get a list of unique entries and then save that list to a file called attendee_summary.txt.

wc -l counts lines in the file. Divide by 4 to estimate reads.

Terminal window
wc -l Training/short_reads/paired/SRR1553607_2.fastq

Output:

813780 Training/short_reads/paired/SRR1553607_2.fastq

grep filters for Healthy filenames, and wc -l counts them.

Terminal window
ls Training/short_reads/unpaired | grep Healthy | wc -l

Output:

4

grep filters for Case filenames, and wc -l counts them.

Terminal window
ls Training/short_reads/unpaired | grep Case | wc -l

Output:

3

sort orders the list, and uniq removes duplicates.

Terminal window
ls Training/short_reads/unpaired | sort | uniq

Output:

SRR11282407_Case.fastq.gz
SRR11282408_Healthy.fastq.gz
SRR11282409_Healthy.fastq.gz
SRR11282410_Case.fastq.gz
SRR11282411_Healthy.fastq.gz
SRR11282412_Case.fastq.gz
SRR11282413_Healthy.fastq.gz
download.sh

wc -l can take multiple files and prints one line per file plus a total. No output appears because it is written to Module_1_Linux/read_line_counts.txt.

Terminal window
wc -l Training/short_reads/paired/SRR1553607_1.fastq Training/short_reads/paired/SRR1553607_2.fastq > Module_1_Linux/read_line_counts.txt

sort orders the list, uniq -c counts occurrences, and > saves to a file.

Terminal window
sort attendees.csv | uniq -c > attendee_summary.txt

WELLDONE!!! You have completed Lesson 2 - Text Processing, Pipes, and Redirection. Go forth and pipe :)!

Section titled “WELLDONE!!! You have completed Lesson 2 - Text Processing, Pipes, and Redirection. Go forth and pipe :)!”