AWK: The Structured Data Processor

AWK treats input as structured data organized into records (lines) and fields (columns). AWK processes input line-by-line, automatically splits each line into fields, and performs specific actions on those fields.

It’s best suited for:

Column extraction and manipulation - Pull specific columns from large datasets
Aggregation and calculations - Sum columns, compute averages, count occurrences
Conditional processing - Apply logic based on field values or patterns
Report generation - Format and transform data into structured output
Bioinformatics data processing - Handle FASTA, FASTQ, GFF, VCF, and other tabular formats

AWK Program Structure

An AWK program follows a predictable workflow:

Execute commands in the BEGIN block (optional - use for initialization)
Main loop: Read a line from input, split into fields, execute commands on that line
Repeat step 2 until end of file
Execute commands in the END block (optional - use for final calculations/reporting)

The BEGIN block

The BEGIN block is optional, but is a good place to initialize variables:

awk 'BEGIN {print "Processing started..."}'

Body Block (Main Processing)

The body block executes on every input line. You can restrict execution with patterns:

awk '{print $1, $3}' file.txt              # Process all lines and print 1st and 3rd fields
awk '/pattern/ {print}' file.txt           # print only lines matching pattern 
awk 'NR > 1 {print}' file.txt              # Skip first record (line), prints the rest

END block

The END block is optional and executes at the end of the program.

END {printf "Number of lines read: %.1f\n", NR}

Basic examples

Lets start working with some text. We’ll start by create a small text file using heredoc syntax.

cat << 'EOF' > example_text.txt
1)  Amit    Physics 80
2)  Rahul   Maths   90
3)  Shyam   Biology 87
4)  Kedar   English 85
5)  Hari    History 89
EOF

1. Add a header to the file using printf in the BEGIN block

awk 'BEGIN{printf "ID\tName\tSubject\tgrade\n"} {print}' example_text.txt

ID  Name    Subject grade
1)  Amit    Physics 80
2)  Rahul   Maths   90
3)  Shyam   Biology 87
4)  Kedar   English 85
5)  Hari    History 89

Key points:

The entire AWK command is enclosed in single quotes (never use double quotes here)

{print} in the body block prints the entire line
printf allows formatted output

2. Extract specific columns using field variables

awk '{print $2, $4}' example_text.txt

Amit 80
Rahul 90
Shyam 87
Kedar 85
Hari 89

Key points

AWK automatically splits each line into fields. Access them with $1, $2, $3, etc., where $0 represents the whole line:

3. Perform calculations on numeric fields

awk '{sum += $4; count++} 
    END {printf "Average grade: %.1f\n", sum/count}' example_text.txt

Average grade: 86.2

Key points:

There is no need to initialize the sum and count variables. AWK has automatic initialization.
‘++’ is a standard increment operator (but not in Python) that adds 1 to any variable
We use the % in the printf statement to specify variable substitution. The options for variable substitution are:
- %s - string substitution
- %d - integer substitution
- %.2f - float substitution with 2 decimal places

4. Use if/else statements to filter or transform data based on conditions

awk '$4 >= 85 {print $2, "passed"}' example_text.txt

Rahul passed
Shyam passed
Kedar passed
Hari passed

In the above example, the conditional statement goes before and outside the braces of the body block. This is very AWK idiomatic. Alternatively, we can create the same logic by using an if-statement inside the braces:

awk '{if ($4 >= 85) print $2, "passed"}' example_text.txt

Rahul passed
Shyam passed
Kedar passed
Hari passed

5: Pattern Matching and Regular Expressions

awk '/Math/ {print}' example_text.txt

2)  Rahul   Maths   90

By using, /PATTERN/ before the body block, AWK will search for this pattern in each line and only execute commands in the body block if a match is found. Here is more complex pattern matching example that uses regular expressions:

awk '$3 ~ /^(Physics|Biology)$/ {print $2}' example_text.txt

Amit
Shyam

Key points:

The ~ operator tests if a field matches a pattern. $3 ~ /^(Physics|Biology)$/ checks if field 3 matches either “Physics” or “Biology”.
This example also demonstrates two key regex operators, ^ and $.
- ^ specifies the start of a line/field. At the start of a pattern match, it indicates that the pattern must be the first text in the line/field.
- $ is the same as ^, but for the end of the line/field. When used together – /^PATTERN$/ – this indicates that the entire line/field must match the pattern.

Other key regex operators are described below:

Examples using these operators:

# Find runs of 4 or more A's
awk '/AAAA+/ {print}' sequences.txt   

# Matches zero or more T's: "AG", "ATG", "ATTG", "ATTTG" 
awk '/AT*G/ {print}' sequences.txt 

# Matches at least 1 A and 1T: "ATG", "AATG", "ATTG"
awk '/A+T+G/ {print}' sequences.txt 

# Matches "chr" or "chrX" (X is optional)
awk '$1 ~ /^chrX?$/ {print}' chromosomes.txt

Following up on the refex theme, here are some more key pattern matching operators

Examples using these operators:

# print lines containing numbers
awk '/[0-9]+/ {print}' data.txt  

#Print lines where the first field starts with A, T, C or G.
awk '$1 ~ /^[ATCG]/ {print}' seqs

# Print lines where field 1 contains ONLY valid DNA bases (A, T, C, G)
awk '$1 !~ /[^ATCG]/ {print}' seqs  

# Any exactly 3-character codon like "ATG", "TAA", "GGC"
awk '/^[ATCG]{3}$/ {print}' codons.txt 

# Print line if the first field matches chromosome names (chr1, chr2, chrX, etc.) 
awk '$1 ~ /^chr[0-9XY]+$/ {print}' genes.gff

# Print line if the 9th field includes ID=LOC followed by a number
awk '$9 ~ /ID=LOC[0-9]+/ {print}' genes.gff

Examples relevant to bioinformatics

First let’s create an example fasta file

cat << 'EOF' > sequences.fasta
>seq1
ATCGATCGATCG
>seq2
GCTAGCTAGCTA
>seq3
TTAAGGCCAATT
EOF

6. Now we’ll write an AWK command to calculate the average length of the sequences.

A few notes before looking at the command

In AWK, each piece of input is called a “record”. By default, each line will be a separate record, and each record will be acted on independently by the body block. Thus, the by default separator between records is the newline character. We can change the record separator such that the text is split by something other than the newline character. In the command below, we set the record separator (RS) to be “>”.
The first record will be everything before the first “>”, and will thus be empty. We want to skip this first record, so at the start of the body block we use NR > 1, such that only record after the first one are processed.

awk 'BEGIN{RS=">"}  
      NR > 1 {
       header=$1
       seq = $2
       len=length(seq)
       count++
       total_len += len
       print "Sequence", count, "length:", len
       
     }
     END {printf "Average length: %.0f bp\n", total_len/count}' sequences.fasta

Sequence 1 length: 12
Sequence 2 length: 12
Sequence 3 length: 12
Average length: 12 bp

7. Now let’s look at another example where we filter and reformat a tabular data file

# make the data file
cat << 'EOF' > variants.txt
chr1    1000    A   G   0.05    PASS
chr1    2000    C   T   0.02    PASS
chr2    5000    G   A   0.5 FAIL
chr2    6000    T   C   0.01    PASS
EOF

Extract only passing variants and change format of text

awk '$6 == "PASS" {printf "%s:%d %s->%s (frequency: %.3f)\n", $1, $2, $3, $4, $5}' variants.txt

chr1:1000 A->G (frequency: 0.050)
chr1:2000 C->T (frequency: 0.020)
chr2:6000 T->C (frequency: 0.010)

8. Use the built in number of fields variable (NF) to work with records with variable numbers of fields

# concatenate two of our example textfiles
cat variants.txt sequences.fasta > temp_text_file.txt

awk '{print "Line", NR, "has", NF, "fields"}' temp_text_file.txt

Line 1 has 6 fields
Line 2 has 6 fields
Line 3 has 6 fields
Line 4 has 6 fields
Line 5 has 1 fields
Line 6 has 1 fields
Line 7 has 1 fields
Line 8 has 1 fields
Line 9 has 1 fields
Line 10 has 1 fields

9. Now let’s look at two different methods of changing the field deliminator from tab/newline to ‘,’

# make some text
cat << 'EOF' > data.csv
name,age,score
Alice,25,95
Bob,30,87
EOF

In the first method, we specify ‘,’ as the deliminator with the -F flag

awk -F',' '{print $1, $3}' data.csv

name score
Alice 95
Bob 87

In the second method, we specify the new deliminator in the BEGIN BLOCK.

awk 'BEGIN {FS=","} {print $1, $3}' data.csv

name score
Alice 95
Bob 87