Bash Text Processing
How to build powerful text processing pipelines with grep, sed, awk, cut, sort, uniq, and tr for log analysis and data transformation.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Unix text processing tools are designed to be composed into pipelines: each tool does one thing well, and the shell connects them with pipes. A single line of Bash can replace hundreds of lines of Python or JavaScript for log analysis, data extraction, and report generation. This recipe covers the essential tools and how to combine them safely.
When to Use
- Extracting and filtering log lines by pattern, time, or status code
- Transforming CSV or tabular data (sorting, deduplication, aggregation)
- Searching codebases for patterns across thousands of files
- Generating quick reports from structured text output
- Pre-processing data before feeding it to a database or API
When NOT to Use
- Parsing nested or irregular formats (JSON, XML, HTML) — use
jq,xq, or a proper parser - Tasks requiring complex state across lines — awk can do it, but Python is more maintainable
- Multi-step transformations where error handling matters — scripting languages have better debugging
- Unicode edge cases — classic tools are byte-oriented and may mangle multibyte characters
Step-by-Step Implementation
grep — Pattern Matching
# Search recursively, show line numbers, ignore binary files
grep -rn "ERROR" logs/
# Invert match, count occurrences
grep -vc "^#" config.ini
# Multiple patterns with extended regex
grep -E "(ERROR|FATAL|CRITICAL)" app.log
# Context lines: 2 before, 3 after
grep -B 2 -A 3 "Exception" app.log
# Only filenames containing match (useful for batch operations)
grep -rl "TODO" src/
# Perl-compatible regex (PCRE) for lookaheads
grep -P "(?<=user_id=)\d+" access.log
sed — Stream Editing
# Replace first occurrence per line
sed 's/foo/bar/' file.txt
# Replace all occurrences globally
sed 's/foo/bar/g' file.txt
# Replace in-place with backup
sed -i.bak 's/old_domain/new_domain/g' config.conf
# Delete lines matching pattern
sed '/^#/d' config.ini # Remove comments
sed '/^$/d' file.txt # Remove empty lines
# Extract specific lines
sed -n '10,20p' file.txt # Print lines 10-20
sed -n '50,$p' file.txt # Print from line 50 to end
# Multi-line replacement (append after match)
sed '/pattern/a\\New line after match' file.txt
awk — Field Processing and Aggregation
# Print specific columns (space/tab delimited)
awk '{print $1, $3}' access.log
# Sum a column
awk '{sum += $2} END {print sum}' sales.txt
# Average with count
awk '{sum += $2; count++} END {if (count) print sum/count}' data.txt
# Filter rows by condition
awk '$3 > 100 {print $1, $3}' orders.csv
# Process CSV with custom delimiter
awk -F',' '{print $2, $5}' customers.csv
# Group by and count (like SQL GROUP BY)
awk '{count[$1]++} END {for (k in count) print k, count[k]}' status.log
# Format output with headers
awk 'BEGIN {print "IP", "Requests"} {count[$1]++} END {for (ip in count) print ip, count[ip]}' access.log
cut, sort, uniq — Column Extraction and Deduplication
# Extract columns by position or delimiter
cut -d',' -f1,3,5 data.csv
cut -c1-10 file.txt # First 10 characters
# Sort numerically, reverse, by specific column
sort -t',' -k3 -n sales.csv # Sort by 3rd column numerically
sort -u file.txt # Sort and remove duplicates
# Count unique occurrences
sort file.txt | uniq -c | sort -rn # Most frequent first
# Show only duplicate or unique lines
sort file.txt | uniq -d # Only duplicates
sort file.txt | uniq -u # Only unique lines
tr — Character Translation
# Convert to uppercase
cat file.txt | tr 'a-z' 'A-Z'
# Squeeze repeated characters
tr -s ' ' < file.txt # Collapse multiple spaces to one
# Delete characters
tr -d '\r' < file.txt # Remove carriage returns
# Replace line endings
tr '\n' ',' < lines.txt > comma-separated.txt
Complex Pipelines
# Top 10 most frequent error types in a log
awk '$0 ~ /ERROR|FATAL/ {print $5}' app.log | \
sort | uniq -c | sort -rn | head -10
# Extract unique client IPs with request count, sorted
awk '{print $1}' access.log | sort | uniq -c | sort -rn | \
awk '{print $2 "," $1}' > ip_counts.csv
# Find slow queries (>1s) and group by table
awk '$NF > 1 {print}' slow_query.log | \
grep -oP 'FROM \K\w+' | sort | uniq -c | sort -rn
# Convert log timestamps to ISO format and filter a date range
sed -n '/2024-06-01/,/2024-06-07/p' app.log | \
awk '{gsub(/\//, "-", $1); print $1 "T" $2}'
# Generate a report: status code distribution
awk '{print $9}' access.log | sort | uniq -c | \
awk '{printf "%s: %d requests (%.1f%%)\n", $2, $1, $1*100/total}' \
total=$(wc -l < access.log)
Best Practices
- Always quote regex patterns with special characters.
grep "$pattern"prevents the shell from expanding*or?before grep sees them. - Use
awkfor columnar data instead ofcutwhen fields vary in width.cutfails on variable spacing;awksplits on any whitespace by default. - Prefer
jqfor JSON,xqfor XML,csvkitfor CSV. Classic tools treat these formats as plain text and will break on quoted fields or nested structures. - Chain tools left to right in order of filtering. Put
grepearly to reduce data volume before expensiveawkorsortoperations. - Use
LC_ALL=Cfor consistent sorting and performance. It forces byte-wise sorting and avoids locale-dependent behavior.
Common Mistakes
- Parsing JSON/HTML with grep/sed/awk. These are not structured formats — use
jq,python -m json.tool, or a DOM parser. - Forgetting that
sedandawkoperate line by line by default. Multi-line patterns require special flags (sed -z,awkRS manipulation) that are non-obvious. - Assuming
sortis stable by default.sortstability varies by implementation; usesort -sif you need it. - Using
catunnecessarily.cat file | grep patternis a useless use ofcat. Usegrep pattern file. - Not handling empty input. Many pipelines fail silently on empty files — add
| catat the end or check file size first.