Commands affecting text and
text files -- 影响 所有文本+文本文件
File sort utility, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See , , and .
tsortTopological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns. The original purpose of tsort was to sort a list of dependencies for an obsolete version of the ld linker in an "ancient" version of UNIX.
The results of a tsort will usually differ markedly from those of the standard sort command, above.
This filter removes duplicate lines from a sorted
file. It is often seen in a pipe coupled with .
cat list-1 list-2 list-3 | sort | uniq > final.list
# Concatenates the list files,
# sorts them,
# removes duplicate lines,
- # and finally writes the result to an output file.
bash$ cat testfile
This line occurs only once.
This line occurs twice.
This line occurs twice.
This line occurs three times.
This line occurs three times.
This line occurs three times.
bash$ uniq -c testfile
1 This line occurs only once.
2 This line occurs twice.
3 This line occurs three times.
bash$ sort testfile | uniq -c | sort -nr
3 This line occurs three times.
2 This line occurs twice.
- 1 This line occurs only once.
# Crude word frequency analysis on a text file.
# This is a more efficient version of the "" script.
# Check for input file on command-line.
if [ $# -ne "$ARGS" ] # Correct number of arguments passed to script?
echo "Usage: `basename $0` filename"
if [ ! -f "$1" ] # Check if file exists.
echo "File \"$1\" does not exist."
exit $E_NOFILE
# main ()
sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\
/g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
# =========================
# Frequency of occurrence
# Filter out periods and commas, and
#+ change space between words to linefeed,
#+ then shift characters to lowercase, and
#+ finally prefix occurrence count and sort numerically.
# Arun Giridhar suggests modifying the above to:
# . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
# This adds a secondary sort key, so instances of
#+ equal occurrence are sorted alphabetically.
# As he explains it:
# "This is effectively a radix sort, first on the
#+ least significant column
#+ (word or string, optionally case-insensitive)
#+ and last on the most significant column (frequency)."
# As Frank Wang explains, the above is equivalent to
#+ . . . | sort | uniq -c | sort +0 -nr
#+ and the following also works:
#+ . . . | sort | uniq -c | sort -k1nr -k
exit 0
# Exercises:
# ---------
# 1) Add 'sed' commands to filter out other punctuation,
#+ such as semicolons.
# 2) Modify the script to also filter out multiple spaces and
#+ other whitespace.
bash$ cat testfile
This line occurs only once.
This line occurs twice.
This line occurs twice.
This line occurs three times.
This line occurs three times.
This line occurs three times.
bash$ ./ testfile
6 this
6 occurs
6 line
3 times
3 three
2 twice
1 only
- 1 once
The expand filter converts tabs to spaces. It is often used in a .
The unexpand filter converts spaces to tabs. This reverses the effect of expand.
cutA tool for extracting from files. It is similar to the print $N command set in , but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.
Using cut to obtain a listing of the mounted filesystems:
- cut -d ' ' -f1,2 /etc/mtab
Using cut to list the OS and kernel version:
- uname -a | cut -d" " -f1,3,11,12
Using cut to extract message headers from an e-mail folder:
bash$ grep '^Subject:' read-messages | cut -c10-80
Re: Linux suitable for mission-critical apps?
Spam complaint
Re: Spam complaint
#--- cut -c 字符数量提取 man cut
Using cut to parse a file:
# List all the users in /etc/passwd.
for user in $(cut -d: -f1 $FILENAME)
echo $user
- # Thanks, Oleg Philon for suggesting this.
lists the beginning of a file to stdout. The default is 10 lines, but a different number can be specified. The command has a number of interesting options.
# Detects scripts within a directory.
TESTCHARS=2 # Test first 2 characters.
SHABANG='#!' # Scripts begin with a "sha-bang."
for file in * # Traverse all the files in current directory.
if [[ `head -c$TESTCHARS "$file"` = "$SHABANG" ]]
# head -c2 #!
# The '-c' option to "head" outputs a specified
#+ number of characters, rather than lines (the default).
echo "File \"$file\" is a script."
echo "File \"$file\" is *not* a script."
exit 0
# Exercises:
# ---------
# 1) Modify this script to take as an optional argument
#+ the directory to scan for scripts
#+ (rather than just the current working directory).
# 2) As it stands, this script gives "false positives" for
#+ Perl, awk, and other scripting language scripts.
- # Correct this.
Example 16-14. Generating 10-digit random numbers
# Outputs a 10-digit random number
# Script by Stephane Chazelas.
head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
# =================================================================== #
# Analysis
# --------
# head:
# -c4 option takes first 4 bytes.
# od:
# -N4 option limits output to 4 bytes.
# -tu4 option selects unsigned decimal format for output.
# sed:
# -n option, in combination with "p" flag to the "s" command,
# outputs only matched lines.
# The author of this script explains the action of 'sed', as follows.
# head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
# ----------------------------------> |
# Assume output up to "sed" --------> |
# is 0000000 1198195154\n
# sed begins reading characters: 0000000 1198195154\n.
# Here it finds a newline character,
#+ so it is ready to process the first line (0000000 1198195154).
# It looks at its <range><action>s. The first and only one is
# range action
# 1 s/.* //p
# The line number is in the range, so it executes the action:
#+ tries to substitute the longest string ending with a space in the line
# ("0000000 ") with nothing (//), and if it succeeds, prints the result
# ("p" is a flag to the "s" command here, this is different
#+ from the "p" command).
# sed is now ready to continue reading its input. (Note that before
#+ continuing, if -n option had not been passed, sed would have printed
#+ the line once again).
# Now, sed reads the remainder of the characters, and finds the
#+ end of the file.
# It is now ready to process its 2nd line (which is also numbered '$' as
#+ it's the last one).
# It sees it is not matched by any
, so its job is done.
# In few word this sed commmand means:
# "On the first line only, remove any character up to the right-most space,
#+ then print it."
# A better way to do this would have been:
# sed -e 's/.* //;q'
# Here, two
s (could have been written
- # sed -e 's/.* //
cat /dev/null > $filename; echo "Creating / cleaning out file."
# Creates the file if it does not already exist,
#+ and truncates it to zero length if it does.
# : > filename and > filename also work.
tail /var/log/messages > $filename
# /var/log/messages must have world read permission for this to work.
echo "$filename contains tail end of system log."
- exit 0
A multi-purpose file search tool that uses . It was originally a command/filter in the venerable ed line editor: g/re/p -- global - regular expression - print.
bash$ grep '[rst]ystem.$' osinfo.txt
- The GPL governs the distribution of the Linux operating system.
bash$ ps ax | grep clock
765 tty1 S 0:00 xclock
901 pts/1 S 0:00 grep clock
The -i option causes a case-insensitive search.
The -w option matches only whole words.
The -l option lists only the files in which matches were found, but not the matching lines.
The -r (recursive) option searches files in the current working directory and all subdirectories below it.
The -n option lists the matching lines, together with line numbers.
bash$ grep -n Linux osinfo.txt
2:This is a file containing information about Linux.
- 6:The GPL governs the distribution of the Linux operating system.
The -v (or --invert-match) option filters out matches.
grep pattern1 *.txt | grep -v pattern2
# Matches all lines in "*.txt" files containing "pattern1",
- # but ***not*** "pattern2".
grep -c txt *.sgml # (number of occurrences of "txt" in "*.sgml" files)
# grep -cz .
# ^ dot
# means count (-c) zero-separated (-z) items matching "."
# that is, non-empty ones (containing at least 1 character).
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz . # 3
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$' # 5
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^' # 5
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$' # 9
# By default, newline chars (\n) separate items to match.
# Note that the -z option is GNU "grep" specific.
- # Thanks, S.C.
character translation filter.
Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.
Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.
- The -d option deletes a range of characters.
echo "abcdef" # abcdef
echo "abcdef" | tr -d b-d # aef
tr -d 0-9 <filename
- # Deletes all digits from the file "filename".
#The --squeeze-repeats (or -s) option deletes all but the first instance of a string of cons#ecutive characters. This option is useful for removing excess whitespace.
bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
- X
#The -c "complement" option inverts the character set to match. With this option, tr acts on#ly upon those characters not matching the specified set.
# -c 是 complement 类似于其他command -v 反向匹配
bash$ echo "acfdeb123" | tr -c b-d +
- +c+d+b++++
toupper: Transforms a file to all uppercase.
# Changes a file to all uppercase.
if [ -z "$1" ] # Standard check for command-line arg.
echo "Usage: `basename $0` filename"
- tr a-z A-Z <"$1"
# 大小写转换
# Same effect as above, but using POSIX character set notation:
# tr '[:lower:]' '[:upper:]' <"$1"
# Thanks, S.C.
# Or even . . .
# cat "$1" | tr a-z A-Z
# Or dozens of other ways . . .
exit 0
# Exercise:
# Rewrite this script to give the option of changing a file
#+ to *either* upper or lowercase.
- # Hint: Use either the "case" or "select" command.
# Changes every filename in working directory to all lowercase.
# Inspired by a script of John Dubois,
#+ which was translated into Bash by Chet Ramey,
#+ and considerably simplified by the author of the ABS Guide.
for filename in * # Traverse all files in directory.
fname=`basename $filename`
n=`echo $fname | tr A-Z a-z` # Change name to lowercase.
if [ "$fname" != "$n" ] # 不是小写字符
mv $fname $n
exit $?
# 两段 代码等价
# Code below this line will not execute because of "exit".
# To run it, delete script above line.
# The above script will not work on filenames containing blanks or newlines.
# Stephane Chazelas therefore suggests the following alternative:
for filename in * # Not necessary to use basename,
# since "*" won't return any file containing "/".
- do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]
# POSIX char set notation.
# Slash added so that trailing newlines are not
# removed by command substitution.
# Variable substitution:
n=${n%/} # Removes trailing slash, added above, from filename.
[[ $filename == $n ]] || mv "$filename" "$n"
# Checks if filename already lowercase.
- exit $?
#du: DOS to UNIX text file conversion.
# DOS to UNIX text file converter.
if [ -z "$1" ]
echo "Usage: `basename $0` filename-to-convert"
CR='\015' # Carriage return.
# 015 is octal ASCII code for CR.
# Lines in a DOS text file end in CR-LF.
# Lines in a UNIX text file end in LF only.
tr -d $CR < $1 > $NEWFILENAME
- # Delete CR
Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.
WIDTH=40 # 40 columns wide.
b=`ls /usr/local/bin` # Get a file listing...
echo $b | fmt -w $WIDTH
# Could also have been done by
# echo $b | fold - -s -w $WIDTH
- exit 0
WIDTH=40 # 40 columns wide.
b=`ls /usr/local/bin` # Get a file listing...
echo $b | fmt -w $WIDTH
# Could also have been done by
# echo $b | fold - -s -w $WIDTH
- exit 0