16.4. Text Processing Commands 学习笔记-wuxiaobo

Commands affecting text and text files -- 影响所有文本+文本文件

sort

File sort utility, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See , , and .

tsort

Topological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns. The original purpose of tsort was to sort a list of dependencies for an obsolete version of the ld linker in an "ancient" version of UNIX.

The results of a tsort will usually differ markedly from those of the standard sort command, above.

uniq

This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with .

点击(此处)折叠或打开

cat list-1 list-2 list-3 | sort | uniq > final.list
# Concatenates the list files,
# sorts them,
# removes duplicate lines,
# and finally writes the result to an output file.

点击(此处)折叠或打开

bash$ cat testfile
This line occurs only once.
This line occurs twice.
This line occurs twice.
This line occurs three times.
This line occurs three times.
This line occurs three times.
bash$ uniq -c testfile
1 This line occurs only once.
2 This line occurs twice.
3 This line occurs three times.
bash$ sort testfile | uniq -c | sort -nr
3 This line occurs three times.
2 This line occurs twice.
1 This line occurs only once.

点击(此处)折叠或打开

#!/bin/bash
# wf.sh: Crude word frequency analysis on a text file.
# This is a more efficient version of the "wf2.sh" script.
# Check for input file on command-line.
ARGS=1
E_BADARGS=85
E_NOFILE=86
if [ $# -ne "$ARGS" ] # Correct number of arguments passed to script?
then
echo "Usage: `basename $0` filename"
exit $E_BADARGS
fi
if [ ! -f "$1" ] # Check if file exists.
then
echo "File \"$1\" does not exist."
exit $E_NOFILE
fi
########################################################
# main ()
sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\
/g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
# =========================
# Frequency of occurrence
# Filter out periods and commas, and
#+ change space between words to linefeed,
#+ then shift characters to lowercase, and
#+ finally prefix occurrence count and sort numerically.
# Arun Giridhar suggests modifying the above to:
# . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
# This adds a secondary sort key, so instances of
#+ equal occurrence are sorted alphabetically.
# As he explains it:
# "This is effectively a radix sort, first on the
#+ least significant column
#+ (word or string, optionally case-insensitive)
#+ and last on the most significant column (frequency)."
#
# As Frank Wang explains, the above is equivalent to
#+ . . . | sort | uniq -c | sort +0 -nr
#+ and the following also works:
#+ . . . | sort | uniq -c | sort -k1nr -k
########################################################
exit 0
# Exercises:
# ---------
# 1) Add 'sed' commands to filter out other punctuation,
#+ such as semicolons.
# 2) Modify the script to also filter out multiple spaces and
#+ other whitespace.
bash$ cat testfile
This line occurs only once.
This line occurs twice.
This line occurs twice.
This line occurs three times.
This line occurs three times.
This line occurs three times.
bash$ ./wf.sh testfile
6 this
6 occurs
6 line
3 times
3 three
2 twice
1 only
1 once

expand, unexpand

The expand filter converts tabs to spaces. It is often used in a .

The unexpand filter converts spaces to tabs. This reverses the effect of expand.

cut

A tool for extracting from files. It is similar to the print $N command set in , but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.

Using cut to obtain a listing of the mounted filesystems:

点击(此处)折叠或打开

cut -d ' ' -f1,2 /etc/mtab

点击(此处)折叠或打开

Using cut to list the OS and kernel version:
uname -a | cut -d" " -f1,3,11,12

点击(此处)折叠或打开

Using cut to extract message headers from an e-mail folder:
bash$ grep '^Subject:' read-messages | cut -c10-80
Re: Linux suitable for mission-critical apps?
MAKE MILLIONS WORKING AT
Spam complaint
Re: Spam complaint
#--- cut -c 字符数量提取 man cut
Using cut to parse a file:
# List all the users in /etc/passwd.
FILENAME=/etc/passwd
for user in $(cut -d: -f1 $FILENAME)
do
echo $user
done
# Thanks, Oleg Philon for suggesting this.

head

lists the beginning of a file to stdout. The default is 10 lines, but a different number can be specified. The command has a number of interesting options.

点击(此处)折叠或打开

#!/bin/bash
# script-detector.sh: Detects scripts within a directory.
TESTCHARS=2 # Test first 2 characters.
SHABANG='#!' # Scripts begin with a "sha-bang."
for file in * # Traverse all the files in current directory.
do
if [[ `head -c$TESTCHARS "$file"` = "$SHABANG" ]]
# head -c2 #!
# The '-c' option to "head" outputs a specified
#+ number of characters, rather than lines (the default).
then
echo "File \"$file\" is a script."
else
echo "File \"$file\" is *not* a script."
fi
done
exit 0
# Exercises:
# ---------
# 1) Modify this script to take as an optional argument
#+ the directory to scan for scripts
#+ (rather than just the current working directory).
#
# 2) As it stands, this script gives "false positives" for
#+ Perl, awk, and other scripting language scripts.
# Correct this.

点击(此处)折叠或打开

Example 16-14. Generating 10-digit random numbers
#!/bin/bash
# rnd.sh: Outputs a 10-digit random number
# Script by Stephane Chazelas.
head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
# =================================================================== #
# Analysis
# --------
# head:
# -c4 option takes first 4 bytes.
# od:
# -N4 option limits output to 4 bytes.
# -tu4 option selects unsigned decimal format for output.
# sed:
# -n option, in combination with "p" flag to the "s" command,
# outputs only matched lines.
# The author of this script explains the action of 'sed', as follows.
# head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
# ----------------------------------> |
# Assume output up to "sed" --------> |
# is 0000000 1198195154\n
# sed begins reading characters: 0000000 1198195154\n.
# Here it finds a newline character,
#+ so it is ready to process the first line (0000000 1198195154).
# It looks at its <range><action>s. The first and only one is
# range action
# 1 s/.* //p
# The line number is in the range, so it executes the action:
#+ tries to substitute the longest string ending with a space in the line
# ("0000000 ") with nothing (//), and if it succeeds, prints the result
# ("p" is a flag to the "s" command here, this is different
#+ from the "p" command).
# sed is now ready to continue reading its input. (Note that before
#+ continuing, if -n option had not been passed, sed would have printed
#+ the line once again).
# Now, sed reads the remainder of the characters, and finds the
#+ end of the file.
# It is now ready to process its 2nd line (which is also numbered '$' as
#+ it's the last one).
# It sees it is not matched by any , so its job is done.
# In few word this sed commmand means:
# "On the first line only, remove any character up to the right-most space,
#+ then print it."
# A better way to do this would have been:
# sed -e 's/.* //;q'
# Here, two s (could have been written
# sed -e 's/.* //

点击(此处)折叠或打开

#!/bin/bash
filename=sys.log
cat /dev/null > $filename; echo "Creating / cleaning out file."
# Creates the file if it does not already exist,
#+ and truncates it to zero length if it does.
# : > filename and > filename also work.
tail /var/log/messages > $filename
# /var/log/messages must have world read permission for this to work.
echo "$filename contains tail end of system log."
exit 0

grep

A multi-purpose file search tool that uses . It was originally a command/filter in the venerable ed line editor: g/re/p -- global - regular expression - print.

点击(此处)折叠或打开

bash$ grep '[rst]ystem.$' osinfo.txt
The GPL governs the distribution of the Linux operating system.

点击(此处)折叠或打开

bash$ ps ax | grep clock
765 tty1 S 0:00 xclock
901 pts/1 S 0:00 grep clock
The -i option causes a case-insensitive search.
The -w option matches only whole words.
The -l option lists only the files in which matches were found, but not the matching lines.
The -r (recursive) option searches files in the current working directory and all subdirectories below it.
The -n option lists the matching lines, together with line numbers.
bash$ grep -n Linux osinfo.txt
2:This is a file containing information about Linux.
6:The GPL governs the distribution of the Linux operating system.

点击(此处)折叠或打开

The -v (or --invert-match) option filters out matches.
grep pattern1 *.txt | grep -v pattern2
# Matches all lines in "*.txt" files containing "pattern1",
# but ***not*** "pattern2".

点击(此处)折叠或打开

grep -c txt *.sgml # (number of occurrences of "txt" in "*.sgml" files)
# grep -cz .
# ^ dot
# means count (-c) zero-separated (-z) items matching "."
# that is, non-empty ones (containing at least 1 character).
#
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz . # 3
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$' # 5
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^' # 5
#
printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$' # 9
# By default, newline chars (\n) separate items to match.
# Note that the -z option is GNU "grep" specific.
# Thanks, S.C.

点击(此处)折叠或打开

tr
character translation filter.
Caution
Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.
Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.
The -d option deletes a range of characters.

点击(此处)折叠或打开

echo "abcdef" # abcdef
echo "abcdef" | tr -d b-d # aef
tr -d 0-9 <filename
# Deletes all digits from the file "filename".

点击(此处)折叠或打开

#The --squeeze-repeats (or -s) option deletes all but the first instance of a string of cons#ecutive characters. This option is useful for removing excess whitespace.
bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
X

点击(此处)折叠或打开

#The -c "complement" option inverts the character set to match. With this option, tr acts on#ly upon those characters not matching the specified set.
# -c 是 complement 类似于其他command -v 反向匹配
bash$ echo "acfdeb123" | tr -c b-d +
+c+d+b++++

点击(此处)折叠或打开

toupper: Transforms a file to all uppercase.
#!/bin/bash
# Changes a file to all uppercase.
E_BADARGS=85
if [ -z "$1" ] # Standard check for command-line arg.
then
echo "Usage: `basename $0` filename"
exit $E_BADARGS
fi
tr a-z A-Z <"$1"
# 大小写转换
# Same effect as above, but using POSIX character set notation:
# tr '[:lower:]' '[:upper:]' <"$1"
# Thanks, S.C.
# Or even . . .
# cat "$1" | tr a-z A-Z
# Or dozens of other ways . . .
exit 0
# Exercise:
# Rewrite this script to give the option of changing a file
#+ to *either* upper or lowercase.
# Hint: Use either the "case" or "select" command.

点击(此处)折叠或打开

#!/bin/bash
#
# Changes every filename in working directory to all lowercase.
#
# Inspired by a script of John Dubois,
#+ which was translated into Bash by Chet Ramey,
#+ and considerably simplified by the author of the ABS Guide.
for filename in * # Traverse all files in directory.
do
fname=`basename $filename`
n=`echo $fname | tr A-Z a-z` # Change name to lowercase.
if [ "$fname" != "$n" ] # 不是小写字符
then
mv $fname $n
fi
done
exit $?
# 两段代码等价
# Code below this line will not execute because of "exit".
#--------------------------------------------------------#
# To run it, delete script above line.
# The above script will not work on filenames containing blanks or newlines.
# Stephane Chazelas therefore suggests the following alternative:
for filename in * # Not necessary to use basename,
# since "*" won't return any file containing "/".
do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]

# POSIX char set notation.
# Slash added so that trailing newlines are not
# removed by command substitution.
# Variable substitution:
n=${n%/} # Removes trailing slash, added above, from filename.
[[ $filename == $n ]] || mv "$filename" "$n"
# Checks if filename already lowercase.
done
exit $?

点击(此处)折叠或打开

#du: DOS to UNIX text file conversion.
#!/bin/bash
# Du.sh: DOS to UNIX text file converter.
E_WRONGARGS=85
if [ -z "$1" ]
then
echo "Usage: `basename $0` filename-to-convert"
exit $E_WRONGARGS
fi
NEWFILENAME=$1.unx
CR='\015' # Carriage return.
# 015 is octal ASCII code for CR.
# Lines in a DOS text file end in CR-LF.
# Lines in a UNIX text file end in LF only.
tr -d $CR < $1 > $NEWFILENAME
# Delete CR

fmt

Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.

点击(此处)折叠或打开

#!/bin/bash
WIDTH=40 # 40 columns wide.
b=`ls /usr/local/bin` # Get a file listing...
echo $b | fmt -w $WIDTH
# Could also have been done by
# echo $b | fold - -s -w $WIDTH
exit 0

点击(此处)折叠或打开

#!/bin/bash
WIDTH=40 # 40 columns wide.
b=`ls /usr/local/bin` # Get a file listing...
echo $b | fmt -w $WIDTH
# Could also have been done by
# echo $b | fold - -s -w $WIDTH
exit 0