16.4. Text Processing Commands 学习笔记

720阅读 0评论2014-01-06 wuxiaobo_2009
分类:LINUX

Commands affecting text and text files   -- 影响 所有文本+文本文件

sort

File sort utility, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See , , and .

tsort

Topological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns. The original purpose of tsort was to sort a list of dependencies for an obsolete version of the ld linker in an "ancient" version of UNIX.

The results of a tsort will usually differ markedly from those of the standard sort command, above.

uniq

This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with .

点击(此处)折叠或打开

  1. cat list-1 list-2 list-3 | sort | uniq > final.list
  2. # Concatenates the list files,
  3. # sorts them,
  4. # removes duplicate lines,
  5. # and finally writes the result to an output file.


点击(此处)折叠或打开

  1. bash$ cat testfile
  2. This line occurs only once.
  3.  This line occurs twice.
  4.  This line occurs twice.
  5.  This line occurs three times.
  6.  This line occurs three times.
  7.  This line occurs three times.


  8. bash$ uniq -c testfile
  9.       1 This line occurs only once.
  10.        2 This line occurs twice.
  11.        3 This line occurs three times.


  12. bash$ sort testfile | uniq -c | sort -nr
  13.       3 This line occurs three times.
  14.        2 This line occurs twice.
  15.        1 This line occurs only once.


点击(此处)折叠或打开

  1. #!/bin/bash
  2. # wf.sh: Crude word frequency analysis on a text file.
  3. # This is a more efficient version of the "wf2.sh" script.


  4. # Check for input file on command-line.
  5. ARGS=1
  6. E_BADARGS=85
  7. E_NOFILE=86

  8. if [ $# -ne "$ARGS" ] # Correct number of arguments passed to script?
  9. then
  10.   echo "Usage: `basename $0` filename"
  11.   exit $E_BADARGS
  12. fi

  13. if [ ! -f "$1" ] # Check if file exists.
  14. then
  15.   echo "File \"$1\" does not exist."
  16.   exit $E_NOFILE
  17. fi



  18. ########################################################
  19. # main ()
  20. sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\
  21. /g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
  22. # =========================
  23. # Frequency of occurrence

  24. # Filter out periods and commas, and
  25. #+ change space between words to linefeed,
  26. #+ then shift characters to lowercase, and
  27. #+ finally prefix occurrence count and sort numerically.

  28. # Arun Giridhar suggests modifying the above to:
  29. # . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
  30. # This adds a secondary sort key, so instances of
  31. #+ equal occurrence are sorted alphabetically.
  32. # As he explains it:
  33. # "This is effectively a radix sort, first on the
  34. #+ least significant column
  35. #+ (word or string, optionally case-insensitive)
  36. #+ and last on the most significant column (frequency)."
  37. #
  38. # As Frank Wang explains, the above is equivalent to
  39. #+ . . . | sort | uniq -c | sort +0 -nr
  40. #+ and the following also works:
  41. #+ . . . | sort | uniq -c | sort -k1nr -k
  42. ########################################################

  43. exit 0

  44. # Exercises:
  45. # ---------
  46. # 1) Add 'sed' commands to filter out other punctuation,
  47. #+ such as semicolons.
  48. # 2) Modify the script to also filter out multiple spaces and
  49. #+ other whitespace.

  50. bash$ cat testfile
  51. This line occurs only once.
  52.  This line occurs twice.
  53.  This line occurs twice.
  54.  This line occurs three times.
  55.  This line occurs three times.
  56.  This line occurs three times.


  57. bash$ ./wf.sh testfile
  58.       6 this
  59.        6 occurs
  60.        6 line
  61.        3 times
  62.        3 three
  63.        2 twice
  64.        1 only
  65.        1 once
expand, unexpand

The expand filter converts tabs to spaces. It is often used in a .

The unexpand filter converts spaces to tabs. This reverses the effect of expand.

cut

A tool for extracting from files. It is similar to the print $N command set in , but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.

Using cut to obtain a listing of the mounted filesystems:


点击(此处)折叠或打开

  1. cut -d ' ' -f1,2 /etc/mtab

点击(此处)折叠或打开

  1. Using cut to list the OS and kernel version:

  2. uname -a | cut -d" " -f1,3,11,12


点击(此处)折叠或打开

  1. Using cut to extract message headers from an e-mail folder:

  2. bash$ grep '^Subject:' read-messages | cut -c10-80
  3. Re: Linux suitable for mission-critical apps?
  4.  MAKE MILLIONS WORKING AT
  5.  Spam complaint
  6.  Re: Spam complaint
  7. #---  cut -c 字符数量提取  man cut
  8. Using cut to parse a file:

  9. # List all the users in /etc/passwd.

  10. FILENAME=/etc/passwd

  11. for user in $(cut -d: -f1 $FILENAME)
  12. do
  13.   echo $user
  14. done

  15. # Thanks, Oleg Philon for suggesting this.
head

lists the beginning of a file to stdout. The default is 10 lines, but a different number can be specified. The command has a number of interesting options.


点击(此处)折叠或打开

  1. #!/bin/bash
  2. # script-detector.sh: Detects scripts within a directory.

  3. TESTCHARS=2 # Test first 2 characters.
  4. SHABANG='#!' # Scripts begin with a "sha-bang."

  5. for file in * # Traverse all the files in current directory.
  6. do
  7.   if [[ `head -c$TESTCHARS "$file"` = "$SHABANG" ]]
  8.   # head -c2 #!
  9.   # The '-c' option to "head" outputs a specified
  10.   #+ number of characters, rather than lines (the default).
  11.   then
  12.     echo "File \"$file\" is a script."
  13.   else
  14.     echo "File \"$file\" is *not* a script."
  15.   fi
  16. done
  17.   
  18. exit 0

  19. # Exercises:
  20. # ---------
  21. # 1) Modify this script to take as an optional argument
  22. #+ the directory to scan for scripts
  23. #+ (rather than just the current working directory).
  24. #
  25. # 2) As it stands, this script gives "false positives" for
  26. #+ Perl, awk, and other scripting language scripts.
  27. # Correct this.

点击(此处)折叠或打开

  1. Example 16-14. Generating 10-digit random numbers

  2. #!/bin/bash
  3. # rnd.sh: Outputs a 10-digit random number

  4. # Script by Stephane Chazelas.

  5. head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'


  6. # =================================================================== #

  7. # Analysis
  8. # --------

  9. # head:
  10. # -c4 option takes first 4 bytes.

  11. # od:
  12. # -N4 option limits output to 4 bytes.
  13. # -tu4 option selects unsigned decimal format for output.

  14. # sed:
  15. # -n option, in combination with "p" flag to the "s" command,
  16. # outputs only matched lines.



  17. # The author of this script explains the action of 'sed', as follows.

  18. # head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
  19. # ----------------------------------> |

  20. # Assume output up to "sed" --------> |
  21. # is 0000000 1198195154\n

  22. # sed begins reading characters: 0000000 1198195154\n.
  23. # Here it finds a newline character,
  24. #+ so it is ready to process the first line (0000000 1198195154).
  25. # It looks at its <range><action>s. The first and only one is

  26. # range action
  27. # 1 s/.* //p

  28. # The line number is in the range, so it executes the action:
  29. #+ tries to substitute the longest string ending with a space in the line
  30. # ("0000000 ") with nothing (//), and if it succeeds, prints the result
  31. # ("p" is a flag to the "s" command here, this is different
  32. #+ from the "p" command).

  33. # sed is now ready to continue reading its input. (Note that before
  34. #+ continuing, if -n option had not been passed, sed would have printed
  35. #+ the line once again).

  36. # Now, sed reads the remainder of the characters, and finds the
  37. #+ end of the file.
  38. # It is now ready to process its 2nd line (which is also numbered '$' as
  39. #+ it's the last one).
  40. # It sees it is not matched by any , so its job is done.

  41. # In few word this sed commmand means:
  42. # "On the first line only, remove any character up to the right-most space,
  43. #+ then print it."

  44. # A better way to do this would have been:
  45. # sed -e 's/.* //;q'

  46. # Here, two s (could have been written
  47. # sed -e 's/.* //

点击(此处)折叠或打开

  1. #!/bin/bash

  2. filename=sys.log

  3. cat /dev/null > $filename; echo "Creating / cleaning out file."
  4. # Creates the file if it does not already exist,
  5. #+ and truncates it to zero length if it does.
  6. # : > filename and > filename also work.

  7. tail /var/log/messages > $filename
  8. # /var/log/messages must have world read permission for this to work.

  9. echo "$filename contains tail end of system log."

  10. exit 0
grep

A multi-purpose file search tool that uses . It was originally a command/filter in the venerable ed line editor: g/re/p -- global - regular expression - print.


点击(此处)折叠或打开

  1. bash$ grep '[rst]ystem.$' osinfo.txt
  2. The GPL governs the distribution of the Linux operating system.

点击(此处)折叠或打开

  1. bash$ ps ax | grep clock
  2. 765 tty1 S 0:00 xclock
  3.  901 pts/1 S 0:00 grep clock
  4.     

  5. The -i option causes a case-insensitive search.

  6. The -w option matches only whole words.

  7. The -l option lists only the files in which matches were found, but not the matching lines.

  8. The -r (recursive) option searches files in the current working directory and all subdirectories below it.

  9. The -n option lists the matching lines, together with line numbers.

  10. bash$ grep -n Linux osinfo.txt
  11. 2:This is a file containing information about Linux.
  12.  6:The GPL governs the distribution of the Linux operating system.

点击(此处)折叠或打开

  1. The -v (or --invert-match) option filters out matches.

  2. grep pattern1 *.txt | grep -v pattern2

  3. # Matches all lines in "*.txt" files containing "pattern1",
  4. # but ***not*** "pattern2".


点击(此处)折叠或打开

  1. grep -c txt *.sgml # (number of occurrences of "txt" in "*.sgml" files)


  2. # grep -cz .
  3. # ^ dot
  4. # means count (-c) zero-separated (-z) items matching "."
  5. # that is, non-empty ones (containing at least 1 character).
  6. #
  7. printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz . # 3
  8. printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$' # 5
  9. printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^' # 5
  10. #
  11. printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$' # 9
  12. # By default, newline chars (\n) separate items to match.

  13. # Note that the -z option is GNU "grep" specific.


  14. # Thanks, S.C.

点击(此处)折叠或打开

  1. tr

  2.     character translation filter.

  3.     Caution    

  4.     Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.

  5.     Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.

  6.     The -d option deletes a range of characters.


点击(此处)折叠或打开

  1. echo "abcdef" # abcdef
  2. echo "abcdef" | tr -d b-d # aef


  3. tr -d 0-9 <filename
  4. # Deletes all digits from the file "filename".


点击(此处)折叠或打开

  1. #The --squeeze-repeats (or -s) option deletes all but the first instance of a string of cons#ecutive characters. This option is useful for removing excess whitespace.

  2. bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
  3. X

点击(此处)折叠或打开

  1. #The -c "complement" option inverts the character set to match. With this option, tr acts on#ly upon those characters not matching the specified set.
  2. # -c 是 complement 类似于其他command -v 反向匹配
  3. bash$ echo "acfdeb123" | tr -c b-d +
  4. +c+d+b++++

点击(此处)折叠或打开

  1. toupper: Transforms a file to all uppercase.

  2. #!/bin/bash
  3. # Changes a file to all uppercase.

  4. E_BADARGS=85

  5. if [ -z "$1" ] # Standard check for command-line arg.
  6. then
  7.   echo "Usage: `basename $0` filename"
  8.   exit $E_BADARGS
  9. fi

  10. tr a-z A-Z <"$1"
  11. # 大小写转换

  12. # Same effect as above, but using POSIX character set notation:
  13. # tr '[:lower:]' '[:upper:]' <"$1"
  14. # Thanks, S.C.

  15. # Or even . . .
  16. # cat "$1" | tr a-z A-Z
  17. # Or dozens of other ways . . .

  18. exit 0

  19. # Exercise:
  20. # Rewrite this script to give the option of changing a file
  21. #+ to *either* upper or lowercase.
  22. # Hint: Use either the "case" or "select" command.


点击(此处)折叠或打开

  1. #!/bin/bash
  2. #
  3. # Changes every filename in working directory to all lowercase.
  4. #
  5. # Inspired by a script of John Dubois,
  6. #+ which was translated into Bash by Chet Ramey,
  7. #+ and considerably simplified by the author of the ABS Guide.


  8. for filename in * # Traverse all files in directory.
  9. do
  10.    fname=`basename $filename`
  11.    n=`echo $fname | tr A-Z a-z` # Change name to lowercase.
  12.    if [ "$fname" != "$n" ] # 不是小写字符
  13.    then
  14.      mv $fname $n
  15.    fi
  16. done

  17. exit $?

  18. #  两段 代码等价
  19. # Code below this line will not execute because of "exit".
  20. #--------------------------------------------------------#
  21. # To run it, delete script above line.

  22. # The above script will not work on filenames containing blanks or newlines.
  23. # Stephane Chazelas therefore suggests the following alternative:


  24. for filename in * # Not necessary to use basename,
  25.                      # since "*" won't return any file containing "/".
  26. do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]
    1. # POSIX char set notation.
    2. # Slash added so that trailing newlines are not
    3. # removed by command substitution.
    4.    # Variable substitution:
    5.    n=${n%/} # Removes trailing slash, added above, from filename.
    6.    [[ $filename == $n ]] || mv "$filename" "$n"
    7.                      # Checks if filename already lowercase.
    8. done

    9. exit $?



点击(此处)折叠或打开

  1. #du: DOS to UNIX text file conversion.

  2. #!/bin/bash
  3. # Du.sh: DOS to UNIX text file converter.

  4. E_WRONGARGS=85

  5. if [ -z "$1" ]
  6. then
  7.   echo "Usage: `basename $0` filename-to-convert"
  8.   exit $E_WRONGARGS
  9. fi

  10. NEWFILENAME=$1.unx

  11. CR='\015' # Carriage return.
  12.            # 015 is octal ASCII code for CR.
  13.            # Lines in a DOS text file end in CR-LF.
  14.            # Lines in a UNIX text file end in LF only.

  15. tr -d $CR < $1 > $NEWFILENAME
  16. # Delete CR

fmt

Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.


点击(此处)折叠或打开

  1. #!/bin/bash

  2. WIDTH=40 # 40 columns wide.

  3. b=`ls /usr/local/bin` # Get a file listing...

  4. echo $b | fmt -w $WIDTH

  5. # Could also have been done by
  6. # echo $b | fold - -s -w $WIDTH
  7.  
  8. exit 0


点击(此处)折叠或打开

  1. #!/bin/bash

  2. WIDTH=40 # 40 columns wide.

  3. b=`ls /usr/local/bin` # Get a file listing...

  4. echo $b | fmt -w $WIDTH

  5. # Could also have been done by
  6. # echo $b | fold - -s -w $WIDTH
  7.  
  8. exit 0
















上一篇:16.3. Time / Date Commands 笔记
下一篇:英语中行星的命名由来