Php文档 Php问答行业资讯 Php论坛 Php手册 Php博客

游戏榜单

软件榜单

关闭导航

热搜榜

热门下载

热门标签

关闭搜索

php爱好者> php文档>All about awk

All about awk

时间：2006-09-13 来源：蜡笔pearly

Outline

General structure of awk scripts
Elementary awk programming

Elementary examples

Advanced awk programming

Advanced examples

Important things which will bite you

General structure of awk (Aho, Weinberg, and Kernighan)

awk, oawk, nawk, gawk, mawk

The original version, based on the first edition of The awk Programming Language was called awk
2nd edition of book led to nawk
Unices usually ship with three different names for awk: oawk, nawk, and awk; either oawk=awk or nawk=awk.
gawk is the FSF version.
mawk is a speedier rewrite which does a partial compilation

The awk command line is:

awk [program|-f programfile] [flags/variables] [files]

Command line flags

-f file -- Read the awk script from the specified file rather than the command line
-F re -- Use the given regular expression re as the field separator rather than the default "white space"
variable=value -- Initialize the awk variable with the specified

An awk program consists of one or more awk commands separated by either \n or semicolons.

The structure of awk commands

Each awk command consists of a selector and/or an action; both may not be omitted in the same command. Braces surround the action.
selector [only] -- action is print
{action}[only] -- selector is every line
selector {action} -- perform action on each line where selector is true
Each action may have multiple statements separated from each other by semicolons or \n

Line selection

A selector is either zero, one, or two selection criteria; in the latter case the criteria are separated by commas
A selection criterion may be either an RE or a boolean expression (BE) which evaluates to true or false
Commands which have no selection criteria are applied to each line of the input data set
Commands which have one selection criterion are applied to every line which matches or makes true the criterion depending upon whether the criterion is an RE or a BE
Commands which have two selection criteria are applied to the first line which matches the first criterion, the next line which matches the second criterion and all the lines between them.
Unless a prior applied command has a next in it, every selector is tested against every line of the input data set.

Processing

The BEGIN block(s) is(are) run (mawk's -v runs first)
Command line variables are assigned
For each line in the input data set

It is read and NR, NF, $I, etc. are set
For each command, its criteria are evaluated
If the criteria is true/matches the command is executed

After the input data set is exhausted, the END block(s) is(are) run

Elementary awk programming

Constants

Strings are enclosed in quotes (")
Numbers are written in the usual decimal way; non-integer values are indicated by including a period (.) in the representation.
REs are delimited by /

Variables

Need not be declared
May contain any type of data, their data type may change over the life of the program
Are named as any token beginning with a letter and continuing with letters, digits and underscores
As in C, case matters; since all the built-in variables are all uppercase, avoid this form.
Some of the commonly used built-in variables are:

NR -- The current line's sequential number
NF -- The number of fields in the current line
FS -- The input field separator; defaults to whitespace and is reset by the -F command line parameter

Fields

Each record is separated into fields named $1, $2, etc
$0 is the entire record
NF contains the number of fields in the current line
FS contains the field separator RE; it defaults to the white space RE, /[<SPACE><TAB>]*/
Fields may be accessed either by $n or by $var where var contains a value between 0 and NF

print/printf

print prints each of the values of $1 through $NF separated by OFS then prints a \n onto stdout; the default value of OFS is a blank
print value value ... prints the value(s) in order and then puts out a \n onto stdout;
printf(format,value,value,...) prints the value(s) using the format supplied onto stdout, just like C. There is no default \n for each printf so multiples can be used to build a line. There must be as many values in the list as there are item descriptors in format.
Values in print or printf may be constants, variables, or expressions in any order

Operators - awk has many of the same operators as C, excepting the bit operators. It also adds some text processing operators.
Built-in functions

substr(s,p,l) -- The substring of s starting at p and continuing for l characters
index(s1,s2) -- The first location of s2 within s1; 0 if not found
length(e) -- The length of e, converted to character string if necessary, in bytes
sin, cos, tan -- Standard C trig functions
atan2(x,y) -- Standard quadrant oriented arctangent function
exp, log -- Standard C exponential functions
srand(s), rand() -- Random number seed and access functions

Elementary examples and uses

length($0)>72 -- print all of the lines whose length exceeds 72 bytes
{$2="";print} -- remove the second field from each line
{print $2} -- print only the second field of each line
/Ucast/{print $1 "=" $NF} -- for each line which contains the string 'Ucast' print the first variable, an equal sign and the last variable (awk code to create awk code; a common trick)
BEGIN{FS="/"};NF<4 -- using '/' as a field separator, print only those records with less than four fields; when applied to the output of du, gives a two level summary
{n++;t+=$4};END{print n " " t} -- when applied to the output of an ls -l command provides a count and total size of the listed files; I use it as part of an alias for dir. Depending on your flavor of UNIX, the $4 may need to be changed to $5.
$1==prv{ct++;next}{printf("%8d %s",ct,prv);ct=1;pr v=$0} -- prints each unique record with a count of the number of occurrences of it; presumes input is sorted

Advanced awk programming

Program structure (if, for, while, etc.)

if(boolean) statement1 else statement2 if the boolean expression evaluates to true execute statement1, otherwise execute statement 2
for(v=init;boolean;v change) statement Standard C for loop, assigns v the value of init then while the boolean expression is true executes the statement then the v change
for(v in array) statement Assigns to v each of the values of the subscripts of array, not in any particular order, then executes statement
while(boolean) statement While the boolean expression is true, execute the statement
do statement while(boolean) execute statement, evaluate the boolean expression and if true, repeat
statement in any of the above constructs may be either a simple statement or a series of statements enclosed in {}, again like C; a further requirement is that the opening { must be on the line with the beginning keyword (if, for, while, do) either physically or logically via \ .
break -- exit from an enclosing for or while loop
continue -- restart the enclosing for or while loop from the top
next -- stop processing the current record, read the next record and begin processing with the first command
exit -- terminate all input processing and, if present, execute the END command

Arrays

There are two types of arrays in awk - standard and generalized
Standard arrays take the usual integer subscripts, starting at 0 and going up; multidimensional arrays are allowed and behave as expected
Generalized arrays take any type of variable(s) as subscripts, but the subscript(s) are treated as one long string expression.
The use of for(a in x) on a generalized array will return all of the valid subscripts in some order, not necessarily the one you wished.
The subscript separator is called SUBSEP and has a default value of comma (,)
Elements can be deleted from an array via the delete(array[subscript]) statement

Built-in variables

FILENAME -- The name of the file currently being processed
OFS -- Output Field Separator default ' '
RS -- Input Record Separator default \n
ORS -- Output Record Separator default \n
FNR -- Current line's number with respect to the current file
OFMT -- Output format for printed numbers default %.6g
RSTART -- The location of the data matched using the match built-in function
RLENGTH -- The length of the data matched using the match built-in function

Built-in functions

gsub(re,sub,str) -- replace, in str, each occurrence of the regular expression re with sub; return the number of substitutions performed
int(expr) -- return the value of expr with all fractional parts removed
match(str,re) -- return the location in str where the regular expression re occurs and set RSTART and RLENGTH; if re is not found return 0
split(str,arrname,sep) -- split str into pieces using sep as the separator and assign the pieces in order to the elements from 1 up of arrname; use FS if sep is not given
sprintf(format,value,value,...) -- write the values, as the format indicates, into a string and return that string
sub(re,sub,str) -- replace, in str, the first occurrence of the regular expression re with sub; return 1 if successful, 0 otherwise
system(command) -- pass command to the local operating system to execute and return the exit status code returned by the operating system
tolower(str) -- return a string similar to str with all capital letters changed to lower case

Other file I/O

print and printf may have > (or >>) filename or | command appended and the output will be sent to the named file or command; once a file is opened, it remains open until explicitly closed
getline var < filename will read the next line from filename into var. Again, once a file is opened, it remains so until it is explicitly closed
close(filename) explicitly closes the file named by the filename expression

Writing your own functions

A function begins with a function header of the form:

function name(argument(s), localvar(s)) {

and ends with the matching }
The value of the function is returned via a statement of the form:

return value

Functions do not have to return a value and the value returned by a function (either built-in or written locally) may be ignored by just placing the function with its arguments as a whole, separate statement
The local variables indicated in the localvars of the heading replace the global variables of the same name until the function completes, at which time the globals are restored
Functions may have side effects such as updating global variables, doing I/O or running other functions with side effects; beware the frumious bandersnatch

Advanced examples and uses

{ split($1,t,":")
 $1 = (t[1]*60+t[2])*60+t[3]
 print
}

Replaces an HH:MM:SS time stamp in the first field with a seconds since midnight value which can be more easily plotted, computed with, etc.

 { for(i = 1; i<=NF; i++) ct[$i] += 1 }
END { for(w in ct) {
 printf("%6d %s",ct[w],w)
 }
 }

This reads a file of text and creates a file containing each unique word along with the number of occurrences of the word in the text.

NR=1 { t0=$1; tp = $1; for(i=1;i<=nv;i++) dp[i] = $(I+1);next}
 { dt=$1-tp;
 tp = $1
 printf("%d ",$1-t0)
 for(i=1;i<=nv;i++) {
 printf("%d ",($(I+1)-dp[i])/dt)
 dp[i] = $(i+1)
 }
 printf("\n")
 }

Take a set of time stamped data and convert the data from absolute time and counts to relative time and average counts. The data is presumed to be all amenable to treatment as integers. If not, formats better the %d must be used.

BEGIN{ printf("set term postscript\n") > "plots"
 printf("set output '|lpr -Php'\n") > "plots" }
 { if(system("test -s " $1 ".r") { 
 print "process1 " $1 ".r " $2
 printf("plot '%s.data' using 2:5 title '%s'",\
 $1,$3) >> "plots"
 }
 }
END { print "gnuplot < plots" }

Write a pair of set lines to a file called plots. For each input line, if a file whose name is the first field on the line with a .r appended exists, write a command to the stdout file containing the file name and the second field from the line; also write a plot statement to a file called plots using the third field from the input line. After the file has been processed, add a gnuplot command to the stdout file. If all of the output is passed to sh or csh through a pipe, the commands will be executed.

BEGIN { l[1]=25; l[2]=20; l[3]=50 }
/^[ABC]/ {
 I = index("ABC", substr($0,1,1))
 a=$0 " "
 print substr(a,1,l[i])
 }
 { print }

Make lines whose first characters are 'A', 'B', or 'C' have lengths of 25, 20, and 50 bytes respectively, changing no other lines.

/^\+/ { hold = hold "\r" substr($0,2); next}
 { if( unfirst ) print hold
 hold =""
 }
/^1/ { hold = "\f" }
/^0/ { hold = "\n" }
/^-/ { hold = "\n\n" }
 { unfirst = 1
 hold = hold + substr($0,2)
 }
END { if(unfirst) print hold }

This routine will take FORTRAN-type output with leading ANSI vertical motion indicators and convert it to a stream with ASCII printer control sequences in it.

BEGIN { b=""; if(ll==0) ll=72 }
NF==0 { print b; b=""; print ""; next }
 { if(substr(b,length(b),1)=="-") {
 b=substr(b,1,length(b)-1) $0 }
 else b=b " " $0
 while(length(b)>ll) {
 i = ll
 while(substr(b,i,1)=" ") I--
 print substr(b,1,i-1)
 b = substr(b,i+1)
 }
 }
END { print b; print "" }

This will take an arbitrary stream of text (where paragraphs are indicated by consecutive \n) and make all the lines approximately the same length. The default output line length is 72, but it may be set via a parameter on the awk command line. Both long and short lines are taken care of but extra spaces/tabs within the text are not correctly handled.

BEGIN { FS = "\t" # make tab the field separator
 printf("%10s %6s %5s %s\n\n",
 "COUNTRY", "AREA", "POP", "CONTINENT")
 }
 { printf("%10s %6d %5d %s\n", $1, $2, $3, $4)
 area = area +$2
 pop = pop + $3
 }
END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }

This will take a variable width table of data with four tab separated fields and print it as a fixed length table with headings and totals.

Important things which will bite you

$1 inside the awk script is not $1 of the shell script; use variable assignment on the command line to move data from the shell to the awk script,
Actions are within {}, not selections
Every selection is applied to each input line after the previously selected actions have occurred; this means that a previous action can cause unexpected selections or selection misses.

Operators

" " The blank is the concatenation operator
+ - * / % All of the usual C arithmetic
 operators, add, subtract, multiply,
 divide and mod.
== != < <= > >= All of the usual C relational
 operators, equal, not equal, less
 than, less than or equal and greater
 than, greater than or equal
&& || The C boolean operators and and or
= += -= *= /= %= The C assignment operators
~ !~ Matches and doesn't match
?: C conditional value operator
^ Exponentiation
++ -- Variable increment/decrement
 Note the absence of the C bit operators &, |, << and >>

[s]printf format items

Format strings in the printf statement and sprintf function consist of three different type of items: literal characters, escaped literal characters and format items. Literal characters are just that: characters which will print as themselves. Escaped literal characters begin with a backslash (\) and are used to represent control characters; the common ones are: \n for new line, \t for tab and \r for return. Format items are used to describe how program variables are to be printed.

All format items begin with a percent sign (%). The next part is an optional length and precision field. The length is an integer indicating the minimum field width of the item, negative if the data is to be white spacethe left of the field. If the length field begins with a zero (0), then instead of padding the value with leading blanks, the item will be padded with leading 0s. The precision is a decimal followed by the number of decimal digits to be displayed for various floating point representations. Next is an optional source field size modifier, usually 'l' (ell). The last item is the actual source data type, commonly one of the list below:

 d Integer
 f Floating point in fixed point format
 e Floating point invaluel format
 g Floating point in "best fit" format; integer, fixed
 point, or exponential; depending on exact value
 s Character string
 c Integer to be interpreted as a character
 x Integer to be printed as hexadecimal

Examples:

 %-20s Print a string in the left portion of a 20 character
 field
 %d Print an integer in however many spaces it takes
 %6d Print an integer in at least 6 spaces; used to format
 pretty output
 %9ld Print a long integer in at least 9 spaces
 %09ld Print a long integer in at least 9 spaces with leading
 0s, not blanks
 %.6f Print a float with 6 digits after the decimal and as
 many before it as needed
 %10.6f Print a float in a 10 space field with 6 digits after
 the decimal