正则表达式

时间：2010-02-28 来源：hs272307562

正则表达式语法:
一个正则表达式可以包含以下几类元素:
1.语义字符(英文字母等)
2.匹配的单字符，字符集或者字符类(字符集表示通过[]以指定[]内的字符，为一个范围;字符类表示数字，字母等类型)
3.量词重复字符(量词主要有以下几种:*,+,.)
4.替换语句(以"|"进行二选一操作:hello|Hello,(h|H)ello,[hH]ello)
5.以圆括号分组的子类型字符

模式的锚定:
通过 ^和$对模式指定行首或者行尾

反斜杠引用:
通过反斜杠可以将以下字符进行转义:
. * ? + [ ] ( ) ^ $ | \ ,还包括一系列字母

匹配优先级:
如果一个模式可以匹配一个字符串中的几个部分，则返回最先匹配的输入字符；存在最长匹配，则返回最长匹配

捕捉子类型:
使用圆括号进行子类型捕捉，并返回和圆括号匹配的字符集

高级正则表达式:
1.反斜杠替换:
. + * ? [ ]可以通过前置"\"，进行转义替换，一般在表达式外通过{}括起来避免混淆

2.字符类别
[:identifier:]

[A-Za-z] 等价于 [[:alpha:]]
[0-9] 等价于 [[:digit:]] 或者\d
[ \b\f\n\r\t\v] 等价于[[:space:]]或者\s
[[:digit:][:alpha:]_]等价于[\d[:alpha:]_]或者[[:alnum:]_]或者\w

3.非贪婪量词

贪婪模式:
[^\n]+\n 或者
.+\n
非贪婪模式
.+?\n

4.边界量词
{m,n}

5.后置引用
\1 ，\2通过前置括号在后面进行复制
("[^"]*"|'[^']*')可以表述为:

('|").*?\1

6.前置查找
^A.*(\.txt$)

^A.*(?=\.txt$):当模式匹配.txt时对.txt之前的内容进行匹配

^A.*(?!\.txt$):当模式不匹配.txt时对.txt之前的内容进行匹配

语法汇总:
元字符:
.
Matches any character.

*
Matches zero or more instances of the previous pattern item.

+
Matches one or more instances of the previous pattern item.

?
Matches zero or one instances of the previous pattern item.

( )
Groups a subpattern. The repetition and alternation operators apply to the preceding subpattern.

|
Alternation.

[ ]
Delimit a set of characters. Ranges are specified as [x-y]. If the first character in the set is ^, then there is a match if the remaining characters in the set are not present.

^
Anchor the pattern to the beginning of the string. Only when first.

$
Anchor the pattern to the end of the string. Only when last.

高级正则表达式:
{m}
Matches m instances of the previous pattern item.

{m}?
Matches m instances of the previous pattern item. Nongreedy.

{m,}
Matches m or more instances of the previous pattern item.

{m,}?
Matches m or more instances of the previous pattern item. Nongreedy.

{m,n}
Matches m through n instances of the previous pattern item.

{m,n}?
Matches m through n instances of the previous pattern item. Nongreedy.

*?
Matches zero or more instances of the previous pattern item. Nongreedy.

+?
Matches one or more instances of the previous pattern item. Nongreedy.

??
Matches zero or one instances of the previous pattern item. Nongreedy.

(?:re)
Groups a subpattern, re, but does not capture the result.

(?=re)
Positive look-ahead. Matches the point where re begins.

(?!re)
Negative look-ahead. Matches the point where re does not begin.

(?abc)
Embedded options, where abc is any number of option letters listed in Table 11-5.

\c
One of many backslash escapes listed in Table 11-4.

[: :]
Delimits a character class within a bracketed expression. See Table 11-3.

[. .]
Delimits a collating element within a bracketed expression.

[= =]
Delimits an equivalence class within a bracketed expression.

字符类:
alnum
Upper and lower case letters and digits.

alpha
Upper and lower case letters.

blank
Space and tab.

cntrl
Control characters: \u0001 through \u001F.

digit
The digits zero through nine. Also \d.

graph
Printing characters that are not in cntrl or space.

lower
Lowercase letters.

print
The same as alnum.

punct
Punctuation characters.

space
Space, newline, carriage return, tab, vertical tab, form feed. Also \s.

upper
Uppercase letters.

xdigit
Hexadecimal digits: zero through nine, a-f, A-F.

反斜杠表达式:
\a
Alert, or "bell", character.

\A
Matches only at the beginning of the string.

\b
Backspace character, \u0008.

\B
Synonym for backslash.

\cX
Control-X.

\d
Digits. Same as [[:digit:]]

\D
Not a digit. Same as [^[:digit:]]

\e
Escape character, \u001B.

\f
Form feed, \u000C.

\m
Matches the beginning of a word.

\M
Matches the end of a word.

\n
Newline, \u000A.

\r
Carriage return, \u000D.

\s
Space. Same as [[:space:]]

\S
Not a space. Same as [^[:space:]]

\t
Horizontal tab, \u0009.

\uXXXX
A 16-bit Unicode character code.

\v
Vertical tab, \u000B.

\w
Letters, digit, and underscore. Same as [[:alnum:]_]

\W
Not a letter, digit, or underscore. Same as [^[:alnum:]_]

\xhh
An 8-bit hexadecimal character code. Consumes all hex digits after \x.

\y
Matches the beginning or end of a word.

\Y
Matches a point that is not the beginning or end of a word.

\Z
Matches the end of the string.

\0
NULL, \u0000

\x
Where x is a digit, this is a back-reference.

\xy
Where x and y are digits, either a decimal back-reference, or an 8-bit octal character code.

\xyz
Where x, y and z are digits, either a decimal back-reference or an 8-bit octal character code.

嵌入式选项字符:
b
The rest of the pattern is a basic regular expression (a la vi or grep).

c
Case sensitive matching. This is the default.

e
The rest of the pattern is an extended regular expression (a la Tcl 8.0).

i
Case insensitive matching.

m
Synonym for the n option.

n
Newline sensitive matching . Both lineanchor and linestop mode.

p
Partial newline sensitive matching. Only linestop mode.

q
The rest of the pattern is a literal string.

s
No newline sensitivity. This is the default.

t
Tight syntax; no embedded comments. This is the default.

w
Inverse partial newline-sensitive matching. Only lineanchor mode.

x
Expanded syntax with embedded white space and comments.