Python正则表达式学习

时间：2009-03-06 来源：hkebao

^ 匹配字符串的开始。
$ 匹配字符串的结尾。
\b 匹配一个单词的边界。
\d 匹配任意数字。
\D 匹配任意非数字字符。
x? 匹配一个可选的 x 字符 (换言之，它匹配 1 次或者 0 次 x 字符)。
x* 匹配0次或者多次 x 字符。
x+ 匹配1次或者多次 x 字符。
x{n,m} 匹配 x 字符，至少 n 次，至多 m 次。
(a|b|c) 要么匹配 a，要么匹配 b，要么匹配 c。
(x) 一般情况下表示一个记忆组 (remembered group)。你可以利用 re.search 函数返回对象的 groups() 函数获取它的值。

[abc] 将匹配"a", "b", 或 "c"中的任意一个字符；也可以用区间[a-c]来表示同一字符集，和前者效果一致。如果你只想匹配小写字母，那么 RE 应写成 [a-z]。
你可以用补集来匹配不在区间范围内的字符。其做法是把"^"作为类别的首个字符；其它地方的"^"只会简单匹配 "^" 字符本身。例如，[^5] 将匹配除 "5" 之外的任意字符。

\d
匹配任何十进制数；它相当于类 [0-9]。

\D匹配任何非数字字符；它相当于类 [^0-9]。

\s
匹配任何空白字符；它相当于类 [ \t\n\r\f\v]。

\S
匹配任何非空白字符；它相当于类 [^ \t\n\r\f\v]。

Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
匹配任何字母数字字符；它相当于类 [a-zA-Z0-9_]。

Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
匹配任何非字母数字字符；它相当于类 [^a-zA-Z0-9_]。
我们讨论的第一个重复功能的元字符是 *。* 并不匹配字母字符 "*"；相反，它指定前一个字符可以被匹配零次或更多次，而不是只有一次

简单地说，为了匹配一个反斜杠，不得不在 RE 字符串中写 '\\\\'，因为正则表达式中必须是 "\\"，而每个反斜杠按 Python 字符串字母表示的常规必须表示成 "\\"。在 REs 中反斜杠的这个重复特性会导致大量重复的反斜杠，而且所生成的字符串也很难懂。

在字符串前加个 "r" 反斜杠就不会被任何特殊方式处理，所以 r"\n" 就是包含"\" 和 "n" 的两个字符，而 "\n" 则是一个字符，表示一个换行。｛原来如此的啊哈哈！｝

match()	Determine if the RE matches at the beginning of the string. 决定 RE 是否在字符串刚开始的位置匹配
search()	Scan through a string, looking for any location where this RE matches. 扫描字符串，找到这个 RE 匹配的位置
findall()	Find all substrings where the RE matches, and returns them as a list. 找到 RE 匹配的所有子串，并把它们作为一个列表返回
finditer()	Find all substrings where the RE matches, and returns them as an iterator. 找到 RE 匹配的所有子串，并把它们作为一个迭代器返回

group()	Return the string matched by the RE 返回被 RE 匹配的字符串
start()	Return the starting position of the match 返回匹配开始的位置
end()	Return the ending position of the match 返回匹配结束的位置
span()	Return a tuple containing the (start, end) positions of the match 返回一个元组包含匹配 (开始,结束) 的位置

group() 返回 RE 匹配的子串。start() 和 end() 返回匹配开始和结束时的索引。span() 则用单个元组把开始和结束时的索引一起返回。因为匹配方法检查到如果 RE 在字符串开始处开始匹配，那么 start() 将总是为零。然而， RegexObject 实例的 search 方法扫描下面的字符串的话，在这种情况下，匹配开始的位置就也许不是零了。