1. re模块

1.1. re.compile(pattern, flags=0)

将正则表达式的样式编译为一个正则表达式对象（正则对象），该对象可以直接使用search()等方法

>>> pt = re.compile('hello')
>>> pt
re.compile('hello')

>>> pt.search('hello world')
<_sre.SRE_Match object; span=(0, 5), match='hello'>

如果需要多次使用这个正则表达式的话，提前使用re.compile() 编译正则表达式，可以让程序更加高效。

每有一个新的正则表达式输入时，该正则表达式会被缓存。所以只有少数的正则表达式时，就不用考虑是否先进行编译了，对于程序的效率是一样的。

1.2. re.search(pattern, string, flags=0)

扫描整个字符串并返回第一个成功的匹配，返回一个匹配对象，如果没有找到匹配的，则返回None。

re.match()只从字符串的第一个字符开始匹配，效率比re.search()高，一般用得比较少。

>>> m = re.search('hello', 'hello world')
>>> m
<_sre.SRE_Match object; span=(0, 5), match='hello'>

>>> re.match('world', 'hello world') # No match

1.3. re.findall(pattern, string, flags=0)

在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。

re.finditer()返回一个迭代器，每次迭代返回的一个匹配对象。

>>> re.findall('l', 'hello world')
['l', 'l', 'l']

>>> for m in re.finditer('l', 'hello world'):
...     m.span()
(2, 3)
(3, 4)
(9, 10)

1.4. re.sub(pattern, repl, string, count=0, flags=0)

用repl替换匹配成功的对象，返回替换后的字符串。

>>> re.sub('l', 'l0', 'hello world')
'hel0l0o worl0d'

>>> re.sub('l', 'l0', 'hello world', 1)	# 只替换一次
'hel0lo world'

>>> re.sub('l', lambda x: x.group() + '0', 'hello world')	# 传入一个函数对象
'hel0l0o worl0d'

1.5. re.split(pattern, string, maxsplit=0, flags=0)

按照正则表达式分割字符串，返回列表

>>> re.split('l', 'hello world')
['he', '', 'o wor', 'd']

>>> re.split('l', 'hello world', 1)	# 只截取一次
['he', 'lo world']

2. 匹配对象

2.1. 获取匹配对象

>>> m = re.search('he(l)(l)o', 'hello world')
>>> m
<_sre.SRE_Match object; span=(0, 5), match='hello'>

一般使用匹配对象都要加一层if判断，防止失配报错：
match = re.search(pattern, string)
if match:
    process(match)

2.2. 提取匹配元素

>>> m = re.search('he(l)(l)o', 'hello world')

>>> m.groups()	# 按顺序返回分组元素
('l', 'l')
>>> re.search('hello', 'hello world').groups()	# 如果表达式中不存在分组，则返回空集
()

>>> m.group()	# 这里0被忽略了，表示返回整个匹配串 
'hello'
>>> m.group(1)	# 返回第一个分组
'l'

>>> m.group(1, 2)	# 返回一个tuple
('l', 'l')

>>> m[0]	# 等效于 m.group(0)
'hello'

# 需要给每个分组设置了索引才有返回结果
# 如何给分组设置索引，可以参考后面 `(?=<name>...)` 语法
>>> m.groupdict()

如果一个分组被匹配多次，只会返回最后一次匹配的结果。

>>> re.match(r"(..)+", "a1b2c3").groups()
('c3',)

2.3. 提取匹配位置

>>> m.span() 	# 这里0被忽略了，表示返回整个匹配串的位置
(0, 5)
>>> m.span(1)	# 返回第一个分组的位置
(2, 3)

2.4. 提取其他信息

>>> m = re.search('he(l)(l)o', 'hello world')

>>> m.string	# 输入字符串
'hello world'

>>> m.re	# 输出正则表达式
re.compile('he(l)(l)o')

>>> m.pos	# 字符串开始匹配的位置
0

>>> m.endpos	# 字符串结束匹配的位置
11

3. 正则语法

我们不妨把正则表达式符号分为两类，一类是一般符号，这种符号，写什么就是匹配什么，另一类是内置符号，下面提到的都是内置符号。

3.1. 特殊字符

模式	描述
`^`	匹配字符串的开头
`$`	匹配字符串的末尾。
`.`	匹配任意字符，除了换行符，当选用 `re.DOTALL` 模式时，或者使用 `[.\n]` 则可以匹配包括换行符的任意字符。

>>> re.findall('^.', 'hello world')	# 匹配字符串开头一个字符
['h']
>>> re.findall('.$', 'hello world')	# 匹配字符串结尾一个字符
['d']

>>> re.findall('.$', 'hello\n world')	# `$` 并不等价于换行符
['d']

>>> re.findall('.+', 'hello\r\n world') # `.` 不能匹配换行符
['hello\r', ' world']

3.2. 重复匹配

模式	描述
`*`	匹配任意次（可以是0次）命中字符
`+`	匹配至少一次命中字符
`{m,n}`	匹配m至n次命中字符
`{m}`	匹配m次命中字符
`{,m}`	匹配0至n次命中字符
`?`	匹配0次或1次命中字符

?一般用于非贪婪模式，表示尽量少匹配命中字符。

{m,n}? 等价于 {m}

? 等价于 {,1}

{,m}? 等价于 *? 等价于如果可以跳过某m个字符，则跳过

{,1}? 等价于 ?? 等价于如果可以跳过某一字符，则跳过

# 有重复命中字符
>>> re.findall('hel*', 'hello world')
['hell']
>>> re.findall('hel+', 'hello world')
['hell']
>>> re.findall('hel?', 'hello world')
['hel']

# 没有重复命中字符
>>> re.findall('hed*', 'hello world')	# 跳过没命中字符
['he']
>>> re.findall('hed+', 'hello world')	# 失配
[]
>>> re.findall('hed?', 'hello world')	# 跳过没命中字符
['he']

# 非贪婪模式
>>> re.findall('hel*?', 'hello world')	# 命中字符被忽略了
['he']
>>> re.findall('hel+?', 'hello world')	# 只匹配一次
['hel']

# 指定匹配次数
>>> re.findall('hel{3}', 'hello world')	# 命中字符重复出现少于3次，失配
[]
>>> re.findall('hel{1,3}', 'hello world')
['hell']
>>> re.findall('hel{,3}', 'hello world')
['hell']
>>> re.findall('hel{1,3}?', 'hello world')
['hel']
>>> re.findall('hel{,3}?', 'hello world')	# 命中字符被忽略了
['he']

# `.?` 表示匹配任意一个字符
>>> re.findall('h.?', 'hello world')
['he']

# `.*?` 表示如果可以跳过任意一个字符，则跳过
>>> re.findall('h.*?', 'hello world')
['h']
>>> re.findall('h.*?l', 'hello world')
['hel']

3.3. 分组匹配

用括号括起来的表达式表示分组。会先把整个表达式先匹配出来，再匹配出该表达式内的分组表达式。

>>> m = re.search('he((l)(l))o', 'hello world')
>>> m.groups()
('ll', 'l', 'l')

# 如果表达式中存在分组，则findall()方法只会列出匹配的分组
>>> re.findall('he((l)(l))o', 'hello world')	
[('ll', 'l', 'l')]

>>> re.findall('((h)(e))((l)(l))o', 'hello world')	# 深度优先匹配分组
[('he', 'h', 'e', 'll', 'l', 'l')]

>>> re.findall('he(l)+', 'hello world')	# 如果分组重复匹配，该分组只保存最后一次匹配的结果
['l']

>>> re.findall('a((b+)(c+))a', 'abcaabbcca')	# 每个匹配的结果一个集合
[('bc', 'b', 'c'), ('bbcc', 'bb', 'cc')]

3.4. 转义字符

\ 表示转义字符，可以搭配其他字符组成特殊序列，一些常用的特殊序列如下

模式	描述
`\w`、`\W`	匹配字母数字及下划线；非字母数字及下划线
`\s`、`\S`	匹配任意空白字符，等价于 `[ \t\n\r\f\v]`；任意非空字符
`\d`、`\D`	匹配任意数字，等价于 `[0-9]`；任意非数字
`\A`	匹配字符串开始，等价于`^`
`\Z`	匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串。
`\z`	匹配字符串结束，等价于`$`
`\G`	匹配最后匹配完成的位置。
`\b`、`\B`	匹配一个单词边界；非单词边界。单词与单词间的界定为非字母数字及下划线，即 `\W`。
`\r`、`\n`、 `\t`、`\f`	匹配一个回车符；换行符；制表符；换页符
`\number`	把第n个分组中匹配出来的内容当做匹配条件
`\10`	如果有第10个分组，则与上面用法相同，否则指的是八进制字符码的表达式。
`\(`、`\[` 等	内置符号转置为一般符号

>>> re.findall(r'\w+', '我爱你，China') # 是否为字母根据当前编码模式界定，Unicode中，中文也被界定为字母
['我爱你', 'China']
>>> re.findall(r'你\b', '我爱你，China')
['你']

>>> re.findall(r'(\?)', 'hi?')	# '?' 跟在 '(' 后面，要匹配 '?' 需要转置
['?']

>>> re.findall(r'he(l)\1o', 'hello world')	# 第一个分组匹配出来的语句是`l`,替换掉`\1`，完整的匹配表达式为 `he(l)lo`
['l']

使用转义字符时推荐在字符串前加 r ，告诉编译器这是原始字符串，不要把 \ 当成转义字符。否则，就要使用 \\ 来表示转义字符。虽然很多时候，加和不加的结果是一样的。（感觉有点迷惑？反正就是加个 r 就对了(╯‵□′)╯︵┻━┻）

3.5. 多条件匹配

[...] 多用于单个符号的或逻辑

[^...] 表示取反

...|... 多用于组合表达式或一些内置符号的或逻辑，例如，^|$

>>> re.findall('[()*+{}\.]', '[()*+{}].')	# 一些内置符号在里面也可以变成一般符号，例如，`[()*+{}]` 
['(', ')', '*', '+', '{', '}', '.']

>>> re.findall('[a-zA-Z0-9]', 'abcABC012')
['a', 'b', 'c', 'A', 'B', 'C', '0', '1', '2']

>>> re.findall('hello|hi', 'hello world')	# `abc|def` 匹配出来的是 'abc' 或 'def'
['hello']

3.6. 扩展标记

(?…)这种形式不是分组，而是使用了扩展语法。这种扩展通常不创建新的分组； (?P<name>...) 是唯一的例外。

后面还会遇到一个重要的概念——消耗。同一级输出中，所有已经被输出的字符串都会被消耗，消耗掉的字符串不可以用于后面的匹配。

# 输出 'a1a' 后，'a1a' 被消耗，剩下的字符串为 '2a3a'，所以 'a2a' 无法被输出
>>> re.findall('a\da', 'a1a2a3a')
['a1a', 'a3a']

所以，如果扩展语法不输出匹配串，就意味着不会消耗字符串

这部分是最难啃的骨头，很多概念都比较绕

3.6.1. (?#…)

注释，括号内的内容不被当成匹配内容。

>>> re.findall('hel(?#this is a comment)lo', 'hello world')
['hello']

3.6.2. (?P<name>…)

给每个分组设置索引值

>>> m = re.search('(?P<name1>h)(?P<name2>e)llo', 'hello world')
>>> m.groupdict()
{'name1': 'h', 'name2': 'e'}
>>> m.group('name1')
'h'
>>> m['name1']
'h'

3.6.3. (?P=name)

把之前分组名设置为 name 内的匹配出来的内容当做匹配条件

# 下面表达式等价于 r'he(?P<name>l)\1o'
>>> m = re.search('he(?P<name>l)(?P=name)o', 'hello world')
>>> m.groupdict()
{'name': 'l'}
>>> m.group()
'hello'

3.6.4. (?:…)

括号内表达式不被认为是分组。多用于重复匹配中。

# 其实，如果使用 `re.search` ，两者的 group(0) 是一样的，唯一的区别是前者有分组，后者没分组，用 `re.findall` 显示就区别就很明显了
>>> re.findall(r'a(\d)+', 'a1234')
['4']
>>> re.findall(r'a(?:\d)+', 'a1234')
['a1234']

3.6.5. (?=…)

正向后置界定符（ positive lookahead assertion）。把括号中的语句当成匹配条件，但不输出括号里的匹配项，且不会消耗括号里的匹配符，也称前置不消耗分组。

举例说明匹配过程，例如，匹配表达式 abc(?=def) ，先预匹配 'def' ，再往前开始匹配 'abc'

(?=…) 一般放在表达式结尾

>>> re.findall(r'a\d(?=a)', 'a1a2a3a')
['a1', 'a2', 'a3']
>>> re.findall(r'a\d(?=a)', 'a1a2b3a')
['a1']

>>> re.findall('(?=abc)def', 'abcdef')
[]
>>> re.findall('abc(?=d)ef', 'abcdef')
[]

其实 (?=…) 也是可以放在末尾或中间的，例如表达式，匹配 (?=2)23 后，先预匹配 '2'，再往前一个字符开始匹配 '23'

故，表达式 (?=pattern2)pattern1 意思为，匹配 'pattern1' 且开头为 'pattern2' 的字符串

>>> re.findall(r'(?=2)23', '123abc1234abcd12345abcde')
['23', '23', '23']
>>> re.findall(r'1(?=2)23', '123abc1234abcd12345abcde')
['123', '123', '123']

# 应用的场景
# 但这显然有点鸡肋，既然要表达这种效果，那为什么不直接用表达式'4\d+'呢，结果是一样的。
>>> re.findall(r'(?=4)\d+', '123abc1234abcd12345abcde')
['4', '45']

3.6.6. (?!…)

否定后置界定符（ negative lookahead assertion ），作用与(?=…)相反。

>>> re.findall(r'a\d(?!a)', 'a1a2a3a')
[]
>>> re.findall(r'a\d(?!a)', 'a1a2b3a')
['a2']

3.6.7. (?<=…)

正向前置界定符（positive lookbehind assertion），这个与 (?=…) 功能类似，也是把括号中的语句当成匹配条件，但不输出括号里的匹配项，且不会消耗括号里的匹配符。两者之间的区别：

举例说明匹配过程，例如，匹配表达式 (?<=abc)def ，先匹配 'def'，再往前匹配 'abc'。

(?<=…) 一般放在表达式开头

(?<=…) 包含的匹配样式必须是定长的，即，像 a* 或 a{3,4} 都是不被允许的。

>>> re.findall(r'(?<=a)\da', 'a1a2a3a')
['1a', '2a', '3a']
>>> re.findall(r'(?<=a)\da', 'a1a2b3a')
['1a']
>>> re.findall(r'(?<=a)\d(?=a)', 'a1a2a3a')
['1', '2', '3']

>>> re.findall('abc(?<=def)', 'abcdef')
[]
>>> re.findall('abc(?<=d)ef', 'abcdef')
[]

同上 (?<=…) 也是可以放在末尾或中间的，例如表达式，匹配 23(?<=3) 后，先匹配 '23'，然后再往前一个字符开始匹配 '3'

故，表达式 pattern1(?<=pattern2) 意思为，匹配 'pattern1' 且结尾为 'pattern2' 的字符串

>>> re.findall(r'23(?<=3)', '123abc1234abcd12345abcde')
['23', '23', '23']
>>> re.findall(r'23(?<=3)a', '123abc1234abcd12345abcde')
['23a']

# 应用的场景
# 但这显然有点鸡肋，既然要表达这种效果，那为什么不直接用表达式'\d+5'呢，结果是一样的。
>>> re.findall(r'\d+(?<=5)', '123abc1234abcd12345abcde')
['12345']

3.6.8. (?<!…)

否定前置界定符（ negative lookbehind assertion），作用与 (?<=…) 相反

>>> re.findall(r'(?<!a)\da', 'a1a2a3a')
[]
>>> re.findall(r'(?<!a)\da', 'a1a2b3a')
['3a']

3.6.9. (?(id/name)yes-pattern[|no-pattern])

如果 id 或 name 的分组存在时，则执行 yes-pattern 表达式，否则执行 no-pattern 表达式，no-pattern 可选

# 输出 '<...>' 或者 '...'，而不会输出 '<...'
>>> re.search('(<)?.+(?(1)>|$)', 'hello world')
<_sre.SRE_Match object; span=(0, 11), match='hello world'>
>>> re.search('(<)?.+(?(1)>|$)', '<hello world>')
<_sre.SRE_Match object; span=(0, 13), match='<hello world>'>

3.6.10. (?[aLuimsx][-[imsx]]:…)

选择模式，可以免去设置 flag 参数

可以选择'a', 'i', 'L', 'm', 's', 'u', 'x' 中的0或者多个，之后可选跟随 '-' 在后面跟随 'i' , 'm' , 's' , 'x' 中的一到多个。这些字符各自对应设置一种模式 re.A （只匹配ASCII），re.I （忽略大小写）， re.L （语言依赖），re.M （多行）， re.S （点匹配所有字符）， re.U （Unicode匹配），以及 re.X (冗长模式)，各种模式在后面也会提及。'a'、 'L' 、 'u' 这三个符号是互斥的。

4. 匹配模式

修饰符	描述
`re.I(GNORECASE)`	使匹配对大小写不敏感
`re.L(OCALE)`	做本地化识别（locale-aware）匹配
`re.M(ULTILINE)`	多行匹配，影响 ^ 和 $
`re.S`/`re.DOTALL`	使 `.` 匹配包括换行在内的所有字符
`re.A(SCII)`	让 `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s` 和 `\S` 只匹配ASCII，而不是Unicode
`re.U(NICODE)`	这个模式在Python3中是冗余的，因为默认字符串已经是Unicode了
`re.X`/`re.VERBOSE`	该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解
`re.DOTALL`	让 `'.'` 特殊字符匹配任何字符，包括换行符

5. references

https://docs.python.org/zh-cn/3/library/re.html

https://www.runoob.com/python/python-reg-expressions.html

https://docs.python.org/zh-cn/3/reference/datamodel.html

python

python的奇妙用法

Python's Bizarre Adventure

2020-06-05 工作积累