正则表达式复习

常用匹配规则

常用方法

match()

从字符串的开头开始匹配，不匹配则失败。

贪婪与非贪婪

import re  

content = 'Hello 1234567 World_This is a Regex Demo'  

#.*匹配任意个字符（除了换行符），默认贪心，加上?为非贪心，一般用非贪心.*?
#但要注意在末尾匹配任意字符时，应用贪心匹配.*，否则匹配不到任何内容
result = re.match('^He.*?(\d+).*Demo$', content)
result1 = re.match('http.*?comment/(.*?)', content)  

print(result)
print(result.group(1))  #提取第一个括号内匹配的内容
print(result.span())    #匹配到的范围
print('result1', result1.group(1))  

output：
<re.Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567
(0, 40)
result1

修饰符
result = re.match(regex, content, re.S | re.I)
re.S和re.I十分常用。
re.DOTALL可以匹配换行符。
re.VERBOSE忽略regex中的空白符和注释。
转义匹配
在前面加上反斜线，如\.匹配.

search()

返回第一个匹配成功的结果。

result = re.search('^He.*?(\d+).*Demo$', content, re.S)

findall()

返回匹配正则表达式的所有内容。
返回结果为列表类型，列表中的元素为元组类型。

results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)
for result in results:
    print(result)
    print(result[0], result[1], result[2])

sub()

用来修改文本。

import re

content = '54aK54yr5oiR54ix5L2g'
content = re.sub('\d+', '', content)
print(content)

output：
aKyroiRixLg

compile()

将正则字符串编译成正则表达式对象，以便日后复用。

pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, '', content)

常用Regex

email

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+
    @
    [a-zA-Z0-9.-]+
    \.[a-zA-Z]{2,4}
    )''', re.VERBOSE)