Python模式匹配与正则表达式

模式匹配流程

  1. 用import re 将正则模块导入
  2. 用re.complie() 函数创建一个Regex对象(记得是使用原始字符串)
  3. 向Regex对象的serach(),方法传入想查找的字符串
  4. 调用Match对象的group(),返回实际匹配文本的字符串

创建正则表达对象

查找段落中的电话号码 ‘xxx-xxx-xxxx’

1
2
3
4
5
6
7
>>> import re
>>> phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}') #\d 表示一个数字字符
>>> mo = phoneNumRegex.search('My number is 123-456-7890') #匹配Regex对象
>>> mo.group()
'123-456-7890'
>>> print(mo)
<re.Match object; span=(13, 25), match='123-456-7890'>
1
re.compile(r'\d{3}-\d{3}-\d{4}').search('My number is 123-456-7890').group()

用正则表达式匹配更多模式

利用括号分组

1
2
3
4
5
6
7
8
9
10
11
12
>>> phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})')
>>> mo = phoneNumRegex.search('My number is 123-456-7890')
>>> mo.group()
'123-456-7890'
>>> mo.group(1)
'123'
>>> mo.group(2)
'456-7890'
>>> mo.group(0)
'123-456-7890'
>>> mo.groups()
('123', '456-7890')

当使用括号的时候( 和) 来表示

1
2
3
4
5
6
7
8
>>> phoneNumRegex = re.compile(r'(\(\d{3}\)) (\d{3}-\d{4})')
>>> mo = phoneNumRegex.search('My number is (123) 456-7890')
>>> mo
<re.Match object; span=(13, 27), match='(123) 456-7890'>
>>> mo.group()
'(123) 456-7890'
>>> mo.groups()
('(123)', '456-7890')

利用管道技术分组

1
2
3
4
5
6
7
8
9
>>> heroRegex = re.compile(r'Tom|Job')
>>> mo1 = heroRegex.search('Tom is better Job')
>>> mo2 = heroRegex.search('Job is better Tom')
>>> mo1
<re.Match object; span=(0, 3), match='Tom'>
>>> mo1.group()
'Tom'
>>> mo2.group()
'Job'

? 实现可选分配

(wo)? 表示匹配时出现零次或者一次

1
2
3
4
5
6
7
8
9
10
11
>>> batRegex = re.compile(r'Bar(wo)?man')
>>> temp1 = batRegex.search('Barman')
>>> temp2 = batRegex.search('Barwoman')
>>> temp1.group()
'Barman'
>>> temp2.group()
'Barwoman'
>>> temp1
<re.Match object; span=(0, 6), match='Barman'>
>>> temp2
<re.Match object; span=(0, 8), match='Barwoman'>

* 匹配零次或者多次

1
2
3
4
5
6
7
8
9
10
>>> batRegex = re.compile(r'Bar(wo)*man')
>>> temp1 = batRegex.search('Barman')
>>> temp2 = batRegex.search('Barwoman')
>>> temp3 = batRegex.search('Barwowowowoman')
>>> print(temp1,temp2,temp3)
<re.Match object; span=(0, 6), match='Barman'>
<re.Match object; span=(0, 8), match='Barwoman'>
<re.Match object; span=(0, 14), match='Barwowowowoman'>
>>> print(temp1.group(),temp2.group(),temp3.group())
Barman Barwoman Barwowowowoman

+ 表示一次或者多次

1
2
3
4
5
6
7
8
9
10
11
>>> batRegex = re.compile(r'Bar(wo)+man')
>>> temp1 = batRegex.search('Barman')
>>> temp2 = batRegex.search('Barwoman')
>>> temp3 = batRegex.search('Barwowowowoman')
>>> print(temp1,temp2,temp3)
None
#因为temp1中对应字符串中没有匹配项,所以为 None
<re.Match object; span=(0, 8), match='Barwoman'>
<re.Match object; span=(0, 14), match='Barwowowowoman'>
>>> print(temp2.group(),temp3.group())
Barwoman Barwowowowoman

使用 { } 匹配特定次数

1
2
#(Ha){3} <=> #(Ha)(Ha)(Ha)
(Ha){3,5} <=> ((Ha)(Ha)(Ha))|(Ha)(Ha)(Ha)(Ha)|(Ha)(Ha)(Ha)(Ha)(Ha)
1
2
3
4
5
6
7
8
>>> haRegex = re.compile(r'(ha){3}')
>>> mo = haRegex.search('hahaha')
>>> mo.group()
'hahaha'
>>> mo.group(0)
'hahaha'
>>> mo.group(1)
'ha'

贪心和非贪心匹配

1
2
3
4
5
6
7
8
9
>>> import re
>>> greedyHaRegex = re.compile(r'(Ha){3,5}')
>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'
>>> greedyHaRegex = re.compile(r'(Ha){3,5}?')
>>> mo2 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo2.group()
'HaHaHa'

findall() 方法

查找全部符合项

1
2
3
4
5
6
7
8
>>> phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
>>> mo = phoneNumRegex.findall('Cell:415-555-9999 work: 212-555-1000')
>>> mo
['415-555-9999', '212-555-1000']
>>> phoneNumRegex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
>>> mo = phoneNumRegex.findall('Cell:415-555-9999 work: 212-555-1000')
>>> mo
[('415', '555', '9999'), ('212', '555', '1000')]

字符串分类

缩写字符类型 表示
\d [0-9]
\D [^0-9]
\w 任何字母数字或者下划线
\W 除字母数字或者下划线的任何字符
\s 空格、制表符、换行符
\S 除空格、制表符、换行符的任何字符
1
2
3
4
>>> xmasRegex = re.compile(r'\d+\s\w+')
>>> mo = xmasRegex.findall('12 dru, 11 pip, 10 lords, 5 rings')
>>> mo
['12 dru', '11 pip', '10 lords', '5 rings']

建立自己的字符分类

1
2
3
4
5
6
7
8
>>> vowelRegex = re.compile(r'[123abcefg]')
>>> mo = vowelRegex.findall('Hello Word 2018!')
>>> mo
['e', '2', '1']
>>> vowelRegex = re.compile(r'[1-5.]')
>>> mo = vowelRegex.findall('Hello Word 2018.')
>>> mo
['2', '1', '.']

[] 的前端加上 ^,代表取反,除 [] 内字符都可以匹配

1
2
3
4
>>> vowelRegex = re.compile(r'[^123abcefg]')
>>> mo = vowelRegex.findall('Hello Word 2018.')
>>> mo
['H', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'd', ' ', '0', '8', '.']

插入字符和美元字符

与取反不同,这里的 (r'^xxxx') 是代表匹配的字符串需要以特定的字符串开头

1
2
3
4
5
6
>>> begRegex = re.compile(r'^hello')
>>> mo = begRegex.search('hello world')
>>> mo
<re.Match object; span=(0, 5), match='hello'>
>>> mo.group()
'hello'

如果没有匹配成功的话,返回的是 None

1
2
3
4
>>> begRegex = re.compile(r'^hello')
>>> mo = begRegex.search('hi world')
>>> mo == None
True

当匹配为 (r'world$') 是代表匹配的字符串需要以特定的字符串结尾

1
2
3
4
>>> begRegex = re.compile(r'world$')
>>> mo = begRegex.search('hello world')
>>> mo.group()
'world'

(r'^\d+$') 是代表整个字符串都是数字

1
2
3
4
5
>>> begRegex = re.compile(r'^\d+$')
>>> mo = begRegex.search('123hello world')
>>> mo
>>> mo == None
True

通配字符

使用 .(句点) 匹配 空格及换行 之外的任何字符

1
2
3
>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']

使用 .* 匹配 换行 之外的任何字符

1
2
3
4
5
6
>>> nameRegex = re.compile(r'First Name:(.*) Last Name:(.*)')
>>> mo = nameRegex.search('First Name:AL Last Name:Sweigart')
>>> mo.groups()
('AL', 'Sweigart')
>>> mo
<re.Match object; span=(0, 32), match='First Name:AL Last Name:Sweigart'>

使用句点字符匹配换行 re.DOTALL

1
2
3
4
5
6
7
8
9
10
>>> newLineREgex = re.compile(r'.*', re.DOTALL)
>>> mo = newLineREgex.search("Hi:\nYes,I can do It.\nThank you.")
>>> print(mo.group())
Hi:
Yes,I can do It.
Thank you.
>>> newLineREgex = re.compile(r'.*')
>>> mo = newLineREgex.search("Hi:\nYes,I can do It.\nThank you.")
>>> print(mo.group())
Hi:

不区分大小写的匹配 re.I or re.IGNORECASE

1
2
3
4
>>> robocop = re.compile(r'robocop', re.I)
>>> mo = robocop.findall('RobocoproBOCOPRRROBOCOP')
>>> print(mo)
['Robocop', 'roBOCOP', 'ROBOCOP']

使用 sub() 方法替换字符串

1
2
3
>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob')
'CENSORED gave the secret documents to CENSORED'

保留前一个字符

1
2
3
>>> namesRegex = re.compile(r'Agent (\w)\w+')
>>> namesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent ')
'A**** told C**** that Agent '

如果想保留三个字符,可以这么做

1
2
3
>>> namesRegex = re.compile(r'Agent (\w{3})\w+')
>>> namesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent ')
'Ali**** told Car**** that Agent '
1
2
3
4
5
>>> namesRegex = re.compile(r'(\d{3})\d{4}(\d{4})')
>>> namesRegex.findall(' tell:15735184252;call:13835213493')
[('157', '4252'), ('138', '3493')]
>>> namesRegex.sub(r'\1****\2', ' tell:15735184252;call:13835213493')
' tell:157****4252;call:138****3493'

管理复杂的正则表达式

re.VERBOSE 表示可以把正则表达式写成多行,并且自动忽略空格。

1
2
3
4
5
6
>>> namesRegex = re.compile(r'''
(\d{3}|\(\d{3}\))? #area code
(\s|-|\.)? #separator
\d{4} #fist 4 digits
(\s|-|\.)? #separator
\d{4} #end 4 digits)''', re.VERBOSE)

组合使用 re.DOTALLre.VERBOSEre.I

利用管道技术,解决只有一个值作为第二参数

1
>>> someRegex = re.compile('foo', re.IGNORECASE|re.VERBOSE|re.DOTALL)

小项目——电话号码和Email提取程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#! python3
import sys,pyperclip,re

mailRegex = re.compile(r'''(
[a-zA-Z0-9_%+-]+ #username
@ #
[a-zA-Z0-9.-]+ #domain name
(\.[a-zA-Z]{2,4}) #
)''', re.VERBOSE)
phoneRegex = re.compile(r'''
\(?(\d{3})\)?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})''', re.VERBOSE)

text = str(pyperclip.paste())
matches = []
#123-456-1234
for group in phoneRegex.findall(text):
phoneNum = '-'.join([group[0],group[2],group[4]])
matches.append(phoneNum)
for group in mailRegex.findall(text):
matches.append(group[0])

if(len(matches) > 0):
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print("No found")

Powered by AppBlog.CN     浙ICP备14037229号

Copyright © 2012 - 2020 APP开发技术博客 All Rights Reserved.

访客数 : | 访问量 :