侧边栏壁纸
博主头像
LittleAO的学习小站 博主等级

在知识的沙漠寻找绿洲

  • 累计撰写 125 篇文章
  • 累计创建 27 个标签
  • 累计收到 0 条评论

目 录CONTENT

文章目录

Python中的正则表达式

LittleAO
2023-05-24 / 0 评论 / 0 点赞 / 9 阅读 / 0 字
温馨提示:
本文最后更新于2023-11-13,若内容或图片失效,请留言反馈。 部分素材来自网络,若不小心影响到您的利益,请联系我们删除。

正则表达式

一个正则表达式或RegEx是一个特殊的文本字符串,它有助于在数据中查找模式。 RegEx可以用来检查某些模式是否存在于不同的数据类型中。要在python中使用RegEx,首先我们应该导入称为re的RegEx模块。

re模块

导入模块后,我们可以使用它来检测或查找模式。

import re

re 模块中的方法

为了查找模式,我们使用不同的 re 字符集,允许在字符串中搜索匹配项。

re.match():在字符串的第一行开头仅搜索并返回匹配对象(如找到的话),否则返回 None。 re.search():返回匹配对象(如果在字符串中有任何匹配项,包括多行字符串)。
re.findall():返回包含所有匹配项的列表
re.split():接受字符串,在匹配点分割字符串,返回列表
re.sub():替换字符串中的一个或多个匹配项

match

# 语法
re.match(substring, string, re.I)
# substring是字符串或模式,string是我们要查找模式的文本,re.I 是无区分大小写。
import re

txt = 'I love to teach python and javaScript'
# 它返回一个带有 span 和 match 的对象。
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# 我们可以使用 span 方法获取匹配的起始和结束位置作为元组
span = match.span()
print(span)     # (0, 15)
# 让我们找到span的起始和停止位置
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach

从上面的例子中可以看出,我们正在寻找的模式(或子字符串)是“I love to teach”。如果文本以此模式开头,则匹配函数才会返回一个对象。

import re

txt = 'I love to teach python and javaScript'
match = re.match('I like to teach', txt, re.I)
print(match)  # None

字符串与“I like to teach”不匹配,因此没有匹配项,match方法返回了“None”。

import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# 它返回一个包含匹配内容和位置范围的对象。
match = re.search('first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# 我们可以使用 span 将比赛的起始位置和结束位置作为元组得到
span = match.span()
print(span)     # (100, 105)
# 让我们从这个范围中找到起始和结束位置
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first

正如您所见,搜索比匹配更好,因为它可以在整个文本中查找模式。搜索返回一个找到的第一个匹配项的匹配对象,否则返回 None。一个更好的 re 函数是 findall。此函数在整个字符串中检查该模式,并将所有匹配项作为列表返回。

使用findall搜索所有匹配项

findall()返回所有匹配项的列表。

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# 返回列表
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']

正如您所看到的,该字符串中出现了两次单词“language”。让我们再练习一下。现在我们将在字符串中查找 Python 和 python 两个单词:

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# 返回列表
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']

由于我们使用了 re.I,因此小写字母和大写字母都包括在内。如果我们没有 re.I 标志,那么我们将不得不以不同的模式编写我们的模式。让我们检查一下:

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

matches = re.findall('Python|python', txt)
print(matches)  # ['Python', 'python']

#
matches = re.findall('[Pp]ython', txt)
print(matches)  # ['Python', 'python']

替换字符

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
# 或者
match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.

让我们再添加一个例子。除非我们删除百分号,否则以下字符串真的很难读。用空字符串替换%将清理文本。

txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

matches = re.sub('%', '', txt)
print(matches)
I am teacher and I love teaching.
There is nothing as rewarding as educating and empowering people. 
I found teaching more interesting than any other jobs. Does this motivate you to be a teacher?

拆分文本

txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # 拆分使用 \n - 行尾符号
['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']

编写RegEx变量

要声明一个字符串变量,我们使用单引号或双引号。要声明 RegEx 变量,我们使用 r''。以下模式仅识别小写的“apple”,为使它不区分大小写,我们应该重写模式或添加标志。

import re

regex_pattern = r'apple'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['apple']

# 使用re.I实现大小写不敏感
matches = re.findall(regex_pattern, txt, re.I)
print(matches)  # ['Apple', 'apple']
# 或者我们可以使用一组字符的方法
regex_pattern = r'[Aa]pple'  # 这意味着第一个字母可以是大写的Apple或小写的apple。
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']
  • []:一组字符
    • [a-c] 表示 a 或 b 或 c
    • [a-z] 表示从 a 到 z 的任意字母
    • [A-Z] 表示从 A 到 Z 的任何字符
    • [0-3] 表示 0 或 1 或 2 或 3
    • [0-9] 表示从 0 到 9 的任何数字
    • [A-Za-z0-9] 表示任何单个字符,即 a 到 z,A 到 Z 或 0 到 9
  • \:用于转义特殊字符
    • \d 表示:匹配包含数字 (0-9) 的字符串
    • \D 表示:匹配不包含数字的字符串
  • .:除换行符(\n)外的任何字符
  • ^:匹配以...开始
    • r'^subString' 例如r'^love',以单词"love"开始的句子
    • r'[^abc]' 表示不是a,不是b,不是c
  • $:匹配以...结束
    • r'subString$' 例如r'love$',以单词"love"结束的句子
  • *:零次或多次
    • '[a]*' 表示a出现0次或多次。
  • +:一次或多次
    • r'[a]+' 表示至少一次或更多次
  • ?: 零次或一次
    • r'[a]?' 表示a出现零次或一次
  • {3}:恰好三个字符
  • {3,}:至少三个字符
  • {3,8}:3到8个字符
  • |:或者
    • r'apple|banana' 表示苹果或香蕉
  • ():捕获并分组

更多点击这里

以下是示例。

方括号

让我们使用方括号来包含大小写。

regex_pattern = r'[Aa]pple' # 这个方括号表示A或a中的任意一个
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']

如果我们想要寻找banana,代码修改如下:

regex_pattern = r'[Aa]pple|[Bb]anana'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']

使用方括号和或运算符,我们成功提取了 Apple、apple、Banana 和 banana。

正则表达式中的转义字符(\)

regex_pattern = r'\d'  # d是一个特殊字符,表示数字。
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'],这不是我们想要的。

一次或多次 (+)

regex_pattern = r'\d+'  # d是一个特殊字符,意思是数字,+表示一个或多个。
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021'] -现在好多了!

句号(.)

regex_pattern = r'[a].'  # 这个方括号表示 a,点号表示除了换行符之外的任何字符
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . 任何字符,+ 任何字符一次或多次
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

零次或更多次(*)

零次或多次。该模式可以不出现,也可以出现多次。

regex_pattern = r'[a].*'  # .任意字符,*代表任意字符出现零次或多次。
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

零次或一次(?)

零或一次。模式可能不会出现,也可能出现一次。

txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? 代表 '-' 是可选的
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']

正则表达式中的量词

我们可以使用花括号指定文本中我们要查找的子字符串的长度。假设我们对长度为4个字符的子字符串感兴趣:

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{4}'  # 正好4个字符
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019', '2021']

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{1, 4}'   # 1-4个字符
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021']

插入号(^)

  • 开头
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'^This'  # ^以此为开头
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']
  • 否定
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'[^A-Za-z ]+'  # 在集合中,字符“^”表示否定,不包括 A 到 Z,不包括 a 到 z,没有空格。
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019', '8', '2021']
0

评论区