Python 学习笔记：Regular Expression-百木园

Regular Expression (正则表达式) 是一种功能十分强大，但是又十分难以解读的古老的编程语言。通常的编程语言是以行作为最基础的解释单位，而 regular expression 则是以字符为基础解释单位。

Regular Expression Module

正则表达式在文本处理和文本挖掘中有很大的优势，即使是在不同编程的语言也会常常使用它。

在 Python 中，我们可以借用 re 这个包来运用正则表达式。

import re

常用的方法有 match(), search()， findall()， split() 等等。其中：

search(): 找到第一个符合的 substring 并返回，与字符串方法 string.find() 类似。
findall(): 找到所有符合的 substrings 并返回 list，常常用于提取文本。

import re
test = \"Quarantine Summary Report - Mar. 14, 23:00 for test@abc.com\"
result = re.search(\"\\S+@\\S+\", test)
if result:
print(\"This line has an email address.\")
emails = re.findall(\"\\S+@\\S+\", test)
if emails:
print(emails)

Greedy matching

对于提取文字，我们可以运用上面提到的两个方法以及正则表达式灵活进行。但是要注意 greedy matching。举个例子:

text = \"123 123 123 123\"
time = re.search(\"1.*3\", text)
print(time)
# <re.Match object; span=(0, 15), match=\'123 123 123 123\'>

在例子中符合的 substring 的组合有4种，但是由于 greedy matching 的原则，会默认取尽可能长匹配结果。如果我们只需要提取短的那一个结果，我们可以在表示 wildcard character 后加 ?，例如 *?， +?, ?? 等等。

text = \"123 123 123 123\"
time = re.search(\"1.*?3\", text)
print(time)
# <re.Match object; span=(0, 3), match=\'123\'>

简单来说，greedy matching 就是指在部分字符已符合匹配后还会继续匹配，直到不成功就停止了。而与之相反的 lazy matching 就是指出现成功匹配的部分字符后就停止匹配。

Regular Expression Quick Guide

简单的列举一些常用的通配符。

Syntax
Description

^	The beginning of the line
$	The end of the line
.	Any character
\\s	White space
\\S	Any non-whitespace
*	Repeat character zero or more times
+	Repeat character one or more times
[]	A single character in the list
[^ ]	A single character not in the list
[0-9]	Any digit
()	The position where string extraction is to begin or end

来源：https://www.cnblogs.com/yukiwu/p/14537825.html
图文来源于网络，如有侵权请联系删除。

Python 学习笔记：Regular Expression

Regular Expression Module

Greedy matching

Regular Expression Quick Guide

相关推荐

热门文章