去掉字符串中的控制字符 & re模块中的match,search区别。-378333581-ChinaUnix博客

1) 去掉字符串中的控制字符。

在某些网页提取链接过程中，发现一些站点在链接文本中，夹杂了一些控制字符，例如回车，tab等等。

去掉很简单，直接用re模块吧。

import re

str = re.sub(r'[\x01-\x1f]','', str)

注：这个只能针对ascii编码，其他编码需要根据情况处理。

2) re.match 与 re.search

同事说某段正则不能提取，

re.match('(?P\d+)', 'xxxx123xxxx')

能看出问题在那里吗？或者看看文档就清楚了。

match(pattern, string, flags=0)

Try to apply the pattern at the start of the string, returning

a match object, or None if no match was found.

search(pattern, string, flags=0)

Scan through string looking for a match to the pattern, returning

a match object, or None if no match was found.

现在发现看那么多文章，不如老老实实看RFC，API接口之类的文档。