当前位置：百木园 > 野生技术 > 正文

Python教程：常用网页字符串处理技巧

2022-06-18 分类：野生技术阅读(315) 评论(0)

首先一些Python字符串处理的简易常用的用法。其他的以后用到再补充。

1.去掉重复空格

s = \"hello   hello   hello\"
s = \' \'.join(s.split())

2.去掉所有回车（或其他字符或字符串）

s = \"hello\\nhello\\nhello hello\\n\"
print(s)
s = s.replace(\"\\n\",\"\")
print(s)

3.查找字符串首次出现的位置（没有返回-1）

s = \"hello\\nhello\\nhello hello\\n\"
print(s.find(\'\\n\'))
print(s.find(\'la\'))

4.查找字符串从后往前找首次出现的位置（没有返回-1）

s = \"hello\\nhello\\nhello hello\\n\"
print(s.rfind(\'\\n\'))
print(s.rfind(\'la\'))

5.将字符串转化成列表list

s = \"hello\\nhello\\nhello hello\\n\"
print(list(s))

6.查找所有匹配的子串

import re

s = \"hello\\nhello\\nhello hello\\n\"
print(re.findall(\'hello\',s)) # hello也可以换成正则表达式

然后是网页字符串处理的高端用法：（综合运用requests模块，beautifulsoup模块，re模块等）

1.requests获取一个链接的内容并原封不动写入文件

import requests

r = requests.get(\'https://baike.baidu.com\')
with open(\'test.html\', \'wb\') as fd:
    for chunk in r.iter_content(100):
        fd.write(chunk)

2.读取一个文件的所有内容存到一个字符串里

# encoding : utf-8

with open(\'test.html\',\'r\',encoding=\'utf-8\') as f:
    content = f.readlines()
content = \'\'.join(content)
# content = content.replace(\'\\n\',\'\') # 如果想去掉回车可以加上这行
print(content)

3.把网页字符串用BeautifulSoup存起来处理

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,\'html.parser\')
print(soup.prettify())

4.存到BeautifulSoup里之后这个字符串就可以任你摆布了，比如：提取出所有标签

\'\'\'
学习中遇到问题没人解答？小编创建了一个Python学习交流群：857662006
寻找有志同道合的小伙伴，互帮互助,群里还有不错的视频学习教程和PDF电子书！
\'\'\'

soup = BeautifulSoup(content,\'html.parser\')
print(soup.find_all(\'a\'))

或者提取出所有标签和标签

soup = BeautifulSoup(content,\'html.parser\') print(soup.find_all([\'a\',\'b\']))

这些属于beautifulsoup的内容了

5.多个关键字切分字符串

import re re.split(\'; |, \',str) >>> a=\'Beautiful, is; better*than\\nugly\' >>> import re >>> re.split(\'; |, |\\*|\\n\',a) [\'Beautiful\', \'is\', \'better\', \'than\', \'ugly\']

来源：https://www.cnblogs.com/djdjdj123/p/16388652.html
本站部分图文来源于网络，如有侵权请联系删除。

未经允许不得转载：百木园 » Python教程：常用网页字符串处理技巧

标签：python

相关推荐

暂无文章

正在加载中...