前言 😋
嗨喽,大家好呀~这里是爱看美女的茜茜呐
环境开发:
-
Python 3.8
-
Pycharm
模块使用:
-
requests
-
parsel
-
csv
基本流程思路: 告诉你 实现程序 应该怎么去操作
一. 数据来源分析:
- 分析我们想要数据内容在哪里?
请求那个网站, 可以得到相应的数据
- 抓包分析, 我们想要数据内容 在哪里
通过开发者工具抓包分析.... 会用 1 不会用 2
-
F12 或者 鼠标右键点击检查 选择 network <网络>
-
通过关键字搜索 可以找寻相关数据包 ---> 查看 headers
二. 代码实现步骤:
-
发送请求, 模拟浏览器对于url地址发送请求
-
获取数据, 获取服务器返回响应数据 ----> 对应 开发者工具里面 response
-
解析数据, 提取我们想要数据内容 基本信息
-
保存数据, 保存表格里面 / 图片可以保存到文件夹里面
代码
源码、解答、教程加Q裙:261823976 点击蓝字加入【python学习裙】
# 导入数据请求模块 ---> 第三方模块 需要cmd里面 pip install requests import requests # 导入数据解析模块 ---> 第三方模块 需要cmd里面 pip install parsel import parsel # 导入csv模块 ---> 内置模块 不需要安装 import csv # 创建文件 f = open(\'对象_1.csv\', mode=\'a\', encoding=\'utf-8\', newline=\'\') csv_writer = csv.DictWriter(f, fieldnames=[ \'标题\', \'幸运号\', \'性别\', \'年龄\', \'星座\', \'年薪\', \'学历\', \'身高\', \'爱情宣言\', \'照片\', \'详情页\', ]) # 写入表头 csv_writer.writeheader() # 网址 列表页面url link = \'https://www.19lou.com/r/1/19lnsxq-3.html\' # 模拟浏览器headers headers = { \'Cookie\': \'_Z3nY0d4C_=37XgPK9h; _DM_SID_=abfbcfb2fade7d35ee39c33b5eef7e13; screen=2543; pm_count=%7B%7D; dayCount=%5B%5D; cuid=Hd93N5CDQEk5bODgyK4cOrzXujbQHL84; JSESSIONID=370A8DC7AD014A912504354C3491C5F5; f39big=ip53; f9big=u87; _DM_S_=dc952385e06e9ac73264931ecd4bd0bc; Hm_lvt_5185a335802fb72073721d2bb161cd94=1659515619,1659592454,1659611492; fr_adv=bbs_huatan_ck; fr_adv_last=merry_thread_pc; _dm_userinfo=%7B%22uid%22%3A0%2C%22stage%22%3A%22%22%2C%22city%22%3A%22%E6%B9%96%E5%8D%97%3A%E9%95%BF%E6%B2%99%22%2C%22ip%22%3A%22175.0.62.249%22%2C%22sex%22%3A%221%22%2C%22frontdomain%22%3A%22www.19lou.com%22%2C%22category%22%3A%22%E6%83%85%E6%84%9F%2C%E5%A9%9A%E5%BA%86%2C%E6%97%B6%E5%B0%9A%22%7D; _dm_tagnames=%5B%7B%22k%22%3A%2219%E6%A5%BC%E5%A5%B3%E7%94%9F%E7%9B%B8%E4%BA%B2%22%2C%22c%22%3A29%7D%2C%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A31%7D%2C%7B%22k%22%3A%22%E7%A1%95%E5%A3%AB%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E7%A7%A4%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A21%7D%2C%7B%22k%22%3A%22%E7%9B%B8%E4%BA%B2%E8%AE%BA%E5%9D%9B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E7%9B%B8%E4%BA%B2%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E5%BE%81%E5%A9%9A%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E8%9D%8E%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%221986%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9C%AC%E7%A7%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E7%A6%BB%E5%BC%82%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%5D; Hm_lpvt_5185a335802fb72073721d2bb161cd94=1659619705\', \'Host\': \'www.19lou.com\', \'Referer\': \'https://www.19lou.com/r/1/19lnsxq-4.html\', \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36\', } # 发送请求 response_1 = requests.get(url=link, headers=headers) # 获取数据 print(response_1.text) # 解析数据 selector_1 = parsel.Selector(response_1.text) # css提取内容 title_list = selector_1.css(\'.item-hd h3::text\').getall() # 获取标题 # 获取链接 href = selector_1.css(\'.item-bd .cont a::attr(href)\').getall() # for循环 for title, index in zip(title_list, href): # 把http替换成https url = index.replace(\'http:\', \'https:\') \"\"\" 1. 发送请求, 模拟浏览器对于url地址发送请求 - python代码 如何模拟浏览器发送请求 请求头 是字典数据类型, 我们构建完整键值对形式 - 如何替换内容 ctrl + R 会弹出框框 输入正则命令 (.*?): (.*) \'$1\': \'$2\', - <Response [200]> 表示请求成功 但是不代表你得到数据... - response = requests.get(url=url, headers=headers) response 自定义变量 自己定义变量 requests.get() 调用requests模块里面get方法 url=url 左边url是get函数里面形式参数 右边url是我们传递进去的参数 \"\"\" # 确定请求url地址 # url = \'https://www.19lou.com/forum-164-thread-83331619167048422-1-1.html\' # 模拟浏览器发送请求 headers请求头 headers = { \'Cookie\': \'_Z3nY0d4C_=37XgPK9h; _DM_SID_=abfbcfb2fade7d35ee39c33b5eef7e13; screen=2543; pm_count=%7B%7D; dayCount=%5B%5D; cuid=Hd93N5CDQEk5bODgyK4cOrzXujbQHL84; JSESSIONID=370A8DC7AD014A912504354C3491C5F5; f39big=ip53; f9big=u87; _DM_S_=dc952385e06e9ac73264931ecd4bd0bc; Hm_lvt_5185a335802fb72073721d2bb161cd94=1659515619,1659592454,1659611492; fr_adv=bbs_huatan_ck; _dm_tagnames=%5B%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A30%7D%2C%7B%22k%22%3A%2219%E6%A5%BC%E5%A5%B3%E7%94%9F%E7%9B%B8%E4%BA%B2%22%2C%22c%22%3A27%7D%2C%7B%22k%22%3A%22%E7%A1%95%E5%A3%AB%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E7%A7%A4%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A21%7D%2C%7B%22k%22%3A%22%E7%9B%B8%E4%BA%B2%E8%AE%BA%E5%9D%9B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E7%9B%B8%E4%BA%B2%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E5%BE%81%E5%A9%9A%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E8%9D%8E%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%221986%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9C%AC%E7%A7%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E7%A6%BB%E5%BC%82%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%5D; _dm_userinfo=%7B%22uid%22%3A0%2C%22stage%22%3A%22%22%2C%22city%22%3A%22%E6%B9%96%E5%8D%97%3A%E9%95%BF%E6%B2%99%22%2C%22ip%22%3A%22175.0.62.249%22%2C%22sex%22%3A%221%22%2C%22frontdomain%22%3A%22www.19lou.com%22%2C%22category%22%3A%22%E6%83%85%E6%84%9F%2C%E5%A9%9A%E5%BA%86%2C%E6%97%B6%E5%B0%9A%22%7D; Hm_lpvt_5185a335802fb72073721d2bb161cd94=1659615006; fr_adv_last=merry_thread_pc\', \'Host\': \'www.19lou.com\', \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36\', } # 发送请求 --> <Response [200]> 表示请求成功 # requests模块里面get请求方法对于url地址发送请求, 并且携带上headers请求头伪装, 最后用response自定变量接受返回数据 response = requests.get(url=url, headers=headers) # 2. 获取数据, 获取服务器返回响应数据 ----> 对应 开发者工具里面 response print(response.text) \"\"\" 3. 解析数据, 提取我们想要数据内容 基本信息 bs4 lxml parsel.... 解析模块 - 解析方法: 都要学习掌握, 没有最好的 ---> 只有最适合的 re: 直接对于字符串数据进行提取 css: 根据标签属性提取数据内容 xpath: 根据标签节点提取数据内容 今日选择css选择器: 根据标签属性提取数据内容 都需要进行类型转换: 转成可解析对象 因为我们得到 response.text ---> 字符串数据类型 pycharm翻译是需要安装插件 ---> 找落落老师去要 css选择器解析方法教学, 在系统课程 2.5个小时 \"\"\" # 转换数据类型 <Selector xpath=None data=\'<html>\\n<head>\\n <meta charset=\"gb23...\'> selector = parsel.Selector(response.text) # 使用css提取数据 不会 2 会的 1 ---> 会复制么 ctrl + c ctrl + v # replace() 字符串替换 love_num = selector.css(\'.love-blind-female .love-blind-info p::text\').get() if love_num: love_num = love_num.replace(\'爱情幸运号:\', \'\') # split() 字符串分割 info_list = selector.css(\'.love-blind-female .love-blind-info .mt10::text\').get().split(\',\') # 列表索引位置取值 sex = info_list[0] # 性别 age = info_list[1] # 性别 constellation = info_list[2] # 星座 money = info_list[3] # 年薪 edu = info_list[4] # 学历 height = info_list[5] # 身高 love_txt = selector.css(\'.love-blind-female .love-blind-info .love-blind-txt::text\').get() img_url = selector.css(\'.view-cont .thread-cont img::attr(src)\').get().replace(\'http:\', \'https:\') # ctrl + D dit = { \'标题\': title, \'幸运号\': love_num, \'性别\': sex, \'年龄\': age, \'星座\': constellation, \'年薪\': money, \'学历\': edu, \'身高\': height, \'爱情宣言\': love_txt, \'照片\': img_url, \'详情页\': url, } csv_writer.writerow(dit) print(img_url) # 获取图片数据 img_content = requests.get(url=img_url).content with open(\'img\\\\\' + title + \'.jpg\', mode=\'wb\') as f: f.write(img_content) print(dit)
尾语 💝
感谢你观看我的文章呐~本次航班到这里就结束啦 🛬
希望本篇文章有对你带来帮助 🎉,有学习到一点知识~
躲起来的星星🍥也在努力发光,你也要努力加油(让我们一起努力叭)。
最后,博主要一下你们的三连呀(点赞、评论、收藏),不要钱的还是可以搞一搞的嘛~
不知道评论啥的,即使扣个6666也是对博主的鼓舞吖 💞 感谢 💐
来源:https://www.cnblogs.com/Qqun261823976/p/16562373.html
本站部分图文来源于网络,如有侵权请联系删除。