Python爬虫实战，requests+openpyxl模块，爬取手机商品信息数据（附源码）-百木园

前言

今天给大家介绍的是Python爬取手机商品信息数据，在这里给需要的小伙伴们代码，并且给出一点小心得。

首先是爬取之前应该尽可能伪装成浏览器而不被识别出来是爬虫，基本的是加请求头，但是这样的纯文本数据爬取的人会很多，所以我们需要考虑更换代理IP和随机更换请求头的方式来对手机信息数据进行爬取。

在每次进行爬虫代码的编写之前，我们的第一步也是最重要的一步就是分析我们的网页。

通过分析我们发现在爬取过程中速度比较慢，所以我们还可以通过禁用谷歌浏览器图片、JavaScript等方式提升爬虫爬取速度。

开发工具

Python版本： 3.6

相关模块：

requests模块

json模块

lxml模块

openpyxl

环境搭建

安装Python并添加到环境变量，pip安装需要的相关模块即可。

文中完整代码及Excel文件，评论留言获取

思路分析

浏览器中打开我们要爬取的页面
按F12进入开发者工具，查看我们想要的手机商品数据在哪里
这里我们需要页面数据就可以了

源代码结构

代码实现

请求头防止反爬

# 这里提示不用请求也是可以的只保留user-agent也可以爬取数据
headers = {
            \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.
            100 Safari/537.36\',
            \'cookie\':\'你的Cookie\',
            \'accept-encoding\': \'gzip, deflate, br\',
            \'accept-language\': \'zh-CN,zh;q=0.9\',
            \'upgrade-insecure-requests\': \'1\',
            \'referer\': \'https://www.jd.com/\',
        }

获取商品评论数

import openpyxl
outwb = openpyxl.Workbook()
outws = outwb.create_sheet(index=0)

outws.cell(row=1,column=1,value=\"index\")
outws.cell(row=1,column=2,value=\"title\")
outws.cell(row=1,column=3,value=\"price\")
outws.cell(row=1,column=4,value=\"CommentCount\")

count=2

根据商品id获取评论数

def commentcount(product_id):
    url = \"https://club.jd.com/comment/productCommentSummaries.action?referenceIds=\"+str(product_id)+\"&callback=jQuery8827474&_=1615298058081\"
    res = requests.get(url, headers=headers)
    res.encoding = \'gbk\'
    text = (res.text).replace(\"jQuery8827474(\",\"\").replace(\");\",\"\")
    text = json.loads(text)
    comment_count = text[\'CommentsCount\'][0][\'CommentCountStr\']

    comment_count = comment_count.replace(\"+\", \"\")
    ###对“万”进行操作
    if \"万\" in comment_count:
        comment_count = comment_count.replace(\"万\",\"\")
        comment_count = str(int(comment_count)*10000)

    return comment_count

获取每一页的商品数据

def getlist(url):
    global  count
    #url=\"https://search.jd.com/search?keyword=笔记本&wq=笔记本&ev=exbrand_联想%5E&page=9&s=241&click=1\"
    res = requests.get(url,headers=headers)
    res.encoding = \'utf-8\'
    text = res.text

    selector = etree.HTML(text)
    list = selector.xpath(\'//*[@id=\"J_goodsList\"]/ul/li\')

    for i in list:
        title=i.xpath(\'.//div[@class=\"p-name p-name-type-2\"]/a/em/text()\')[0]
        price = i.xpath(\'.//div[@class=\"p-price\"]/strong/i/text()\')[0]
        product_id = i.xpath(\'.//div[@class=\"p-commit\"]/strong/a/@id\')[0].replace(\"J_comment_\",\"\")

        comment_count = commentcount(product_id)
        #print(title)
        #print(price)
        #print(comment_count)

        outws.cell(row=count, column=1, value=str(count-1))
        outws.cell(row=count, column=2, value=str(title))
        outws.cell(row=count, column=3, value=str(price))
        outws.cell(row=count, column=4, value=str(comment_count))

        count = count +1
        #print(\"-----\")

遍历每一页

def getpage():
    page=1
    s = 1
    for i in range(1,6):
        print(\"page=\"+str(page)+\",s=\"+str(s))
        url = \"https://search.jd.com/Search?keyword=手机=utf-8&wq=手机=56b2bc7c47db4861986201bb72c1b281\"+str(page)+\"&s=\"+str(s)+\"&click=1\"
        getlist(url)
        page = page+2
        s = s+60

结果展示

数据结果

最后

今天的分享到这里就结束了，感兴趣的朋友也可以去试试哈

对文章有问题的，或者有其他关于python的问题，可以在评论区留言或者私信我哦

觉得我分享的文章不错的话，可以关注一下我，或者给文章点赞(/≧▽≦)/

来源：https://www.cnblogs.com/guzichuan/p/16975647.html
本站部分图文来源于网络，如有侵权请联系删除。

Python爬虫实战，requests+openpyxl模块，爬取手机商品信息数据（附源码）

前言

开发工具

环境搭建

思路分析

代码实现

请求头防止反爬

获取商品评论数

根据商品id获取评论数

获取每一页的商品数据

遍历每一页

结果展示

最后

相关推荐

热门文章

前言

开发工具

环境搭建

思路分析

代码实现

请求头防止反爬

获取商品评论数

根据商品id获取评论数

获取每一页的商品数据

遍历每一页

结果展示

最后

相关推荐

热门文章

切换注册登录

用户名或邮箱

密码

切换登录注册

昵称

邮箱