京东关键词搜索商品列表的Python实战

谢 2026-01-09 1366

电子说

1.4w人已加入

描述

一、实现思路分析

URL 构造：京东搜索 URL 格式为 https://search.jd.com/Search?keyword=关键词&page=页码（page 为奇数，如 1、3、5 对应第 1、2、3 页）。

反爬处理：模拟浏览器请求（设置 User-Agent、Cookie）、控制请求频率（添加延迟）。

页面解析：京东商品列表的核心信息（标题、价格、链接、店铺）嵌在 HTML 中，用 BeautifulSoup 解析。

数据保存：将爬取的信息保存为 CSV 文件，方便后续查看。

二、实战代码实现

1. 安装依赖库

首先在终端执行以下命令安装所需库：

bash

运行

pip install requests beautifulsoup4 fake-useragent pandas

2. 完整爬虫代码

python

运行

import requests
from bs4 import BeautifulSoup
import time
import random
from fake_useragent import UserAgent
import pandas as pd

class JdSpider:
    def __init__(self, keyword, page_num=3):
        # 初始化参数
        self.keyword = keyword  # 搜索关键词
        self.page_num = page_num  # 要爬取的页数
        self.headers = self._get_headers()  # 请求头
        self.data_list = []  # 存储爬取的商品数据

    def _get_headers(self):
        """构造请求头，模拟浏览器访问"""
        # 注意：Cookie需要替换成你自己的（从浏览器开发者工具中复制）
        cookie = "你的京东Cookie"  # 替换成真实Cookie！！！
        ua = UserAgent()
        headers = {
            "User-Agent": ua.random,  # 随机生成User-Agent
            "Cookie": cookie,
            "Referer": "https://www.jd.com/",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive"
        }
        return headers

    def _get_page_url(self, page):
        """构造指定页码的搜索URL"""
        # 京东的page参数：第1页=1，第2页=3，第3页=5，以此类推
        jd_page = page * 2 - 1
        url = f"https://search.jd.com/Search?keyword={self.keyword}&page={jd_page}&enc=utf8"
        return url

    def _parse_page(self, html):
        """解析页面，提取商品信息"""
        soup = BeautifulSoup(html, "html.parser")
        # 找到所有商品项
        items = soup.find_all("li", class_="gl-item")
        
        for item in items:
            try:
                # 1. 商品价格
                price_elem = item.find("div", class_="p-price")
                price = price_elem.find("i").text if price_elem else "无价格"
                
                # 2. 商品标题
                title_elem = item.find("div", class_="p-name p-name-type-2")
                title = title_elem.find("em").text.strip() if title_elem else "无标题"
                
                # 3. 商品链接
                link_elem = item.find("a", class_="J_ClickStat")
                link = "https:" + link_elem["href"] if link_elem else "无链接"
                
                # 4. 店铺名称
                shop_elem = item.find("div", class_="p-shop")
                shop = shop_elem.find("a").text.strip() if shop_elem else "无店铺"
                
                # 存储数据
                self.data_list.append({
                    "标题": title,
                    "价格": price,
                    "链接": link,
                    "店铺": shop
                })
            except Exception as e:
                print(f"解析单个商品失败：{e}")
                continue

    def run(self):
        """爬虫主逻辑"""
        print(f"开始爬取京东关键词【{self.keyword}】的商品信息，共{self.page_num}页...")
        
        for page in range(1, self.page_num + 1):
            try:
                # 1. 构造URL
                url = self._get_page_url(page)
                # 2. 发送请求
                response = requests.get(url, headers=self.headers, timeout=10)
                response.encoding = "utf-8"  # 设置编码
                # 3. 解析页面
                self._parse_page(response.text)
                # 4. 打印进度
                print(f"第{page}页爬取完成，已获取{len(self.data_list)}条商品数据")
                # 5. 随机延迟（2-5秒），避免高频请求被封
                time.sleep(random.randint(2, 5))
                
            except Exception as e:
                print(f"第{page}页爬取失败：{e}")
                continue
        
        # 保存数据到CSV
        if self.data_list:
            df = pd.DataFrame(self.data_list)
            df.to_csv(f"京东_{self.keyword}_商品列表.csv", index=False, encoding="utf-8-sig")
            print(f"数据保存完成！文件名为：京东_{self.keyword}_商品列表.csv")
        else:
            print("未爬取到任何商品数据！")

if __name__ == "__main__":
    # 示例：爬取关键词"Python编程"的前3页商品
    spider = JdSpider(keyword="Python编程", page_num=3)
    spider.run()

三、关键部分解释

Cookie 获取方法：

打开浏览器访问京东（登录后），按 F12 打开开发者工具 → 切换到 Network 标签 → 刷新搜索页面 → 找到第一个请求（Search?keyword=...）→ 在 Request Headers 中复制 Cookie 值，替换代码中的你的京东Cookie。

请求头构造：

使用fake-useragent随机生成 User-Agent，避免固定 UA 被识别为爬虫；

添加 Referer、Cookie 等字段，模拟真实用户的请求特征。

页面解析逻辑：

通过BeautifulSoup定位商品项（li.gl-item），再分别提取价格（div.p-price）、标题（div.p-name）、链接（a.J_ClickStat）、店铺（div.p-shop）；

加入异常处理，避免单个商品解析失败导致整个页面爬取中断。

反爬措施：

随机延迟（2-5 秒）：避免短时间内发送大量请求；

模拟浏览器请求头：降低被反爬机制识别的概率；

分页爬取：控制爬取页数，避免一次性爬取过多数据。

四、测试与注意事项

运行代码：替换 Cookie 后，直接运行代码，会在当前目录生成 CSV 文件，包含商品标题、价格、链接、店铺信息。

重要注意事项：

京东的反爬机制会更新，若 HTML 结构变化，需重新检查元素调整解析规则；

不要爬取过多数据 / 过快，否则可能被封 IP 或账号；

该爬虫仅用于学习，请勿用于商业用途，遵守京东的 robots 协议。

总结

京东商品列表爬虫的核心是构造正确的 URL + 模拟浏览器请求（Cookie/UA） + 解析 HTML 提取数据；

反爬的关键是控制请求频率、伪装请求特征，避免被京东的反爬系统识别；

数据解析时需加入异常处理，保证爬虫的稳定性，最后将数据保存为 CSV 方便后续使用。

审核编辑黄宇

打开APP阅读更多精彩内容