5 大主流电商商品详情解析实战手册：淘宝 / 京东 / 拼多多 / 1688 / 唯品会核心字段提取 + 反爬应对 + 代码示例

邓林 2025-10-13 1635

电子说

1.4w人已加入

描述

在电商数据分析、竞品监控、智能选品等场景中，商品详情页的核心数据（价格、SKU、库存、供应商信息等）是关键决策依据。但不同平台的页面结构、数据加载方式及反爬机制差异显著，直接影响数据获取效率。本文针对淘宝、京东、拼多多、1688、唯品会 5 大主流平台，详细拆解商品详情页的解析逻辑，包含完整字段提取代码、平台特性适配方案及反爬应对策略，所有原解析逻辑均保留，同时补充实操细节与技术注解，帮开发者避开常见坑点。

一、淘宝商品详情解析：应对动态渲染与字体加密

1. 页面结构核心特性（补充实操痛点）

动态渲染深度依赖 JS：商品 SKU、库存、实时价格等数据并非页面加载时直接返回，需等待前端 JS 异步请求（通常来自tmall.com或taobao.com的接口）；

字体加密防爬常见：部分价格数字使用自定义字体文件（如woff格式）渲染，直接爬取会出现乱码，需结合字体映射关系解密；

验证码触发阈值低：同一 IP 短时间内请求超过 5 次即可能触发滑块验证，需严格控制请求频率。

2. 核心字段解析（保留原代码 + 补充注释 + 异常处理）

python

运行

import requests
import json
from bs4 import BeautifulSoup
import re
from typing import Dict, List, Optional
def parse_taobao_item(url: str) - > Optional[Dict]:
    """
    解析淘宝商品详情页核心字段
    :param url: 商品详情页URL（如https://item.taobao.com/item.htm?id=xxx）
    :return: 包含标题、价格、SKU、店铺信息的字典，失败返回None
    """
    # 构建请求头：模拟Chrome浏览器，Referer需与商品域名一致
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
        'Referer': 'https://www.taobao.com',
        'Cookie': ''  # 可选：添加登录态Cookie，可获取更多非公开数据（如会员价）
    }
    
    try:
        # 发送请求：超时设为10秒，应对淘宝服务器响应延迟
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # 触发HTTP错误（如403、500）
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 1. 提取商品标题：淘宝标题通常在h1标签，带data-spm属性
        title_tag = soup.select_one('h1[data-spm="1000983"]')
        title = title_tag.text.strip() if title_tag else "未获取到标题"
        
        # 2. 提取价格：处理字体加密（此处为基础方案，复杂加密需解析字体文件）
        price_tag = soup.select_one('.tm-price')
        if not price_tag:
            return None  # 价格标签不存在，可能触发反爬
        price_text = price_tag.text.strip()
        price_match = re.search(r'(d+.d+)', price_text)  # 匹配小数价格（如25.80）
        price = float(price_match.group(1)) if price_match else 0.0
        
        # 3. 提取SKU信息：从页面内嵌JS的skuMap中解析（动态加载数据）
        sku_info: List[Dict] = []
        sku_script = soup.find('script', string=re.compile('skuMap'))  # 查找含skuMap的脚本
        if sku_script:
            # 正则提取skuMap的JSON字符串（避免解析整个JS）
            sku_data_match = re.search(r'skuMaps*:s*({.*?})(?=,s*"skuId"|})', sku_script.string, re.DOTALL)
            if sku_data_match:
                try:
                    sku_json = json.loads(sku_data_match.group(1))
                    # 遍历SKU，提取规格、价格、库存
                    for sku_id, sku_detail in sku_json.items():
                        sku_info.append({
                            'sku_id': sku_id,
                            'properties': sku_detail.get('name', '未知规格'),  # 如"颜色分类:红色"
                            'price': float(sku_detail.get('price', 0)),
                            'stock': int(sku_detail.get('stock', 0))  # 库存为0表示无货
                        })
                except json.JSONDecodeError:
                    print("SKU数据JSON解析失败，可能页面结构变更")
        
        # 4. 提取店铺信息：店铺名称通常在.slogo-shopname标签
        shop_name_tag = soup.select_one('.slogo-shopname')
        shop_name = shop_name_tag.text.strip() if shop_name_tag else "未获取到店铺名称"
        
        return {
            'platform': '淘宝',
            'url': url,
            'title': title,
            'price': price,
            'sku_info': sku_info,
            'shop_name': shop_name,
            'parse_status': 'success'
        }
    
    except requests.exceptions.RequestException as e:
        print(f"请求淘宝商品页失败：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}
    except Exception as e:
        print(f"解析淘宝商品页异常：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}

二、京东商品详情解析：依托清晰 API 简化提取

1. 页面结构核心特性（补充 API 优势）

JSON 接口标准化：商品基础信息、价格、SKU 均有独立 API（如价格 API、SKU API），无需深度解析 HTML；

登录态影响数据范围：未登录仅能获取公开价格，登录后可获取会员价、优惠券等专属数据；

评论数据分页加载：商品评论需调用comment.jd.com的分页接口，单次最多获取 10 条。

2. 核心字段解析（保留原代码 + 补充 API 说明 + 登录提示）

python

运行

import requests
import json
from bs4 import BeautifulSoup
from typing import Dict, List, Optional
def parse_jd_item(item_id: str) - > Optional[Dict]:
    """
    解析京东商品详情页核心字段（基于官方API+页面解析）
    :param item_id: 商品SKU ID（如100012345678，从商品URL中提取）
    :return: 包含标题、价格、SKU的字典，失败返回None
    """
    # 基础配置：京东商品页URL与API
    base_url = f"https://item.jd.com/{item_id}.html"
    price_api_url = f"https://p.3.cn/prices/mgets?skuIds=J_{item_id}"  # 价格API（无需登录）
    sku_api_url = f"https://item-soa.jd.com/getWareBusiness?skuId={item_id}"  # SKU API
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
        'Referer': base_url,
        'Cookie': ''  # 建议添加登录Cookie：可获取会员价、库存详情
    }
    
    try:
        # 1. 提取商品标题：从基础商品页HTML解析
        base_response = requests.get(base_url, headers=headers, timeout=8)
        base_response.raise_for_status()
        soup = BeautifulSoup(base_response.text, 'html.parser')
        title_tag = soup.select_one('.sku-name')  # 京东标题标签固定为.sku-name
        title = title_tag.text.strip() if title_tag else "未获取到标题"
        
        # 2. 提取价格：调用京东官方价格API（比页面解析更稳定）
        price_response = requests.get(price_api_url, headers=headers, timeout=8)
        price_data = price_response.json()
        if not price_data:
            return None
        # 价格API返回列表，p为当前价，m为市场价
        current_price = float(price_data[0].get('p', 0))
        original_price = float(price_data[0].get('m', 0))
        
        # 3. 提取SKU信息：调用SKU专属API（含规格、价格、库存）
        sku_info: List[Dict] = []
        sku_response = requests.get(sku_api_url, headers=headers, timeout=8)
        sku_data = sku_response.json()
        
        # 解析SKU数据结构（京东API返回格式较固定）
        ware_sku = sku_data.get('wareSku', {})
        if 'skus' in ware_sku:
            for sku in ware_sku['skus']:
                sku_info.append({
                    'sku_id': sku.get('skuId', ''),
                    'properties': sku.get('name', '未知规格'),  # 如"颜色:黑色;容量:128G"
                    'price': float(sku.get('price', 0)),
                    'stock_state': sku.get('stockState', 0),  # 0=无货，3=有货，4=预售
                    'stock_desc': '有货' if sku.get('stockState') == 3 else '无货/预售'
                })
        
        return {
            'platform': '京东',
            'item_id': item_id,
            'title': title,
            'current_price': current_price,
            'original_price': original_price,
            'sku_info': sku_info,
            'parse_status': 'success'
        }
    
    except requests.exceptions.RequestException as e:
        print(f"请求京东接口失败：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}
    except Exception as e:
        print(f"解析京东商品数据异常：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}

三、拼多多商品详情解析：适配移动端 API 与加密请求

1. 页面结构核心特性（补充移动端适配要点）

移动端 API 为核心：PC 端页面仅展示基础信息，完整数据（如 SKU、销量）需调用移动端apiv3.pinduoduo.com接口；

请求参数加密频繁：关键参数（如sign）需按拼多多算法生成，直接拼接参数会返回 403；

滑块验证触发严格：新 IP 或高频请求（≥3 次 / 分钟）必触发滑块，需结合 IP 代理与设备指纹。

2. 核心字段解析（保留原代码 + 补充加密提示 + 销量说明）

python

运行

import requests
import json
import time
import random
from typing import Dict, Optional, List
def parse_pinduoduo_item(item_id: str) - > Optional[Dict]:
    """
    解析拼多多商品详情页核心字段（基于移动端API）
    :param item_id: 商品ID（如123456789，从移动端URL提取：https://mobile.yangkeduo.com/goods.html?goods_id=xxx）
    :return: 包含标题、价格、销量、图片的字典，失败返回None
    """
    # 拼多多移动端API（注意：实际使用需破解sign参数加密，此处为基础示例）
    api_url = "https://apiv3.pinduoduo.com/api/item/get"
    
    # 构建请求参数：模拟移动端请求，包含时间戳、随机数
    params = {
        'item_id': item_id,
        'pdduid': int(time.time() * 1000),  # 模拟用户唯一标识（每次请求可变更）
        '_': int(time.time() * 1000),       # 时间戳（毫秒级，防缓存）
        'random': round(random.random(), 16),  # 16位随机数，增加请求唯一性
        'sign': ''  # 关键：需按拼多多sign算法生成，否则接口返回403（需逆向JS获取算法）
    }
    
    # 移动端请求头：必须模拟iPhone/Android设备，否则拒绝服务
    headers = {
        'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Mobile/15E148 Safari/604.1',
        'Referer': f'https://mobile.yangkeduo.com/goods.html?goods_id={item_id}',
        'Origin': 'https://mobile.yangkeduo.com',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Accept': 'application/json, text/plain, */*'
    }
    
    try:
        # 发送请求：拼多多API响应较快，超时设为5秒
        response = requests.get(api_url, params=params, headers=headers, timeout=5)
        response.raise_for_status()
        data = response.json()
        
        # 解析顶层数据：API返回格式为{"item": {...}, "code": 0}
        if data.get('code') != 0:
            print(f"拼多多API返回错误：{data.get('msg', '未知错误')}")
            return None
        item_data = data.get('item', {})
        if not item_data:
            return None
        
        # 提取核心字段：注意拼多多价格单位为"分"，需除以100
        title = item_data.get('goods_name', '未获取到标题')
        min_group_price = item_data.get('min_group_price', 0) / 100  # 最低拼团价
        market_price = item_data.get('market_price', 0) / 100        # 市场价
        sales_tip = item_data.get('sales_tip', '0人已买')           # 销量提示（如"10万+人已买"）
        gallery = item_data.get('gallery', [])                      # 商品图片列表
        images = [img.get('url', '') for img in gallery]            # 提取图片URL
        goods_desc = item_data.get('goods_desc', '无商品描述')       # 商品简介
        
        # 提取SKU信息（若有）
        sku_info: List[Dict] = []
        sku_list = item_data.get('sku_list', [])
        for sku in sku_list:
            sku_info.append({
                'sku_id': sku.get('sku_id', ''),
                'properties': sku.get('spec', '未知规格'),  # 如"颜色:白色;尺寸:M"
                'price': sku.get('price', 0) / 100,
                'stock': sku.get('stock', 0)
            })
        
        return {
            'platform': '拼多多',
            'item_id': item_id,
            'title': title,
            'current_price': min_group_price,
            'original_price': market_price,
            'sales_tip': sales_tip,
            'images': images,
            'description': goods_desc,
            'sku_info': sku_info,
            'parse_status': 'success'
        }
    
    except requests.exceptions.RequestException as e:
        print(f"请求拼多多API失败：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}
    except Exception as e:
        print(f"解析拼多多商品数据异常：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}

四、1688 商品详情解析：聚焦 B 端供应商与批发属性

1. 页面结构核心特性（补充 B 端数据重点）

供应商信息突出：页面包含企业名称、所在地、经营年限、认证资质等 B 端关键数据；

SKU 支持混批规则：部分商品按 “起订量”“混批折扣” 定价，SKU 字段需额外提取批发属性；

API 权限门槛高：企业级数据（如供应商成交率）需申请 1688 开放平台权限，个人开发者难获取。

2. 核心字段解析（保留原代码 + 补充批发属性 + 供应商资质）

python

运行

import requests
import json
from bs4 import BeautifulSoup
import re
from typing import Dict, List, Optional
def parse_1688_item(item_id: str) - > Optional[Dict]:
    """
    解析1688商品详情页核心字段（含B端供应商信息与批发属性）
    :param item_id: 商品Offer ID（如688123456789，从URL提取：https://detail.1688.com/offer/xxx.html）
    :return: 包含商品信息、供应商信息的字典，失败返回None
    """
    item_url = f"https://detail.1688.com/offer/{item_id}.html"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
        'Referer': 'https://www.1688.com',
        'Cookie': ''  # 登录后可获取供应商联系方式、成交记录
    }
    
    try:
        response = requests.get(item_url, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 1. 提取商品标题：1688标题标签为.d-title
        title_tag = soup.select_one('.d-title')
        title = title_tag.text.strip() if title_tag else "未获取到标题"
        
        # 2. 提取价格范围：批发商品多为区间价（如"10.00-15.00元"）
        price_range_tag = soup.select_one('.price-now')
        price_range = price_range_tag.text.strip() if price_range_tag else "未获取到价格"
        
        # 3. 提取供应商核心信息（B端重点）
        company_name_tag = soup.select_one('.company-name')
        company_name = company_name_tag.text.strip() if company_name_tag else "未获取到企业名称"
        
        location_tag = soup.select_one('.location')
        location = location_tag.text.strip() if location_tag else "未获取到所在地"
        
        # 补充：提取经营年限（部分页面有，需按实际结构调整）
        operate_years_tag = soup.select_one('.year')
        operate_years = operate_years_tag.text.strip() if operate_years_tag else "未公开"
        
        # 4. 提取SKU信息（含混批、起订量等B端属性）
        sku_info: List[Dict] = []
        sku_script = soup.find('script', string=re.compile('skuMap'))
        if sku_script:
            sku_data_match = re.search(r'skuMaps*:s*({.*?})(?=,s*"defSkuId"|})', sku_script.string, re.DOTALL)
            if sku_data_match:
                try:
                    sku_json = json.loads(sku_data_match.group(1))
                    for sku_id, sku_detail in sku_json.items():
                        sku_info.append({
                            'sku_id': sku_id,
                            'properties': sku_detail.get('name', '未知规格'),
                            'price': sku_detail.get('price', '0.00'),  # 可能为区间价（如"10-12"）
                            'min_order': sku_detail.get('minOrderQuantity', 1),  # 最小起订量
                            'available_quantity': sku_detail.get('availableQuantity', 0),  # 可售数量
                            'mix_batch': sku_detail.get('supportMix', False)  # 是否支持混批
                        })
                except json.JSONDecodeError:
                    print("1688 SKU数据JSON解析失败")
        
        return {
            'platform': '1688',
            'item_id': item_id,
            'title': title,
            'price_range': price_range,
            'supplier_info': {
                'company_name': company_name,
                'location': location,
                'operate_years': operate_years
            },
            'sku_info': sku_info,
            'parse_status': 'success'
        }
    
    except requests.exceptions.RequestException as e:
        print(f"请求1688商品页失败：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}
    except Exception as e:
        print(f"解析1688商品数据异常：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}

五、唯品会商品详情解析：适配品牌特卖与限时活动

1. 页面结构核心特性（补充限时活动要点）

品牌特卖数据为主：页面突出品牌名称、折扣力度，价格带多为 “折后价”；

活动时效性强：商品库存、价格每小时更新，解析后需标注数据获取时间；

PC 端功能简化：仅展示基础信息，完整 SKU、活动规则需解析移动端页面。

2. 核心字段解析（保留原代码 + 补充时间标注 + 折扣计算）

python

运行

import requests
import json
import re
from typing import Dict, List, Optional
from datetime import datetime
def parse_vip_item(item_id: str) - > Optional[Dict]:
    """
    解析唯品会商品详情页核心字段（含品牌特卖、限时折扣信息）
    :param item_id: 商品ID（如1234567，从移动端URL提取：https://m.vip.com/product-xxx.html）
    :return: 包含品牌、价格、折扣的字典，失败返回None
    """
    mobile_url = f"https://m.vip.com/product-{item_id}.html"
    headers = {
        'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Mobile/15E148 Safari/604.1',
        'Referer': 'https://m.vip.com/',
        'Accept': 'application/json, text/plain, */*'
    }
    
    try:
        response = requests.get(mobile_url, headers=headers, timeout=8)
        response.raise_for_status()
        html_content = response.text
        
        # 提取页面内嵌的商品JSON数据：唯品会数据存于window.productInfo
        product_info_match = re.search(r'window.productInfos*=s*({.*?});', html_content, re.DOTALL)
        if not product_info_match:
            print("未找到唯品会商品JSON数据，可能页面结构变更")
            return None
        
        # 解析JSON数据
        product_info = json.loads(product_info_match.group(1))
        product = product_info.get('product', {})
        if not product:
            return None
        
        # 提取核心字段
        title = product.get('name', '未获取到标题')
        brand_name = product.get('brandName', '未获取到品牌')
        original_price = float(product.get('marketPrice', 0))  # 市场价
        current_price = float(product.get('salePrice', 0))      # 折后价
        discount = product.get('discount', '无折扣')            # 如"3.5折"
        
        # 补充：计算实际折扣率（验证页面折扣是否准确）
        discount_rate = round((current_price / original_price) * 10, 1) if original_price != 0 else 0.0
        
        # 提取商品图片：detailImages为详情图列表
        detail_images = product.get('detailImages', [])
        images = [img.get('url', '') for img in detail_images]
        
        # 提取颜色选项
        color_options = [color.get('name', '') for color in product.get('colors', [])]
        
        # 提取活动时间（限时特卖关键）
        activity_start = product.get('startTime', '')
        activity_end = product.get('endTime', '')
        
        return {
            'platform': '唯品会',
            'item_id': item_id,
            'title': title,
            'brand': brand_name,
            'original_price': original_price,
            'current_price': current_price,
            'discount': discount,
            'discount_rate': discount_rate,  # 实际折扣率（如3.5）
            'images': images,
            'color_options': color_options,
            'activity_time': {
                'start': activity_start,
                'end': activity_end
            },
            'data_fetch_time': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # 数据获取时间
            'parse_status': 'success'
        }
    
    except requests.exceptions.RequestException as e:
        print(f"请求唯品会商品页失败：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}
    except Exception as e:
        print(f"解析唯品会商品数据异常：{str(e)}")
        return {'parse_status': 'fail', 'error_msg': str(e)}

六、通用解析策略与注意事项（补充实操工具与方案）

1. 动态内容处理（补充工具选型建议）

JS 渲染工具对比：

Selenium：适合 Python 开发者，支持可视化调试，缺点是占用资源多；

Puppeteer：Node.js 生态工具，渲染速度快，适合批量解析；

Playwright：微软开源工具，支持多浏览器（Chrome/Firefox/Safari），兼容性优于前两者；

官方 API 优先原则：

淘宝：通过taobao.item_get接口（需申请开放平台权限）；

京东：调用jd.union.open.goods.detail.query接口（联盟账号可申请）；

优势：数据准确性 100%，无反爬风险，更新频率同步平台。

2. 反爬应对进阶方案（补充细节与工具推荐）

反爬类型	应对方案	工具推荐
User-Agent 检测	构建多设备 UA 池（PC + 移动端），每次请求随机选择，避免固定格式	开源 UA 池：user_agent Python 库
IP 封锁	使用住宅代理池（模拟真实用户 IP），避免数据中心 IP；单 IP 请求间隔≥3 秒	Luminati（全球住宅 IP）、Oxylabs
Cookie 验证	维护登录态 Cookie 池，定期更新（如淘宝 Cookie 有效期约 7 天）	Cookie 自动刷新工具：CookieCloud
字体加密	解析字体文件的字符映射关系，将乱码转换为正常文字；或使用 OCR 识别价格图片	Python 库：fonttools、pytesseract

3. 数据验证与质量保障（补充具体案例）

字段格式验证：

价格：使用正则r'^d+.d{2}$'确保为两位小数（如 25.80，避免 100 或 25.8 等格式）；

商品 ID：淘宝 ID 为 11-12 位数字，京东为 10-13 位数字，不符则标记异常；

空值与异常处理：

库存为负数时，自动修正为 0；

价格为 0 时，重新请求或标记为 “数据异常”；

数据时效性标注：

对唯品会、拼多多等限时活动平台，必须记录数据获取时间，避免使用过期价格。

七、总结与平台适配建议

不同电商平台的解析难度与核心关注点差异显著，建议根据业务场景优先选择适配方案：

C 端选品 / 竞品监控：优先解析淘宝、京东、拼多多，重点关注价格、销量、SKU 库存；

B 端供应商筛选：聚焦 1688，提取企业资质、起订量、混批规则；

品牌折扣分析：主攻唯品会，重点跟踪折扣力度、活动周期、品牌分布。

需注意：所有解析行为需遵守平台robots.txt协议与《网络安全法》，避免高频爬取或获取敏感数据（如用户隐私、未公开商业数据）。建议定期（每 1-2 个月）检查平台页面结构，及时更新解析逻辑，应对平台反爬策略调整。

若在实际解析中遇到 “API 参数加密”“字体解密失败”“滑块验证突破” 等问题，评论区留言具体场景，小编看到必回，分享实操解决方案！

审核编辑黄宇

打开APP阅读更多精彩内容