爬虫的定义
**爬虫的分类
**
ROBOTS协议
举例:
访问淘宝网的robots文件: https://www.taobao.com/robots.txt
* 很显然淘宝不允许百度的机器人访问其网站下其所有的目录
如果允许所有的链接访问应该是:Allow: /
http和https的概念
http
https
**HTTPS比HTTP更安全,但是性能更低
**
URL的形式
Http常见请求头
Http常见响应码
requests的官网
**示例:下载官网上的图片
**
import requests
url="https://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png"
response=requests.get(url)
# 响应状态码为200,表示请求成功
if response.status_code==200:
# 生成图片
with open("aa.png","wb") as f:
f.write(response.content)
response.text 和 response.content 的区别
**requests请求带header
**
示例
import requests
respons=requests.get("http://www.baidu.com")
print(respons.request.headers)
# 设置header 模拟谷歌浏览器
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
respons2=requests.get("http://www.baidu.com",headers=headers)
print(respons2.request.headers)
获取User-Agent的值
带参数的requests请求
import requests
url="http://www.baidu.com/s?"
# 添加参数
params={
"wd":"hello"
}
# 方式一
respons=requests.get(url,params=params)
# 方式二
respons2=requests.get("http://www.baidu.com/s?wd={}".format("hello"))
print(respons.status_code)
print(respons.request.url)
print(respons2.status_code)
print(respons2.request.url)
requests发送post请求
import requests
url="https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1617719326771"
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
params={
"curr_iss_amt": 20
}
respons=requests.post(url,data=params,headers=headers)
print(respons.content.decode())
requests使用代理方式
import requests
url="http://www.baidu.com"
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
proxies={
"http": "175.42.158.211:9999"
}
respons=requests.post(url,proxies=proxies)
print(respons.status_code)
**cookie和session的区别
**
爬虫处理cookie和session
**requests 处理cookies、session请求
**
import requests
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
data={
"email":"用户名",
"password":"密码"
}
session=requests.session()
# session发送post请求,cookie保存在其中
session.post("http://www.renren.com/PLogin.do",data=data,headers=headers)
# 使用session请求登录后才能访问的页面
r=session.get("http://www.renren.com/976564425",headers=headers)
print(r.content.decode())
全部0条评论
快来发表一下你的评论吧 !