requests抓取网页用chardet.detect存在问题

2500阅读　0评论2020-05-11　wwm
分类：Python/Ruby

chardet.detect经常提示是gb2312 。另外网页charset="gb2312"
但实际上是 gbk或者是 GB18030 。

txt =c.content.decode("gbk")
txt =c.content.decode("GB18030")

例子
c = requests.get(url,stream=True)
print chardet.detect(c.content)
txt =c.content.decode("GB18030")
txt = txt.encode("utf-8")
soup = BeautifulSoup(txt, 'lxml',from_encoding='utf-8')

上一篇：request会出现请求的内容中文乱码
下一篇：python3获得浏览器chrome指定网站cookie