记录一下爬虫网站的时候的一些问题。
- scrapy shell报错:twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost
1 | scrapy shell https://www.xxx.com/ |
解决方法一:尝试把www去掉就可以了
解决方法二:尝试模仿浏览器访问,修改请求的user-agent,编辑settings.py
1
2
3
4
5
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
输出爬取结果时,中文乱码,显示的是unicode码
1
scrapy crawl kanping99 -O kanping99.jsonl
-s设置输出的编码:1
scrapy crawl kanping99 -O kanping99.jsonl -s FEED_EXPORT_ENCODING=utf-8
connection refused
1
2
3
4
5
62024-04-28 14:07:36 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-28 14:07:36 [scrapy.core.engine] INFO: Spider opened
2024-04-28 14:07:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.xxx.com/3-1.html> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2024-04-28 14:07:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.xxx.com/3-1.html> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2024-04-28 14:07:37 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.xxx.com/3-1.html> (failed 3 times): Connection was refused by other side: 61: Connection refused.
https换成http,www去掉试试。
我把https换成http后ok了。
欢迎关注微信公众号,你的资源可变现:【乐知付加密平台】
一起学习,一起进步。