爬虫相关问题

记录一下爬虫网站的时候的一些问题。

scrapy shell报错：twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost

1	scrapy shell https://www.xxx.com/

解决方法一：尝试把www去掉就可以了
解决方法二：尝试模仿浏览器访问，修改请求的user-agent，编辑settings.py
    1
2
3
4
5
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

输出爬取结果时，中文乱码，显示的是unicode码

1	scrapy crawl kanping99 -O kanping99.jsonl

unicode
-s设置输出的编码：

1	scrapy crawl kanping99 -O kanping99.jsonl -s FEED_EXPORT_ENCODING=utf-8

connection refused

2024-04-28 14:07:36 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-28 14:07:36 [scrapy.core.engine] INFO: Spider opened
2024-04-28 14:07:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.xxx.com/3-1.html> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2024-04-28 14:07:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.xxx.com/3-1.html> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2024-04-28 14:07:37 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.xxx.com/3-1.html> (failed 3 times): Connection was refused by other side: 61: Connection refused.

https换成http，www去掉试试。
我把https换成http后ok了。

欢迎关注微信公众号，你的资源可变现：【乐知付加密平台】

一起学习，一起进步。