工具：scrapy_splash

什么是scrapy_splash?

scrapy-splash加载js数据是基于Splash来实现的。
Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器，Splash是用Python和Lua语言实现的，基于Twisted和QT等模块构建。
使用scrapy-splash最终拿到的response相当于是在浏览器全部渲染完成以后的网页源代码。

官方网站：https://splash-cn-doc.readthedocs.io/zh-cn/latest/Installation.html

splash的作用：
scrapy-splash能够模拟浏览器加载js，并返回js运行后的数据

功能

来看看官网的介绍，他能干啥：

Take screenshots of multiple pages
Wait for element
Scroll page
Preload jQuery
Preload functions
Load multiple pages
Count DIV tags
Call Later
Render PNG
Take a screenshot of a single element
Log requested URLs
Block CSS
Execute function with timeout
Submit a search input

对于我们普通使用用户来说，加载js，滚动页面，等待js加载，截图应该比较有用。
需要的大家自己去搜索用法吧。

安装

linux + docker

下载 docker

拉取镜像

1	docker pull scrapinghub/splash

启动容器

1	docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash

现在splash在0.0.0.0这个ip上监听并绑定了端口8050(http) 和5023 (telnet)

启动完splash后，可以访问它的web页面，看是否启动成功：
http://127.0.0.1:8050

scrapy使用splash

安装

1	pip install scrapy-splash

新建爬虫项目：

scrapy startproject test_splash
cd test_splash
scrapy genspider splash_test https://www.baidu.com/s?wd=13161933309

修改settings.py

# 在settings.py文件中添加splash的配置以及修改robots协议

# 渲染服务的url
SPLASH_URL = 'http://127.0.0.1:8050'
# 下载器中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

修改爬虫代码

spiders/splash_test.py

# spiders/with_splash.py
import scrapy
from scrapy_splash import SplashRequest # 使用scrapy_splash包提供的request对象

class SplashTestSpider(scrapy.Spider):
    name = 'splash_test'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def start_requests(self):
        yield SplashRequest(self.start_urls[0],
                            callback=self.parse_splash,
                            args={'wait': 10}, # 最大超时时间，单位：秒
                            endpoint='render.html') # 使用splash服务的固定参数

    def parse_splash(self, response):
        with open('with_splash.html', 'w') as f:
            f.write(response.body.decode())

js加载后的html文件内容就保存在with_splash.html文件中了。

如果你想通过scrapy解析加载后的html内容，那直接操作response即可。

1
2
3

def parse_splash(self, response):
    title = response.css("title").get()
    yield title

好了，记录到这里。

好记性不如烂笔头。

欢迎关注微信公众号，你的资源可变现：【乐知付加密平台】

一起学习，一起进步。