Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台.
中文 | English
Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer
Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.
Installation
Three methods:
- Docker (Recommended)
- Direct Deploy (Check Internal Kernel)
- Kubernetes (Multi-Node Deployment)
Pre-requisite (Docker)
- Docker 18.03+
- Redis 5.x+
- MongoDB 3.6+
- Docker Compose 1.24+ (optional but recommended)
Pre-requisite (Direct Deploy)
- Go 1.12+
- Node 8.12+
- Redis 5.x+
- MongoDB 3.6+
Architecture
The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.
The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0
. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.
Master Node
The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.
The Master Node offers below services:
- Crawling Task Coordination;
- Worker Node Management and Communication;
- Spider Deployment;
- Frontend and API Services;
- Task Execution (one can regard the Master Node as a Worker Node)
The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.
Worker Node
The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub
. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.
MongoDB
MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.
Redis
Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET
to set their info into a hash list named nodes
in Redis, and the Master Node will identify online nodes according to the hash list.
Frontend
Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.
Integration with Other Frameworks
Crawlab SDK provides some helper
methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.
crawlab-sdk
using pip.
Scrapy
In settings.py
in your Scrapy project, find the variable named ITEM_PIPELINES
(a dict
variable). Add content below.
ITEM_PIPELINES = {
'crawlab.pipelines.CrawlabMongoPipeline': 888,
}
Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result
General Python Spider
Please add below content to your spider files to save results.
# import result saving method
from crawlab import save_item
# this is a result record, must be dict type
result = {'name': 'crawlab'}
# call result saving method
save_item(result)
Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result
Other Frameworks / Languages
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID
. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION
is passed by Crawlab as the name of the collection to store results data.
Comparison with Other Frameworks
There are existing spider management frameworks. So why use Crawlab?
The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.
Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.
Framework | Technology | Pros | Cons | Github Stats |
---|---|---|---|---|
Crawlab | Golang + Vue | Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. | Not yet support spider versioning | |
ScrapydWeb | Python Flask + Vue | Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. | Not support spiders other than Scrapy. Limited performance because of Python Flask backend. | |
Gerapy | Python Django + Vue | Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. | Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0 | |
SpiderKeeper | Python Flask | Open-source Scrapyhub. Concise and simple UI interface. Support cron job. | Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy. |
直接部署
直接部署是之前没有 Docker 时的部署方式,相对于 Docker 部署来说有些繁琐。但了解如何直接部署可以帮助更深入地理解 Docker 是如何构建 Crawlab 镜像的。
推荐人群:
- 了解
Node
、Golang
、MongoDB
、Redis
、Nginx
的安装和使用方式的开发者 - 希望了解 Crawlab 源代码和运行原理的开发者
- 需要二次开发 Crawlab 的开发者
推荐配置:
- Go: 1.12+
- Node: 8.x+
- MongoDB: 3.6+
- Redis: 5.x+
- Nginx: 1.10+
1. 拉取代码
首先是将 Github 上的代码拉取到本地。
git clone https://github.com/crawlab-team/crawlab
2. 安装 Node 环境
我们使用 nvm
(Node Version Manager)来管理 Node 环境。当然如果您对 Node 比较熟悉,可以跳过这一节。
请参照 nvm Github 地址 来安装 nvm。或者也可以运行下面的命令来安装。
⚠️注意: Windows 用户请用 nvm-windows 来安装。
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.35.2/install.sh | bash
安装好后,执行下面的命令来初始化 nvm。Mac 或 Linux 用户可以将下面的代码添加到 .profile
或者 .bashrc
文件中。
export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
然后,您就可以安装使用特定的 Node 版本了。我们执行以下命令来启用 8.12 版本的 Node。
nvm use 8.12
这里可能会下载安装相应的 Node 版本,请耐心等待。安装好后,运行下面命令查看知否安装成功。
node -v
如果有提示版本号,就说明安装成功了。
3. 安装前后端
安装前端所需库。
npm install -g yarn
cd frontend
yarn install
接下来是安装后端所需库。
在运行这一步之前,如果咱们在国内,需要设置一下 Go Module 的代理,将环境变量 GOPROXY
设置为 https://goproxy.cn
。如果是 Linux 或 Mac,可以执行如下命令。
export GOPROXY=https://goproxy.cn
然后,执行如下命令安装后端。
cd ../backend
go install ./...
4. 构建前端
这里的构建是指前端构建。在构建之前,我们需要配置一下前端的部署环境变量。
打开 ./frontend/.env.production
,内容如下。
NODE_ENV='production'
VUE_APP_BASE_URL=/api
VUE_APP_CRAWLAB_BASE_URL=https://api.crawlab.cn
VUE_APP_DOC_URL=http://docs.crawlab.cn
这里解释一下各个环境变量的作用:
- NODE_ENV: 当前的环境(development / test / production),这里默认用
production
,不用改; - VUE_APP_BASE_URL: 后端 API 的地址,需要改成您 API 的外网地址,例如 http://8.8.8.8:8000;
- VUE_APP_CRAWLAB_BASE_URL: Crawlab 远端服务的 API 地址,目前主要发送统计信息用,不用改;
- VUE_APP_DOC_URL: 文档地址,不用改。
配置完成后,执行以下命令。
cd ../frontend
npm run build:prod
构建完成后,会在 ./frontend
目录下创建一个 dist
文件夹,里面是打包好后的静态文件。
5. Nginx
安装 nginx
,在 ubuntu 16.04
是以下命令。
sudo apt-get install nginx
添加 /etc/nginx/conf.d/crawlab.conf
文件,输入以下内容。
server {
listen 8080;
root /path/to/dist;
index index.html;
}
其中,root
是静态文件的根目录,这里是 npm
打包好后的静态文件。
现在,只需要启动 nginx
服务就完成了启动前端服务。
nginx reload
6. MongoDB & Redis
6.1 安装 MongoDB
请参照 MongoDB 教程 来完成 MongoDB 的安装。
6.2 安装 Redis
请参照 Redis 安装 来完成 Redis 的安装。
7. 配置
修改配置文件 crawlab/backend/conf/config.yaml
。配置文件是以 yaml
的格式。配置详情请见配置Crawlab。
8. 构建后端
执行以下命令。
cd ../backend
go build
go build
命令会将Golang代码打包为一个执行文件,默认在 $GOPATH/bin
里。
9. 启动服务
这里是指启动后端服务。执行以下命令。
$GOPATH/bin/crawlab
然后在浏览器中输入 http://localhost:8080
就可以看到界面了。
⚠️注意:启动的时候需要保证您的工作路径在从 Github 拉取下来的 Crawlab 项目的 ./backend
路径里。
10. 下一步
请参考 爬虫章节 来详细了解如何使用 Crawlab。
No comments:
Post a Comment