Total Pageviews

Friday 16 April 2021

Crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台.

      

中文 | English

Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

Three methods:

  1. Docker (Recommended)
  2. Direct Deploy (Check Internal Kernel)
  3. Kubernetes (Multi-Node Deployment)

Pre-requisite (Docker)

  • Docker 18.03+
  • Redis 5.x+
  • MongoDB 3.6+
  • Docker Compose 1.24+ (optional but recommended)

Pre-requisite (Direct Deploy)

  • Go 1.12+
  • Node 8.12+
  • Redis 5.x+
  • MongoDB 3.6+

Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node offers below services:

  1. Crawling Task Coordination;
  2. Worker Node Management and Communication;
  3. Spider Deployment;
  4. Frontend and API Services;
  5. Task Execution (one can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET to set their info into a hash list named nodes in Redis, and the Master Node will identify online nodes according to the hash list.

Frontend

Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️Note: make sure you have already installed crawlab-sdk using pip.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.pipelines.CrawlabMongoPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION is passed by Crawlab as the name of the collection to store results data.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

FrameworkTechnologyProsConsGithub Stats
CrawlabGolang + VueNot limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc.Not yet support spider versioning 
ScrapydWebPython Flask + VueBeautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform.Not support spiders other than Scrapy. Limited performance because of Python Flask backend. 
GerapyPython Django + VueGerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc.Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0 
SpiderKeeperPython FlaskOpen-source Scrapyhub. Concise and simple UI interface. Support cron job.Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy. 
from https://github.com/crawlab-team/crawlab
-------------------------

直接部署

直接部署是之前没有 Docker 时的部署方式,相对于 Docker 部署来说有些繁琐。但了解如何直接部署可以帮助更深入地理解 Docker 是如何构建 Crawlab 镜像的。

推荐人群:

  • 了解 NodeGolangMongoDBRedisNginx 的安装和使用方式的开发者
  • 希望了解 Crawlab 源代码和运行原理的开发者
  • 需要二次开发 Crawlab 的开发者

推荐配置:

  • Go: 1.12+
  • Node: 8.x+
  • MongoDB: 3.6+
  • Redis: 5.x+
  • Nginx: 1.10+

1. 拉取代码

首先是将 Github 上的代码拉取到本地。

git clone https://github.com/crawlab-team/crawlab

2. 安装 Node 环境

我们使用 nvm(Node Version Manager)来管理 Node 环境。当然如果您对 Node 比较熟悉,可以跳过这一节。

请参照 nvm Github 地址 来安装 nvm。或者也可以运行下面的命令来安装。

⚠️注意: Windows 用户请用 nvm-windows 来安装。

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.35.2/install.sh | bash

安装好后,执行下面的命令来初始化 nvm。Mac 或 Linux 用户可以将下面的代码添加到 .profile 或者 .bashrc 文件中。

export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm

然后,您就可以安装使用特定的 Node 版本了。我们执行以下命令来启用 8.12 版本的 Node。

nvm use 8.12

这里可能会下载安装相应的 Node 版本,请耐心等待。安装好后,运行下面命令查看知否安装成功。

node -v

如果有提示版本号,就说明安装成功了。

3. 安装前后端

安装前端所需库。

npm install -g yarn
cd frontend
yarn install

接下来是安装后端所需库。

在运行这一步之前,如果咱们在国内,需要设置一下 Go Module 的代理,将环境变量 GOPROXY 设置为 https://goproxy.cn。如果是 Linux 或 Mac,可以执行如下命令。

export GOPROXY=https://goproxy.cn

然后,执行如下命令安装后端。

cd ../backend
go install ./...

4. 构建前端

这里的构建是指前端构建。在构建之前,我们需要配置一下前端的部署环境变量。

打开 ./frontend/.env.production,内容如下。

NODE_ENV='production'
VUE_APP_BASE_URL=/api
VUE_APP_CRAWLAB_BASE_URL=https://api.crawlab.cn
VUE_APP_DOC_URL=http://docs.crawlab.cn

这里解释一下各个环境变量的作用:

  • NODE_ENV: 当前的环境(development / test / production),这里默认用 production不用改
  • VUE_APP_BASE_URL: 后端 API 的地址,需要改成您 API 的外网地址,例如 http://8.8.8.8:8000;
  • VUE_APP_CRAWLAB_BASE_URL: Crawlab 远端服务的 API 地址,目前主要发送统计信息用,不用改
  • VUE_APP_DOC_URL: 文档地址,不用改

配置完成后,执行以下命令。

cd ../frontend
npm run build:prod

构建完成后,会在 ./frontend 目录下创建一个 dist 文件夹,里面是打包好后的静态文件。

5. Nginx

安装 nginx,在 ubuntu 16.04 是以下命令。

sudo apt-get install nginx

添加 /etc/nginx/conf.d/crawlab.conf 文件,输入以下内容。

server {
    listen    8080;
    root    /path/to/dist;
    index    index.html;
}

其中,root 是静态文件的根目录,这里是 npm 打包好后的静态文件。

现在,只需要启动 nginx 服务就完成了启动前端服务。

nginx reload

6. MongoDB & Redis

6.1 安装 MongoDB

请参照 MongoDB 教程 来完成 MongoDB 的安装。

6.2 安装 Redis

请参照 Redis 安装 来完成 Redis 的安装。

7. 配置

修改配置文件 crawlab/backend/conf/config.yaml。配置文件是以 yaml 的格式。配置详情请见配置Crawlab

8. 构建后端

执行以下命令。

cd ../backend
go build

go build 命令会将Golang代码打包为一个执行文件,默认在 $GOPATH/bin 里。

9. 启动服务

这里是指启动后端服务。执行以下命令。

$GOPATH/bin/crawlab

然后在浏览器中输入 http://localhost:8080 就可以看到界面了。

⚠️注意:启动的时候需要保证您的工作路径在从 Github 拉取下来的 Crawlab 项目的 ./backend 路径里。

10. 下一步

请参考 爬虫章节 来详细了解如何使用 Crawlab。

11. 参考

from https://docs.crawlab.cn/zh/Installation/Direct.html

No comments:

Post a Comment