Total Pageviews

Monday 13 June 2016

webmagic -一个基于java的爬虫框架

A scalable web crawler framework for Java.

Readme in Chinese
User Manual (Chinese)
Build Status
A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.5.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.5.3</version>
</dependency>
WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.
<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>
 
from https://github.com/code4craft/webmagic 
https://github.com/webmagic-io/docs
https://github.com/webmagic-io/webmagic-io.github.io