A collection of awesome web crawler,spider and resources in different languages.
Python
- Scrapy - A fast high-level screen scraping and web crawling framework.
- django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.
- Scrapy-Redis - Redis-based components for Scrapy.
- scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.
- distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
- pyspider - A powerful spider system.
- cola - A distributed crawling framework.
- Demiurge - PyQuery-based scraping micro-framework.
- Scrapely - A pure-python HTML screen-scraping library.
- feedparser - Universal feed parser.
- you-get - Dumb downloader that scrapes the web.
- Grab - Site scraping framework.
- MechanicalSoup - A Python library for automating interaction with websites.
- portia - Visual scraping for Scrapy.
- crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
- RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
- MSpider - A simple ,easy spider using gevent and js render.
- brownant - A lightweight web data extracting framework.
- PSpider - A simple spider frame in Python3.
- Gain - Web crawling framework based on asyncio for everyone.
- sukhoi - Minimalist and powerful Web Crawler.
Java
- Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
- anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
- Crawler4j - Simple and lightweight web crawler.
- JSoup - Scrapes, parses, manipulates and cleans HTML.
- websphinx - Website-Specific Processors for HTML information extraction.
- Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- Gecco - A easy to use lightweight web crawler
- WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
- Webmagic - A scalable crawler framework.
- Spiderman - A scalable ,extensible, multi-threaded web crawler.
- Spiderman2 - A distributed web crawler framework,support js render.
- Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
- SeimiCrawler - An agile, distributed crawler framework.
- StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
- Spark-Crawler - Evolving Apache Nutch to run on Spark.
- webBee - A DFS web spider.
C#
- ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
- SimpleCrawler - Simple spider base on mutithreading, regluar expression.
- DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
- Abot - C# web crawler built for speed and flexibility.
- Hawk - Advanced Crawler and ETL tool written in C#/WPF.
- SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
JavaScript
- scraperjs - A complete and versatile web scraper.
- scrape-it - A Node.js scraper for humans.
- simplecrawler - Event driven web crawler.
- node-crawler - Node-crawler has clean,simple api.
- js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
- x-ray - Web scraper with pagination and crawler support.
- node-osmosis - HTML/XML parser and web scraper for Node.js.
- web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
- supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
PHP
- Goutte - A screen scraping and web crawling library for PHP.
- laravel-goutte - Laravel 5 Facade for Goutte.
- dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
- pspider - Parallel web crawler written in PHP.
- php-spider - A configurable and extensible PHP web spider.
C++
- open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.
C
- httrack - Copy websites to your computer.
Ruby
- upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
- wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
- RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
- Spidr - Spider a site ,multiple domains, certain links or infinitely.
- Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
- mechanize - Automated web interaction & crawling.
R
- rvest - Simple web scraping for R.
Erlang
- ebot - A scalable, distribuited and highly configurable web cawler.
Perl
- web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
Go
- pholcus - A distributed, high concurrency and powerful web crawler.
- gocrawl - Polite, slim and concurrent web crawler.
- fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
- go_spider - An awesome Go concurrent Crawler(spider) framework.
- dht - BitTorrent DHT Protocol && DHT Spider.
- ants-go - A open source, distributed, restful crawler engine in golang.
- scrape - A simple, higher level interface for Go web scraping.
- creeper - The Next Generation Crawler Framework (Go).
Scala
- crawler - Scala DSL for web crawling.
- scrala - Scala crawler(spider) framework, inspired by scrapy.
- ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
from https://github.com/BruceDone/awesome-crawler
No comments:
Post a Comment