The Dregs by jazzijava (CC BY-NC-ND) |
Node.js + Cheerio + Request - a Great Combo
As it happens Node.js and associated technologies are a great fit for this purpose. You get to use a familiar query syntax. And there is tooling available. A lot of it.Disfigured by scabeater (CC BY-NC-ND) |
My absolute favorite used to be Zombie.js. Although designed mainly for testing it works often alright for scraping. node.io is another good alternative. In a certain case I had to use a combination of request, htmlparser and soupselect as zombie just didn't bite there.
These days I like to use combination of cheerio and request. Getting this combo to work on various environments is easier than Zombie. In addition you get to operate on a familiar jQuery syntax so that's a big bonus as well.
Basic Workflow
When it comes to scraping the basic workflow is quite simple. During development it can be useful to stub out functionality and fill it in as you progress. Here is the rough approach I use:- Figure out how the data is structured currently
- Come up with selectors to access it
- Map the data into some new structure
- Serialize the data into some machine readable format based on your needs
- Serve the data through a web interface if you so want
It can be helpful to know how to use Chrome Developer Tools or Firebug effectively. SelectorGadget bookmarklet may come in handy too. If you feel like it, play around with jQuery selectors in your browser. It will be very useful to be able to compose selectors effectively.
Examples
Shady Customer by Petur (CC BY-NC-ND) |
sonaatti-scraper scrapes
some restaurant data. It uses node.io, comes with a small CLI tool and
makes it possible to serve the data through a web interface.
There is some room for improvement. It would be a good idea not to
scrape the data each time a query is performed to the web API for
instance. There should be a cache of some sort to avoid unnecessary
polling. It is a good starting point, though, given its simplicity.
My other example, jklevents, is based on zombie cheerio.
It is a lot more complex as it parses through a whole collection of
pages, not just one. It also performs tasks such as geocoding to further
improve the quality of data.
In my third example, f500-scraper,
I had to use a combination of tools as zombie didn't quite work. The
issue had something to do with the way the pages were loaded using
JavaScript so the DOM just wasn't ready when I needed to scrape it.
Instead I ended up just capturing the page data the good old way and
applying some force on it. As it happens it worked.
lte-scraper uses cheerio and request. The implementation is somewhat short and may be worth investigating.
Other Considerations
When scraping, be polite. Sometimes the "targets" of scraping might
actually be happy that you are doing some of the work for them. In case
of jkl-event-scraper I contacted the right holder of the data and we
agreed on an attribution deal. So it is alright to use the data in a
commercial way given there is an attribution.
This is just a point I wanted to make as there are times when good
things can come out of these sort of things. In the best case you might
even earn a client this way.
Conclusion
Node.js is an amazing platform for scraping. The tooling is mature
enough and you can use familiar query syntax for instance. It does not
get much better than that for me at least. I believe it could be
interesting to try to apply fuzzier approaches (think AI) to scraping.
For instance in case of restaurant data this might lead into a more
generic scraper you can then apply to many pages containing that type of
data. After all there is a certain structure to it although the way it
has been structured in DOM will always vary somewhat.
Even the crude methods described here briefly are often quite enough. But you can definitely make scraping a more interesting problem if you want to.
from http://www.nixtu.info/2013/03/scraping-web-using-nodejs.html