I'm looking for a way to render arbitrary Web pages -- including CSS and JavaScript -- and access the resulting DOM tree programatically, i.e., in an automated/headless fashion. I want to be able to ask the following questions of the resulting DOM tree:
- For a given element, what font family, size, and color is the text?
- How tall and wide (in pixels) is a given
<div>
,<table>
, etc.? - What are the x/y coordinates of a given element (from the upper-left corner of the page, or lower-left, or wherever)?
- For a given element, what is its text content?
The rendering must be state-of-the-art, handling advanced CSS that Firefox, Safari and IE handle. It should work on Linux. Bonus points if there's a Python API for this magical DOM tree.
This is all stuff that standard in-page JavaScript could accomplish, but the catch with me is that I need to be able to do it in a completely automated way, on arbitrary pages, on a headless server.
I know Gecko and Webkit provide this, but I'm not sure where to start with them. The docs and articles I've read seem to be focused more on embedding the full browser window in a GUI application than embedding the rendering engine itself and manipulating the resulting pages.
Help! If you have any clues, I'd be grateful if you left a comment or got in touch with me.
Comments
Posted by Andrew Sutherland on May 2, 2008, at 2:45 a.m.:
Posted by Rene Dudfield on May 2, 2008, at 3:19 a.m.:
Posted by Michael Twomey on May 2, 2008, at 4:46 a.m.:
Posted by Justin Mason on May 2, 2008, at 5:01 a.m.:
Posted by Gábor Farkas on May 2, 2008, at 5:10 a.m.:
Posted by Jason on May 2, 2008, at 7:04 a.m.:
Posted by anonymous on May 2, 2008, at 8:15 a.m.:
Posted by anonymous on May 2, 2008, at 8:50 a.m.:
Posted by anonymous on May 2, 2008, at 10:15 a.m.:
Posted by alan taylor on May 2, 2008, at 10:36 a.m.:
Posted by Matthew Marshall on May 2, 2008, at 10:42 a.m.:
Posted by Kumar McMillan on May 2, 2008, at 11:40 a.m.:
Posted by anonymous on May 2, 2008, at 12:02 p.m.:
Posted by Ryan Shaw on May 2, 2008, at 12:26 p.m.:
Posted by mikeal on May 2, 2008, at 1:49 p.m.:
Posted by Henning on May 2, 2008, at 2:29 p.m.:
Posted by anonymous on May 2, 2008, at 6:08 p.m.:
Posted by anonymous on May 2, 2008, at 6:12 p.m.:
Posted by Phil on May 2, 2008, at 7:31 p.m.:
Posted by Daniel on May 2, 2008, at 7:46 p.m.:
Posted by rex on May 3, 2008, at 8:44 a.m.:
Posted by anonymous on May 5, 2008, at 4:55 a.m.:
Posted by Eric Moritz on May 5, 2008, at 3:50 p.m.: