What's old (scraping) is new again (microformats)

You know, I think I fully realized why microformats seem so appealing and familiar to me:

The appealing part stems from that fact that I've been working on building web scrapers for years now, using a slew of languages (Perl, Java, Python, XSL, JavaScript, bash scripts) and approaches (HTML parsing, regexes, XSL, tidy-and-XPath). Anything to make that easier strikes a nice chord with me.

But, I just remembered that a few summers ago, I wrote a bit for the O'Reilly book Spidering Hacks that anticipates the notion of a microformat.

If you happen to have the book (which I highly recommend), or can read it on Safari (which I also highly recommend), take a look at Hack #96: Making Your Resources Scrapable with Regular Expressions.

Granted, my proto-solution published there suggested using regexes and consistenly named HTML id attributes. So, I'm much more pleased with the microformat approach using CSS classes. And, I'm currently trying to write a general microformat parser using Python's HTMLParser class, which beats using regexes.

Just thought I'd share the revelation, and toot my own fledgling writing horn. :)

Archived Comments

  • I dunno... HTMLParser and screen scraping has always been an unsatisfying experience to me. Not very reliable, and fails in unexpected ways.
  • Well, scraping is most certainly nothing you want to really rely on without watching it, but I've got at least a dozen or two useful and active feeds running as a result of scraping for the past few years-- it's certainly better than nothing :)
  • ha, I never realised you were one of the authors of that book -- awesome! I wrote sitescooper, which was a proto-scraper for a wide variety of sites to transcode them into an offline-readable format for small-screen handheld devices. I really like the microformat idea, thanks for the pointer. one difficulty, however, of using it for scraping is that you'll have to use XPath and trust that the input XHTML is valid. regexps won't work, because the close tags don't include the "id" or "class" attributes, so nested close tags won't match correctly with simple regexps. but then, we're all told that scraping and regexps are kludges anyway, so I guess we shouldn't be using them any more ;)
  • Well, the thing about the microformats, if I recall, is that you have to at least start with well-formed XHTML. Otherwise, your microformatted content is broken. That said, though, I've been using Python's HTMLParser to lift data out of microformat content with a lot of success.
  • Leslie, that's exactly right. More and more blogging tools publish in well-formed XHTML by default and are becoming better and better at "tidying" ill-formed markup into well-formed markup. In addition, right now you can get started with a bit of a hybrid approach: you can use a regexp to find the *start* of a microformat such as hCard, hCalendar, hReview, XOXO etc. simply by looking for class attributes that contain the right value. Then at that point, you can hand the stream over to an XML parser to process the well-formed microformatted markup and handle the structured data as you wish! Tantek
  • I remember that presentation at WWDC last year or the year before in which the very nice Sara ? explains that scraping is for H4x0Rs, to which I commented to her afterward that it's actually for 3L337 H4x0Rs... I particularly enjoy the "Scraping is Fun" bent of this thread, so I thought I'd bang my own drum a little to show how much I enjoy it, too: http://www.metafy.com/products/anthracite/ It works great with Perl (or any other UNIX command) and AppleScript now, and even better later this week when we release the Automator actions for Anthracite (whoops, not supposed to say that yet until they're all done)... Among other nifty things you can do with it today are convert the results of a Google search into an RSS feed, and/or search those results via Spotlight. I hope it helps you enjoy scraping even more! Joe @ Metafy Boulder, Colorado USA
  • Here is a microformat I proposed a couple years ago http://internetalchemy.org/2003/04/rssInXHTML Generally, I think the microformat concept is great. If an html author can at least use CSS nicely, its trivial to create an RSS with tidy + XSL (or any other way). Thats useful if their content management system is some sort of lame html editing system that inserts the authors text into a template (i.e. only one page can be created from a single source) rather than generating markup for several pages (html, rss) from a content source.
Magic Microformat Forms  Previous Idempotency: It's not just for APIs (or, the web is an API) Next