Structure Is Not Meaning
So, I announce ForgePlucker, and within a day I’ve got some guy from Y Combinator sneering at me for using regular expressions to parse HTML. Says it’s “crappy code”. The poor fool…he has fallen victim to a conceptual trap which I, fortunately, learned to avoid decades ago. I could spout a freshet of theory about it, but instead I’m just going to utter a maxim: Never confuse structure with meaning.
When you parse an XML or HTML document into a DOM, you’re not getting meaning. You’re only getting structure. The DOM structure. And by doing that, you add a kind of dependency to your code that it didn’t have before – you become vulnarable to changes in the document’s parse structure that don’t change the meaning and wouldn’t have bitten you if you had stuck to doing local pattern-matching.
That may be an acceptable or even necessary risk if the meaning you’re trying to extract is closely tied to the structure of the document. If what you’re looking at is pure high-church XML, that’s often the case. But HTML? It’s tag soup. The structure the DOM tries to capture may not be well-formed at all. Even if it is, some UI designer fiddling with presentation-level tags can easily mutate the structure in a way that that points your beautiful theoretically-neat XPath queries off into la-la land.
Yes, sometimes lxml or BeautifulSoup is the right thing. I’ll probably use lxml when I get around to writing the parser for the XML state-dump format I’ve started to define. That will be appropriate, because the generators for that format can guarantee well-formedness and aren’t going to be changed casually by UI designers who innocently believe they’re doing no harm.
But that’s a rather different use case from HTML generated by someone else’s code. It still could be that DOM-based parsing is the smart thing, if the HTML is stable and the generator was designed by somebody fastidious. Best not to count on it, though. Here’s an example of why…
At one point in my code, I have to parse HTML that presents as a two-column table on a bug index page. First column is bug IDs, wrapped in anchor markup so they hotlink to bug detail pages. Second column is the bug summary. This table is embedded in huge amounts of gorp – page headers, page footers, a sidebar, various text annotations. All I care about is the bug IDs, though. I just want a list of them.
If I did what my unwise critic says is right, I’d take a parse tree of the entire page and then write a path query down to the table cell and the embedded link. Which would be cute, and I could do that, and I’m sure code like that got him an A in college – but would leave me shit out of luck the second after, say, anything in the page’s top-level structure changed.
So instead, I walk through the page looking for anything that looks like a hotlink wrapping literal text of the form #[0-9]*. Actually, that oversimplifies; I look for a hotlink with a specific pattern of URL in a hotlink that I know points into the site bugtracker.
Now, ask yourself: what’s more likely to remain stable: the tag soup on the page, or the path structure of the forge site? Which is the more reliable cue to the data I actually want to extract?
Never confuse structure with meaning.
Sometimes, the brute-force hack-at-it-with-regexps approach really is best. It looks stupid, but it gets the job done.
CORRECTION: The snotty kid (or snotty-kid soundalike) isn’t from Y Combinator. That’s actually a relief – I know Paul Graham, I like Paul Graham, and the crew around him usually has more sense. Anyway, this post was supposed to be about the critic’s mistake, not the critic.
UPDATE: I’ve had more to say on this topic in The Pragmatics of Webscraping.