For a while I’ve been trying to ensure that all user-generated links on a site I code for had the rel=nofollow attribute to prevent giving spammers our link juice.
It’s a tough problem to solve, though. Or so I thought. I ended up doing a global search-and-replace (gsub) on any user-generated text, replacing " with " but this was broken for a few reasons. One is that, while apparently legal, it's bizarre to throw the rel attribute before the href attribute. Maybe that doesn't matter. The tricky part is that a link of the form http://www.example.com"> is totally valid, so I couldn't just match on or I would miss links from crafty users. The more I thought about it, the more it turned into a regular expression from hell. Plus, what if there was already a rel attribute, something like http://www.example.com" rel="faked_you_out">? I'd then put in a second rel attribute, which is bad. It was just spiraling out of control, and turning into a lot of code to handle a lot of weird possible cases.
A coworker nudged me in the direction of hpricot, an HTML parser. And suddenly, it was comically easy to do this flawlessly:
require 'hpricot' html = Hpricot.parse(user_content_here) (html/'a').each do |link| link['rel'] = 'nofollow' end return html.to_s
For each 'a' attribute, called 'link,' set its "rel" attribute to "nofollow". If there's already a rel attribute, it's replaced with "nofollow" and if there isn't one, it's added. hpricot handles all of the "special cases" that my code would have required. I don't care where the rel is at all in this. It just works.
How awesome is that? It seems to work flawlessly, and yet it's really basic code once you get past the somewhat unconventional format that hpricot uses.