I’m toying with building a small library that will take a URL, load and parse it (using Mechanize and Loofah), and spit out a title, description/summary, and the ‘main’ image. Sometimes it’s a piece of cake, like when they have Open Graph tags indicating all of that information in the headers. (I also plan to work on oEmbed support.)
But it’s a huge headache, because there are myriad standards, with most people not bothering to follow any of them. I’ve learned about all sorts of declaring “the image” for a page, like og:image or a rel=”image_src” link, but most of the time I have to try to parse all the <img> tags and find the biggest. But that’s really error-prone, because lots of people don’t bother setting height and width parameters on their images, so I end up picking a 16×16-pixel icon as the “biggest”. (I categorically refuse to actually download every image to extract its dimensions for this. It’s a huge waste of bandwidth.) On Lifehacker, my code typically returns a banner image as the largest image.
And when an og:image is defined in the <meta> headers, all bets are off for what it is. Maybe it’s the largest image on the page — yay! Maybe it’s a 64×64-pixel thumbnail. (Boo!) Or maybe it’s the site’s logo. It might be an absolute path, but it might be relative, in which case I have to construct a new URL to it based on what I think the site’s domain is after account for having followed redirects.
And then there’s just insane stuff. If no og:title is set, I try to get the <title> tag. You wouldn’t believe how many times the <title> tag has HTML inside of it, like <h1> tags, or newlines. Often, multiple newlines, in the middle. On big, well-known sites. What the heck? Or the title is in ALL CAPS LIKE A CRAZY DERANGED PERSON IS WRITING for no apparent reason, so I’m inclined to titlecase it. Or perhaps the title is actually just the site’s name. Or perhaps it’s actually the text “Untitled,” which is frighteningly common. Or perhaps there just isn’t a title.
And don’t even get me started on character encoding.
Right now, the code to extract these few bits of information is 124 lines, and I don’t feel anywhere near ready to change its name away from “Parser::Crude” yet. And I’ve really only implemented basic functionality thus far. I still need to add oEmbed, which is a huge can of worms.