It's a blog.
In: Uncategorized30 Dec 2010
Here’s a challenge I’m facing right now for a little side project I’m dabbling with. Given an arbitrary webpage, return the ‘representative’ image that you would use if you were constructing a link to the page. For example, this Uncrate post should show the trailer, as it’s directly relevant. This news story should show one of the two maps used.
In some cases, we get lucky. Sites that use the Open Graph Protocol can define an og:image, and Uncrate uses <link rel=”image_src”…> which is easy to parse. So in those cases, I take those. But 95% of sites don’t do these things, so I’m left grabbing all of the <img> nodes on the page and trying to decide which is most representative.
Here are the possible solutions I’m weighing:
The concept of the largest image is still slightly funky, in that if you had a blog post with a small image, but a huge banner ad or logo on your site, the largest image wouldn’t be the one relevant to the story. That would argue that trying to find the ‘main div’ is best, but I don’t see how that could ever work reliably. The first idea — using height and width attributes to find the largest — works maybe 50% of the time, but when it doesn’t work, it’s astonishingly bad. (The news story I linked to returns a one-pixel transparent GIF. Using it on Lifehacker returns banner ads.) Downloading every image on the page is a huge bother, as it would be slow and use a lot of bandwidth. (I might be able to get away with sending a HEAD and picking the image with the largest Content-Length, which I very foolishly assume is the most ‘important’ image on the page.)
Anyone have any better ideas?