Here’s a challenge I’m facing right now for a little side project I’m dabbling with. Given an arbitrary webpage, return the ‘representative’ image that you would use if you were constructing a link to the page. For example, this Uncrate post should show the trailer, as it’s directly relevant. This news story should show one of the two maps used.
In some cases, we get lucky. Sites that use the Open Graph Protocol can define an og:image, and Uncrate uses <link rel=”image_src”…> which is easy to parse. So in those cases, I take those. But 95% of sites don’t do these things, so I’m left grabbing all of the <img> nodes on the page and trying to decide which is most representative.
Here are the possible solutions I’m weighing:
- Look at the height and width parameters, multiple, and return the largest, which is presumably the ‘main’ image. This is what I’m doing now. It turns out that the images used in blog posts and such — the things I want to return as the ‘main’ image — rarely have height and width parameters, so their size is effectively zero, leaving me returning static parts of the site layout as the ‘main’ image. Often, it’s small 16×16 icons, or the site’s logo, or banner ads (!).
- Try to intelligently find the ‘main content’ <div> and grab the first image in there. This would really only work if people used a handful of common names for the main content areas of their site, and then, only if they refrained from putting something like an author headshot or social media sharing icons first, which is not at all a safe assumption to make. But “not safe globally” is really the story of the Internet; it just has to be less-bad than the above method to be worth doing.
- Download every image on the page, load each into memory, and compute the total number of pixels; keep the largest.
The concept of the largest image is still slightly funky, in that if you had a blog post with a small image, but a huge banner ad or logo on your site, the largest image wouldn’t be the one relevant to the story. That would argue that trying to find the ‘main div’ is best, but I don’t see how that could ever work reliably. The first idea — using height and width attributes to find the largest — works maybe 50% of the time, but when it doesn’t work, it’s astonishingly bad. (The news story I linked to returns a one-pixel transparent GIF. Using it on Lifehacker returns banner ads.) Downloading every image on the page is a huge bother, as it would be slow and use a lot of bandwidth. (I might be able to get away with sending a HEAD and picking the image with the largest Content-Length, which I very foolishly assume is the most ‘important’ image on the page.)
Anyone have any better ideas?