Finding the Largest Image on a Page

Here’s a challenge I’m facing right now for a little side project I’m dabbling with. Given an arbitrary webpage, return the ‘representative’ image that you would use if you were constructing a link to the page. For example, this Uncrate post should show the trailer, as it’s directly relevant. This news story should show one of the two maps used.

In some cases, we get lucky. Sites that use the Open Graph Protocol can define an og:image, and Uncrate uses <link rel=”image_src”…> which is easy to parse. So in those cases, I take those. But 95% of sites don’t do these things, so I’m left grabbing all of the <img> nodes on the page and trying to decide which is most representative.

Here are the possible solutions I’m weighing:

  • Look at the height and width parameters, multiple, and return the largest, which is presumably the ‘main’ image. This is what I’m doing now. It turns out that the images used in blog posts and such — the things I want to return as the ‘main’ image — rarely have height and width parameters, so their size is effectively zero, leaving me returning static parts of the site layout as the ‘main’ image. Often, it’s small 16×16 icons, or the site’s logo, or banner ads (!).
  • Try to intelligently find the ‘main content’ <div> and grab the first image in there. This would really only work if people used a handful of common names for the main content areas of their site, and then, only if they refrained from putting something like an author headshot or social media sharing icons first, which is not at all a safe assumption to make. But “not safe globally” is really the story of the Internet; it just has to be less-bad than the above method to be worth doing.
  • Download every image on the page, load each into memory, and compute the total number of pixels; keep the largest.

The concept of the largest image is still slightly funky, in that if you had a blog post with a small image, but a huge banner ad or logo on your site, the largest image wouldn’t be the one relevant to the story. That would argue that trying to find the ‘main div’ is best, but I don’t see how that could ever work reliably. The first idea — using height and width attributes to find the largest — works maybe 50% of the time, but when it doesn’t work, it’s astonishingly bad. (The news story I linked to returns a one-pixel transparent GIF. Using it on Lifehacker returns banner ads.) Downloading every image on the page is a huge bother, as it would be slow and use a lot of bandwidth. (I might be able to get away with sending a HEAD and picking the image with the largest Content-Length, which I very foolishly assume is the most ‘important’ image on the page.)

Anyone have any better ideas?

Standards: They’re Nice.

I’m toying with building a small library that will take a URL, load and parse it (using Mechanize and Loofah), and spit out a title, description/summary, and the ‘main’ image. Sometimes it’s a piece of cake, like when they have Open Graph tags indicating all of that information in the headers. (I also plan to work on oEmbed support.)

But it’s a huge headache, because there are myriad standards, with most people not bothering to follow any of them. I’ve learned about all sorts of declaring “the image” for a page, like og:image or a rel=”image_src” link, but most of the time I have to try to parse all the <img> tags and find the biggest. But that’s really error-prone, because lots of people don’t bother setting height and width parameters on their images, so I end up picking a 16×16-pixel icon as the “biggest”. (I categorically refuse to actually download every image to extract its dimensions for this. It’s a huge waste of bandwidth.) On Lifehacker, my code typically returns a banner image as the largest image.

And when an og:image is defined in the <meta> headers, all bets are off for what it is. Maybe it’s the largest image on the page — yay! Maybe it’s a 64×64-pixel thumbnail. (Boo!) Or maybe it’s the site’s logo. It might be an absolute path, but it might be relative, in which case I have to construct a new URL to it based on what I think the site’s domain is after account for having followed redirects.

And then there’s just insane stuff. If no og:title is set, I try to get the <title> tag. You wouldn’t believe how many times the <title> tag has HTML inside of it, like <h1> tags, or newlines. Often, multiple newlines, in the middle. On big, well-known sites. What the heck? Or the title is in ALL CAPS LIKE A CRAZY DERANGED PERSON IS WRITING for no apparent reason, so I’m inclined to titlecase it. Or perhaps the title is actually just the site’s name. Or perhaps it’s actually the text “Untitled,” which is frighteningly common. Or perhaps there just isn’t a title.

And don’t even get me started on character encoding.

Right now, the code to extract these few bits of information is 124 lines, and I don’t feel anywhere near ready to change its name away from “Parser::Crude” yet. And I’ve really only implemented basic functionality thus far. I still need to add oEmbed, which is a huge can of worms.

The Web, My Way

I used to work for an ad-supported company, so I’ve always felt kind of bad using AdBlock. I bifurcated for a while between keeping AdBlock off except for obnoxious sites, or keeping AdBlock and whitelisting sites I frequent that aren’t obnoxious. It ended up getting to be too much of a hassle, so I’ve browsed with AdBlock on full-time, and a handful of sites whitelisted. I still feel bad doing so, but not bad enough to unblock.

Recently I’ve noticed a couple sites that are really slow. Firebug shows they’re doing tons of crap. Uncrate seems to have something that will go off and do an AJAX post any time I scroll to expose new content, and content from several other sites is pulled in. Slowly. The site locks up my browser for a moment. Some other site I went to the other day managed to peg one of my cores at 100% CPU usage. I don’t even know what it was doing.

I’ve declined to use NoScript for a long time, because in 2010, the Internet just doesn’t work without JavaScript. It would be like me trying to avoid surveillance cameras in public, or trying to avoid using WiFi because I didn’t trust it. But I’ve just had a mini-rebellion. There is so much crap on pages. A page pulls in JavaScript tags from a half-dozen sites, and each of those pulls in more scripts that all make calls all over the place. I have a pretty fast computer and keep Firefox locked down, so when a page is able to lock up Firefox for a moment, something is really wrong.

So I’m browsing the web on my terms now. NoScript blocks just about all JavaScript and I’m starting to selectively whitelist. I did permit Google Analytics.

The other great thing about NoScript is that it blocks Flash by default. I like Flash for things like videos. But Flash is also insidious in its use of persistent cookies that your browser doesn’t control. Using Adobe’s site to view saved Flash cookies, I have a ton, mostly from sites I’ve never even heard of. I deleted content from all of them, but also don’t permit Flash to run by default.

The Internet looks pretty crappy now, but right now I prefer that to sites that are crappy.

Overclocking

When I put together my system, I picked the i7-930 because Microcenter offered it for less than the i7-920. Both are practically the same processor; the i7-920 is 2.66 GHz and mine is 2.8 GHz. Both are legendary for their ability to be overclocked. I didn’t necessarily run with the overclocking crowd, but I’ve always had in the back of my mind that it could be done. Last night I ran into some forum posts with people complaining that they “only” got to 4.1 GHz with my processor/cooler combination before they started hitting stability issues.

I don’t want to push things that hard, especially when I’m concerned about whether my CPU cooler is working at full efficiency. (I want to pick up some thermal grease and remount it.) But I was bit by the bug.

I can’t actually change the multiplier of my processor — it’s fixed at 21x the base clock frequency. But the base clock is trivial to change. I made sure all the thermal monitoring and auto-shutoffs were enabled and bumped it up from 133 MHz to 150MHz. And presto! I’m running a bit hot now, but one simple change and:

I’m not certain why it’s reported as a Xeon; it’s not. But the clock speed is accurate. When I get around to trying to remount the cooler, I think I’m going to have a go at the 4 GHz barrier, though I don’t want to run too far out of spec.