Standards: They’re Nice.

I’m toying with building a small library that will take a URL, load and parse it (using Mechanize and Loofah), and spit out a title, description/summary, and the ‘main’ image. Sometimes it’s a piece of cake, like when they have Open Graph tags indicating all of that information in the headers. (I also plan to work on oEmbed support.)

But it’s a huge headache, because there are myriad standards, with most people not bothering to follow any of them. I’ve learned about all sorts of declaring “the image” for a page, like og:image or a rel=”image_src” link, but most of the time I have to try to parse all the tags and find the biggest. But that’s really error-prone, because lots of people don’t bother setting height and width parameters on their images, so I end up picking a 16×16-pixel icon as the “biggest”. (I categorically refuse to actually download every image to extract its dimensions for this. It’s a huge waste of bandwidth.) On Lifehacker, my code typically returns a banner image as the largest image.

And when an og:image is defined in the headers, all bets are off for what it is. Maybe it’s the largest image on the page — yay! Maybe it’s a 64×64-pixel thumbnail. (Boo!) Or maybe it’s the site’s logo. It might be an absolute path, but it might be relative, in which case I have to construct a new URL to it based on what I think the site’s domain is after account for having followed redirects.

And then there’s just insane stuff. If no og:title is set, I try to get the tag. You wouldn’t believe how many times the <title> tag has HTML inside of it, like <h1> tags, or newlines. Often, <em>multiple</em> newlines, in the middle. On big, well-known sites. What the heck? Or the title is in ALL CAPS LIKE A CRAZY DERANGED PERSON IS WRITING for no apparent reason, so I’m inclined to <a href="http://apidock.com/rails/ActiveSupport/CoreExtensions/String/Inflections/titlecase">titlecase</a> it. Or perhaps the title is actually just the site’s name. Or perhaps it’s actually the text “Untitled,” which is frighteningly common. Or perhaps there just <em>isn’t</em> a title.</p> <p>And don’t even get me <em>started</em> on character encoding.</p> <p>Right now, the code to extract these few bits of information is 124 lines, and I don’t feel anywhere near ready to change its name away from “Parser::Crude” yet. And I’ve really only implemented basic functionality thus far. I still need to add oEmbed, which is a huge can of worms.</p> </div><!-- .entry-content --> <footer class="entry-meta"> This entry was posted in <a href="https://blogs.n1zyy.com/n1zyy/category/uncategorized/" rel="category tag">Uncategorized</a> by <a href="https://blogs.n1zyy.com/n1zyy/author/n1zyy/">n1zyy</a>. Bookmark the <a href="https://blogs.n1zyy.com/n1zyy/2010/12/19/standards-theyre-nice/" title="Permalink to Standards: They’re Nice." rel="bookmark">permalink</a>. </footer><!-- .entry-meta --> </article><!-- #post-3306 --> <div id="comments"> <div id="respond" class="comment-respond"> <h3 id="reply-title" class="comment-reply-title">Leave a Reply <small><a rel="nofollow" id="cancel-comment-reply-link" href="/n1zyy/2010/12/19/standards-theyre-nice/#respond" style="display:none;">Cancel reply</a></small></h3><form action="https://blogs.n1zyy.com/n1zyy/wp-comments-post.php" method="post" id="commentform" class="comment-form"><p class="comment-notes"><span id="email-notes">Your email address will not be published.</span> <span class="required-field-message">Required fields are marked <span class="required">*</span></span></p><p class="comment-form-comment"><label for="comment">Comment <span class="required">*</span></label> <textarea id="comment" name="comment" cols="45" rows="8" maxlength="65525" required="required"></textarea></p><p class="comment-form-author"><label for="author">Name <span class="required">*</span></label> <input id="author" name="author" type="text" value="" size="30" maxlength="245" autocomplete="name" required="required" /></p> <p class="comment-form-email"><label for="email">Email <span class="required">*</span></label> <input id="email" name="email" type="text" value="" size="30" maxlength="100" aria-describedby="email-notes" autocomplete="email" required="required" /></p> <p class="comment-form-url"><label for="url">Website</label> <input id="url" name="url" type="text" value="" size="30" maxlength="200" autocomplete="url" /></p> <p class="comment-form-cookies-consent"><input id="wp-comment-cookies-consent" name="wp-comment-cookies-consent" type="checkbox" value="yes" /> <label for="wp-comment-cookies-consent">Save my name, email, and website in this browser for the next time I comment.</label></p> <p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Post Comment" /> <input type='hidden' name='comment_post_ID' value='3306' id='comment_post_ID' /> <input type='hidden' name='comment_parent' id='comment_parent' value='0' /> </p><p style="display: none;"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="c673d37d34" /></p><p style="display: none !important;" class="akismet-fields-container" data-prefix="ak_"><label>Δ<textarea name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea></label><input type="hidden" id="ak_js_1" name="ak_js" value="183"/><script defer src="data:text/javascript;base64,ZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoICJha19qc18xIiApLnNldEF0dHJpYnV0ZSggInZhbHVlIiwgKCBuZXcgRGF0ZSgpICkuZ2V0VGltZSgpICk7"></script></p></form> </div><!-- #respond --> <p class="akismet_comment_form_privacy_notice">This site uses Akismet to reduce spam. <a href="https://akismet.com/privacy/" target="_blank" rel="nofollow noopener">Learn how your comment data is processed</a>.</p> </div><!-- #comments --> </div><!-- #content --> </div><!-- #primary --> </div><!-- #main --> <footer id="colophon" role="contentinfo"> <div id="site-generator"> <a href="http://wordpress.org/" title="Semantic Personal Publishing Platform" rel="generator">Proudly powered by WordPress</a> with an unholy mashup of <a href="http://wordpress.org/themes/twentyeleven">Twenty Eleven</a> and <a href="http://designdisease.com/compositio-wordpress-theme/">Compositio</a> themes. </div> </footer><!-- #colophon --> </div><!-- #page --> <script defer type="text/javascript" src="https://blogs.n1zyy.com/n1zyy/wp-includes/js/comment-reply.min.js?ver=6.7.5" id="comment-reply-js" data-wp-strategy="async"></script> <script defer type="text/javascript" src="https://blogs.n1zyy.com/n1zyy/wp-content/cache/autoptimize/2/js/autoptimize_single_91954b488a9bfcade528d6ff5c7ce83f.js?ver=1732055752" id="akismet-frontend-js"></script> </body> </html> <!-- Page cached by LiteSpeed Cache 7.8.0.1 on 2026-04-21 00:07:17 -->