{"id":3306,"date":"2010-12-19T22:31:54","date_gmt":"2010-12-20T03:31:54","guid":{"rendered":"http:\/\/blogs.n1zyy.com\/n1zyy\/?p=3306"},"modified":"2010-12-19T22:31:54","modified_gmt":"2010-12-20T03:31:54","slug":"standards-theyre-nice","status":"publish","type":"post","link":"https:\/\/blogs.n1zyy.com\/n1zyy\/2010\/12\/19\/standards-theyre-nice\/","title":{"rendered":"Standards: They&#8217;re Nice."},"content":{"rendered":"<p>I&#8217;m toying with building a small library that will take a URL, load and parse it (using <a href=\"http:\/\/mechanize.rubyforge.org\/mechanize\/\">Mechanize<\/a> and <a href=\"https:\/\/github.com\/flavorjones\/loofah\">Loofah<\/a>), and spit out a title, description\/summary, and the &#8216;main&#8217; image. Sometimes it&#8217;s a piece of cake, like when they have <a href=\"http:\/\/ogp.me\/\">Open Graph<\/a> tags indicating all of that information in the headers. (I also plan to work on <a href=\"http:\/\/www.oembed.com\/\">oEmbed<\/a> support.)<\/p>\n<p>But it&#8217;s a huge headache, because there are myriad standards, with most people not bothering to follow <em>any<\/em> of them. I&#8217;ve learned about all sorts of declaring &#8220;the image&#8221; for a page, like og:image or a rel=&#8221;image_src&#8221; link, but most of the time I have to try to parse all the <img> tags and find the biggest. But that&#8217;s really error-prone, because lots of people don&#8217;t bother setting height and width parameters on their images, so I end up picking a 16&#215;16-pixel icon as the &#8220;biggest&#8221;. (I categorically refuse to actually download every image to extract its dimensions for this. It&#8217;s a huge waste of bandwidth.) On Lifehacker, my code typically returns a banner image as the largest image.<\/p>\n<p>And when an og:image <em>is<\/em> defined in the <meta> headers, all bets are off for what it is. Maybe it&#8217;s the largest image on the page &#8212; yay! Maybe it&#8217;s a 64&#215;64-pixel thumbnail. (Boo!) Or maybe it&#8217;s the site&#8217;s logo. It might be an absolute path, but it might be relative, in which case I have to construct a new URL to it based on what I <em>think<\/em> the site&#8217;s domain is after account for having followed redirects.<\/p>\n<p>And then there&#8217;s just insane stuff. If no og:title is set, I try to get the <title> tag. You wouldn&#8217;t believe how many times the <title> tag has HTML inside of it, like <h1> tags, or newlines. Often, <em>multiple<\/em> newlines, in the middle. On big, well-known sites. What the heck? Or the title is in ALL CAPS LIKE A CRAZY DERANGED PERSON IS WRITING for no apparent reason, so I&#8217;m inclined to <a href=\"http:\/\/apidock.com\/rails\/ActiveSupport\/CoreExtensions\/String\/Inflections\/titlecase\">titlecase<\/a> it. Or perhaps the title is actually just the site&#8217;s name. Or perhaps it&#8217;s actually the text &#8220;Untitled,&#8221; which is frighteningly common. Or perhaps there just <em>isn&#8217;t<\/em> a title.<\/p>\n<p>And don&#8217;t even get me <em>started<\/em> on character encoding.<\/p>\n<p>Right now, the code to extract these few bits of information is 124 lines, and I don&#8217;t feel anywhere near ready to change its name away from &#8220;Parser::Crude&#8221; yet. And I&#8217;ve really only implemented basic functionality thus far. I still need to add oEmbed, which is a huge can of worms.<\/p>","protected":false},"excerpt":{"rendered":"<p>I&#8217;m toying with building a small library that will take a URL, load and parse it (using Mechanize and Loofah), and spit out a title, description\/summary, and the &#8216;main&#8217; image. Sometimes it&#8217;s a piece of cake, like when they have &hellip; <a href=\"https:\/\/blogs.n1zyy.com\/n1zyy\/2010\/12\/19\/standards-theyre-nice\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3306","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/posts\/3306","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/comments?post=3306"}],"version-history":[{"count":0,"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/posts\/3306\/revisions"}],"wp:attachment":[{"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/media?parent=3306"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/categories?post=3306"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.n1zyy.com\/n1zyy\/wp-json\/wp\/v2\/tags?post=3306"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}