Subtly Bad Code

Alright, let’s have a little fun… I just added a new blog and went to include it on the main page, but my code failed citing the database throwing errors. It took me forever to find. I’m curious if others can find it.

I was further confused because the code worked fine until I added the new blog to the list of ones for it to use, and it was specifically built so that it wouldn’t matter how many blogs there were. It has a separate file that just lists blogs to include, and reads that file at runtime and builds a query to retrieve posts from all of them.

You need some background, first… All the posts are stored in a database, so each has its own table. I built this monster query, basically looking something like (Get most recent posts from blog 1) UNION (Get more recent posts from blog 2) UNION (…3…), and then tack an “ORDER BY…” onto the end. Credit for this idea goes to Andrew; I’d have never thought of it myself.

What the list includes is blog IDs in the database. They ranged from 2 to 9, skipping 8 (which isn’t used). After a bout of spam registrations, the numbers got run up, so when I included the new one, it was numbered 51.

The below code (in PHP) calls some custom-rolled functions, but I’ll just say up front that the error does not depend on understanding how they work. Similarly, the answer does not have to do with caching in any way, so don’t get too hung up on the amount of code devoted to working with the cache. (And finally, I’m building one huge variable called $query the whole time, and then return that variable… This isn’t a crucial thing to understand either, I just wanted to explain it since it’s somewhat of a bizarre practice. .= is the PHP variable concatenation method.)

// $count is the number of blogs to pull out
function genRPQuery($count) {
  // Retrieve it from Memcache
  $query = getCachedObject("bigquery-$count");
  // It'll return NULL if it doesn't exist, so we check for that...
  if($query) return $query;

    // Since we're here, we didn't return, and
    // thus didn't get it out of the cache

    // Next two lines read in the files. blogList()
    // returns a list of the blogs -- it's little more than a
    // file read with caching enabled.
    $blogs = explode(',', rtrim(blogList(),"n"));
    $fields = rtrim(cachedFile('./fields.inc',30), "n");

    foreach ($blogs as $i) {
      // We have a loop for each blog
      // For unfamiliar eyes, .= is PHP's means of variable concatenation
      // We're building a ridiculously-long query, each one a SELECT, encased in
      // parens, and we UNION them all together...
      $query .= "(SELECT $fields FROM wp_" . $i . "_posts WHERE post_status='publish' AND post_type='post' AND post_password='' ORDER BY post_date DESC LIMIT $count)n";
      // If we're not on the last one, insert a "UNION" in (see above)
      if($i

Remember, it worked fine when the list was blogs number "2,3,4,5,6,7,9" but the simple change to "2,3,4,5,6,7,9,51" causes it to blow up and try run a query with invalid syntax. This made no sense to me, since the code was built to not care about things like that. I eventually found it, and feel like an idiot.

I've posted a hint in the comments... It's in the interest of fairness because I turned on some debugging and got the information I share. But it also really narrows your attention to a couple of lines, so I don't want to include it in the main post.

Spam

My e-mail setup right now for n1zyy.com and ttwagner.com consists of just forwarding all e-mail to GMail. It works fine, and the spam filters there have been pretty much 100% effective. However, it bothers me that I’m forwarding dozens, if not hundreds, of e-mails just to have them ignored. Some basic spam filtering should really take place on my server.

I made a few basic configuration changes to Postfix, the MTA I run. In a nutshell, I tell it to require stricter compliance with e-mail RFCs: e-mails with HELO addresses that don’t exist (or just don’t make sense), and people sending multiple commands before the server replies to acknowledge them, for example, now results in mail delivery failing. The default configuration errs very much on the side of ‘safety’ in accepting mail, but the trick is to tighten it down enough that you’ll reject mail that’s egregious spam, but not reject anything that could be from a mailserver. And that’s where I’m at.

I also installed SpamAssassin. I’m currently using it in conjunction with procmail, and therefore wasn’t quite sure if it works. I set it up to make some changes to the headers, so that I can verify whether it’s working. But I ran into a problem I never thought I’d have: I’m not getting enough spam. I’m sitting here eagerly awaiting some to see what happens. And it’s just not coming.

The Definition of a Game

I’ve been playing Team Fortress 2 a bit lately. Not so much now that finals are here, but it’s a fun way to pass the time. When the mood strikes, I’ll sign on and play one of a few different classes of people… I’m usually either Pyro or Engineer. As Pyro, I go around with a flamethrower. It’s a good weapon, as it sets enemy forces on fire. It’s also the only sure-fire (no pun intended) way to see if someone who looks like a teammate is actually a spy disguised as a teammate: if they burn, they’re a spy. So I’ll run through my own team with the flamethrower periodically, since it’ll only damage enemies. As Engineer, I set up teleporters, so people that die and respawn can get to the front lines faster to keep up the attack. I also set up Dispensers, which dispense health points and ammunition. And, my favorite, the sentry guns, automated guns (which I usually upgrade to include rocket launchers), letting us cover areas without having to be there. They’re a good way to protect our assets.

The process of going through and doing all of this, to defend against enemies, is really fun. And, even though it’s just a game, there’s a certain sense of accomplishment when we capture the enemy bases.

Last night, I installed a new piece of software on my server, a web-based file manager. They have a wiki (powered by MediaWiki, the Wikipedia software), but it was overrun with spam. So I signed up and started reverting back to spam-free versions. It’s a skill I picked up on Wikipedia, and it’s super-easy. In about 20 minutes time, I wiped out weeks, sometimes months, of spam. As time went on, I took out more and more. This morning they made me an admin, and I’m now deleting spam accounts, blocking persistent spam IPs, and deleting pages that are nothing but spam.

Is this a game? Volunteer work? Work? I’m not sure, but I’m getting at least as much enjoyment out of it as I get from setting enemies ablaze in Team Fortress 2. And I’m actually accomplishing something.

Ultimate Boot CD

Ultimate Boot CD saves the day again! This time, my 500 GB drive with lots of important stuff backed up to it randomly wasn’t being detected. Windows saw it as a raw, unformated disk, and Linux wouldn’t mount it citing disk problems.

Of course, I had some problems at first… It’s a 500 GB drive, which is greater than 137 GB. It’s also mounted over USB, thanks to this brilliant piece of technology. So DOS-based file tools were understandably a bit confused. I ended up throwing the disk in my old desktop machine, where it was used as a “real” IDE drive instead of a USB external drive. And it turns out that most of the programs can cope with it being 500GB.

Of course, this is one of those classic problems where I have no idea what actually “fixed” it. I ran a bad block check (which takes forever on a 500GB disk!), and was actually somewhat irritated when it finished having found nary a bad block. But as I poked around looking at other options, I found that filesystem tools were showing me files on the drive. All my old data? Intact!

Seriously, burn yourself a copy of UBCD and keep it with your computers. It’ll save the day. Previously, I’ve used it to reset computer passwords for a professor, and to fix a broken (err, missing) bootloader.

The Time…

I’m pretty OCD and thus run an NTP server on this server. (It should respond to any hostname on this box.) Despite the server being in Texas, I keep the timezone set to EST.

So here’s a page displaying the time. Granted, having a clock that’s accurate down to a fraction of a second (synced to the atomic clock) is no longer that impressive. But tell me you’ve never wished for an easy way to find the correct time… Now you know.

memcached

On my continuing series of me poking around at ways to improve performance…

I accidentally stumbled across something on memcached. The classic example is LiveJournal (which, incidentally, created memcached for their needs). It’s extraordinarily database-intensive, and spread across dozens of servers. For what they were doing, generating HTML pages didn’t make sense that often. So it does something creative: it creates a cache (in the form of a hash table) that works across a network. You might have 2GB of RAM to spare on your database server (actually, you shouldn’t?) and 1GB RAM you could use on each of 6 nodes. Viola, 8 GB of cache. You modify your code to ask the cache for results, and, if you don’t get a result, then you go get it from the database (or whatever) as usual.

But what about situations like mine? I have one server. And I use MySQL query caching. But it turns out it’s useful. (One argument for using it is that you can just run multiple clients on a single server to render moot any problems with using more than 4GB on a 32-bit system… But I’m not lucky enough to have problems with not being able to address my memory.)

MySQL’s query cache has one really irritating “gotcha”–it doesn’t catch TEXT and BLOB records, since they’re of variable length. Remembering that this is a blog, consisting of lots and lots of text, you’ll quickly see my problem: nearly every request is a cache miss. (This is actually an oversimplification: there are lots of less obvious queries benefiting, but I digress.) (WordPress complicates things by insisting on using the exact timestamp in each query, which also renders a query cache useless.) I just use SuperCache on most pages, to generate HTML caches, which brings a tremendous speedup.

But on the main page, I’m just hitting the database directly on each load. It holds up fine given the low traffic we have, but “no one uses it” isn’t a reason to have terrible performance. I’ve wanted to do some major revising anyway, so I think a rewrite in my spare time is going to experiment with using memcached to improve performance.

Performance++

I guess I’ve become somewhat of a performance nut. Truthfully a lot of the time is spent doing things for nominal improvements: changing MySQL’s tmp directory to be in RAM has had no noticeable impact on performance, for example. Defragging log files doesn’t speed much up either.

I was reading a bit about LiteSpeed, though. It’s got a web GUI to control it, and is supposedly much faster than Apache. I’ve got it installed, but I’m having some permission issues right now. (The problem is that changing them will break Apache, so I’m going to have to try it with some insignificant pages first.) It’ll automatically build APC or eAccelerator in. It apparently has some improved security features, too, which is spiffy. And it’s compatible with Apache, so I don’t have to start from scratch.

The base version is free, too. (But not GPL.) The “Enterprise” edition is $349/year or $499 outright purchase. To me, it’s not worth it. But if I were a hosting company with many clients, I might be viewing it differently, especially if the performance is as good as they say.