Get a (Virtual) Life

Amid wrestling with getting Xen working (its kernel doesn’t play nicely with my video drivers… oh how I hate closed-source drivers), I downloaded VMware player. It’s free.

I first downloaded a VMware image of Mailserver by Allard Consulting. Quick review: I’ve never used it in a ‘real’ environment to send or receive e-mail (and I screwed up VMware’s networking, making things worse), but it seems extremely impressive. The one thing I have realized is that my much-raved-about spamd is very irritating if you try to telnet to port 25 to ‘test’ the mailserver. If I had a colocated server hosting multiple VPSs cough I think I’d buy the ‘real deal’ from them and use this as my mailserver.

But I think I’m going to get entirely distracted with virtual machines tonight. I’m running the latest and greatest version of Ubuntu, 7.10, codenamed “Gusty Gibbon.” But 8.04, code-named “Hardy Heron” is in early testing, and you can grab an image of it. (You can also run it on your desktop, it’s in no way ‘proprietary,’ but a lot of us aren’t hardcore enough to want to run bleeding-edge alpha code as our main OS.)

I’ve mentioned before that I was somewhat interested in the $300 PCs that Walmart was selling. They came with Linux, apparently something Google partnered with them on, dubbing the desktop environment “gOS.” (The machine also draws insanely low power.) Lo an behold, it’s out there as a VMware image. (I was also able to play around with the One Laptop per Child (OLPC) image in VMware.)

Oh, and Solaris anyone?

Fanboy

I’d gone  a while without ogling Apple products. So they came out today with some new products.

This is a neat idea. It’s their “Airport Extreme” wireless AP (with N-capability), but with a neat addition–a 500 or 1TB disk for wireless backups. Sure, the real geeks already have their Linux server in the basement with a RAID array of 500GB disks accessible over NFS and rsync, but Apple brings something cool into a nice little box, makes it work pretty seamlessly, and, get this–sells it at a fairly cheap price.  $500 for an 802.11N AP with an integrated 1TB backup fileserver?

Of course, I’d need a Mac machine to sync to it. But I’m already carrying so much stuff to class, I want something light! I guess I’d need the world’s thinnest laptop, the Apple MacBook Air.  Not only is it ridiculously small, but it takes the awesome MultiTouch technology from their iPod Touch / iPhone and applies it to the trackpad. 2GB standard, and if you don’t like the sound of your hard drive spinning, you could always opt for the 64GB solid state one. (Apparently at a cost of $1,000, though… But that’s what you pay for 64GB SSD drives right now.)

And they relaunched the AppleTV, without the suck this time. You can also do the much-rumoured movie rentals through iTunes.

Darn you, Apple! Today was supposed to be the day that I caught up on all the work I need to do!

Cool Stuff

  • FDC (FDCServers.net) has come a long way since I last dealt with them. (I remember back when they had a couple Cogent lines). They’ve now got 81 Gbps of connectivity.
  • Internap has long been the Internet provider when latency/speed matters. They basically buy lines from all the big providers, and peer with lots of the smaller ones, so that, unless your hosting company has their own private peering agreements, it’s basically impossible to find a shorter route. People hosting gameservers, or really just anything “high quality,” love Internap. I’ve seen prices in the $100-200 range for 1 Mbps. (This is purely for the transit: it’s all well and good to envision $100 for a 1 Mbps line to your house as good, but that’s not what it is. This is when you’re in a data center where they have a presence and run a line to them. The cost is just for them carrying your packets.)
  • FDC now has a 10 Gbps line to Internap. “Word on the street” is that Internap had some sort of odd promotion at $15/Mbps if you bought in bulk, and FDC wisely jumped, getting a 2 Gbps commit on a 10 Gbps line.
  • I’m working on getting Xen running on my laptop. It’s interested me for a long time–it’s a GPL’ed virtualization platform. You can use it on your desktop to experiment with various OSs inside VMs, but it’s also awesome on servers to run multiple virtual machines as virtual private servers.
  • Do you remember Cobalt RaQs? I distinctly remember ogling them and thinking they were the best things ever. (Of course, now we see them as 300 MHz machines…) It turns out that, when Cobalt went belly-up, they released a lot of the code under the GPL or similar. The BlueQuartz project is an active community-developed extension of that, and, combined with CentOS, it apparently runs well on “normal” computers now. (True, you don’t get the spiffy blue rackmount server or the spiffy LCD, but you do get to run it on something ten times as powerful.)
  • I’m still itching to host a TF2 server. I’ve found that they’re all either full or empty, with few in-betweens, and that a lot of them aren’t ‘adminned’ as tightly as I’d like: games like this seem to attract irritating people, and not enough servers kick/ban them.
  • cPanel seems to have come a distance since I last used that, too, and you can now license it for use just inside a VPS at $15/month.
  • Mailservers are hard to perfect. There are lots and lots of mediocre ones, but it’s rare to come across an excellent one, something that can deflect spam seamlessly, make it easy to add lots of addresses, and provide a nice web GUI. All of the technology’s out there, but for some reason, mailservers are among the hardest things in the world to configure. (Even my thermostat is easier to use!) Especially given my affinity for spamd, it’s no wonder that I’m so impressed with the Mailserver ‘appliance’ that Allard Consulting produces. It’s essentially all of the best things about mailservers (greylisting, whitelisting, SpamAssassin, Postfix with MySQL-based virtual domains, a spiffy web interface with graphs, Roundcube…), hosted on OpenBSD, coming as a pre-assembled ISO.
  • Computer hardware’s come a long way lately. I’d imagine it’d be fairly easy to assemble a machine with a good dual-core (or quad-core!) processor, 4 GB RAM, and a few 500 GB disks for around $1,000.
  • Colocation + 1,000 GB transfer on Internap at FDC is $169. (Or $199 for 5 Mbps unmetered, but that’s probably overkill.) Are you thinking what I’m thinking? (Hint: everything on this list indirectly leads to these last two point!)

Emulating spamd for HTTP

I won’t lie–I love OpenBSD’s spamd. In a nutshell, it’s a ‘fake’ mailserver. You set your firewall up to connect obvious spammers to talk to this instead of your real mailserver. It talks to them extremely slowly (1B/sec), which keeps them tied up for quite some time. (As an added bonus, it throws them an error at the end.)

One thing that really gets under my skin is bots (and malicious users) probing for URLs on the server that don’t exist. I get a lot of hits for /forum, /phpbb, /forums, /awstats… What they’re doing is probing for possible (very) outdated scripts that have holes allowing remote code execution.

It finally hit me: it’s really not that hard to build the same thing for HTTP. thttpd already supports throttling. (Note that its throttling had a more sane use in mind: limiting overall bandwidth to a specific URL, not messing with spammers and people pulling exploits, so it’s not exactly what we want, but it’ll do.)

Then you need a large file. I downloaded a lengthy novel from Project Gutenberg. It’s about 700 kB as uncompressed text. I could get much bigger files, yes. But 700 kB is plenty. More on this later.

It’s also helpful to use Apache and mod_rewrite on your ‘real’ server. You can work around it if you have to.

Set up your /etc/thttpd/throttle.conf:

**    16

Note that, for normal uses, this is terrible. This rule effectively says, “Limit the total server (**) to 16 (bytes per second).” By comparison, a 56K dialup line is about 7,000 bytes per second (or 56,000 bits per second).

Rudimentary tests show that having one client downloading a 700 kB file at 16B/sec places pretty much no load on the server (load average remained 0.00, and thttpd doesn’t even show up in the section of top that I can see), so I’m not concerned about overhead.

You can also set up your thttpd.conf as needed. No specific requirements there. Start it up with something like thttpd -C /etc/thttpd/thttpd.conf -d /var/www/maintenance/htdocs/slow -t /etc/thttpd/throttle.conf (obviously, substituting your own directories and file names! Note that the /slow is just the directory I have it serving out of, not any specific naming convention.)

Now what we need to do is start getting some of our mischievous URL-probers into this. I use some mod_rewrite rules on my ‘real’ Apache server:

# Weed out some more evil-doers
RewriteRule ^forum(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^phpbb(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^badbots(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^awstats(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]

In a nutshell, I redirect any requests starting with “forum,” “phpbb,” “badbots,” or “awstats” to an enormous text file. I’m not sure if escaping the colon is strictly necessary, but it has the added benefit of ‘breaking’ the link when pasted, say, here: I don’t want anyone getting caught up in this unless they’re triggering it. I tend each with (.*), essentially matching everything. You may or may not see this as desirable. I like it, since /forum and /forums are both requested, and so forth. You could take that out if necessary. The [NC,L] is also useful in terms of, well, making anything work.

I want to watch and see whether anyone gets caught up in this. Since it’s technically passing the request to a different webserver (thttpd), it has to tell the client to connect to that, as opposed to seamlessly serving it up. I don’t know if the bots are smart (dumb?) enough to follow these redirects or not.

Note that /badbots doesn’t really exist. I inserted it into my robots.txt file, having heard that some ‘bad ‘bots (looking for spam, etc.) crawl any directory you tell them not to. I wondered if this was accurate.

The ending is quite anticlimactic: we wait not-so-patiently to see what ends up in the logfile.

Spam

So my new policy is to keep spam ‘on file’ for three days. It’s filed away as spam so no one sees it, but it’s good for analysis and such, to protect against future spam. Several times a day, I run a little script to delete spam older than three days and optimize the tables, to keep things running fast.

So this table is particularly telling of the spam problem. Akismet is catching just about all of it, so it’s not a big problem for me per se, but the fact remains that, with three days of spam and something like nine months of legitimate comments, spam accounts for right around two-thirds of all comments on my blog. Wow-a-wee-wow!

Geolocation

The concept of matching an IP to a country is known as IP geolocation, often just “IPGeo” or “GeoIP.” There are lots of reasons for using IP geolocation, ranging from the mundane (identifying countries in your webserver logfiles) to the questionable (banning countries from your server to cut down on spam) to the neat (doing it at firewall/router level and redirecting a user to the closest data center).

Most of the work is just done on a country level. You take an IP (72.36.178.234, my server) and look it up in a database, and get “UNITED STATES” as an answer. There do exist databases on finer levels, down to the city, but they’re expensive and often wrong. (I keep getting ads to find hot singles in Mashpee, more than 100 miles away and in a different state… Or maybe it’s Mattapan. Whatever the case, they’re not even close.)

It turns out that you can download a free database of IP-country mappings. It’s not infallible, but they say it’s 98% accurate. The database itself won’t do you any good. It’s a compressed CSV (comma-separated variable).

In the comments section here, there’s a snippet of PHP code to take the CSV and convert it to a huge series of SQL inserts, which you input into a database… (Hint: for whatever reason, his preg_match is imperfect and leaves a few instances of the word “error” in the middle of the file. It’s probably a bad idea, but I just commented out the “echo error” line. I end up with a 5.7MB SQL query. You can also just download the thing directly here (warning: 5.7 MB SQL file). Note that, per the license terms, I disclose in the comments that it’s a derivative work of their CSV file.

The other important catch is that IPs are stored as long integers, not ‘normal’ IPs. You’ll presumably want to use PHP + MySQL to get the country associate with PHP, so I’ll provide pseudocode in a minute. PHP provides an ip2long() function, but it only takes you halfway, but leaves you with sign problems. (Argh!) It’s an easy fix, though, and you want something like the following:

$long = sprintf("%u", ip2long($ip));
$query = "SELECT a2,a3,country FROM ip2c WHERE start <= $long AND end >= $long";

You then, of course, run $query and parse through it… You get 2- and 3-letter country codes, as well as the full country name. I use it, with good results, in seeing what country comment spam is coming from. (Most of it comes from the US.)

A MySQL query isn’t the proper way to do this: there exist binary files with the same data that result in faster lookups. But this is the simplest way to start doing IP geolocation in ten minutes time, and, with the query cache enabled, there’s not a ton of overhead.

I’m tempted to write some scripts to allow people to ‘browse’ the database, either looking up an IP, or to view it by country.

Update: Weird Silence has a binary implementation of this same database that’s supposedly much faster. The main page is here, the PHP one is here, and the C one is (t)here. (I’m wondering if it makes sense to write a PHP script to call the C version, and what the performance implications would be?)

Update 2: Get your country flags here.

Amazon S3

I really didn’t pay it that much attention, or think about its full potential, at the time it was released. But Amazon’s Simple Storage Servic (hence the “S3”) is really pretty neat. In a nutshell, it’s file hosting on Amazon’s proven network infrastructure. (When have you ever seen Amazon offline?) They provide HTTP and BitTorrent access to files.

Their charges do add up — it might cost a few hundred dollars a month to move a terabyte of data and store 80GB of content. But then again, the reliability (and scalability!) is probably much greater than what I can handle, and it’s apparently much cheaper than it would be to host it with a ‘real’ CDN service.

Sadly, I can’t think of a good use for this service. I suppose the average person really doesn’t need to hire a company to provide mirrors of their files for download. (It would make an awesome mirror for Linux/BSD distributions, but I think the typical mirror is someone with a lot of spare bandwidth and an extra server, not someone paying hundreds a month to mirror files for other people… I wonder if there’s a market for a ‘premium’ mirror service? I doubt it, since the existing ones seem to work fine?)

Bad Bone Weights

I decided the other day that I ought to try to start my own TeamFortress 2 server. (Actually, I tried long ago, but hoped the problem had been fixed. But it hasn’t.) I want to share the cause of the problem in the hopes of helping others, since Google usually picks these things up.

You spend forever downloading the Half-Life Dedicated Server (HLDS), and excitedly fire it up. It runs through some stuff and seems to be working, but then you get a whole bunch of bizarre errors scrolling by:

Bad data found in model "dispenser_toolbox.dmx" (bad bone weights)
Bad data found in model "dispenser_gib1.smd" (bad bone weights)
Bad data found in model "dispenser_gib2.smd" (bad bone weights)
Bad data found in model "dispenser_gib3.smd" (bad bone weights)
Bad data found in model "dispenser_gib4.smd" (bad bone weights)
Bad data found in model "dispenser_gib5.smd" (bad bone weights)

What’s been surmised is that it’s because your processor doesn’t support SSE2. Bah! There’s no fix, either, other than pleading with Steam to write a version that doesn’t require SSE2, or upgrading your CPU.

It’s clearly time to build a new server and colocate it. 😉

Datacenter Fiend

No matter what I do, I keep finding myself thinking about webhosting.

Netcraft does a monthly survey of hosts with the top uptime, and mentioned that DataPipe is usually on top. I’ve found that, at least for what I do, any “real” data center has just about 100% uptime. I have never not been able to reach my server. You’re either with a notoriously bad host (for example, when Web Host “Plus” bought out Dinix, they took the servers offline for a few days with no notice… that’s noticeable downtime), or you’re with a reputable host where downtime just doesn’t really happen.

So 0.00% downtime, as opposed to 0.01%, isn’t a huge deal for me. (That doesn’t mean it’s not impressive.) But what impressed me about DataPipe is that I clicked their link and their webpage just appeared. No loading in the slightest. I browsed their site, and there was never any waiting. I might as well have had the page cached on my computer, except I know it’s not cached anywhere.

Their data center is in New Jersey, but they clearly have excellent peering. I’m getting 20ms pings. They don’t (directly, at least) offer dedicated hosting, VPS hosting, or shared hosting.

One of my big concerns is that I wonder about long-term viability. The market’s full of hosts. A lot of them are “kiddie hosts,” inexperienced people just reselling space often with poor quality. That’s room for competition. But the problem is that there are hosts selling the moon: 200 GB of disk space and 3 terabytes of bandwidth for $5 a month? That’s ludicrous: that’s more than I get with my dedicated server! They can get away with it because no one uses that much, but it concerns more “honest” hosts–you’d have to charge ten times as much if everyone actually used it! But for hosts that offer, say, 1GB of space and 10 GB of transfer–a ‘realistic’ amount–they’re left vulnerable to people thinking they’re getting a better deal.

I realized the other day that, while a lot of people offer VPS (virtual private server: several people share a server, but software ‘partitions’ give each of them their own server software-wise, with root access and separation from other users), I’m really not aware of any good ones. It’s also hard to find any that offer significant amounts of disk space, or any that are particularly cheap.

YouTube

One of the many things I try to shy away from is making generalizations. They’re often harmful and downright inaccurate.

But one generalization I do feel comfortable making is that the comments on YouTube are among the worst I’ve ever seen. Even the few that are coherent tend to contain egregious grammar problems. I’m not talking about a missing comma. im talkin about like riteing like this i mean its so dummm why do they do this its like their never lurnd 2 right

Those are the good ones. The bad ones are offensive, pointless (“i like this video so much1!111”), or just downright bizarre. (In the video to one of my favorite songs, you barely see The Killers at all, yet someone left a comment that they love videos like this one where you can see the band playing the whole time.)

I want to know why this is the case. There are some sites (Digg, Slashdot) where there are some dumb comments. But YouTube is notoriously bad. Hilariously so. Except it’s gone way past hilarious, to the point of being irritating and kind of depressing. Is it a demographic thing? Is it swamped by 13-year-olds? (With apologies to 13-year-olds, who probably far exceed the average commenter on YouTube.) Is it a broken windows type thing, where people leave stupid comments because everyone else does?

YouTube recently implemented a rating system, where you can give a thumbs-up or thumbs-down to comments. Good idea. Except it really doesn’t work! For one, they made my classic mistake, but in reverse: they clearly never tested in Firefox (well, Flock or Firefox 3, but Flock is basically Firefox with some more addons and a fancy theme). But that’s not my point. A comment might voted up or down a couple points, but that’s all. There’s no suppression of comments, and the comments remain in chronological order, so comment moderation is pretty pointless.