Easy Backups on Linux

It’s good to keep backups, especially of servers in remote data centers using old hard drives.

rsync -vaEz --progress user@remote:/path /local/path

In my case, I’m doing it as root and just copying /, although, in hindsight, I think I should have used the –exclude=… option… It doesn’t make any sense to me to “back up” /proc or /dev, /tmp is iffy, and /mnt is usually not desired.

A few notes: I use –progress because otherwise it wants to just sit there, which is irritating.

-a is archive, which actually maps to a slew of options. -z enables compression. Note that this may or may not be desirable: on a fast link with a slower machine, this may do more harm than good. There’s also a –bwlimit argument that takes KB/sec as an argument. (–bwlimit=100 would be 100KB/sec, or 800kbps.)

Using rsync for backups is nothing new, but it’s still not used as widely as it could be. A seemingly-common option is to create a huge backup with tar, compress it, and then download the massive file. rsync saves you the overhead of making ludicrously-large backup files, and also lets you just download what’s changed, as opposed to downloading a complete image every time. It’s taking forever the first time, since I’m downloading about 40GB of content. But next time, it’ll be substantially quicker.

With backups this easy, everyone should be making backups frequently!

DNS Dork

The real geeks in the room already know what the root zone file is, but for those of you with lives… DNS (Domain Name Service) is the service that transforms names (blogs.n1zyy.com) into IPs (72.36.178.234). DNS is hierarchical: as a good analogy, think of there being a folder called “.com,” with entries for things like “amazon” and “n1zyy” (for amazon.com and n1zyy.com, two sites of very comparable importance.) Within the amazon ‘folder’ is a “www,” and within “n1zyy” is a “blogs,” for example. A domain name is really ‘backwards,’ then: if it were a folder on your hard drive, it would be something like C:.com.n1zyyblogs.

Of course, this is all spread out amongst many servers across the world. When you go to connect to blogs.n1zyy.com, you first need to find out how to query the .com nameservers. The root servers are the ones that give you this answer: they contain a mapping of what nameservers are responsible for each top-level domain (TLD), like .com, .org, and .uk.

So you get your answer for what nameservers process .com requests, and go to one of them, asking what nameserver is responsible for n1zyy.com. You get your answer and ask that nameserver who’s responsible for blogs.n1zyy.com, and finally get the IP your computer needs to connect to. And, for good measure, it probably gets cached, so that the next time you visit the site, you don’t have to go through the overhead of doing all those lookups again. (Of course, this all happens in the blink of an eye, behind the scenes.)

Anyway! The root zone file is the file that the root servers have, which spells out which nameservers handle which top-level domains.

Yours truly found the root zone file (it’s no big secret) and wrote a page displaying its contents, and a flag denoting the country of each of the nameservers. The one thing I don’t do is map each of the top-level domains to their respective country, since, in many cases, I don’t have the foggiest clue.

What’s interesting to note is that a lot of the data is just downright bizarre. Cuba has six nameservers for .cu. One is in Cuba, one in the Netherlands, and four are in the US. Fiji (.fj) has its first two nameservers… at berkeley.edu. American universities hosting foreign countries’ nameservers, however bizarre, isn’t new. .co (Colombia) has its first nameservers in Colombia (at a university there), but also has NYU and Columbia University (I think they did that just for the humor of Columbia hosting Colombia).

In other news, it turns out that there’s a list of country-to-ccTLD (Country-Code Top Level Domain) mappings. I’m going to work on incorporating this data… Maybe I can even pair it up with my IPGeo page with IP allocations per country…

Stomatron

I’ve been working on my resume as I seek to apply for a job that’s a neat blend of multiple interests–managing web projects (even in my preferred LAMP environment), politics, and even a management potential. And as I do it, I’m remembering all the stuff I did at FIRST, and reflecting on how much better it could be.

I was “fluent” in SQL at the time, but didn’t know some of the neater functions of MySQL. For example, when I wrote the web management interface to the Stomatron, I didn’t know that I could make MySQL calculate times. So I’d retrieve a sign-in and sign-out time and use some PHP code to calculate elapsed time. This wasn’t terrible, really, but it just meant that I did more work than was necessary.

More significantly, I didn’t know about the MySQL query cache. (Actually, I don’t know when it was introduced… This was five years ago.) Some of the queries were quite intense, and yet didn’t change all that often. This is exactly where the query cache is indicated.

Worse yet, I really didn’t do much with the idea of caching at all. Being the stats-freak that I am, I had a little info box showing some really neat stats, like the total number of “man hours” worked. As you can imagine, this is a computation that gets pretty intense pretty quickly, especially with 30+ people logging in and out every day, sometimes repeatedly. Query caching would have helped significantly, but some of this stuff could have been sped up in other ways, too, like keeping a persistent cache of this data. (Memcache is now my cache of choice, but APC, or even just an HTML file, would have worked well, too.)

And, 20/20 hindsight, I don’t recall ever backing up the Stomatron box. (I may well be wrong.) Especially since it and our backup server both ran Linux, it’d have been trivial to write a script to run at something like 3 a.m. (when none of us would be around to feel the potential slowdown) to have it do a database dump to our backup server. (MySQL replication would have been cool, but probably needless.) If I were doing it today, I’d also amend that script to employ our beloved dot-matrix logger, to print out some stats, such as cumulative hours per person, and maybe who worked that day. (Which would make recovery much easier in the event of a catastrophic data loss: we’d just take the previous night’s totals, and then replay (or, in this case, re-enter) the day’s login information.)

I’m not sure it was even mainstream back then, but our website could have used a lot of optimization, too. We were admittedly running up against a really slow architecture: I think it was a 300 MHz machine with 128MB RAM. With PostNuke, phpBB, and Gallery powering the site, every single pageload was being generated on the fly, and used a lot of database queries. APC or the like probably would have helped pretty well, but I have to wonder how things would have changed if we used MySQL query caching. Some queries (like WordPress’s insistence on using exact timestamps in every one) don’t benefit. I wonder if phpBB is like that. I have a feeling that at least the main page and such would have seen a speedup. We didn’t have a lot of memory to play with, but even 1MB of cache probably would have made a difference. As aged as the machine was, I think we could have squeezed more performance out of it.

I’m still proud of our scoring interface for our Lego League competition, though. I think Mr. I mentioned in passing a day or two before the competition that he wanted to throw something together in VB to show the score, but hadn’t had the time, or something of that sort. So Andy and I whipped up a PHP+MySQL solution after school that day, storing the score in MySQL and using PHP to retrieve results and calculate score, and then set up a laptop with IE to display the score on the projector. And since we hosted it on the main webserver, we could view it internally, but also permitted remote users to watch results. It was coded on such a short timeline that we ended up having to train the judges to use phpMyAdmin to put the scores in. And the “design requirements” we were given didn’t correctly state how the score was calculated, so we recoded the score section mid-competition.

I hope they ask me if I have experience working under deadlines.

Torrent Hosting

So I’m contemplating posting my BlueQuartz VMware image on VMware’s “Appliances” page, where it’d probably get a decent amount of downloads. I strongly doubt I’ll run into my bandwidth limit (it’d have to be downloaded about 3,000 times in a month), but I still don’t want to use bandwidth I don’t have to. When you’re distributing a big file to lots of people all of a sudden, BitTorrent is the perfect solution.

Unlike distributing, say, a bootleg movie, there’s an ‘official source’ for a lot of legitimate torrent hosting. This doesn’t mean anything in BitTorrent, but I think it should. The official source wants to ‘host’ it, but get people to help with bandwidth over BitTorrent.

There should be an easy way for them to host the file. Run a single command, pass it the file you want to distribute, and it’ll automatically create a .torrent file, register with some trackers (or host your own?), and begin seeding the file. In practice, this would probably take 10-15 minutes of work by hand. That’s pathetic.

There’s also a catch 22 at first: you want seeders (people who have the whole file and upload it to their peers), since, without them, no one can get the file. But you need a seeder before anyone can be a seeder. The obvious solution is to seed your own file, and this is how it’s done. But, as the ‘official’ distributor of a file, you don’t want to burn through bandwidth, so it makes sense that you’d want to throttle your available bandwidth: if there were lots of other seeders, you’d only use a small amount of bandwidth. By keeping the ‘server’ up as a permanent seeder, you alleviate the really annoying problem of no one having the full file, which, obviously, prevents anyone from ever getting it.  This is sort of a “long tail” problem: after the rush is over, you often end up with BitTorrent not being so awesome.  (And, if you set your throttled upload bandwidth to be inversely proportional to the number of seeders, when no one else is seeding it, there’s really no difference between someone downloading your file over BitTorrent and downloading it directly from your server.)

Of course, you’ll still have to distribute over FTP/HTTP, since not everyone can use BitTorrent. But, if you distribute it ‘normally’ over HTTP, you create an incentive for people to just download it from you, bypassing BitTorrent, which ruins the whole plan. So you also need to be able to throttle your bandwidth on those services, to make sure that it’s never faster than BitTorrent.

I really think there should be an all-in-one package to do this, so the host just runs a quick command on the server, and the file’s immediately being seeded on BitTorrent and available on HTTP/FTP. And for all of “us,” just think of situations that, say, Linux distributions must have with distributing large files.

This could even be a hosted service: a decent amount of people providing things like games have been smart enough to embrace BitTorrent. The market’s there. There’s just nowhere offering this.

Beat the Rush

In case anyone here is interested, I’m hosting a VMware Player image for BlueQuartz, the ‘modern’ GPL version of the old Cobalt RaQ software. A lot of people seem to want a VMware image. I was one of them, until I ended up just creating one on my own.

So grab it while it’s hot! (Read: grab it before I take the time to better throttle download speed.)

Fanboy

I’d gone  a while without ogling Apple products. So they came out today with some new products.

This is a neat idea. It’s their “Airport Extreme” wireless AP (with N-capability), but with a neat addition–a 500 or 1TB disk for wireless backups. Sure, the real geeks already have their Linux server in the basement with a RAID array of 500GB disks accessible over NFS and rsync, but Apple brings something cool into a nice little box, makes it work pretty seamlessly, and, get this–sells it at a fairly cheap price.  $500 for an 802.11N AP with an integrated 1TB backup fileserver?

Of course, I’d need a Mac machine to sync to it. But I’m already carrying so much stuff to class, I want something light! I guess I’d need the world’s thinnest laptop, the Apple MacBook Air.  Not only is it ridiculously small, but it takes the awesome MultiTouch technology from their iPod Touch / iPhone and applies it to the trackpad. 2GB standard, and if you don’t like the sound of your hard drive spinning, you could always opt for the 64GB solid state one. (Apparently at a cost of $1,000, though… But that’s what you pay for 64GB SSD drives right now.)

And they relaunched the AppleTV, without the suck this time. You can also do the much-rumoured movie rentals through iTunes.

Darn you, Apple! Today was supposed to be the day that I caught up on all the work I need to do!

Cool Stuff

  • FDC (FDCServers.net) has come a long way since I last dealt with them. (I remember back when they had a couple Cogent lines). They’ve now got 81 Gbps of connectivity.
  • Internap has long been the Internet provider when latency/speed matters. They basically buy lines from all the big providers, and peer with lots of the smaller ones, so that, unless your hosting company has their own private peering agreements, it’s basically impossible to find a shorter route. People hosting gameservers, or really just anything “high quality,” love Internap. I’ve seen prices in the $100-200 range for 1 Mbps. (This is purely for the transit: it’s all well and good to envision $100 for a 1 Mbps line to your house as good, but that’s not what it is. This is when you’re in a data center where they have a presence and run a line to them. The cost is just for them carrying your packets.)
  • FDC now has a 10 Gbps line to Internap. “Word on the street” is that Internap had some sort of odd promotion at $15/Mbps if you bought in bulk, and FDC wisely jumped, getting a 2 Gbps commit on a 10 Gbps line.
  • I’m working on getting Xen running on my laptop. It’s interested me for a long time–it’s a GPL’ed virtualization platform. You can use it on your desktop to experiment with various OSs inside VMs, but it’s also awesome on servers to run multiple virtual machines as virtual private servers.
  • Do you remember Cobalt RaQs? I distinctly remember ogling them and thinking they were the best things ever. (Of course, now we see them as 300 MHz machines…) It turns out that, when Cobalt went belly-up, they released a lot of the code under the GPL or similar. The BlueQuartz project is an active community-developed extension of that, and, combined with CentOS, it apparently runs well on “normal” computers now. (True, you don’t get the spiffy blue rackmount server or the spiffy LCD, but you do get to run it on something ten times as powerful.)
  • I’m still itching to host a TF2 server. I’ve found that they’re all either full or empty, with few in-betweens, and that a lot of them aren’t ‘adminned’ as tightly as I’d like: games like this seem to attract irritating people, and not enough servers kick/ban them.
  • cPanel seems to have come a distance since I last used that, too, and you can now license it for use just inside a VPS at $15/month.
  • Mailservers are hard to perfect. There are lots and lots of mediocre ones, but it’s rare to come across an excellent one, something that can deflect spam seamlessly, make it easy to add lots of addresses, and provide a nice web GUI. All of the technology’s out there, but for some reason, mailservers are among the hardest things in the world to configure. (Even my thermostat is easier to use!) Especially given my affinity for spamd, it’s no wonder that I’m so impressed with the Mailserver ‘appliance’ that Allard Consulting produces. It’s essentially all of the best things about mailservers (greylisting, whitelisting, SpamAssassin, Postfix with MySQL-based virtual domains, a spiffy web interface with graphs, Roundcube…), hosted on OpenBSD, coming as a pre-assembled ISO.
  • Computer hardware’s come a long way lately. I’d imagine it’d be fairly easy to assemble a machine with a good dual-core (or quad-core!) processor, 4 GB RAM, and a few 500 GB disks for around $1,000.
  • Colocation + 1,000 GB transfer on Internap at FDC is $169. (Or $199 for 5 Mbps unmetered, but that’s probably overkill.) Are you thinking what I’m thinking? (Hint: everything on this list indirectly leads to these last two point!)

Emulating spamd for HTTP

I won’t lie–I love OpenBSD’s spamd. In a nutshell, it’s a ‘fake’ mailserver. You set your firewall up to connect obvious spammers to talk to this instead of your real mailserver. It talks to them extremely slowly (1B/sec), which keeps them tied up for quite some time. (As an added bonus, it throws them an error at the end.)

One thing that really gets under my skin is bots (and malicious users) probing for URLs on the server that don’t exist. I get a lot of hits for /forum, /phpbb, /forums, /awstats… What they’re doing is probing for possible (very) outdated scripts that have holes allowing remote code execution.

It finally hit me: it’s really not that hard to build the same thing for HTTP. thttpd already supports throttling. (Note that its throttling had a more sane use in mind: limiting overall bandwidth to a specific URL, not messing with spammers and people pulling exploits, so it’s not exactly what we want, but it’ll do.)

Then you need a large file. I downloaded a lengthy novel from Project Gutenberg. It’s about 700 kB as uncompressed text. I could get much bigger files, yes. But 700 kB is plenty. More on this later.

It’s also helpful to use Apache and mod_rewrite on your ‘real’ server. You can work around it if you have to.

Set up your /etc/thttpd/throttle.conf:

**    16

Note that, for normal uses, this is terrible. This rule effectively says, “Limit the total server (**) to 16 (bytes per second).” By comparison, a 56K dialup line is about 7,000 bytes per second (or 56,000 bits per second).

Rudimentary tests show that having one client downloading a 700 kB file at 16B/sec places pretty much no load on the server (load average remained 0.00, and thttpd doesn’t even show up in the section of top that I can see), so I’m not concerned about overhead.

You can also set up your thttpd.conf as needed. No specific requirements there. Start it up with something like thttpd -C /etc/thttpd/thttpd.conf -d /var/www/maintenance/htdocs/slow -t /etc/thttpd/throttle.conf (obviously, substituting your own directories and file names! Note that the /slow is just the directory I have it serving out of, not any specific naming convention.)

Now what we need to do is start getting some of our mischievous URL-probers into this. I use some mod_rewrite rules on my ‘real’ Apache server:

# Weed out some more evil-doers
RewriteRule ^forum(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^phpbb(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^badbots(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^awstats(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]

In a nutshell, I redirect any requests starting with “forum,” “phpbb,” “badbots,” or “awstats” to an enormous text file. I’m not sure if escaping the colon is strictly necessary, but it has the added benefit of ‘breaking’ the link when pasted, say, here: I don’t want anyone getting caught up in this unless they’re triggering it. I tend each with (.*), essentially matching everything. You may or may not see this as desirable. I like it, since /forum and /forums are both requested, and so forth. You could take that out if necessary. The [NC,L] is also useful in terms of, well, making anything work.

I want to watch and see whether anyone gets caught up in this. Since it’s technically passing the request to a different webserver (thttpd), it has to tell the client to connect to that, as opposed to seamlessly serving it up. I don’t know if the bots are smart (dumb?) enough to follow these redirects or not.

Note that /badbots doesn’t really exist. I inserted it into my robots.txt file, having heard that some ‘bad ‘bots (looking for spam, etc.) crawl any directory you tell them not to. I wondered if this was accurate.

The ending is quite anticlimactic: we wait not-so-patiently to see what ends up in the logfile.

Spam

So my new policy is to keep spam ‘on file’ for three days. It’s filed away as spam so no one sees it, but it’s good for analysis and such, to protect against future spam. Several times a day, I run a little script to delete spam older than three days and optimize the tables, to keep things running fast.

So this table is particularly telling of the spam problem. Akismet is catching just about all of it, so it’s not a big problem for me per se, but the fact remains that, with three days of spam and something like nine months of legitimate comments, spam accounts for right around two-thirds of all comments on my blog. Wow-a-wee-wow!

Geolocation

The concept of matching an IP to a country is known as IP geolocation, often just “IPGeo” or “GeoIP.” There are lots of reasons for using IP geolocation, ranging from the mundane (identifying countries in your webserver logfiles) to the questionable (banning countries from your server to cut down on spam) to the neat (doing it at firewall/router level and redirecting a user to the closest data center).

Most of the work is just done on a country level. You take an IP (72.36.178.234, my server) and look it up in a database, and get “UNITED STATES” as an answer. There do exist databases on finer levels, down to the city, but they’re expensive and often wrong. (I keep getting ads to find hot singles in Mashpee, more than 100 miles away and in a different state… Or maybe it’s Mattapan. Whatever the case, they’re not even close.)

It turns out that you can download a free database of IP-country mappings. It’s not infallible, but they say it’s 98% accurate. The database itself won’t do you any good. It’s a compressed CSV (comma-separated variable).

In the comments section here, there’s a snippet of PHP code to take the CSV and convert it to a huge series of SQL inserts, which you input into a database… (Hint: for whatever reason, his preg_match is imperfect and leaves a few instances of the word “error” in the middle of the file. It’s probably a bad idea, but I just commented out the “echo error” line. I end up with a 5.7MB SQL query. You can also just download the thing directly here (warning: 5.7 MB SQL file). Note that, per the license terms, I disclose in the comments that it’s a derivative work of their CSV file.

The other important catch is that IPs are stored as long integers, not ‘normal’ IPs. You’ll presumably want to use PHP + MySQL to get the country associate with PHP, so I’ll provide pseudocode in a minute. PHP provides an ip2long() function, but it only takes you halfway, but leaves you with sign problems. (Argh!) It’s an easy fix, though, and you want something like the following:

$long = sprintf("%u", ip2long($ip));
$query = "SELECT a2,a3,country FROM ip2c WHERE start <= $long AND end >= $long";

You then, of course, run $query and parse through it… You get 2- and 3-letter country codes, as well as the full country name. I use it, with good results, in seeing what country comment spam is coming from. (Most of it comes from the US.)

A MySQL query isn’t the proper way to do this: there exist binary files with the same data that result in faster lookups. But this is the simplest way to start doing IP geolocation in ten minutes time, and, with the query cache enabled, there’s not a ton of overhead.

I’m tempted to write some scripts to allow people to ‘browse’ the database, either looking up an IP, or to view it by country.

Update: Weird Silence has a binary implementation of this same database that’s supposedly much faster. The main page is here, the PHP one is here, and the C one is (t)here. (I’m wondering if it makes sense to write a PHP script to call the C version, and what the performance implications would be?)

Update 2: Get your country flags here.