Easy Backups on Linux

It’s good to keep backups, especially of servers in remote data centers using old hard drives.

rsync -vaEz --progress user@remote:/path /local/path

In my case, I’m doing it as root and just copying /, although, in hindsight, I think I should have used the –exclude=… option… It doesn’t make any sense to me to “back up” /proc or /dev, /tmp is iffy, and /mnt is usually not desired.

A few notes: I use –progress because otherwise it wants to just sit there, which is irritating.

-a is archive, which actually maps to a slew of options. -z enables compression. Note that this may or may not be desirable: on a fast link with a slower machine, this may do more harm than good. There’s also a –bwlimit argument that takes KB/sec as an argument. (–bwlimit=100 would be 100KB/sec, or 800kbps.)

Using rsync for backups is nothing new, but it’s still not used as widely as it could be. A seemingly-common option is to create a huge backup with tar, compress it, and then download the massive file. rsync saves you the overhead of making ludicrously-large backup files, and also lets you just download what’s changed, as opposed to downloading a complete image every time. It’s taking forever the first time, since I’m downloading about 40GB of content. But next time, it’ll be substantially quicker.

With backups this easy, everyone should be making backups frequently!

Stomatron

I’ve been working on my resume as I seek to apply for a job that’s a neat blend of multiple interests–managing web projects (even in my preferred LAMP environment), politics, and even a management potential. And as I do it, I’m remembering all the stuff I did at FIRST, and reflecting on how much better it could be.

I was “fluent” in SQL at the time, but didn’t know some of the neater functions of MySQL. For example, when I wrote the web management interface to the Stomatron, I didn’t know that I could make MySQL calculate times. So I’d retrieve a sign-in and sign-out time and use some PHP code to calculate elapsed time. This wasn’t terrible, really, but it just meant that I did more work than was necessary.

More significantly, I didn’t know about the MySQL query cache. (Actually, I don’t know when it was introduced… This was five years ago.) Some of the queries were quite intense, and yet didn’t change all that often. This is exactly where the query cache is indicated.

Worse yet, I really didn’t do much with the idea of caching at all. Being the stats-freak that I am, I had a little info box showing some really neat stats, like the total number of “man hours” worked. As you can imagine, this is a computation that gets pretty intense pretty quickly, especially with 30+ people logging in and out every day, sometimes repeatedly. Query caching would have helped significantly, but some of this stuff could have been sped up in other ways, too, like keeping a persistent cache of this data. (Memcache is now my cache of choice, but APC, or even just an HTML file, would have worked well, too.)

And, 20/20 hindsight, I don’t recall ever backing up the Stomatron box. (I may well be wrong.) Especially since it and our backup server both ran Linux, it’d have been trivial to write a script to run at something like 3 a.m. (when none of us would be around to feel the potential slowdown) to have it do a database dump to our backup server. (MySQL replication would have been cool, but probably needless.) If I were doing it today, I’d also amend that script to employ our beloved dot-matrix logger, to print out some stats, such as cumulative hours per person, and maybe who worked that day. (Which would make recovery much easier in the event of a catastrophic data loss: we’d just take the previous night’s totals, and then replay (or, in this case, re-enter) the day’s login information.)

I’m not sure it was even mainstream back then, but our website could have used a lot of optimization, too. We were admittedly running up against a really slow architecture: I think it was a 300 MHz machine with 128MB RAM. With PostNuke, phpBB, and Gallery powering the site, every single pageload was being generated on the fly, and used a lot of database queries. APC or the like probably would have helped pretty well, but I have to wonder how things would have changed if we used MySQL query caching. Some queries (like WordPress’s insistence on using exact timestamps in every one) don’t benefit. I wonder if phpBB is like that. I have a feeling that at least the main page and such would have seen a speedup. We didn’t have a lot of memory to play with, but even 1MB of cache probably would have made a difference. As aged as the machine was, I think we could have squeezed more performance out of it.

I’m still proud of our scoring interface for our Lego League competition, though. I think Mr. I mentioned in passing a day or two before the competition that he wanted to throw something together in VB to show the score, but hadn’t had the time, or something of that sort. So Andy and I whipped up a PHP+MySQL solution after school that day, storing the score in MySQL and using PHP to retrieve results and calculate score, and then set up a laptop with IE to display the score on the projector. And since we hosted it on the main webserver, we could view it internally, but also permitted remote users to watch results. It was coded on such a short timeline that we ended up having to train the judges to use phpMyAdmin to put the scores in. And the “design requirements” we were given didn’t correctly state how the score was calculated, so we recoded the score section mid-competition.

I hope they ask me if I have experience working under deadlines.

Torrent Hosting

So I’m contemplating posting my BlueQuartz VMware image on VMware’s “Appliances” page, where it’d probably get a decent amount of downloads. I strongly doubt I’ll run into my bandwidth limit (it’d have to be downloaded about 3,000 times in a month), but I still don’t want to use bandwidth I don’t have to. When you’re distributing a big file to lots of people all of a sudden, BitTorrent is the perfect solution.

Unlike distributing, say, a bootleg movie, there’s an ‘official source’ for a lot of legitimate torrent hosting. This doesn’t mean anything in BitTorrent, but I think it should. The official source wants to ‘host’ it, but get people to help with bandwidth over BitTorrent.

There should be an easy way for them to host the file. Run a single command, pass it the file you want to distribute, and it’ll automatically create a .torrent file, register with some trackers (or host your own?), and begin seeding the file. In practice, this would probably take 10-15 minutes of work by hand. That’s pathetic.

There’s also a catch 22 at first: you want seeders (people who have the whole file and upload it to their peers), since, without them, no one can get the file. But you need a seeder before anyone can be a seeder. The obvious solution is to seed your own file, and this is how it’s done. But, as the ‘official’ distributor of a file, you don’t want to burn through bandwidth, so it makes sense that you’d want to throttle your available bandwidth: if there were lots of other seeders, you’d only use a small amount of bandwidth. By keeping the ‘server’ up as a permanent seeder, you alleviate the really annoying problem of no one having the full file, which, obviously, prevents anyone from ever getting it.  This is sort of a “long tail” problem: after the rush is over, you often end up with BitTorrent not being so awesome.  (And, if you set your throttled upload bandwidth to be inversely proportional to the number of seeders, when no one else is seeding it, there’s really no difference between someone downloading your file over BitTorrent and downloading it directly from your server.)

Of course, you’ll still have to distribute over FTP/HTTP, since not everyone can use BitTorrent. But, if you distribute it ‘normally’ over HTTP, you create an incentive for people to just download it from you, bypassing BitTorrent, which ruins the whole plan. So you also need to be able to throttle your bandwidth on those services, to make sure that it’s never faster than BitTorrent.

I really think there should be an all-in-one package to do this, so the host just runs a quick command on the server, and the file’s immediately being seeded on BitTorrent and available on HTTP/FTP. And for all of “us,” just think of situations that, say, Linux distributions must have with distributing large files.

This could even be a hosted service: a decent amount of people providing things like games have been smart enough to embrace BitTorrent. The market’s there. There’s just nowhere offering this.

Cool Stuff

  • FDC (FDCServers.net) has come a long way since I last dealt with them. (I remember back when they had a couple Cogent lines). They’ve now got 81 Gbps of connectivity.
  • Internap has long been the Internet provider when latency/speed matters. They basically buy lines from all the big providers, and peer with lots of the smaller ones, so that, unless your hosting company has their own private peering agreements, it’s basically impossible to find a shorter route. People hosting gameservers, or really just anything “high quality,” love Internap. I’ve seen prices in the $100-200 range for 1 Mbps. (This is purely for the transit: it’s all well and good to envision $100 for a 1 Mbps line to your house as good, but that’s not what it is. This is when you’re in a data center where they have a presence and run a line to them. The cost is just for them carrying your packets.)
  • FDC now has a 10 Gbps line to Internap. “Word on the street” is that Internap had some sort of odd promotion at $15/Mbps if you bought in bulk, and FDC wisely jumped, getting a 2 Gbps commit on a 10 Gbps line.
  • I’m working on getting Xen running on my laptop. It’s interested me for a long time–it’s a GPL’ed virtualization platform. You can use it on your desktop to experiment with various OSs inside VMs, but it’s also awesome on servers to run multiple virtual machines as virtual private servers.
  • Do you remember Cobalt RaQs? I distinctly remember ogling them and thinking they were the best things ever. (Of course, now we see them as 300 MHz machines…) It turns out that, when Cobalt went belly-up, they released a lot of the code under the GPL or similar. The BlueQuartz project is an active community-developed extension of that, and, combined with CentOS, it apparently runs well on “normal” computers now. (True, you don’t get the spiffy blue rackmount server or the spiffy LCD, but you do get to run it on something ten times as powerful.)
  • I’m still itching to host a TF2 server. I’ve found that they’re all either full or empty, with few in-betweens, and that a lot of them aren’t ‘adminned’ as tightly as I’d like: games like this seem to attract irritating people, and not enough servers kick/ban them.
  • cPanel seems to have come a distance since I last used that, too, and you can now license it for use just inside a VPS at $15/month.
  • Mailservers are hard to perfect. There are lots and lots of mediocre ones, but it’s rare to come across an excellent one, something that can deflect spam seamlessly, make it easy to add lots of addresses, and provide a nice web GUI. All of the technology’s out there, but for some reason, mailservers are among the hardest things in the world to configure. (Even my thermostat is easier to use!) Especially given my affinity for spamd, it’s no wonder that I’m so impressed with the Mailserver ‘appliance’ that Allard Consulting produces. It’s essentially all of the best things about mailservers (greylisting, whitelisting, SpamAssassin, Postfix with MySQL-based virtual domains, a spiffy web interface with graphs, Roundcube…), hosted on OpenBSD, coming as a pre-assembled ISO.
  • Computer hardware’s come a long way lately. I’d imagine it’d be fairly easy to assemble a machine with a good dual-core (or quad-core!) processor, 4 GB RAM, and a few 500 GB disks for around $1,000.
  • Colocation + 1,000 GB transfer on Internap at FDC is $169. (Or $199 for 5 Mbps unmetered, but that’s probably overkill.) Are you thinking what I’m thinking? (Hint: everything on this list indirectly leads to these last two point!)

Amazon S3

I really didn’t pay it that much attention, or think about its full potential, at the time it was released. But Amazon’s Simple Storage Servic (hence the “S3”) is really pretty neat. In a nutshell, it’s file hosting on Amazon’s proven network infrastructure. (When have you ever seen Amazon offline?) They provide HTTP and BitTorrent access to files.

Their charges do add up — it might cost a few hundred dollars a month to move a terabyte of data and store 80GB of content. But then again, the reliability (and scalability!) is probably much greater than what I can handle, and it’s apparently much cheaper than it would be to host it with a ‘real’ CDN service.

Sadly, I can’t think of a good use for this service. I suppose the average person really doesn’t need to hire a company to provide mirrors of their files for download. (It would make an awesome mirror for Linux/BSD distributions, but I think the typical mirror is someone with a lot of spare bandwidth and an extra server, not someone paying hundreds a month to mirror files for other people… I wonder if there’s a market for a ‘premium’ mirror service? I doubt it, since the existing ones seem to work fine?)

Datacenter Fiend

No matter what I do, I keep finding myself thinking about webhosting.

Netcraft does a monthly survey of hosts with the top uptime, and mentioned that DataPipe is usually on top. I’ve found that, at least for what I do, any “real” data center has just about 100% uptime. I have never not been able to reach my server. You’re either with a notoriously bad host (for example, when Web Host “Plus” bought out Dinix, they took the servers offline for a few days with no notice… that’s noticeable downtime), or you’re with a reputable host where downtime just doesn’t really happen.

So 0.00% downtime, as opposed to 0.01%, isn’t a huge deal for me. (That doesn’t mean it’s not impressive.) But what impressed me about DataPipe is that I clicked their link and their webpage just appeared. No loading in the slightest. I browsed their site, and there was never any waiting. I might as well have had the page cached on my computer, except I know it’s not cached anywhere.

Their data center is in New Jersey, but they clearly have excellent peering. I’m getting 20ms pings. They don’t (directly, at least) offer dedicated hosting, VPS hosting, or shared hosting.

One of my big concerns is that I wonder about long-term viability. The market’s full of hosts. A lot of them are “kiddie hosts,” inexperienced people just reselling space often with poor quality. That’s room for competition. But the problem is that there are hosts selling the moon: 200 GB of disk space and 3 terabytes of bandwidth for $5 a month? That’s ludicrous: that’s more than I get with my dedicated server! They can get away with it because no one uses that much, but it concerns more “honest” hosts–you’d have to charge ten times as much if everyone actually used it! But for hosts that offer, say, 1GB of space and 10 GB of transfer–a ‘realistic’ amount–they’re left vulnerable to people thinking they’re getting a better deal.

I realized the other day that, while a lot of people offer VPS (virtual private server: several people share a server, but software ‘partitions’ give each of them their own server software-wise, with root access and separation from other users), I’m really not aware of any good ones. It’s also hard to find any that offer significant amounts of disk space, or any that are particularly cheap.

Big Hosting

I tend to think of web hosting in terms of many sites to a server. And that’s how the majority of sites are hosted–there are multiple sites on this one server, and, if it were run by a hosting company and not owned by me, there’d probably be a couple hundred.

But the other end of the spectrum is a single site that takes up many servers. Most any big site is done this way. Google reportedly has tens of thousands. Any busy site has several, if nothing else to do load-balancing.

Lately I’ve become somewhat interested in the topic, and found some neat stuff about this realm of servers. A lot of things are done that I didn’t think were possible. While configuring my router, for example, I stumbled across stuff on CARP. I always thought of routers as a single point of failure: if your router goes down, everything behind it goes down. So you have two (or more) routers in mission-critical setups.

One thing I wondered about was serving up something that had voluminous data. For example, suppose you have a terabyte of data on your website. One technique might be to put a terabyte of drives in every server and do load balancing from there. But putting a terabyte of drives in each machine is expensive, and, frankly, if you’re putting massive storage in one machine, it’s probably huge but slow drives. Another option would be some sort of ‘horizontal partitioning,’ where five (arbitrary) servers each house one-fifth of the data. This reduces the absurdity of trying to stuff a terabyte of storage into each of your servers, but it brings problems of its own. For one, you don’t have any redundancy: if the machine serving sites starting with A-G goes down, all of those sites go down. Plus, you have no idea of how ‘balanced’ it will be. Even if you tried some intricate means of honing which material went where, the optimal layout would be constantly changing.

Your best bet, really, is to have a bunch of web machines, give them minimal storage (e.g., a 36GB SCSI drive–a 15,000 rpm one!), and have a backend fileserver that has the whole terabyte of data. Viewers would be assigned to any of the webservers (either in a round-robin fashion, or dynamically based on which server was the least busy), which would retrieve the requisite file from the fileserver and present it to the viewer. Of course, this places a huge load on the one fileserver. There’s an implicit assumption that you’re doing caching.

But how do you manage the caching? You’d need some complex code to first check your local cache, and then turn to the fileserver if needed. It’s not that hard to write, but it’s also a pain: rather than a straightforward, “Get the file, execute if it has CGI code, and then serve” process, you need the webserver to do some fancy footwork.

Enter Coda. No, not the awesome web-design GUI, but the distributed filesystem. In a nutshell, you have a server (or multiple servers!) and they each mount a partition called /coda, which refers to the network. But, it’ll cache files as needed. This is massively oversimplifying things: the actual use is to allow you to, say, bring your laptop into the office, work on files on the fileserver, and then, at the end of the day, seamlessly take it home with you to work from home, without having to worry about where the files physically reside. So running it just for the caching is practically a walk in the park: you don’t have complicated revision conflicts or anything of the sort. Another awesome feature about Coda is that, by design, it’s pretty resilient: part of the goal with caching and all was to pretty gracefully handle the fileserver going offline. So really, the more popular files would be cached by each node, with only cache misses hitting the fileserver. I also read an awesome anecdote about people running multiple Coda servers. When a disk fails, they just throw in a blank. You don’t need RAID, because the data’s redundant across other servers. With the new disk, you simply have it rebuild the missing files from other servers.

There’s also Lustre, which was apparently inspired by Coda. They focus on insane scalability, and it’s apparently used in some of the world’s biggest supercomputer clusters. I don’t yet know enough about it, really, but one thing that strikes me as awesome is the concept of “striping” across multiple nodes with the files you want.

The Linux HA project is interesting, too. There’s a lot of stuff that you don’t think about. One is load balancer redundancy… Of course you’d want to do it, but if you switched over to your backup router, all existing connections would be dropped. So they keep a UDP data stream going, where the master keeps the spare(s) in the loop on connection states. Suddenly having a new router or load balancer can also be confusing on the network. So if the master goes down, the spare will come up and just start spoofing its MAC and IP to match the node that went down. There’s a tool called heartbeat, whereby standby servers ping the master to see if it’s up. It’s apparently actually got some complex workings, and they recommend a serial link between the nodes so you’re not dependent on the network. (Granted, if the network to the routers goes down, it really doesn’t matter, but having them quarreling over who’s master will only complicate attempts to bring things back up!)

And there are lots of intricacies I hadn’t considered. It’s sometimes complicated to tell whether a node is down or not. But it turns out that a node in ambiguous state is often a horrible state of affairs: if it’s down and not pulled out of the pool, lots of people will get errors. And if other nodes are detecting oddities but it’s not down, something is awry with the server. There’s a concept called fencing I’d never heard, whereby the ‘quirky’ server is essentially shut out by its peers to prevent it from screwing things up (not only may it run away with shared resources, but the last thing you want is a service acting strangely to try to modify your files). The ultimate example of this is STONITH, which sounds like a fancy technical term (and, by definition, now is a technical term, I suppose), but really stands for “Shoot the Other Node in the Head.” From what I gather from the (odd) description, the basic premise is that if members of a cluster suspect that one of their peers is down, they “make it so” by calling external triggers to pull the node out of the network (often, seemingly, to just reboot the server).

I don’t think anyone is going to set up high-performance server clusters based on what someone borderline-delirious blogged at 1:40 in the morning because he couldn’t sleep, but I thought someone else might find this venture into what was, for me, new territory, to be interesting.