Bulk-editing HTML files with images containing a # sign

A friend of mine recently pinged me with a problem he’s having. I wanted to share the problem and its solution, both because I hope it will help him and maybe others, and because my initial attempts at the problem were way more complicated than they needed to be.

The problem

My friend used some software to generate a whole bunch of HTML reports, and the reports contained images with names like “Project #1 – Blah/Image #1.jpg” or such. None of the images showed up.

He was savvy enough to recognize that the # (octothorp, pound sign, number sign, “hashtag symbol,” etc.) was the problem, but this was little comfort: with about 100 generated HTML files, each with six images, editing it all was perhaps an entire day’s work.

The problem is that the pound-sign is a special character in URLs, the fragment identifier. (I admit, I had to look up what it’s actually called.)

The over-engineered fix

What I struggled with was this: I could easily write a quick shell script to rename all the folders, and then mass-edit the HTML files. It would probably be 5-10 minutes of work.

What I couldn’t do easily, though, was write a Ruby (or whatnot) script without seeing the files and then get my friend to run it, on what I presume is a Windows computer without a Ruby interpreter. It’s one of those problems that I could very easily fix for myself, but communicating it or getting it to run for someone else is tremendously more complicated.

So I proposed that he just create a Zip file with the entire folder contents, including 600 images, and send it to me to fix.

This is a bad fix for two reasons. One, emailing 600 images is crazy. But two, there’s a much easier fix that hadn’t occurred to me.

Tech-savvy, non-programmer fix

Having suggested the above fix at first, I mulled over something else he said: this was one of those things he wasn’t sure how to even search for on Google, because he didn’t really know the terms. I think this is a problem we’ve all faced at one time. (For example, when I dropped a tiny little weird-shaped screw into my carpet and wanted to buy a replacement. If you have no idea what the thing is called, and I didn’t, finding it online will be even harder than finding it in your carpet.)

I wondered: what could you search for? I tried “how to bulk-edit HTML files” or something of the sort. It likely wouldn’t have gotten him the details he needed. But it helped me realize an easier fix!

There’s an easy way to bulk-edit text files, beyond writing a script to do it. And it’s something that programmers in particular should be familiar with: text editors.

And then something else occurred to me: you don’t actually need to rename the files! You just need to percent-encode the # in the URL!

So this suddenly becomes a much easier solution to relay.

  • Back up your work!
  • Download a free and reputable text editor. Notepad++ is highly-regarded on Windows, but I’m not a Windows user. I use Sublime Text on my Mac, and I still have a TextMate install as well.
  • Open the entire folder in the text editor.
  • Use the editor’s global search-and-replace / find-and-replace functionality. In Notepad++, it looks like it’s called Find in Files.
  • Make sure you really do have the project backed up; global search and replace on a project can be a real pain to undo if it goes wrong.
  • Search for # in all HTML files (*.html) and replace it with the percent-encoded representation, %23. This can look ugly: “Photos #1/pic.png” becomes “Photos %231/pic.png”. (The space should technically be %20, because a space is also a special character, but browsers are better about figuring that out.) Make sure “Replace all” or the equivalent is selected.
  • Ideally, test it in one file, save that file, and verify that the images now show up.
  • Let it roar through all of the files for you.

There’s one other risk worth noting, which is that the # sign could be used elsewhere in the file, particularly in its correct use as a fragment identifier, like if there was a link to different sections within the page. In that case, you would need to be slightly more selective in your search and replace, and this is where it gets gory. You’d need to find a search string that matches only the image URLs, make like “Photos #”, and then replace it with the same thing, but with the “#” changed to a “%23” — i.e., search for “Photos #” and replace it with “Photos %23”.

My Ghetto CDN

I’m not sure yet if this is a good idea or not, nor if I want to pay for it long-term, but I’m playing with an experiment that I think is kind of neat. (But maybe I’m biased.)

Background

For a long time now, I’ve been using CloudFront to serve static assets (images, CSS, JS, etc.) on the blogs, and a few other sites I host. That content never changes, so I can offload it to a global Content Delivery Network. IMHO it’s win-win: visitors get a faster experience because most of those assets come from a server near them (they have a pretty extensive network), and I get my server alleviated from serving most images, so it’s only handling request for the blog pages themselves.

Except, serving images is easy; they’re just static files, and I’ve got plenty of bandwidth. What’s hard is serving blog pages: they’re dynamically-generated, and involve database queries, parsing templates, and so forth… Easily 100ms or more, even when things go well. What I’d really like is to cache those files, and just remove them from cache when something changes. And I’ve actually had that in place for a while now, using varnish in front of the blogs. It’s worked very well; more than 90% of visits are served out of cache. (And a decent bit of the 10% that miss are things that can’t be cached, like form POSTs.) It alleviates backend load, and makes the site much faster when cache hits occur, which is most of the time.

But doing this requires a lot of control over the cache, because I need to be able to quickly invalidate the cache. CloudFront doesn’t make that easy, and they also don’t support IPv6. What I really wanted to do was run varnish on multiple servers around the world myself. But directing people to the closest server isn’t easy. Or, at least, that’s what I thought.

Amazon’s Latency-Based Routing

Amazon (more specifically, AWS) has supported latency-based routing for a while now. If you run instances in, say, Amazon’s Virginia/DC (us-east-1) region and in their Ireland (eu-west-1) data centers, you can set up LBR records for a domain to point to both, and they’ll answer DNS queries for whichever IP address is closer to the user (well, the user’s DNS server).

It turns out that, although the latency is in reference to AWS data centers, your IPs don’t actually have to point to data centers.

So I set up the following:

  • NYC (at Digital Ocean), mapped to us-east-1 (AWS’s DC/Virginia region)
  • Frankfurt, Germany (at Vultr), mapped to eu-central-1 (AWS’s Frankfurt region)
  • Los Angeles (at Vultr), mapped to us-west-1 (AWS’s “N. California” region)
  • Singapore (at Digital Ocean), mapped to ap-southeast-1 (AWS’s Singapore region)

The locations aren’t quite 1:1, but what I realized is that it doesn’t actually matter. Los Angeles isn’t exactly Northern California, but the latency is insignificant—and the alternative was all the traffic going to Boston, so it’s a major improvement.

Doing this in DNS isn’t perfect, either: if you are in Japan and use a DNS server in Kansas, you’re going to get records as if you’re in Kansas. But that’s insane and you shouldn’t do it, but again, it doesn’t entirely matter. You’re generally going to get routed to the closest location, and when you don’t, it’s not really a big deal. Worst case, you see perhaps 300ms latency.

Purging

It turns out that there’s a Multi-Varnish HTTP Purge plugin, which seems to work. The downside is that it’s slow: not because of anything wrong with the plugin, but because, whenever a page changes, WordPress now has to make connections to four servers across the planet.

I want to hack together a little API to accept purge requests and return immediately, and then execute them in the background, and in parallel. (And why not log the time it takes to return in statsd?)

Debugging, and future ideas

I added a little bit of VCL so that /EDGE_LOCATION will return a page showing your current location.

I think I’m going to leave things this way for a while and see how things go. I don’t really need a global CDN in front of my site, but with Vultr and Digital Ocean both having instances in the $5-10 range, it’s fairly cheap to experiment with for a while.

Ideally, I’d like to do a few things:

  1. Enable HTTP/2 support, so everything can be handled in one request. Doing so has some great reviews.
  2. Play around with keeping backend connections open from the edge nodes to my backend server, to speed up requests that miss cache.
  3. Play around with something like wanproxy to try to de-dupe/speed up traffic between edge nodes and my backend server.

Update

I just tried Pingdom’s Full Page Test tool, after making sure the Germany edge node had this post in cache.

Screen Shot 2015-03-06 at 11.22.00 PM

Page loaded in 291ms: after fetching all images and stylesheets. I’ll take that anywhere! But for my site, loaded in Stockholm? I’m really pleased with that.

I Use This: iTerm

I ran into some trouble with the Mac’s Terminal not supporting vim colors properly. It turns out that Terminal doesn’t have very good support for some things. Someone recommended I try iTerm, so I did. I wish I’d done it sooner. And it implements the feature I’ve craved for the longest time: blurring the background if you have a translucent terminal window:

There’s a little weirdness in that it doesn’t seem to update quite as fast as Terminal did, and that triple-clicking selects the line shown, not the whole line if it wraps. For the most part they’re petty differences. Overall, though, Terminal’s a thing of the past, and iTerm’s here to stay.

Update: I’d noticed that iTerm hadn’t been updated in a long time. Thanks to George for commenting below that he’s picked up development and rechristened the project iTerm 2. Even better!

Thinking Like an Engineer

Lately a lot of my work as a web developer has been way at the back-end, and, for whatever reason, it tends to focus heavily on third parties. I spent a while fixing a bizarre intermittent error with our credit card processor, moved on to connecting with Facebook, and am now working on a major rewrite of the API client we use to talk to our e-mail provider. Sometimes it starts to bleed over into my personal life.

This kind of turned into crazy-person babble, but I haven’t posted in a while, so here goes a perhaps-horrifying look into how my mind works:

  • Driving home one night, I went through the FastLane / EZPass lane, as I often do. Only this time, instead of thinking, “I hate that I have to slow down for this,” I started thinking about latency. Latency is one of the biggest enemies of people working with third parties. It was at the crux of our problems with the credit card processor — we’d store a card and immediately try to charge it, when sometimes we had to wait “a little bit” before the card was available throughout their system to be charged. So I had to introduce a retry loop with exponential backoff. The email API work has major hurdles around latency and timeouts. We’ve moved almost all of it into a background queue so that it doesn’t delay page load, but even then we have intermittent issues with timeouts. So driving through the FastLane lane today, I slowed to about 45, and thought how remarkable it was that, even at that speed, it was able to read the ID off my transponder, look it up in a remote database somewhere, and come back with a value on what to do. I’d have assumed that they’d just queue the requests to charge my account, but if my prepaid balance is low, I get a warning light shown. It seems that there’s actually a remote call. It’s got to happen in a split-second, though, and that’s pretty impressive. I wonder how they do it. I thought a lot about this, actually.
  • I work on the fourth floor of a building with one, slow elevator. A subsection of Murphy’s Law indicates that the elevator will always be on the exact opposite floor: when I’m on the first floor, it’s on the fourth, for example. So one day while waiting for the elevator, I started thinking that it needed an API. I could, from my desk, summon it to our floor to lessen my wait time. Likewise, I could build an iPhone app allowing me to call the elevator as I was walking towards it. The issue of people obnoxiously calling the elevator way too early seems like a problem, but I think it’s okay — if you call it too soon, it will arrive, and then someone else will call it and you’ll miss out entirely. It’s in everyone’s interest to call it “just right” or err on the side of a very slight wait.
  • While thinking more about the elevator API, I started thinking about how elevators aren’t really object-oriented. (I’m pretty sure that’s never been written before.) It seems an elevator is really pretty procedural, running something like goToFloor(4). The obvious object would be Floors, but that’s not really right. You’re not adding Floors to the building, or even changing properties of Floors. The object is really CallRequest, and it would take two attributes: an origin and a direction. “Come to floor two, I’m going up.” It made me think that there are some places that being object-oriented just doesn’t make a ton of sense.
  • You really want to add authentication. To get to our floor, you need to swipe your badge. The elevator API needs to account for the fact that some requests require validating a user’s credentials to see if they’re authorized to make the request they are.
  • “Code an elevator” would actually be an interesting programming assignment. But I fear it’s too far removed from most normal coding. I started thinking that you’d want to sort CallRequests in some manner, use some algorithms, and then iterate over CallRequests. I think you actually want to throw out that concept. You have a tri-state to control direction: Up, Down, and Idle. Then you have two arrays: UpwardCalls and DownwardCalls. They don’t even need to be sorted. As you near a floor, you see if UpwardCalls contains that floor. If so, you stop. If not, you continue. If you’ve reached the highest floor in UpwardCalls, you check to see if DownwardCalls has an elements. If so, you set your direction to Down and repeat the same procedure for DownwardCalls. If there are no DownwardCalls, you set your state to Idle. The problem is that this is really not how I’m used to thinking. I want to iterate over CallRequests as they come in, but this means that the elevator is going all over the place. The person on the 4th floor wants go to the 2nd, so we make that happen. But right as they put that request in, the person on the 3rd wants to go to the 1st. So you’d go 4 to 2 to 3 to 1. “Fair” queuing, but ridiculously inefficient. On your way from the 4th to the 2nd, stop on the 3rd to pick the person up.
  • I wonder how things work when you have multiple elevators. In big buildings you’ll often have something like 8 elevators. I’m far too tired to try to figure out the ideal way to handle that. They need to be smart enough to have a common queue so that I don’t have to hit “Up” on all eight elevators and just take whatever comes first, but deciding what elevator can service my request first is interesting. I kind of think it’s another case of elevators not being the same as the programming I’m used to, and it’s just whatever elevator happens to pass my floor in its service going in the right direction. But what if there’s an idle elevator? Can it get to me first, or will an already-running elevator get there first? Do you start the idle elevator first and make it event-driven? What if the already-running elevator has someone request another floor between its present location and my floor? You’d need to recompute. You’re probably better off dispatching an idle elevator and just giving me whatever gets there first.
  • You then need to figure out what’s important. If you have an idle elevator that can get to me more expediently than an already-running elevator, but the wait time wouldn’t be that much longer, do you start up the idle elevator, or do you save power and have me wait? How do you define that wait? Is this something elevator-engineers actually tune?
  • I think you want to track the source of a request — whether it came from within the elevator or from the external button on a floor. If it’s within the elevator, you obviously need to stop, or the elevator has effectively “kidnapped” the person. But if it’s an external button, you might want to just skip it and let another elevator get to it, if you have a bunch of CallRequests you’re working through. Ideally, you’d also approximate the occupancy of the elevator based on the weight (from reading the load on the motors?), and when the elevator was at perhaps 75% capacity, stop processing new external requests.
  • Should the elevator controller try to be proactive? It might keep a running log of the most “popular” floors out of, say, the last 50 CallRequests, and, when it was done processing all CallRequests, go to whatever the most popular was and sit idle there? Or perhaps it should work its way towards the middle floor? If you had multiple elevators you could split them apart that way. Is it worth the power conservation?
  • The office thermostat should have an API, too. (I bet the good ones do. But we don’t have those.) Thermostats are a pain to use. readTemperature and setTemperature are the obvious API calls, though advanced thermostats would have a TemperaturePlan concept.

Delete Old Files in Linux

Here’s a good one to have in your bag of tricks: all the time I wind up with a directory where I just want to delete anything older than a certain number of days. Here’s a pretty simple one:


DELETE_PATH='/var/backup/full_dumps/'
DAYS_TO_KEEP='7'

for i in `find $DELETE_PATH -ctime +$DAYS_TO_KEEP`; do
ls -lh "$i"
#rm -vf "$i"
done

The delete is commented out, so it will just list the files. Uncomment when you’re convinced it does what you want. Obviously, change the top two lines to suit your needs, and play around with find (for example, there’s an atime instead of my ctime). But it’s a handy little thing to have around and just adapt as needed.

Removing Duplicates from MySQL

I Googled that a while ago. How do you remove duplicate rows from a SQL database? (MySQL specifically.)

If you have a ‘simple’ duplicate setup, it’s pretty easy, and lots of other sites go into great detail. (And in that case, you might look into how to require that fields be unique so you don’t have that problem anymore…)

In our case, we ran into an obscure problem where we expected the combination of two fields to be unique. Without publicizing anything secret, we had a summary table of user_id and month. (Not that it’s actually anything secret, but I refer to the table I did this on as “tablename,” just since blogging about our production database schemas seems like a bad idea…) A brief bug introduced some duplicates. My task? Remove the duplicate rows.

So I went about it somewhat backwards, and started wondering how do a SELECT DISTINCT across two columns. You can’t, really. Someone recommended something clever: CONCAT() across the two field names, and a DISTINCT on that: SELECT DISTINCT(CONCAT(user_id, month))… After toying with that a bit, I was able to create a temporary table housing a de-duped version of our data:


create temporary table nodupes
SELECT DISTINCT(CONCAT(user_id, month)) asdf, tablename.*
FROM tablename GROUP BY asdf;

But how does that help? The problem with temporary tables is that they’re, uhh, temporary. If you lived on the edge, you could probably “DROP TABLE original_table; CREATE TABLE original_table (SELECT * FROM temp_table)”, but this has a few horrible problems. One is that you’re throwing away all your disk-based data so you can populate it with a version that exists only in RAM. If anything went wrong, you’d lose the data. Furthermore, in our case, this is a live, busy table. Dropping it in production is awful, and it would take many minutes to repopulate it since it’s a huge table.

So then it was clear what I wanted to do: something like DELETE FROM original_table WHERE id NOT IN(SELECT * FROM temp_table_with_no_dupes), except not using an IN() on a table with hundreds of thousands of rows. And this is where it got tricky. The key is to use that temporary table to derive a new temporary table, housing only the duplicates. (Actually, speaking of “the key,” create one in your temporary table: CREATE INDEX gofast ON nodupes(id); — you’ll need this soon, as it’s the difference between the next query taking 10 seconds and me aborting it after 30 minutes.)

The temporary table of duplicate rows is basically derived by joining your ‘real’ table with the list of de-duped tables, with a rather obscure trick: joining where id=NULL. So here’s what I did:


CREATE TEMPORARY TABLE dupes
(SELECT tablename.*
FROM tablename
LEFT OUTER JOIN nodupes nd
ON nd.id=tablename.id WHERE nd.id IS NULL);

For bonus points, we’re selecting from the real table, so the silly CONCAT() field doesn’t come through into this temporary table.

So once again, I momentarily flashed a sly grin. This is going to be easy! DELETE FROM tablename WHERE id IN(SELECT id FROM dupes);. But then my sly grin faded. It turns out that there are two problems here. The first is that there are still a lot of rows. In my case, I was doing an IN() with about 6,000 rows. MySQL doesn’t like that. But there’s a bigger problem. (Thanks to Andrew for confirming that this is a real bug: #9090.) MySQL doesn’t use indexes when you use a subquery like that. It seems the bug is pretty narrow in scope, but this is precisely it. So that query is awful, basically doing 6,000 full table scans. And on a really big table.

So the solution comes from something I’ve never even thought about doing before. MySQL calls it the “multi-delete.” I call it the, “Holy smokes, this could go really, really wrong if I’m not careful” delete.


-- Don't run this yet
DELETE bb
FROM tablename AS bb
RIGHT JOIN dupes d
ON (d.id=bb.id);

That’s the command you want, but a few caveats. First, change the first line to “SELECT *” and run it to see what happens. Make sure it’s only grabbing the duplicate rows. You’re not able to use a LIMIT in a multidelete query, so you have to be extremely careful here. Second, we tested this multiple times on various test databases before doing it against our master database, and I was still uneasy when it came time to hit enter. Keep backups before you do this. And finally, in many cases, MySQL will expect an explicit “USE database_name” before the multi-delete, where database_name is the database housing the table you’re deleting from. Keep that in mind if you get an error that it can’t find the table, even if you explicitly reference it (via database_name.table_name).

Viewing all cron jobs

Periodically I run into the situation where I’m trying to find a cron job on a particular machine, but I can’t remember which user owns it. At least on CentOS, it’s easy:

cat /var/spool/cron/* will show all crons. The crontab command doesn’t seem to support doing this. The downside is that that command just mashes them all into one list, which is only useful if you don’t care who the job runs as. Usually I do. Here’s a simple little script to format the output a little bit:

for i in `ls /var/spool/cron/`; do
        echo "Viewing crons for $i"
        echo "--------------------------------------"
        cat /var/spool/cron/$i
        echo
done

Location Error vs. Time Error

This post christens my newest category, Thinking Aloud. It’s meant to house random thoughts that pop into my head, versus fully fleshed-out ideas. Thus it’s meant more as an invitation for comments than something factual or informative, and is likely full of errors…

Aside from “time geeks,” those who deal with it professionally, and those intricately familiar with the technical details, most people probably are unaware that each of the GPS satellites carries an atomic clock on board. This is necessary because the way the system works, in a nutshell, by triangulating your position from various satellites, where an integral detail is knowing precisely where the satellite is at a given time. More precise time means a more precise location, and there’s not much margin of error here. The GPS satellites are also syncronized daily to the “main” atomic clock (actually a bunch of atomic clocks based on a few different standards), so the net result is that the time from a GPS satellite is accurate down to the nano-second level: they’re within a few billionths of a second of the true time. Of course, GPS units, since they don’t cost millions of dollars, rarely output time this accurately, so even the best units seem to have “only” microsecond accuracy, or time down to a millionth of a second. Still, that’s pretty darn precise.

Thus many–in fact, most–of the stratum 1 NTP servers in the world derive their time from GPS, since it’s now pretty affordable and incredibly accurate.

The problem is that GPS isn’t perfect. Anyone with a GPS probably knows this. It’s liable to be anywhere from a foot off to something like a hundred feet off. This server (I feel bad linking, having just seen what colocation prices out there are like) keeps a scatter plot of its coordinates as reported by GPS. This basically shows the random noise (some would call it jitter) of the signal: the small inaccuracies in GPS are what result in the fixed server seemingly moving around.

We know that an error in location will also cause (or, really, is caused by) an error in time, even if it’s miniscule.

So here’s the wondering aloud part: we know that the server is not moving. (Or at least, we can reasonably assume it’s not.) So suppose we define one position as “right,” and any deviation in that as inaccurate. We could do what they did with Differential GPS and “precision-survey” the location, which would be very expensive. But we could also go for the cheap way, and just take an average. It looks like the center of that scatter graph is around -26.01255, 28.11445. (Unless I’m being dense, that graph seems ‘sideways’ from how we typically view a map, but I digress. The latitude was also stripped of its sign, which put it in Egypt… But again, I digress.)

So suppose we just defined that as the “correct” location, as it’s a good median value. Could we not write code to take the difference in reported location and translate it into a shift in time? Say that six meters East is the same as running 2 microseconds fast? (Totally arbitrary example.) I think the complicating factors wouldn’t whether it was possible, but knowing what to use as ‘true time,’ since if you picked an inaccurate assumed-accurate location, you’d essentially be introducing error, albeit a constant one. The big question, though, is whether it’s worth it: GPS is quite accurate as it is. I’m a perfectionist, so there’s no such thing as “good enough” time, but I have to wonder whether the benefit would even show up.

Building an Improvised CDN

From my “Random ideas I wish I had the resources to try out…” file…

The way the “pretty big” sites work is that they have a cluster of servers… A few are database servers, many are webservers, and a few are front-end caches. The theory is that the webservers do the ‘heavy lifting’ to generate a page… But many pages, such as the main page of the news, Wikipedia, or even these blogs, don’t need to be generated every time. The main page only updates every now and then. So you have a caching server, which basically handles all of the connections. If the page is in cache (and still valid), it’s served right then and there. If the page isn’t in cache, it will get the page from the backend servers and serve it up, and then add it to the cache.

The way the “really big” sites work is that they have many data centers across the country and your browser hits the closest one. This enhances load times and adds in redundancy (data centers do periodically go offline: The Planet did it just last week when a transformer inside blew up and the fire marshalls made them shut down all the generators.). Depending on whether they’re filthy rich or not, they’ll either use GeoIP-based DNS, or have elaborate routing going on. Many companies offer these services, by the way. It’s called CDN, or a Contribution Distribution Network. Akamai is the most obvious one, though you’ve probably used LimeLight before, too, along with some other less-prominent ones.

I’ve been toying with SilverStripe a bit, which is very spiffy, but it has one fatal flaw in my mind: its out-of-box performance is atrocious. I was testing it in a VPS I haven’t used before, so I don’t have a good frame of reference, but I got between 4 and 6 pages/second under benchmarking. That was after I turned on MySQL query caching and installed APC. Of course, I was using SilverStripe to build pages that would probably stay unchanged for weeks at a time. The 4-6 pages/second is similar to how WordPress behaved before I worked on optimizing it. For what it’s worth, static content (that is, stuff that doesn’t require talking to databases and running code) can handle 300-1000 pages/second on my server as some benchmarks I did demonstrated.

There were two main ways to enhance SilverStripe’s performance that I thought of. (Well, a third option, too: realize that no one will visit my SilverStripe site and leave it as-is. But that’s no fun.) The first is to ‘fix’ Silverstripe itself. With WordPress, I tweaked MySQL and set up APC (which gave a bigger boost than with SilverStripe, but still not a huge gain). But then I ended up coding the main page from scratch, and it uses memcache to store the generated page in RAM for a period of time. Instantly, benchmarking showed that I could handle hundreds of pages a second on the meager hardware I’m hosted on. (Soon to change…)

The other option, and one that may actually be preferable, is to just run the software normally, but stick it behind a cache. This might not be an instant fix, as I’m guessing the generated pages are tagged to not allow caching, but that can be fixed. (Aside: people seem to love setting huge expiry times for cached data, like having it cached for an hour. The main page here caches data for 30 seconds, which means that, worst case, the backend would be handling two pages a minute. Although if there were a network involved, I might bump it up or add a way to selectively purge pages from the cache.) squid is the most commonly-used one, but I’ve also heard interesting things about varnish, which was tailor-made for this purpose and is supposed to be a lot more efficient. There’s also pound, which seems interesting, but doesn’t cache on its own. varnish doesn’t yet support gzip compression of pages, which I think would be a major boost in throughput. (Although at the cost of server resources, of course… Unless you could get it working with a hardware gzip card!)

But then I started thinking… That caching frontend doesn’t have to be local! Pick up a machine in another data center as a ‘reverse proxy’ for your site. Viewers hit that, and it will keep an updated page in its cache. Pick a server up when someone’s having a sale and set it up.

But then, you can take it one step further, and pick up boxes to act as your caches in multiple data centers. One on the East Coast, one in the South, one on the West Coast, and one in Europe. (Or whatever your needs call for.) Use PowerDNS with GeoIP to direct viewers to the closest cache. (Indeed, this is what Wikipedia does: they have servers in Florida, the Netherlands, and Korea… DNS hands out the closest server based on where your IP is registered.) You can also keep DNS records with a fairly short TTL, so if one of the cache servers goes offline, you can just pull it from the pool and it’ll stop receiving traffic. You can also use the cache nodes themselves as DNS servers, to help make sure DNS is highly redundant.

It seems to me that it’d be a fairly promising idea, although I think there are some potential kinks you’d have to work out. (Given that you’ll probably have 20-100ms latency in retreiving cache misses, do you set a longer cache duration? But then, do you have to wait an hour for your urgent change to get pushed out? Can you flush only one item from the cache? What about uncacheable content, such as when users have to log in? How do you monitor many nodes to make sure they’re serving the right data? Will ISPs obey your DNS’s TTL records? Most of these things have obvious solutions, really, but the point is that it’s not an off-the-shelf solution, but something you’d have to mold to fit your exact setup.)

Aside: I’d like to put nginx, lighttpd, and Apache in a face-off. I’m reading good things about nginx.

The State of Linux

I don’t really remember precisely when I started using Linux, but I do distinctly remember December 31, 1999, around 11:55pm, sitting in front of my computer and seeing what would happen. (Absolutely nothing out of the ordinary.) I was in KDE at the time, back when they had a HUGE digital clock that looked like crap even then.

I remember when USB thumb drives came into vogue, and I tried using mine in Linux. They worked! I just had to pull up a shell window, su to root, mkdir /mnt/usb, and then mount it there. And one day I forgot to umount before unplugging it, causing a kernel panic. Windows, meanwhile, let you plug the thumb drive in and seamlessly mapped it to a new drive. When you pulled it out, it unmounted the drive for you. (Although it still occasionally gripes at me with “Delayed Write Failed” even after I’ve closed everything using it and let it sit for quite some time. But I digress.)

Today, without thinking, I decided to plug my Logitech G15 into my Linux machine, running Ubuntu’s Hardy Heron release. It worked, but that’s not saying much: any old OS can see a USB keyboard. But what took me by surprise was what happened next. Without thinking, I used the volume wheel on it to turn down my music. It worked! On a whim, I hit the “Previous Track” button, and Rhythmbox started playing the previous song. I had to install drivers for this in Windows, but not in Linux. How’s that for a role reversal?

Of course, this isn’t a “Linux is superior.” There are still some flaws on my system that drive me crazy (why do my graphics drivers keep suspend/hibernate from working?!), but I can say that about Windows too. The point is that Linux used to be laughably far behind Windows in terms of things “just working.” And now I occasionally find myself wishing Windows were as easy to use as Linux in some regards. This is impressive progress!