The Curious Case of My Mac’s Clock

When I traveled to Europe and back, I switched through timezones a lot. I went from Boston to London. When in London, Boston switched off of DST. Then I moved from London to Brno, forward another hour. Then back to London, then back to Boston.

Not wanting to lug my heavy Thinkpad around, I brought my MacBook Air on the trip. (And was appropriately ostracized by coworkers for bringing a Mac to a meeting at a company best known for its Linux distribution.) And it did something that kind of intrigued me — it automatically adjusted to the local timezone.

Now, this shouldn’t be possible. My clock is synchronized using NTP, which exchanges UTC time, deliberately ignorant of timezones or DST. And even if NTP did use local timezones (which would be a huge disaster), it still wouldn’t have worked, because I had hardcoded some local clocks.

Investigating a bit, I found this. This is seemingly a new feature, and it uses the operating system’s location service to find your timezone. My computer doesn’t have an onboard GPS, of course, so it’s doing one of two things — GeoIP, and/or looking up the access points I can see against a database.

This is really neat, but it’s also pretty damned creepy. Here is what I see when I visit that setting:

It knows exactly where I am.

I’m willing to bet it knows exactly where the closest tinfoil hat shop is, too.

I Use This: iTerm

I ran into some trouble with the Mac’s Terminal not supporting vim colors properly. It turns out that Terminal doesn’t have very good support for some things. Someone recommended I try iTerm, so I did. I wish I’d done it sooner. And it implements the feature I’ve craved for the longest time: blurring the background if you have a translucent terminal window:

There’s a little weirdness in that it doesn’t seem to update quite as fast as Terminal did, and that triple-clicking selects the line shown, not the whole line if it wraps. For the most part they’re petty differences. Overall, though, Terminal’s a thing of the past, and iTerm’s here to stay.

Update: I’d noticed that iTerm hadn’t been updated in a long time. Thanks to George for commenting below that he’s picked up development and rechristened the project iTerm 2. Even better!

Building a Macintosh

At work, I came to be a Mac fan. It’s based on BSD, but adds a much more polished GUI. It’s got a very user-friendly interface, and yet it’s trivial for me to pull up the command line and do “real” things there. So when I decided to build a new desktop, I decided I wanted to build something that would run OS X. Unfortunately, Apple’s hardware is ludicrously expensive, and I was on a budget. The good news is that you can build your own!

It used to be that you had to download a “cracked” version of OS X, which had a bunch of patches applied to make it run on non-Apple hardware. Now, though, the tweaks occur elsewhere — some minor changes to the bootloader, and a couple kernel modules to do the rest. The install process wasn’t too convoluted, and the result was well worth it:

It’s not actually a Xeon, but that’s okay.

Here’s what I used for hardware. I tried to stick to things that were fairly close to the hardware Apple uses, to ensure compatibility:

  • Motherboard: Gigabyte EX58-UD5. NewEgg hasn’t carried it in a long time. I got mine somewhere else. This is one of the few components that you have to be careful about, verifying them with hardware compatibility lists.
  • CPU: Intel i7-930, a quad-core 2.8 GHz processor with plenty of cache.
  • CPU cooler: I’m used to my laptop running hot, so I went overboard and bought a huge Zalman cooler.
  • Graphics: I went for an EVGA-branded GeForce 9800GT, because it’s easy to get working. This and the motherboard are the two main components to watch for OS X compatibility. A fairly simple “enabler” script brought me from crappy VESA display to dual displays at 1920×1080, with 3D acceleration. I was going to skimp on the video card, but Kyle convinced me that I’d regret it. Within a couple weeks of building the system, Steam launched a Macintosh client, so I was in business playing TF2. The card runs great.
  • RAM: 6x 2GB DIMMs. I went for these OCZ sticks. The RAM ended up costing me more than any other component, but I’ve spent pretty much my entire life wishing I had more RAM. It’s really great to have more than enough. I probably would have been okay with 6GB, but in my experience, 4GB is inadequate, and I wanted room to grow.
  • Hard drive: I should have bought an SSD, in hindsight. But I was trying to keep the cost down, and I was used to having inadequate disk space, so I sprung for a 2TB SATA disk. At 5400rpm, it’s a slow performer, but I don’t do anything too disk-intensive. Some day I may pick up an SSD for the OS, applications, and the portion of my home directory that isn’t 75GB of photographs or 10GB of music or 250GB of backups.
  • Bluetooth adapter: this little AZiO adapter fits in one of the myriad USB slots, protruding only a minuscule amount, powering my Bluetooth mouse. I was concerned about Mac support, and the reviews are full of people complaining that it doesn’t work on Windows 7. But it’s plug-and-play on the Mac.
  • Case: I got a nice big, roomy Zalman case. It’s kind of a ludicrous amount of money for a case, but I think it’s well worth it to have a really quality case. Little details, like slide-in rails for the power supply — or the fact that the power supply is the bottom and not the top — plus tons of hot-swap bays on the front — make it a pleasure to work with. The USB ports on the top front are a nice touch, too.

I have an actually-legally-purchased copy of OS X. I used the digital_dreamer install scripts, which required having an already-functioning Mac. Luckily I had a Mac laptop. I connected the hard drive via USB (about the thousandth time that I realized my USB-to-IDE/SATA adapter was a great purchase) and formatted it properly, did a retail install of OS X to the drive, and then ran a script to patch in the requisite drivers. Then I plunked the drive into my new machine, booted it up, downloaded the latest OS updates, and ran a couple things like the “enabler” scripts for my Ethernet cards and the graphics. Now it runs superbly well; the only thing I lack is a sleep mode. I think that’s easy to fix, actually, but I haven’t bothered yet.

For fun, a few photos of the build.

This kind of amused me. The processor came in a huge package. In the course of removing the heatsink, this little silver square fell out. The little silver square, of course, is the processor:

Here’s the motherboard fitted with the processor and the default heatsink:

But I didn’t trust that fan. Here it is with the Zalman cooler installed. (Yes, the protective plastic is still on. This was a dry fit, before everything was mounted. Yes, I remembered to remove it when I installed it for real.)

Quite a bit larger. Good thing I got a roomy case.

I’ve been quite happy with the setup, and I’d highly recommend it.

Thinking Like an Engineer

Lately a lot of my work as a web developer has been way at the back-end, and, for whatever reason, it tends to focus heavily on third parties. I spent a while fixing a bizarre intermittent error with our credit card processor, moved on to connecting with Facebook, and am now working on a major rewrite of the API client we use to talk to our e-mail provider. Sometimes it starts to bleed over into my personal life.

This kind of turned into crazy-person babble, but I haven’t posted in a while, so here goes a perhaps-horrifying look into how my mind works:

  • Driving home one night, I went through the FastLane / EZPass lane, as I often do. Only this time, instead of thinking, “I hate that I have to slow down for this,” I started thinking about latency. Latency is one of the biggest enemies of people working with third parties. It was at the crux of our problems with the credit card processor — we’d store a card and immediately try to charge it, when sometimes we had to wait “a little bit” before the card was available throughout their system to be charged. So I had to introduce a retry loop with exponential backoff. The email API work has major hurdles around latency and timeouts. We’ve moved almost all of it into a background queue so that it doesn’t delay page load, but even then we have intermittent issues with timeouts. So driving through the FastLane lane today, I slowed to about 45, and thought how remarkable it was that, even at that speed, it was able to read the ID off my transponder, look it up in a remote database somewhere, and come back with a value on what to do. I’d have assumed that they’d just queue the requests to charge my account, but if my prepaid balance is low, I get a warning light shown. It seems that there’s actually a remote call. It’s got to happen in a split-second, though, and that’s pretty impressive. I wonder how they do it. I thought a lot about this, actually.
  • I work on the fourth floor of a building with one, slow elevator. A subsection of Murphy’s Law indicates that the elevator will always be on the exact opposite floor: when I’m on the first floor, it’s on the fourth, for example. So one day while waiting for the elevator, I started thinking that it needed an API. I could, from my desk, summon it to our floor to lessen my wait time. Likewise, I could build an iPhone app allowing me to call the elevator as I was walking towards it. The issue of people obnoxiously calling the elevator way too early seems like a problem, but I think it’s okay — if you call it too soon, it will arrive, and then someone else will call it and you’ll miss out entirely. It’s in everyone’s interest to call it “just right” or err on the side of a very slight wait.
  • While thinking more about the elevator API, I started thinking about how elevators aren’t really object-oriented. (I’m pretty sure that’s never been written before.) It seems an elevator is really pretty procedural, running something like goToFloor(4). The obvious object would be Floors, but that’s not really right. You’re not adding Floors to the building, or even changing properties of Floors. The object is really CallRequest, and it would take two attributes: an origin and a direction. “Come to floor two, I’m going up.” It made me think that there are some places that being object-oriented just doesn’t make a ton of sense.
  • You really want to add authentication. To get to our floor, you need to swipe your badge. The elevator API needs to account for the fact that some requests require validating a user’s credentials to see if they’re authorized to make the request they are.
  • “Code an elevator” would actually be an interesting programming assignment. But I fear it’s too far removed from most normal coding. I started thinking that you’d want to sort CallRequests in some manner, use some algorithms, and then iterate over CallRequests. I think you actually want to throw out that concept. You have a tri-state to control direction: Up, Down, and Idle. Then you have two arrays: UpwardCalls and DownwardCalls. They don’t even need to be sorted. As you near a floor, you see if UpwardCalls contains that floor. If so, you stop. If not, you continue. If you’ve reached the highest floor in UpwardCalls, you check to see if DownwardCalls has an elements. If so, you set your direction to Down and repeat the same procedure for DownwardCalls. If there are no DownwardCalls, you set your state to Idle. The problem is that this is really not how I’m used to thinking. I want to iterate over CallRequests as they come in, but this means that the elevator is going all over the place. The person on the 4th floor wants go to the 2nd, so we make that happen. But right as they put that request in, the person on the 3rd wants to go to the 1st. So you’d go 4 to 2 to 3 to 1. “Fair” queuing, but ridiculously inefficient. On your way from the 4th to the 2nd, stop on the 3rd to pick the person up.
  • I wonder how things work when you have multiple elevators. In big buildings you’ll often have something like 8 elevators. I’m far too tired to try to figure out the ideal way to handle that. They need to be smart enough to have a common queue so that I don’t have to hit “Up” on all eight elevators and just take whatever comes first, but deciding what elevator can service my request first is interesting. I kind of think it’s another case of elevators not being the same as the programming I’m used to, and it’s just whatever elevator happens to pass my floor in its service going in the right direction. But what if there’s an idle elevator? Can it get to me first, or will an already-running elevator get there first? Do you start the idle elevator first and make it event-driven? What if the already-running elevator has someone request another floor between its present location and my floor? You’d need to recompute. You’re probably better off dispatching an idle elevator and just giving me whatever gets there first.
  • You then need to figure out what’s important. If you have an idle elevator that can get to me more expediently than an already-running elevator, but the wait time wouldn’t be that much longer, do you start up the idle elevator, or do you save power and have me wait? How do you define that wait? Is this something elevator-engineers actually tune?
  • I think you want to track the source of a request — whether it came from within the elevator or from the external button on a floor. If it’s within the elevator, you obviously need to stop, or the elevator has effectively “kidnapped” the person. But if it’s an external button, you might want to just skip it and let another elevator get to it, if you have a bunch of CallRequests you’re working through. Ideally, you’d also approximate the occupancy of the elevator based on the weight (from reading the load on the motors?), and when the elevator was at perhaps 75% capacity, stop processing new external requests.
  • Should the elevator controller try to be proactive? It might keep a running log of the most “popular” floors out of, say, the last 50 CallRequests, and, when it was done processing all CallRequests, go to whatever the most popular was and sit idle there? Or perhaps it should work its way towards the middle floor? If you had multiple elevators you could split them apart that way. Is it worth the power conservation?
  • The office thermostat should have an API, too. (I bet the good ones do. But we don’t have those.) Thermostats are a pain to use. readTemperature and setTemperature are the obvious API calls, though advanced thermostats would have a TemperaturePlan concept.

Deal Roundup

My post this week will probably reflect a clear bias towards good deals on LCDs, since I’m itching to pick up another one. There are some other good ones in here, too, though:

  • Dell S2409W 24″ LCD, 1920×1080, for $199. (via FatWallet)
  • HP W2338H 23.3″ LCD, 1920×1080, $220 minus $50 = $170 after coupon, at Staples. (via FatWallet with coupon code)
  • Acer Aspire One netbook, $299.99 but there’s a $75 off coupon for purchases over $300. 1GB RAM, 160GB disk, 8.9″ LCD at 1024×600, WiFi, webcam, WinXP. Need to spend a penny more to get the deal. (via FatWallet) Possible free printer with purchase, too?
  • White Asus EEE, refurbished, 1.6 GHz Atom, 4GB SSD, 1GB RAM, 8.9″ LCD, Linux – $169.99.
  • Dell S2009W 20″ LCD, for someone looking for a more modest monitor, $109. 1600×900 (via FatWallet)
  • 6GB RAM + 5-disk hot-swappable SATA RAID enclosure, a bizarre combo, but only $220, a huge discount, at NewEgg. (via FatWallet)
  • Samsung 2343BWX 23″ LCD with ridiculous document.write(neg_specification_newline(‘2048 x 1152’));2048 x 1152 resolution, $220 at NewEgg.
  • Acer X223Wbd, 22″ 1680×1050 LCD, $150 at NewEgg.
  • Acer H213H bmid, 21.5″ 1920×1080 LCD, $180 at NewEgg.
  • WD 1TB SATA “Green” disk (internal), $90 at NewEgg.
  • Choice of one of two 1TB external disks, $99.99 at NewEgg.
  • 1.5TB Barracuda, $130 at NewEgg, but beware the comments of it not working well in arrays.

Unapologetic

Apparently, a company wrote an application for the iPhone called Baby Shaker. It depicts a crying baby, and you vigorously shake the iPhone to make it stop, at which point its eyes are replaced by X’s.

Apple pulled the application from its store and apologized, saying, “This app is deeply offensive and should not have been approved for distribution on the App Store.”

The Sarah Jane Brain Foundation, however, has had enough, with a spokesperson calling it “the most cynical apology I have ever seen.” They plan to picket Apple stores, calling on them to “mitigate the harm they’ve now caused.”

What I find so interesting is how the Sarah Jane Brain Foundation has had “The PETA Effect” here, at least for me: so vehemently overstating your cause that you steer people to the other side. If I’d seen the application distributed, I’d surely have joined the Sarah Jane Brain Foundation in finding it horrifically offensive. It’s in bad taste and makes light of an abusive practice that kills many babies and leaves even more with permanent injuries and brain damage.

And yet, with them coming across as so overzealous, my “That’s really kind of funny” sense is triggered, just a tiny bit. I guess I find their position so outrageous since:

  • I don’t like Apple having sole control of what I can run on my iPhone. Apple pulled the app, which means that, unless I jailbreak my iPhone (voiding my warranty), I can’t have the application. I’m not sure I want this application, but it’s a sore spot with me. The fact that Apple pulled the app just drives home Apple’s exclusive control.
  • Apple promptly pulled the app. The “most cynical apology” actually seemed to be a pretty emphatic, “That application was horribly offensive. We screwed up big time in approving it!” from Apple. I’ve definitely heard much more cynical. (I’m sorry you feel that way…)
  • The application shows that shaking babies kills them. Sure, it demonstrates it in an awful way, but it’s like showing tapes of babies’ skulls being crushed to lobby against abortion. Isn’t this exactly what the Sarah Jane Brain Foundation should enjoy?
  • I think that, and then get the sneaking suspicion that they are loving this, because it’s giving them tremendous publicity. And calling for protests outside Apple stores, weeks after they pulled the application and apologized for it, only furthers that point.

What do you think? Was Apple’s apology (and prompt retraction) of the app good enough? Should Apple have left it up even though it was controversial?

If I Made Computers

I think you could say with relative accuracy that there are three main bottlenecks in a computer: CPU, memory, and disk. There are some outliers that people might try to pile in: video card performance, or network throughput if you’re tweaking interrupts on your 10GigE card. But the basic three are pretty universal.

To cut to the chase: I hit disk bottlenecks sometimes, CPU bottlenecks almost never, and RAM bottlenecks all the time. And sometimes high load that looks to be on the CPU is really just I/O wait cycles. But RAM is special: if you have enough RAM, disk throughput becomes less important. At least, redundant disk I/O, which seems to account for a lot of it.

What interests me, though, is that almost everything is RAM starved in my opinion. My laptop has 2GB and I get near the limit fairly often. I’m thinking of trying to take it to 4GB. The jury’s out on whether or not it’ll see more than 3GB, and others complain that 3GB causes you to lose out a bit on speed.

But here’s the thing. I maintain things like a MySQL server with 32GB RAM. It’s not RAM-bound per se: we could switch to a machine with 1GB RAM and MySQL would still run fine. The memory is overwhelmingly configured for various forms of cache. But it’s not enough: there’s still a steady stream of disk activity, and a non-negligible number of queries that have to write temporary tables to disk.

RAM is cheap. It’d cost me about $50 to buy 4GB of RAM for my laptop. The reason RAM stops being cheap is that most motherboards don’t give you enough room. Both of my laptops can only take two DIMMS, which means I need dual 2GB sticks. They’re both based on older 32-bit chipsets, so I can’t exceed 4GB, but if I wanted to, I’d need dual 4GB sticks, and those are expensive. Even on decent servers, it’s hard to find many that give you more than 8 slots, making 32GB hard to exceed.

So what I’d really like to see someone bring to market is a 1U box with as many memory slots as it’s physically possible to fit in. 1U is still tall enough to have standard DIMMs standing up, and most of them are extremely deep. I bet you could fit 256 slots in. Then throw in a compact power supply, a standard LGA775 slot (allowing a quad-core chip), a good Gigabit NIC or four, and an optional FibreChannel card. No hard drives. Maybe a 4GB CompactFlash card if you really want it to have its own storage. Oh, and make sure the motherboard is pretty versatile in terms of RAM requirements and FSB. Oh, and don’t force me to go with ECC. If this were a single database server, it might be worth buying top-notch ECC RAM. But if this were just for caching things, I don’t care. Cache isn’t meant to be permanent, so an error is no big deal.

256 slots, and you could fill it with ultra-cheap 1GB DDR2 DIMMs. (Heck, at work, we have a bag of “useless” 1GB sticks that we pulled out.) You can get ’em for $10 a pop, meaning 256GB RAM would cost about $2,560. I suspect the system would command a high premium, but really, it’s just $2,560 worth of RAM and a $200 processor. A 2GB DIMM is about twice as much ($20/stick), but $5,000 for half a terabyte of RAM isn’t bad. Though 4GB DIMMs are still considerably more: they’re hard to find for under $100.

I think this would be a slam-dunk product. memcache is pretty popular, and it’s increasingly being used in previously unheard of roles, like a level 2 cache for MySQL. There are also a lot of machines that just need gobs of RAM, whether they’re database servers, virtual machine hosts, or application servers. And tell me a file server (sitting in front of a FibreChannel array) with 256GB RAM for caches and buffers wouldn’t be amazing.

So, someone, hurry up and make the thing. The key is to keep it fairly cheap. Cheaper than buying 4GB DIMMs, at least.

Cool MySQL Projects

I think part of the thing I love about MySQL is the same thing I love about Firefox and Thunderbird. Even though I sometimes wonder if there are databases that would work better than MySQL or have fewer bugs, MySQL has an incredibly rich community that’s home to myriad fascinating projects.

I’ve looked into MySQL Proxy at work a bit, and found it pretty interesting, but it’s still considered alpha-level code, and we don’t really need a proxy in front of our database right now.

I’m not even going to pretend that the links that follow are all things I’ve just found on my own. They’re inspired from this Xapbr post, with Xapbr being a blog run by one of the top guys at Percona, and Percona being a well-known MySQL consulting firm, perhaps best known for the O’Reilly High Performance MySQL book. (A must-read even for people who aren’t full-time DBAs!)

It turns out that MySQL Proxy has a competitor, Dormando’s Proxy for MySQL, better known as dpm. It’s a BSD-licensed proxy created by SixApart’s DBA. (SixApart used to own LiveJournal, which is well-known for releasing memcached to the community.) Given that I’ve worked with neither dpm nor MySQL Proxy, I’m not able to comment on the merits of either, but it’s a handy link to have.

I’m also interested in the Tungsten Replicator, an Open Source project providing (emphasis mine) “database-neutral master/slave replication,” which was interesting enough to keep me reading past the words “JDK” and “Ant,” which are technical terms meaning “bloat” and “memory leaks.” (I kid… Sort of.) It looks like the project is mostly MySQL-centric right now, but capable of replicating MySQL databases into Oracle. (But apparently not the other way, probably since Oracle binlogs aren’t as publicly-documented as MySQL’s.)

And then there’s WaffleGrid, a project using memcache as an L2 cache. That is, MySQL will first look in its own caches, but in the event of a cache miss, will check memcache before turning to disk. It looks like it gives stellar results, even though it’s one of those things that seems kind of wrong. (You’d think that, if you’re adding memory to better cache your database, you’d want to put it in the MySQL machine, and if you needed to scale to a cluster of memcache boxes, you might be better off spreading the load around several MySQL boxes… But at the same time, clustering isn’t something MySQL excels at right now.)

Drizzle is “a lightweight SQL database for [the] cloud and web,” which is forked from MySQL 6. (And for those of you scratching your heads: yes, I said version 6. Yes, 5.1 is the latest release branch.) There are some big names working on Drizzle, and besides “the cloud,” they talk a lot about “massive concurrency,” which is something that should be of interest even to people running more traditional databases.

And, of course, there’s Percona’s XtraDB engine for MySQL, a significant overhaul to InnoDB. It seems a bit of a young technology for production environments, and yet it seems like it’s production environments where it would have the best impact.

This is only tangentially related to MySQL, but I’ve seen Sphinx mentioned in at least three distinct places today, a SQL-based fulltext search index. I’d be curious in seeing how it compares with Lucene.

bash: Looping over a range

This falls into the category of things that are simple yet I’m always trying to remember. Sometimes in bash I find myself wanting a for loop, not of the usual foreach type (for i in `ls /etc/`; do ...), but a more classical one (a la C-style for var=0; var<100; var++ ... syntax). Here’s the bash way:


for i in {0..99}; do
echo $i
done

The output is a list of numbers from 0 to 99. For a fun example of when you might need this, consider the case of noticing that Postfix isn’t running on a machine and starting it. Your mail client promptly crashes, and the flurry of output in the log indicates that it’s delivering mail as fast as it can, but complaining that some mail has been in the queue for 180 days. “Holy smokes,” you exclaim, stopping Postfix. “Just how much mail is there?” You’re not entirely sure because you aborted your ls after it sat for a good minute with no output. That’s never good. du -sh shows about a gig and a half.

“I’ve got this under control,” you think. rm -f /var/spool/postfix/*. Ta-da! Wait, what’s this “/bin/rm: Argument list too long” business? That error can’t possibly be good. rm very rarely complains. So we tried a smaller delete size, thinking we could do postfix/1*, postfix/2*, etc. Just step through it. That, too, was too much for rm.

So it ended up taking an absurd for i in {0..99}; do rm -f /var/spool/postfix/$i; done to purge the mail. (And that didn’t catch all of it, either; I’m not sure off-hand how to do this in hex.) Each iteration deleted somewhere around 2,000 mail messages, making me think there were something like a half-million messages. (00-FF, times a little under 2,000.)

Removing Duplicates from MySQL

I Googled that a while ago. How do you remove duplicate rows from a SQL database? (MySQL specifically.)

If you have a ‘simple’ duplicate setup, it’s pretty easy, and lots of other sites go into great detail. (And in that case, you might look into how to require that fields be unique so you don’t have that problem anymore…)

In our case, we ran into an obscure problem where we expected the combination of two fields to be unique. Without publicizing anything secret, we had a summary table of user_id and month. (Not that it’s actually anything secret, but I refer to the table I did this on as “tablename,” just since blogging about our production database schemas seems like a bad idea…) A brief bug introduced some duplicates. My task? Remove the duplicate rows.

So I went about it somewhat backwards, and started wondering how do a SELECT DISTINCT across two columns. You can’t, really. Someone recommended something clever: CONCAT() across the two field names, and a DISTINCT on that: SELECT DISTINCT(CONCAT(user_id, month))… After toying with that a bit, I was able to create a temporary table housing a de-duped version of our data:


create temporary table nodupes
SELECT DISTINCT(CONCAT(user_id, month)) asdf, tablename.*
FROM tablename GROUP BY asdf;

But how does that help? The problem with temporary tables is that they’re, uhh, temporary. If you lived on the edge, you could probably “DROP TABLE original_table; CREATE TABLE original_table (SELECT * FROM temp_table)”, but this has a few horrible problems. One is that you’re throwing away all your disk-based data so you can populate it with a version that exists only in RAM. If anything went wrong, you’d lose the data. Furthermore, in our case, this is a live, busy table. Dropping it in production is awful, and it would take many minutes to repopulate it since it’s a huge table.

So then it was clear what I wanted to do: something like DELETE FROM original_table WHERE id NOT IN(SELECT * FROM temp_table_with_no_dupes), except not using an IN() on a table with hundreds of thousands of rows. And this is where it got tricky. The key is to use that temporary table to derive a new temporary table, housing only the duplicates. (Actually, speaking of “the key,” create one in your temporary table: CREATE INDEX gofast ON nodupes(id); — you’ll need this soon, as it’s the difference between the next query taking 10 seconds and me aborting it after 30 minutes.)

The temporary table of duplicate rows is basically derived by joining your ‘real’ table with the list of de-duped tables, with a rather obscure trick: joining where id=NULL. So here’s what I did:


CREATE TEMPORARY TABLE dupes
(SELECT tablename.*
FROM tablename
LEFT OUTER JOIN nodupes nd
ON nd.id=tablename.id WHERE nd.id IS NULL);

For bonus points, we’re selecting from the real table, so the silly CONCAT() field doesn’t come through into this temporary table.

So once again, I momentarily flashed a sly grin. This is going to be easy! DELETE FROM tablename WHERE id IN(SELECT id FROM dupes);. But then my sly grin faded. It turns out that there are two problems here. The first is that there are still a lot of rows. In my case, I was doing an IN() with about 6,000 rows. MySQL doesn’t like that. But there’s a bigger problem. (Thanks to Andrew for confirming that this is a real bug: #9090.) MySQL doesn’t use indexes when you use a subquery like that. It seems the bug is pretty narrow in scope, but this is precisely it. So that query is awful, basically doing 6,000 full table scans. And on a really big table.

So the solution comes from something I’ve never even thought about doing before. MySQL calls it the “multi-delete.” I call it the, “Holy smokes, this could go really, really wrong if I’m not careful” delete.


-- Don't run this yet
DELETE bb
FROM tablename AS bb
RIGHT JOIN dupes d
ON (d.id=bb.id);

That’s the command you want, but a few caveats. First, change the first line to “SELECT *” and run it to see what happens. Make sure it’s only grabbing the duplicate rows. You’re not able to use a LIMIT in a multidelete query, so you have to be extremely careful here. Second, we tested this multiple times on various test databases before doing it against our master database, and I was still uneasy when it came time to hit enter. Keep backups before you do this. And finally, in many cases, MySQL will expect an explicit “USE database_name” before the multi-delete, where database_name is the database housing the table you’re deleting from. Keep that in mind if you get an error that it can’t find the table, even if you explicitly reference it (via database_name.table_name).