A couple things I ran into today that I want to keep searchable here in case I run into them again, and that I figured might be useful to someone else someday:
Let’s say that you take down a MySQL server that’s a replicated slave to do a memory upgrade, and it takes a really long time to shut down, and then you find that the machine doesn’t like the “new” DIMMs, so you throw the old ones back in and power it up. Just hypothetically. You then restart MySQL and issue the START SLAVE command, but it dies with an error:
090127 14:53:17 [ERROR] Failed to open the relay log './mysqld-relay-bin.000023' (relay_log_pos 23726) 090127 14:53:17 [ERROR] Could not find target log during relay log initialization
The relay log and position were both wholly wrong. I poked around, and found a lot of people who ran into this; it seems to be a data corruption issue, but also happens occasionally on a reboot. There’s a bunch of suggested fixes out there that don’t actually work. One thing that does work, though, is deleting the relay logs on the slave. (Any time someone on the Internet tells you to delete a file, you should, of course, think “move to another directory” instead so you can undo it if need be.) Once I deleted the relay logs, it started right up.
Lesson #2? Now you’re about 6,000 seconds behind the master, and the replication lag counter is going down at a rate of about 1 second per second. You can wait a couple hours. That seems pretty pathetic, though.
If your to-do list reads, “1.) Get slave running, and then 2.) Fine-tune my.cnf, currently stolen from another machine,” there’s a chance that you have sync_binlogs=1 set. This is bad for two reasons: the first is simply because of what it’s designed to do: flush the binlog to disk on every write. This is very safe, but also very slow. But the second reason it’s bad is that it’s apparently especially bad on ext3, so it’s doubly important to not use this option, at least not when write performance is important.