About Time

My old LayeredTech box that’s been taken offline was a stratum 2 NTP server, and I kept upping the load until it was set at 100 Mbps in the pool, which had it handling about 15-20 queries/second. NTP was very light on system resources, so for a machine that sits in a data center and is allowed 1,000 GB of traffic a month, I was able to handle a lot of traffic. It was maybe 1-2 GB of bandwidth/month, though an instantaneous look would show a crazy amount of incoming connections. (It’s 75-80 bytes of UDP.)

So when we set up the new server (shared between Andrew and I), I conned Andrew into allowing me to run NTP in Dom0 (the ‘root’ domain, versus inside a virtual machine guest, as VMs, for probably-obvious reasons, aren’t allowed to manipulate the hardware clock) and put the server in the pool. (By some strange fluke, the geo-IP code saw the machine’s IP as being in Brazil… The server’s been added to the US zone, but also remains in the Brazil zone, as South America is very underserved: 17 servers in South America, versus 497 in the US alone.)

NTP is one of those things where the default configuration probably gets you 90% of the possible accuracy. Set up a machine and have it sync to pool.ntp.org and your clock will probably stay within 50ms of “true time.” (Assuming you use ‘real’ NTP, which polls up to every 1024 seconds, versus Windows’ conservative 7 day interval.)

However, you can squeeze more out of it. One thing is that, when you’re becoming a member of the pool, you don’t want to set your server as the pool, or you risk forming a feedback loop of sorts, in which you might ultimately look to your own server as its reference or whatnot. So you tend to hand-pick a nice array of servers.

NTP has a concept of stratums (“strata” is probably the correct plural). It essentially indicates the number of steps before you reach to a “reference clock,” which is something definitively setting time, such as a GPS receiver (which can be accurate down to the nanosecond level), or even an actual atomic clock. When you sync to stratum 1s, you serve time as a stratum 2, and so on. Stratum doesn’t necessarily point to decreased accuracy, but that’s kind of like saying that the number of hops on a traceroute doesn’t necessarily mean increased latency: in practice, it does, though with NTP the difference is typically very small. NTP is very good, though, at evaluating its clocks.

So I recently redid the server lineup, and pulled out the entries for a couple stratum 2 servers, so that our server is consistently stratum 2. A decent number of stratum 1s are semi-private, but open to those running public servers, which means we can get away with syncing to really nice clocks.

Another thing that makes a big difference is latency. Latency itself is actually not a big deal: NTP looks at how long packets take round-trip and adjusts, so that syncing to servers in Africa is really no different than syncing to local ones. At least in theory, though I’ve found that, in practice, it’s fairly accurate. But what matters is variable latency, especially uneven latency, such as if outgoing packets take a different route than incoming ones, which is increasingly common. (It usually makes no difference, as most people don’t need to calculate round-trip latency precisely…) This is where it helps to have more local clocks, as round-trip latency is small enough that differences between outgoing and incoming routing are minimized.

A less-common worry, though one worth looking into IMHO, is diversity of time sources, too. GPS is very commonly-used on stratum 1’s, because it’s cheap and very accurate. However, if it were to ever go down, it’s so commonly-used that some worry that NTP would become very degraded in accuracy. (Of course, if GPS were to go down, we’re probably have bigger problems than our clocks losing sync.)

So I’ve just redone our timeserver setup after a couple servers we used to use ended up being pretty crappy. (One was my old server, which is no longer online, and another inexplicably dropped to stratum 2.) We now sync to six different stratum 1 servers. All are geographically close, giving us about 30ms latency worst-case. A couple are set by CDMA (via cell towers, which get their time from GPS and tend to ‘filter’ it through a Rubidium reference), the PPS one syncs to GPS as I understand it, and two get time from ACTS, by dialing into NIST and syncing time over the phone lines, which apparently gives superb accuracy. (As you get much more controlled latency.) Oh, and the last server… It’s synced to the Naval Observatory’s atomic “master clocks.” (Somewhat to my annoyance, NTP seems to love the UDel server at the exclusion of the USNO clock.)

I’ll monitor things for a few days to see how things go, but I expect very good results. (Then again, we’ve had very good results in the past, too.) I worry that we might hop around between sources a bit, because we no longer have one server that stands head-and-shoulders above the others. Four of the six seem to be extremely local (in terms of latency) and extremely accurate (in terms of their agreement with each other). So far, though, I haven’t seen our root dispersion (rougly, the difference between the biggest and largest offsets, summed between all the servers in our path) go above 20ms. I am seeing a ~5ms spread between offsets between multiple “good” hosts, but then again, 5ms offset is very good…

Leave a Reply

Your email address will not be published. Required fields are marked *