Emulating spamd for HTTP

I won’t lie–I love OpenBSD’s spamd. In a nutshell, it’s a ‘fake’ mailserver. You set your firewall up to connect obvious spammers to talk to this instead of your real mailserver. It talks to them extremely slowly (1B/sec), which keeps them tied up for quite some time. (As an added bonus, it throws them an error at the end.)

One thing that really gets under my skin is bots (and malicious users) probing for URLs on the server that don’t exist. I get a lot of hits for /forum, /phpbb, /forums, /awstats… What they’re doing is probing for possible (very) outdated scripts that have holes allowing remote code execution.

It finally hit me: it’s really not that hard to build the same thing for HTTP. thttpd already supports throttling. (Note that its throttling had a more sane use in mind: limiting overall bandwidth to a specific URL, not messing with spammers and people pulling exploits, so it’s not exactly what we want, but it’ll do.)

Then you need a large file. I downloaded a lengthy novel from Project Gutenberg. It’s about 700 kB as uncompressed text. I could get much bigger files, yes. But 700 kB is plenty. More on this later.

It’s also helpful to use Apache and mod_rewrite on your ‘real’ server. You can work around it if you have to.

Set up your /etc/thttpd/throttle.conf:

**    16

Note that, for normal uses, this is terrible. This rule effectively says, “Limit the total server (**) to 16 (bytes per second).” By comparison, a 56K dialup line is about 7,000 bytes per second (or 56,000 bits per second).

Rudimentary tests show that having one client downloading a 700 kB file at 16B/sec places pretty much no load on the server (load average remained 0.00, and thttpd doesn’t even show up in the section of top that I can see), so I’m not concerned about overhead.

You can also set up your thttpd.conf as needed. No specific requirements there. Start it up with something like thttpd -C /etc/thttpd/thttpd.conf -d /var/www/maintenance/htdocs/slow -t /etc/thttpd/throttle.conf (obviously, substituting your own directories and file names! Note that the /slow is just the directory I have it serving out of, not any specific naming convention.)

Now what we need to do is start getting some of our mischievous URL-probers into this. I use some mod_rewrite rules on my ‘real’ Apache server:

# Weed out some more evil-doers
RewriteRule ^forum(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^phpbb(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^badbots(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]
RewriteRule ^awstats(.*)$ http://ttwagner.com:8080/20417.txt [NC,L]

In a nutshell, I redirect any requests starting with “forum,” “phpbb,” “badbots,” or “awstats” to an enormous text file. I’m not sure if escaping the colon is strictly necessary, but it has the added benefit of ‘breaking’ the link when pasted, say, here: I don’t want anyone getting caught up in this unless they’re triggering it. I tend each with (.*), essentially matching everything. You may or may not see this as desirable. I like it, since /forum and /forums are both requested, and so forth. You could take that out if necessary. The [NC,L] is also useful in terms of, well, making anything work.

I want to watch and see whether anyone gets caught up in this. Since it’s technically passing the request to a different webserver (thttpd), it has to tell the client to connect to that, as opposed to seamlessly serving it up. I don’t know if the bots are smart (dumb?) enough to follow these redirects or not.

Note that /badbots doesn’t really exist. I inserted it into my robots.txt file, having heard that some ‘bad ‘bots (looking for spam, etc.) crawl any directory you tell them not to. I wondered if this was accurate.

The ending is quite anticlimactic: we wait not-so-patiently to see what ends up in the logfile.

Geolocation

The concept of matching an IP to a country is known as IP geolocation, often just “IPGeo” or “GeoIP.” There are lots of reasons for using IP geolocation, ranging from the mundane (identifying countries in your webserver logfiles) to the questionable (banning countries from your server to cut down on spam) to the neat (doing it at firewall/router level and redirecting a user to the closest data center).

Most of the work is just done on a country level. You take an IP (72.36.178.234, my server) and look it up in a database, and get “UNITED STATES” as an answer. There do exist databases on finer levels, down to the city, but they’re expensive and often wrong. (I keep getting ads to find hot singles in Mashpee, more than 100 miles away and in a different state… Or maybe it’s Mattapan. Whatever the case, they’re not even close.)

It turns out that you can download a free database of IP-country mappings. It’s not infallible, but they say it’s 98% accurate. The database itself won’t do you any good. It’s a compressed CSV (comma-separated variable).

In the comments section here, there’s a snippet of PHP code to take the CSV and convert it to a huge series of SQL inserts, which you input into a database… (Hint: for whatever reason, his preg_match is imperfect and leaves a few instances of the word “error” in the middle of the file. It’s probably a bad idea, but I just commented out the “echo error” line. I end up with a 5.7MB SQL query. You can also just download the thing directly here (warning: 5.7 MB SQL file). Note that, per the license terms, I disclose in the comments that it’s a derivative work of their CSV file.

The other important catch is that IPs are stored as long integers, not ‘normal’ IPs. You’ll presumably want to use PHP + MySQL to get the country associate with PHP, so I’ll provide pseudocode in a minute. PHP provides an ip2long() function, but it only takes you halfway, but leaves you with sign problems. (Argh!) It’s an easy fix, though, and you want something like the following:

$long = sprintf("%u", ip2long($ip));
$query = "SELECT a2,a3,country FROM ip2c WHERE start <= $long AND end >= $long";

You then, of course, run $query and parse through it… You get 2- and 3-letter country codes, as well as the full country name. I use it, with good results, in seeing what country comment spam is coming from. (Most of it comes from the US.)

A MySQL query isn’t the proper way to do this: there exist binary files with the same data that result in faster lookups. But this is the simplest way to start doing IP geolocation in ten minutes time, and, with the query cache enabled, there’s not a ton of overhead.

I’m tempted to write some scripts to allow people to ‘browse’ the database, either looking up an IP, or to view it by country.

Update: Weird Silence has a binary implementation of this same database that’s supposedly much faster. The main page is here, the PHP one is here, and the C one is (t)here. (I’m wondering if it makes sense to write a PHP script to call the C version, and what the performance implications would be?)

Update 2: Get your country flags here.

Bad Bone Weights

I decided the other day that I ought to try to start my own TeamFortress 2 server. (Actually, I tried long ago, but hoped the problem had been fixed. But it hasn’t.) I want to share the cause of the problem in the hopes of helping others, since Google usually picks these things up.

You spend forever downloading the Half-Life Dedicated Server (HLDS), and excitedly fire it up. It runs through some stuff and seems to be working, but then you get a whole bunch of bizarre errors scrolling by:

Bad data found in model "dispenser_toolbox.dmx" (bad bone weights)
Bad data found in model "dispenser_gib1.smd" (bad bone weights)
Bad data found in model "dispenser_gib2.smd" (bad bone weights)
Bad data found in model "dispenser_gib3.smd" (bad bone weights)
Bad data found in model "dispenser_gib4.smd" (bad bone weights)
Bad data found in model "dispenser_gib5.smd" (bad bone weights)

What’s been surmised is that it’s because your processor doesn’t support SSE2. Bah! There’s no fix, either, other than pleading with Steam to write a version that doesn’t require SSE2, or upgrading your CPU.

It’s clearly time to build a new server and colocate it. 😉

Big Hosting

I tend to think of web hosting in terms of many sites to a server. And that’s how the majority of sites are hosted–there are multiple sites on this one server, and, if it were run by a hosting company and not owned by me, there’d probably be a couple hundred.

But the other end of the spectrum is a single site that takes up many servers. Most any big site is done this way. Google reportedly has tens of thousands. Any busy site has several, if nothing else to do load-balancing.

Lately I’ve become somewhat interested in the topic, and found some neat stuff about this realm of servers. A lot of things are done that I didn’t think were possible. While configuring my router, for example, I stumbled across stuff on CARP. I always thought of routers as a single point of failure: if your router goes down, everything behind it goes down. So you have two (or more) routers in mission-critical setups.

One thing I wondered about was serving up something that had voluminous data. For example, suppose you have a terabyte of data on your website. One technique might be to put a terabyte of drives in every server and do load balancing from there. But putting a terabyte of drives in each machine is expensive, and, frankly, if you’re putting massive storage in one machine, it’s probably huge but slow drives. Another option would be some sort of ‘horizontal partitioning,’ where five (arbitrary) servers each house one-fifth of the data. This reduces the absurdity of trying to stuff a terabyte of storage into each of your servers, but it brings problems of its own. For one, you don’t have any redundancy: if the machine serving sites starting with A-G goes down, all of those sites go down. Plus, you have no idea of how ‘balanced’ it will be. Even if you tried some intricate means of honing which material went where, the optimal layout would be constantly changing.

Your best bet, really, is to have a bunch of web machines, give them minimal storage (e.g., a 36GB SCSI drive–a 15,000 rpm one!), and have a backend fileserver that has the whole terabyte of data. Viewers would be assigned to any of the webservers (either in a round-robin fashion, or dynamically based on which server was the least busy), which would retrieve the requisite file from the fileserver and present it to the viewer. Of course, this places a huge load on the one fileserver. There’s an implicit assumption that you’re doing caching.

But how do you manage the caching? You’d need some complex code to first check your local cache, and then turn to the fileserver if needed. It’s not that hard to write, but it’s also a pain: rather than a straightforward, “Get the file, execute if it has CGI code, and then serve” process, you need the webserver to do some fancy footwork.

Enter Coda. No, not the awesome web-design GUI, but the distributed filesystem. In a nutshell, you have a server (or multiple servers!) and they each mount a partition called /coda, which refers to the network. But, it’ll cache files as needed. This is massively oversimplifying things: the actual use is to allow you to, say, bring your laptop into the office, work on files on the fileserver, and then, at the end of the day, seamlessly take it home with you to work from home, without having to worry about where the files physically reside. So running it just for the caching is practically a walk in the park: you don’t have complicated revision conflicts or anything of the sort. Another awesome feature about Coda is that, by design, it’s pretty resilient: part of the goal with caching and all was to pretty gracefully handle the fileserver going offline. So really, the more popular files would be cached by each node, with only cache misses hitting the fileserver. I also read an awesome anecdote about people running multiple Coda servers. When a disk fails, they just throw in a blank. You don’t need RAID, because the data’s redundant across other servers. With the new disk, you simply have it rebuild the missing files from other servers.

There’s also Lustre, which was apparently inspired by Coda. They focus on insane scalability, and it’s apparently used in some of the world’s biggest supercomputer clusters. I don’t yet know enough about it, really, but one thing that strikes me as awesome is the concept of “striping” across multiple nodes with the files you want.

The Linux HA project is interesting, too. There’s a lot of stuff that you don’t think about. One is load balancer redundancy… Of course you’d want to do it, but if you switched over to your backup router, all existing connections would be dropped. So they keep a UDP data stream going, where the master keeps the spare(s) in the loop on connection states. Suddenly having a new router or load balancer can also be confusing on the network. So if the master goes down, the spare will come up and just start spoofing its MAC and IP to match the node that went down. There’s a tool called heartbeat, whereby standby servers ping the master to see if it’s up. It’s apparently actually got some complex workings, and they recommend a serial link between the nodes so you’re not dependent on the network. (Granted, if the network to the routers goes down, it really doesn’t matter, but having them quarreling over who’s master will only complicate attempts to bring things back up!)

And there are lots of intricacies I hadn’t considered. It’s sometimes complicated to tell whether a node is down or not. But it turns out that a node in ambiguous state is often a horrible state of affairs: if it’s down and not pulled out of the pool, lots of people will get errors. And if other nodes are detecting oddities but it’s not down, something is awry with the server. There’s a concept called fencing I’d never heard, whereby the ‘quirky’ server is essentially shut out by its peers to prevent it from screwing things up (not only may it run away with shared resources, but the last thing you want is a service acting strangely to try to modify your files). The ultimate example of this is STONITH, which sounds like a fancy technical term (and, by definition, now is a technical term, I suppose), but really stands for “Shoot the Other Node in the Head.” From what I gather from the (odd) description, the basic premise is that if members of a cluster suspect that one of their peers is down, they “make it so” by calling external triggers to pull the node out of the network (often, seemingly, to just reboot the server).

I don’t think anyone is going to set up high-performance server clusters based on what someone borderline-delirious blogged at 1:40 in the morning because he couldn’t sleep, but I thought someone else might find this venture into what was, for me, new territory, to be interesting.

Custom LogFormat with Apache

Posting this in the hopes that it’ll help someone at some point….

Using Apache (Apache2 in my case, but I’m not sure it matters), you can customize the format for log files like access_log. Apache has a good page describing the variables you can use. But it doesn’t tell you everything you need to know!

The first question is where you put it… You can just specify it in httpd.conf (I put it near the end, but I don’t think its placement matters terribly, as long as it’s not in the middle of a section. It doesn’t go in any directives or anything. You can also insert it inside a VirtualHost directive if you only want it to apply to those. (Don’t put it inside a Directory directive!)

The second thing is something that’s not really specified anywhere: specifying a LogFormat without then specifying a CustomLog directive accomplishes nothing! I wanted to keep Apache logging in the default directory (/var/log/apache2/access_log on Gentoo), so I just set the LogFormat to something I wanted. And nothing happened.

You specify the format in CustomLog as well, so it’s handy to use LogFormat to assign a “nickname”:

LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"" n1zyy
CustomLog /var/log/apache2/access_log n1zyy

The first line sets the “n1zyy” ‘nickname’ to refer to to the format I specify. The next line sets a “custom” log file (in this case, it’s the same as the default, but I digress. It won’t work if I don’t specify it.) Then I tell it to use the format named “n1zyy.”

Once this is set up, you want to reload Apache, since it won’t notice your changes until you do.

Geek

We’ve been having a lot of intermittent network problems at home. Periodically, our Internet cuts out. At first I assumed it was our ISP–it’s no longer Adelphia (run by pharmacists), though–but subsequent research indicated that it wasn’t our ISP’s fault: our router was going down.

My dad set it all up, so I wasn’t too sure how things went. I was pretty confident that we were just using a generic store-bought broadband router, though, so I found it strange that it would be drifting in and out. It turns out that I overlooked something about the router: it’s being held together with duct tape.

I’d already been intrigued by OpenBSD’s pf, so this seemed like a sign! I commissioned an old desktop system, loaded OpenBSD up on it, and went to work configuring it. OpenBSD was just more different from Linux than I expected. It asks you if you want to let OpenBSD use the whole hard drive. I said yes, and thought, “Wow, this is just as easy as Ubuntu!” But it turns out that this was just the first stage. After this, you have to set “disk labels,” which are sort of like partitions but ambiguously different. The syntax is obscure, the purpose is obscure, and so forth. Then I had to configure the network. NICs are named by the drivers they use, so instead of eth0 and eth1 (for Ethernet), I have rl0 (Realtek) and dc0 (who knows).

I was also extremely confused trying to set up routing. Long-term, it was going to be the router, but short-term, it needs to know about our existing router so that it can connect and download the requisite packages.

So I finally got it all set up. I also installed MySQL (unnecessarily, it turns out), Apache, and PFW, a web-based configuration tool for pf. I ended up not using PFW, because my understanding of pf is so bad that I’m basically relegated to copying-and-pasting rules from websites into the configuration file.

Even using pf is confusing. It’s called pf, but typing “pf” at the command line doesn’t do anything. It turns out that you control it with a tool called “pfctl.” You can do pfctl -e to enable pf, and pfctl -d to disable it.

As I tried to tweak the firewall/routing rules, I’d periodically “restart” pf by disabling and then re-enabling it. I wasn’t sure if it read the rules “live” or if a restart was needed. It turns out… neither! The rules are stored in memory, but restarting pf doesn’t flush the rules. You need to pass pf some more arguments to tell it to flush the cache and read them anew from its configuration file.

After a few more hours of work, I thought it was all set up. Both NICs were configured, the external one to get an IP over DHCP, and the internal one with a low fixed IP. I had a complex set of rules, doing NAT, filtering traffic, and using HFSC for prioritized queueing. (HFSC seems completely undocumented, by the way. I took my tips from random websites.) It seemed very impressive: I prioritized ACKs so that downloads wouldn’t suffer if our outbound link was saturated. (Aside: it really doesn’t make sense to do queueing on incoming traffic, since the bottleneck is our Internet link, not our 100 Mbps LAN.)  I also afforded DNS, ssh, and video game traffic high priorities, but allocated them a lower percentage of traffic. I even figured out the default BitTorrent ports and gave them exceptionally low priority: if our line is fully saturated, the last thing I care about is sharing unnecessary data with other people.

And there are other neat features. It “scrubs” incoming connections, reassembling fragmented packets and just eliminating crap that doesn’t make sense. It catches egregious “spoofing” attempts and discards them.

I hooked up the second LAN connection to test it out, rebooted, and… waited.

It never came up. Well, it did come up. The computer’s running fine. Both network cards show up with the switch. Doing an nmap probe of our LAN, I see one strange entry. It’s actually pretty mysterious: it has no open ports, and attempting to ssh into it just sits there: it doesn’t send a connection refused, but completely ignores the incoming packets, leaving my poor ssh client sitting there waiting for a reply, having no clue what’s going on.

In a nutshell, it seems that I just built a firewall/router that’s so secure that I can only find one of its two cards on the network, and I can’t even try to log into it. Let’s see you hack that! Of course, this does have some issues. For example, I can’t use it.

I haven’t lost hope yet: I have a keyboard and monitor so I can log in on the console and try to do some tweaking there. (You can’t firewall off the keyboard.) It’s just not very encouraging to think, “Alright, let’s reboot and make sure it works as flawlessly as I think it will” and then have the darned thing not even show up on the network.

Advice

I learned two valuable lessons today:

  • Don’t ever create a 500GB FAT partition. No matter how good of an idea it seems, don’t do it. (Not terribly different is the advice, “Don’t ever create one big 500GB partition.”)
  • Mounting a filesystem as “msdos” is not the same as mounting it as “vfat” in Linux. msdos is still constrained by the 8.3 naming system. vfat is not. Unless the disk was literally written with MS DOS, don’t use msdos. It’ll work okay, but boy are you screwing yourself if you make backups with it mounted as msdos. (Fortunately, I realized this before wiping the drive.)

Coding Malpractice

I just wrote the following line of code. And it’s no mistake: it functions perfectly and does exactly what I wanted it to do:

$count += 0;

This is surely poor programming practice, essentially implicitly recasting a variable as an integer. But it’s simple and it works flawlessly. (The context: I run an SQL query saying SELECT SUM(votes)..., which makes the tabulation of all the entries MySQL’s problem, not mine. The one ‘flaw’ is that the sum of no votes isn’t 0, but NULL. This becomes a very important distinction when you’re trying to display a number: “0 votes” isn’t the same as ” votes.”)

Since we all know that NULL + 0 = 0 (and, of course, integer + 0 = integer), adding 0 works flawlessly. Could I just convert it to an integer? Probably. But I haven’t done that stuff in a while, and I was far too lazy to pull up the documentation. And incrementing a variable by 0 is way more fun.

Spam

My e-mail setup right now for n1zyy.com and ttwagner.com consists of just forwarding all e-mail to GMail. It works fine, and the spam filters there have been pretty much 100% effective. However, it bothers me that I’m forwarding dozens, if not hundreds, of e-mails just to have them ignored. Some basic spam filtering should really take place on my server.

I made a few basic configuration changes to Postfix, the MTA I run. In a nutshell, I tell it to require stricter compliance with e-mail RFCs: e-mails with HELO addresses that don’t exist (or just don’t make sense), and people sending multiple commands before the server replies to acknowledge them, for example, now results in mail delivery failing. The default configuration errs very much on the side of ‘safety’ in accepting mail, but the trick is to tighten it down enough that you’ll reject mail that’s egregious spam, but not reject anything that could be from a mailserver. And that’s where I’m at.

I also installed SpamAssassin. I’m currently using it in conjunction with procmail, and therefore wasn’t quite sure if it works. I set it up to make some changes to the headers, so that I can verify whether it’s working. But I ran into a problem I never thought I’d have: I’m not getting enough spam. I’m sitting here eagerly awaiting some to see what happens. And it’s just not coming.