Engineer's Hell
Sunday, August 22, 2010For the uninitiated, Engineer's Hell is that lovely place where everything is right, but nothing works. Imagine, for instance, a mechanic's frustration with a car that won't start, even though every test he tries (save actually starting it) says it should. Or, more simply, your own frustration with a door that won't close fully, even though there's nothing stopping it.
Naturally, this being the real world, there's always something wrong. Until you find it, however, it's a very frustrating experience. Worse if you're in a hurry. And Murphy's Law says you always will be.
My particular circle of Engineer's Hell bore the inscription "SSL". SSL is a very nice protocol, it handles securing HTTPS connections (among others). In other words, it's the main reason you can feel fairly safe using your online banking or buying something from a reputable online merchant.
One of the major aspects of SSL that makes HTTPS possible is called a certificate chain. When you connect to a server, it sends you a document, called a certificate, which proves that it is the server you asked to talk to. A certificate is essentially a digital message that says, "This guy is who you think he is. -- (signed) Someone You Trust". But how do you trust that the signer is who he says he is? Because he hands you another certificate, signed by someone else. This chain of certificates ends when it reaches someone who (in theory) should be trusted, called a root certificate.
Why do you trust a root certificate? Because someone handed it to you and said "You can trust this." No, really. Root certificates come bundled with operating systems and web browsers, and there's usually not any (easy) way to verify them. The theory is that if you can't trust the source you got your OS/browser from, you're already screwed, because they can cheat anyway.
By the time I post this, Armageddon City 0.1 (alpha) will be released. When you start it, you'll notice a button labeled "Update" on the main menu. When you click that button, AC will connect out to LDG's download server via HTTPS and request a summary file that tells it where to find any updated versions of AC and gives a sha256 hash of them, a small bit of text that allows it to verify that the update it downloads hasn't been corrupted in download or maliciously altered.
That small step is the lynchpin of securely updating AC, and when I switched from the development server (which sits on my desk) to the public download server (hosted by Amazon CloudFront), it broke.
That was the moment that I entered Engineer's Hell. I knew that step worked; it updated from the development server without a hitch. But as soon as I directed it to the production server, it broke.
Under the hood, the summary file is downloaded by Python's urllib2, a nice little module which has one glaring issue: It doesn't check the certificate chain of HTTPS connections. Which makes HTTPS... not quite worthless, but deeply flawed. The connection is still secure, but you can't be sure that the other side is who they say they are, so it could be a hacker pretending to be the server.
This is called a man-in-the-middle attack, because the hacker can relay your messages to the server and its replies to you, eavesdropping on the connection even though both sides think it's secure, and free to make any changes he likes. For instance, instead of that nice, clean AC update you wanted, he can send you a virus, which you will run, because you think it came from LDG.
Needless to say, that was not acceptable. Using an insecure connection for the initial download is slightly risky, but fairly normal. Using one for automatic updates is just asking for trouble.
Fortunately, it's possible to convince urllib2 to check the certificate chain. It's not what I would call easy, but it's doable. In the end, however, it's still up to you to provide the root certificate for urllib2 to check against.
Converting from development to production update server took two changes. First, I had to change the summary URL. Second, I had to change the root certificate to match the production server. For the second step, I called up Firefox's page info, moved over to the security information, viewed the certificate details, and exported the root certificate. Easy, and the exact same process I had used for the development server.
Having made the necessary changes, I tested it, and the certificate didn't validate. So I made sure I had the right certificate and tried again. It didn't work. I tried using the next certificate down in the chain. I tried feeding it the entire chain. I tried the development server again, and that still worked. But the production server didn't, even though it should work exactly the same.
I repeated the process with the development certificate, and compared against the saved copy. Aha! I had picked up that one under Linux, so it had a slightly different file type (unix format, not dos). Okay, convert the certificate to the correct format and... it doesn't work. Well, crud.
I tried everything I could think of. Nothing worked. Every once in a while, I'd spot something that could be the problem, have a brief moment of hope, and get frustrated again as that fix didn't work. I wandered away and played cards for a bit. I came back and set to work figuring out how to recompile Python's SSL library so that I could watch the process and figure out how it was failing. Finally finished that, only to discover that the problem was a level deeper, in OpenSSL itself. So I set to work figuring out how to recompile that.
As part of that process, I went to find a way to extract the certificate with OpenSSL, so that I could check the chain bit-by-bit. I found a little command-line utility called (fittingly enough) "openssl". The openssl utility is basically a collection of little helper programs that you can run. Among those I found the boring-sounding "s_client", which is billed as "a generic SSL/TLS client".
I figured I'd try it, to see if it printed out the certificate. As it happens, it dumped out something even more useful: the certificate chain. The entire certificate chain.
Oh, I saw all the pieces I was expecting. The server's certificate itself. DigiCert's signing certificate. DigiCert's root certificate, the one I was using. And underneath that, Entrust.net's root certificate.
Say what? When did Entrust.net get involved in this certificate chain?
It turns out that DigiCert is a relatively new certificate-signing company, established in 2003. Since there were many browsers and OSes out there that did not have their root certificate installed, some of which might never incorporate it, they did something slightly strange. They had Entrust.net (established 1994) sign their root certificate. Anyone who recognized the DigiCert root could ignore that signature, and anyone who didn't could fall back to the Entrust.net root.
Firefox happily trusted the DigiCert root certificate, and looked no farther. OpenSSL, and by extension urllib2, saw that the certificate was signed and continued looking for the true root. Which is more correct, I have no idea, but that disagreement was what had banished me to Engineer's Hell.
From there, the solution was easy. I added Entrust.net's root certificate to the certificate file, and it worked.
So if you peek into the data files that armageddon_city uses, you'll find a little text file named root_certs.pem . And if you look in there, you'll find two big blobs of incomprehensible text. One is DigiCert's root, the other is Entrust.net's. Together, they got me into and then back out of Engineer's Hell.
Labels: armageddon_city, frustration, programming, security