Engineer's Hell

Sunday, August 22, 2010

I have spent the last five hours in Engineer's Hell.

For the uninitiated, Engineer's Hell is that lovely place where everything is right, but nothing works. Imagine, for instance, a mechanic's frustration with a car that won't start, even though every test he tries (save actually starting it) says it should. Or, more simply, your own frustration with a door that won't close fully, even though there's nothing stopping it.

Naturally, this being the real world, there's always something wrong. Until you find it, however, it's a very frustrating experience. Worse if you're in a hurry. And Murphy's Law says you always will be.

My particular circle of Engineer's Hell bore the inscription "SSL". SSL is a very nice protocol, it handles securing HTTPS connections (among others). In other words, it's the main reason you can feel fairly safe using your online banking or buying something from a reputable online merchant.

One of the major aspects of SSL that makes HTTPS possible is called a certificate chain. When you connect to a server, it sends you a document, called a certificate, which proves that it is the server you asked to talk to. A certificate is essentially a digital message that says, "This guy is who you think he is. -- (signed) Someone You Trust". But how do you trust that the signer is who he says he is? Because he hands you another certificate, signed by someone else. This chain of certificates ends when it reaches someone who (in theory) should be trusted, called a root certificate.

Why do you trust a root certificate? Because someone handed it to you and said "You can trust this." No, really. Root certificates come bundled with operating systems and web browsers, and there's usually not any (easy) way to verify them. The theory is that if you can't trust the source you got your OS/browser from, you're already screwed, because they can cheat anyway.

By the time I post this, Armageddon City 0.1 (alpha) will be released. When you start it, you'll notice a button labeled "Update" on the main menu. When you click that button, AC will connect out to LDG's download server via HTTPS and request a summary file that tells it where to find any updated versions of AC and gives a sha256 hash of them, a small bit of text that allows it to verify that the update it downloads hasn't been corrupted in download or maliciously altered.

That small step is the lynchpin of securely updating AC, and when I switched from the development server (which sits on my desk) to the public download server (hosted by Amazon CloudFront), it broke.

That was the moment that I entered Engineer's Hell. I knew that step worked; it updated from the development server without a hitch. But as soon as I directed it to the production server, it broke.

Under the hood, the summary file is downloaded by Python's urllib2, a nice little module which has one glaring issue: It doesn't check the certificate chain of HTTPS connections. Which makes HTTPS... not quite worthless, but deeply flawed. The connection is still secure, but you can't be sure that the other side is who they say they are, so it could be a hacker pretending to be the server.

This is called a man-in-the-middle attack, because the hacker can relay your messages to the server and its replies to you, eavesdropping on the connection even though both sides think it's secure, and free to make any changes he likes. For instance, instead of that nice, clean AC update you wanted, he can send you a virus, which you will run, because you think it came from LDG.

Needless to say, that was not acceptable. Using an insecure connection for the initial download is slightly risky, but fairly normal. Using one for automatic updates is just asking for trouble.

Fortunately, it's possible to convince urllib2 to check the certificate chain. It's not what I would call easy, but it's doable. In the end, however, it's still up to you to provide the root certificate for urllib2 to check against.

Converting from development to production update server took two changes. First, I had to change the summary URL. Second, I had to change the root certificate to match the production server. For the second step, I called up Firefox's page info, moved over to the security information, viewed the certificate details, and exported the root certificate. Easy, and the exact same process I had used for the development server.

Having made the necessary changes, I tested it, and the certificate didn't validate. So I made sure I had the right certificate and tried again. It didn't work. I tried using the next certificate down in the chain. I tried feeding it the entire chain. I tried the development server again, and that still worked. But the production server didn't, even though it should work exactly the same.

I repeated the process with the development certificate, and compared against the saved copy. Aha! I had picked up that one under Linux, so it had a slightly different file type (unix format, not dos). Okay, convert the certificate to the correct format and... it doesn't work. Well, crud.

I tried everything I could think of. Nothing worked. Every once in a while, I'd spot something that could be the problem, have a brief moment of hope, and get frustrated again as that fix didn't work. I wandered away and played cards for a bit. I came back and set to work figuring out how to recompile Python's SSL library so that I could watch the process and figure out how it was failing. Finally finished that, only to discover that the problem was a level deeper, in OpenSSL itself. So I set to work figuring out how to recompile that.

As part of that process, I went to find a way to extract the certificate with OpenSSL, so that I could check the chain bit-by-bit. I found a little command-line utility called (fittingly enough) "openssl". The openssl utility is basically a collection of little helper programs that you can run. Among those I found the boring-sounding "s_client", which is billed as "a generic SSL/TLS client".

I figured I'd try it, to see if it printed out the certificate. As it happens, it dumped out something even more useful: the certificate chain. The entire certificate chain.

Oh, I saw all the pieces I was expecting. The server's certificate itself. DigiCert's signing certificate. DigiCert's root certificate, the one I was using. And underneath that, Entrust.net's root certificate.

Say what? When did Entrust.net get involved in this certificate chain?

It turns out that DigiCert is a relatively new certificate-signing company, established in 2003. Since there were many browsers and OSes out there that did not have their root certificate installed, some of which might never incorporate it, they did something slightly strange. They had Entrust.net (established 1994) sign their root certificate. Anyone who recognized the DigiCert root could ignore that signature, and anyone who didn't could fall back to the Entrust.net root.

Firefox happily trusted the DigiCert root certificate, and looked no farther. OpenSSL, and by extension urllib2, saw that the certificate was signed and continued looking for the true root. Which is more correct, I have no idea, but that disagreement was what had banished me to Engineer's Hell.

From there, the solution was easy. I added Entrust.net's root certificate to the certificate file, and it worked.

So if you peek into the data files that armageddon_city uses, you'll find a little text file named root_certs.pem . And if you look in there, you'll find two big blobs of incomprehensible text. One is DigiCert's root, the other is Entrust.net's. Together, they got me into and then back out of Engineer's Hell.

Labels: armageddon_city, frustration, programming, security

LDG and Open Source

Thursday, October 29, 2009

LDG is performing a rather intricate dance where open source is concerned. On one hand, we like it a lot. We like giving things away for free, we like the freedoms it gives, and we really like the way it improves over time. On the other hand, we have some compelling reasons to avoid it in the short run, mostly revolving around money.

As I said last time, money is just as important for a non-profit as for a normal business. Money is power. With it, we can do a lot. Without it, we've got problems.

The core issue is one of risk. We've already put a lot of work into the games we're working on, and they're not finished yet. If we don't get the funding we need from them, it will be that much harder to start over with a new concept. Yes, it's possible to make money, even commercially, in open source. But once we go down that road, there's no turning back. If we try it and it doesn't work for us, we're stuck with it. If our games are still closed-source, we have a lot more flexibility to try other strategies.

In the long run, however, we strongly favor open source. Once we've gotten what money we can out of our work, we'll be happy to release it for everyone to enjoy, examine, and adapt as they see fit. Even while they are closed-source, our games are our gift to the community, and what better way to express that than to give the community control of them.

That "short run"/"long run" distinction is our way of compromising between the ideals of open source and the economics of LDG. We wrote it into our bylaws, because that was the strongest way we could see to state our commitment to open source.

If, at some point, we can afford to risk going open source immediately, we'll be happy to do it. We'll probably still develop in secret, though. It's more interesting if we can surprise you.

Labels: money, open source, philosophy, risk

LDG's Business Model

Thursday, October 29, 2009

The difference between a normal business and a non-profit seems, at first glance, to be a no-brainer. A normal business turns a profit by selling products or services for more than they cost it to produce. A non-profit uses donated time and money to serve its community. They're as different as day and night.

Under the surface, the distinction is much more subtle. You see, a non-profit also produces goods and services, and if they cost more to produce than it makes in donations, it's every bit as doomed as a normal business. The fact that they receive donations instead of selling what they produce doesn't change the fundamental economics of the situation.

The most important difference is that when a non-profit makes money, there's no owner or shareholders to siphon some off. All of the money that comes in the door gets used by the non-profit. This is what makes people willing to support them. Any money you donate is used to support the non-profit's mission.

The downside is that there's a lot of work involved to convince people (especially the IRS) that nobody is siphoning money off. And, as a result, there are a lot of restrictions on the kinds of things you can do.

Aside from those (admittedly important) differences, there's not a lot to separate the two. In particular, running a non-profit takes every bit as much planning as running a normal business. If you don't know where your money is going to come from, odds are good the answer is "It isn't."

LDG is using what I call the "refined NPR model". If you're not familiar with it, NPR is short for National Public Radio, a US-based public radio network. It produces some of the best and most-listened-to radio programs in the USA. Like LDG, NPR is a non-profit, largely supported by donations. Periodically, they run a pledge drive, essentially a period of time where they actively solicit donations on air. Like many non-profits, they give out labeled items like t-shirts or coffee mugs for specific levels of donations.

It works. The combination of repeated reminders and incentives gets people to donate. The problem is that it's also annoying. Even if you donate, they keep interrupting what you're listening to to ask for money, and it doesn't stop until the pledge drive runs its course. It's still an improvement over commercials, but that's not exactly a difficult achievement, now is it?

In software, the NPR model is known as "nagware", and people don't like it. I understand completely. It interrupts or delays what you're doing, and that sucks. At the same time, it's a necessary evil, and one that should be mitigated as much as possible.

The "improved" part of the improved NPR model refers to a feature that's all but universal in nagware: If you pay the toll, it stops nagging you. NPR, by its nature, can't do that. They can't put out a second radio signal without the donation drive, or everybody would just switch over to that one.

On the other hand, NPR has one thing figured out that most nagware doesn't. Frequency. If you solicit donations frequently (the "nag" part of nagware), all you accomplish is annoying people. It's a delicate balance. If you don't ask frequently enough, people forget that you need support. If you ask too frequently, they don't want to support you at all!

Right now, our plans call for a two-factor approach to frequency. The first is absolute time: we won't bug you more than once a day. We don't want to nag anyone, just remind them every now and again. The second factor is play time. If you haven't played the game for at least four hours since the last reminder, we won't remind you again.

Both of those timers start at 0. If you don't play for several hours or on more than one day, we haven't earned your attention yet.

There are other issues to consider, too. Seeing the same notice over and over is boring, so we'll put several versions in. They will be skippable, but we'll try to make them amusing and tied into the game's content, so that you want to read them. And if the notices show up during gameplay, we'll do our best not to interrupt you at a critical moment; it'd be counter-productive. In short, we're doing our best to take the "nag" out of nagware, because the best way to get donations is to make people happy.

Labels: fundraising, money, plans