Screw Referential Integrity

After a late night fixing problems caused by a number of things (all boiling down to binary incompatibility issues) I'm wondering why we bother with 30 odd years of database theory. What's the point?

Database theory, with fourth normal form and referential integrity, two phase commits, transactions and isolation levels aren't for everyone. For many applications (like photo sites, blogs and so on), these don't matter. For other (like your bank account) they apparently do. But I'm not in the banking business, and the same goes for a lot of other people, working on applications that today rely on a relational database for their storage needs. Developments such as map reduce, BigTable challenge the notion that you _must_ have a relational database in order to do serious work.

This post started out as a rant against binary formats. There's a myth out there that binary stuff is so much better than text based, self descriptive, redundant formats. And I so much disagree. I spent the night upgrading (trying to upgrade) from PostgreSQL 7.4 to 8, and then migrating the entire shebang from PostgreSQL to MySQL in order to get things working. Client code was of course to blame too, but only because JDBC drivers don't all speak the same dialect of SQL. What did work was the text file dump of all the data in the old postgres. It was a load of INSERT statements, and (with few extremely simple changes (s/"language"/language/)) it worked like a charm.

I started thinking about why the database is there in the first place.

Can anyone make a relational database that scales to internet proportions? I think not. At least if you're an average nerd. Perhaps if you're a computer scientist and you have 10 or so years theorizing about what comes after fourth normal then you might find a way that can handle referential integrity when you're talking millions of servers. But I doubt it.

Why do none of the really successful (internet scale) application technologies out there use or even consider relational integrity? DNS? NNTP? SMTP? SNMP? IMAP? POP? Or even (dare I say) the web itself? Wouldn't it be cool if we could have started over and the Internet as we know it were built by die hard RDBMS fans. No.

If it had been, the DNS root servers (RDBMS of course) might have contained records with foreign keys to all the top level domain servers (also RDBMS), and oh, that means that they also would live in the same database. Oh, so that means we'd have one extremely obese database that held all of our DNS records. How convenient. Oh, it's overloaded? I'll just wait, then.

If it had been, then you wouldn't be able to insert <a href>s so easily, since you would have to add a row in some table which was a foreign key to the target of the anchor. "Oh, I can't delete this web page because who is linking to us?"

Transactional integrity, atomicity are extremely useful at a very fine grained level. Like the balance of my bank account. If someone wants to transfer funds from one account to another, it better be atomic. But does that mean that every operation you will ever perform on all of your data have to be atomic? No! Can you "fix" a money transfer gone awry? Often not. Can you "fix" your e-mail address that had a typo? Yes! Can you "fix" that blog post that included an image that failed the upload? Yes!

Do bad things happen? Yes. Let's just assume they do and not pretend that we can theorize ourselves out of them.

What does all this have to do with text based formats? Well imagine again that the Internet was a huge relational database. I'm sure glad I won't be taking the database offline to perform the upgrade script when the database vendor decided to earn some more cash by releasing a new version of the "I am god" database.

#