Cowboy in the desert.

In Octopus 3.0, we're switching from RavenDB to SQL Server

Paul Stovell

Early beta versions of Octopus used SQL Server with Entity Framework. In 2012, just before 1.0, I switched to using RavenDB, and wrote a blog post about how we use the Embedded version of RavenDB.

For over two years we've been developing on top of RavenDB. In that time we've had over 10,000 installations of Octopus, which means we've been responsible for putting RavenDB in production over 10,000 times. And since most customers don't have in-house Raven experts, we've been the first (only) line of support when there are problems with Raven. We haven't just been kicking the tyres or "looking at" Raven, we bet the farm on it.

For Octopus 3.0, we are going to stop using RavenDB, and use SQL Server instead. Understandably many people are interested in "why". Here goes.

First, the good

RavenDB has a great development experience. Compared to SQL + EF or NHibernate, you can iterate extremely fast with RavenDB, and it generally "just works". If I was building a minimum viable product on a tight deadline, RavenDB would be my go-to database. We rewrote nearly all of Octopus in 6 months between 1.6 and 2.0, and I don't think we could have iterated that quickly on top of SQL + EF.

The bad

We handle most support via email/forums, but when there are big problems, we escalate them to a Skype/GoToMeeting call so we can help the customer. Usually that's very early in the morning, or very late at night, so minimizing the need to do them is critical to our sanity.

What's the cause of most of our support calls? Unfortunately, it's either Raven, or a mistake that we've made when using Raven. And it's really easy to make a mistake when using Raven. These problems generally fall in two categories: index/data corruption issues, or API/usage issues.

Above all else, a database needs to be rock-solid and perform reliably. Underneath Raven uses ESENT, and we've generally not lost any data from the transactional side of Raven. But indexes are based on Lucene.NET, and that's a different story. Indexes that have broken and need to be rebuilt are so common that for 1.6 we wrote a blog post explaining how people can reset their indexes. We sent this blog post to so many people that in 2.0 we built an entire feature in the UI to do it for them.

Repair RavenDB

When I said we'd never lost the transactional data, that's not quite right. It's really easy in RavenDB to add an index that causes big, big problems. Take this:

  Map = processes => from process in processes
                     from step in process.Steps
                     select {...}
  Reduce = results => from result in results
                      group result by ....

You can write this index, and it works fine for you, and you put it into production. And then you find a customer with 10,000 process documents, each of which have, say, 40 steps.

While Raven uses Lucene for indexing, it also writes index records into ESENT. I don't know the internals, but there are various tables inside the Raven ESENT database, and some are used for temporarily writing these map/reduce records. For every item being indexed, it will write a huge number of records to these tables. So we get a support issue from a customer: they start Octopus, and their database file grows at tens or hundreds of MB per second, until it fills up the disk. The database file becomes so large that they can't repair it. All they can do is restore from a backup. When we finally got a copy of one of these huge data files, and explored it using some UI tools for ESENT, these tables contained millions upon millions of records, just for 10,000 documents.

The RavenDB team realised this was a problem, because in 3.0 they added a new feature. If a map operation produces more than 15 output records, that document won't be indexed.

I mean, just read that paragraph again. You write some code, test it, and it works fine in development. You put it in production and it works fine there too, for everyone. And then you get a call from a customer: I just added a new process, and it's not appearing in the list. Only after many emails and a support call do you realise that it's because Raven decided that 15 is OK and 16 is not, and the item isn't being indexed. Your fault for not reading the documentation!

"Safe by default" is so painful in production

Raven has a "safe by default" philosophy, but the API makes it so easy to write "safe" code that breaks in production. For example:

session.Query<Project>().ToList();

Put this in production and you'll get a support call: "I just added my 129th project and it isn't showing up on screen". In order to protect you from the dreaded "unbounded result set" problem, Raven limits the number of items returned from any query. Be thankful it wasn't this:

DeleteExcept(session.Query<Project>().Where(p => !p.KeepForever).ToList())

Unbounded result sets are bad, sure. But code that works in dev, and in production, until it suddenly behaves differently when the number of records change, is much worse. If RavenDB believes in preventing unbounded result sets, they shouldn't let that query run at all - throw an exception when I do any query without calling .Take(). Make it a development problem, not a production problem.

You can only do 30 queries in a session. Result sets are bounded. Only 15 map results per item being mapped. When you work with Raven, keep these limits in your mind *every single time you interact with RavenDB, or you'll regret it.

These limits are clearly documented, but you'll forget about them. You only become aware of them when something strange happens in production and you go searching. Despite two years of production experience using Raven, these opinions still bite us. It frustrates me to see posts like this come out, advocating solutions that will actively break in production if anyone tries them.

Conclusion

RavenDB is great for development. Maybe the problems we're experiencing are our fault. All databases have their faults, and perhaps this is a case of the grass is always greener on the other side. Switching to SQL Server might seem like a step backwards, and might make development harder, but at this point I do feel like we will have less problems in production with SQL Server. It has been around for a long time, and the pitfalls are at least well known and predictable.

That's enough about why we're leaving RavenDB. Next week I'll share some details about how we plan to use SQL Server in Octopus 3.0.

(*) You can disable the unbounded result set protection thing by specifying unlimited items to be returned, if you know where to turn it off. But you still have to explicitly call .Take(int.MaxValue) every single time you write a query.


Tagged with: Architecture
Loading...