April 16, 2014

The worst database developers in the world?

If the makers of MMO RPGs (Massive Multi-Player Online Role-Playing Games) aren’t quite the worst database application developers in the world, they’re at least on the short list for consideration. The makers of Guild Wars didn’t even try to have decent database functionality. A decade later, when they introduced Guild Wars 2, the database-oriented functionality (auction house, real-money store, etc.) would crash for days at a time. Lord of the Rings Online evidently had multiple issues with database functionality. Now I’m playing Elder Scrolls Online, which on the whole is a great game, but which may have the most database screw-ups of all.

ESO has been live for less than 3 weeks, and in that time:

1. There’s been a major bug in which players’ “banks” shrank, losing items and so on. Days later, the data still hasn’t been recovered. After a patch, the problem if anything worsened.

2. Guild functionality has at times been taken down while the rest of the game functioned.

3. Those problems aside, bank and guild bank functionality are broken, via what might be considered performance bugs. Problems I repeatedly encounter include:

In general, it seems like that what should be a collection of database records is really just a list, parsed each time an update occurs, periodically flushed in its entirety to disk, with all the performance problems you’d expect from that kind of choice.

4. Even stupider are the in-game stores, where fictional items are sold for fictional money. They have an e-commerce interface that is literally 15+ years out of date — items are listed with VERY few filtering options, and there is no way to change the sort. But even that super-primitive interface doesn’t work; in particular, filter queries frequently return incorrect empty-set responses.

5. Much as in other games, over 10 minutes of state changes can be lost.

Except perhaps for #5, these are all functions that surely are only loosely coupled to the rest of the game. Hence the other difficulties of game scaling and performance should have no bearing on them. Hence there’s no excuse for doing such a terrible job of development on large portions of gameplay functionality.

Based on job listings, ESO developer Zenimax doesn’t see database functionality as a major area to fix. This makes me sad.

Comments

27 Responses to “The worst database developers in the world?”

  1. Joel Wittenmyer on April 16th, 2014 9:54 am

    I began life as a developer. However, back then we knew what we didn’t know. Since I became a DBA, and then an Architect, I can usually only laugh on the inside and wait for management to ask me to help when the developers tell me they don’t need a database. Who knows, maybe the ‘database’ behind ESOL is Hadoop (do I hear ‘in the cloud’?). I hear that’s going to replace all the RDBMSs… :)

  2. Evan on April 16th, 2014 11:24 am

    There’s a reason for this. Investment horizons in the videogame industry are short (1-3 years), and risk is very high.

    Economically, it doesn’t make sense for any individual studio or publisher to invest in their staff or their infrastructure, since it’s unlikely that they will get a return. So games that do succeed suddenly have lots of scale, an inexperienced staff, and a backend system in which every corner has been cut.

    In social media, for example, investment horizons are long (10 years or more), so the opposite occurs.

  3. Trevor on April 16th, 2014 1:21 pm

    I chalk any issues for now up to their “megaserver” technology.

    Who knows what they may have to do in the background to pull that off.

    I bet their data tier has to look and function differently compared to the MMOs before it.

  4. Curt Monash on April 16th, 2014 4:11 pm

    Evan,

    If you’re offering that explanation as part of the thought process for bad decisions, I’m not going to argue. If, however, you’re suggesting that the apparently bad decisions are good ones, I disagree.

    Doing it “right” from a database standpoint has massive impact on gamer satisfaction, right from the launch. And the way marketing works in the game industry, if a game ever is to be successful, it will have a lot of activity right at the launch.

  5. Curt Monash on April 16th, 2014 4:13 pm

    Trevor,

    I doubt tha ESO’s “megaserver” strategy, which is essentially just a more flexible way to shard across subsets of the player base, has much to do with it.

  6. Curt Monash on April 16th, 2014 4:44 pm

    A forum thread with more details on one of the bugs:

    http://tamrielfoundry.com/topic/where-do-things-stand-on-the-bank-bugs/#post-448026

    Apparently there was a known authentication bug in beta that the Zenimax neither fixed nor clearly warned players about, and which has dire consequences.

  7. Barney Finucane on April 16th, 2014 7:28 pm

    I used to work for a company that used Lotus Notes to write its invoices. The company is no more.

  8. Ted on April 16th, 2014 7:32 pm

    The comment about item stacking piqued my interest. Item stacks to my limited knowledge have been traditionally represented as a single item instance with a stack quantity attribute. Aggregation and splitting are an application level transaction with all it’s pitfalls and limitations. Are you suggesting it’s past prime and we should be using the built-in indexed queries and aggregators?

  9. Curt Monash on April 16th, 2014 7:56 pm

    Ted,

    My best theory for the item stack splitting is that combining stacks into one is something that’s done when throughput allows. It’s one of my numerous signals that what’s going on aren’t real database operations, but rather equivalents to database operations carried out at something like the application level …

    … which seems like a very unfortunate choice.

  10. Paul Brown on April 17th, 2014 4:16 pm

    Not *quite* Curt …

    http://jobs.zenimax.com/requisitions/view/157

    They’re building their DBMS story on MySQL / MariaDB.

  11. Curt Monash on April 17th, 2014 4:50 pm

    Good find, Paul. But I’m suspicious as to how they’re using it. Guild Wars just used SQL Server to store 200K BLOBs.

    I continue to be amazed at the frequency and variety of ESO’s incorrect results to “database” queries. In some cases, I’m surprised if there’s actually a correct response.

  12. Paul Brown on April 18th, 2014 8:50 am

    I doubt they’re using it for much. Certainly not transactions processing. Possibly just to hold rubble.

  13. Evan on April 18th, 2014 4:57 pm

    Curt,

    I’m not really passing judgment on the outcome. I’m suggesting that structurally it could not be otherwise due to the self-reinforcing combination of inexperience and excessive schedule pressure.

    As you mention, most games have a launch spike and usage declines from there, making the opportunity cost of faults extremely high.

  14. Curt Monash on April 19th, 2014 2:55 am

    The bug documented in http://www.youtube.com/watch?feature=player_embedded&v=JoZQW0_OEmw reveals yet another aspect of the problem.

  15. Curt Monash on April 19th, 2014 2:59 am
  16. Curt Monash on April 20th, 2014 11:26 am
  17. Paul Brown on April 23rd, 2014 11:08 am

    Ouch …

    It’s not too early (from some of the threads I’ve seen) to adjudge the whole ESO game launch a fiasco on a par with the Ultima Online launch in 1997. At that point, Origin Systems basically owned the genre. But the launch of UO was so bad–by the time I got on half my shard had dragons–that the franchise never recovered.

    And in another commentary on whether what we do can really be called “engineering”, ESO and UO seem to have *precisely* the same collection of issues; duping bugs.

    Pity. I thoroughly enjoyed Skyrim.

  18. Why an ACID DBMS Beats “File Systems” Any Day – An Online Gamers Lament | Paradigm4, Inc. on April 23rd, 2014 4:28 pm

    […] Monash, on his DBMS2 blog, last week burnt off what I am going to guess (reading between the lines) was an ample portion of […]

  19. Softfalcon on May 15th, 2014 6:21 pm

    I’m not saying this article is wrong, but it makes a lot of assumptions about the development of the game with little to no proof.

    You can blame it on the database management system for a bunch of this, except that the database could have been incredibly well designed but with a poor access layer written for it, or the networking is shoddy, or the mega-server architecture is spotty.

    As a developer, until you look at the code and determine the problem, the issue could literally be anything. Jumping to conclusions about it being bad database design is just misleading.

  20. Drew on May 15th, 2014 7:54 pm

    I was a DB Developer on a short-lived Australian MMOish game. You’re not wrong that bugs are bad, but most of what you’re talking about is the game server code, not necessarily the database.

    Game servers are designed to deal with interactions between hundreds of players simultaneously. One player gives damage and another takes it. One player trades an item away to another player.

    This can’t happen in 2-4 business days like banks do it. This can’t happen in an eventually-consistent sharded way like twitter does it. If facebook loses a comment for five minutes or so, so what? If you lose that stack of ehalth potions for five minutes, you’re dead and so on. All these changes always have to happen now before the next monster attack.

    That’s why most game servers are decoupled somewhat from the backing database store. They have to be.

    The servers are also never empty. There’s no nice 2am downtime window when nobody’s online and you can rebuild your indices. Australia’s peak time is just coming on line then.

    That’s not to say all these problems aren’t solvable, but typically the only time the developers can even approach the loads and behaviour of real players (not matter how hard they simulate) is in the first few days and weeks of having those players.

    All complex, unproven software is hard. Launching games is hard. Launching popular MMOs is somewhat harder again.

  21. Curt Monash on May 15th, 2014 8:08 pm

    Obviously, I wasn’t clear. Sorry about that!

    I’m not blaming any of this on the underlying database management system, if any. Rather, I’m blaming it on an obvious failure to seriously use an underlying database management system. There’s no way you should be able, irrespective of load, to try to increment a value from 72 to 74 and instead wind up with two values, one of 72 and one of 2. The increment should either succeed or fail; there should be no intermediate state like what I’ve seen. And by the way, that intermediate state occurs as often as not.

    There are a whole range of issues that are horrid, but the one I just described is the smoking gun which proves that serious database programming wasn’t even tried. (Or, if it was tried, it was thrown out and a whole different approach was taken for the version that finally shipped.)

  22. Mike Lewis on May 15th, 2014 11:12 pm

    I’ve worked in both heavily database-driven software, and, conveniently, in the MMORPG space.

    Your argument sounds good until you examine the core assumption, which is that an RDBMS is the best solution to the problems an MMO has to contend with. As long as one makes that assumption, you’d be correct in asserting that game developers should have tried it.

    My problem is that this assumption is blatantly false for anyone who’s actually written a realtime multiplayer game, let alone an MMO. Let me explain why.

    What almost all contemporary MMOs do for data storage is to run a game simulation and then periodically persist game state to a database as needed. This is the cause of the “10 minute rollback” phenomenon.

    So why don’t we use a DBMS? Well, let’s try and find one that does all the things we need!

    There are four aspects to consider.

    1 – Realtime performance AND guaranteed sequential operations. Stalling ANY game operation for more than a couple of milliseconds is absolutely out of the question, period. If you’re running in a distributed environment like an MMO server farm, you don’t have the luxury of waiting for a commit to go through. You absolutely *must* have all operations commit sequentially as well, or you get inconsistent simulation results. The fastest way to do this is to write dedicated code that handles the simulation, and persists state periodically by handing it off to another thread/machine. Waiting on persistence is an overhead, no matter how you slice it, so ANY sim operation that relies on a DBMS must never block for longer than it takes to do the base calculation involved, and must also be sequential. The need for this property eliminates a very large number of DBMS packages right off the bat. Oh, and deadlocks are not allowed, either, because the common approach of rerunning a “victim” transaction violates sequential requirements.

    2 – High scalability, i.e. possibly tens of thousands of players and NPCs on a single piece of hardware. Because it’s a game simulation, this means there are potentially millions of interactions going on per second. Let’s make the DBMS selection process even harder! Now we have to have a very high transaction throughput, in addition to operations being non-blocking of the simulation and sequential. Serializing all operations is one approach, but that’s going to murder throughput and waste a lot of CPU core time. It also breaks rule number 1, don’t block. So not only do we have difficult requirements on the properties of the DB, we have very high demands as well.

    3 – Zero tolerance: if I ever write corrupt, partial, or stale data, game over! Some DBs are not prone to this, which is great, but that safety comes at a performance cost and often at the expense of the properties we know we have to have. This rule also completely eliminates all data stores that allow for partial writes or stale views of data.

    4 – Cost distribution: running a beefy database machine for every game server is cost-prohibitive. It is far more economical to have a small number of high-end powerful DB machines and then use cheaper hardware for the game simulation itself. This inherently means that the servers running the game simulation are distributed as separate hardware from the DB hardware. As a result, the only way we can persist state is to talk over the network. But wait… network operations can fail, can block, and so on… so doesn’t that violate our rules?

    The trick is that we solve the network communication problem using periodic (or trigger-based) persistence instead of relying on a DBMS to do our simulation work.

    So please, don’t assume that we are ignorant of how database systems operate. In fact, we have some exceptionally sharp people with lots of database experience in the MMO industry, and we *have* considered all this stuff. The answer to your implicit question (“why don’t MMOs do this”) is not that we are unaware, it’s that your solutions that work fine for many applications are not suited to OUR application.

    If you have a DBMS that can do all that stuff in a bulletproof way, I’d love to learn about it. Otherwise, we’ll all have to learn to live with making tradeoffs – a fundamental part of engineering that I’m sure I don’t need to educate you on.

  23. Curt Monash on May 16th, 2014 5:03 am

    Mike,

    Thank you for your fascinating comment! However, I must dispute your characterization of my core assumption, which is that an RDBMS is necessarily the way to go. I’d actually guess that a NoSQL store would be more suitable, specifically one that’s good at counters.

    Nor am I necessarily calling for true ACID. These are dealings in fictional goods and fictional currencies, so full transactional integrity is not an absolute requirement.

    That said — in the game I wrote about, ESO, there are three kinds of inventory changes:

    1. Character-specific, in the usual way.
    2. Account-wide “bank slots”, in the usual way.
    3. Guild banks (guild membership capped at 500) and guild stores.

    #2 and #3 only happen when a character is physically standing in a bank, of which there are relatively few, each of them limited in how many characters it can accommodate at once. So I’m not seeing why the enormous concurrency requirements of an MMORPG necessarily apply to the specific case of those operations.

  24. Mike Lewis on May 16th, 2014 5:27 pm

    Apologies – I used the term RDBMS by reflexive habit but of course just meant the far more general DBMS, to include non-relational stores. In fact MMOs often do use a lot of non-relational systems for various operations, including banks and such.

    Transactional integrity is not just a matter of losing “Real” value. This is abundantly clear once you’ve been on the receiving end of a pissed off gamer who just had his character wiped by a DB failure ;-)

    The main problem isn’t preserving transactional semantics for their own sake, it’s about simulation consistency. Debugging games requires reproducible behavior, and if I get different simulation results between runs with the same inputs, it becomes incredibly painful to control the simulation’s output. Gamers will also get incredibly annoyed when the simulation starts doing unpredictable things – not good for your longevity as a business.

    As to your specific #2 and #3 cases – those are indeed amenable to using traditional database stores, and in fact MMOs often do exactly that. Where it gets problematic is that the game is *not* using a database to hold all of its state, so inherently we have to build a translation layer that moves data back and forth between the game simulation and the data store. The object-relational mapping problem shows up in a huge way here.

    ORM and distributed programming combined create a very hard problem. This is usually the origin of dupe bugs, for example, because that mapping failed – it has nothing to do with the data store or the game logic in most cases, but the layer that moves stuff between the two.

  25. Curt Monash on May 16th, 2014 7:12 pm

    ESO, like other games, has had bugs allowing duplication of items, and said bugs got exploited, and a fire drill ensued. However, that’s not high on my list of illustrations as to why they’ve botched database programming.

    Rather, for stackable items, it is ROUTINE that they not go into the same stack. This causes significant player inconvenience, as “bank space”, “inventory space” etc. are — as in most games — scarce and costly resources.

    And for an example solely on the application side — if you sell something in the game store, you get email giving you money saying that something has sold, but not saying which item in which store. That’s absurdly uninformative. Given the small number of store transactions that seem to be happening, there’s no way this should be a performance compromise. In general, “guild store” — i.e. “auction house” without auctions — functionality in ESO is far inferior to the equivalent in the first release of LOTRO, which is now quite a few years ago.

  26. Curt Monash on May 16th, 2014 8:16 pm

    http://tamrielfoundry.com/topic/eso-an-insiders-view/ hypothesizes other odd technical issues, notably huge difficulty editing the spoken and even written text in the game.

  27. Chris Ochs on November 19th, 2014 5:32 pm

    Mike does a good job illustrating the challenges mmo’s face, but I have to disagree with some of his conclusions.

    The core of the issue is that almost every mmo has a bad concurrency model, which leads to these approaches to handling persistence like he describes.

    If you want to write highly concurrent servers in C/C++ that’s fine, but if you don’t understand how to implement your own fork/join framework or otherwise learn how to do concurrency right, you fail. And that sums up the majority of what is wrong with mmo servers now days.

    And once you mess that up, the rest is kind of history.

    Mike is right that you can’t go directly to the database for most things. But a non naive approach doesn’t require batching or huge losses in durability, while still retaining all of the features needed. What it does require is a good concurrency model, and on top of that an algorithmic approach instead of a brute force one.

    For example, one approach I have used myself is to have a distributed memory cache that provides full consistency, and has a smart write behind cache to persistent storage. It is fully non blocking, serializes all updates, is highly scalable.

    There is a reason why many companies that are solving similar problems in other industries have dropped C++. It’s really hard to do concurrency well in that language. The ones that are using C++ have people that know what they are doing. They aren’t using thousands of threads per server and other broken models like you see in mmo’s.

    The way mmo’s handle persistence works, but it’s anything but a modern approach, and as much as they think their problems are unique to them, they really aren’t.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.