Yesterday’s Data Persistence Error: An Explanation

May

30, 2014

by Keith V. Lucas


Archive

Adam Miller and Arseny Kapoulkine also contributed to this article.

What happened

This week we introduced an error into our legacy Data Persistence functionality that caused some data loss. A very small number of games that used this functionality lost user data for the people who played these games between Wednesday, May 28th at 10 p.m. Pacific Time and Thursday, May 29th at 2:40 p.m. Pacific Time. Fewer than one in 5,000 ROBLOX games played during that time were affected; this was strictly a ROBLOX-created issue and not the fault of game creators.

Data Persistence is an older method of storing game models/instances and other user data in games — things like a score or a model that you improve over multiple play sessions. A few months ago we introduced Data Stores, which are the highly recommended new method (and not affected by this problem). A few games still use the old system of persisting data.

Game client

Here’s the full set of conditions required to cause the data loss problem:

  1. The game uses Data Persistence.
  2. Player data had Instances in it (i.e., the game used Player.SaveInstance at some point for this player). Games that only used SaveNumber/SaveString/etc. were not affected.
  3. A player played this game between 10 p.m. PT on 5/28 and 2:40 p.m. PT on 5/29.

What happened in this case was that player persistence data was loaded as empty, and when the data was saved (when the player left the game), the empty data set has overwritten the one that’s stored on the servers, thus destroying the original data.

In the ROBLOX engineering team, we have a series of safeguards designed to keep problems like this from happening. Usually they work fairly well, but we’re trying to move fast to improve the platform, and occasionally we break things. Here’s how we missed this one:

  1. In an effort to clean up some old code, we disabled a feature we thought was unused, but in reality Data Persistence had a subtle dependency on it.
  2. The ROBLOX Client code mishandled Data Persistence failures. Our bug caused data fetches to fail. At this point, it should have frozen the user’s persistence data, but instead it returned a blank item as if the user had no data saved, with no error. The game would then overwrite the saved data on exit.
  3. We test our proposed releases on a series of staging environments to ensure the highest quality possible. Our first staging environment detected this problem. We misjudged it to be likely to be a problem with the staging environment itself, instead of a bug in our code.
  4. When we moved to our second staging environment, which is one step away from production, we failed to reproduce the bug. We are not yet sure why this is the case, but we erroneously concluded that there was no bug in our code.
  5. We released late on Wednesday night. Though the bug was extremely damaging, it was also fairly subtle and rare. Wsly, one of the builders of the excellent Deathrun 2, posted about the problem in our developer forum, but due to a timezone difference we didn’t see the post until the next day. Fortunately, Wsly was smart and turned off his game, limiting the damage.
  6. It took us approximately 15 hours to realize there was a problem here at ROBLOX HQ. Once we understood the problem, we had a fix live in under 30 minutes.

Operations

Virtually all code that saves data is at risk of data corruption from an application bug. The standard way of mitigating this risk is to have data backups spread out over time. For example, a typical SQL database backup sequence looks like this:

  • 15-minute backup on local server
  • 1-hour backup on a secondary server
  • 1-day backup on a tertiary server
  • 1-week backup offsite

It therefore takes time for the data corruption to spread through the backups, so the restore fidelity can be selected based on when the application bug is discovered. If discovered in the first 15 minutes, the data can be restored to an hour earlier. If discovered in between a day and a week, the data can be backed up from the previous week.

Unfortunately, data from the old Persistence service were not backed up in this way. This was an oversight. The data are stored on S3, which provides high availability and scalability, but does not guard against application bugs. We are therefore unable to restore old data.

What we are doing

While we urge game developers to transition to Data Stores, a lot of existing games use Data Persistence and we will make it more resilient to failure. In particular, we’ll audit the code that loads player data and change it to block the load/save functionality in an event of an unexpected error, so that data can’t be lost; additionally, we’ll expand our testing to make sure we thoroughly test both Data Persistence and Data Stores for every build that we plan to release to production.

Additionally, while we were able to quickly fix the bug once we realized the problem exists, it took us too long to discover the problem. We’re trying to establish a communication channel that could allow developers to quickly reach us and tell us about the problem so that we can react to emergencies faster.

Finally, we enabled versioning on the Data Persistence S3 bucket this morning. This is a newer feature from Amazon, and it will make multiple backups of each individual piece of content in the Data Persistence store. This ensures that there is at least a previous day’s backup of all changes. This will guard against ROBLOX application bugs, but the feature is internal use only, and will thus not mitigate against developer-introduced game bugs.

Data stores are solid

Persistence vs. Data Stores

The old Data Persistence service was designed to store player-specific data within a game. For each player you can store a variety of primitive types (i.e. numbers and strings), or instances – say, parts, or models. To simplify the storage implementation, all player data was stored in one XML file (using a format that’s derived from our XML model format) and uploaded to Amazon S3 when the player leaves the game.

Storing all key/value pairs in one data blob simplified the implementation at the cost of increased risk of data corruption – corrupting the value of one key can lead to loss of all key values, which is exactly what happened.

On top of that, storing data after the player leaves the game instead of continuously saving values as the developer sets them drastically reduces the load of saving the data, but can lead to occasional loss of data if the player-leaving sequence is not handled correctly (i.e. if the server shuts down before the player leaves the game).

Essentially, Data Persistence is an easy-to-use API built on top of a brittle storage design. The Data Store feature is a much more versatile system that was built on a general key-value store, allowing us to update individual key/value pairs much more often (we needed to implement an adaptive throttling mechanism to make sure we can handle the load) and satisfy more complicated usage scenarios (i.e., data that can be updated from multiple server instances concurrently).

Current safety mechanisms

Data Stores are backed by Amazon’s DynamoDB, a highly scalable and available data store. To guarantee recovery, all DynamoDB tables are backed up to Amazon S3. Our retention policy for these backups is one month, so we will have one month of daily backups of Data Store content. This is a low-level infrastructure mechanism, and thus internal only. It mitigates ROBLOX-introduced bugs, but is not meant to guard against developer-introduced bugs in game code.

Future development

We are currently exploring options to give developers access to the previous day’s Data Store values. There are a few ways of doing this, and we’ll share more as we hone in on a strategy. Everyone at ROBLOX is passionate about supporting developers, so it is a no-brainer to provide this functionality.

Compensation

While we can’t restore the data we lost, we seek to compensate for the loss as best we can. This comes in two forms:

  1. For each player who played an affected game during the problem window, we will refund all Developer Products purchased by that player in that game.
  2. For each developer of an affected game, we will distribute an amount of ROBUX proportional to the number of affected players.

We recognize that in many games, players have built up significant progress in the game, such as experience points. ROBLOX cannot directly restore this data. We hope that in some cases, the game developer can themselves restore a player’s progress based on their experience with their user base, PMs received, and so on. The ROBUX we give out here is to compensate the developer for taking the time to help their players.

We are hard at work on these solutions and will have a more precise timeline soon.

Deprecating old persistence

As mentioned above, Data Stores are a vastly superior system to the old Data Persistence. We don’t want to break old games, but we do want people to move over to using Data Stores. This will help everyone use a more robust system, as well as allow us to concentrate our quality efforts on a single persistence layer. We’re working on a plan for this now, and will update the community shortly.

Final thoughts from Keith

I am extremely proud to work with the most passionate, talented, and hard working engineers on the planet. Our engineers care very deeply about supporting, empowering, and entertaining our players. It is very disappointing for us to learn that we have negatively impacted our devoted fans, and for that I apologize. Our culture is to move fast, and that means we sometimes break things. Our culture is also to move fast to fix things and to prevent the same thing from happening again. We have already mitigated the highest risks, and we will be continuing over the coming weeks to add additional checks and developer features. Thank you.