scalability

Supporting Millions of Players at ROBLOX

The architecture and design behind any web stack is key to scalability, and at ROBLOX, we aim to provide simple solutions to complex problems. Today, Vice President of Web Engineering & Operations, Keith V. Lucas, and Technical Director of Web Architecture and Scaling, Matt Dusek, explain the scale out ROBLOX has done and what motivated these changes.

The ROBLOX web stack has evolved considerably since it began. We started with a traditional two-tiered architecture plus some back-end services, which looks like this:

The design was simple, and it supported very rapid development cycles. We discussed more sophisticated enterprise architectures at the time, but rejected them in favor of fast development. Sometimes we kick ourselves now, but only a little bit – ROBLOX has grown in part because of the speed and nimbleness of our engineering team.

The original design eventually broke due to increases in the following:

  • number of inbound requests
  • number of engineers (aka number of changesets / week)
  • number of things we want to know
  • number of client devices talking to back-end services

and to decreases in:

  • tolerance for site disruptions
  • tolerance for potential data loss risks
  • infrastructure costs per online user

In terms of scale, we are now operating at roughly 75,000 database requests per second, 30,000 web-farm requests per second, and over 1 PB per month of CDN traffic. Our content store is responsible for roughly 1% of all the objects stored in the Amazon S3 cloud.  Currently, we have about 200 pieces of equipment at PEER1, and many game servers distributed throughout North America.

Here is what our system looks like today:

Here is a list of the scale-out that we’ve done, roughly in chronological order, along with the driving factors behind the changes.

Change Driving Factors
Horizontally Scaled Out Web and Database Farms Load
Monitoring & Alerts Predict, avoid, or minimize site disruptions Increase our understanding of the entire system to identify bottlenecks and failure points, both before and after issues.
Memcached Cost — shift load from expensive databases, to cheaper memcached, then scale out memcached
Background Worker Farm Quality of Service + Cost — migrate all processing away from web servers that deliver real time content to background processors, allowing web servers to operate more efficiently and consistently (i.e. no cyclical processing “hiccups”).
Provisioning Models Load + Cost — have enough servers to keep up with growth, but not too many to keep costs low; Our provisioning models have been essential to both.
Split WWW Farm into Multiple Smaller Farms Quality of Service + Cost — our servers fall into a few different performance patterns based on load; Splitting them into dedicated farms not only allows them to perform more consistently (less variance in all key metrics), but it also allows us to better predict how many servers we really need.
SOLR / Lucene Load — our original full-text database search could not keep up with user load, content growth, or our feature wish list; SOLR/Lucene is a core element of our search & discovery.
API Service Farm Multiple Client Devices + Engineering Team Scale — our API service farm is composed of individually developed, deployed, and run components, all accessible to the website, the ROBLOX client, and our iPhone App (more to come).
OLAP Business Intelligence — “deep dive” queries didn’t last long on production servers.
Database Mirrors Disaster Recovery — early on, we developed a low-cost system to automatically backup, ship, and restore all of our databases every few minutes; We have since moved to real-time mirroring.
Horizontally Scaled Out our Load Balancing Load — we currently have an active-active load balancing mesh that distributes incoming load across multiple devices, giving us both high availability and horizontal scale-out.
RabbitMQ Load + Quality of Service — as we start interacting with our newer cloud services at Amazon, we’re initially using RabbitMQ to protect our internal systems against the intermittent high latencies of our remote systems; We’ll soon be expanding this throughout our back-end to decouple our sub-systems more effectively.

Finally, as we’ve scaled out our infrastructure, we’ve organized around a few themes:

Build Narrowly-Scoped, Composable Components

We do not shy away from complex systems, especially when that’s what’s required to do the job. However, we strive for simple solutions to complex problems. Component design follows good coding practices: compose sophisticated behavior from simple, easy-to-understand constituent pieces. More and more of our user-facing features are the result of interactions between very simple and fast individual services. Our in-RAM datastore backing our presence service easily processes tens of thousands of requests per second with minimal CPU overhead (under 10%) — because that’s all it does.

We Are a ROBUX Bank

When it comes to both real and virtual currency transactions, ROBLOX is a bank: transactions must be auditable, responses must be immediate, and failures must be handled deterministically and deliberately. When we work on these features, we behave a bit more like an old-school software shop in both process and design. Our billing systems have more failsafe’s, checks, and audited queues than any other system. They are also subject to more design reviews, more deliberation, more testing, and more sophisticated roll out strategies. Our virtual economy has an extensive audit trail that allows us to trace money and transactions with impressive fidelity; we can, for example, track the distribution of a single ROBUX grant (e.g. received through a currency purchase) as it travels via user-to-user transactions to hundreds and even thousands of users.

We Are Not a Bank

Other than real and virtual currency transactions, we are not a bank. We shun over-designed solutions, always opting for low-friction, incremental changes to our product and system. We rapidly iterate, most often succeeding, sometimes failing, and always rapidly correcting. We live by the principle that live code getting real-world feedback is infinitely better than code living on someone’s laptop. We’ve moved to daily web application releases, and our high-scalability services teams releases component by component ahead of the final feature roll out.

Get the Interface “Right”, Iterate the Implementation

Despite our “not a bank” highly iterative approach, we do draw a (dotted but darkening) line when it comes to defining what a component is and how other components interact with it. That’s because changing the interface to components is much harder than changing what’s under the hood. Given a few days to deliver a new service, our priorities are: (a) surviving the load, and (b) getting the interface right. We’ll gladly deploy a new service that has an inefficient (but liveable) implementation, knowing we can iterate on that implementation without any other team or component knowing. Sometimes that is easier said than done.

Nonetheless, ROBLOX’s architecture allows for high scalability with fast development cycles, and our engineering team continues to develop a more sophisticated enterprise design, as we scale out.

About Keith V. Lucas

SVP, Engineering & Operations at ROBLOX. @kvlucas on Twitter

97 thoughts on “Supporting Millions of Players at ROBLOX

  1. LordValkrie

    I get it, It’s mainly audits of how many players, and the amount of robux spent, and how they track it down. I don’t see what the picture has to do with it though. Most of the ROBLOX percentage of audits are alternate accounts.

  2. FATAL OWNAG3

    I did not get ANY of that, only the strange picture that looked like something out of star wars intreeged me

  3. TheException

    My only question is about packet spamming. If you can push to the roblox server that you killed someone, would they die? Or is it controlled by multiple variables, where if it doesn’t match, it doesn’t do anything?

  4. Xxximboredxxx

    GUYS, IT IS NOT A PICTURE OF A TARDIS. It is from a nissiin called “S.S. Azura” in the game “Star Trek Online”

    1. Midnight517

      The TARDIS is this little phone booth that is actually a machine that lets you go through time and space. but in roblox terms, it might stand for somehting else.

      1. Jaref14

        the TARDIS is Actually a police box that the Dr. uses
        not Dr.Richtofen from black ops zombies

  5. jobro13

    We are a bank – We are not a bank

    I lol’ed, nice article guys. Roblox blog is really getting funny to read! Nice, posts every day. Good, very good (and nice marketing BTW)

  6. Coldie25

    Woah…Roblox sure has evolved since the early ’07 days… I have a rudimentary grasp on what they’re talking about, and it’s pretty amazing. 1% of the Google cloud? o.O
    Those people who keep criticizing member benefits ought to look at how much Roblox has to spend on these servers and all the related equipment, it’s mass of staff, etc.

  7. Hexagon100

    I think that picture looks very fururistic. And I was facinated on how much the ROBLOX web stack evolved.

  8. medum2

    ftw whats this about? why we talken bout codes,and why is thar a picture of a tardis?…FTW!!!!!

  9. ~spitfire1111111

    That is sad, most of you don’t even know what they are saying… I am Ten and I know what It means,
    It’s talking about the economics of Roblox. And how it works. Just if you didn’t know what that picture was… It is a TARDIS V2.1 Engine.

  10. einsteinK

    I understand it, the most then :D
    If this works, we’ll get less lagg, less chanche for website errors, less chanche to get/lose robux/tix without reasson, …

  11. ClutchRelease

    @Guy with an opinion, the code is to stop spam comments. You will see this on a lot of sites, such as the ‘Sign up!’ page of YouTube, and the ‘Sign up now!’ page of Weebly. Since internet coding has evolved, exploiters can create codes that work from their computer that rapidly thinks up names and comments. Computers cannot read the code, and therefore the comment will fail to post.

  12. tate with a cake

    Haha id rather forum than read this but is this a problem? i didnt read it but… ya know…

    1. Jordan3885

      Breaking down this… they are creating a system of security, and less lag for the users to make our experience better, instead of having to wait for the game to load.

    2. ChocCookieRaider

      I understood about half of it. So it’s not too hard to understand but then i’m only 12.

      1. Avery Walker

        I couldn’t Understand first, But like he said, Does it mean that the Games take less time to Load? Thats’ what I thought.

    1. yummydoglover

      :D :D :D!! Agreed I just pretended I understood and acted happy and danced about…. o3o cant they add a glossary so we can ACTUALLY UNDERSTAND half of the waffle and techy stuff that they typy upy…. ROBLOX would be a happier place! :D :D -:D FTW! UNICOOOOORN!!!!! -:D -:D

  13. Guy with an opinion

    Wow, I have no idea what you typed there.
    :D
    Oh, and, I noticed you put in a little code-type-thingy when posting a comment in this blog?
    What’s that about?

  14. nathanthe123

    hmm i usally dont read the blog i scroll through it really fast i think this might be the hardest to read blog ever

  15. boblovespizza

    this is really fascinating i didnt know that you put that much thought into security and etc its nice too know were protected. thanks!

  16. supergokurocks

    This is amazing! But it doesn’t really talk about any physics engine or wiring or even HOW many users and their storage they take up.

    1. ChocCookieRaider

      It could take up millions! Wait a sec they are already storing millions! Then a few hundred millionz!

  17. Kashudi

    Speaking of banks, we should be able to put money in a virtual bank account and earn interest on it…

    Just saying.

Comments are closed.