Engineering ROBLOX for the iPad, Part 3 (Performance Optimization)

November

10, 2012

by Andrew Haak


Archive

Code Profiling Tool

If the first part of developing a well-performing ROBLOX experience for the iPad is ensuring stability through memory optimization, the second part is improving the frame rate to the point it’s as smooth as it is on a modern desktop or laptop computer. The process is a balancing act: push performance optimization to its limit without noticeably degrading the quality of the experience.

As mentioned in our previous Engineering ROBLOX for the iPad article, the Client Team has been neck-deep in ROBLOX’s source code, identifying inefficiencies and re-engineering them in exchange for quantifiable and positive impacts on performance. One of the best benchmarks for illustrating their collective progress is Crossroads, a classic level the team has been using as an iPad testing ground. When we first launched the ROBLOX code stack on an iPad, Crossroads with eight players ran at an unplayable five frames per second (FPS). Today, it runs at 30+ FPS.

In this article, we’ll describe how we improved ROBLOX’s iPad framerate by making our performance-monitoring and quality-adjusting system more aggressive, and hunting for inefficiencies in ROBLOX’s code with a special, internal tool.

FrameRateManager and debris culling

ROBLOX’s engine monitors each user’s computer to determine whether it’s striking the most efficient balance between visual fidelity and performance. The system that makes this happen is FrameRateManager, which ramps up and down effects, particles, ambient occlusion, shadows and draw distance based on trends in a user’s frame rate. This keeps users running in real time (i.e., around 30 FPS) across a range of hardware, according to Client Engine Lead Simon Kozlov.

FrameRateManager is a fixture on the desktop version of ROBLOX and one of the reasons why two machines with a 20x performance differential can play together in the same level. However, the iPad’s hardware falls short of even the weakest PC or Mac that runs ROBLOX well, which means it quickly stretches FrameRateManager to the point it can’t make games run better. To address this problem, we first added flexibility to FrameRateManager so it ramps down effects sooner – for example, upon hitting 25 FPS rather than 15 FPS – on the iPad. We also introduced an iPad-specific technique that ROBLOX developers have coined “debris culling.”

One thing that causes significant slowdown on the iPad is rendering individual parts in the environment. To put it in perspective, it’s more resource-intensive to render one individual part – with its own unique properties and data – than it is to render that part when batched in a geometry buffer. With debris culling, we aggressively cull individual parts that are sitting idly in the environment.

In other words, iPad users will stop seeing individual, resting parts at a shorter distance than will users on more powerful hardware. While the difference in distance depends on what FrameRateManager decides is an appropriate quality level, Simon offers this example: if the current draw distance for parts is 450 studs or more, debris might be culled at 220 studs (assuming your device is running slow).

Natural Disaster Survival

This technique is of particular value in dynamic games with environmental destruction, such as Crossroads and Natural Disaster Survival. The objects still exist in the world; they’re just not rendered from certain distances on iPad players’ screens. Simon estimates debris culling increases ROBLOX’s frame rate from roughly 20 to 30 FPS, making it a valuable tradeoff.

Code profiling

All of the performance and memory optimizations discussed thus far relate to ongoing rendering – GUIs, textures, physical objects, etc. But the Client Team has also logged and analyzed ROBLOX’s performance on the iPad to hunt for “spikes”: one-off events that cause the frame-render time to take a sudden turn for the worse. For example, at 30 FPS, the frame-render time is 33 milliseconds. But if one frame takes 100 milliseconds to render, there will be a visible “spike” in the rendering and the observer will perceive the resulting animation as jerky.

Senior Rendering Engineer Arseny Kapoulkine essentially built a game mode that records the actions of someone playing ROBLOX. It sends the recorded actions to a special tool that lets us analyze what happened in each render frame and drill into the code that caused spikes. This process is called code profiling, and this is what our internal tool looks like:

Code Profiling Tool

Here’s a screenshot of ROBLOX’s code-profiling tool. Click to enlarge.
  • The top-right section of the window shows the render frame durations; render frames are the most important because you notice render lag the most. Generally, performance was okay (green frames are approximately 30 FPS), but it dropped in a few frames.
  • The bottom-right section shows the individual tasks that ran during the currently selected time range. You can see that in the selected time range we logged 100,000 events across four seconds of gameplay, which equals roughly 780 events for each frame. While rendering took the most time, the delay in red frames is more likely attributed to ProcessPacketData and ResumeWaitingScripts, based on the “max” field. Changing the time range to include just the red (slow) frames confirms this.
  • The top-left section is a general picture, with each color representing a different thread of execution. We use several threads on iPad – we actually run most of our rendering code in parallel with most of our physics/networking code and occasionally some other tasks (e.g., content processing). The length of each line represents event activity. From this graph we can tell how the load is distributed between threads.
  • The bottom-left section either shows the actual log stream (all events in the time range with event data, such as part name for rendering, property value for part updates, etc.) or the Lua profiling information that is extracted from these events.

In the screenshot above, you can see all the script executions in the selected time frame. We’re drilling down to identify a bottleneck — i.e., why one execution of the LoadoutScript takes 30 milliseconds.

The main culprit is inLoadout, which is executed six times for 15.95 milliseconds total. We can see that it traverses the children of a node and by reading the code we can try to optimize it — in this case by keeping all the tools that belong to loadout in a Lua table and updating it whenever a tool is added or removed.

The code-profiling tool is barebones and intended for internal use only. We may ship a tool like this in the future to help ROBLOX game developers profile their scripts, but there is a lot of polish and interface work to be done.

Crossroads: Tower DownLet’s see how the code-profiling tool works in a game situation. Say a player blows up the Tower in Crossroads, causing it to deconstruct into individually rendered objects. This is a resource-intensive process involving many events that causes the iPad to go into “shock” – it can’t process everything quickly enough. Using our code-profiling tool, we can see a log of the events that happen when the Tower blows up and break the problem into manageable pieces:

  1. First, we can optimize the code that renders the individual parts. To generate a part, we have to traverse its sub-tree in the data model to figure out whether it needs decals, etc. We now cache vital information for rendering (e.g., does the part have decals? Does the part belong to a humanoid? Is the part a file mesh?) and update it whenever the children of the part change or the part changes its ancestor. We can usually skip most of the traversal.
  2. Second, we can distribute the load over multiple frames, making the spike into a speed bump. This is something ROBLOX already does, but the load has been distributed to more frames on the iPad.

As was the case with memory optimization, some of the performance improvements we’re making for the iPad version of ROBLOX will apply to the PC and Mac. Plus, there’s more performance optimization than what this article covers; namely, featherweight parts – known colloquially as the 500k parts update. We’ll have more on that in the coming months.