
Optimizing level loading in Trans Neuronica
Trans Neuronica is a puzzle game I’ve been working on and off for the past few years and recently I’ve started porting it from C# to JavaScript (and previously it was ported from Flash to C#, don’t ask). I’ve got a substantial chunk of work done and the most significant challenge so far is switching from MonoGame’s freeform drawing to PixiJS’s display tree.
Today, while testing the transition effect when changing levels I’ve noticed there is a noticeable frame skip when the next level is loaded. Not a huge deal, but the game will only get more bloated so while right now it’s almost acceptable it’ll only ever get worse. My rule about code optimization is “optimize as late as possible” (I wrote about when to optimize in the past) and I think this is the last possible moment to optimize things. Why not earlier? Because I didn’t notice it earlier. Why not later? Because the code is still simple enough that it’s easy for me to hook up any optimizations into it, but I have enough of the code ported that porting the rest won’t be an issue.
Confirming the problem
Now that I’ve noticed there is a problem I want to confirm it’s indeed related to level loading and not something that happens along with it. Knowing the architecture, it’s unlikely there is another explanation but checking it takes only a minute. Since TN is an HTML5 game, I opened Chrome’s dev tools by pressing F12
, navigated to the Performance
tab, clicked the record button and switched the level a couple of times. The result is:
Preparing a consistent test
Now that we know where the problem lies we need a consistent way of testing the time it takes to run it so that we can verify our optimizations do what we want them to. For this specific case, I made it so that pressing a button will trigger, in the span of a single frame, ten level changes, time that, repeat this five times and give me the results of each + average time. I don’t strictly need to see the times here since I’ll keep using the performance recording to investigate things further, but it speeds up iterating on the change and seeing if I am on the right track or not.
Locating the bottleneck
Now it’s time to use the profiler with our test case and figure out where the root of the problem is. Again, I fired up dev tools and started performance recording, this time running the test once:
Not shown on this screenshot is an in-depth investigation into the call tree which strongly suggested that the process of creating a new LevelRenderer
is our target right now. Let me briefly explain how the game is rendered so that you have the necessary context to understand what I did.
The stage on the screenshot below is 19×13 tiles, for a total of 247 squares. Each square has nine layers, so that gives us 2,223 display objects. The squares (as in non-display models) are stored in a jagged array (array of arrays), while the display objects are added to 9 containers. When this renderer was instantiated the following things had to happen:
- Instantiate 247
LevelSquare
s - Instantiate 9
PIXI.Container
s (the layers themselves) - Instantiate 2,223
PIXI.Sprite
s - Call
addChild
2,223 times (which adds the sprite to the display tree)
That’s quite a lot of work just to display something! Let’s assume for simplicity that the points #3 and #4 wasted the majority of the time spread equally across the two. I can’t give you hard data because a) any data from the profiler is an approximation and b) at this point I knew what I wanted to do so I didn’t even bother to collect it.
Planning & implementing the solution
There are two main paths to reduce the number of instantiations and addChild
calls:
- Make that behavior obsolete by avoiding the display tree entirely, but it requires understanding and modifying Pixi’s internals, something I am not willing to do for something likely to have a simpler solution.
- Do the work beforehand. By pooling.
If I have a pool of a few LevelRenderers
that have the sprites ready, then it’ll only be a matter of adjusting the size slightly to the requirements of the level and trimming the excess/filling the missing. I don’t discard the trimmed sprites either, only remove them from the display tree temporarily but keep them at bay to be re-added as needed. And when a given LevelRenderer
is no longer in use, I release it back to the pool instead of letting it be garbage collected. The result?
The visual bug was an easy fix that did not impact the speed. Still, we can do better!
Further optimizations
I won’t bore you with all the details that are difficult to transcribe into text. Instead, I’ll quickly summarize other steps that I took to further reduce the runtime of a single test to about 25ms:
Optimizing children adding/removal
After I cut the number of instantiations and calls to addChild
by reusing LevelRenderers
then theremoveChild
became the biggest issue. They are called when resizing a level only on the squares that were outside the new boundaries. For example, level resized from 19×13 to 17×11 had 60 squares removed, multiplied by nine layers it’s 540 removals. And if you take a look at the source code of this function, for each removed sprite it has to first find the index of the removed child in the array, and then it uses this library to remove the actual item, which results in moving all children that occur later in the array to be moved to the left. It does it 540 times, yikes.
What I did was wrote an extension to PIXI.Container
which buffers additions/removals to do them all in one sweep. The source code for explanations is here, but the main points are:
- Use faster array removal method (swap with last and change length).
- Group removed children to buckets and iterate on the children instead of on the removed elements. It’s faster because I do a lot of removal at once.
- Don’t emit events – this one was not needed since it only sped things up by a fraction of a second, but since I don’t need those…
Even more pooling
At this point another bottleneck appeared, this time related to level loading. Again mostly related to instantiation and setting things up, so I fixed it by pooling everything that relates to level data structures.
But then there were problems
Code optimized for CPU performance is harder to maintain, period. It’s also easier to introduce bugs, and I had my fair share of those when working on things presented in this article.
Forgot-to-release-to-the-pool
Pooling requires you to manually manage memory unless the language you use somehow allows you to intercept destruction/garbage collection and put the element back to the pool instead. Which would suck anyway, you would wait forever for things to be released, creating new instances effectively losing the benefits of pooling.
You need to be painfully aware of when all of your pooled objects stop being needed and release them accordingly and you’ll forget to do that. I certainly forgot a few times, but since the code in TN is structured neatly enough, it was easy to find the places which were missing the release.
Pooling side effects
Remember that cool transition effect I linked at the start of the article? At some point, it just stopped working, and it took me over an hour to figure the problem. I was fairly confident it had something to do with the side effects of the pooling, but it was a very sneaky bug. Just as LevelRenderer
is built of SquareRenderers
, Level
is made of Squares
. Before the optimizations, when a level was finished, the squares were left intact, but now they are pooled and released. And when they are released, the foreground and background tiles are removed from them. And when those are removed, they clear the SquareRenderer
that handles how they are displayed. What that means is when the transition out starts to play, all of the squares are empty, so there is nothing to display. I solved it by preventing the SquareRenderer
s from being cleared IF the removal happens during level switching.
Another situation was where pits would, seemingly randomly, have wrong colors after the first time you change a level. This time it was both a side-effect of pooling and my inattentive porting of the original code which only manifested thanks to the pooling.
If you are curious here is the implementation of Pool I wrote for the game. It’s straightforward which is all I needed it to be.
Arsenal of tools
The last thing I want to touch upon is the tools to make optimization easier.
- Your primary tools are your knowledge, experience, and perception to spot and analyze choke points correctly. Those will grow in time but the more mature developer you are, the more you should trust your gut. And numbers.
- Next is the profiler and other timing tools, both of which are something I used extensively during this article. They’re there to tell you if you are making progress, so make sure to set up a test case that you can run and get consistent results to be able to prove that the improvements you make are truly impactful.
- Something I used a lot when implementing pooling was logging of new instance creation and release. This told me of any leaks (by the consequence of new instances being created even when they shouldn’t be). Unfortunately, it did not give me insight into why, how and where they are not released.
- Finally, I wrote a small tool for Pixi to log the duration of each phase of the frame handling as well as the number of times
render
is called. It happened before this article was written, but it allowed me to spot a pretty huge issue where I accidentally rendered everything multiple times which resulted in a huge, huge time waste (like 5ms per frame). So think of other things that can happen in your code that can be a bottleneck and ensure that if they happen to be, you’ll know about it.
The cherry on the top
I had a bug in my code which invalidates all numbers from Optimizing children adding/removal and onwards, whoops! The bug was that when I was buffering children to be added/removed, I never cleared those buffers, so the same display objects were being added over and over again. Which impacted the time it took to generate the buckets. Which impacted the time it took to go through all the buckets. Which made each next test take longer and longer.
From 80ms to 14ms, almost 6 times faster (17.5% of the original time) for ten level switches.
Thankfully the bug I introduced did not, in any way, affect the validity of everything I wrote but treat it as a cautionary tale – sometimes the bugs that appear when optimizing code only manifest after some time has passed.