Architecting For Dedicated Game Servers With Unreal, Part 1

In the age of cloud infrastructure, dedicated servers are being chosen over peer-to-peer with increasing frequency for multiplayer games. The pros and cons of the two models have been described extensively (e.g. here). What we discuss less frequently are the many traps developers can fall into when they've committed to integrating dedicated servers into their game's online platform.

I've fallen into many of these traps in my life over the years. I'm going to tell you the story of how I first encountered the problem in part 1. I hope this will save others some trouble. Many years ago, while working on The Maestros, an indie game written in Unreal Engine, my team was faced with incorporating dedicated servers into our back-end services ecosystem. The issues we faced were the same ones I would see in games using Unity, Unreal or custom engines.

Later, in Part 2, I'll discuss the major choices you'll face when running dedicated servers for your own game: datacenter vs cloud, bare metal vs VMs vs containers, etc.

Getting into the Game - The Flow

To give you a feel for what we're discussing, let me briefly illustrate how a player gets into game in The Maestros.

1 - Create a lobby

2 - Players Join and Choose Characters

3 - Wait for a game server to start

4 - Join the game server

Phase 1 - Make it Work

We knew what we wanted so we started to make it possible with our tech stack. This involved Node.js and Windows (required at the time for Unreal) as well as Microsoft Azure cloud VMs. First, the maestros.exe process on a player's machine made calls over HTTP to a Node.js web service called "Lobbies." These calls would create/join a lobby and choose characters. When all the players were connected and ready, the Lobbies service made an HTTP call to another Node.js service called the "Game Allocator." GAMING This would cause the Game Allocator service to start another process on the same VM for the Game Server. In Unreal, a game server is just another maestros.exe process with some special parameters like so: "maestros.exe /Game/Maps/MyMap -server"

Our Game Allocator then watched for the Game Server to complete startup by searching the Game Server's logs for a string like "Initializing Game Engine Completed." When it saw the startup string, the Game Allocator would send a message back to the Lobbies service which would then pass along the IP & port to players. Players, in turn, would connect to the Game Server, emulating the "open" you might type in the Unreal console.

Phase 2 - Scaling Up

This is where we can play a game. With a couple more lines of JavaScript, our Game Allocator was also able to manage multiple Game Server processes simultaneously on it's VM. Eventually, we would need to run more game server processes than 1 VM could handle, and we'd want the redundancy of multiple game server VMs as well. We created multiple VMs each with a Game Allocator, which would report its status to the Lobbies Service periodically. Lobbies code would then select the best game allocator to start a new game.

Phase 3 - Software Bug Fixing

This architecture worked and served us well through many years of development. It's similar to how many developers implement game server allocation on their first try too. Unfortunately, it's not without its problems. We kept running into problems for The Maestros which required us to intervene manually. Despite our cleverest code, we dealt with Game Server processes that never exited, or game server VMs getting overloaded, or games being assigned to VMs that were in a bad state or even shut down (Azure performs regular rolling maintenance). Our engineers would need to manually kill the game instances, restart or restart the entire VM.

These headaches have been reported on many different games, so let's examine the root causes. The first problem is the inherently messy business of starting new processes. Start Unreal requires a lot of slow loads from the disk. Any process can fail or hang for many reasons (e.g. Insufficient RAM, CPU, and Disk There's very little we can do structurally to fix this except test extensively, and write the cleanest code we can.

Second, we keep trying to observe these processes from far away. In order to tell a Game Server process had completed startup, we read it's logs (yuck). Node is used by Node to read wmic commands. Even more problematic, Lobbies makes decisions about which game server VMs can handle a new game. Lobbies makes decisions about which game server VMs can handle a new game. This separate process runs on a separate VM. It takes several milliseconds to complete (in the most ideal case). If your heart-rate hasn't increased to a dangerous level by this point, then you haven't experienced networked race-conditions before.

Even if the Game Allocator parsed the OS information on a Game Server process correctly, the Game Server's state could change before the Game Allocator acted upon it. What's more, even if the Game Server's state didn't change before the Game Allocator reported it to Lobbies, the game server VM could get shut down by Azure while Lobbies tries to assign it a game. It would be even more difficult to scale our Lobbies service, and add redundancy, because multiple Lobbies can assign games to a single game allocator before they notice each other's games.

We tried several fixes over the next few months, but the race conditions didn't disappear until we changed our thinking. The breakthroughs happened when we started putting the decision making power in the hands of the process with the best information. When it came to game startup, the Game Server process had the best information about when it was done initializing. Therefore, we let the Game Server tell the Game Allocator when it was ready (via a local HTTP call), instead of snooping through it's logs.

When it came to determining whether a game server VM was ready to accept new games, the Game Allocator process had the best information. Lobbies set up a game-start task (RabbitMQ) in a message queue. Once a Game Allocator was available, it would pull tasks from the queue instead of being informed by another process with outdated information. We were able to add multiple Lobbies instances without regard to race conditions. Manual intervention on game servers reduced from weekly to a couple times a year.

Phase 4 - Hardware Bug Fixing

The next problem was a very serious one. During our regular Monday night playtests, we saw great performance for our game servers. Units were responsive, and hitching was uncommon. However, hitching and latency were unacceptable when we playedtest with alphas during weekends.

Our investigations showed that packets weren’t reaching our clients even when they had strong connections. The Maestros requires a lot of bandwidth, but our Azure VMs should have been able to keep up with the Maestros's CPU and bandwidth requirements. Even so, optimizing where possible did not solve the problem. However, it was back in our next weekend playtest. The only thing that seemed to eliminate the issue completely was using huge VMs that promised 10x the bandwidth we needed, and those were vastly less cost-efficient on a per-game basis than a handful of small/medium instances.

Over time we started to become suspicious. What differed between our regular playtests and our external playtests wasn't location or hardware (devs participated in both tests), it was actually the times. We played at off-times for development tests, but always scheduled our alpha tests around peak times to attract testers. The correlation was confirmed with more probing and prodding.

The hypothesis became that our VM's network was under-performing the advertised values when traffic got heavy in the datacenter, probably because other tenants were saturating the network too. This is commonly known as a "noisy neighbors problem" and is frequently discussed. However, many argue it doesn’t matter because you can dynamically assign more servers. Even Microsoft Azure uses overprovisioning in order to address these issues. Unfortunately, this strategy doesn't work for our Unreal game servers which are single processes with latency-sensitive network traffic that could not be distributed across machines, and certainly cannot be interrupted mid-game.

We had lots of evidence, but not enough to confirm our suspicions so we decided that we would run a test. We purchased unmanaged, bare-metal servers from a provider and ran them alongside our Azure VMs during peak-time playtests. The bare-metal servers had a double latency (40ms vs. 80ms), but the games ran smoothly, whereas our Azure VMs suffered from near-incomprehensible lag.

Although the transition seemed inevitable, there were pros as well as cons. One was that it took a full day to get new servers from our provider. This meant that if we went all out on bare metal we'd lose the ability of scaling up quickly to meet customer demand. Bare metal saved us about 50% on a per-game cost basis. We decided to provide enough bare metal servers to support daily load, and to use larger, more costly Azure VMs when needed.

Conclusion and Future Topics

I hope our story helps you or other developers looking to use dedicated servers for your game. In Part 2, I'll discuss the trade-offs in cost, maintenance, and complexity of the major choices around dedicated game server architectures.

This includes datacenters vs cloud, containers vs VMs, and existing solutions like Google's new, container-based dedicated server solution, Agones.

Created: 28/08/2022 20:06:07
Page views: 85