FleetCapacityExceededException - Unable to reserve a process on fleet

Hi there,

We’re using the GameLift Realtime servers, but after an amount of time running a fleet we’ve noticed that calls to CreateGameSession will start to throw an exception:

FleetCapacityExceededException - Unable to reserve a process on fleet

This seems like a resource leak of some kind, but we’re unsure what that actual issue is. When the problem occurs, we bump the instance count and eventually just replace the fleet with a new one.

As far as the data available on the GameLift Management Console is concerned, we have plenty of Available game sessions and Available player sessions. We’re running no less than a single instance in a fleet, with 50 Concurrent processes.

The dashboard also indicates 50, 100, or 150 active and healthy Server processes all the time depending on whether we have 1, 2, or 3 instances in the fleet. This actually seems strange to me, shouldn’t this number potentially fluctuate as server processes start/terminate?

Our script is based on the documentation (https://docs.aws.amazon.com/gamelift/latest/developerguide/realtime-script.html), so I believe we’re exiting correctly (otherwise I would assume the issue would crop up much faster?) We have the exact same tickLoop function that periodically checks for active players and exits if none are left:

async function tickLoop() {
    const elapsedTime = getTimeInS() - startTime;
    logger.info("Tick... " + elapsedTime + " activePlayers: " + activePlayers);

    if ( (activePlayers == 0) && (elapsedTime > minimumElapsedTime)) {
        logger.info("All players disconnected. Ending game");
        const outcome = await session.processEnding();
        logger.info("Completed process ending with: " + outcome);
        process.exit(0);
    }
    else {
        setTimeout(tickLoop, tickTime);
    }
}

I was hoping a GameLift engineer might take a look at our fleet and illuminate what the problem is:

Region: us-east-1
FleetId: fleet-6118c87d-b0f0-4fe7-b269-979f27c28e2c

Cheers

Unable to reserve a process on a fleet generally means things:

  1. That you have no spare processes left to host game sessions. You should be able to see this in your fleet metrics (https://docs.aws.amazon.com/gamelift/latest/developerguide/monitoring-cloudwatch.html). There should be an available game session metrics

If you don’t see available game sessions then I would guess is that you are not calling TerminateGameSession / ProcessEnding to clean up the game sessions. Confirm your tick loop is running as expected and you see calls to ProcessEnding being made.

  1. You are out of threads/memory/cpu so your server cannot respond in time/successfully. Check your instance metrics to confirm

Try larger instances or fewer servers per instance and ensure you are not CPU/memory bound.

Otherwise you’re encountering some SDK/GameLift communication bugs, in which case we can get GameLift service team to investigate, but please check 1 & 2 first.

Hi Pip – hope you’re well,

First, in the past week we’ve experienced this issue on:

  • May 24th at 10:30 AM PST
  • May 24th at 10:36 AM PST
  • May 27th at 6:09 PM PST

Now for your points, I’ll attach some graphs / logs here. So:

Available game sessions

It appears we have enough available game sessions

Are we calling TerminateGameSession / ProcessEnding correctly?

Here are some logs from a particular process ending:

28 May 2020 17:54:00,382 [INFO] (server.js) 1646: All players disconnected. Ending game
28 May 2020 17:54:00,397 [INFO] (gamelift.js) 103: Calling GameLiftServerAPI.ProcessEnding
28 May 2020 17:54:00,713 [INFO] (gamelift.js) 108: GameLiftServerAPI.ProcessEnding succeeded with optional result: undefined
28 May 2020 17:54:00,713 [INFO] (server.js) 601: Completed process ending with: true

Should we be concerned about the undefined optional result?
GameLiftServerAPI.ProcessEnding succeeded with optional result: undefined

CPU usage

Memory usage

The red, orange, & blue lines represent the first, second, & third instance in our fleet over time.

Threads – I’m not sure if/how to track this metric. Instead here are the number of established connections:

Again with red, orange, & blue for the first, second, & third instance in each fleet over time


Based on these metrics, I don’t think I see anything glaringly obvious? Maybe our CPU usage is a little too high? We only topped out ~80%.

Thanks for the update. I will file an issue with the GameLift service team to take a look.

Hello Kip,

I have investigated the FleetCapacityExceededExceptions you are seeing, and found that they are due to server-processes beginning registration with our service, but not completing the registration for some time. This causes server-processes to look available in metrics, but not actually be available for gameSession placement.

From looking at logs, the cause of this issue is that the on-box process that registers server-processes is starved for threads and doesn’t have resources to complete the process registration. Looking at ec2 metrics, I see that some of the instances in the fleet hit 100% CPU utilization at times as well. These fleetCapacityExceededExceptions can likely be solved by reducing the # of concurrent server processes.

I’ll work to improve “AvailableGameSession” metric to be more accurate. In the meantime, reducing # of concurrent server processes should solve this issue you’re seeing.

Hi @Nathan,

Thanks for taking a look at this! We’ll reduce the # of concurrent server processes and see how that fares.

About CPU usage, I’m taking a look at one of our fleet instances and notice that much of the CPU goes towards a java auxproxy process. I assume this is the “…on-box process that registers server-processes…”. At this particular moment, it doesn’t seem to dip below 50% CPU usage and is very often sitting between 75% to 100% CPU usage. This is the fleet in question:


Region: us-east-1
FleetId: fleet-2b287ad2-1c38-456c-a834-e969eb4ac4b9


Working with GameLift, we were looking to find a way to specify a log level for the base server process. We ran across this todo:

// TODO (GLIFT-10345): allow developers to specify log level in --devargs

We find the current default info level very verbose for our production environment. We’d ideally like to bump this up to at least warn.

So, without any knowledge about what the auxproxy process does nor how it works, I’m curious about how much CPU time is spent on logging. Taking a look at the one instance in the fleet above:

# pwd
/local/whitewater/Logs

# ls -lSh
total 1.6G
-rw-r--r-- 1 gl-auxproxy gl-auxproxy 945M Jun  2 04:51 auxproxy.log
-rw-r--r-- 1 gl-auxproxy gl-auxproxy 123M Jun  2 04:51 auxproxy-metrics.log

# cat auxproxy.log | grep "\[DEBUG\]" | wc -l | xargs printf "%'d\n"
3,932,761

# cat auxproxy.log | grep "\[INFO\]" | wc -l | xargs printf "%'d\n"
690,643

This fleet has been up for ~26 hours now and auxproxy has produced almost 4 million lines of debug logs. Between the logs auxproxy generates itself and the verbose application level logs it ingests/processes/uploads/other – does this seem correct/normal?

Once again, I don’t know much about this process, so please let me know if I’m missing something or whether this is completely normal. Otherwise, maybe a higher log level for our production environment might help us out?


P.S. We really would love a supported way to specify the log level (GLIFT-10345) :smile:

Cheers

Looking at the debug messages, they’re pretty much all related to log upload. Looking closer, it looks like the same files are being uploaded many times. The need to copy/compress/upload all these files is why you’re seeing large log volumes as well as high cpu.

It looks like when one process terminates aux proxy sets off uploading logs across all processes, not just the one that terminated. My guess is that when you called Process ready you passed in /local/game/logs instead of (or possibly in addition to) /local/game/logs/. You need to ensure that you are only specifying the process-specific log directory to be pulled. If you correct that then I believe your problem should go away.

To clarify, I believe that you are specify the per-pid directory correctly in process ready. The issue is that you also have the parent directory specified on the fleet itself. If you remove the parent from the fleet configuration I think that your issue should go away.

Hi @BrianS-aws,

Just to clarify, we’re using the GameLift Realtime Servers product. I’m a bit confused by your comment, I don’t think this is something we control on our end?

Taking a look at the application logs:

[gl-user-remote@ip-10-22-80-142 ~]$ cd /local/game/logs
[gl-user-remote@ip-10-22-80-142 logs]$ grep -R "Realtime server started" | head -n 5
2952/server.log.2020-06-01-02:01 Jun 2020 02:19:45,742 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1908,"LogParameters":{"LogPaths":["/local/game/logs/2952"]}}
3422/server.log.2020-06-01-02:01 Jun 2020 02:19:46,531 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1922,"LogParameters":{"LogPaths":["/local/game/logs/3422"]}}
2875/server.log.2020-06-01-02:01 Jun 2020 02:19:46,197 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1916,"LogParameters":{"LogPaths":["/local/game/logs/2875"]}}
3377/server.log.2020-06-01-02:01 Jun 2020 02:19:47,142 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1938,"LogParameters":{"LogPaths":["/local/game/logs/3377"]}}
2961/server.log.2020-06-01-02:01 Jun 2020 02:19:46,793 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1931,"LogParameters":{"LogPaths":["/local/game/logs/2961"]}}

Assuming LogPaths is the parameter we’re talking about, they do seem to be specified on a per-pid basis.

There’s a log paths parameter attached at the fleet level. This is combined with the paths that you register for each process. It looks like you can’t update this – it is only specified when you create the fleet. How did you create your RTS fleet? via the console, CLI, or something else? If CLI, do you have the command that you used (and also where you got the command from). I am wondering if we might have bad/misleading instructions out there somewhere, but I was unable to find them looking around a bit.

@BrianS-aws We created the RTS fleet via the console. I searched for a log paths parameter in the AWS CLI documentation and found this:

https://docs.aws.amazon.com/cli/latest/reference/gamelift/create-fleet.html

–log-paths (list)
This parameter is no longer used. Instead, to specify where Amazon GameLift should store log files once a server process shuts down, use the Amazon GameLift server API ProcessReady() and specify one or more directory paths in logParameters . See more information in the Server API Reference .


Edit: Looks like you’re right about the LogPaths parameter, though:

$ aws gamelift describe-fleet-attributes --region us-east-1 --fleet-id fleet-ecc0518a-6c5b-49c5-9669-405bc7fcf630
{
    "FleetAttributes": [
        {
            ...
            "LogPaths": [
                "/local/game/logs"
            ],
            ...
        }
    ]
}

So we should try creating a new fleet via the CLI and specifying an empty list for the --log-paths parameter? Seems like it’s still a bug that this is the default value?

Yeah, that’s what it sounds like to me. I’ll see if I can track down where this log paths is being assigned. If nothing else, it definitely is doing the wrong thing in the cases where customers are using more than one server per host.

Let us know if creating via the CLI allows you to work around this issue. describe-fleet-attributes is a good way to double-check your self.

@BrianS-aws,

Happy to say that starting a fleet via the CLI and emptying the --log-paths attribute worked a charm. Thanks for your help debugging this issue and also thanks to @Nathan and @Pip for their help as well!

We’re back up to 50 concurrent processes per instance and our CPU usage is looking much better. Here’s a graph over the past week. The blue line on the right is the new instance in the fleet without that pesky LogPaths attribute, the other lines are the other various instances we ran over the week:

Cheers

The underlying issue which caused the log paths to be incorrect for multiple servers on a host should be fixed. Please let us know if you experience any issues!