Fleet Capacity Exception Creating Game Sessions When Capacity Available

Hi,

I have been having issues with Game Lift where I am receiving multiple exceptions when attempting to create game sessions. I have been investigating this issue but am unable to see anything that is wrong - it looks to me like my fleet has plenty capacity of healthy game sessions and processes.

As an example I received the following errors over a couple of minutes from my GameLift client when attempting to create sessions:

[17:11:36] Failed to create game session with exception: Unable to reserve a process on fleet fleet-e7840144-2b69-4d8d-a215-ebce6fa9466b. Type = Unknown, Code = FleetCapacityExceededException

[17:11:36] Failed to create game session with exception: Unable to reserve a process on fleet fleet-e7840144-2b69-4d8d-a215-ebce6fa9466b. Type = Unknown, Code = FleetCapacityExceededException

[17:11:36] Failed to create game session with exception: Unable to reserve a process on fleet fleet-e7840144-2b69-4d8d-a215-ebce6fa9466b. Type = Unknown, Code = FleetCapacityExceededException

[17:11:37] Failed to create game session with exception: Unable to reserve a process on fleet fleet-e7840144-2b69-4d8d-a215-ebce6fa9466b. Type = Unknown, Code = FleetCapacityExceededException

[17:11:37] Failed to create game session with exception: Unable to reserve a process on fleet fleet-e7840144-2b69-4d8d-a215-ebce6fa9466b. Type = Unknown, Code = FleetCapacityExceededException

However if I look at the metrics for my fleet on the Game Lift dashboard I see the following results at that time:

Instance Counts:

Game:

Server Processes:

Instance Performance:

Scaling Limits:

All of these reports suggest to me that there should be plenty of headroom for new game sessions so I fail to see why I would be receiving so many Fleet Capacity Exceptions.The only thing that looks slightly abnormal at that timestamp is a spike for Network Out but even that looks like it should be within the capabilities of the instance (and presumably thats the total across all instances?). Is there something that I am missing?

I would greatly appreciate any assistance in resolving this issue.

Thanks,

Tom

Looking at your metrics, the critical one to look at its called Available Game Sessions (in the game.png). You can see that its at or near zero at the time in question.

FleetCapacityExceeded exception is thrown when GameLift cannot find an active process that can host your game session. This can be because of capacity or because your process terminate unexpectedly or fail to complete game session acceptance.

Based on your metrics it looks mostly like you do not have the capacity to meet your incoming game session rate.

I would look at:

  • What scaling rules you have in place so you can scale up faster/ hold slightly more capacity in the region.
  • How quickly your server can terminate and be ready for a game session (ie can you recycle your process faster so after one game session ends another is available)?
  • Can you pack more game sessions onto an instance by hosting more processes per instance?

Hi @Pip,
No problem, I have made that mistake myself when looking at those charts.
From the logs that I have from that particular issue I can see that I received 74 of those FleetCapacityExceededExceptions over a 2 minute period from 17:10-17:11 (in comparison it looks like there were ~50 successful game session activations in that period). There were also a few throttling exceptions but this could be related with game clients constantly requesting game sessions after the previous requests failed.
While looking at this further today I did manage to receive another abnormal termination and saw similar results with cascading FleetCapacityExceededExceptions in the period straight after. At that time there was the following event in the dashboard for the fleet: SERVER_PROCESS_TERMINATED_UNHEALTHY. Looking over the configuration for the fleet I observed that Max concurrent game session activation was set to no limit and I was concerned that too many simultaneous game session activations could be causing the process to take too long to respond to its health request and thus fail. I changed this to a maximum of 5 sessions on a new fleet and haven’t seen an abnormal termination all afternoon since that. This could be coincidental and I am still investigating the issue but does this sound like it could be related to you?

I have however still seen a few (though not as many) FleetCapacityExceededExceptions when I would expect there to be plenty of capacity in the fleet. I have attached some more images of the metrics for these below.I received 6 FleetCapacityExceededExceptions in the period from 18.03 to 18.07. The metrics show that there are plenty of available game sessions (we have scaling set to a very conservative 50% at the moment). There is an instance shutting down in this time period but I would not expect that to have an impact unless the Game Lift service is erroneously attempting to create game sessions on the instance that is shutting down? There are between 150-200 or so available game sessions during that period so again I would not expect to see this exception.

Game:

Instance Counts:

Server Processes:

It is worth pointing out that these issues are occurring for us in our production environment and our customers are facing issues where they are failing to find matches so it is of vital importance to us that we resolve these issues as soon as possible.

Thanks for the assistance,
Tom

Apologies, I mis-identified Activating game sessions as it shows with a color very similarly to Available game sessions on my laptop. So my advice is not that relevant.

My initial guess after a more careful review of your metrics is that you may have hit an issue with a bad instance, esp as you had an abnormal termination just after this event. In this case the instance may have managed to register but is having stability problems so it fails to complete game sessions.

GameLift team will need to investigate what happened wrt to the game sessions you identified.

BTW Do you have # or % of impacted game sessions (even a rough ballpack)? Are you still seeing this issue? Or was it for a narrow time range?

It has been 7 days since I posted this. My team has also attempted to contact Amazon via other means only to be met with more silence. This is poor. These issues are ongoing and affecting our paying customers. It is vitally important to us that these issues are resolved. I have provided detailed information of the issue and it looks to be on Game Lift’s side. There are more forum posts that suggest other people are experiencing these issues as well.

If we can not resolve this issue soon we will have to abandon Game Lift as not fit for purpose and move over to another hosting solution that works.

It is disappointing that you can find time to update your forum design but not to actually answer your customer’s questions and concerns, particularly when they’ve taken the time to provide detailed information.

Apologies for your frustrating experience here.

The move to the new forums (which was a Amazon GameTech/Lumberyard wide thing and not specific to GameLift) caused everything to be read only for a couple of days so folks are still getting back up to speed.

I have reached out to the GameLift team on your behalf to see if someone can reply to your forum question asap.

As this seems to be an urgent issue for you, did you create an issue via AWS support?

Based on the information currently available this seems like an issue where the fleet had a bad instance in it, when this happens we do not consider the instance as viable for placing game sessions and after some time that instance gets replaced, need some more information to dive deep into the issue.

What was the date when this issue was noticed?
Was a queue being used to place game sessions when this issue was noticed?

Hi Pip,

Ok that sounds like an unfortunate timing issue with the forum change then.

Thanks for that I appreciate that. Unfortunately I am told that my company’s AWS account does not include the option to make a support ticket, although I would be happy to make one if that is not the case. We have simultaneously reached out to a contact at Amazon though who got back to us yesterday.

Hi Akshay,

Thanks for getting back to me. Ok that sounds understandable. I would expect the available sessions metric to report a lower number if that was the case though and that doesn’t seem to be happening. Is that what you would expect to see? We also have a scaling policy set up (at 20%) so I would expect that to make up for any bad instances by starting up new instances.

This issue has been ongoing for us for a couple of weeks since we launched our game. Mostly game sessions seem to be placed ok but we’ve noticed these exceptions at particular surges of users, although as mentioned the metrics we have suggest that at those points there should be game sessions available.

We are not using a queue to place sessions at the moment as we only have a single fleet in a single region. We are currently using the CreateGameSession method. Would there be a difference in outcome if using the CreateGameSessionQueue method?

Hey TJClifton,

On taking a look at events emitted by your fleet, I am seeing the following event being emitted a lot of times,

SERVER_PROCESS_SDK_INITIALIZATION_TIMEOUT Server process started correctly but did not call InitSDK() within 5 minutes

This indicates that the game server build has not correctly integrated with the gamelift server sdk. Check the below shared documentation for more information

https://docs.aws.amazon.com/gamelift/latest/developerguide/gamelift-sdk-server-api.html

“Add code to initialize an Amazon GameLift client and notify the Amazon GameLift service that the server is ready to host a game session. This code should run automatically before any Amazon GameLift-dependent code, such as on launch.”

You can see these events in the events tab in gamelift console for future debugging

In addition to what we posted previously about the SDK integration, we have done some additional investigation and here is what we are seeing:

  • Your instances that are seeing issues are running near 100% CPU utilization.
  • You are running 50 processes per instance.
  • You are running on m3.mediums which have 1 vCPU for all 50 processes.

Our recommendation would be to run 1. fewer processes per instance and 2. run on larger instances, in that order. This way, whenever CPU spikes on a particular process due to something with your game server code, it doesn’t impact that instance’s ability to start a game session.

Also, in order to make sure that you get immediate attention for player impacting issues, please reach out to AWS Support and get a support ticket filed. The forums are not the appropriate mechanism to ensure a fast response for issues impacting live games.