GameLift Service Process(auxproxy) Terminates Healthy Game Processes (net45)

Firstly, this post is for the developers working for Amazon to see.

To end users using GameLift: Game processes running on Amazon Linux instance in healthy state will get terminated randomly(about 30-60 after creation). Only .NET 4.5 version of SDK is affected. To improve service quality, the problem must be resolved.

Environment

  • Fleet OS: Amazon Linux (VERSION_ID=“2017.09”)
  • GameLift Server SDK: 3.2.1 for .NET 4.5
  • Game Engine: Unity 2018.1
    Description

There are two bugs, one in Java process(auxproxy) and one in GameServer SDK, involved. As far as I know, 2 WebSocket connections(game -> auxproxy) are used for each controll game process. The one that used to report health status is disconnected randomly(1000-3600 seconds interval) for reasons unknown. The SDK handles the disconnection event by reestablishing the lost WebSocket connection (EngineIoClientDotNet.Client.Socket.Options.Reconnection defaults to true). However, the GameLift service process(auxproxy) terminates the associated game process after 3 minutes of the event while it’s getting health reports from the game process through the recovered WebSocket connection. It seems that the auxproxy does not handle the reconnection case since the connections are loopback, and the developers never thought that they would get disconnected.

Analysis

I tried to identify the bug in the SDK but failed to do so because the SDK depends on the bloated unorganised old WebSocket project called ‘WebSocket4Net’. I also searched EngineIoClientDotNet, SocketIoClientDotNet only to find out that there’s no problem in the libraries. They only close connections when the server disconnects and in case of parse error. EngineIoClientDotNet uses the unspecified version of socket implementation called ‘SuperSocket’. The developer did not specify which version is used, so I lost track there. My best guess is that connections are dropped due to the problem in the SuperSocket library.

The bug in the SDK is not specific to the Amazon Linux environment. The disconnection problem is also shown with Windows binaries, too.

Packet Capture

The game process sends the server a normal “Close” WebSocket frame (code 1000 w/o reason string payload). There’s no sign of malformed WebSocket frame. Note that the server sends nothing that could trigger the library’s error handler prior to the disconnection event.

Logs

It is evident from the log that auxproxy does not handle reconnection of WebSockets.


01 Jul 2018 08:48:13,818 [WARN] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: SDK disconnected: /127.0.0.1:58110
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.ProcessIdResolver: Resolved 23530 for client /127.0.0.1:58110
01 Jul 2018 08:48:13,818 [DEBUG] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: Attempting to retrieve application runner by process ID.
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: GameProcess of disconnected SDK is in state: READY and is not being terminated, force terminating process: 23530
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.process.ApplicationRunner: Forcing termination of {processUUID=14a93cac-8e73-481d-b8c6-62e942da8363,processId=23530} with deadline instant 2018-07-01T08:52:43.818Z and reason PROCESS_UNHEALTHY
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.api.AmazonWhitewaterAwsJsonClient: terminateHostProcess: applicationId = 76310f40-197b-413b-9e45-d2d924852fc9 , hostId = i-08445d3501599e67b, processId = 14a93cac-8e73-481d-b8c6-62e942da8363
...
01 Jul 2018 08:48:14,825 [DEBUG] (nioEventLoopGroup-3-12) com.corundumstudio.socketio.handler.AuthorizeHandler: Handshake authorized for sessionId: ce24cebc-a2fd-4bbc-8614-656db13daa99, query params: {pID=[23530], sdkVersion=[3.2.1], sdkLanguage=[CSharp], EIO=[3], transport=[websocket], t=[636660316948238360-2]} headers: {Origin=[ws://127.0.0.1:5757], Sec-WebSocket-Key=[MWM2NzFkNzAtNTBlZS00Nw==], content-length=[0], Connection=[Upgrade], Sec-WebSocket-Version=[13], Host=[127.0.0.1:5757], Upgrade=[WebSocket]}
01 Jul 2018 08:48:14,825 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocketServerHandshaker: [id: 0xf94437a1, L:/127.0.0.1:5757 - R:/127.0.0.1:58838] WebSocket version V13 server handshake
01 Jul 2018 08:48:14,825 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocketServerHandshaker: WebSocket version 13 server handshake key: MWM2NzFkNzAtNTBlZS00Nw==, response: 7q17cO0w3giRJ7GTrQIctLe9c+I=
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) com.corundumstudio.socketio.handler.ClientHead: binding channel: [id: 0xf94437a1, L:/127.0.0.1:5757 - R:/127.0.0.1:58838] to transport: WEBSOCKET
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocket08FrameEncoder: Encoding WebSocket Frame opCode=1 length=113
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocket08FrameEncoder: Encoding WebSocket Frame opCode=1 length=2
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: SDK connected: /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.ProcessIdResolver: Resolved 23530 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: Attempting to retrieve application runner by process ID.
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) com.corundumstudio.socketio.transport.WebSocketTransport: сlient ce24cebc-a2fd-4bbc-8614-656db13daa99 handshake completed
...
01 Jul 2018 08:52:18,962 [INFO] (ThreadPoolPacketHandler-3) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: onReportHealth received from /127.0.0.1:58838 with data: [8, 1] and ackRequest requested? true
01 Jul 2018 08:52:18,962 [INFO] (ThreadPoolPacketHandler-3) com.amazon.whitewater.auxproxy.pbuffer.ProcessIdResolver: Resolved 23530 for client /127.0.0.1:58838
...
01 Jul 2018 08:52:44,461 [INFO] 0f1b00d3-047b-49c6-bcd2-6aa77692fc59 (pool-9-thread-1) com.amazon.whitewater.auxproxy.deadline.DeadlineManager: Reached deadline GenericDeadline(deadlineDueTime=2018-07-01T08:52:43.818Z, shouldEnforce=com.amazon.whitewater.auxproxy.process.deadline.ProcessDeadlineManager$$Lambda$84/897297607@7e7a852, enforcementAction=com.amazon.whitewater.auxproxy.process.tasks.ForceTerminationTimeoutTask@40bb4a57, friendlyName=FORCE_TERMINATION_DEADLINE_ENFORCER) and required enforcement, executing action.
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.tasks.ForceTerminationTimeoutTask: Force Termination deadline reached triggering onDisconnect for the process
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.ApplicationRunner: SDK disconnected.
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.event.GameProcessAdvertiser: Handling GameProcessTerminatedEvent: event = GameProcessTerminatedEvent(gameServer={processUUID=14a93cac-8e73-481d-b8c6-62e942da8363,processId=23530}), readyProcesses = [], advertisedProcesses = [{processUUID=2b588ac1-2a8c-403f-a2af-097ad3c7e64a,processId=24831}, {processUUID=1729555d-f5f4-4b4f-9676-60799f706091,processId=23403}, {processUUID=9a495ace-2a4b-4e7b-a5b6-04e90824d73b,processId=20058}, {processUUID=435878b8-486e-4d77-825e-6593e517b65c,processId=23182}, {processUUID=3b6926d3-66dd-4b71-b3e7-e1110863d7f0,processId=23484}, {processUUID=c66307a7-0bb6-42b6-8335-1a0658de14ac,processId=24647}, {processUUID=b67b71a0-9f69-4a07-97ab-cb6135afbed0,processId=21605}, {processUUID=6f8fafa7-9841-4648-af46-28fde0a55e75,processId=21857}, {processUUID=61bed66b-324f-46d6-901a-42ed6476fd98,processId=21544}]
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.event.GameSessionMonitorEvictor: Handling event: GameProcessTerminatedEvent(gameServer={processUUID=14a93cac-8e73-481d-b8c6-62e942da8363,processId=23530})

Connection drop is also reproducible using GameLiftLocal.

15:00:52,750 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-3 - onReportHealth received from /127.0.0.1:55602 with health status: healthy
15:00:53,934 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-4 - SDK disconnected: /127.0.0.1:55572
15:00:53,934 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-4 - SDK pid is 15965
15:00:53,934 INFO || - [GameTerminationManager] nioEventLoopGroup-3-4 - Handling process disconnect event for PID: 15965
15:00:54,951 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - SDK connected: /127.0.0.1:55660
15:00:54,953 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - SDK pid is 15965, sdkVersion is 3.2.1 and sdkLanguage is CSharp
15:00:54,953 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - NOTE: Only SDK versions 3.1.5 and above are supported in GameLiftLocal!
15:01:48,829 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-1 - onReportHealth received from /127.0.0.1:55566 with health status: healthy
15:01:49,030 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-3 - onReportHealth received from /127.0.0.1:55570 with health status: healthy
15:01:49,031 INFO || - [HostProcessManager] nioEventLoopGroup-3-3 - Process (pID: 15965) health changed from false to true!
15:01:49,462 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - onReportHealth received from /127.0.0.1:55574 with health status: healthy

I programmed my Unity game to run following code to see if I can get any info from the SDK’s logging system but it showed nothing.


var hierarchy = (Hierarchy)LogManager.GetRepository();
hierarchy.Root.RemoveAllAppenders(); /*Remove any other appenders*/
var fileAppender = new FileAppender();
fileAppender.AppendToFile = true;
fileAppender.LockingModel = new FileAppender.MinimalLock();
fileAppender.File = "EngineIoClientDotNet.log";
var pl = new PatternLayout();
pl.ConversionPattern = "%d [%2%t] %-5p [%-10c] %m%n";
pl.ActivateOptions();
fileAppender.Layout = pl;
fileAppender.ActivateOptions();
BasicConfigurator.Configure(fileAppender);

Extra bits

A piece of advice: no one will find what’s wrong in SuperSocket. To fix this, just don’t use WebSocket4Net and get a new decent WebSocket library. That’ll be the best for the SDK’s future.

Attached

7111-gamelift-bug-analysis.zip (1020 KB)gamelift-bug-analysis.zip

  • Wireshark capture file
  • auxproxy log
  • SDK log (log4net)
  • Fleet event log (retrieved using awscli)
  • GameLift Local output

My ARNs

arn:aws:gamelift:ap-northeast-2::alias/alias-71a4547d-c04a-4a10-b7bd-bd23b656e36a

->

arn:aws:gamelift:ap-northeast-2:018576890487:fleet/fleet-76310f40-197b-413b-9e45-d2d924852fc9

We’re getting the same random termination of our Window UE servers, not sure if it’s the same case… Nothing was wrong last week. But after last Friday, everything just went haywire.

Can someone helps verify this?

@DavidTimber

I have reported this to the development team. Thanks for spending the time to provide all of these details.

Hi @DavidTimber, First off, thank you very much for your detailed analysis of this error. This is excellent work. You are correct that there is a bug here. I’m investigating our options given the impact here and current work in progress.

Thanks Again,

Ben

The issue seems to be caused by EngineIoClientDotNet pinging the connection while his ready state isn’t set to open (not sure why the connection wouldn’t be ready). Chaos ensue and the connection close.
The bug has been fixed in the version 1.0.4 of the package. It didn’t works out of the box with the game server SDK so I just back ported the fix to 0.9.22 instead.

I haven’t had any healthy process terminated so far.
Here’s the important code if anyone is interested in back porting the fix like I did.

https://github.com/Quobject/EngineIoClientDotNet/blob/master/Src/EngineIoClientDotNet.mono/Client/Socket.cs#L580

You probably want to replace most of the SetPing() method, however, guarding the Ping() call with if(ReadyState==ReadyStateEnum.OPEN) should be enough to prevent the connection from dropping.

EDIT: Oh well it seems the issue has not been completely resolved by guarding the ReadyState. I decided to disable the timeout feature completely from EngineIoClientDotNet until this gets resolved. It hasn’t crashed since then.

The SDK doesn’t reference the version that you linked and if it was indeed because of timeout feature, connection lost shouldn’t be random.

I like the fact that there’s someone who can look into this which is a good thing. If only the developers documented the library versions used… I know this would be hard to solve without refactoring the whole insides of the SDK(replacing the WebSocket library). I know I can do that myself and use that version of mine, but time hasn’t been convenient for me, not just yet.

If you are the maintainer assigned this issue, best of luck to you.

Hey @DavidTimber - Thank you for the very thorough deep dive here, appreciate it. We called out on this other thread that we acknowledge we need to do some work to better support Unity developers looking to use .NET 4.5 and GameLift together. We are looking at improving this connectivity to avoid the issue you are calling out - no firm dates yet.

We’ll update this thread once we have more details or news to share.

The code I linked is from the EngineIoClientDotNet’s master branch, you’re right this is not the version used in the SDK. The SDK uses the version 0.9.22, which is about 2 year olds. I linked the master as an example of what the “fixed” method should looks like. You shouldn’t bother anyway as it still timeout randomly, which either means the master still has the bug or I didn’t back-port the fix entirely. Since I don’t want to spend too much time on this I decided to remove the timeout disconnection code entirely from the library and compiled my own version to use with the Gamelift’ SDK.

Here is the code used by the SDK if you want to dig deeper or compile your own hotfix like I did https://github.com/Quobject/EngineIoClientDotNet/releases/tag/0.9.22 .

I am not the maintainer by the way, I just happens to need Gamelift to works on .Net 4.x.

I can confirm however that it IS a ping timeout bug. It’s made very obvious when enabling the debug trace file in EngineIoClientDotNet. The client close the connection even though the pong has been received. The issue is documented here https://github.com/Quobject/EngineIoClientDotNet/issues/43 . The trace logs are very similar to what I experienced.

I have no idea if the bug is truly fixed in later release as upgrading the library cause a DLL hell I can’t afford to spend time fixing. I decided to just “hack” my fix in for now by removing the timeout code. I figured auxproxy will kill the process anyway in case of a real timeout so this shouldn’t be an issue.

Best of luck!

UPDATE: Hotfix for the issue

Thanks to @David-342562, I found the code causing this problem and was able to write a patch.

Yes, there’s something wrong with timeout… But there’s a bigger problem: false timeout was random because there is a race condition going on in the code. Clearly, the developer didn’t know what they were doing. Threads are used to implement event system and lock is no where to be found.

Two classes are responsible for calling OnHeartbeat() where PingTimeoutTimer is reset. The timer is reset when a pong packet is received from the server. The timer is also reset every time a packet is received. “Pinging” or “heartbeat” is by definition checking on the connection on a regular interval. So in this case, the timeout event is delayed if any packet is received causing the two timers(another one being PingIntervalTimer) out of sync. But even if they are not in sync, the ping timer works fine as long as the server responds to ping packets. **The problem isOnHeartbeat()could be run at the same time by multiple threads.** It is possible that the previous timeout timer is left unstopped. That explains the random behaviour.EasyTimer` is implemented using Task in net 4.5 version and BackgroundWorker thread in net 3.5.

I was not joking when I said replacing the WebSocket library is an option to consider.

The patch

I pushed my patch to my github repo. Open the solution file (Src/EngineIoClientDotNet.net45/EngineIoClientDotNet.net45.sln) and build it. Replace the dll(EngineIoClientDotNet.dll).

This fix is just about removing the case where OnHeartbeat() is called on different threads. Now that I’ve found that the author of the library has no idea what they were doing, who knows where else dangerous things like Task is used.

It’s race problem where bug happens randomly. “Processes are not getting killed after applying your patch” wouldn’t just cut it. You can solve race problems only by inspecting and fully understanding the code. So don’t expect too much from this.

Attached

7331-engineioclientdotnet.zip (34.1 KB)engineioclientdotnet.zip

It seems to be working well for now. I think I’m going to get over it right now.

Thank you for your efforts!