Firstly, this post is for the developers working for Amazon to see.
To end users using GameLift: Game processes running on Amazon Linux instance in healthy state will get terminated randomly(about 30-60 after creation). Only .NET 4.5 version of SDK is affected. To improve service quality, the problem must be resolved.
Environment
- Fleet OS: Amazon Linux (VERSION_ID=“2017.09”)
- GameLift Server SDK: 3.2.1 for .NET 4.5
- Game Engine: Unity 2018.1
Description
There are two bugs, one in Java process(auxproxy) and one in GameServer SDK, involved. As far as I know, 2 WebSocket connections(game -> auxproxy) are used for each controll game process. The one that used to report health status is disconnected randomly(1000-3600 seconds interval) for reasons unknown. The SDK handles the disconnection event by reestablishing the lost WebSocket connection (EngineIoClientDotNet.Client.Socket.Options.Reconnection
defaults to true). However, the GameLift service process(auxproxy) terminates the associated game process after 3 minutes of the event while it’s getting health reports from the game process through the recovered WebSocket connection. It seems that the auxproxy does not handle the reconnection case since the connections are loopback, and the developers never thought that they would get disconnected.
Analysis
I tried to identify the bug in the SDK but failed to do so because the SDK depends on the bloated unorganised old WebSocket project called ‘WebSocket4Net’. I also searched EngineIoClientDotNet, SocketIoClientDotNet only to find out that there’s no problem in the libraries. They only close connections when the server disconnects and in case of parse error. EngineIoClientDotNet uses the unspecified version of socket implementation called ‘SuperSocket’. The developer did not specify which version is used, so I lost track there. My best guess is that connections are dropped due to the problem in the SuperSocket library.
The bug in the SDK is not specific to the Amazon Linux environment. The disconnection problem is also shown with Windows binaries, too.
Packet Capture
The game process sends the server a normal “Close” WebSocket frame (code 1000 w/o reason string payload). There’s no sign of malformed WebSocket frame. Note that the server sends nothing that could trigger the library’s error handler prior to the disconnection event.
Logs
It is evident from the log that auxproxy does not handle reconnection of WebSockets.
01 Jul 2018 08:48:13,818 [WARN] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: SDK disconnected: /127.0.0.1:58110
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.ProcessIdResolver: Resolved 23530 for client /127.0.0.1:58110
01 Jul 2018 08:48:13,818 [DEBUG] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: Attempting to retrieve application runner by process ID.
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: GameProcess of disconnected SDK is in state: READY and is not being terminated, force terminating process: 23530
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.process.ApplicationRunner: Forcing termination of {processUUID=14a93cac-8e73-481d-b8c6-62e942da8363,processId=23530} with deadline instant 2018-07-01T08:52:43.818Z and reason PROCESS_UNHEALTHY
01 Jul 2018 08:48:13,818 [INFO] (nioEventLoopGroup-3-7) com.amazon.whitewater.auxproxy.api.AmazonWhitewaterAwsJsonClient: terminateHostProcess: applicationId = 76310f40-197b-413b-9e45-d2d924852fc9 , hostId = i-08445d3501599e67b, processId = 14a93cac-8e73-481d-b8c6-62e942da8363
...
01 Jul 2018 08:48:14,825 [DEBUG] (nioEventLoopGroup-3-12) com.corundumstudio.socketio.handler.AuthorizeHandler: Handshake authorized for sessionId: ce24cebc-a2fd-4bbc-8614-656db13daa99, query params: {pID=[23530], sdkVersion=[3.2.1], sdkLanguage=[CSharp], EIO=[3], transport=[websocket], t=[636660316948238360-2]} headers: {Origin=[ws://127.0.0.1:5757], Sec-WebSocket-Key=[MWM2NzFkNzAtNTBlZS00Nw==], content-length=[0], Connection=[Upgrade], Sec-WebSocket-Version=[13], Host=[127.0.0.1:5757], Upgrade=[WebSocket]}
01 Jul 2018 08:48:14,825 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocketServerHandshaker: [id: 0xf94437a1, L:/127.0.0.1:5757 - R:/127.0.0.1:58838] WebSocket version V13 server handshake
01 Jul 2018 08:48:14,825 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocketServerHandshaker: WebSocket version 13 server handshake key: MWM2NzFkNzAtNTBlZS00Nw==, response: 7q17cO0w3giRJ7GTrQIctLe9c+I=
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) com.corundumstudio.socketio.handler.ClientHead: binding channel: [id: 0xf94437a1, L:/127.0.0.1:5757 - R:/127.0.0.1:58838] to transport: WEBSOCKET
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocket08FrameEncoder: Encoding WebSocket Frame opCode=1 length=113
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) io.netty.handler.codec.http.websocketx.WebSocket08FrameEncoder: Encoding WebSocket Frame opCode=1 length=2
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: SDK connected: /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.ProcessIdResolver: Resolved 23530 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: Attempting to retrieve application runner by process ID.
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [INFO] (nioEventLoopGroup-3-12) com.amazon.whitewater.auxproxy.pbuffer.SDKVersionResolver: Resolved 3.2.1 for client /127.0.0.1:58838
01 Jul 2018 08:48:14,826 [DEBUG] (nioEventLoopGroup-3-12) com.corundumstudio.socketio.transport.WebSocketTransport: сlient ce24cebc-a2fd-4bbc-8614-656db13daa99 handshake completed
...
01 Jul 2018 08:52:18,962 [INFO] (ThreadPoolPacketHandler-3) com.amazon.whitewater.auxproxy.pbuffer.SDKListenerImpl: onReportHealth received from /127.0.0.1:58838 with data: [8, 1] and ackRequest requested? true
01 Jul 2018 08:52:18,962 [INFO] (ThreadPoolPacketHandler-3) com.amazon.whitewater.auxproxy.pbuffer.ProcessIdResolver: Resolved 23530 for client /127.0.0.1:58838
...
01 Jul 2018 08:52:44,461 [INFO] 0f1b00d3-047b-49c6-bcd2-6aa77692fc59 (pool-9-thread-1) com.amazon.whitewater.auxproxy.deadline.DeadlineManager: Reached deadline GenericDeadline(deadlineDueTime=2018-07-01T08:52:43.818Z, shouldEnforce=com.amazon.whitewater.auxproxy.process.deadline.ProcessDeadlineManager$$Lambda$84/897297607@7e7a852, enforcementAction=com.amazon.whitewater.auxproxy.process.tasks.ForceTerminationTimeoutTask@40bb4a57, friendlyName=FORCE_TERMINATION_DEADLINE_ENFORCER) and required enforcement, executing action.
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.tasks.ForceTerminationTimeoutTask: Force Termination deadline reached triggering onDisconnect for the process
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.ApplicationRunner: SDK disconnected.
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.event.GameProcessAdvertiser: Handling GameProcessTerminatedEvent: event = GameProcessTerminatedEvent(gameServer={processUUID=14a93cac-8e73-481d-b8c6-62e942da8363,processId=23530}), readyProcesses = [], advertisedProcesses = [{processUUID=2b588ac1-2a8c-403f-a2af-097ad3c7e64a,processId=24831}, {processUUID=1729555d-f5f4-4b4f-9676-60799f706091,processId=23403}, {processUUID=9a495ace-2a4b-4e7b-a5b6-04e90824d73b,processId=20058}, {processUUID=435878b8-486e-4d77-825e-6593e517b65c,processId=23182}, {processUUID=3b6926d3-66dd-4b71-b3e7-e1110863d7f0,processId=23484}, {processUUID=c66307a7-0bb6-42b6-8335-1a0658de14ac,processId=24647}, {processUUID=b67b71a0-9f69-4a07-97ab-cb6135afbed0,processId=21605}, {processUUID=6f8fafa7-9841-4648-af46-28fde0a55e75,processId=21857}, {processUUID=61bed66b-324f-46d6-901a-42ed6476fd98,processId=21544}]
01 Jul 2018 08:52:44,461 [INFO] f5200f48-acc1-4b0f-92fc-c867343cc9f0 (pool-11-thread-40) com.amazon.whitewater.auxproxy.process.event.GameSessionMonitorEvictor: Handling event: GameProcessTerminatedEvent(gameServer={processUUID=14a93cac-8e73-481d-b8c6-62e942da8363,processId=23530})
Connection drop is also reproducible using GameLiftLocal.
15:00:52,750 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-3 - onReportHealth received from /127.0.0.1:55602 with health status: healthy
15:00:53,934 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-4 - SDK disconnected: /127.0.0.1:55572
15:00:53,934 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-4 - SDK pid is 15965
15:00:53,934 INFO || - [GameTerminationManager] nioEventLoopGroup-3-4 - Handling process disconnect event for PID: 15965
15:00:54,951 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - SDK connected: /127.0.0.1:55660
15:00:54,953 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - SDK pid is 15965, sdkVersion is 3.2.1 and sdkLanguage is CSharp
15:00:54,953 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - NOTE: Only SDK versions 3.1.5 and above are supported in GameLiftLocal!
15:01:48,829 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-1 - onReportHealth received from /127.0.0.1:55566 with health status: healthy
15:01:49,030 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-3 - onReportHealth received from /127.0.0.1:55570 with health status: healthy
15:01:49,031 INFO || - [HostProcessManager] nioEventLoopGroup-3-3 - Process (pID: 15965) health changed from false to true!
15:01:49,462 INFO || - [SDKListenerImpl] nioEventLoopGroup-3-5 - onReportHealth received from /127.0.0.1:55574 with health status: healthy
I programmed my Unity game to run following code to see if I can get any info from the SDK’s logging system but it showed nothing.
var hierarchy = (Hierarchy)LogManager.GetRepository();
hierarchy.Root.RemoveAllAppenders(); /*Remove any other appenders*/
var fileAppender = new FileAppender();
fileAppender.AppendToFile = true;
fileAppender.LockingModel = new FileAppender.MinimalLock();
fileAppender.File = "EngineIoClientDotNet.log";
var pl = new PatternLayout();
pl.ConversionPattern = "%d [%2%t] %-5p [%-10c] %m%n";
pl.ActivateOptions();
fileAppender.Layout = pl;
fileAppender.ActivateOptions();
BasicConfigurator.Configure(fileAppender);
Extra bits
A piece of advice: no one will find what’s wrong in SuperSocket. To fix this, just don’t use WebSocket4Net and get a new decent WebSocket library. That’ll be the best for the SDK’s future.
Attached
7111-gamelift-bug-analysis.zip (1020 KB)gamelift-bug-analysis.zip
- Wireshark capture file
- auxproxy log
- SDK log (log4net)
- Fleet event log (retrieved using awscli)
- GameLift Local output