MT Suggestions Unresponsive
Incident Report for Lilt
Postmortem

A proverbial Rube Goldman machine was in effect when the issue was caused, and during the process of fixing it.

Of the 3 translate nodes which were running, 2 had SQLAlchemy exceptions like below:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 50 overflow 10 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/3o7r))

This exception was observed while version_id was being retrieved for the model, for any incoming translation request. The timeout caused the version_id to become None, which caused errors in the downstream code, which performed numerical operations on the version_id.

At some point, the translations were being served from the background models only.

The first node started showing these exceptions at about 10:24 AM PST, and the second one started showing them at about 10:40 AM PST. There was nothing common about these incidents on the two nodes. The requests previous to the failures were observed for some commonality in model id, project id, the type of request etc. But none were found.

After making the fix for reverting the hardcoded production values from the code, a release was started. But this release exposed another issue in the code regarding Healthchecks. A fix was merged to the develop branch, adding an import statement for MqConfig class, which wasn;t merged with Master. The addition of import was needed mainly due to changes performed for pubsub feature, which Judith is investigating currently. As this change was not present in master, Healthchecks failed, and subsequently, the nodes never started, even after about 5 restarts.

After a rollback was attempted, which was suggested by Chase, the nodes came back up as usual, and started serving the translation requests as expected.

There are 2 unresolved issues currently:
1) What was the actual reason for the database query failures? It might as well be the change in hardcoded production values, but there are some doubts
2) Currently master branch is not in working condition because of the missing import statement for MqCondig class. This has to be fixed

Posted Feb 13, 2019 - 22:28 UTC

Resolved
This incident has been resolved.
Posted Feb 13, 2019 - 21:35 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 13, 2019 - 21:08 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 13, 2019 - 19:41 UTC
Investigating
Investigating MT suggestions failing.
Posted Feb 13, 2019 - 19:24 UTC
This incident affected: App.