Client application unavailable due to database migration

Incident Report for Lilt

Postmortem

On Monday, 11 February, 2019, Lilt’s database became unavailable. The downtime lasted for a total of 30 minutes, from Thursday 24:45 to Friday 00:15 Pacific Time.

This incident resulted from running a migration on one of our production database tables to prepare for a backend update to enable a new feature in the client application. The migration was run on one of our largest and most frequently queried tables, leading to the memory database instance’s available memory being saturated. Moreover, because the migration was run online, in the interest of avoiding scheduled downtime, the database continued servicing our application and backend services, the number of allowed connections was eventually reached, leaving us unable to connect to the database to cancel the query causing the problem. Our Lilt engineer running the migration was monitoring the migration as it was ran, but as soon as the system became unresponsive, the engineer was unable to connect to the database to cancel the query. This, in turn, led some our backend services to restart a number of times, making the system intermittently unavailable and unusable during the outage period. To resolve the issue issue, we restarted the database, which took approximately one minute.

We are taking this issue seriously and will take measures to prevent this type of problem in the future. We have already begun discussing how we will prevent this problem while running migrations on the large and frequently queried tables in our database.

We sincerely apologize to all of our customers for allowing this downtime to occur as well as for all inconvenience experienced due to this downtime.

Posted Feb 15, 2019 - 09:46 CET

Resolved

This incident has been resolved.

Posted Feb 15, 2019 - 09:17 CET

Update

The database unavailability has been resolved.

Posted Feb 15, 2019 - 09:16 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 15, 2019 - 09:15 CET

Update

We are continuing to work on a fix for this issue.

Posted Feb 15, 2019 - 08:54 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 15, 2019 - 08:49 CET

Investigating

We are currently investigating this issue.

Posted Feb 15, 2019 - 08:45 CET

This incident affected: Translate (Translate) and Manage.