On Monday, 11 February, 2019, Lilt’s database became unavailable. The downtime lasted for a total of 30 minutes, from Thursday 24:45 to Friday 00:15 Pacific Time.
This incident resulted from running a migration on one of our production database tables to prepare for a backend update to enable a new feature in the client application. The migration was run on one of our largest and most frequently queried tables, leading to the memory database instance’s available memory being saturated. Moreover, because the migration was run online, in the interest of avoiding scheduled downtime, the database continued servicing our application and backend services, the number of allowed connections was eventually reached, leaving us unable to connect to the database to cancel the query causing the problem. Our Lilt engineer running the migration was monitoring the migration as it was ran, but as soon as the system became unresponsive, the engineer was unable to connect to the database to cancel the query. This, in turn, led some our backend services to restart a number of times, making the system intermittently unavailable and unusable during the outage period. To resolve the issue issue, we restarted the database, which took approximately one minute.
We are taking this issue seriously and will take measures to prevent this type of problem in the future. We have already begun discussing how we will prevent this problem while running migrations on the large and frequently queried tables in our database.
We sincerely apologize to all of our customers for allowing this downtime to occur as well as for all inconvenience experienced due to this downtime.