Database Restore
Incident Report for CxAlloy
Resolved
This incident has been resolved.
Posted Mar 24, 2022 - 08:44 EDT
Monitoring
We have reopened the site after verifying that the data is fully restored and application functionality is working as expected. There may be some slower performance for the next few hours as the database is "warmed up".

I wish to again express our sincere apologies for the disruption this downtime has caused. We will be posting a postmortem in the next few days giving greater detail on what happened and the steps we are taking to mitigate the causes.
Posted Mar 23, 2022 - 18:29 EDT
Update
The restore has finished and initial assessment indicates a complete restore. We are now verifying the data and application stability before reopening the site and anticipate service being restored shortly.
Posted Mar 23, 2022 - 17:52 EDT
Update
The data restoration is progressing and is over 2/3rd complete.

We know how disruptive it is to have CxAlloy unavailable and are deeply sorry that we've had any outage at all and such a long outage in particular.

We will providing a postmortem once the incident is fully resolved, but can address some common questions:

- The outage is a result of a corrupted database table that happened unexpectedly during a code release last night. That application update, like many of our updates, required modifications to the database structure. For reasons that we do not yet know, the update to the database structure failed, triggered an automatic restart of the database, and after the restart resulted in one of our largest tables being corrupted.

- Once we established that recovering the corrupted table was not possible we pulled the data from our replica backup and initiated a restore. This required us to delete the corrupted table and re-import all the records from our backup. Because this table is so large - over 100 million rows - we were not able to avoid significant downtime while we brought the records in.
Posted Mar 23, 2022 - 13:34 EDT
Update
The data restoration has passed the halfway point.
Posted Mar 23, 2022 - 12:23 EDT
Update
At the current pace we are estimating the restoration will take approximately 10 hours to complete. We realize this is a long time and are doing everything we can to expedite the process.

In this case the affected table is our checklist lines table, one of the largest tables within the application. We have explored whether we could bring the site back to partial functionality, however because checklist lines are referenced in many other areas (as issue sources, connected to files, and so forth) it is not feasible to do so.

Importantly, all data is present in our backups and we don't anticipate any data loss from this incident.
Posted Mar 23, 2022 - 09:11 EDT
Update
The table is being restored. As it is a very large table it will take some time. We will provide an estimated time to completion in an upcoming update.
Posted Mar 23, 2022 - 08:23 EDT
Update
The corrupted table is currently being restored. We will continue to provide updates as the restore proceeds.
Posted Mar 23, 2022 - 07:27 EDT
Update
After the database restarted one of the database tables is reporting it is corrupted and needs to be restored. We are working to restore it.
Posted Mar 23, 2022 - 06:26 EDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 23, 2022 - 05:38 EDT
Identified
During a routine update the database initiated a restart. Due to the size of the database this can take upwards of an hour and the site will be unavailable during that time. We apologize for the inconvenience and will provide further updates as needed.
Posted Mar 23, 2022 - 05:34 EDT
This incident affected: CxAlloy TQ (Application, Servers).