IncidentLog

From SoylentNews
Jump to navigation Jump to search

This is a log of incidents that occurred and how they were remedied.

March 11, 2014

15:30-17:00 EDT

Summary. There was a report of a problem at http://soylentnews.org; logged-in users saw a comment count of zero (0) for the three newest stories on the main page. When a user went to the actual story, there were comments there. Separately, non-logged-in users failed to even see the most-recent 3 stories. In light of recent developments, credentials had been removed for many people; the one person who had access to everything was unavailable.

What went well:

  • Generally followed procedures outlined in ICS (Incident Command System)
  • Used private channel on http://irc.sylnt.us/ to communicate in real time.
  • Incident Commander performed a .op and prefixed nick with "cmdr_" to clearly identify role.
  • The main site stayed up as we attempted to solve the problem.
  • Staff jumped in and offered their services.
  • Provided some updates to the community via IRC on channel: #Soylent
  • Focused on gathering data to identify problem and outline possible solutions.
  • Performed confirmation after it appeared the problem was fixed, to ensure it actually was fixed.
  • Requested feedback on lessons learned.
  • Incident Commander admitted mistakes.
  • There were some ruffled feathers, but all-in-all people worked well together.
  • Established a follow-up to find underlying cause of problem and to document solution.

What did not go well:

  • Did not use staff mailing list to inform all staff there was a problem.
  • Incident Commander failed to recognize when a technical lead role was needed and to delegate a task leader for it.
  • Key people who had domain knowledge and access were unavailable.
  • Other people with the know-how to diagnose and fix the problem lacked the credentials they needed to do so.
  • Lacked alternative means to contact key people. (e.g. phone numbers)

Takeaway.

There is a fine core of dedicated professionals who genuinely want to see the site succeed. They rose to the occasion and strove to work together. We successfully diagnosed and solved the problem without causing further damage. We learned what happened when there was a failure to identify when a task leader was needed and to delegate appropriately. We successfully coordinated efforts from people who were distributed in multiple locations and time zones.

After Action Report

Upon looking at the slashd log it was confirmed by Dev staff that slashd died almost the same time that the DNS was taken down in the previous issue. Dev staff believe that due to slashes reliance on the fully qualified domain name for many urls that it did not handle the no DNS issue very well and slashd died because of it. Our recommendation is to add the FQDN to the host file and to add some watchdog to the system to make sure slashd and other systems stay running. Given that this is a Ubuntu system, it should be possible to turn get Upstart to respawn it as a system service. It seems probably slashd died at the hands of the OOM killer, yet we are lacking the kernel logs for that, which might tell us if it was slashd or some other task that was the reason for the OOM state.

11:30-13:30 EST

In migrating the service linode and remaining credit from the other 2 unused linodes to NCommander's account, the original account had to be closed. After closing the account and transferring over the svc. linode, somehow the DNS zone record was lost and nearly everything went down.

We tried rebooting the production linode via root access. This lead to 503 errors.

After much confusion and digging, we figured out what the problem is (the DNS zone records going missing with the account closure), I called linode, had the file(s) transferred to Ncommander's account and reinstated.

This caused problems with slashd (see above)

mrcoolbp (talk) 23:24, 11 March 2014 (UTC)