IncidentLog

From SoylentNews
Revision as of 22:34, 11 March 2014 by Martyb (talk | contribs) (Added date/time stamp of incident)
Jump to navigation Jump to search

Report of an incident that occurred on March 11, 2014 from 15:30-17:00 EDT.

Summary. There was a report of a problem at http://soylentnews.org; logged-in users saw a comment count of zero (0) for the three newest stories on the main page. When a user went to the actual story, there were comments there. Separately, non-logged-in users failed to even see the most-recent 3 stories. In light of recent developments, credentials had been removed for many people; the one person who had access to everything was unavailable.

What went well:

  • Generally followed procedures outlined in ICS (Incident Command System)
  • Used private channel on http://irc.sylnt.us/ to communicate in real time.
  • Incident Commander performed a .op and prefixed nick with "cmdr_" to clearly identify role.
  • The main site stayed up as we attempted to solve the problem.
  • Staff jumped in and offered their services.
  • Provided some updates to the community via IRC on channel: #Soylent
  • Focused on gathering data to identify problem and outline possible solutions.
  • Performed confirmation after it appeared the problem was fixed, to ensure it actually was fixed.
  • Requested feedback on lessons learned.
  • Incident Commander admitted mistakes.
  • There were some ruffled feathers, but all-in-all people worked well together.
  • Established a follow-up to find underlying cause of problem and to document solution.

What did not go well:

  • Did not use staff mailing list to inform all staff there was a problem.
  • Incident Commander failed to recognize when a technical lead role was needed and to delegate a task leader for it.
  • Key people who had domain knowledge and access were unavailable.
  • Other people with the know-how to diagnose and fix the problem lacked the credentials they needed to do so.
  • Lacked alternative means to contact key people. (e.g. phone numbers)

Takeaway.

There is a fine core of dedicated professionals who genuinely want to see the site succeed. They rose to the occasion and strove to work together. We successfully diagnosed and solved the problem without causing further damage. We learned what happened when there was a failure to identify when a task leader was needed and to delegate appropriately. We successfully coordinated efforts from people who were distributed in multiple locations and time zones.