IncidentLog

From SoylentNews
Revision as of 23:24, 11 March 2014 by Mrcoolbp (talk | contribs)
Jump to navigation Jump to search

This is a log of incidents that occurred and how they were remedied.

March 11, 2014

15:30-17:00 EDT

Summary. There was a report of a problem at http://soylentnews.org; logged-in users saw a comment count of zero (0) for the three newest stories on the main page. When a user went to the actual story, there were comments there. Separately, non-logged-in users failed to even see the most-recent 3 stories. In light of recent developments, credentials had been removed for many people; the one person who had access to everything was unavailable.

What went well:

  • Generally followed procedures outlined in ICS (Incident Command System)
  • Used private channel on http://irc.sylnt.us/ to communicate in real time.
  • Incident Commander performed a .op and prefixed nick with "cmdr_" to clearly identify role.
  • The main site stayed up as we attempted to solve the problem.
  • Staff jumped in and offered their services.
  • Provided some updates to the community via IRC on channel: #Soylent
  • Focused on gathering data to identify problem and outline possible solutions.
  • Performed confirmation after it appeared the problem was fixed, to ensure it actually was fixed.
  • Requested feedback on lessons learned.
  • Incident Commander admitted mistakes.
  • There were some ruffled feathers, but all-in-all people worked well together.
  • Established a follow-up to find underlying cause of problem and to document solution.

What did not go well:

  • Did not use staff mailing list to inform all staff there was a problem.
  • Incident Commander failed to recognize when a technical lead role was needed and to delegate a task leader for it.
  • Key people who had domain knowledge and access were unavailable.
  • Other people with the know-how to diagnose and fix the problem lacked the credentials they needed to do so.
  • Lacked alternative means to contact key people. (e.g. phone numbers)

Takeaway.

There is a fine core of dedicated professionals who genuinely want to see the site succeed. They rose to the occasion and strove to work together. We successfully diagnosed and solved the problem without causing further damage. We learned what happened when there was a failure to identify when a task leader was needed and to delegate appropriately. We successfully coordinated efforts from people who were distributed in multiple locations and time zones.


11:30-13:30 EST

In migrating the service linode and remaining credit from the other 2 unused linodes to NCommander's account, the original account had to be closed. After closing the account and transferring over the svc. linode, somehow the Zone file was lost and nearly everything went down.

After much confusion and digging, we figured out what the problem is, I called linode, had the file transferred to Ncommanders account and reinstated.

I'll post more details soon. mrcoolbp (talk) 23:24, 11 March 2014 (UTC)