IncidentLog

From SoylentNews
Jump to navigation Jump to search

This is a log of incidents that occurred and how they were remedied.


March 20,2014

00:34 GMT

It seems our Wiki made a bit of a mess, it started a PHP script at every page load, a script that idled and lurked around and caused the server's memory to fill up after a while and as a result, the MySQL server got killed, causing the Wiki and Mail server to stop functioning properly and caused the Spamassassin to be killed as well.

Note: due to the MySQL server going down, most mail functions, incl. the webmail didn't work anymore.

When we got alerted (by mrcoolbp) that the wiki wasn't working, I started looking in to why. It soon became apparent there were 251 PHP processes active without a good reason, after killing those, the servers memory usage went down to 250 - 400 (can't remember precisely) but, as people loaded the wiki after restarting the database, PHP processes got started again, so FunPika & I started to investigate, he later found there was a bug in MediaWiki.

After changing a configuration setting in the Wiki and killing all the remaining PHP processes, the problem went away. Naturally, the issue still exists, it's just not a problem at this moment.

I'm glad to announce it's resolved, for now. We've also came up with some ideas MechanicJay & I going to look into to prevent this from happening in the future:

- cgroups: in short, control groups that will allow us to limit resources per user, e.g. we'll have the wiki and all other things running under their own user and it won't be able to eat up all the servers memory or spawn more processes then we define.

- httpd/mod_ruid2: this will force the webserver to start processes under a predefined user. For instance, all wiki things would be started under the "wiki" user, now combine that with cgroups, and we've got ourselves an initial solution for what happened earlier (e.g. the entire server and all services suffering because of one Wiki instance).

It should be noted that as of now, the solution isn't implemented yet, currently, the only thing that's changed is the Wiki settings preventing it from starting processes.

I would like to thank both FunPika and MechanicJay for helping out.

- Incident Commander Xlefay

March 13,2014

Begin 22:18 PST, end c. 23:50

martyp reports same problem as before on main site, 0 comment counts, audioguy agrees to have a look.

Slashd not running.

Attempt to start slashd, which started but with complaints about not being able to connect to the database. Decided to restart apache, due to the strong dependencies there with modperl.

Apache refused to start, citing lack of connection to database. Stopped slashd, in order to try bringing up Apache first, then slash. Apache still refused to start, unable to connect to the database.

At this point NCommander appeared, audioguy handed all incident response authority to him.

NCommander had some difficulty connecting due to an extremely poor connection at his location (ssh being blocked), but was finally able to get in through the Web interface at Linode. (Interesting but incidental fact: an attempt was made to get in through audioguys account on slashcott which is on an alternate port, but ssh access was thwarted even on this alternate port.)

NCommander determined the problem was due to the ssl cert on the database having expired. This was set up by someone no longer here with a very short expiration time, for unknown reasons.

NCommander was able to regenerate a new ssl cert and bring things back up.

A short discussion with NCommander afterward brought out the following facts:

The database was confirmed to be on a different machine, to which audioguy had no access. Audioguy would not have been able to solve this problem, having access only to the main server, not the database server. The database uses ssl because of this remote connection.

The main server cannot be brought up by a simple reboot in an emergency. The Apache server must be started by hand. (It is located in /srv/soylentnews.org/apache an unusual location).

NCommander has no idea why Nginx is running on the slash machine.

A brief discussion about documenting System Administration was held, this is to be a high priority once NCommander returns from vacation. NCommander received a reprieve from various possible flogging and keel-hauling due to his timely appearance to save the day.

After Action Recommendations:

  • The same people who have access to the main server should also have access to the database server, since the system cannot be brought up from a dead stop without this access.
  • The system will not start itself from a cold reboot at present. This is non-standard behavior, and is easily fixed. While I am not familiar with the specifics of how this is done on Ubuntu, every other unix style system I am aware of has a means to provide a local start and local stop function in the startup sequence as a way to automate special cases. In addition, it seems likely that Ubuntu would have a startup script for apache 1.3, perhaps an old one, that could be modified to work. Generally systems have some way to specify start order as well, which we need. This should be done.
  • Someone needs to have a look at nginx and and see exactly what it is doing. If nothng useful, it should be shut down. It is unusual to have to have two separate httpd servers running on the same machine, particularly if that machine is a production machine designed to handle one application. Unless it has a clear use, it is simply an additional load and possible security risk at this point.
  • The start and stop procedures for slash need to be clearly documentd somewhere. If no one objects, I will begin a new section on the wiki that has at least the very basics, at a level that someone not very familiar with the system can understand.

- AudioGuy

March 11, 2014

15:30-17:00 EDT

Summary. There was a report of a problem at http://soylentnews.org; logged-in users saw a comment count of zero (0) for the three newest stories on the main page. When a user went to the actual story, there were comments there. Separately, non-logged-in users failed to even see the most-recent 3 stories. In light of recent developments, credentials had been removed for many people; the one person who had access to everything was unavailable.

What went well:

  • Generally followed procedures outlined in ICS (Incident Command System)
  • Used private channel on http://irc.sylnt.us/ to communicate in real time.
  • Incident Commander performed a .op and prefixed nick with "cmdr_" to clearly identify role.
  • The main site stayed up as we attempted to solve the problem.
  • Staff jumped in and offered their services.
  • Provided some updates to the community via IRC on channel: #Soylent
  • Focused on gathering data to identify problem and outline possible solutions.
  • Performed confirmation after it appeared the problem was fixed, to ensure it actually was fixed.
  • Requested feedback on lessons learned.
  • Incident Commander admitted mistakes.
  • There were some ruffled feathers, but all-in-all people worked well together.
  • Established a follow-up to find underlying cause of problem and to document solution.

What did not go well:

  • Did not use staff mailing list to inform all staff there was a problem.
  • Incident Commander failed to recognize when a technical lead role was needed and to delegate a task leader for it.
  • Key people who had domain knowledge and access were unavailable.
  • Other people with the know-how to diagnose and fix the problem lacked the credentials they needed to do so.
  • Lacked alternative means to contact key people. (e.g. phone numbers)

Takeaway.

There is a fine core of dedicated professionals who genuinely want to see the site succeed. They rose to the occasion and strove to work together. We successfully diagnosed and solved the problem without causing further damage. We learned what happened when there was a failure to identify when a task leader was needed and to delegate appropriately. We successfully coordinated efforts from people who were distributed in multiple locations and time zones.

After Action Report

Upon looking at the slashd log it was confirmed by Dev staff that slashd died almost the same time that the DNS was taken down in the previous issue. Dev staff believe that due to slashes reliance on the fully qualified domain name for many urls that it did not handle the no DNS issue very well and slashd died because of it. Our recommendation is to add the FQDN to the host file and to add some watchdog to the system to make sure slashd and other systems stay running. Given that this is a Ubuntu system, it should be possible to turn get Upstart to respawn it as a system service. It seems probably slashd died at the hands of the OOM killer, yet we are lacking the kernel logs for that, which might tell us if it was slashd or some other task that was the reason for the OOM state.

11:30-13:30 EST

Summary: In migrating the service linode and remaining credit from the other 2 unused linodes to NCommander's account, the original account had to be closed. After closing the account and transferring over the svc. linode, somehow the DNS zone record was lost and nearly everything went down.

  • Mailed staff via mailing list to notify of site going down
  • Got in touch with dev team members via IRC
  • Discovered no one available had access to lindoe manager interface
  • Devs tried rebooting the production linode via root access
    • This lead to 503 errors

After much confusion and digging, we figured out what the problem is (the DNS zone records going missing with the account closure), I called linode, had the file(s) transferred to Ncommander's account and reinstated.

This caused problems with slashd (see above)

Devs: feel free to add to this log what we learned, how to prevent in the future.

mrcoolbp (talk) 23:24, 11 March 2014 (UTC)


March 16,2014

~ 20:30 EST

mrcoolbp (talk) 04:58, 17 March 2014 (UTC) Today the story queue ran dry and no stories were posted for a few hours. The situation was remedied by paulej72 who stepped in and edited some stories, and was later joined by others.

No commander was presiding at the time when the incident occurred.

Below is a excerpt from IRC:

Sat 2014-03-16:
17:59 <@janrinok> Sorry guys, but I've got to go look after my other half.
17:59 <@janrinok> .deop
17:59 -!- mode/#staff [-o janrinok] by what-if-i
18:04 -!- kobach [~nope@Soylent/Staff/IRC/kobach] has left #staff [nope]
18:05 -!- janrinok [~janrinok@Soylent/Staff/Editor/janrinok] has quit [Quit: leaving]
18:41 <+FatPhil> dammit
18:41 <+FatPhil> .op
Sun 2014-03-17:
00:18 <@FatPhil> "21:52  * FatPhil has itchy feet" That was 2 and a half hours ago - I'm cracking open a beer now
00:23 <@FatPhil> .deop
00:43 -!- You have been marked as being away
...
01:27 -!- bytram [~a6b50356@Soylent/Staff/Developer/martyb] has joined #staff
01:27 -!- mode/#staff [+v bytram] by what-if-i
01:27 <+bytram> Hi gang!
01:27 <+FatPhil> you're cmdr
01:27 <+bytram> who says?
01:28 <+FatPhil> everyone who's not cmdr, which is everyone
01:28 <+bytram> I just popped in to see if there was an editor here?  Main page's newest story looks to be 4hrs old.
01:28 <+FatPhil> I'm in the pub, not reading the chan
01:28 <+bytram> k
01:29 <+FatPhil> (I know there's evidence to the contrary)
01:29 <+bytram> .op
01:29 -!- mode/#staff [+o bytram] by what-if-i
01:29 <+FatPhil> thank you
01:29 <@bytram> don't know how long I stay, but happy to help.