Friday, February 17, 2012

Wednesday's outage explained

Last Wednesday just after 9:30am Eastern time my SSH console to dev.eclipse.org became unresponsive. Both our primary AND secondary NFS servers were no longer responding, and as a result most of eclipse.org was off the air. Since the failed servers are physically elsewhere, it's not like we can easily walk over to the console to see what has happened.

Usually, when one server ceases to respond, the problem is with the server. When two servers on the same network segment cease to respond at the same time, it's anything but the server. But Matt and I took no chances and split up: I investigated the network side and the possibility of a kernel DoS/exploit, and he hopped in his car to go see what's happening on the server side. Fortunately, the servers are only 10 minutes away.

As it turns out, the Linux kernel crashed on both servers, each within minutes of each other. Here's a sample of what we saw in the logs:

Feb 15 09:34:58 kernel: [18446743997.844366] WARNING: at [snip]/kernel/sched.c:3878 find_busiest_group+0xc79/0xce0()
Feb 15 09:34:58 kernel: [18446743997.844370] Hardware name: X8DT6
Feb 15 09:34:58 kernel: [18446743997.844417] Pid: 51, comm: events/0 Not tainted


Both servers are physically identical and were brought online about the same time, so this whole thing smells like something I've heard of before. To make me feel even better, the Kernel bug that closely matches what we've experienced is still open today:

https://bugzilla.kernel.org/show_bug.cgi?id=16991

After restarting both servers, we discovered that our rather large OpenLDAP server's database has some data corruption, and some specific operations cause it to segfault. Those numerous LDAP crashes meant it was difficult for anyone to get anything done on Wednesday.

It's all fun :) Any bets on when this will happen again?

5 Comments:

Blogger Benjamin CabĂ© said...

> Any bets on when this will happen again?

I'd say 21st December 2012 :-)

10:22 AM  
Blogger K Matthias said...

Fun times! Lets hope no one runs whatever triggered it any time soon. ;-)

1:11 PM  
Blogger banbook said...

Lucky you :)
2 days before your crash - 15/02 i got the same freeze on 22 out of 24 production SUSE servers, that was funny.

4:35 AM  
Blogger Denis Roy said...

@banbook... Any insight to share? Did you simply reboot your servers, or did you investigate further?

This has only happened to two of out 30-or-so SLES servers.

4:09 PM  
Blogger Denis Roy said...

http://www.novell.com/support/kb/doc.php?id=7009834

2:03 PM  

Post a Comment

<< Home