We seem to have a number of problems.
1) Simply, we're out of RAM. We have been for awhile, but now Linux is trying to do lots of silly things like sending a full gig to swap and then panicking while every site I run basically dies as it tries to swap what it needs back in but can't because it really likes to keep about 10% free just in case because.
I -think- I've temporarily solved that issue by restarting most of our server's services (and then because of #3, reboot the server anyway : /). But this is just going to get worse, and more frequent, until we get more RAM. I'll be posting an announcement about this here and on Blue Moon soon.
2) Our I/O situation could be better on its own.
- I should have run the mailserver off the slave server from the beginning.
- I should never have used a RAID. At all.
Regrets : /
I could setup another server 'properly' fairly cheaply (a few hundred), though Elliquiy would go down a couple of times. An SSD would be a far more dramatic improvement, but they are expensive, and I don't want to go that route until Wheezy (next version of the Linux distro I use) is stable.
3) Something else is causing random load spikes and I honestly have no idea what. Some of it, I know, is measures I took to help correct the first issue, but that isn't all of it. There are about a thousand connections to E and other sites I run open at any one time, but the server can and has handled a hundred times that in load testing. File descriptors are a similar story. It's also possible that after 200 days other things happened causing Linux to slowly degrade. 200 days is a long time to be serving millions of pageviews and handling tens of millions of database calls per day.
I just rebooted the server. If that doesn't solve #3 here (besides what I know I caused to mitigate the above) we have Problems : /
- So far, so good, though. A few seemingly random mysterious issues no longer exist. Which is good, when it comes to computers >_> Rebooting still bad : / E-penis size is measured in server uptime.
4) We do occasionally top out CPU usage, but this is incidental and extremely momentary. With the AJAX chat off of E, we tend to peak at about 50% usage.
- Getting another CPU might allow us to turn the AJAX chat back on again, but with our RAM situation so fragile I'd like to get that taken care of, first - or alongside it.