Monitoring and Maintenance

On a production server several things should be monitored automatically, from inside and outside, with automatic alarms actually reaching someone feeling responsible.

Local monitoring recommendations

We suggest to monitor:

Free hard disc space
Free physical and virtual RAM (but: virtual RAM is a reserve for peak load, no real resource)
CPU load (not only the computation usage, also the overall load respecting I/O and context switches, think about monitoring /proc/load)

In case of presumed overload, try very hard to distinct between the several aspects of distributed computation and the whole list of possible bottlenecks down to network usage.

Remote monitoring recommendations

We suggest to monitor:

Basic network connectivity (ping with timing)
Application connectivity (HTTP(S)-Requests with minimal content checking and timing)

Maintenance

Somebody should watch the watchers.

Every now and then check:

Is the monitoring still running? Eventually stop or interrupt something, at a point in time when you don't ruin someone's day.
Would alarms reach anyone? Eventually send test messages.
Is there activity at all? Idle servers may be idle because the clients can't connect.