On a production server several things should be monitored automatically, from inside and outside, with automatic alarms actually reaching someone feeling responsible.
Local monitoring recommendations
We suggest to monitor:
Free hard disc space
Free physical and virtual RAM (but: virtual RAM is a reserve for peak load, no real resource)
CPU load (not only the computation usage, also the overall load respecting I/O and context switches, in Linux think about monitoring
/proc/load
)
In case of presumed overload, try very hard to distinct between the several aspects of distributed computation and the whole list of possible bottlenecks down to network usage and disc I/O.
Remote monitoring recommendations
We suggest to monitor:
Basic network connectivity (ping with timing)
Application connectivity (HTTP(S)-Requests with , checking reaction time and some minimal content checking and timingbit)
Maintenance
Somebody should watch the watchers.
...