Monitoring at Xandr
At Xandr, we monitor the following parts of our physical infrastructure and core internals:
- Physical Servers
- Switches / Routers
- Local Load Balancing
- Xandr URLs
We do not monitor customers' applications running within instances, but we do monitor discrepancies between our database records for the instance state and reality. For monitoring, we use Nagios and AlertSite as an external tool. On each critical event, Nagios and AlertSite trigger the pagers of the sysops on duty. Non-critical events (e.g., high load on the physical server for a minute), are reported by email.
There are always members of SysOps on duty at all times to fill requests and monitor infrastructure.
We monitor all critical server hardware metrics. In the case of any HDD, memory, power supply, or similar issues, sysops is immediately paged. After investigating the issue, they make a decision on further hardware maintenance. In the case of an extremely critical issue, SysOps sends an appropriate notification to the customer, suggesting immediate migration to another server. Otherwise, regular maintenance (RMA) is scheduled, and we notify customers about it 7–10 days or more in advance.
On any critical service issues, sysops will receive alerts and starts an investigation immediately. Such issues include, but are not limited to:
- A server goes off-line
- A disk has failed in a storage unit
- A host is unavailable or flapping
- Load is critical on a server
- An instance stops responding to ping
- Critical disk or volume issues are detected
- Instances are failing or launch or are taking extreme amounts of time to launch
Xandr monitors the following URL resources:
- Nagios instances in each of our datacenters
The Customer Portal at https://help.xandr.com
- Xandr and some customer CDN domains
If issues are detected, SysOps is alerted.
We are monitoring via Nagios the health and load status of all important Xandr infrastructure. This includes, but isn't limited to:
- Our API
- Local Load Balancers
Pagers of the SysOps members on duty are triggered in case of problems with these components.
Nagios is an open-source, enterprise-class monitoring system. Nagios can perform checks for various services (SMTP, POP3, HTTP, NNTP, PING), as well as resources checks (CPU load, disk usage).
Checks are broken down into active and passive checks. Active checks are performed for the following:
1) On the Nagios box by different plugins (check_ping, check_dns, check_ssh, check_https, etc.),
2) On hosts using the NRPE daemon.
NRPE stands for Nagios Remote Plugin Executor. On the Xandr side, it runs tests such as check_nrpe_disk, check_nrpe_users, check_nrpe_load, check_nrpe_swap, check_nrpe_exp_memory, check_nrpe_lvm, and many others. When a check fails, an alarm message goes to sysops.
Service checks which are performed and submitted to Nagios by external applications are called passive checks. (More info on passive checks could be found here: http://nagios.sourceforge.net/docs/3_0/passivechecks.html). The
snmptrapd daemon routes SNMP traps to Nagios using passive checks. Networking gear (F5, PDU, Core Switches) and NAS units are monitored via SNMP using passive checks.
More info can be found on the Nagios homepage: http://www.nagios.org/.