One of the main benefits in VCOPS is supposed to be that through its algorithms it can identify when a resource is having issues based on a number of metrics, rather than the more common method of setting thresholds on single metrics. VCOPS allows you to set alerts based on a single metric going out of baseline or when the health score of a resource or application dips below a certain point. We use SMARTS as a our Event Management tool, and it would be simple enough to turn off the static threshold alerts and then have the alerts from VCOPS sent to the SMARTS console. From there our NOC that monitors the console 24x7 can receive the alerts and then follow our Event and Incident Management processes. We attempted this quite a while ago, but we were only sending alerts for specific metrics, for example when the CPU breached its normal baseline we'd get an alert. This caused a huge excess of alerts because the resource was breaching by going to 4% CPU utilization on a Saturday, when normally it was closer to 2%. Obviously something we don't want to create an incident for. Kit explained at VMWorld last year that we really want to change our alerts to be based off the Health score, and we're currently in a place we could do that. I'm just curious if anyone else has made this leap and how successful it was.
↧