SCOM 2012 R2 Agent – High Read IO Issue

Following a recent firmware upgrade on our Dell Equallogic SAN I was horrified to see a massive spike in read IO activity, growing from an average of 2-3k IOPS right up to the maximum system limit on our SAN of 10k.

The thing that struck me the most was the imbalance in level of read traffic compared to write, as this jumped up to a massive 92% for read traffic. Initially my first thought was it was an issue with either the firmware on the Dell Equallogic or the Dell N4032 10Gbe switches and I logged a call with them, however the read activity just didnt add up.

Over the weekend I had maintenance time to take down all of our VM’s for troubleshooting purposes, when I started a single application server I could see peaks of almost 1k IOPS from just that VM alone. Upon investigating further with ProcessExplorer I could clearly see that the HealthService service was by far the highest user of disk read activity, measuring IO within the VM I could see very clearly the impact that the service was having with IO dropping to >50 once the service was disabled.

There were obviously two issues here.

1. Dell Equallogic / SAN HQ – Misreporting IO Prior to the V7 fimrware release our the Dell Equallogic appeared to be misreporting the level of IO activity in the SAN HQ software. In the below screenshot you can clearly see the level prior to the firmware upgrade and post upgrade (upgrade shown by 0 IO dip)
Traffic2

2. SCOM Agent Cache Clearing the SCOM agent cache had a dramatic impact on the level of disk activity, with no noticable impact on disk IO activity once the service was started after the cache being cleared. At this point I had already disabled the HealthService by GPO for testing purposes, I removed the GPO entry and ran the following PS script to rename the SCOM agent cache and restart the agent.


<#
.NOTES
===========================================================================
Created with:     SAPIEN Technologies, Inc., PowerShell Studio 2015 v4.2.81
Created on:       16/03/2015 10:30
Created by:       Maurice Daly
Filename:         ClearAgentCache.ps1
===========================================================================
.DESCRIPTION
Rename's SCOM agent cache and starts agent.
#>

Import-Module ActiveDirectory
$Servers = Get-ADComputer -LDAPFilter "(name=YOUR-SERVER-NAMING*)" | ForEach-Object { Write-Output $($_.name) }

foreach ($Server in $Servers)
{
$HealthService = Get-Service -ComputerName $Server | Where-Object { $_.name -eq "HealthService" }
$HealthService | Set-Service -StartupType 'Disabled'
$HealthService | Stop-Service -Force
Rename-Item -Path "\\$Server\C$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State" -NewName "\\$Server\C$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State.old"
$HealthService | Set-Service -StartupType 'Automatic'
$HealthService | Start-Service
}

The result, SAN traffic levels are now back to normal 🙂 : Traffic