SCOM 2012 R2 Agent – High Read IO Issue


Following a recent firmware upgrade on our Dell Equallogic SAN I was horrified to see a massive spike in read IO activity, growing from an average of 2-3k IOPS right up to the maximum system limit on our SAN of 10k.

The thing that struck me the most was the imbalance in level of read traffic compared to write, as this jumped up to a massive 92% for read traffic. Initially my first thought was it was an issue with either the firmware on the Dell Equallogic or the Dell N4032 10Gbe switches and I logged a call with them, however the read activity just didnt add up.

Over the weekend I had maintenance time to take down all of our VM’s for troubleshooting purposes, when I started a single application server I could see peaks of almost 1k IOPS from just that VM alone. Upon investigating further with ProcessExplorer I could clearly see that the HealthService service was by far the highest user of disk read activity, measuring IO within the VM I could see very clearly the impact that the service was having with IO dropping to >50 once the service was disabled.

There were obviously two issues here.

1. Dell Equallogic / SAN HQ – Misreporting IO Prior to the V7 fimrware release our the Dell Equallogic appeared to be misreporting the level of IO activity in the SAN HQ software. In the below screenshot you can clearly see the level prior to the firmware upgrade and post upgrade (upgrade shown by 0 IO dip)
Traffic2

2. SCOM Agent Cache Clearing the SCOM agent cache had a dramatic impact on the level of disk activity, with no noticable impact on disk IO activity once the service was started after the cache being cleared. At this point I had already disabled the HealthService by GPO for testing purposes, I removed the GPO entry and ran the following PS script to rename the SCOM agent cache and restart the agent.


<#
.NOTES
===========================================================================
Created with:     SAPIEN Technologies, Inc., PowerShell Studio 2015 v4.2.81
Created on:       16/03/2015 10:30
Created by:       Maurice Daly
Filename:         ClearAgentCache.ps1
===========================================================================
.DESCRIPTION
Rename's SCOM agent cache and starts agent.
#>

Import-Module ActiveDirectory
$Servers = Get-ADComputer -LDAPFilter "(name=YOUR-SERVER-NAMING*)" | ForEach-Object { Write-Output $($_.name) }

foreach ($Server in $Servers)
{
$HealthService = Get-Service -ComputerName $Server | Where-Object { $_.name -eq "HealthService" }
$HealthService | Set-Service -StartupType 'Disabled'
$HealthService | Stop-Service -Force
Rename-Item -Path "\\$Server\C$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State" -NewName "\\$Server\C$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State.old"
$HealthService | Set-Service -StartupType 'Automatic'
$HealthService | Start-Service
}

The result, SAN traffic levels are now back to normal 🙂 : Traffic

6 thoughts on “SCOM 2012 R2 Agent – High Read IO Issue

  1. Patrick January 18, 2017 / 9:37 am

    Hi,
    experiencing the same at a customer of mine. Have you figured out the root case?

    As a workaround you can do the same right from the SCOM console (Flush agent cache task). I trigger it automatically once the IOs are high (recovery task).

    Thanks,
    Patrick

    • Patrick January 20, 2017 / 10:04 am

      Update: at another customer (no EquaLogic) I see the same high IOs (as I was afraid to see). On the systems I verified I can see a lot events (>10/sec) in the Windows Security log. Someone else experiencing the same?

      Best,
      Patrick

  2. storagebuilder November 6, 2015 / 1:31 pm

    We are seeing the same issue but on a massive scale peaking at >100k read I/O. After cache flush, does this Read IO stay low or does it creep back up as the cache gets bigger and bigger?

    • modalyit November 6, 2015 / 1:40 pm

      Having initially cleared the cache in March of this year we have seen no IO creep coming back in from the SCOM agent, so far so good.

  3. Jason Jensen September 29, 2015 / 4:38 pm

    Thanks for this. I was having this exact issue and this completely resolved the issue.

    • modalyit October 1, 2015 / 2:05 am

      No problem Jason, glad that it helped.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s