So I've been trying to figure this issue out since late December. Tickets with VMware aren't getting far, so now I'm hoping the forums can get me in the right direction.
so we've now had 5 instances since December where we leave for the day and all VM's are fine (0 problem desktops) and come in the next morning and all VM's on a host are Agent Unreachable. The Vm's are still running, they are still responsive, can be logged into via Console, have valid IP's, etc. If I vMotion them to another host they immediately become come out of agent unreachable without needing a reboot, service restarts, etc. I can then vMotion them back onto the original host and they continue to work without issue. This has only ever happened after hours, vCOps doesn't indicate any issues/alerts/anomalies, and has happened on 4 different hosts out of 5 occurrences. I don't see any odd metrics in the performance graphs for cpu, ram, disk, or networking. our SAN metrics don't show anything weird happening. I thought maybe vShield or MOVE were doing something but those logs aren't indicating anything either. The VM logs don't have anything super obvious happening in them. I did see a couple of events where the WSNM service is restarting, but it restarts successfully each time. The fact they are responsive throws me off of the vShield line of thinking where a similar situation happens but the VM itself is unresponsive.
Has anyone seen behavior like this and have a "check out X" kind of direction to point me?