Troubleshooting Common VMware ESX Host Server Problems
Troubleshooting Common VMware ESX Host Server Problems
exception type, register dump, what was running on each CPU at the time of
the crash, back-trace, server up-time, error messages and memory core
dump info. The information won't be useful to you, but VMware support can
decipher it and help determine the cause of the crash.
Unfortunately, other than recording the information on the screen, your only option
when experiencing a PSOD is to power the server off and back on. Once the server
reboots you should find a vmkernel-zdump-* file in your server /root directory. This
file will be valuable for determining the cause. You can use the vmkdump utility to
extract the vmkernel log file from the file (vmkdump l ) and examine it for clues as
to what caused the PSOD. VMware support will usually want this file also. One
common cause of PSOD's is defective server memory; the dump file will help identify
which memory module caused the problem so it can be replaced.
The RAM check utility runs in the VMkernel space and can be started by logging into
the Service Console and typing Service Ramcheck Start.
While RAM check is running it will log all activity and any errors to the
/var/log/vmware directory in files called ramcheck.log and ramcheck-err.log. One
drawback, however, is that it's hard to test all of your RAM with this utility if you
have virtual machines (VMs) running, as it will only test unused RAM in the ESX
system. A more thorough method of testing your server's RAM is to shutdown ESX,
boot from a CD, and run Memtest86+.
Type vmware
build-32039
Type esxupdate
Type vpxa
l query
service is the management agent that handles all communication between ESX and its
clients. To restart it, log into the ESX host and type service vmware-vpxa restart. It
is important to note that restarting either of these services will not impact the
operation of your virtual machines (with the exception of the bug noted above).
Conclusion
In this tip, I have addressed a few of the most common problems that can occur with
VMware ESX. In the next installment of this series, I will cover troubleshooting
VirtualCenter issues.
Usually when we experience PSOD, we reboot the host (which is a must) and then gather
the logs and upload it to VMware support for analysis (where I spend a good amount of time
going through it)
Step 1:
I am going to simulate a PSOD on my ESXi host. You need to be logged into the host's
SSH. The command is
Step 2:
Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have
core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.
Reboot the host, if it is in the PSOD screen. Once the host is back up, login to the
SSH/Putty of the host and go to the core directory. The core directory is the location where
your PSOD logging go to.
# cd var/core
# ls -lh
Here you can see the vmkernel dump file, and the file is in the zdump format.
Step 3:
How do we extract it?
Well, we have a nice extract script that does all the job, " vmkdump_extract ". This
command must be executed against the zdump.1 file, which looks something like this:
# vmkdump_extract vmkernel-zdump.1
Skip to the end of the file by pressing Shift+G. Now let's slowly go to the top by pressing
PageUp.
You will come across a line that says @BlueScreen: <event>
In my case, the dumps were:
2015-12-17T20:34:03.603Z cpu3:47209)@BlueScreen: CrashMe
2015-12-17T20:34:03.603Z cpu3:47209)Code start: 0x418021200000 VMK uptime:
0:01:14:16.524>
2015-12-17T20:34:03.603Z
cpu3:47209)0x412461a5dc10:[0x41802128d249]PanicvPanicInt@vmkernel#nover+0x575
stack: 0x726f632000000008
2015-12-17T20:34:03.603Z
cpu3:47209)0x412461a5dc70:[0x41802128d48d]Panic_NoSave@vmkernel#nover+0x49 stack:
0x412461a5dcd0
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dd60:[0x41802157a63b]CrashMeCurrentCore@vmkernel#nover+0x55
3 stack: 0x100000278
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dda0:[0x41802157a8ca]CrashMe_VsiCommandSet@vmkernel#nover
+0x13e stack: 0x0
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5de30:[0x41802160c3c7]VSI_SetInfo@vmkernel#nover+0x2fb stack:
0x41109d630330
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dec0:[0x4180217bd7a7]UWVMKSyscallUnpackVSI_Set@<none>#<no
ne>+0xef stack: 0x412461a67000
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df00:[0x418021783a47]User_UWVMKSyscallHandler@<none>#<none
>+0x243 stack: 0x412461a5df20
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df10:[0x4180212aa90d]User_UWVMKSyscallHandler@vmkernel#nove
r+0x1d stack: 0xffbc0bb8
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df20:[0x4180212f2064]gate_entry@vmkernel#nover+0x64 stack: 0x0
The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case it is
CrashMe which is for a manual crash.
The VMKuptime tells the Kernel up-time before the crash.
The logging after that is the information that we need to be looking for, the cause as to why
the crash occurred.
Now, here the crash dump varies for every crash. These issues can range from hardware
errors / driver issues / issues with ESXi build and a lot more.
Each dump analysis would be different. But the basic is the same.
So, you can try analyzing the dumps by yourself. However, if you are entitled to VMware
support, I will do the job for you.
Cheers!