0% found this document useful (0 votes)
258 views

Troubleshooting Common VMware ESX Host Server Problems

This document provides troubleshooting tips for common problems with VMware ESX host servers. It discusses the purple screen of death (PSOD), which is similar to the blue screen of death in Windows. When a PSOD occurs, the information on the screen should be recorded. The only option is to reboot the server. PSODs are often caused by defective RAM, which can be checked using built-in utilities or an external memory testing tool. The document also provides guidance on troubleshooting a frozen service console, lost network configurations, and using log files and commands to gather troubleshooting information.

Uploaded by

vinoopnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views

Troubleshooting Common VMware ESX Host Server Problems

This document provides troubleshooting tips for common problems with VMware ESX host servers. It discusses the purple screen of death (PSOD), which is similar to the blue screen of death in Windows. When a PSOD occurs, the information on the screen should be recorded. The only option is to reboot the server. PSODs are often caused by defective RAM, which can be checked using built-in utilities or an external memory testing tool. The document also provides guidance on troubleshooting a frozen service console, lost network configurations, and using log files and commands to gather troubleshooting information.

Uploaded by

vinoopnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Troubleshooting common VMware

ESX host server problems


Get a grip on potential VMware ESX
host server problems including the
purple screen of death, a frozen service
console, and rebuilding your network
configurations after they've been lost.
Panicking at the onset of a high impact technical problem can cause impulsive
decision making that enhances the problem. Before trying to troubleshoot any
problem, pause and relax to approach the task with a clear mind, then
address each symptom, possible cause and resolution appropriately.
In this series, I offer solutions for many common problems that arise
with VMware ESX host servers, VirtualCenter, and virtual machines in
general. Let's begin by addressing common issues with VMware ESX host
servers.
Windows server administrators have long been familiar with the dreaded Blue
Screen of Death (BSOD), which signifies a complete halt by the server.
VMware ESX has a similar state called the purple screen of death (PSOD)
which is typically caused by hardware problems or a bug in the VMware code.

Troubleshooting a purple screen of death


When a PSOD occurs, the first thing you want to do is note the information
displayed on the screen. I suggest using a digital camera or cell phone to take
a quick photo. The PSOD message consists of the ESX version and build, the

exception type, register dump, what was running on each CPU at the time of
the crash, back-trace, server up-time, error messages and memory core
dump info. The information won't be useful to you, but VMware support can
decipher it and help determine the cause of the crash.

Unfortunately, other than recording the information on the screen, your only option
when experiencing a PSOD is to power the server off and back on. Once the server
reboots you should find a vmkernel-zdump-* file in your server /root directory. This
file will be valuable for determining the cause. You can use the vmkdump utility to
extract the vmkernel log file from the file (vmkdump l ) and examine it for clues as
to what caused the PSOD. VMware support will usually want this file also. One
common cause of PSOD's is defective server memory; the dump file will help identify
which memory module caused the problem so it can be replaced.

Checking your RAM for errors


If you suspect your system's RAM may be at fault you can use a built-in utility to
check your RAM in the background without affecting your running virtual machines.

The RAM check utility runs in the VMkernel space and can be started by logging into
the Service Console and typing Service Ramcheck Start.
While RAM check is running it will log all activity and any errors to the
/var/log/vmware directory in files called ramcheck.log and ramcheck-err.log. One
drawback, however, is that it's hard to test all of your RAM with this utility if you
have virtual machines (VMs) running, as it will only test unused RAM in the ESX
system. A more thorough method of testing your server's RAM is to shutdown ESX,
boot from a CD, and run Memtest86+.

Using the vm-support utility


If you contact VMware support, they will usually ask you to run the vm-support
utility that packages all of the ESX server log and configuration files into a single file.
To run this utility, simply log in to the service console with root access, and type "vmsupport" without any options. The utility will run and create a single Tar file that will
be named "esx- -- . .tgz". You can send it via FTP to VMware support. Make sure you
delete the Tar file from the ESX Server once you are done to save disk space.
Alternatively, you can generate the same file by using the VMware Infrastructure
Client (VI Client). Select Administration, then Export Diagnostic Data, and select
your host (VirtualCenter data optional) and a directory on your local PC to store the
file that will be created.

Using log files for troubleshooting


Log files are generally your best tool for troubleshooting any type of problem. ESX
has many log files. Which ones you should check depends on the problem you are
experiencing. Below is the list of ESX log files that you will commonly use to
troubleshoot ESX server problems. The VMkernel and hosted log files are usually the
logs you will want to check first.

VMkernel - /var/log/vmkernel Records activities related to the virtual


machines and ESX server. Rotated with a numeric extension, current log
has no extension, most recent has a ".1" extension.

VMkernel Warnings - /var/log/vmkwarning Records activities with the


virtual machines, a subset of the VMkernel log and uses the same rotation
scheme.

VMkernel Summary - /var/log/vmksummary - Used to determine uptime


and availability statistics for ESX Server; readable summary found in
/var/log/vmksummary.txt.

ESX Server host agent log - /var/log/vmware/hostd.log - Contains


information on the agent that manages and configures the ESX Server
host and its virtual machines. (Search the file date/time stamps to find the
log file it is currently outputting to, or open hostd.log, which is linked to the
current log file.)

ESX Firewall log - /var/log/vmware/esxcfg-firewall.log Logs all firewall


rule events.

ESX Update log - /var/log/vmware/esxupdate.log Logs all updates done


through the esxupdate tool.

Service Console - /var/log/messages - Contains all general log messages


used to troubleshoot virtual machines or ESX Server.

Web Access - /var/log/vmware/webAccess - Records information on webbased access to ESX Server.

Authentication log - /var/log/secure - Contains records of connections


that require authentication, such as VMware daemons and actions initiated
by the xinetd daemon.

Vpxa log - /var/log/vmware/vpx - Contains information on the agent that


communicates with VirtualCenter. Search the file date/time stamps to find

the log file it is currently outputting to or open hostd.log which is linked to


the current log file.
As part of the troubleshooting process, often times you'll need to find out the version
of various ESX components and which patches are applied. Below are some
commands you can run from the service console to do this:

Type vmware

to check ESX Server version, i.e., VMware

ESX Server 3.0.1

build-32039

Type esxupdate

Type vpxa

Type rpm qa | grep VMware-esx-tools to check the ESX Server VMware


Tools installed version i.e., VMware-esx-tools-3.0.1-32039.

l query

to see which patches are installed.

to check the ESX Server management version, i.e. VMware


VirtualCenter Agent Daemon 2.0.1 build-40644.
v

If all else fails, restart the VMware host


agent service
Many ESX problems can be resolved by simply restarting the VMware hostagent
service (vmware-hostd), which is responsible for managing most of the operations on
the ESX host. To do this, log into the service console and type service mgmt-vmware
restart.
NOTE: ESX 3.0.1 contained a bug that would restart all your VMs if your ESX server
was configured to use auto-startups for your VMs. This bug was fixed in a patch for
3.0.1 and also in 3.0.2, but appeared again in ESX 3.5 with another patch released to
fix it. It's best to temporarily disable auto-startups before you run this command.
In some cases restarting the vmware-vpxa service when you restart the host agent will
fix problems that occur between ESX and both the VI Client and VirtualCenter. This

service is the management agent that handles all communication between ESX and its
clients. To restart it, log into the ESX host and type service vmware-vpxa restart. It
is important to note that restarting either of these services will not impact the
operation of your virtual machines (with the exception of the bug noted above).

Fixing a frozen service console


Another problem that can occur is your Service Console can hang and not allow you
to log in locally. This can be caused by hardware lock-ups or a deadlocked condition.
Your VMs may continue to operate normally when this occurs, but rebooting ESX is
usually the only way to recover. Before you do that, however, try shutting down your
guest VMs and/or using VMotion to migrate them to another ESX host. To do this,
use the VI Client by connecting remotely via SSH or by using one of
alternate/emergency consoles, which you can access by pressing Alt-F2 through AltF6. You can also press Alt-F12 to display VMkernel messages on the console screen.
If you are able to shutdown or move your VMs, then you can try rebooting the server
by issuing the reboot command through the VI Client or alternate consoles. If not,
cold-booting the server is your only option.

Lost network configurations


The problem that can occur is that you may lose part or all of your networking
configurations. If this happens, you must rebuild your network by using the ESX local
service console, since you will be unable to connect using the VI Client. VMware has
published knowledgebase articles that detail how to rebuild your networking using the
esxcfg-* service console commands and also how to verify your network settings.

Conclusion
In this tip, I have addressed a few of the most common problems that can occur with
VMware ESX. In the next installment of this series, I will cover troubleshooting
VirtualCenter issues.

How To Analyze PSOD


Purple Screen of Death or commonly known as PSOD is something which we see most of
the times when we run an ESXi host.

Usually when we experience PSOD, we reboot the host (which is a must) and then gather
the logs and upload it to VMware support for analysis (where I spend a good amount of time
going through it)

Why not take a look at the dumps by yourself?

Step 1:
I am going to simulate a PSOD on my ESXi host. You need to be logged into the host's
SSH. The command is

# vsish -e set /reliability/crashMe/Panic 1


And when you open a DCUI to the ESXi host, you can see the PSOD

Step 2:
Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have
core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.

Reboot the host, if it is in the PSOD screen. Once the host is back up, login to the
SSH/Putty of the host and go to the core directory. The core directory is the location where
your PSOD logging go to.

# cd var/core

Then list out the files here:

# ls -lh
Here you can see the vmkernel dump file, and the file is in the zdump format.

Step 3:
How do we extract it?

Well, we have a nice extract script that does all the job, " vmkdump_extract ". This
command must be executed against the zdump.1 file, which looks something like this:
# vmkdump_extract vmkernel-zdump.1

It creates four files:


a) vmkernel-log.1
b) vmkernel-core.1
c) visorFS.tar
d) vmkernel-pci
All we require for analysis is the vmkernel-log.1 file
Step 4:
Open the vmkernel-log.1 file using the below command:
# less vmkernel-log.1

Skip to the end of the file by pressing Shift+G. Now let's slowly go to the top by pressing
PageUp.
You will come across a line that says @BlueScreen: <event>
In my case, the dumps were:
2015-12-17T20:34:03.603Z cpu3:47209)@BlueScreen: CrashMe
2015-12-17T20:34:03.603Z cpu3:47209)Code start: 0x418021200000 VMK uptime:
0:01:14:16.524>
2015-12-17T20:34:03.603Z
cpu3:47209)0x412461a5dc10:[0x41802128d249]PanicvPanicInt@vmkernel#nover+0x575
stack: 0x726f632000000008

2015-12-17T20:34:03.603Z
cpu3:47209)0x412461a5dc70:[0x41802128d48d]Panic_NoSave@vmkernel#nover+0x49 stack:
0x412461a5dcd0
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dd60:[0x41802157a63b]CrashMeCurrentCore@vmkernel#nover+0x55
3 stack: 0x100000278
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dda0:[0x41802157a8ca]CrashMe_VsiCommandSet@vmkernel#nover
+0x13e stack: 0x0
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5de30:[0x41802160c3c7]VSI_SetInfo@vmkernel#nover+0x2fb stack:
0x41109d630330
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dec0:[0x4180217bd7a7]UWVMKSyscallUnpackVSI_Set@<none>#<no
ne>+0xef stack: 0x412461a67000
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df00:[0x418021783a47]User_UWVMKSyscallHandler@<none>#<none
>+0x243 stack: 0x412461a5df20
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df10:[0x4180212aa90d]User_UWVMKSyscallHandler@vmkernel#nove
r+0x1d stack: 0xffbc0bb8
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df20:[0x4180212f2064]gate_entry@vmkernel#nover+0x64 stack: 0x0

The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case it is
CrashMe which is for a manual crash.
The VMKuptime tells the Kernel up-time before the crash.
The logging after that is the information that we need to be looking for, the cause as to why
the crash occurred.
Now, here the crash dump varies for every crash. These issues can range from hardware
errors / driver issues / issues with ESXi build and a lot more.
Each dump analysis would be different. But the basic is the same.
So, you can try analyzing the dumps by yourself. However, if you are entitled to VMware
support, I will do the job for you.

Cheers!

You might also like