Kernel Panics
Kernel Panics
Try to establish (as quickly as possible) the surface area of the problem:
Connectivity? (Telnet is good; if you get an error page returned in the browser, something's
obviously working - eliminate connectivity first)
General App Pool failure, or specific to a content type? (Do ASPX files work/not work,
but .HTM work? Do you have canary files for each app and content type?)
Specific in-app failure, hang, or crash? (Most of this is for hangs and app failures; crashes
dictate their own methodology: get a crash dump, debug it)
Collect Data
1. Grab whatever time-sensitive/timely data you will need to resolve the issue later. Don't
worry about persistent stuff - Event Logs and IIS logs stick around, unless you're a
compulsive clearer, in which case: stop it. (Those that don't have an Event Log of last week
are doomed to repeat it)
2. Determine the affected worker process
o
APPCMD LIST WP
can help with this, or the Worker Processes GUI at the Server
level.
o
If using the GUI, don't forget to look at the Current Requests by right-clicking the
worker process - that'll show you which module the requests are jammed in.
Determine the scope (just one App Pool, multiple App Pools, two with dependencies
- this depends on your app and website layout)
Grab a memory dump of the worker process - once you've identified which App
Pool has the problem, identify the relevant Worker Process, and use Task manager to
create a memory dump by right-clicking that process. Note the filename for later.
On Task Manager: You need to use the same bitness of Task Manager as the Worker
Process you're attacking with it - if you dump a 32-bit WP (w3wp*32) with 64-bit
Task Manager, it's not going to be interpretable. If dumping a 32-bit process on 64-bit
Windows,
you
need
to
exit
Task
Manager,
run
%WINDIR
%\SYSWOW64\TaskMgr.exe to get the 32-bit version, then dump with the same
bitness. (a ten second detour, but you must do it at the time).
Restore Service
1. Recycle the minimum number of Worker Processes in order to restore service.
o Don't bother stopping and starting Websites, you generally need the App Pool to be
refreshed in order to get the site working again, and that's what a Recycle does.
o
Note that recycling appears to happen on the next request to come in (even though
the existing WP has been told to go away), so a worker process may not immediately
reappear. That doesn't mean it hasn't worked, just that no requests are waiting.
IISReset is usually a tool used by people that don't know better. Don't use it unless
you need every website to terminate and restart all at once. (It's like trying to
hammer a nail into a wall with a brick. It might work, but you look like an idiot, and
there's going to be collateral damage)
Determine Cause i.e. look at and think about the data you've collected.
1. Take the logs and the memory dump, look for commonalities, engage the app developers,
debug the dump with DebugDiag 1.2, and so on.
Set up for next time
1. Don't assume it's the last occurrence - develop a plan for what you'll need to collect next
time, based on this time.
o For example, if the requests are all for the same URL, implement some additional
instrumentation or logging, or a Failed Request Tracing rule that'll help identify the
spot on the page that experiences a problem.
o
Performance monitor logs are helpful (if in doubt, get a perfmon log too).
Defective or incompatible RAM often causes of kernel panics. Despite being a highlyreliable product, RAM can fail. Modern operating systems, like Mac OS X, are sensitive to
RAM. Purchase additional RAM from either Apple or third parties who guarantee their RAM
is compatible with Mac OS X, offer a liberal exchange policy, and provide a lifetime
warranty should the RAM become defective or a later version of Mac OS X introduce
incompatibilities.
Incompatible, obsolete, or corrupted kernel extensions. If a third-party kernel extension or
one of its dependencies is incompatible or obsolete with respect to the version of Mac OS X
you are using, kernel panics may occur when the kernel executes such extensions. Likewise,
if a kernel extension or one of its dependencies is corrupted, such as the result of hard disk
corruption, kernel panics are likely to occur when the kernel attempts to load or execute
such.
Incompatible, obsolete, or corrupted drivers. Similar to kernel extensions, drivers for thirdparty hardware which are incompatible with the version of Mac OS X you are using, or
which have become corrupted, will cause in kernel panics.
Hard disk corruption, including bad sectors, directory corruption, and other hard-disk ills.
Incompatible hardware. While rare, this is generally the result of a third-party hardware
vendors product failing to properly respond to the kernel or a kernel extension in an
expected way.
What is Kdump?
Sample Scenario 1 :
o
Observations:
Sample Scenario 2
Sample Scenario 3
Sample Scenario 4
This is continuation post to our earlier kernel panic reference post ( Redhat Enterprise
Linux 6 Kernel Panic and System Crash Troubleshooting Quick Reference ) where we
have discussed several types of kernel panic issues and their reasons. And in this post I will
be talking about the procedures and some guidelines to diagnosis and troubleshoot the
common kernel panic issues in redhat linux.
Btw, please note that these are just guidelines and purely your knowledge purpose, but
they doesnt guarantee any solutions to your environment specific issues. You need to take
extreme care and precaution while troubleshooting issues which are very specific to your
What is Kdump?
Starting in Red Hat Enterprise Linux 5, kernel crash dumps are captured using the kdump
mechanism. Kexec is used to start another complete copy of the Linux kernel in a reserved
area of memory. This secondary kernel takes over and copies the memory pages to the
crash dump location.
brief
it
has
following
steps:
| 0 2G | 128M | 16 |
| 2G 6G | 256M | 24 |
| 6G 8G | 512M | 16 |
| 8G 24G | 768M | 32 |
++
Example for RHEL 6:
title Red Hat Enterprise Linux Server (2.6.32-71.7.1.el6.x86_64)
root (hd0,0)
kernel /vmlinuz-2.6.32-71.7.1.el6.x86_64 ro
root=/dev/mapper/vg_example-lv_root rd_LVM_LV=vg_example/lv_root
rd_LVM_LV=vg_example/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM
LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc
KEYTABLE=us crashkernel=128M rhgb quiet
initrd /initramfs-2.6.32-71.7.1.el6.x86_64.img
Guidelines for Crash Kernel Reserved Memory Settings:
crashkernel=0M to 2G: 128M, 2G-6G:256M, 6G-8G:512M,8G-:768M
Ram Size CrashKernel
>0GB 128MB
>2GB 256MB
>6GB 512MB
>8GB 768MB
Step 3. configure /etc/kdump.conf
3A) to specify the destination to send the output of kexec,i.e. vmcore. Following destinations can be
used
raw device : raw /dev/sda4
file : ext3 /dev/sda3 , that will dump vmcore to /dev/sda3:/var/crash
NFS share : net nfs.example.com:/export/vmcores
Another system via SSH : net [email protected]
3B) Configure Core Collector, to discard unnecessary memory pages and compress the only needed
ones
Option
1
2
4
8
Discard
Zero pages
Cache pages
Cache private
User pages
16
Free pages
log Display the kernel ring buffer log. On a running system, dmesg also
displays the kernel ring buffer log.
Often times this can capture log messages that were not written to disk due to the crash.
crash>
log
snip
SysRq
:
Trigger
a
crash
BUG: unable to handle kernel NULL pointer dereference at (null)
IP:
[<ffffffff8130e126>]
sysrq_handle_crash+0x16/0x20
PGD
7a602067
PUD
376ff067
PMD
0
Oops:
0002
[#1]
SMP
crash>
bt
PID:
6875
TASK:
ffff88007a3aaa70
CPU:
0
COMMAND:
bash
#0
[ffff88005f0f5de8]
sysrq_handle_crash
at
ffffffff8130e126
#1
[ffff88005f0f5e20]
__handle_sysrq
at
ffffffff8130e3e2
#2
[ffff88005f0f5e70]
write_sysrq_trigger
at
ffffffff8130e49e
#3
[ffff88005f0f5ea0]
proc_reg_write
at
ffffffff811cfdce
#4
[ffff88005f0f5ef0]
vfs_write
at
ffffffff8116d2e8
#5
[ffff88005f0f5f30]
sys_write
at
ffffffff8116dd21
#6 [ffff88005f0f5f80] system_call_fastpath at ffffffff81013172
RIP: 00000037702d4230 RSP: 00007fff85b95f40 RFLAGS: 00010206
RAX: 0000000000000001 RBX: ffffffff81013172 RCX: 0000000001066300
RDX: 0000000000000002 RSI: 00007f04ae8d2000 RDI: 0000000000000001
RBP: 00007f04ae8d2000 R8: 000000000000000a R9: 00007f04ae8c4700
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000002
R13: 0000003770579780 R14: 0000000000000002 R15: 0000003770579780
ORIG_RAX:
0000000000000001
CS:
0033
SS:
002b
sys Displays system data same information displayed when crash starts
crash>
sys
DUMPFILE:
/tmp/vmcore
[PARTIAL
DUMP]
CPUS:
2
DATE:
Thu
May
5
14:32:50
2011
UPTIME:
00:01:15
LOAD
AVERAGE:
1.19,
0.34,
0.12
TASKS:
252
NODENAME:
rhel6-desktop
RELEASE:
2.6.32-220.23.1.el6.x86_64
VERSION:
#1
SMP
Mon
Oct
29
19:45:17
EDT
2012
MACHINE:
x86_64
(3214
Mhz)
MEMORY:
2
GB
PANIC:
Oops:
0002
[#1]
SMP
(check
log
for
details)
PID:
6875
COMMAND:
bash
TASK:
ffff88007a3aaa70
[THREAD_INFO:
ffff88005f0f4000]
CPU:
0
STATE: TASK_RUNNING (PANIC)
crash>
snip
CPU
0:
Machine
Check
Exception:
Kernel
panic
not
syncing:
Unable
Redirect Crash output to Regular Commands
dmesg
0000000000000004
to
continue
Example
Example
1:
2:
crash>
ps
crash>
|
fgrep
log
bash
>
|
log.txt
wc
-l
Sample Scenario 1 :
System hangs or kernel panics with MCE (Machine Check Exception) in /var/log/messages file.
System was not responding. Checked the messages in netdump server. Found the following
messages
Kernel
panic
not
syncing:
Machine
check.
System
crashes
under
load.
System
crashed
and
rebooted.
Machine Check Exception panic
Observations:
Look for the phrase Machine Check Exception in the log just before the panic message. If this
message occurs, the rest of the panic message is of no interest.
Analyze vmcore
$crash /path/to/2.6.18-128.1.6.el5/vmlinux vmcore
KERNEL: ./usr/lib/debug/lib/modules/2.6.18-128.1.6.el5/vmlinux
DUMPFILE: 563523_vmcore [PARTIAL DUMP]
CPUS: 4
DATE: Thu Feb 21 00:32:46 2011
UPTIME: 14 days, 17:46:38
LOAD AVERAGE: 1.14, 1.20, 1.18
TASKS: 220
NODENAME: gurkulnode1
RELEASE: 2.6.18-128.1.6.el5
VERSION: #1 SMP Tue Mar 24 12:05:57 EDT 2009
MACHINE: x86_64 (2599 Mhz)
MEMORY: 7.7 GB
PANIC: Kernel panic not syncing: Uncorrected machine check
PID: 0
COMMAND: swapper
TASK: ffffffff802eeae0 (1 of 4) [THREAD_INFO: ffffffff803dc000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> log
This tool is available as an RPM package from Red Hat Network (RHN) as well as a boot option
from the Red Hat Enterprise Linux rescue disk.
To boot memtest86+ from the rescue disk, you will need to boot your system from CD 1 of the Red
Hat Enterprise Linux installation media, and type the following at the boot prompt (before the Linux
kernel is started):
boot: memtest86
If you would rather install memtest86+ on the system, here is an example of how to do it on a Red
Hat Enterprise Linux 5 machine registered to RHN:
# yum install memtest86+
For the Red Hat Enterprise Linux version 4, perform the following command to install memtest86+.
Make sure current system has been registered to RHN:
# up2date -i memtest86+
Then you will have to configure it to run on next reboot:
# memtest-setup
After reboot, the GRUB menu will list memtest. Select this item and it will start testing the memory.
Please note that once memtest86+ is running it will never stop unless you interrupt it by pressing the
Esc key. It is usually a good idea to let it run for a few hours so it has time to test each block of
memory several times.
memtest86+ may not always find all memory problems. It is possible that the system memory can
have a fault that memtest86+ does not detect.
Sample Scenario 2
Console
Screen
having
the
messages
as
below
Northbridge
Error,
node
1,
core:
-1
K8
ECC
error.
EDAC
amd64
MC1:
CE
ERROR_ADDRESS=
0x101a793400
EDAC
MC1:
INTERNAL
ERROR:
row
out
of
range
(-22
>=
8)
EDAC
MC1:
CE
no
information
available:
INTERNAL
ERROR
EDAC MC1: CE no information available: amd64_edacError Overflow
Observations :
Sample Scenario 3
Sample Scenario 4
Console Shows following Error Message
NMI: IOCK error (debug interrupt?)
CPU 0
Modules linked in: ipt_MASQUERADE iptable_nat ip_nat xt_state
ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter
ip_tables x_tables bridge mptctl mptbase bonding be2iscsi ib_iser
rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i
cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp
libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_round_robin
dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec
i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac
parport_pc lp parport joydev sr_mod cdrom hpilo bnx2 serio_raw
shpchp pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache
Under RHEL6, the kernel.panic_on_io_nmi = 1 sysctl can be set to have the system panic when an
I/O NMI is received.
Software: Pseudo-hangs
Kernel Panic A voluntary halt to all system activity when an abnormal situation is
detected by the kernel. A Kernel panic is an action taken by an operating system upon
detecting an Internal fatal error from which it cannot safely recover. And in Linux these
Kernel Panics can be caused by different reasons
o
o
unknown_nmi_panic
panic_on_unrecovered_nmi
panic_on_io_nmi
Sample Scenario 1 :
System hangs or kernel panics with MCE (Machine Check Exception) in /var/log/messages file.
Normally, EDAC errors caused by Hardware mechanism to detect and report memory chip and PCI
transfer errors, and reported in /sys/devices/system/edac/{mc/,pci} and logged by the kernel as:
EDAC MC0: CE page 0x283, offset 0xce0, grain 8,
syndrome 0x6ec3, row 0, channel 1 DIMM_B1:
amd76x_edac
All the Informational EDAC messages (such as a corrected ECC error) are printed to the system
log, where as critical EDAC messages (such as exceeding a hardware-defined temperature
threshold) trigger a kernel panic.
Sample Scenario 2 :
Console Screen having the messages as below
Northbridge Error, node 1, core: -1
K8 ECC error.
EDAC amd64 MC1: CE ERROR_ADDRESS= 0x101a793400
EDAC MC1: INTERNAL ERROR: row out of range (-22 >= 8)
EDAC MC1: CE no information available: INTERNAL ERROR
EDAC MC1: CE no information available: amd64_edacError Overflow
Troubleshooting Procedure Posted here Redhat Enterprise Linux Troubleshooting Kernel
Panic issues Part 2
A Non maskable interrupt (NMI) is an interrupt that is unable to be ignored/masked out by standard
operating system mechanisms. A non-maskable interrupt (NMI) cannot be ignored, and is generally
used only for critical hardware errors however recent changes in behavior has added additional
functionality of:
1) NMI button.
The NMI This can be used to signal the operating system when other standard input mechanisms
(keyboard, ssh, network) have ceased to function.
It can be used to create an intentional panic for additional debugging. It may not always be a
physical button.
It may be presented through an iLO or Drac Interface.
Unknown NMIs The kernel has mechanisms to handle certain known NMIs appropriately,
unknown ones typically result in kernel log warnings such as:
Uhhuh. NMI received.
Dazed and confused, but trying to continue
You probably have a hardware problem with your RAM chips
Uhhuh. NMI received for unknown reason 32.
Dazed and confused, but trying to continue.
Do you have a strange power saving mode enabled?
These unknown NMI messages can be produced by ECC and other hardware problems. The kernel
can be configured to panic when these are received
though this sysctl:
kernel.unknown_nmi_panic=1
This is generally only enabled for troubleshooting
Sample Scenario 3:
kernel:
kernel:
kernel:
kernel:
kernel:
kernel:
2) A Watchdog-like software on the system that monitors for perceived system hangs
The NMI watchdog monitors system interrupts and sends an NMI if the system appears to have
hung.
On a normal system hundreds of device and timer interrupts are received per second. If there are no
interrupts in a 30 second interval*,
the NMI watchdog assumes that the system has hung and sends an NMI to the system to trigger a
kernel panic or restart.
nmi_watchdog=2.
unknown_nmi_panic
A feature was introduced in kernel 2.6.9 which helps to make easier the process of diagnosing
system hangs on specific hardware.
The feature utilizes the kernels behavior when dealing with unknown NMI sources. The behavior is
to allow it to panic, rather than handle the unknown nmi source. This feature cannot be utilized on
systems that also use the NMI Watchdog or some oprofile (and other tools that use performance
metric features as both of these also make use of the undefined NMI interrupt. If
unknown_nmi_panic is activated with one of these features present, it will not work.
Note that this is a user-initiated interrupt which is really most useful for helping to diagnose a
system that is experiencing system hangs for unknown reasons.
To enable this feature, set the following system control parameter in the /etc/sysctl.conf file as
follows:
kernel.unknown_nmi_panic = 1
To disable, set:
kernel.unknown_nmi_panic = 0
Once this change has taken effect, a panic can be forced by pushing the systems NMI switch.
Systems that do not have an NMI switch can still use the NMI Watchdog feature which will
automatically generate an NMI if a system hang is detected.
panic_on_unrecovered_nmi
Some systems may generate an NMI based on vendor configuration, such as power management,
low battery etc. It may be important to set this if your system is generating NMIs in a knownworking environment.
To enable this feature, set the following system control parameter in the /etc/sysctl.conf file as
follows:
kernel.panic_on_unrecovered_nmi = 1
To disable, set:
kernel.panic_on_unrecovered_nmi = 0
panic_on_io_nmi
This setting was only available in Red Hat Enterprise Linux 6. When set, this will cause a kernel
Code: 89 ca ed ed 41 89 c4 41 8a 45 1c 83 e0 30 3c 30 75 15 f0 ff
Sample Scenario 5:
NFS client kernel crash because async task already queued
hitting BUG_ON(RPC_IS_QUEUED(task)); in __rpc_execute
kernel BUG at net/sunrpc/sched.c:616!
invalid opcode: 0000 [#1] SMP
last sysfs file:
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 8
Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss
pcc_cpufreq sunrpc power_meter hpilo
hpwdt igb mlx4_ib(U) mlx4_en(U) raid0 mlx4_core(U) sg microcode
serio_raw iTCO_wdt
iTCO_vendor_support ioatdma dca shpchp ext4 mbcache jbd2 raid1
sd_mod crc_t10dif mpt2sas
scsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log
dm_mod
[last unloaded: scsi_wait_scan]
Pid: 2256, comm: rpciod/8 Not tainted 2.6.32-220.el6.x86_64 #1 HP
ProLiant SL250s Gen8/
RIP: 0010:[<ffffffffa01fe458>] [<ffffffffa01fe458>]
__rpc_execute+0x278/0x2a0 [sunrpc]
Process rpciod/8 (pid: 2256, threadinfo ffff882016152000, task
ffff8820162e80c0)
Call Trace:
[<ffffffffa01fe4d0>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
[<ffffffffa01fe4e5>] rpc_async_schedule+0x15/0x20 [sunrpc]
[<ffffffff8108b2b0>] worker_thread+0x170/0x2a0
[<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8108b140>] ? worker_thread+0x0/0x2a0
[<ffffffff81090886>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
Code: db df 2e e1 f6 05 e0 26 02 00 40 0f 84 48 fe ff ff 0f b7 b3
d4 00 00 00 48 c7
c7 94 39 21 a0 31 c0 e8 b9 df 2e e1 e9 2e fe ff ff <0f> 0b eb fe 0f
b7 b7 d4 00 00 00
31 c0 48 c7 c7 60 63 21 a0 e8
RIP [<ffffffffa01fe458>] __rpc_execute+0x278/0x2a0 [sunrpc]
This kind of kernel panics typically indicates a programming error and normally appear as below:
NULL pointer dereference at 0x1122334455667788 ..
or
Unable to handle kernel paging request at virtual address 0x11223344
One of the most common reason for this kind of error is possible memory corruption
Sample Scenario 6 :
NFS client kernel panics when doing an ls in the directory of a snapshot that has
already been removed.
NFS client kernel panics under certain conditions when connected to NFS server
Software: Pseudo-hangs
This are the common situations, that we commonly encounter where the system appears to be hung,
but some progress is being made, there are several reasons for this kind of behaviour, and they are
Livelock
if running a realtime kernel, application load could be too high,
leading the system into a state where it becomes effectively unresponsive in a
live lock/ busy wait state. The system is not actually hung, but just moving so
slowly that it appears to be hung.
Thrashing continuous swapping with close to no useful processing done
Lower zone starvation on i386 the low memory has a special significance
and the system may hang even when theres plenty of free memory
Normally, Hangs which are not detected by the hardware are trickier to debug:
Sample Scenario 7:
The system is frequently getting hung and following error messages are getting logged
in /var/log/messages file while performing IO operations on the /dev/cciss/xx devices:
INFO: task cmaperfd:5628 blocked for more than 120 seconds.
echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this
message.
cmaperfd D ffff810009025e20 0 5628 1 5655 5577 (NOTLB)
ffff81081bdc9d18 0000000000000082 0000000000000000 0000000000000000
0000000000000000 0000000000000007 ffff81082250f040 ffff81043e100040
0000d75ba65246a4 0000000001f4db40 ffff81082250f228 0000000828e5ac68
Call Trace:
[<ffffffff8803bccc>] :jbd2:start_this_handle+0x2ed/0x3b7
[<ffffffff800a3c28>] autoremove_wake_function+0x0/0x2e
[<ffffffff8002d0f4>] mntput_no_expire+0x19/0x89
[<ffffffff8803be39>] :jbd2:jbd2_journal_start+0xa3/0xda
[<ffffffff8805e7b0>] :ext4:ext4_dirty_inode+0x1a/0x46
[<ffffffff80013deb>]
[<ffffffff80041bf5>]
[<ffffffff8805e70c>]
[<ffffffff88055abc>]
[<ffffffff8002cf2b>]
[<ffffffff800e45fe>]
__mark_inode_dirty+0x29/0x16e
inode_setattr+0xfd/0x104
:ext4:ext4_setattr+0x2db/0x365
:ext4:ext4_file_open+0x0/0xf5
notify_change+0x145/0x2f5
sys_fchmod+0xb3/0xd7
Sample Scenario 8 :
When the system panics kdump starts, but kdump hangs and does not output a vmcore. I see
following error messages on the console:
Kernel panic - not syncing: Out of memory and no killable processes...