Troubleshooting XenServer Deployments
Troubleshooting XenServer Deployments
Agenda
Case Study: Production down
Learn: XenServer crash Case study: Singlepathing Q& A
Production down
SR
SR PBD PBD
Broken storage
What is broken?
PBD = Physical Block Device
Volume Group Name: <Prefix>+SR UUID
PBD PBD
SCSI ID
XenServer_1 XenServer_2
SR
has UUID (unique ID)
# xe pbd-list currently-atached=false
Storage troubleshooting
Goal: Reproduce and analyse the logs
PBD unplugged
Plugging PBD manually # grep PBD.plug xensource.log
Volume Group
What is VG?
HDD / LUN
Virtual Disk
Logical Volume (LV)
HDD / LUN
HDD / LUN
Storage Repository
SR
Volume Group
Matching the UUID
# vgs
# vgs 'VG_XenStorage-19856cba-830c-e298-79faVG 84a79eb658f4' #PV #LV #SN Attr VSize VFree
VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3
1 1 1 1
18 2 11 1
0 0 0 0
wz--n- 89.99G 19.48G wz--n- 129.07G 129.05G wz--n- 49.99G 2.84G wz--n1.99G 1.98G
Examining HDD/LUN
Checking SCSI ID
PBD
SCSI ID
Examining HDD/LUN
Can Linux kernel see this block device? (SCSI device)
# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...
Timing buffered disk reads: 138 MB in 3.02 seconds = 45.68 MB/sec (LUN readable!
/dev/mapper/360a9800050334f4963345767656c546
Also check /dev/disk/by-path
Examining HDD/LUN
Is the LUN empty?
...
ID_FS_TYPE=LVM2 member
...
If this is LVM member, why there is no VG on it?
Examining HDD/LUN
Is there a VG created on PV?
# pvs
PV VG Fmt /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 PV VG Fmt Attr /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 /dev/mapper/360a9800050334f496334595a32306431 /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 Attr Psize # pvs |grep 360a9800050334f496334595a32306431 a89.99G Free 19.48G aPsize 49.99G Free2.84G a14.99G 6.45G a1.99G 1.98G a129.07G 129.05G
VG_XenStorage-332432-430d-3423-4332434-5485974
lvm2 a-
14.99G
14.99G
Volume Group
...has been recreated!
Volume Group
Looking for LVM metadata backup
/etc/lmv/backup/VG_XenStorage-19856cba-830ce298-79fa-84a79eb658f4
Check backup timestamp (within the file)
LVs in backup file
# cat /etc/lvm/backup/VG... | grep VHD
LV LV LV
Volume Group
Removing new VG and PV
Volume Group
Recreating PV and VG from backup
# pvcreate --uuid <PV uuid from backup file> --restorefile /etc/lvm/backup/VG_XenStorage-<SR_UUID> /dev/mapper/<SCSI ID>
# vgcfgrestore VG_XenStorage-<SR UUID> -f /etc/lvm/backup/VG_XenStorage-<SR UUID>
Examining HDD/LUN
Confirm that VG name contains SR uuid...
14.99G
Volume Group
Checking Logical Volumes
# lvs
4.00M
Storage Repository
Plugging PBD again...
# xe pbd-plug uuid=
# xe sr-scan uuid= Error code: SR_BACKEND_FAILURE_46
Success! But no VDIs shown...
Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]
# xe vdi-list uuid=<above number>
# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
# xe sr-scan uuid=
Success! All VDIs shown...
Well done!
Serial console
Boot the host to the console CTX120540 & reboot Generate crashdump CTX120540 & reboot
No serial console
Review crashdump
Review crashdump
HA activity, page fault, driver, storage issues CPU stack - to be analysed by Citrix Tech Support
Citrix Confidential - Do Not Distribute
Investigating crash.log
XenConsole ring
located at the bottom of the file
(XEN) Watchdog timer fired for domain 0 (XEN) Domain 0 shutdown: watchdog rebooting machine. Why watchdog triggered? /var/log/xha.log (Network or Storage heartbeat failed) Why heartbeat failed? /var/log/messages (DMP, kernel, drivers, I/O errors)
Review crashdump (cont)
Investigating crash.log
Page fault
Other examples:
(XEN) **************************************** (XEN) Panic on CPU 6: (XEN) FATAL TRAP: vector = 14 (page fault) (XEN) [error_code=0000] , IN INTERRUPT CONTEXT (XEN) **************************************** (XEN) (XEN) Reboot in five seconds...
Single-Pathing
96 MB in
3.07 seconds =
Storage Performance
Checking multipath status
# mpathutil status
/dev/mapper/....
\_ round-robin 0 [prio=4][enabled]
\_ 3:0:0:2 sdk 8:160 [active][ready] \_ 4:0:0:2 sdj 8:144 [active][ready]
/dev/
Storage Performance
Determining current performance on domain0
Storage Performance
Determining usage of paths
Storage Performance
Checking if there are really 2 iSCSI sessions # ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)" ip-10.1.200.40:3260-iscsi-iqn.199208.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk ip-10.1.201.40:3260-iscsi-iqn.199208.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj
Storage Performance
Checking if different paths are really used
HWaddr 00:1D:09:70:88:2E
RX bytes:1801238
(166 MiB)
Storage Performance
Checking source IP addresses for iSCSI sessions
Storage Performance
Checking kernel routing table
# route
Destination Gateway Genmask Iface
10.1.200.0
10.1.200.0 default
*
* 10.1.200.1
255.255.255.0
255.255.255.0 0.0.0.0
xenbr0
xenbr1 xenbr0
Storage Performance
Configuration of management interfaces in XenCenter
Storage Performance
Determining current performance on domain0
# route
Destination Gateway Genmask Iface
10.1.200.0
10.1.201.0 default
*
* 10.1.200.1
255.255.255.0
255.255.255.0 0.0.0.0
xenbr0
xenbr1 xenbr0
Storage Performance
Configuring kernel routing table
Storage Performance
Determining current performance on VM
LinuxVM:~# hdparm -t /dev/xvdb
/dev/xvdb: Timing buffered disk reads: 45 MB/sec
Well Done!
# hdparm t
# iostat # ifconfig, # tcpdump, # netstat, # route # watch Best practices for iSCSI storages
Questions
Resources
First aid kit
http://docs.xensource.com XenServer documentation
Download presentations starting Friday, 15 October, from your My Organiser Tool located in your My Synergy Microsite event account