0% found this document useful (0 votes)
206 views26 pages

PDF pt203 Sos Nutanix Troubleshooting

This document provides an overview of managing and troubleshooting Nutanix environments. It discusses monitoring Nutanix clusters, performing health checks, analyzing performance issues, isolating problems, common troubleshooting scenarios, engaging support best practices, and additional resources. The document emphasizes isolating problems by rapidly reducing failure domains, checking for recent changes, and using built-in reporting tools. It also covers typical troubleshooting workflows for issues like upgrades not progressing or storage being unavailable.

Uploaded by

AdiWidyanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views26 pages

PDF pt203 Sos Nutanix Troubleshooting

This document provides an overview of managing and troubleshooting Nutanix environments. It discusses monitoring Nutanix clusters, performing health checks, analyzing performance issues, isolating problems, common troubleshooting scenarios, engaging support best practices, and additional resources. The document emphasizes isolating problems by rapidly reducing failure domains, checking for recent changes, and using built-in reporting tools. It also covers typical troubleshooting workflows for issues like upgrades not progressing or storage being unavailable.

Uploaded by

AdiWidyanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Agend

aMANAGING
L NUTANIX ENVIRONMENTS
• Cluster Monitoring
• NCC overview
• Prism Analysis (and Prism Central)

I I . TROUBLESHOOTING N UTANIX
ENVIRONMENTS
• General Troubleshooting
• Troubleshooting Scenarios
• Engaging support best practices
• Additional Resources

I I I.
Q/A

CONFERENCE
Monitoring

Pulse ,.•,
Emai
l SNMP

Syslog

Prism Alerts

CONFERENCE
Prism Alecs Pulse
HD

I N S GH T S
Pulse
HDurly Cluster
RepDrts
Deep Analytics
And InventDP/

/\UtDFFIBtIC
Case generatiDFl
Cluster PhDn e
Prism Alerts
HDme
Health Alerts

COINF
ERENCE
Auto-case Generation
Example:
Description Block Serlal Number:
alert tima: Tue Mar 22 2016 18:54:51 GMT-0700 (PDT)
aIert_type: PowerSupplyDown
alert msg: A1046:Bottom power supply iB down on
block
cluster id:
aIert„body: No Alert Body Available

New Alerts Appended

Block Serial Number:


alert time: Tue Mar 22 2016 21:46:25 GMT-0700 (PDT)
aIart_typa: PowerSupplyDown
a 6:Top power supply is down on block

cluster id:
aIert_body: No Alert Body Available
Resolution Scheduled Maintenance. As advised by customer
CONFERENCE
Auto-case Generation

THESE ALERTS WILL AUTO GENERATE SUPPORT CASES:


• Stargate process is down for more than 3 hours
(StargateTemporarilyDown)
• Curator scan fails (CuratorScanFailure)
• Running out of space on the cluster
• Running out of space on CVMs
• Hardware Clock Failure (HardwareClockFailure)
• Faulty RAM module (RAMFault)
• Power Supply failure (PowerSupplyDown)

If you want up to date information check


http://portal.nutanix.com/kb/1959 on the portal — KB
1959
For our customers leveraging our partners hardware platforms, we will
generate software based alerts which triggers auto support cases. CONFERENCE
Working with Prism Alerts

›I
i

COINFERE
NCE
Working with Prism Central Alerts Dashboard

COINFERE
NCE
NCC Health
ChecksCLI - (NCC HEALTH PRISM (AOS 5.X)
CHECKS RUN ALL)

- Summary of Cluster Check Executed on 4/28/2047, ¥

Passed

Total

CONFERENCE
DC Chcck Na mc

Checks• Aftecte a C V M s
NCC s a framewo of a tomatically diagnose cluster
scfi$
• Default
hea checks are non-disru we
• KB article for each NCC check
• Helps get a baselines
• NCC can be upgrade
Troubleshooting no impa
withrelevant
Information fincludinp KB) act to cluster

• Poperation
: The tested aspect of the cluster is healthy and no
further
action is required

cannot be evaluated as PASS/FAIL


CONFERENCE
’° w
CONFERENCE
Entity 8‹ Metric Charts

COINFERE
NCE
CONFERENCE
Troubleshooting Nutanix Environments: A Framework

• Problem Isolation

• Fixes and Mitigations

• Root Cause Analysis

• Product
Improvement

CONFERENCE
Troubleshooting by
Layers
A PPLICAT1ON
• SOL, VDI, Oracle RAC. etc.
CVM
• Stargate. Curator. Cassandra. etc.
HYPERVISOR
• AHV, ESXi, Hyper-V, XenServer
HARDWARE
• NVMe. SSD, HDD, Memory, NIC. Processor, etc.
N ETWOR K
• OVS. vswitch, Physical Switch, etc.

CONFERENCE
Troubleshooting: Problem Isolation
• Rapidly reduce failure domain scope. achieve faster resolution.
• Any recent changes in the environment*

IMPACT
• Is storage available*
• Are there performance issues*
• Can you reach Prism*

Use Build -In RE PORTS NG


• Prisrr Alerts
• Cluster Health
• NCC
• Cluster logs
• User Reports
CONFERENCE
Troubleshooting: Problem Isolation — Cluster
States

He \pful additionaJ commands


• cluster status I ex p -v UP showing condensed version
• genesis sBtus — shows only local
services/processes
CONFERENCE
Troubleshooting: Problem Isolation — allssh, hostssh, NCCR
Logging
• allssh

• NCC

• /home/nutanix/data/logs and sysstats


• INFO. WARN, ERROR, FATAL
• allssh "Is -Itr data/logs/*.FATAL”
• If FATALs are actively occurring and you’re experiencing issues, they may be related.
• hostssh "vmware -vl” instead of allssh ‘ssh -I root 192.168.5.1 "vmware -vl”’
• If you’re seeing an error, check the Nutanix Knowledge Base!

CONFERENCE
Problem Isolation - Data Resilenc States

O&

Rebuild capaclty
available

• r›cli cluster get-domain-fault-tolerance-status CONFERENCE


Typical Troubleshooting Scenarios
U PG RADE IS NOT PROG RESS ING
• Logging: genesis.out. host_upgrade.out, firmware_upgrade.out
• upgrade_status
• host_upgrade_status
• firmware_upgrade_status

STORAG E U NAVAI LAB LE


• Do all CVMs have connectivity to each other and to the hypervisor?
• Recent stargate FATALs*
• Cassandra status*
REPLICATION, SN A PS HOTTING , A ND METRO RELATED ISSUES
• Logging: Cerebro logs
NCC // HEALTH C H E C KS FAI LING
• Running NCC should indicate the nature oT the issue and give a KB describing
how to resolve the issue. .
CONFERENCE
rio - fflin
H
AH ĞK
V

CONFERENCE
Root Cause Analysis - Log Collection

Logs will be collected for all the no0es and components. Once the
task completes the bundle will de aveilabJe for download.

5 umm C he ck 0 C oiieci Logs startlng now


ary s E UTII 'I C 'C E R 7 STATL!S

Cluster }0b Succceded

Pun C h ec k s
BY CHI-CK S TA I US
Log Collecfor

Passed 39
1
C an cel

CONFERENCE
Best Practices for Engaging Suppor
• Update your break/fix contact via My Nutanix Portal
• Upgrade to the latest NCC and start a health
check
• Clear problem description
• What steps have you already taken?
• Keep components on the recommended version levels
• Press the Escalate Button in portal for immediate
attention
• Provide feedback after case closure. Surveys
matter!

CONFERENCE
Additional Resources
The Nutanix Bible - Architecture details
portal.nutanix.com - Nutanix Support Portal, KBs, Documentation, Software, etc.
portal.nutanix.com/ h/4530 — Additional troubleshooting details for Acropolis File Services

IF LIKED THIS SESSION, YOU MAY ALSO LIKE:


YOU
• Nutanix Architecture Deep Dive and the Deep Dive Super Session
• Getting the Network Right (The First Time)
• Fail Fast and Never Again
• AHV — Virtualization You Always Wanted

CONFERENCE

You might also like