Troubleshooting Log Files in Domino
Troubleshooting Log Files in Domino
Introduction
This lab is intended to provide IBM Lotus Domino Administrators an idea of the data that can be used in examining server crash and performance issues. Needless to say, a complete analysis of NSD is not possible within the allotted time. This lab focuses on using Call Stack data and Memcheck data perform a preliminary analysis and get you started in the right direction.
Contents
This lab contains the following topics: Topic Section I: NSD Basics Section II: Call Stacks Section III: Memcheck Shared Memory Section IV: Memcheck Process Memory Section V: Memcheck Resource Usage Summary Section VI: Correlating NSD Output Section VII: NSD Checklist Section VIII: Case Studies Appendix: Student Answer Key See Page 2-6 7-13 14-26 27-33 34-36 36-40 41 42-47 48-68
Within the ND7 timeframe, NSD has undergone many format and behavioral changes. Many of these changes have been back ported to ND6.x (as of ND6.5.5). Within this lab, we refer to both formats, where the original format in ND6 is applies to versions 6.5.4 and earlier, and the newer format applies to versions 6.5.5 and later, including ND7.x. Where appropriate, we indicate the different KEYWORDs that can be used for searching within an NSD in the Newer vs. Original Format. Please refer to page 3 for more discussion on the NSD back port for existing versions of ND6.x and ND7.x, referred to as the NSD Update Strategy.
Due to time constraints, this lab is focused on troubleshooting Domino Server crash and hang/performance issues. However, many of these techniques can also be applied to IBM Lotus Notes Client crashes and hangs. Please feel free to discuss these additional troubleshooting techniques with the lab instructors.
What is NSD?
NSD (Notes System Diagnostic) is one of the primary diagnostics used for the Lotus Domino Product Suite. It is used to troubleshoot crashes, hangs, and severe performance problems for such products as: Domino Server & Notes Client Quickplace, DomDoc, Domino Workflow Sametime In ND6 and ND7, NSD is used on all Notes Client and Domino Server Platforms except for the Apple Macintosh platform. On UNIX NSD is a shell script (nsd.sh), with Memcheck compiled as a separate binary On W32 NSD is a compiled binary (nsd.exe), with Memcheck built into nsd.exe On iSeries ND6 is a compiled binary. The source for NSD differs both in nature and in output from the other platforms. However, in ND7, NSD on iSeries has been modified to more closely match the output of NSD on the other platforms.
NSD Basics
NSD can be run under two contexts: 1). Manually - run from a command prompt to troubleshoot hangs and performance problems 2). Automatically as part of Fault Recovery to troubleshoot crashes this is enabled in server document (enabled by default) The NSD.EXE/NSD.SH file is located in the program directory. When running NSD manually, in order for NSD to run properly, you must run NSD from the desired data directory. With no switches, NSD will collect full set of data, creating a log file in the IBM_TECHNICAL_SUPPORT directory with the following filename format: ND6 - nsd_all_<Platform>_<Host>_MM_DD@HH_MM.log ND7 - nsd_<Platform>_<ServerName>_YYYY_MM_DD@HH_MM_SS.log
Continued on next page
Between ND6 and ND7, NSD underwent numerous important changes, including format changes to better represent the data, as well as behavioral changes to resolve previous issues and improve NSDs reliability as a diagnostic tool. In order to allow existing customers to take advantage of improvements to NSD in their current environments, IBM has made available an updated version of NSD for existing versions of Domino on Windows 32-bit, IBM AIX, SUN Solaris, and Linux. This updated version includes most of the features available in ND7.x The NSD Update Strategy means that IBM will periodically provide an NSD Update with the latest set of NSD fixes/enhancements for existing versions of Domino. This is done so that IBM Support and customers can leverage the latest fixes and enhancements for First Failure Data Collection. This will be done for existing versions of ND6.x as well as ND7.x. For more information regarding the NSD Update Strategy, including a list of fixes to NSD, refer to the following URL: Technote #1233676 - http://www.ibm.com/support/docview.wss?uid=swg21233676 You may also refer to a recently published Knowledge Collection for NSD: Technote # 7007508 - http://www.ibm.com/support/docview.wss?uid=swg27007508
Continued on next page
W32 NSD
When NSD is run manually on W32 using no switches, it attaches to all Domino processes, dumps all call stacks, runs Memcheck data and system information, then display an NSD prompt, where it remains attached to all Domino processes. On Windows 2000, quitting NSD kills all Domino processes, due to a limitation with the Operating System that does not allow for a debugger to detach from a process without exiting that process as a result. As long as the NSD prompt remains displayed, the Domino server will continue operation. On Windows 2003 & XP, NSD displays a prompt, but you can manually detach from all processes with the command detach, allowing NSD to quit without affecting the Domino Server. When run as part of Fault Recovery, NSD is no longer run from JIT interface on W32. As a result, NSD will only be invoked for crashes occurring in a Domino process (any process that makes use of the Domino API).
nsd detach (detaches from all processes without killing them - XP & 2003 only) nsd stack (collects only call stacks, speeds execution) => nsd*.log nsd info (collects only system info) => sysinfo*.log nsd noinfo (collects all but info) => nsd*.log nsd memcheck (collects only memcheck info) => memcheck*.log nsd nomemcheck (collects all but memcheck) => nsd*.log nsd perf (collects process memory usage) => perf*.log nsd noperf (collects all but performance data) => nsd*.log nsd handles (collects OS level handle info) => handles*.log nsd nohandles (collects all but handle info) => nsd*.log nsd kill (kills all notes process and associated memory) => nsd_kill*.log nsd monitor (attaches and waits for exceptions) => nsd*.log nsd p (runs against a specific process call stacks only) => nsd*.log
At NSD prompt Can use all of these commands plus: dump (dumps all call stacks out) quit f (forces a quit will bring down all Domino processes)
Continued on next page
UNIX NSD
On Unix, when run manually with no switches, NSD attaches to all Domino processes, dumps data, and detached automatically (returning to a system prompt). NSD behaves differently on Unix than W32 because the Unix debugging interface allows for NSD to detach from all processes without causing any problems. Unix platforms include IBM AIX, Linux, SUN Solaris, and IBM zSeries.
nsd batch (runs nsd with no output to console) nsd info (collects only system info) nsd noinfo (collects all but system info) nsd memcheck (collects only memcheck info) nsd nomemcheck (collects all but memcheck) nsd kill (kills all notes process and associated memory) nsd ps (lists running processes) nsd lsof (list of open Notes file) nsd nolsof (collects all but lsof) nsd user (runs nsd as specific unix user)
=> nsd_sysinfo*.log => nsd_all*.log => nsd_memcheck*.log => nsd_all*.log => nsd_kill*.log => nsd_ps_*.log => nsd_lsof*.log => nsd_all*.log
Note: There is no NSD prompt on Unix, since NSD does not have the detach problems on Unix that it does on W32. NSD is a different tool in ND6 on the IBM iSeries platform from the other platforms. In ND6, NSD output includes: Call stacks Environment Info Job Log of current jobs Status of current jobs Thread/Mutex info Notes.ini Last 2000 lines of Console log
NSD on iSeries
Note: NSD takes no switches on iSeries, only runs for fault recovery. DumpSrvStacks is a separate utility available for dumping call stacks. In ND7, changes have been made to NSD on iSeries to function more closely to NSD on the other platforms. Support is in the process of updating Knowledge Base to reflect these new capabilities.
Continued on next page
As you may or may not know, NSD output is rather verbose. In order to more easily visualize the contents of NSD, you can consider that there are four major sections within an NSD log, each major section being composed of subsections, referred to as minor sections. The four major sections within NSD are: Process Information (Call Stacks) Process Information is composed of the list of all running processes system wide, followed by a list of all Domino specific processes. However, the most important aspect in this section is the list of the Call Stacks for all threads of each Domino Process. These call stacks provide Support with insight into the code path involved in a particular problem. This section is arguable the most important section within NSD. Memcheck (Domino Memory Objects) The Memcheck section dumps information about Domino-specific structures. Memcheck steps through all Domino allocated memory pools (both private and shared). Memcheck output is composed of a series of minor sections which summarize memory usage and provide a list of open resources such as open database, open view, open documents, connected users, and open files. Memcheck tells Support which resources are involved in a given problem. Hence, Memcheck is a runner up for most important part of NSD. System Information This section provides information regarding version of OS, kernel configurations, patch information, disk information, network connections, and memory usage. While this section is important under a number of different cases, this lab does not focus on information within this section. Environment Information The section rounds out NSD output, providing the user environment, the notes.ini, and a list of the Domino executables and file located within the Domino Data directory and sub-directories. Again, this lab will not focus on this section.
For most server or client crashes or hangs, the Call Stack section is the single most important section of NSD. But what exactly is a call stack? In brief, each thread has its own area of memory referred to as its stack region which is reserved for the execution of functions. The region of memory is used as a temporary scratch-pad to store CPU register values, arguments to each function, variables that are local to each function, and return addresses for each caller (i.e. where to resume activity). The summary of return addresses listed in a threads stack region essentially provides an indication of the code path of execution for that thread at any given moment. This summary, also referred to as a stack trace summary, is constantly changing depending on thread activity. It is this summary that NSD dumps for each and every Domino thread, providing a snap-shot of the code path of activity for each thread. This summary is invaluable in troubleshooting crashes and hangs. Note: As important as call stacks are, they provide limited information. Namely, call stacks provide the current state of thread activity not a cumulative history of thread activity. Stacks frames show current conditions, but not necessarily a history of how those conditions were reached. With complex problems, it is often necessary to augment call stack data with other forms of debug to develop a clearer picture of the problem.
The level of information that NSD provides for call stacks differs for each platform. In general, platforms fall into two categories: W32 and UNIX. W32 For the fatal thread, NSD makes 3 passes (2 passes in the original format): 1). Dumps complete call stack (divided into "before" and "after" frames) 2). Granular break down of stack frames, showing arguments, return address, basic register information 3). De-referenced pointer arguments for each function, meaning if a function is passed a pointer, NSD dumps out the contents of memory that the pointer references UNIX This includes Linux, AIX, Solaris, z/OS and i5/OS. NSD provides one pass for call stack, and currently does not break down stack frames. Limited register data is provided for certain platforms (AIX, Linux & z/OS). On AIX - arguments may show as "???", meaning the module in question was code not compiled with the necessary levels of debug to show argument information.
Continued on next page
On W32, NSD provides 2 passes of information or the fatal thread, or 3 passes in ND7. Below are excerpts from an NSD for each pass. For non-fatal threads, only PASS ONE is provided. PASS ONE This pass provides a stack trace summary for the thread, divided into before and after frames. NSD first lists the after frames these are all frames that indicate how the thread attempts to handle the fatal condition. These frames do NOT indicate the fatal condition itself. In essence, you can ignore the after frames.
############################################################ ### thread 67/135: [ nSERVER:0908: 2692] ### FP=0ae5ec70, PC=6018eb23, SP=0ae5e2f8 ### stkbase=0ae60000, total stksize=262144, used stksize=2424 ############################################################ [ 1] 0x77f83786 ntdll.ZwWaitForSingleObject+11 (560,36ee80,0,601a7c06) [ 2] 0x77e87837 KERNEL32.WaitForSingleObject+15 (7e4e5a0,77e8ae88,7e4ec0c,0) @[ 3] 0x601a7046 nnotes._OSFaultCleanup@12+342 (0,0,0,7e4ec0c) @[ 4] 0x601b07b1 nnotes._OSNTUnhandledExceptionFilter@4+145(7e4ec0c,7e4ec0c,6ef1ab5,7e4ec0c) [ 5] 0x1000e596 jvm._JVM_FindSignal@4+180 (7e4ec0c,77ea18a5,7e4ec14,0) [ 6] 0x77ea8e90 KERNEL32.CloseProfileUserMapping+161 (0,0,0,0)
The before frames are all thread activity leading up to an including the crash. The before frames are actually listed second, so dont be confused. These frames are the real meat of the crash this is the part you want to examine:
############################################################ ### FATAL THREAD 67/135 [ nSERVER:0908: 2692] Process ID: Thread ID ### FP=0x0ae5ec70, PC=0x6018eb23, SP=0x0ae5e2f8 ### stkbase=0ae60000, total stksize=262144, used stksize=2424 ### EAX=0x010d088c, EBX=0x00000000, ECX=0x00ba0000, EDX=0x00ba0000 ### ESI=0x0ae5e904, EDI=0x00001d34, CS=0x0000001b, SS=0x00000023 ### DS=0x00000023, ES=0x00000023, FS=0x0000003b, GS=0x00000000 Flags=0x00010202 Exception code: c0000005 (ACCESS_VIOLATION) ############################################################ C++ Class Name @[ 1] 0x6018eb23 nnotes._Panic@4+483 (609d0013,0,ae5eeb8,6010ee14) (not mangled) @[ 2] 0x6018e8ec nnotes._Halt@4+28 (107,0,0,0) @[ 3] 0x6010ee14 nnotes._MemoryInit1@0+212 (0,0,5a4cc160,ae5ef5c) C++ Function Name @[ 4] 0x6010e1a4 nnotes._OSInitExt@8+52 (0,0,ae5f7c8,626412d2) @[ 5] 0x600ef05e nnotes._OSInit@4+14 (0,0,e9754f4,5a4cc160) (not mangled) @[ 6] 0x626412d2 nlsxbe._LsxMsgProc@12+130 (2,5a4c8288,5a4cc160,e953ff4) @[ 7] 0x6014fe93 nnotes.DLLNode::Register+35 (5a4cc160,e967674,e980000,e953ff4) @[ 8] 0x6014faf9 nnotes.LSIClassRegModule::AddLibrary+105 (ae5f8a0,ae5f894,ae5f884,6014f59d) @[ 9] 0x6014f5f3 nnotes.LSISession::RegisterClassLibrary+19 (e967674,ae5f8a0,e953ff4,ae5f894) @[10] 0x6014f59d nnotes.LSISession::RegisterClassLibrary+141 (e967674,ae5f8a0,321,e9641f8) @[11] 0x60935bdd nnotes._LSCreateScriptSession@20+125 (e9641f8,60ab10d0,60a2bae4,0) @[12] 0x60935c4a nnotes._LSLotusScriptInit@4+26 (e9641f8,0,e9548f4,e9540f4) @[13] 0x6093040e nnotes.CLSIDocument::Init+30 (e9641f4,e9641f4,605b08f0,1d78ffa0) @[14] 0x605b0379 nnotes._AgentRun@16+313 (2ea34ec,e9641f4,0,10)
DLL Name
Frames that are pre-fixed with the @ symbol mean these are functions that NSD was able to annotate using the Domino symbol files.
Continued on next page Rob Gearhart/Elliott Harden
PASS TWO This pass provides a break down of stack frame contents, showing local variables, arguments, etc. To the right is an ASCII representation of the stack frame contents. This ASCII column can provide insight such as an error message, a database name, or user name.
############################################################ ### PASS 2 : FATAL THREAD with STACK FRAMES 67/135 [ nSERVER:0908: 2692] ### FP=0ae5ec70, PC=6018eb23, SP=0ae5e2f8, stksize=2424 Exception code: c0000005 (ACCESS_VIOLATION) ############################################################ # ---------- Top of the Stack ---------# 0ae5e2f8 00000001 00000107 7c573a9d 54200a0a |.........:W|.. T| # 0ae5e308 61657268 305b3d64 3a383039 35453130 |hread=[0908:01E5| # 0ae5e318 3841302d 530a5d34 6b636174 73616220 |-0A84].Stack bas| # 0ae5e328 78303d65 36454130 34393030 7453202c |e=0x0AE60094, St| # 0ae5e338 206b6361 657a6973 37203d20 20323935 |ack size = 7592 | # 0ae5e348 65747962 41500a73 3a43494e 736e4920 |bytes.PANIC: Ins| # 0ae5e358 69666675 6e656963 656d2074 79726f6d |ufficient memory| @[ 1] 0x6018eb23 nnotes._Panic@4+483 (609d0013,0,ae5eeb8,6010ee14) # 0ae5ec70 0ae5ec80 6018e8ec 609d0013 00000000 |.......`...`....|
ASCII Column
@[ 2] 0x6018e8ec nnotes._Halt@4+28 (107,0,0,0) # # # # # 0ae5ec80 0ae5ec90 0ae5eca0 0ae5ecb0 0ae5ecc0 0ae5eeb8 00000000 314d4d24 6e696d6f 5c6f6e69 6010ee14 00000000 64243439 61442e6f 78736c6e 00000107 53495249 746f4c2e 73006174 442e6562 00000000 4d454d24 442e7375 6d6f445c 00004c4c |.......`........| |........IRIS$MEM| |$MM194$d.Lotus.D| |omino.Data.s\Dom| |ino\nlsxbe.DLL..|
@[ 3] 0x6010ee14 nnotes._MemoryInit1@0+212 (0,0,5a4cc160,ae5ef5c) # # # # # 0ae5eeb8 0ae5eec8 0ae5eed8 0ae5eee8 0ae5eef8 0ae5f578 5a4cc160 ffffffff 00000000 00000000 6010e1a4 0ae5ef5c 0ae5ef6c 000a4a10 00000000 00000000 00000000 7c57685c 00000001 0ae5f29c 00000000 00081240 00070000 0e9754f4 0ae5f2a5 |x......`........| |`.LZ\.......@...| |....l...\hW|....| |.....J.......T..| |................|
@[ 5] 0x600ef05e nnotes._OSInit@4+14 (0,0,e9754f4,5a4cc160) # # # # 0ae5f588 0ae5f598 0ae5f5a8 0ae5f5b8 0ae5f7c8 0e9754f4 4b340016 0000cdff 626412d2 5a4cc160 00000000 600011df 00000000 60003fbc 0000c436 6000123f 00000000 60c6b5e0 0000c000 010c16f2 |......db........| |.T..`.LZ.?.`...`| |..4K....6.......| |.......`?..`....|
Be Careful
While the ASCII section in PASS TWO is helpful, be careful not to jump to conclusions based on this information. The values you see are variables that are local each function. While these variables are important to examine, they may not be directly relevant to the root-cause of a given crash.
Continued on next page
On Unix, NSD provides only the stack trace summary (PASS ONE). However, just as is the case with W32, the upper part of the stack shows thread activity that attempts to handle the fatal exception, and does not indicate the exception itself. Look at portion of the call stack below the fatal, raise.raise, signal handler, abort or terminate line. The exact syntax for this line will depends on the version of UNIX, as well as the nature of the problem itself. In addition, C++ function names are mangled for all UNIX platforms except zSeries (OS390), so they can be difficult to read. Below are some examples of call stacks for each platform. Note: For iSeries NSDs, call stacks are read from bottom (most recent function) to top (previous callers), whereas all other platforms are read the opposite, from top (most recent function) to bottom (previous callers).
Function Mangling
Function mangling is an artifact of the compile process for software. In order to uniquely identify one function from another, the compiler generates a function decoration, or mangles the function, which includes the function name, class (if any), and argument types. As a result, it takes a bit of work to extract the function name from the function decoration. Within NSD, the syntax for function decorations is very similar for AIX, Linux, and iSeries. Solaris has a different decoration syntax; zSeries function names are not mangles within NSD output.
Continued on next page
10
################################### ## thread 6/15 :: adminp pid=70336, k-id= 275121 , pthr-id=1286 ## stack :: k-state=wait, stk max-size=262144, cur-size=9484 ################################### ptrgl.$PTRGL() at 0xd01d7a10 PID/TID raise.nsleep(??, ??) at 0xd01e6490 raise.nsleep(??, ??) at 0xd01e6490 sleep(??) at 0xd02515b0 OSRunExternalScript(??) at 0xd3e2bcf8 OSFaultCleanup(??, ??, ??) at 0xd3e2cf54 fatal_error(??, ??, ??) at 0xd4d3f5cc pth_signal.pthread_kill(??, ??) at 0xd0199cf0 pth_signal._p_raise(??) at 0xd01992ec Read stack from here down raise.raise(??) at 0xd01e688c Panic(??) at 0xd3c621b8 LockHandle(??, ??, ??) at 0xd3c698dc OSLockObject(??) at 0xd3c6a358 ItemLookupByName(??, ??, ??, ??, ??) at 0xd3e928c4 NSFItemLookupByName(??, ??, ??, ??, ??) at 0xd3e92d6c NSFItemInfo(??, ??, ??, ??, ??, ??, ??) at 0xd3e9250c AdminpCompileResponseStatus(0x3b003b, 0x0, 0x3038d250, 0x3038d150, 0x100a2340) at 0x10006324 CompileResponseStatus__12AdminRequestFUsPcT2PCc(??, ??, ??, ??, ??) at 0x10054cb0 DoProcessRequest__21CreateIMAPDelegationsFv(??) at 0x1005372c AdminpProcessNewRequest(??, ??, ??, ??, ??, ??, ??, ??) at 0x1001ca0c AdminpRequestAndResponse(??, ??, ??, ??, ??, ??, ??, ??) at 0x10018ddc EntryThread(??) at 0x10002f80 ThreadWrapper(??) at 0xd3c5f4ac pth_pthread._pthread_body(??) at 0xd018a5a4
Function Name
Class Name
TID ----- Thread 12759 ----0x42174771: __nanosleep + 0x11 (1, 41f9ac54, 3150, 1, 4bc08608, 0) + 354 0x406a2421: OSRunExternalScript + 0x15d (0, 41f9ac54, 4bc08a1c, 4bc08a1c, 61660000, ffff) + 1c8 0x406a14e2: OSFaultCleanup + 0x4b2 (0, 0, 0, 8141f6c, 6, 4bc08cc8) + 20c 0x40682d35: fatal_error + 0x12d (6, 4bc08cc8, 4bc08d5c, 4bc08b68, 81422e8, 494fa4e0) 0x494af892: panicSignalHandler + 0xea (6, 4bc08cc8, 4bc08d5c, 2, 81422f8, 4bc08b68) + 8c 0x494ee80f: sysUnwindSignalCatchFrame + 0x77 (494ee844, 6, 4bc08d5c) 0x494ee8a1: sysSignalCatchHandler + 0x5d (6, 4bc08cc8, 4bc08d5c, 8142178, 4bc08d5c, 6) 0x494ef12c: userSignalHandler + 0x68 (6, 4bc08cc8, 4bc08d5c, 494ee844, 8142178, 4bc08d5c) + 28 0x494ef0ba: intrDispatch + 0xba (6, 4bc08cc8, 4bc08d5c, 8142178, 40043afc, 0) + 10 0x494ef31c: intrDispatchMD + 0x60 (6, 4bc08cc8, 4bc08d48, 4281db00, 31d7, 31d7) 0x4001cb13: pthread_sighandler_rt + 0x63 (6, 4bc08cc8, 4bc08d48, 6, 0, 0) + 37c 0x42100dc0: __libc_sigaction + 0x120 (64033, 6, 1, 0, 421effd4, 4001afe0) 0x4001ce2b: raise + 0x2b (6, 4bc09088, 0, fffffe64, 20, 0) + 110 0x42102549: abort + 0x199 (428e4d8c) Read stack from here down 0x42749c15: __default_terminate + 0x15 (428e4d8c) 0x4274a72a: terminate__Fv + 0x1a (428e4d8c, 4bc12070, 0, 0, 0, 0) + 8e58 0x4272e809: ShimmerCalPrint__5HaikuP5NNoteiPPvPUlPUiT4 + 0x2d899 (4bc121d4, 4bc13608) + 120 0x4252e686: GetHaikuDatum__5HaikuP5NNoteiPPvPUlPUiT4 + 0xa76 (4bc121d4, 4bc13608) + 10c8 0x42445c49: ExtensionProc__8NFormulaUsUsPUlPPvPUiT3 + 0xaed (4bc14c38, b9, 4bc13608) + 1c 0x42444bbd: INotesCompExtProc + 0x59 (7c0c5ff8, 4bc14c38, b9, 2, 4bc132e8, 4bc13478) + 35c 0x40d93f18: ExtensionProc__18CompGeneralContextR7ComputeUlPP9CompValueUl + 0x154 (7c0c2ff8) + 1a8 0x40d61a7a: Execute__13ExtensionProc + 0x12a (7c0c61cc, 41f9ac54, 7c0c61cc, 7c0c6120)
. . . . .
Function Name
Class Name
Continued on next page
11
JOB: 001099/QNOTES/HTTP THREAD: 0x34 LE_Create_Thread2__FP12crtth_parm_t ThreadWrapper HTThreadBeginProc ThreadMain__14HTWorkerThreadFv PID/TID CheckForWork__14HTWorkerThreadFv StartRequest__9HTSessionFv ProcessRequest__9HTRequestFv ProcessRequest__21HTRequestExtContainerF19HTAppl ProcessRequest__15HTInotesRequestFv InotesHTTPProcessRequest InotesHTTPProcessRequestImpl__FP18_InotesHTTPreq Execute__3CmdFv Handler__10CmdHandlerFP3CmdPv PrivHandle__10CmdHandlerFP3Cmd PrivHandle__14CmdHandlerBaseFP3CmdT1 HandleOpenImageResourceCmd__10CmdHandlerFP20Open TryIfModifiedSinceWithDb__3CmdFP9NDatabasei GetTimeLastMod__9NDatabaseFR11tagTIMEDATET1 NSFDbModifiedTime InitDbContext InitDbContextExt Function/Class HANDLEDereferenceToNSFBLOCK Names HANDLEDereference Halt Panic fatal_error Read stack from here up OSFaultCleanup (on iSeries) OSFaultCleanupExt OSRunExternalScript
QLECRTTH QLESPI THREAD LIBNOTES HTTHREAD LIBHTTPSTA HTWRKTHR HTSESSON HTREQUST HTEXTCON HTINOTES INOTESIF LIBINOTES CMD CMDHAND CMDHANDB OPIMGHD CMD NDB NSFSEM2 LIBNOTES DBLOCK DBHANDLE OSPANIC BREAK CLEANUP
TID ################################### ###### thread 50/61 :: http, pid=11903, lwp=50, tid=50 ###### ################################### [1] ff2195ac nanosleep (de07b1b8, de07b1b0) [2] ff07e230 sleep (1, de07b220, 40, 100, de07b337, fc800000) + 58 [3] fd9fe978 OSRunExternalScript (0, 1393d8c, 11, 26c00, fed925ac, 26d40) + 15c [4] fd9fd68c OSFaultCleanup (0, 0, 125800, 2e7f, 1393a28, fed925ac) + 3e8 [5] fd9dac18 fatal_error (b, de07bd20, de07ba68, 0, 0, 0) + 1a4 [6] f83c7964 __1cCosHSolarisPchained_handler (1, fd9daa74, f853ce2c, de07ba68) + 9c [7] f820a7ac JVM_handle_solaris_signal (0, 2b0708, de07ba68, f84c8000, b, de07bd20) + 7e4 [8] ff085fec __sighndlr (b, de07bd20, de07ba68, f820a8cc, 0, 0) + c Read stack from [9] ff07fdd8 call_user_handler (b, de07bd20, de07ba68, 0, 0, 0) + 234 [10] ff07ff88 sigacthandler (b, de07bd20, de07ba68, 0, ab6c80, 13a6814) + 64 here down [11] --- called from signal handler with signal 11 (SIGSEGV) --[12] fd2fcc58 __1cJURLTargetJGetDbFile6M_pc_ (de07f27c, e, f9910900, 0, de07c0ac, 3) + 4 [13] fd2d9448 __1cFHaikuDCtxQGetFormsCachePtr6F_pnMHuFormsCache (de07bff4, 7e68, de07cc0c) + 4 [14] fd2e7f18 __1cFHaikuHGetForm6FrnHSafePtr4nGHuForm (de07cc08, dfff80, de07cc0c) + 70 [15] fd28c7a8 __1cOCustomResponseQAttemptToProcess(de07d028, d, fd639028, fd63900c) + 298 . . . Class Name Function Name
Continued on next page
PID
12
################################### ## thread 1/4 :: update pid=340, k-id=0x12850600, pthr-id=8 ## stack :: k-state=activ, stk max-size=0, cur-size=0 TID ################################### sleep() at 0x12370204 PID OSRunExternalScript() at 0x12a3255c OSFaultCleanup() at 0x12a35226 fatal_error() at 0x12a20eb8 __zerros() at 0x1238d164 Read stack from here down .() at 0x9e67426 CGtrPosWork::ReadNext(unsigned char)() at 0x9e67426 CGtrPosShort::InsertDocs(CGtrPosSh)() at 0x1650036a CGtrPosBrokerNormal::Externalize(KEY_R)() at 0x164ae9ec gtrMergeMerge() at 0x164aacc8 gtr_MergePatt() at 0x1648738a GTR__mergeIndex() at 0x1648f66e GTR_mergeIndex() at 0x16428f36 cGTRio::Merge()() at 0x163fb848 FTGMergeGTRIndexes(FTG_CTX*,int)() at 0x163f2210 FTGIndex() at 0x163df398 FTCallIndex() at 0x141b3910 . . Class Name Function Name .
(not mangled)
(not mangled)
13
Overview
Shared Memory is one of the most important aspects of troubleshooting Domino issues. The Memcheck section of NSD dumps out verbose information regarding Shared Memory Usage and Shared Memory structures. We touch on the more important aspects of Memcheck of which you as an administrator should be aware. Lesson Lesson I - Summary of Shared Pools Lesson II - Top 10 Shared Memory Block Usage Lesson III - Shared OS Fields (MM/OS Structure) Lesson IV - Open Databases Lesson V - Open Documents Lesson IV Open Views See Page 15 17 19 20 22 25
Realize that you will never be looking at NSD output in a vacuum. Usually, you will be dealing with NSD in the context of an issue, such as a crash or performance problem. You should use Memcheck in conjunction with other NSD components (namely call stacks) as well as other Domino diagnostics (such as console or semaphore output) or OS diagnostics (such as PerfMon on W32, or PerfPMR on AIX).
The summarization of shared (and private) memory usage in Memcheck includes memory allocated ONLY by Domino. Memory usage by LotusScript, as well as other components such as Java and other third party components is not included in this summary. In order to evaluate total memory usage for a process, OS diagnostics are typically needed.
14
SIZE
ALLOC %used
Description Indicates the various types of pools that are allocated. In almost all cases, all pools will be labeled as S-DPOOL, or Static DPOOL, and will match the figures for Overall Indicates the total amount of memory allocated by the Domino Memory Manager, not all of which may be in use. This total should be no higher than about 1.0 GB to 1.2 GB (see note below). Indicates the amount of the SIZE that is actually in use (or sub-allocated) at any given time. Indicates the percentage used of memory (ALLOC/SIZE). Its a good thing to have percentages in the 90% range, not a bad thing. The higher the %used, the better we want to be making good use of what we have allocated.
Every servers configuration and usage are different; therefore the amount of memory usage for every server will also be different. However, as a rule of thumb, you should usually see around 1.0 to 1.2 GB of shared memory usage. A majority of this memory should be in the form of the UBM (0x82cd), at around 750 MB or less. While a higher number for shared memory usage may not indicate a server defect, it should be investigated to see if any configuration changes need to be made. Certain platforms such as Solaris can use more shared memory without detriment, upwards of 1.5 GB 1.6 GB.
Continued on next page
15
Use this section of Memcheck when dealing with high shared memory usage, or some problem resulting from high shared memory usage. You may also use this section if you simply want to establish the amount of shared memory allocated by the Domino Server at any given time.
Upon examination of the Summary of Shared DPOOLs: IF you find lots of DPOOLs (more than 500 depending on platform) OR you find high total shared memory usage (more than about 1.2 GB) you find a low %used figure (for instance, below 85%) THEN you may be dealing with high memory usage or a leak AND You Should check the Top 10 Shared Memory Block Usage. While this section gives you the total amount of memory allocated for all DPOOLs, it does NOT provide a break down memory usage per block (for that, see the Top 10 section). Call into Support for assistance examining files collect memory dumps. You will need memory dumps to determine if fragmentation is occurring for sure. Call into Support for assistance examining files
Note: In the excerpt above, the server has 294 Shared DPOOLs allocated, with overall shared memory usage over 1.5 GB. This figure is above the threshold of 1.2 GB and should be investigated. The Top 10 Shared Memory Block Usage is usually the next place to go. Needless to say, you want to initiate a call with Support
16
Top 10 Shared Memory Block Usage shows a list of the top 10 highest users of memory blocks by total size and by number of blocks. Note: the above excerpt for the Newer Format has been truncated for clarity. Field Label Type TotalSize Handles Description The hexadecimal block type designation used by the component that allocated the block Amount of shared memory allocated across all blocks of a given type Number of blocks allocated for a given type (each block uses one handle)
Continued on next page
17
This section augments the Shared Pool Summary by breaking down memory usage by block type. This output ONLY shows block usage, not pool usage. Hence, if a pool is not well utilized, you will not see the problem here. Under those cases, you should collect a memory dump. You should come here: if you suspect that a shared block is leaking if you get "Out of Shared Handles" errors if you want to see shared block usage (i.e. who is using what) Look for either a large amount of overall memory coming from one block type (exhausting the user address space) or a large number of one block type (causing Out of shared handle error messages). Note: From the Top 10 excerpt above, you see the largest amount of shared memory comes from block 0x82cd (BLK_UBMBUFFER), which will typically be 500 MB to 750 MB. This usage is from the UBM, and is perfectly normal and expected. Memory usage from any other single block type should usually be an order of magnitude lower than UBM (for instance 75 MB)
18
This section gives you information about server start time, crash time, and any PANIC messages if present. It also provides the thread ID (both physical and virtual) that crashed the server, with the text Thread in ND6, StaticHang in ND7. In theory, you can consult this section when you have multiple fatal threads or panics to determine which thread crashed first (*). Use this section if you want to establish specific crash information, although this information is available through other means, such as the NSD time and console output.
(*)CAUTION
For ND6.5.4 (and earlier) on W32, the Thread field actually reflects the last crashing thread ID, in the case where there are two or more fatal threads. This issue has been addressed as of ND6.5.5. For a workaround, you may determine the first fatal thread by examine the OS Process Table, near the top of the NSD, and looking for the process that invoked the NSD process. This will be the process that caused the server to crash.
19
<@@ ------ Notes Memory Analyzer (memcheck) -> Open Databases (Time 11:45:21) ------ @@> D:\Lotus\Domino\Data\HR\projnav.nsf Version = 41.0 SizeLimit = 0, WarningThreshold = 0 ReplicaID = 862568fe:0019c2ad bContQueue = NSFPool [120: 7236] FDGHandle = 0xf01c0928, RefCnt = 3, Dirty = N DB Sem = (FRWSEM:0x0244) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[] SemContQueue ( RWSEM:#0:0x029d) rdcnt=-1, refcnt=0 Writer=[] n=0, wcnt=-1, Users=-1, Owner=[] By: [ nSERVER:0768: 132] DBH= 63984, User=CN=Rhonda Smith/O=ACME/C=US By: [ nSERVER:0768: 132] DBH= 64127, User=CN=Rhonda Smith/O=ACME/C=US By: [ nSERVER:0768: 132] DBH= 63006, User=CN=Rhonda Smith/O=ACME/C=US D:\Lotus\Domino\Data\finance\Finance.nsf Version = 41.0 SizeLimit = 0, WarningThreshold = 0 ReplicaID = 86256c87:00664fb0 bContQueue = NSFPool [1935: 58788] FDGHandle = 0xf01c2a98, RefCnt = 1, Dirty = N DB Sem = (FRWSEM:0x0244) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[] SemContQueue ( RWSEM:#0:0x029d) rdcnt=-1, refcnt=0 Writer=[] n=0, wcnt=-1, Users=-1, Owner=[] By: [ nhttp:0fe4: 53] DBH= 63808, User=Anonymous D:\Lotus\Domino\Data\manufacturing\Suppliers.nsf Version = 43.0 SizeLimit = 0, WarningThreshold = 0 ReplicaID = 86256df1:006d2839 bContQueue = NSFPool [2014: 52900] FDGHandle = 0xf01c0cf6, RefCnt = 1, Dirty = Y DB Sem = (FRWSEM:0x0244) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[] SemContQueue ( RWSEM:#0:0x029d) rdcnt=-1, refcnt=0 Writer=[] n=0, wcnt=-1, Users=-1, Owner=[] By: [ nhttp:0fe4: 34] DBH= 63883, User=CN=Mary Peterson/O=ACME/C=US
Description Name of the database. When an absolute path is given, the database has been opened locally. If the Database name contains !!CN=Server/O=ACME..., it means the database has been opened via the network, including local databases that are opened using the server name Version of the databases ODS Self-explanatory Indicates the process name, process ID, and virtual thread ID of the server thread that has opened the database Database handle for each instance opened (can be correlated to Open Documents) Name of the user that has the database open. Each user can have multiple instances of a database opened (multiple DBHs)
Continued on next page
20
The Open Databases section is one of the more important parts of Memcheck. This section gives you an indication as to what databases are opened, and what users are accessing them. Use this section to establish the total number of open databases (you must estimate this count manually), or need specific information about a database. The main information you will find helpful is: Database Name ODS Version Replica ID List of Users While this section is valuable, you will likely find that the Resource Usage Summary is better equipped to provide immediate answers as to what databases, views, or documents a crashing or hung thread is accessing. However, this section gives more verbose information regarding database information.
21
[ [ [ [ [ [ [ [
1135
4156 d:\notedata\MAIL3\jdoe.nsf
1353
4444 d:\notedata\MAIL2\dfish.nsf
22
Field Label DBH NOTEID CLASS IsProf (Newer Format) Pools (Newer Format) Size (Newer Format) Items (Newer Format) Database (Newer Format) Opened By (Newer Format)
Description Database handle for each instance opened (can be correlated to Open Databases to determine database name) NOTEID in decimal for the opened Note Class of the note this indicates if a note is a form note, view note, agent note, or document note (see table below) Indicates if the document is a profile document Indicates how many POOLs the document is spread across in memory Indicates the size of the document meta data held in the above mentioned POOLs Indicates how many items/fields the document contains Indicates which database the document belongs to Indicates which user opened the document
Description Data Note (document) Form Note View note ACL Note Agent Note Replication Formula Note
Continued on next page
23
You will probably find the Open Documents section one of the more helpful sections of NSD. This section shows the list of all open notes in memory on the server (or client), regardless of the type of note (design note, or data note, etc). You can examine this section to determine NoteID information, or establish which database a document belongs to. You are most interested in: DBH database handle NOTEID noteID CLASS note class (e.g. NOTE_CLASS_DOCUMENT) FLAGS note flags (e.g. NOTE_FLAG_UNREAD) Pools how many POOLs the document is spread across in memory Items how many items/fields the document contains Size the size of the document meta data held in the above mentioned POOLs Database which database the document belongs to Opened By which user opened the document
The note's "CLASS" will tell you if a note is a document, view note, agent, etc. This information is also held in the Resource Usage Summary, and is better organized there. However, the placement of the Open Documents section within Memcheck indicates whether the notes are opened in private or shared memory, which is information not reflected in Resource Usage Summary.
Within NSD, you may encounter multiple Open Document sections, due to the fact that Memcheck dumps out different types of open documents as separate lists, for instance, notes opened locally, notes opened remotely, or new notes. In addition to these different note types, there may also be documents opened in shared or private memory depending on the needs of the component that opened the document. Therefore, you may potentially see as many as three open document lists in shared memory and three open documents lists under each process. It is also quite possible that you will see no open document lists at all (since no notes may be open at the time that NSD is run).
Whenever a server opens a note on behalf of a client, the server performs what is called a fast note open. In essence, the server opens the note just long enough to pass it back to the client, and then closes the note. So while a client may have a document open for long periods of time, the server keeps the note open for only a fraction of a second. Hence, open notes that are listed within a server NSD are typically in fact not opened on behalf of clients, but rather are notes that are open for the servers own use (such as during the running of an agent or mail delivery).
24
<@@ ------ Notes Memory Analyzer (memcheck) -> NIF Collections (Time 12:48:35) ------ @@> CollectionVB ViewNoteID UNID OBJID RefCnt Flags Options Corrupt Deleted Temp NS Entries ViewTitle ------------ ---------- -------- ------ ------ ------ -------- ------- ------- ---- --- ------- -----------[ 0020e005] 1518 1356a8 358710 1 0x0000 00000008 NO NO NO NO 0 MyNotices CIDB = [ 0253cc05] CollSem (FRWSEM:0x030b) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[ : 0000] NumCollations = 2 bCollationBlocks = [ 001e72e5] bCollation[0] = [ 00117005] bCollation[1] = [ 001a2205] CollIndex = [ 00012a09] Collation 0:BufferSize 26,Items 1,Flags 0 0: Ascending, by KEY, "StartDateTime", summary# 2 CollIndex = [ 00012c09] Collation 1:BufferSize 26,Items 1,Flags 0 0: Descending, by KEY, "StartDateTime", summary# 2 ResponseIndex [ 0010e4b6] NoteIDIndex [ 0010e385] UNIDIndex [ 0010e5e7] <@@ ------ Notes Memory Analyzer (memcheck) -> NIF Collection Users (hash) (Time 12:48:33) ------ @@> CollUserVB ... CollectionVB Remote OFlags ViewNoteID Data HDB/Full View HDB/Full ------------ ... ------------ ------ ------ ---------- ------------- ------------[ 00239805] ... [ 0023d005] NO 0x0082 786 1219/1874 1219/1874 CurrentCollation = 0 [ 0013a805] ... [ 00136005] NO 0x00c2 11122 886/785 886/785 CurrentCollation = 0 [ 0028d805] ... [ 0020e005] NO 0x00c2 1518 551/1432 551/1432 CurrentCollation = 0 ... Open By ... -------------... [ nserver: 09d8: ... [ nserver: ... [ nserver: 09d8: 09d8:
Note: The string .... indicates where output has been removed for clarity.
Continued on next page
25
NIF Collections provides you information about all collections opened server-wide. Use this section when you want to get an idea of the total number of open views, or if you want detailed information about the view, such as number of collations. NIF Collection Users provides you information about the users of the views opened on the server (such as which thread has a view open, what collation it is using, etc). For a summary of information about views being used by a specific thread, it is easier to refer to Resource Usage Summary, but if you need to take a deeper dive, such as investigating which threads are waiting on a semaphore for a given view, you will refer to these two sections. The main reason to use the otherwise rather cryptic CollectionVB value is to cleanly correlate NIF Collection User information back to the NIF Collections information. Sometimes you can use ViewNoteID to do this, but since multiple view notes can have the same NoteID in different databases, this is not a definitive match, where as CollectionVB is definitive (since it is an in-memory location for collection info). Below is a short description for the more important fields of information that you will routinely investigate. These fields are bolded in the excerpt on the previous page.
Field Label CollectionVB (Collections & Collection Users) ViewNoteID (Collections & Collection Users) RefCnt (NIF Collections) ViewTitle (NIF Collections) CollSem (NIF Collections) NumCollations (NIF Collections) Ascending/Descending (NIF Collections) View HDB/Full (NIF Collection Users) Open By (NIF Collection Users)
Description In-memory location of collection info good for correlating data between Collections and Collection Users NoteID for view note design sometimes good for correlating data between Collections and Collection Users Number of users that have the view open Name of the View/Collection Info about the Collection Semaphore. If contention exists on a view, contains reader and waiter information (same as semdebug) Number of collations in the view (hints at view complexity) Indicates how the collation is sorted, and by which field Database handle for user & full server access for the database where the view in contained good for correlating data Indicates the process name, PID & TID of thread that has view opened
26
Introduction
Private Memory can also be important in troubleshooting issues, particularly when high memory usage occurs in process-private memory. Important section in Private Memory include: Lesson Lesson I - TLS Mapping Lesson II - Open Documents Lesson III - Process Heap Memory Lesson IV - Top 10 Memory Block Usage See Page 28 29 30 32
Search on KEYWORD memcheck until you come to the beginning of the Memcheck section Search on KEYWORD attach this will bring you to the beginning of the Private Memory section where Memcheck attaches to the first process Subsequent searches on KEYWORD attach will take you to the next process Search on KEYWORD detach to get to the end of a specific process information
As stated before, the majority of memory usage for a Domino Server should be from shared memory. In contrast, on average, each server task should use considerably less private memory through the Domino Memory Manager, between 50 MB and 100 MB; there are a few exceptions to this rule, including the server task, http task, and router task, which may use upwards of 200-300 MB. While it may not indicate a server defect, if private memory usage for a task exceeds these thresholds, an investigation should be made to determine the nature of this memory usage, and if any configuration changes are required. Under these cases, you should call into Support to assist in this investigation.
27
[ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [
The TLS Mapping is crucial for mapping virtual thread ID used throughout Memcheck with the correct physical thread ID used in other portions of NSD such as call stacks. You should consult the TLS Mapping when you wish to determine which virtual thread ID a particular physical thread is running under (i.e. you have physical Thread ID, but not virtual Thread ID). You can also use this section if you know the virtual thread ID and need the physical thread ID, but this can also be found in the Resource Usage Summary along with all the open resources for that virtual or physical thread.
Virtual thread IDs are numbers generated by Domino that allow for an additional abstraction layer between Domino and OS threads. This is done primarily to facilitate scalability for the server when dealing with network I/O. For many server tasks, a given physical thread will constantly be switching from one virtual thread ID to another during normal operations.
28
Formatted in the same manner as the Open Document Sections from shared memory, this section lists open documents that are opened in private memory. You should use this section when determining how many notes a specific process has open. While the Resource Usage Summary also lists all these documents, that section does not give an indication as to the scope each note. You can use this section to establish the scope of each note if needed.
29
SIZE
ALLOC %used
Description Indicates the various types of pools that are allocated. For private memory, the overall total is currently calculated incorrectly, so you should examine the stats for S-DPOOL, or Static DPOOL (other pool types such as POOL and VPOOL are actually sub-allocated from S-DPOOL, so this stat will reflect overall usage). Indicates the total amount of private memory allocated by the Domino Memory Manager, not all of which may be in use. This total should be no higher than about 100 MB (see note on next page). Indicates the amount of the SIZE that is actually in use (or suballocated) at any given time. Indicates the percentage used of memory (ALLOC/SIZE). Its a good thing to have percentages in the 90% range, not a bad thing. The higher the %used, the better we want to be making good use of what we have allocated.
Continued on next page
30
Similar to the Shared Memory Summary, this section can be used to establish the total amount of private memory Domino has allocated for a process, which is augmented by the Top 10 [Process] Block Usage sections. The information provided in this section of Memcheck is identical in nature to the Shared Memory pool section, only this time memory is private to each process. Use this section of Memcheck when dealing with high private memory usage, or some problem resulting from high private memory usage or high private handle counts. For instance: if a crash is due to high private memory usage if you suspect a leak in private memory if you want to evaluate the amount and efficiency of private memory usage As mentioned earlier, each server task should use considerably less private memory usage (50-100 MB) than shared memory (1.0-1.2 GB); there are a few exceptions to this rule, including the server task, http task, and router task, which may use upwards of 200300 MB. While it may not indicate a server defect, when private memory exceeds these thresholds, the nature of this memory usage should be investigated.
31
Description The hexadecimal block type designation used by the component that allocated the block Amount of private memory allocated across all blocks of a given type Number of blocks allocated for a given type (each block uses one handle)
Continued on next page
32
Use this section for a quick method of establishing process-private block usage broken down by block type. This ONLY shows block usage, not total pool usage. Hence, if a pool is not well utilized, you will not see the problem here. Under those cases, you should consult the Process Heap Memory Summary and collect memory dumps (opening a ticket with Support). You should come here: if you suspect that a private block is leaking if you get "Out of Private Handles" errors if you want to see private block usage Just as in shared memory, look for either a large amount of total memory usage for one block type, or a large count of one block type. Again, large amounts of private memory over 100 MB should be a concern, with a few exceptions such as server, http, or router (which may use upwards of 200-300 MB).
33
Introduction
No doubt you have noticed that this documentation has repeatedly mentioned the importance of the Resource Usage Summary. As has already been eluded, this section of Memcheck lists all the major resources broken out per thread ID (virtual and physical thread). The format of this section is cleanly organized, allowing for one-stop shopping. This is perhaps the most important section of Memcheck next to memory usage. With the exception of memory usage, you will spend most of your time here. Important output is: VThread (Virtual Thread) To PThread (Physical Thread) Mapping
34
---- Resource Usage Summary ---** Process [ nserver:289c] .. SOBJ: addr=0x01130004, .. SOBJ: addr=0x022d2628, .. SOBJ: addr=0x010ede50, .. SOBJ: addr=0x00000001, .. SOBJ: addr=0x02670338, .. SOBJ: addr=0x00000001, .. SOBJ: addr=0x022d2a84, .. SOBJ: addr=0x022d2764, .. SOBJ: addr=0x022d2894, .. SOBJ: addr=0x010f0c3c, .. SOBJ: addr=0x010eedd8, .. SOBJ: addr=0x010a17d4, .. SOBJ: addr=0x01160814, .. SOBJ: addr=0x0253e918, .. SOBJ: addr=0x033669f8, .. SOBJ: addr=0x0109c32c, .. SOBJ: addr=0x010a6cb4,
h=0xf0104002 h=0xf010400f h=0xf01c0043 h=0x00000000 h=0xf01c0036 h=0x00000000 h=0xf0104018 h=0xf0104013 h=0xf0104028 h=0xf01c004f h=0xf01c004b h=0xf01c000d h=0xf01c00c2 h=0xf01c00e2 h=0xf01c0154 h=0xf01c0009 h=0xf01c000e
t=8128 t=8a18 t=1381 t=8310 t=0a22 t=9508 t=831b t=8912 t=880f t=030f t=0a04 t=0901 t=151d t=0803 t=0f73 t=0f6f t=024c
(BLK_PCB) (BLK_NET_PROCESS) (BLK_SCT_MGR_CONTEXT) (BLK_NIF_PROCESS) (BLK_LOCINFO) (BLK_EVENT_PROCESS) (BLK_DIRASSIST) (BLK_SERVER_PROCESS) (BLK_CLIENT_PROCESS) (BLK_NIF) (BLK_NET) (BLK_SERVER) (BLK_EVENT_LIC_GLOBAL) (BLK_CLIENT) (BLK_DBMISC_POLICY_GLOBAL) (BLK_SERVER_ACL) (BLK_NSF)
** VThread [ nserver:289c: 77] .Mapped To: PThread [ nserver:289c:10684] .Description: Server for Rob Gearhart/SET on TCPIP .. using: Primal Thread [ nserver:289c: 58] .. SOBJ: addr=0x0345918c, h=0xf01040c4 t=c130 (BLK_TLA) .. SOBJ: addr=0x0231d4b0, h=0xf01040c6 t=c275 (BLK_NSFT) .. Task: TaskID=[ 70: 8392], PRThread: [ nserver:289c: 77] .... using: IOCP h=297271297, VThread [ nserver:289c: 10] .... TaskVar: id: [ 70: 8392], transID=5457, Func=59, st=6, by: CN=Rob Gearhart/O=SET .. Database: D:\Lotus\DominoR65\Data\mayhem.nsf .... DBH: 91, By: CN=Rob Gearhart/O=SET .... DBH: 84, By: CN=Rob Gearhart/O=SET ...... doc: HDB=84, ID=2310, H=6705, class=0001, flags=4000 ...... view: hCol=105, cg=N noteID=322, sessID: [ 12: 3732] All Documents|($All) .. file: fd: 1060, D:\Lotus\DominoR65\Data\mayhem.nsf
35
Use this section when you want to know what databases, views, notes or files a physical thread has open. You can find: Database establish the database name DBH correlate the instance of the database user against other sections of Memcheck By to determine the user of the database view establish any open views doc establish a list of notes open by the thread (keep an eye on class) file establish a list of any files the thread had open through Domino file management
Keep in mind that this list of resources is specific to the virtual thread ID, which a suspect physical thread happens to be mapped to at that moment. Not all of these resources were originally opened or even in use by this physical thread at the time of the crash. However, this should not affect your investigation. Often times, there will be multiple databases, views, etc opened by the thread, only one of which may responsible for the problem. This means that you will need to make an educated guess at to which resource was actually involved in a problem (based in part on arguments on the stack).
Be careful not to jump to conclusions regarding a problem database, view or document. Just because a database, view or document was open at the time of the crash doesnt mean the root-cause of the crash is due to accessing that resource. Not every database involved in a crash is corrupt, in fact most arent. There are many cases where a crash or hang is a result of conditions that are completely unrelated to the resource being used. Always approach a crash or hang with a grain of salt. Get an idea from the call stack about the nature of the crash.
36
Introduction
We have seen throughout Memcheck Anatomy, as well as in this unit, how Memcheck lists the same information in multiple contexts. Knowing how to correlate all this information together to find what you need is very important. The Resource Usage Summary correlates much of the important pieces together, but it is helpful to be familiar with how to correlate this directly. This section will discuss how to go about tying certain pieces of information together. Lesson Lesson I Finding Physical/Virtual Thread Lesson II Correlating Database Handle with Open Documents Lesson III Correlating View Information See Page 38 39 40
37
PIDs/TIDs
When using a text editor, be sure to use process ID's and thread ID's for locating the process you are troubleshooting, since there can be multiple processes with the same name, like AMGR, Replica, etc. When using a thread ID, you will almost always be starting with a physical (native) thread ID and working your way back to the virtual thread ID using the TLS Mapping section for the process in question. For instance:
thread ID
------ TLS Mapping ----[ [ [ [ [ [ [ [ [ NativeTID nHTTP:113c: 1852] nHTTP:113c: 2560] nHTTP:113c: 2556] nHTTP:113c: 4152] nHTTP:113c: 1800] nHTTP:113c: 2580] nHTTP:113c: 4284] nHTTP:113c:10840] nHTTP:113c: 3500] [ [ [ [ [ [ [ [ [ VirtualTID nHTTP:113c: 5] nHTTP:113c: 6] nHTTP:113c: 7] nHTTP:113c: 8] nHTTP:113c: 9] nHTTP:113c: 10] nHTTP:113c: 11] nHTTP:113c: 12] nHTTP:113c: 13] [ [ [ [ [ [ [ [ [ PrimalTID nHTTP:113c: 5] nHTTP:113c: 6] nHTTP:113c: 7] nHTTP:113c: 8] nHTTP:113c: 9] nHTTP:113c: 10] nHTTP:113c: 11] nHTTP:113c: 12] nHTTP:113c: 13]
38
Each users DBH is listed in several different sections within Memcheck. In the original format for NSD, you can correlate an open note to its database handle using the following two sections (the info of interest is squared in blue): Open Databases (Database Handle & Database Name) Open Documents (Database Handle & NoteID)
------ Open Databases ------D:\Lotus\DominoR65\Data\mayhem.nsf Version = 43.0 SizeLimit = 0, WarningThreshold = 0 ReplicaID = 86256de1:0082e471 bContQueue = NSFPool [ 2: 40868] FDGHandle = 0xf01c01a8, RefCnt = 15, Dirty = N DB Sem = (FRWSEM:0x0244) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[] SemContQueue ( RWSEM:#0:0x029d) rdcnt=-1, refcnt=0 Writer=[] n=0, wcnt=-1, Users=-1, Owner=[] By: [ nserver:0874: 110] DBH= 77, User=CN=Sithlord/O=SET By: [ nserver:0874: 110] DBH= 80, User=CN=Sithlord/O=SET By: [ nHTTP:113c: 9] DBH= 97, User=CN=Sithlord/O=SET By: [ nHTTP:113c: 9] DBH= 98, User=CN=Sithlord/O=SET By: [ nHTTP:113c: 10] DBH= 100, User=CN=Sithlord/O=SET By: [ nHTTP:113c: 10] DBH= 101, User=CN=Sithlord/O=SET By: [ nHTTP:113c: 11] DBH= 103, User=CN=Sithlord/O=SET By: [ nHTTP:113c: 11] DBH= 104, User=CN=Sithlord/O=SET ----------- Open Documents ---------
Database handle
FirstItem [ 6706: 816] [ 6709: 816] [ 6707: 816] [ 6708: 816] [ 6710: 816] [ 6711: 816]
[ [ [ [ [ [
The Newer Format for NSD lists the database name directly in the Open Documents section, as well as the name of the user that has the document opened (squared in blue):
<@@ ------ Notes Memory Analyzer (memcheck) -> Open Documents (BLK_OPENED_NOTE): ...------ @@> DBH 531 NOTEID HANDLE CLASS FLAGS IsProf #Pools #Items 7330 0x24ff 0x0001 0x0200 Yes 1 4 . Open By: CN=John Smith/O=ACME/C=US Flags2 = 0x0404 Flags3 = 0x0000 OrigHDB = 531 First Item = [ 9471: 836] Last Item = [ 9471: 1228] Non-pool size : 0 Member Pool handle=0x24ff, size=2984 Size Database 2984 d:\notedata\drmail\jsmith.nsf
39
Determining view information can be a bit challenging, since the view name is listed in one place, and the DBH to the database is listed in another, and the name of the database is listed in yet another. The sections that allows you to correlate view name and database name are as follows (the info of interest is squared in blue): NIF Collections (ViewNoteID or Collection VB) NIF Collection Users (ViewNoteID or Collection VB, Database Handle, Thread ID) Open Databases (Database Handle, Database Name, Thread ID)
<@@ ------ Notes Memory Analyzer (memcheck) -> NIF Collections (Time 12:48:35) ------ @@> CollectionVB ViewNoteID UNID OBJID RefCnt Flags Options Corrupt Deleted Temp NS Entries ViewTitle ------------ ---------- -------- ------ ------ ------ -------- ------- ------- ---- --- ------- -----------[ 0020e005] 1518 1356a8 358710 1 0x0000 00000008 NO NO NO NO 0 MyNotices CIDB = [ 0253cc05] CollSem (FRWSEM:0x030b) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[ : 0000] NumCollations = 2 bCollationBlocks = [ 001e72e5] bCollation[0] = [ 00117005] bCollation[1] = [ 001a2205] CollIndex = [ 00012a09] Collation 0:BufferSize 26,Items 1,Flags 0 0: Ascending, by KEY, "StartDateTime", summary# 2 CollIndex = [ 00012c09] Collation 1:BufferSize 26,Items 1,Flags 0 0: Descending, by KEY, "StartDateTime", summary# 2 ResponseIndex [ 0010e4b6] NoteIDIndex [ 0010e385] UNIDIndex [ 0010e5e7] <@@ ------ Notes Memory Analyzer (memcheck) -> NIF Collection Users (hash) (Time 12:48:33) ------ @@> CollUserVB ... CollectionVB Remote OFlags ViewNoteID Data HDB/Full View HDB/Full ------------ ... ------------ ------ ------ ---------- ------------- ------------[ 00239805] ... [ 0023d005] NO 0x0082 786 1219/1874 1219/1874 CurrentCollation = 0 [ 0013a805] ... [ 00136005] NO 0x00c2 11122 886/785 886/785 CurrentCollation = 0 [ 0028d805] ... [ 0020e005] NO 0x00c2 1518 551/1432 551/1432 CurrentCollation = 0 ... Open By ... -------------... [ nserver: 09d8: ... [ nserver: ... [ nserver: 09d8: 09d8:
<@@ ------ Notes Memory Analyzer (memcheck) -> Open Databases (Time 12:47:58) ------ @@> D:\Lotus\Domino\Data\HR\projnav.nsf Version = 41.0 SizeLimit = 0, WarningThreshold = 0 ReplicaID = 862568fe:0019c2ad bContQueue = NSFPool [120: 7236] FDGHandle = 0xf01c0928, RefCnt = 3, Dirty = N DB Sem = (FRWSEM:0x0244) state=0, waiters=0, refcnt=0, nlrdrs=0 Writer=[] SemContQueue ( RWSEM:#0:0x029d) rdcnt=-1, refcnt=0 Writer=[] n=0, wcnt=-1, Users=-1, Owner=[] By: [ nSERVER:09d8: 03b0] DBH= 551, User=CN=Rhonda Smith/O=ACME/C=US By: [ nSERVER:09d8: 132] DBH= 2023, User=CN=Rhonda Smith/O=ACME/C=US By: [ nSERVER:09d8: 02ae] DBH= 128, User=CN=Rhonda Smith/O=ACME/C=US
40
Below is a suggested checklist that you as an administrator might put together for looking over an NSD. Needless to say, we cannot capture the entirety of NSD analysis here, but hopefully this checklist can give you a broad overview for what to review: Call Stacks locate crashing process/thread KEYWORD fatal or panic What is the physical thread ID? What was the crash point? Memcheck Shared Memory determine Domino shared memory usage KEYWORD Shared Memory KEYWORD Top 10 Shared Block Usage KEYWORD Open Databases KEYWORD Open Documents KEYWORD NIF Collections & NIF Collection Users KEYWORD Shared OS or MM/OS to determine Panic Message Memcheck Private Memory determine Domino private memory usage KEYWORD Process Heap Memory KEYWORD Top 10 Process Block Usage KEYWORD TLS Mapping Resource Usage Summary determine databases, views, documents opened by thread KEYWORD Mapped to: PThread
What should you as an administrator remember about NSD? Its one of our most important Domino diagnostics, and tells you information about what a server or client was doing at the time of a problem, and what resources it was doing it do. While NSD is not the proverbial magic bullet, NSD will get you moving solidly in the right direction. We have given you just a taste of what you can do with NSD. Used in conjunction with other diagnostic files such as CONSOLE.LOG, SEMDEBUG.TXT and relevant OS diagnostics, NSD can give you surprising insight into the nature of a crash or hang, if not the root-cause itself. Once you know how to use NSD to your advantage, you can use it to determine the next steps in troubleshooting a problem, and reduce time to resolution while improving your grasp of the issue.
41
Case Studies
The following are a series of scenarios intended to improve your working knowledge of NSD. You are given a brief explanation of the scenario along with the NSD from the server. In each scenario, you are given an NSD file, and in some cases, supporting files, and asked a series of questions intended to guide you through the files. We are not asking you to resolve a problem top to bottom. Instead, we may only ask you what the next step would be in a particular situation. This echoes real life, where you may not be able to definitively solve a problem from one NSD, but where you can use what use see in the file to narrow down the problem, and determine next steps. See the answer key at the end - but no peeking! The provided NSD files should remain in the lab, and should not be taken with you when you leave the lab. In addition, we have a special surprise for you; we have provided an alpha version of NSDAnalyzer (part of a tool called ISEW) for you to use with this lab, a tool used to view and analyze an NSD. This is an internal beta provided as a special courtesy for Lotusphere only. No we wont give you a sneak copy to take home with you, so dont bother asking. However, we are giving you a preview of a tool that we eventually hope to make available to customers for use in their environment. We do not have an ETA for when this may be available. Laptop sign in: Lotus Password: password ISEW sign-in: [email protected] Password: password
Continued on next page
42
Scenario 1
Question 2: What is the note ID and database name for the crashing agent?
Question 4: How would you establish the name of the agent that crashes?
Question 5: Using the call stack for the fatal thread, can you find any documented TNs in Knowledge base, for instance, an SPR?
43
Scenario 2
Question 3: Using the Shared Memory Pools section of NSD, what is overall Shared Memory usage as allocated by Domino? What is %used? Is this normal?
Question 4: Using the Process Heap Memory section of NSD, what is the overall private memory usage for the process in question? What is the %used? Is the normal?
Question 5: Open the memory dump file, and examine LotusScript memory usage for the process in question. What is the overall private memory usage in bytes for LotusScript Usage? Is this normal?
Question 6: Adding the amount of private memory usage from question 4 to that of question 5, how many bytes of private memory is being used by the process in question (through both the Domino MM and LotusScript MM)? Is this normal?
44
Scenario 3
Question 3: Given what little clues you have from the call stack, as well as from the Resource Usage Summary for this thread, what do you think the thread might have been doing at the time of the crash?
(hint1 what is the one DLL name you can decipher in the call stack?) (hint2 under what cases would the http process be executing java?) (hint3 what is the user ID of the documents opened by this thread?)
45
Scenario 4
Files: scenario_4_nsd.log, scenario_4_semdebug.txt, scenario_4_console.log Problem: HTTP is non-responsive. Question 1: Examining the semdebug.txt, what semaphore is the point of contention? (Hint: look at the semdebug.txt file as a whole)
Question 3: Based on the ownership of this semaphore, does it look like the server is hung?
Question 4: What is the timeframe during which the call stacks in the NSD are dumped? (Hint: look for DBG() to determine the timestamps in NSD).
Question 5: During the window of time in question 4, which thread owns the semaphore in question? As far as you can tell, how long did this thread have the semaphore locked? What was this thread doing during that time?
Question 6: Using the Memcheck portion of NSD, what database and view was experiencing the Collection Semaphore contention?
Question 7: At the time of the NSD, how many readers and waiters were there for this view? How many waiters are waiting for a read versus a write? What are the PID and TID of the first writer waiting for the semaphore?
Question 8: If you had to guess, what would you say is the problem? What would be your next step?
Continued on next page
46
Scenario 5 (BONUS)
Question 4: Who is the largest user of shared memory, and how much are they using? Is this normal?
Question 5: Why is there so much memory allocated for BLK_OPENED_NOTE? (Hint: Look in the Open Documents section)
Question 7: What do you think is going on here? Does this appear to be a server defect?
47
Scenario 1 Answers
File: scenario_1_nsd.log Problem: AMGR has crashed on an agent. Question 1: What is the process ID and thread ID of the crashing thread? Answer: Process= AMgr, Process ID=0B04, Thread ID= 1288 Question 2: What is the note ID and database name for the crashing agent? Answer: Database=SomeDB.nsf, Agent NoteID=966 Question 3: What is the virtual thread ID for the crashing thread? Answer: Virtual Thread ID=0002 Question 4: How would you establish the name of the agent that crashes? Answer: Use the Admin Client and search the database using the NoteID to pull up the agent design note; from there, you can read the $TITLE field to establish the agent name. When searching the database, you will need to convert the NoteID from decimal format (as taken from the NSD) to hexadecimal format (as inputted into the Admin Client). In this case, within the Admin client, you would search the database for NoteID 3C6 (thats 966 in hex). Question 5: From the fatal call stack, can you match this against any known SPRs? Answer: Yes, TN 1187989, SPR HRON65SL9C, using FormatFromVariant as our search criteria.
Scenario 1 Details
The note listed with HDB=35 (database handle), class=0200 (agent note) and opened by AGTSIGNER/AGT/ACME (signer of the agent) is the agent note we are interested in. The note with HDB=16, class=0200, and opened by user name SERVER01/SVR/ACME is instance of the same agent note opened by the server in preparation to run the agent (either one gives you the correct note ID).
Continued on next page
48
Virtual Thread ID
.Mapped To: PThread [ nAMgr:0b04:1288] .. SOBJ: addr=0x014c8580, h=0xf0104013 t=c176 (BLK_SDKT) .. SOBJ: addr=0x014deea4, h=0xf0104035 t=c275 (BLK_NSFT) .. SOBJ: addr=0x15ce00b4, h=0xf0104037 t=c30a (BLK_LOOKUP_THREAD) .. SOBJ: addr=0x01482b7c, h=0xf0104001 t=c130 (BLK_TLA) .. SOBJ: addr=0x014c8884, h=0xf0104015 t=c436 (BLK_LSITLS) .. Database: X:\Lotus\Domino\Data\Apps\SomeDB.nsf .... DBH: 113, By: CN=SERVER01/OU=SVR/O=ACME ...... doc: HDB=113, ID=966, H=16, class=0200, flags=0000 .... DBH: 485, By: CN=AGTSIGNER/OU=AGT/O=ACME .... DBH: 502, By: CN=AGTSIGNER/OU=AGT/O=ACME ...... doc: HDB=502, ID=966, H=35, class=0200, flags=0000 ...... doc: HDB=502, ID=6370, H=40, class=0001, flags=0100 ...... doc: HDB=502, ID=0, H=44, class=0001, flags=0000 .... DBH: 503, By: CN=AGTSIGNER/OU=AGT/O=ACME ...... view: hCol=505, cg=N, noteID=1194, Forecast .... DBH: 504, By: CN=AGTSIGNER/OU=AGT/O=ACME .. file: fd: 2252, X:\Lotus\Domino\Data\Apps\SomeDB.nsf .. file: fd: 2780, F:\Lotus\Domino\Data\IBM_TECHNICAL_SUPPORT\console.log
Agent Note
49
Scenario 2 Answers
Files: scenario_2_nsd.log & scenario_2_memory.dmp Problem: The Domino Server crashed. Question 1: What is the process ID and thread ID of the crashing thread? Answer: Process=nSERVER, Process ID=0908, Thread ID=2692 Question 2: Why did the server crash? Answer: PANIC: Insufficient Memory (Low Memory Condition) Question 3: Using the Shared Memory Pools section of NSD, what is overall Shared Memory usage as allocated by Domino? What is %used? Is this normal? Answer: Domino shared memory usage is 988 MB (1,036,481,020 bytes), at 97% used. Yes, these numbers are both normal; shared memory is below the 1 GB threshold, and memory is well utilized at 97% (this is good). Our problem is not in shared memory. Question 4: Using the Process Heap Memory section of NSD, what is the overall private memory usage for the process in question? What is the %used? Is the normal? Answer: Domino Private memory usage is 167 MB (175,636,480 bytes), and is 17% used. While the memory usage is not yet over the 200-300MB threshold (exception case for nserver), the pool utilization is VERY low, indicating a potential problem in private memory usage. Question 5: Open the memory dump file, and examine LotusScript memory usage for the process in question. What is the overall private memory usage in bytes for LotusScript Usage? Is this normal? Answer: LotusScript total (private) memory usage is 184 MB (193,052,121 bytes). Taken by itself, this usage is still under the threshold of 200 MB. However, LotusScript memory usage usually averages much lower (in the 30 MB range), so this memory usage is abnormally high. Question 6: Adding the amount of private memory usage from question 4 to that of question 5, how many bytes of private memory is being used by the process in question (through both the Domino MM and LotusScript MM)? Is this normal? Answer: With both the 167 MB of Domino private memory usage, and the 184 MB of LotusScript private memory usage, the Server process has now exceeded 351 MB of private memory usage, which is greater than the 200-300 MB range, and should be investigated.
Continued on next page
50
Scenario 2 Details
The nSERVER process crashed on insufficient memory availability. Based on the NSD (and memory dump), shared memory usage looks normal. Private memory usage however, in the form of Domino private memory and LotusScript private memory is too high. In addition, NSD indicates that the Domino private memory that has been allocated is not being well used (i.e. its going to waste). Other evidence also exists within the NSD (not discussed in this documentation) that third-party components have allocated an additional 150 MB of private under the nSERVER process. Along with an increased number of server threads (120 threads) resulting in a larger memory footprint, the total amount of private memory usage exceeds 620 MB, which has clearly caused a problem. As we can see, no one component has caused the condition, but the memory usage from the combination of factors has caused a problem. The next step would be to call such an issue into Support in order to employ additional diagnostics to determine what steps can be taken to address memory usage by each separate component (including examining server configuration and database design). For historical insight, for this issue, additional sets of data that were collected (including memory dumps). This additional data showed that once the server load decreased after hours, LotusScript memory usage dropped back down to 10 MB. This indicates that the design of agents caused an increased usage of LotusScript memory, rather than a server defect (such as a leak). Other components, such as relation database connectors, were also found to be responsible for high memory usage.
Continued on next page
51
------ Shared OS Fields ------Start Time = 27/07/2004 16:19:26 Panic Message Crash Time = 02/08/2004 16:24:56 Error Message = PANIC: Insufficient memory. SharedDPoolSize = 1000000 FaultRecovery = 0x00010010 Thread [ nSERVER:0908: 485]/[ nSERVER:0908: 2692] (908/1e5/a84) caused Static Hang to be set
52
LotusScript Memory Usage for Process: Heap 'MM Internal Heap': 143804461 bytes in use out of 176647641 bytes in 17564 allocations Heap 'General Heap': 16 bytes in use out of 4096 bytes in 1 allocations Heap 'LSKeyWords': 12720 bytes in use out of 16384 bytes in 390 allocations Heap 'LSLitPool': 560 bytes in use out of 8192 bytes in 14 allocations Heap 'LSIAdtClassTable': 10136 bytes in use out of 12288 bytes in 104 allocations Heap 'ObjectManager': 128 bytes in use out of 4096 bytes in 2 allocations Heap 'Dynamic Array Heap': 0 bytes in use out of 4096 bytes in 0 allocations Heap 'Dynamic List Heap': 0 bytes in use out of 4096 bytes in 0 allocations . .<content removed) . Heap 'LSKeyWords': 12720 bytes in use out of 16384 bytes in 390 allocations Heap 'LSLitPool': 560 bytes in use out of 8192 bytes in 14 allocations Heap 'LSIAdtClassTable': 10136 bytes in use out of 12288 bytes in 104 allocations Heap 'ObjectManager': 128 bytes in use out of 4096 bytes in 2 allocations Heap 'Dynamic Array Heap': 0 bytes in use out of 4096 bytes in 0 allocations Heap 'Dynamic List Heap': 0 bytes in use out of 4096 bytes in 0 allocations Total Heap Usage: 151662629 bytes in use out of 193052121 bytes in 187756 allocations
53
Scenario 3 Answers
File: scenario_3_nsd.log Problem: The Domino Server crashed. Question 1: What is the process ID and thread ID of the crashing thread? Answer: Process=nHTTP, Process ID=0c18, Thread ID=0bf0 Question 2: What is unusual about the call stack? Answer: The call stack looks very short (i.e. it is truncated). This call stack provides very little information about what this thread was doing. Question 3: Given what little clues you have from the call stack, as well as from the Resource Usage Summary for this thread, what do you think the thread might have been doing at the time of the crash? Answer: The HTTP server crashed while running a java agent (hint1 what is the one DLL name you can decipher in the call stack?) Answer: jvm.dll (hint2 under what cases would the http process be executing java?) Answer: while running a java agent or servlet (hint3 what is the user ID of the documents opened by this thread) Answer: AGTSIGNER/AGT/ACME
Scenario 3 Details
Based on the one DLL name in the stack trace summary (jvm), it appears that this thread was executing some type of java code at the time of the crash. Since the crashing thread is in HTTP, chances are strong that this is either a java agent or a servlet. By examining the Resource Usage for this thread, we can further narrow down the choices by noting that the user associated with open databases for this thread is AGTSIGNER/AGT/ACME. A java servlet would open any databases with the servers ID rather than a user ID (ruling out a servlet). We can be nearly certain that this HTTP thread was executing an agent at the time of the crash (this can be confirmed with the use of other diagnostics such as htthr.log files).
Continued on next page
54
What next?
If this thread is executing an agent, why dont we see an open document with a class=0200 under the thread ID in Resource Usage? Excellent question - for reasons too detailed to discuss here, any time any server task thread (such as HTTP) runs a java agent, the agent note is opened on one thread, but the actual agent execution occurs on a newly spawned JVM thread. As a result, one of the other threads under HTTP Resource Usage Summary will have an agent note open. Normally, one would need to employ additional INI settings to correlate which HTTP worker thread spawned the java current agent, along with the name of the java agent. However, if we are lucky, within this NSD, there are a limited number of threads that have an agent note open, so we can build a list of possible candidates. As it turns out, this is exactly the case! Only two other HTTP threads have any agent notes open, and it just so happens that both of those thread have the same agent note open (NOTEID=462, database=Inspired.nsf). Hence, we have luckily removed all the guess work in determining which agent is involved in the crash. Even though we are lacking a clean call stack and have not yet isolated root-cause, we have nonetheless been able to establish a surprising amount of information from this one NSD, once we know what to look for. The next steps are to examine the design of the agent, and work on extracting a clean call stack for the crash. Clearly, these next steps require the assistance of IBM Support.
Continued on next page
55
Scenario 3: Correlating the Data From the NSD - Fatal Call Stack:
############################################################ ### FATAL THREAD 65/67 [ nHTTP:0c18:0bf0] ### FP=0x5b75fc8c, PC=0x746e656d, SP=0x5b75fc54, stksize=56 ### EAX=0x2d5e1340, EBX=0x00000001, ECX=0x2d5e1340, EDX=0x6276b2e0 ### ESI=0x5b75fd08, EDI=0x1abb3010, CS=0x0000001b, SS=0x00000023 ### DS=0x00000023, ES=0x00000023, FS=0x00000038, GS=0x00000000 Flags=0x00010202 Exception code: c0000005 (ACCESS_VIOLATION) DLL indicates thread is ############################################################ [ 1] 0x746e656d (1abb3010,5b75fd08,3f1,0) likely executing Java of [ 2] 0x71d0f342 jvm (1abb3010,626b1a00,5b75fd08,76cadc4)
some kind
use
56
Only two HTTP threads that have an agent note open, both are the same agent
** VThread [ nHTTP:0c18:000d] .Mapped To: PThread [ nHTTP:0c18:0e58] .. SOBJ: addr=0x0f312278, h=0xf01042ad t=c275 (BLK_NSFT) .. SOBJ: addr=0x104409f4, h=0xf01043f8 t=c30a (BLK_LOOKUP_THREAD) .. SOBJ: addr=0x015de788, h=0xf010417c t=c130 (BLK_TLA) .. SOBJ: addr=0x0f3903bc, h=0xf01044a4 t=c436 (BLK_LSITLS) .. Database: F:\Lotus\Domino\Data\Inspired.nsf .... DBH: 597, By: Anonymous ...... doc: HDB=597, ID=462, H=3861, class=0200, flags=0001 ...... doc: HDB=597, ID=0, H=2390, class=0001, flags=0000 .... DBH: 1082, By: CN=AGTSIGNER/OU=AGT/O=ACME
57
Scenario 4 Answers
Files: scenario_4_nsd.log, scenario_4_semdebug.txt, scenario_4_console.log Problem: HTTP is non-responsive. Question 1: Examining the semdebug.txt, what semaphore is the point of contention? (Hint: look at the semdebug.txt file as a whole) Answer: The FRWSEM 0x030B Collection semaphore Question 2: At what time does the semaphore contention begin? Answer: 08/02/2005 11:27:22:09 AM EDT. The previous semaphore timeouts occur 15 minutes prior, and do not implicate the Collection Semaphore. Question 3: Based on the ownership of this semaphore, does it look like the server is hung? Answer: No. The timeouts are not continuous (there are breaks of a few seconds to a few minutes between timeouts), and the owner of the Collection Semaphore changes at least every 2 minutes. If the server were hung, the timeouts would be continuous over long periods of time (say 30-60 minutes) with no change in ownership of the semaphore. Question 4: What is the timeframe during which the call stacks in the NSD are dumped? (Hint: look for DBG() to determine the timestamps in NSD). Answer: The call stacks were dumped between 11:33:21 AM and 11:33:48 AM. This provides a 27-second window that can be compared to semdebug.txt Question 5: During the window of time in question 4, which thread owns the semaphore in question? As far as you can tell, how long did this thread have the semaphore locked? What was this thread doing during that time? Answer: PID=0F9C, TID=0FE4. This thread had the collection semaphore locked from 11:33:10:49 AM EDT to 11:34:04:15 AM EDT, or about 54 seconds. The thread was performing a DBLookup on the view, which means it locked the semaphore for a read. Question 6: Using the Memcheck portion of NSD, what database and view was experiencing the Collection Semaphore contention? Answer: Database=profiledb.nsf, view=(UserID)|wUserID
Continued on next page
58
Question 7: At the time of the NSD, how many readers and waiters were there for this view? How many waiters are waiting for a read versus a write? What are the PID and TID of the first writer waiting for the semaphore? Answer: 27 readers, 13 waiters. Of the waiters, 2 are writers, 11 are readers. The first thread waiting for a write is nHTTP:0f9c:108c. Question 8: If you had to guess, what would you say is the problem? What would be your next step? Answer: There are frequent updates to a view that is experiencing a heavy throughput of readers. The next step is to look at the database design (see side note below)
Scenario 4 Details
Based on the pattern of the waiters in the NSD, it appears that writer activity is sprinkled in between heavy reader activity (20+ readers). Based on the call stacks for the various view readers, there are a large number of DBLookups being performed on the view in question. There is probably a design issue in the application with too many DBLookups being performed on a large view intermixed with frequent updates. In order to confirm a consistent pattern to the reader and writer activity, you will need to collect multiple sets of data (NSD and Semdebug) over different slow downs. The next step is to examine the design of the view to determine how long it may take to update, and to investigate how pervasive the lookups are to this view. As a sanity check, you could investigate if it appears that all activity on the Web Server is slow, or only pages that interact with this view (either directly or indirectly). You will probably find the latter. In this particular case, it turns out the customer overloaded the view with too many lookups in addition to the fact that they were frequently adding documents to this view. The recommendation was to reduce the number of lookups in the application, split out the lookups across multiple views to reduce contention, and batch the addition of documents on a more scheduled basis, say once every hour or so instead of once every two or three minutes. These types of performance issues could and should be identified with stress testing prior to a rollout into production. Side Note: The review of the application is an investigation that needs to be lead primarily by the creator and maintainer of the application (thats you), where direction and assistance can be provided by Support. In this case, based on the data so far, there is no evidence of a Server defect.
Continued on next page
59
Change in ownership
08/02/2005 11:33:10:49 AM EDT sq="00008BFA" THREAD [0F9C:000D-108C] WAITING FOR FRWSEM 0x030B Collection semaphore (@02079108) (R=30,W=0,WRITER=0000:0000,1STREADER=0F9C:0FE4) FOR 30000 ms
08/02/2005 11:33:10:52 AM EDT sq="00008BFB" THREAD [0F9C:023A-035C] WAITING FOR FRWSEM 0x030B Collection semaphore (@02079108) (R=30,W=0,WRITER=0000:0000,1STREADER=0F9C:0FE4) FOR 30000 ms
08/02/2005 11:33:10:52 AM EDT sq="00008BFE" THREAD [0F9C:000C-0A40] WAITING FOR FRWSEM 0x030B Collection semaphore (@02079108) (R=30,W=0,WRITER=0000:0000,1STREADER=0F9C:0FE4) FOR 30000 ms . . . 08/02/2005 11:33:24:25 AM EDT sq="00008C06" THREAD [0F9C:0023-0FAC] WAITING FOR FRWSEM 0x030B Collection semaphore (@02079108) (R=28,W=0,WRITER=0000:0000,1STREADER=0F9C:0FE4) FOR 30000 ms
08/02/2005 11:33:34:15 AM EDT sq="00008C0C" THREAD [0F9C:000E-0FF8] WAITING FOR FRWSEM 0x030B Collection semaphore (@02079108) (R=27,W=0,WRITER=0000:0000,1STREADER=0F9C:0FE4) FOR 30000 ms
08/02/2005 11:33:40:49 AM EDT sq="00008C2F" THREAD [0F9C:000D-108C] WAITING FOR FRWSEM 0x030B Collection semaphore (@02079108) (R=27,W=0,WRITER=0000:0000,1STREADER=0F9C:0FE4) FOR 30000 ms
Window of time during which NSD dumped call stacks (see next page)
60
From scenario_4_nsd.log:
DBG(134c) 11:33:21 @@@@@@@@@@@@@@@@@ Process Table @@@@@@@@@@@@@@@@@ INFO PID 0000 0008 0148 0164 0160 0198 * 0334 044c 076c -> 116c -> 0750 PPID UID STIME COMMAND 0000 0 ??? [] 0000 65535 ??? [ ?:0008] 0008 0 07/27 11:37:19 [ smss:0148] 0148 0 07/27 11:37:21 [ csrss:0164] 0148 0 07/27 11:37:22 [winlogon:0160] 0160 0 07/27 11:37:22 [services:0198] 0198 0 07/27 11:37:26 [nservice:0334] 0334 0 07/27 11:37:28 [ java:044c] 044c 0 07/27 11:37:37 [ java:076c] 044c 0 08/02 10:44:55 [ nserver:116c] 116c 0 08/02 10:50:34 [ nevent:0750]
Window of time during which NSD dumped call stacks (used to correlate to semdebug,txt) as labeled in the Original Format
memcheck -k cur -p 0x116c -p 0x750 -p 0xf9c -p 0x14a4 -p 0x370 -p 0xecc -p 0x11bc -p 0x11b8 p 0x11c8 -p 0xf60 -p 0x358 -p 0x120c -p 0x125c -p 0x1160 -p 0x12a4 -p 0x1134 -p 0x1100 -p 0x1170 -p 0x928 -p 0x118c -p 0x12e0 -p 0xfdc -p 0x858 -d err -o
61
62
First Waiter mode=W indicates first waiter is waiting to lock the semaphore for a write. (Note the interleaved nature of the waiting readers and writers)
63
Scenario 5 Answers
File: scenario_5_nsd.log Problem: The Domino Server crashed. Question 1: What is the crashing process and thread ID? Answer: Process Name=nserver, Process ID= 09d8, Thread ID= 0d30 Question 2: What was this server doing when it crashed? Answer: Allocating Memory in an attempt to create a network buffer in order to service a client request. The key here is, we crashed trying to allocate memory. We dont see a PANIC message to this affect, so some investigation is required to see if it really is a low memory condition. Question 3: Based on the answer to question 2, where would you go next? Answer: Shared Memory Usage, Top 10, and if needed, Private Memory Usage for nserver.exe. Question 4: Who is the largest user of shared memory, and how much are they using? Is this normal? Answer: BLK_OPENED_NOTE is using over 600 MB, more than the UBM. This is not good, and is certainly not normal! Question 5: Why is there so much memory allocated for BLK_OPENED_NOTE? (Hint: Look in the Open Documents section) Answer: While there are not a large number of documents opened (say around 60 notes), several dozen of these notes are VERY large at 24 MB a piece, which is resulting in high memory usage. It only takes a few of these to kill the server. Notice that all of these documents are in various mail files. Question 6: Which user has these problematic documents opened? Answer: The BES server (Blackberry) - CN=BES/O=ACME/C=US
Continued on next page
64
Question 7: What do you think is going on here? Does this appear to be a server defect? Answer: It looks like a large document was sent out as an e-mail distribution. The BES server has opened these rather large documents to scan them, inflating memory usage. This is not a server defect. Question 8: Can you recommend a course of action to alleviate this problem? Answer: Yes. Use SERVER_MAX_NOTEOPEN_MEMORY_MB to restrict memory usage that results from the server opening documents on behalf of the client.
Scenario 5 Details
In this case, it appears that a large document with several embedded images was sent out as a mass mailing, where the various embedded images give the document an effective size of 24 MB. The BES server, in its attempt to keep all the end users up to date on their e-mail has opened these rather large documents to scan them, which is ballooning memory usage. If you open up enough of these documents at one time, memory usage will hit critical mass causing a crash. Its likely that just given normal user activity, memory usage would not have been an issue, since the transactions of opening the large documents would have been relatively short and randomized. However, the BES server is scanning all these notes at once, and since they are large, they remain open in memory a bit longer that normal (even though they are fast opens). The remote BES server compounds the problem. This behavior is not a server defect per se, since there is not a leak or a mismanagement of opened document. The same effect could be accomplished by a poorly written agent that opens many documents at once. However, this behavior does emphasis a scaling issue with Domino in a 32-bit environment. This scaling issue is not restricted to Domino; many enterprise applications have encountered limitations in the 32-bit address space, especially on platforms such as W32, which only provides a 2 GB user address space as the default.
Continued on next page
65
The ultimate solution is to move to a 64-bit model, which greatly expands the process address space. However, this move is not trivial; as it stands, the Domino 64-bit version is still under development. The short term solution is to use the INI parameter designed (as a stop-gap measure) for just this problem. See DCF document #1198511, titled Can the number of BLK_OPENED_NOTE blocks be limited? which discusses the use of the INI parameter SERVER_MAX_NOTEOPEN_MEMORY_MB. In this case, the server crashes when BLK_OPENED_NOTE memory climbs above a few hundred MB, so we can set the value as follows: SERVER_MAX_NOTEOPEN_MEMORY_MB=150 This will limit the number of these large documents (should they occur) to no more than about six at a time. However this may affect performance, since we force the other open note transactions to pending until memory usage drops sufficiently. Warning: As stated above, this parameter may affect performance. In addition, this parameter only addresses the case where the Server opens a note on behalf of a remote Native Notes Client (NRPC). This INI will not be effective for notes that are opened by the server for its own purposes (such as agents, etc), and will not affect notes opened via other protocols, such as HTTP, IMAP, SMTP, etc. This is a fun one, eh? See the following Knowledge Base Articles for more details: #1198511 Can the number of BLK_OPENED_NOTE blocks be limited?
Continued on next page
66
Utilization is good - Top 10 will show where all this memory is going
67
BLK_OPENED_NOTE is the largest user (larger than the UBM, which is very unusual)
Mail File
68