Solaris System Performance Management - SA-400 - Student Guide
Solaris System Performance Management - SA-400 - Student Guide
Performance Management
SA-400
Student Guide
Sun Microsystems
500 Eldorado Boulevard
MS: BRM01-209
Broomfield, Colorado 80021
U.S.A.
Please
Recycle
Contents
About This Course................................................................................... xvii
Course Overview ........................................................................... xviii
Course Map........................................................................................ xix
Course Objectives........................................................................... xxvi
Skills Gained by Module.............................................................. xxvii
Guidelines for Module Pacing ................................................... xxviii
Topics Not Covered........................................................................ xxix
How Prepared Are You?................................................................ xxxi
Introductions .................................................................................. xxxii
How to Use the Course Materials.............................................. xxxiii
Course Icons and Typographical Conventions ......................... xxxv
Icons .........................................................................................xxxv
Typographical Conventions ............................................... xxxvi
Introduction to Performance Management...........................................1-1
Relevance............................................................................................ 1-2
What Is Tuning? ................................................................................ 1-4
Basic Tuning Procedure ................................................................... 1-5
Conceptual Model of Performance................................................. 1-6
Where to Tune First .......................................................................... 1-8
Where to Tune? ................................................................................. 1-9
Performance Measurement Terminology.................................... 1-10
Performance Graphs....................................................................... 1-12
The Tuning Cycle ............................................................................ 1-15
SunOS Components........................................................................ 1-16
Check Your Progress ...................................................................... 1-17
Think Beyond .................................................................................. 1-18
System Monitoring Tools.........................................................................2-1
Relevance............................................................................................ 2-2
How Data Is Collected...................................................................... 2-3
The kstat........................................................................................... 2-4
How the Tools Process the Data ..................................................... 2-5
Monitoring Tools Provided With the Solaris OS.......................... 2-6
iii
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
sar....................................................................................................... 2-7
vmstat ................................................................................................ 2-8
iostat ................................................................................................ 2-9
mpstat .............................................................................................. 2-10
netstat ............................................................................................ 2-11
nfsstat ............................................................................................ 2-12
SyMON System Monitor................................................................ 2-13
Other Utilities .................................................................................. 2-14
memtool ............................................................................................ 2-15
/usr/proc/bin............................................................................... 2-17
The Process Manager...................................................................... 2-18
The SE Toolkit.................................................................................. 2-20
System Performance Monitoring With the SE Toolkit .............. 2-21
SE Toolkit Example Tools .............................................................. 2-22
System Accounting ......................................................................... 2-23
Viewing Tunable Parameters ........................................................ 2-24
Setting Tunable Parameters........................................................... 2-25
/etc/system ................................................................................... 2-26
Check Your Progress ...................................................................... 2-27
Think Beyond .................................................................................. 2-28
System Monitoring Tools...................................................................... L2-1
Tasks ................................................................................................ L2-2
Using the Tools....................................................................... L2-2
Enabling Accounting ............................................................. L2-3
Installing the SE Toolkit ........................................................ L2-3
Installing memtool................................................................ L2-4
Processes and Threads ..............................................................................3-1
Relevance............................................................................................ 3-2
Process Lifetime ................................................................................ 3-3
fork With exec Example................................................................. 3-4
Process Performance Issues ............................................................. 3-5
Process Lifetime Performance Data................................................ 3-6
Process-Related Tunable Parameters ............................................. 3-8
Multithreading ................................................................................ 3-10
Application Threads ....................................................................... 3-12
Process Execution............................................................................ 3-13
Thread Execution ............................................................................ 3-14
Process Thread Examples .............................................................. 3-15
Performance Issues ......................................................................... 3-16
Thread and Process Performance ................................................. 3-17
Locking ............................................................................................. 3-18
Locking Problems ........................................................................... 3-20
The lockstat Command .............................................................. 3-21
Interrupt Levels ............................................................................... 3-22
The clock Routine.......................................................................... 3-23
Course Goal
This course provides an introduction to performance tuning for larger
Sun™ environments. It covers most of the major areas of system
performance, concentrating on physical memory and input/output
(I/O) management.
xvii
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
Course Overview
The following course map enables you to see the general topics and
the modules for that topic area in reference to the course goal:
Overview
Introduction to
Performance
Management
Tools
System
Monitoring Tools
CPU
Processes and CPU Scheduling
Threads
Memory
System Caches Memory Tuning
I/O
System I/O Tuning UFS Tuning
Buses
Networks
Network Tuning
Summary
Performance
Tuning
Summary
This module shows how buses operate, what kinds of buses there
are in the system, and what bus characteristics are important to
tuning.
The tuning required for a network, both the physical network and
the network applications, is covered in this module. It discusses
tuning the Transmission Control Protocol/Internet Protocol
(TCP/IP) stack and NFS™.
● Appendix C – “Accounting”
While not a direct part of the course, the setting of the Interprocess
Communication (IPC) shared memory and semaphore parameters
is important. This appendix explains all of the IPC tunable
parameters.
Module
Skills Gained 1 2 3 4 5 6 7 8 9 10 11
This course does not cover the topics shown on the above overhead.
Many of these topics are covered in other courses offered by Sun
Educational Services.
To be sure you are prepared to take this course, can you answer yes to
the questions shown on the above overhead?
Now that you have been introduced to the course, introduce yourself
to each other and the instructor, addressing the items shown on the
above overhead.
Icons
Additional resources – Indicates additional reference materials are
available.
!
Warning – Anything that poses personal danger or irreversible
damage to data or the operating system.
Courier bold is used for characters and numbers that you type. For
example:
system% su
Password:
Palatino italics is used for book titles, new words or terms, or words
that are emphasized. For example:
Objectives
Upon completion of this module, you should be able to:
1-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
1
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
● Raj Jain, Raj. 1991. The Art of Computer System Performance Analysis.
John Wiley & Sons, Inc.
● Ridge, Peter M., and David Deming. 1995. The Book of SCSI. No
Starch Press.
What Is Tuning?
Almost every tuning activity can be placed into one of these two
categories.
Note that you will never be finished tuning: there will always be a
bottleneck. If there were none, infinite work could be completed in
zero time. Tuning provides a compromise between cost, the adding of
extra hardware, and performance, the ability to get the system’s work
done efficiently.
There are many factors that make a system perform the way it does.
These include:
● A random component to the work; that is, how many queries are
executed, how many records are looked at, how many users are
supported, how much email is generated, how big files are, what
the file locations are, and so on. The exact same amount and type
of work is rarely done repeatedly on a system.
Take a moment to prioritize the areas in the overhead image for tuning
purposes from 1 (most improvement possible) to 4 (least
improvement).
Where to Tune?
As shown on page -8, the closer to where the work is performed, the
more effect your tuning will have. As system administrators, access to
the first two levels is difficult, so the tendency is to concentrate on
what is accessible.
● Bandwidth
▼ What you really get; that is, how much of the bandwidth is
actually used
● Response time
● Service time
● Utilization
Performance Graphs
The three graphs on this page show the most common appearance of
data obtained from performance measurements.
The first two graphs indicate a problem, but not its source. You can see
that the work performed, measured either as throughput or response
time, improves to a certain level, and then begins to drop off. The
drop-off may occur quickly or slowly, but the system is no longer able
to efficiently perform more work. A resource constraint has been
reached.
Most tuning tasks begin with the complaint, “The system’s too slow.”
Once you have confirmed the problem, you must gather relevant
system performance information. It may not be possible to quickly
identify the source of the problem. In this case, general information
must be gathered and analyzed to locate likely causes.
Once this has been done, various solutions must be evaluated and
chosen for implementation. Some may be as simple as changing a
tuning parameter, some may require recompiling the application, and
others may require the purchase of additional hardware.
Once you have identified the source of the problem, develop and
implement a plan to remove the cause. Make sure that you only fix
what is reasonable to fix; requests to software vendors to redesign
their product are usually futile.
Finally, after a change, make sure that you continue to monitor the
system. You need to ensure that performance is now “satisfactory,” as
well as check to see that a new bottleneck (and there will always be
one) is not causing new problems in another part of the system.
It has been said that tuning is like peeling an onion, there is always
another layer and more tears. As shown in the above overhead, the
tuning cycle is continuous.
For example, fixing a CPU problem may reveal (or create) an I/O
problem. Fixing that may reveal a new CPU problem, or perhaps a
memory problem. You always move from one bottleneck to another,
until the system is (for the moment) performing acceptably.
SunOS Components
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
Objectives
Upon completion of this module, you should be able to:
2-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
2
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
● Man pages for the various commands (kstat, sysdef, adb, ndd)
and performance tools (sar, vmstat, iostat, mpstat, nfsstat,
netstat).
The first output line reported by the Solaris tools provides the current
values of the counters. This first line should be ignored because this is
not reliable summary data. Subsequent lines are accurate and
subtracted from the previous values.
The kstat
The kstats all have names, and can be read either by name or by
following a chain. There are a series of kstat manipulation system
calls (such as kstat_lookup); they are all documented in the man
pages.
When processing this data, you can specify intervals that are multiples
of the tool’s internal recording interval. Be careful when specifying a
large interval, spikes and other unusual readings can be smoothed out
and lost, causing you to see no problems. But when shorter intervals
are examined, problems become visible.
When more than one tool reports similar data, a table comparing the
data produced by the different tools will be provided.
sar
The sar data is usually recorded to a file, then processed later due to
the large amount of available data.
vmstat
As will be discussed later, paging I/O has several distinct uses. These
are reported by the memstat tool, which is merged into vmstat in the
Solaris 7 OS.
iostat
If you are using the Volume Manager, iostat may not show you
enough information about the volumes. You will need to use the
vxstat command, which is provided with Volume Manager.
mpstat
The data from mpstat are not usually needed during routine tuning,
but can be very helpful if there is unusual behavior or the other
monitoring tools show no indication of the source of a problem.
Output from the mpstat report displays one line for each CPU in the
system.
netstat
The -s option shows extremely detailed counts for very low-level TCP,
User Datagram Protocol (UDP), and other protocol messages. This
allows for quick identification of problems or unusual usage patterns.
nfsstat
The nfsstat (NFS statistics) command has six options that provide
detailed information on a system’s NFS activity. It provides both host
and client information, which can be used to locate the source of a
problem. Also, since remote procedure call (RPC) information is
provided, it can also report on the network activity of some database
management systems (DBMSs).
Other Utilities
● memtool
● /usr/proc/bin
● Process Manager
● The SE Toolkit
● System accounting
memtool
/usr/proc/bin
● CPU usage
● I/O usage
● Memory usage
● Process credentials
● Scheduling parameters
The SE Toolkit
Once installed, the tools can be configured to offer notices and repair
suggestions as it detects problems. If run by root, one of the utilities
(virtual_adrian.se) will actually change some of the system tuning
parameters as conditions require.
One of the SE programs is a GUI front end which displays the state of
your system. The program is called zoom.se and a typical display is
shown in .
The SE Toolkit has several tools. Source code for each of the tools is
provided. Table 2-2 lists some of the example tools by function.
Table 2-2 SE Programs Grouped by Function
System Accounting
There are several utilities that provide the ability to look at the current
values of system tuning parameters.
The sysdef command, near the end of its output, provides a list of
approximately 60 of the most commonly used tuning parameters.
The kernel debugger, adb, enables you to look at (and alter) the values
of the tuning parameters. While not very user friendly, it and the
similar crash utility allow the inspection of the running system.
The ndd utility enables you to look at and change the settings of
network device driver and protocol stack parameters. Most of these
parameters are undocumented, so caution should be used when
changing them. Changes to these parameters may also be made in
/etc/system.
Changes may also be made to a running system using adb. Not all
parameters can be changed this way, however. If a parameter is used
at boot time to specify the size of a table, for instance, changing it after
the boot will have no effect. Without a knowledge of system internals,
it can be difficult to determine whether the change will be seen.
/etc/system
The /etc/system file can be used to set the values for kernel-resident
tuning and option parameters.
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
Objectives
Upon completion of this lab, you should be able to:
● Install memtool
L2-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L2
Tasks
3. List which tool(s) you would use to look at the following data
elements:
4. Using the man pages for those tools, determine which option(s)
you would need for each data type.
Enabling Accounting
Complete these steps:
crontab -e sys
3. Add the three packages now found in the directory using the
following command. Respond y to all prompts about commands
to be started at boot time.
pkgadd -d .
/opt/RMCmem/drv/bunyipload
Objectives
Upon completion of this module, you should be able to:
3-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
3
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
Process Lifetime
All processes in the system, except for sched (process 0), are created
by the fork system call. The fork call makes an exact copy of the
calling, process, and the parent, in a new virtual address space (the
child).
The child then usually issues an exec system call, which replaces the
newly created process address space with a new executable file. The
child continues to run until it completes its job. It then issues an exit
system call, which returns the child’s resources to the system.
While the child is running, the parent either waits for the child to
complete, by issuing a wait system call, or continues to execute. If it
continues it can check occasionally for the child’s completion, or be
notified by the system when the child exits.
The example in the above overhead shows how a shell would process
an input command. After processing the command, the shell issues a
fork to create a copy of itself. The copy then issues an exec call to run
the requested command.
The shell then either waits for the command to finish, or if background
execution was requested (with an &), issues the next prompt and waits
for another input command.
You can also save memory by remembering that executables which use
shared libraries take fewer system resources (memory, swap, and so
on) to use. This is discussed in Module 5.
● fork/s – The number of fork system calls per second (about 0.5
per second on a four- to six-user system). This number will
increase if shell scripts are running.
The problem of scaling the size of several kernel tables and buffers is
addressed by creating a single sizing variable. The variable name,
maxusers, was related to the number of time-sharing terminal users
the system could support. Today, there is no direct relationship
between the number of users a system supports and the value of
maxusers. Increases in memory size and complexity of applications
require much larger kernel tables and buffers for the same number of
users. The maxusers setting in the Solaris 2.x OS is automatically set
via the physmem variable. The minimum limit is 8 and maximum is
1024, corresponding to systems with 1 Gbyte or more of random
access memory (RAM). maxusers can be set manually in
/etc/system, but manual setting is checked and limited to a
maximum of 2048.
Multithreading
Testing has shown that the creation of a process takes 33 times more
CPU time than creating a new thread.
When using threads, a user process still has only one executable. Each
thread has its own stack, but they share the same executable file.
Application Threads
Each thread has its own stack and some thread local storage. All of the
address space is visible to all of the threads. Unless programmed that
way, threads are not aware that there are other threads in the process.
Much of the state of the process is shared by all of the threads, such as
open files, the current directory, and signal handlers.
Process Execution
As CPUs become free, kernel threads are taken from the dispatch
queue and run.
Thread Execution
This model has two symmetrical levels: a dispatch queue and "CPUs"
for the application threads, and dispatch queues and real CPUs for the
kernel threads. Since the number of LWPs is under programmer
control, programmers can control the concurrency level of the threads
in their application.
The middle layer is in the kernel, which includes the LWP and the
kernel thread structures. The kernel thread is the entity which is
scheduled by the kernel.
The bottom layer represents the hardware; that is, the physical CPUs.
The kernel threads that are not attached to an LWP represent system
daemons such as the page daemon, the callout thread, and the clock
thread. These kernel threads are known as system threads.
Performance Issues
The table shows the creation times for unbound threads, bound
threads, and processes. It measures the time consumed to create a
thread using the default stack that is cached by the threads package. It
does not include the initial context switch to the thread.
The ratio column gives the ratio of the second and third rows with
respect to the first row (the creation time of unbound threads).
Locking
Locks must be used only when threads share writable data. Often
there is no necessity to lock the data just to look at it, which can
improve performance, although frequent locking is required to write
the data. There are several different locking mechanisms available to
handle the different locking situations.
▼ Mutexes
▼ Condition variables
A bad locking design, for example, protecting too much data with one
lock, can cause performance problems by serializing the threads for
too long.
Locking Problems
However, any user can use the lockstat command to identify the
source of a run-time locking problem.
When run like truss, the lockstat command runs the specified
command and monitors it for lock usage. It monitors locks used by the
program within its own process (for example, with multithreading), as
well as locks obtained for the process in the kernel. These locks are
acquired during processing of a system call, for instance.
Large counts (indv), with a slow locking time (nsec) or very large
locking times, can identify the source of the problem. While locking
problems in the kernel must be fixed by the vendor, the lock name is
usually prefixed with the module name. With the module name,
known locking problems in the module can be searched for, and
perhaps a patch identified.
Interrupt Levels
Just as there are priorities for threads in the system to access the CPU
and for I/O devices, interrupts coming into the system also have
priorities.
The clock executes at interrupt level 10, a very high priority. There is
one system thread per system that manages the clock, which is
associated with a particular CPU.
Each time the clock routine executes is a tick. For SPARC processors
(except sun4c systems), there are 100 ticks per second as shown by the
vmstat -i command. This can be changed to 1000 ticks by requesting
the high resolution rate. High resolution rate is used primarily for real-
time processing This is discussed in more detail in Module 4.
# vmstat -i
interrupt total rate
--------------------------------
clock 85912634 100
zsc0 3197273 3
zsc1 6013844 6
hmec0 5952964 6
fdc0 10 0
--------------------------------
Total 101076725 117
#
For every clock tick, the system performs numerous checks for timer
related work, some of which is shown in the overhead image above.
The system clock tends to drift slowly during operation. There are not
always exactly 100 clock interrupts every second, due to interference
from other system operations. Different systems drift at different rates.
To synchronize the system clocks with each other, and perhaps with an
external exact time source, you can use the NTP (Network Time
Protocol), provided with the Solaris 2.6 and above OS. See the man
page for xntpd(1M) for more information.
Once you identify which processes are active, you can narrow it down
to the one(s) causing the delay and determine what kind of resource is
slowing down the process.
This summary immediately tells you who is running the most active
processes.
● RSS – Resident set size of the process. It is the basis for %MEM and is
the amount of RAM in use by the process.
▼ S – Sleeping
● TIME – Total amount of CPU time the process has used so far
Many processes live very short lives. You cannot see them with ps, but
they may be so frequent that they dominate the load on your system.
The only way to catch them is to ask the system to keep a record of
every process that it runs, who ran it, what was it, when it started and
ended, and how much resource it used. This is done by the system
accounting subsystem.
TOTALS 17014 1610658.50 100.21 704471.91 16072.67 0.01 0.00 1773230720 24821
● KCOREMIN – The product of the amount of CPU time used and the
amount of RAM used while the command was active. The output
is sorted by KCOREMIN.
● BLOCKS READ – Data read from block devices (basically local disk
file system reads and writes).
msacct.se
The msacct.se program uses microstate accounting to show the
detailed sleep time and causes for a process. This program must be
executed by root.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/msacct.se
pea.se
The pea.se program also uses microstate accounting. This program
runs continuously and reports on the average data for each active
process in the measured interval (10 seconds by default). The data
includes all processes and shows their average data since the process
was created.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/pea.se
tmp = pp.lasttime<9.29046e+08>
tmp_tm = localtime(&tmp<929045620>)
debug_off()
14:13:40 name lwmx pid ppid uid usr% sys% wait% chld% size rss pf
init 1 1 0 0 0.00 0.00 0.00 0.00 736 152 0.0
utmpd 1 219 1 0 0.00 0.00 0.00 0.00 992 536 0.0
ttymon 1 259 252 0 0.00 0.00 0.00 0.00 1696 960 0.0
sac 1 252 1 0 0.00 0.00 0.00 0.00 1624 936 0.0
rpcbind 1 108 1 0 0.00 0.00 0.00 0.00 2208 840 0.0
automountd 13 158 1 0 0.00 0.00 0.00 0.00 2800 1808 0.0
● lwmx – The number of LWPs for the process so you can see which
are multithreaded
● pid, ppid, uid – Process ID, parent process ID, and user ID
● rss – Resident set size of the process, the amount of RAM in use
by the process
● pf – Page fault per second rate for the process over the interval
The pea.se program also has a wide mode which displays additional
information. Use the -DWIDE option to the se command
/opt/RICHPse/bin/se -DWIDE /opt/RICHPse/examples/pea.se.
The additional data includes:
ps-p.se
The ps-p.se program is a ps -p clone (using /usr/ucb/ps output
format). The program monitors only processes that you own.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/ps-p.se 4234
PID TT S TIME COMMAND
4234 pts/3 S 4:12.73 /usr/dist/share/framemaker,v5.5.6/bin/sunxm.s5.sparc/ma
#
pwatch.se
The pwatch.se program watches a process as it consumes CPU time.
It defaults to process ID 3 to watch the system process fsflush. It
monitors only processes that you own.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/pwatch.se
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● How many threads can a process have? Why is that the limit?
● Why are LWPs used in the thread model? Are they really needed?
Objectives
Upon completion of this lab, you should be able to:
L3-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L3
Tasks
Using maxusers
The maxusers parameter is used to set the size of several system
tables. Some of the tables that are modified directly by maxusers are:
maxusers ____________
max_nprocs ____________
ufs_ninode ____________
maxuprc ____________
ncsize ____________
set maxusers=new_value
9. Restore the /etc/system file to its original form and reboot the
system.
Using lockstat
Complete the following step:
Remember, you will not see any output from lockstat until
HotJava exits.
Monitoring Processes
Complete the following steps:
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
__________________________________________
▼ msacct.se
▼ pea.se
▼ ps-ax.se
▼ ps-p.se
▼ pwatch.se
Objectives
Upon completion of this module, you should be able to:
4-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
4
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
Scheduling States
Remember that application threads do not queue for the CPU directly,
although system threads do. They are attached to an LWP and kernel
thread pair. The kernel thread is then assigned to a dispatching queue
when there is work to do for the application thread.
Scheduling Classes
● SYS – The system class, also called the kernel priorities, is used for
system threads such as the page daemon and clock thread.
Like device drivers and other kernel modules, the scheduling class
modules are loaded when they are needed.
Class Characteristics
Timesharing/Interactive
System Class
● Not time sliced – Threads run until they finish their work or are
preempted by a higher priority thread.
Real-time Class
● Time sliced
● Fixed priorities
Dispatch Priorities
System threads are assigned a priority both within their class, and
based on that priority, a system-wide dispatching priority.
The process scheduler (or dispatcher) is the portion of the kernel that
controls allocation of the CPU to processes. The timesharing dispatch
parameter table consists of an array of entries, one for each priority. It
is indexed by the thread’s priority, and provides the scheduling
parameters for the thread. Initially, it is assigned the amount of CPU
time specified in quantum, in clock ticks.
If a process is using a lot of CPU time, it exceeds its assigned CPU time
quantum (ts_quantum) and is given the new priority in ts_tqexp.
The new priority is usually 10 less than its previous priority, but in
return it gets a longer time slice.
CPU Starvation
If a thread does not get any CPU time in ts_maxwait seconds, its
priority is raised to ts_lwait. (A value of 0 for ts_maxwait means 1
second.)
When you change the size of the time quantum for a particular
priority, you change the behavior of the CPU that it is using. For
example, by increasing the size of the quantum, you help ensure that
data the thread is using remains in the CPU caches, providing better
performance. By decreasing the size of the quantum, you favor I/O-
bound work that gets on and off of the CPU quickly.
With very heavy CPU utilization, the CPU starvation fields can ensure
that all threads get some CPU time. Generally, if you are relying on
starvation processing, you need more CPUs for the system.
If your workload changes slowly between I/O bound and CPU bound,
you can alter the ts_tqexp and ts_slpret fields to raise and lower
priorities more slowly.
The table shown in the overhead image is from the SunOS 5.2
operating system, where the table favored batch work. Note that the
quanta are five times larger, and priorities drift up and down more
slowly as the quantum is or is not used up.
The table allows batch work to use the CPU caches, such as the page
descriptor cache (PDC) and level one and level two caches, more
effectively. It also forces the thread to "earn" its way to I/O-bound
status, as the thread works its way up over several dispatches. Using
the nice and priocntl commands may change the calculated
priorities for a thread.
A hybrid table could be built, using the upper half of the current table
and the lower half of this table, which allows I/O-bound work to
move up quickly, and CPU-bound work to use the hardware more
efficiently.
This command can provide a list of the active, loaded classes (-l), or
allow you to inspect or change the current table. For example, to
change the TS table, you would:
3. Make the changed table active for the life of the boot by running:
dispadmin -c TS -s tfile
dispadmin Example
dispadmin -r 100 -c TS -g
If you change the table, be sure to set the quanta in units that
correspond to the RES value.
Note – See the ts_dptbl(4) man page for instructions on how to make
permanent changes to the table.
Real-Time Scheduling
Real-Time Issues
The above overhead lists the default values for the real-time dispatch
parameter table. When a thread is put into the real-time class by
priocntl(), the priority and the time quantum can be specified at that
time. No thread or process is put into the real-time class by default.
Since the real-time priorities are fixed, the time quantum and global
priority do not change for the life of the thread unless they are
modified using priocntl. This also means that the fields from the
timesharing dispatch parameter table that manage the priority changes
are not needed.
● Scheduling class
You must be the root user to place a process into the real-time class.
You can specify an infinite time quantum with the priocntl(2) system
call by using the special value RT_TQINF for the time quantum.
Old Thread
The switch request (pre-empt) routine is called either when a higher
priority thread is put on the dispatch queue or a thread has used up its
time slice. It has two responsibilities: remove the current thread from
the CPU, and assign a new thread in its place.
● If a thread has been out of execution for less than three ticks, it is
placed on the dispatch queue of the CPU it last executed on.
For a real-time thread, the dispatch queue of the CPU running the
lowest priority thread is selected.
New Thread
After the replaced thread has been placed on a dispatch queue, a new
thread must be chosen to run on the CPU. The priority is as follows (1
being the highest and 5 the lowest):
The chosen thread is then loaded onto the CPU (for example, its
registers restored), and run. If the thread selected is the same thread
that was just taken off of the CPU, it is run since it is already loaded on
the CPU.
Processor Sets
Work that is not assigned to a specific processor set must be run on the
CPUs that are not part of a processor set. You must ensure that you
have enough CPU capacity available for this non-associated work.
Only the root user may create, manage, and assign processes to these
processor sets.
The performance monitoring tools report the length of the CPU run
queue. The run queue is the number of kernel threads waiting to run
on all system dispatch queues. It is not scaled by the number of CPUs.
A depth of three to five is normally acceptable, but if heavily CPU-
bound work is constantly being run, this may be too high.
It is not possible to tell what work is being delayed nor how long it has
run. Remember, higher priority work can constantly preempt lower
priority work, subject to any CPU starvation processing done by the
thread’s scheduling class.
CPU Activity
● Wait I/O – All CPUs are idle, but at least one disk device is active.
This is the amount of time spent waiting for blocked I/O
operations to complete. If this is a significant portion of the total,
the system is I/O bound.
If the CPU is not spending a lot of time idle (less than 15 percent), then
threads which are runnable have a higher likelihood of being forced to
wait before being put into execution. In general, if the CPU is
spending more than 70 percent in user mode, the application load
might need some balancing. In most cases, 30 percent is a good high-
water mark for the amount of time that should be spent in system
mode.
Some reporting tools, such as vmstat, add Wait I/O and Idle time
together, usually reporting it as Idle time.
To locate the device(s) causing a high Wait I/O percentage, check the
%w column, reported by iostat -x. The amount of CPU time used by
an individual process can be determined by using /usr/ucb/ps -au
or system accounting.
The table in the above overhead compares the data available from
vmstat and sar relevant to CPU activity. Sample outputs of the sar
command with CPU-related options are shown. The -u option reports
CPU utilization.
# sar -u
Average 1 0 1 98
#
● %idle – Lists the percentage of time the processor is idle and is not
waiting for I/O.
Average 1.8 1
#
The above overhead image shows the origin and meaning of the data
reported in the CPU monitoring reports.
There are a number of commands that provide CPU control and status
information:
It can also point to locking problems, which can be identified with the
lockstat command.
● CPU – Processor ID
● intr – Interrupts
mpstat Data
The above overhead image shows the data reported by mpstat and its
origin. mpstat also provides CPU utilization data for each CPU.
cpu_meter.se
The cpu_meter.se program is a basic meter which shows usr, sys,
wait, and idle CPU as separate bars.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/cpu_meter.se
vmmonitor.se
The vmmonitor.se program is an old and oversimplified vmstat-
based monitor script that looks for low RAM and swap space.
mpvmstat.se
The mpvmstat.se program provides a modified vmstat-like display. It
prints one line per CPU.
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● Why are there four scheduling classes? What was the point of
adding the interactive scheduling class?
● When might you want to use the SunOS operating system’s real-
time capability?
Objectives
Upon completion of this lab, you should be able to:
L4-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L4
Tasks
CPU Dispatching
Complete the following steps:
dispadmin -l
___________________________________________________________
dispadmin -c TS -g
___________________________________________________________
___________________________________________________________
dispadmin -c RT -g
___________________________________________________________
___________________________________________________________
● If a process (or thread) does not use its time quantum in maxwait
seconds (maxwait of 0 equals 1 second), its priority is raised to
lwait which is higher and assures the thread will get some CPU
time.
___________________________________________________________
___________________________________________________________
___________________________________________________________
./io_bound
___________________________________________________________
___________________________________________________________
___________________________________________________________
5. Type:
/usr/proc/bin/ptime ./io_bound
___________________________________________________________
___________________________________________________________
7. Type:
/usr/proc/bin/ptime ./dhrystone
___________________________________________________________
___________________________________________________________
9. Type:
./test.sh
This shell script runs both dhrystone and io_bound and stores
the results in two files, io_bound.out and dhrystone.out. It then
waits for both of them to complete. Record the results.
▼ io_bound results
_______________________________________________________
_______________________________________________________
▼ dhrystone results
_______________________________________________________
_______________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
1. Type:
2. Type:
dispadmin -s ts_config.new -c TS
3. Type:
./test.sh
Note the results. Given the results, can you make an educated
guess as to the nature of the ts_config.new dispatch parameter
table?
___________________________________________________________
4. Type:
ps -cle
___________________________________________________________
___________________________________________________________
5. Type:
dispadmin -s ts_config -c TS
priocntl -e -c RT ./test.sh.
___________________________________________________________
___________________________________________________________
Note – Running these processes as real time might cause your system
to temporarily “hang.”
▼ io_bound results
_______________________________________________________
_______________________________________________________
_______________________________________________________
▼ dhrystone results
_______________________________________________________
_______________________________________________________
_______________________________________________________
Objectives
Upon completion of this module, you should be able to:
5-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
5
Relevance
● What is a cache?
● What areas of the system might benefit from the use of a cache?
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
What Is a Cache?
Caches are used to keep a (usually small) subset of active data near
where it is being used. Just as the reference books you use tend to
remain on your desk, the system tries to keep data being used near the
point of use. This provides the ability to not only access the data more
quickly, but avoids a time-consuming retrieval of the data from its
current location.
Caches are used by the hardware, in the CPU and I/O subsystem, and
the software, especially for file system information. The hardware and
software contain dozens of caches.
A cache usually can hold only a small subset of the data available,
sometimes less than .01 percent, so it is critical that a subset contain as
much of the data in use as possible.
There are many different ways to manage a cache, as you will see, but
there are only a few that are commonly used.
Hardware Caches
The most commonly known hardware caches are the CPU caches.
Modern CPUs contain at least two data caches: level one and level
two.
The level one cache is also called the on-chip or internal cache. It is
usually on the order of 32 Kbytes in size. The level two cache is also
called the external cache. It is the most commonly known cache.
In addition, the hardware I/O subsystem has various caches that hold
data moving between the system and the various peripheral devices.
Most caches are made from SRAM (static random access memory). The
system’s main storage is made from DRAM (dynamic RAM). Static
RAM uses four transistors per bit, while DRAM uses only one.
CPU Caches
The level one cache is usually from 4–40 Kbytes in size, and is
constructed as part of the CPU chip. Of the three million transistors on
the UltraSPARC-II™ CPU chip, two million are used for the cache.
The level two cache is usually from 0.5–8 Mbytes in size, and is located
near, but not on, the CPU.
The level one cache operates at CPU speeds since it needs to provide
data to the CPU as rapidly as possible. Data must be in the level one
cache before the CPU can use it.
Cache Operation
The system uses many techniques to improve the hit rate, which are
discussed in the following pages.
For efficiency reasons, caches are usually kept 100 percent full. This
way, as much data as possible is available for high-speed access.
When a cache miss occurs, new data must be put in the cache. This
process, called a move in, must replace data already in the cache.
This replacement process requires that a location for the incoming data
be found, at the expense of some data already present in the cache.
Depending on the design of the cache, there may be only one place
that the new data can go, or there may be many. If there is more than
one, a location least likely to hurt performance is chosen.
The algorithm used to find the new location almost always assumes
that the oldest unreferenced candidate location (the longest unused) is
the least valuable and will choose it as the location. This type of
algorithm is known as LRU (least recently used).
Cache Replacement
When a location for the incoming data is found, the system cannot just
copy the new data into this location; the incoming data could overlay
the only current copy of the existing data. In this case, the data must
be moved out, or flushed, and copied to the next level up. If the next
level up already has a current copy of the data, it can be abandoned,
and the new data copied directly in.
Depending on the hardware design, the move in can begin before the
move out has finished.
The amount of data that is moved is called a cache line. A cache line, in
the CPU caches, is usually 64 bytes, but sometimes only 16. In other
types of caches, it can be as small as 1 byte or as large as several
Kbytes or more.
1. If the data is present, the cache line (or portion of the line) is
returned to the requester.
2. If the data is not present, the requestor must wait until the data
has been received from the next level up. When the data is
available, a cache location is found for it, and the existing data in
that location is either abandoned or moved out, depending on its
state at the next level up. A full cache line of data will be
transferred.
3. If the next level cache cannot find the data, it requests the data
from its next level up, until the data is found and moved in or an
error is generated.
The cache hit rate is the percentage of memory accesses that result in
cache hits or finding the requested data in cache.
● The cache’s fetch rate – How much data is placed in the cache for
each memory access
Cache misses can be very expensive, due to the much higher cost of
getting data from the next level up.
Even a miss rate of just a few percent can make a noticeable difference
in a system’s performance.
There are many ways to improve the hit rate, most involve the design
characteristics of the cache. Different characteristics can be used
depending on system requirements and cost considerations.
Cache Characteristics
There are several major cache design characteristics that can be chosen,
depending on the system’s requirements.
Once you identify a cache and its characteristics, you can determine
how to manage it for better performance.
In a set associative cache, a line can exist in the cache in several possible
locations. If there are n possible locations, the cache is known as n-way
set associative. While set-associative caches are more complex than
direct-mapped caches, they allow for a more selective choice of data to
be replaced.
Harvard Caches
A Harvard cache (which is named after the university where it was
invented) separates a cache into two parts. The first part, named the I-
cache (I$), contains only instructions. The second part, named the D-
cache (D$), contains only data. The two parts are usually four-way set
associative, and do not have to be the same size (or type).
The idea behind this divided cache is that instructions will not lose
their cache residence when large amounts of data are passed through
the cache. This allows better performance for the system as a whole,
although there may be slight performance degradation for a few
applications. Harvard caches are usually used in CPU level one caches.
Caches that do not divide the cache are called unified. Following these
naming conventions, the unified (non-Harvard) external cache is
named the E$.
A write-back cache writes only to cache. It writes to the next level only
when the line is flushed or another system component needs the data.
Write Cancellation
A write-back cache takes advantage of a system behavior called write
cancellation.
If a CPU modifies a cache line, it causes all other caches to discard that
block if it is present in them.
If the CPU tries to access a line not present in its cache, it needs to read
it from a higher level. The CPU requests (via the system bus) the most
recent cache line from any cache or from memory.
If the CPU modifies a line, it will cause all other caches to either
discard the block (called write invalidate protocol) or request all other
caches to update their copy of the cache line (write update protocol).
Which is done is processor architecture dependent.
Cache Thrashing
Also, this problem can occur on one type of system or CPU, but not on
another type. If you suspect a problem of this type and the program’s
source code is not available, move the program to a different type of
system to see if the problem disappears.
The further from the CPU the cache is, the larger and slower it is.
These components may have different behaviors, but they all operate
according to the basic cache principles.
The table in the above overhead shows the relative speeds of the
different levels of the cache hierarchy. Clearly, data should be at the
lowest possible locations when being used; there are significant
performance differences between the levels. The further away from the
user the data is, the higher the cost of a cache miss.
As you might imagine from looking at the relative access times in the
above table, the system tries very hard to avoid doing I/O operations.
(You will see how this is done in later modules.) Caches figure
prominently in these processes.
▼ Linked lists are bad for locality. Use arrays whenever possible.
Most of the things that you can tune to affect hardware cache
performance can only be done at the application level. Unlike the OS
software caches, there are no controls available to change the
configuration, size, or behavior of the hardware caches.
You can, however, alter the behavior of main memory, the cache for
your disks. This is discussed in the next module.
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● How could you prove that you were having hardware caching
problems?
● Can you find software equivalents for each of the hardware cache
types?
Objectives
Upon completion of this lab, you should be able to:
L5-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L5
Tasks
1. What speed will 100 accesses run at if all are found in the cache
(100 percent hit rate)?
___________________________________________________
___________________________________________________
3. If a job was running three times longer than it should, and this
was due completely to cache miss cost, what percentage hit rate
does it achieve?
___________________________________________________
4. How could the routine in the program that was causing the
problem be located?
___________________________________________________
___________________________________________________
Objectives
Upon completion of this module, you should be able to:
6-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
6
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
Virtual Memory
The text segment is the executable instructions from the executable file,
which are generated by the compiler. The data segment contains the
global variables, constants, and static variables from the program. Its
initial form also comes from the executable file. The block starting with
symbol (BSS) contains the same types of information that the data
segment does; the difference is that the BSS is initialized to zeros, so it
is not part of the executable file.
The heap is memory that is used by the program for some types of
working storage; it is allocated with the malloc call.
Following the heap is the hole. The hole is that section of the address
space that is unallocated and not in use. For most processes, it is the
vast majority of the address space. It takes up no resources since it has
not been allocated.
Above the shared libraries, at the end of the address space area that
the program can address, is the stack. The stack is used by the program
for variables and storage while a specific routine is running. The stack
grows and contracts in size, depending on which routines have been
called and what kind of storage requirements each routine has. The
stack is normally 8 Mbytes in size.
Located above the stack is the kernel. The kernel can be as large as 0.5
Gbytes of virtual memory. While it is a part of the address space, the
kernel cannot be referenced by an application. When a program needs
access to kernel data, it must be able to run as root or make a system
call request. The actual size of the kernel changes, depending on many
factors such as peripheral device configuration and amount of system
real memory. Essentially the entire kernel is locked in memory: the
amount of real memory used is the same as the amount of virtual
memory allocated in the kernel.
Types of Segments
There are several different segment drivers; the majority are used to
manage kernel memory. The most common segment driver for user
processes is the segvn driver. The segvn driver maps files
(represented by vnodes) into the address space. Almost every segment
in the address space is backed by a file (an executable, swap, or data
file), and the segvn driver supports them all.
Since memory acts like a fully associative cache for a disk, a directory
mechanism must allow the system to determine whether a page is in
memory, and if so, where it is in real memory. This process is called
address translation.
The diagram above illustrates the process that occurs when a virtual
address must be translated.
The PDC is checked first, then a long translation is done. If the page is
invalid at the end of the translation, the hardware cannot proceed and
asks for help from the software by issuing a page fault interrupt. The
OS passes the request to the proper segment driver, which checks
memory to see if the page is available. If it is, it attaches the page to
the process. (For a shared page, there may already be hundreds of
users of the one page.) If the page is not in memory, it requests the
appropriate I/O operation to bring in the page (for example, from NFS
or UFS).
Main memory is a fully associative cache for the contents of the disks.
As such, it has the usual attributes of cache operation, such as move
ins and move outs. For main memory, move ins and move outs are
called paging.
The process for main memory is the same: a cache miss occurs (a page
fault). The OS determines that it must read in the page from disk. In
the hardware caches, a location is identified, a move out is done, and
then a move in (pagein) is done.
You must maintain enough free pages for the system’s needs.
Pages are not replaced immediately after they are used. Instead, the
page daemon wakes up every 1/4 second and checks to see if there are
enough free pages. If there are, it does nothing. If there are not, it
attempts to replenish the free page queue in a process called page
replacement. The moveouts are asynchronous in this case.
Since main memory is a cache, the algorithm used by the page daemon
to replenish the free queue is the usual LRU process. The page daemon
inspects, over time, all of the pages in memory and takes those that
have not been used.
The page daemon LRU process is called scanning. The more pages
needed, the faster the page daemon scans. The scan rate, how many
pages are being checked each second, is the best indicator of stress
(shortage) in the memory subsystem. If the page daemon cannot keep
up with the demand for free pages, the swapper may run, which will
temporarily move some memory-heavy work to disk.
The page daemon performs the same checks on every page eligible to
be taken. There are many types of pages that cannot be taken, such as
kernel pages, which are skipped over.
The page daemon checks each eligible page to see if it can be taken. To
take the page, it must be unreferenced, and if modified, the maxpgio
limit cannot have been reached.
If there are less than lotsfree pages left, the page daemon begins
scanning at the slowscan rate. As fewer and fewer pages are left on
the free queue, the faster the page daemon scans. When a minimum
number (minfree) are left, the page daemon scans as fast as possible,
at the fastscan rate. The scan rate is determined by the number of
pages available, as shown in the graph above.
The only special case occurs when too many pages are taken at once,
causing a big drop in the free queue. When this occurs, the system
temporarily adds to lotsfree a quantity (deficit) designed to
keep this from happening again. The deficit value decays to zero
over a few seconds.
The table above shows the defaults for the page daemon parameters.
Prior to SunOS 5.6, these parameters were not scaled for multi-
processor systems, causing larger systems to default to too small a free
queue and unnecessarily high scan rates.
With SunOS 5.6 and higher, these parameters are properly scaled. With
deficit processing, it is no longer necessary to tune these
parameters. If you have values in /etc/system from an earlier release,
remove them when migrating to the Solaris 2.6 environment.
The drawback is that the page daemon could take pages that the
program is still using, causing the program to quickly take them back.
This causes memory thrashing, which uses CPU time and does not
replenish the free queue. Performance usually degrades rapidly at this
point, especially if the pages are modified and being written out.
maxpgio
If a large number of dirty pages were taken every quarter second, this
would cause significant I/O spikes, as well as large amounts of
"non-productive" I/O that are not related to the system’s execution.
If the page daemon has taken maxpgio ÷4 pages, any other pages it
takes must be unmodified; if an unused but modified page is found, it
is ignored.
The Solaris method of caching file system data is known as the page
cache. The page cache is dynamically sized and can use all memory
that is not being used by applications. The Solaris page cache caches
file blocks rather than disk blocks. The key difference is that the page
cache is a virtual file cache, rather than a physical block cache. This
allows the SolarisOS to retrieve file data by simply looking up the file
reference and seek offset, rather than invoking the file system to look
up the physical disk block number.
The above overhead shows the Solaris page cache. When a Solaris
process reads a file the first time, file data is read from disk though the
file system into memory in page-size chunks and returned to the user.
The buffer cache is used only for file system data that is only known
by physical block numbers. This data only consists of metadata items –
direct/indirect blocks and inodes. All file data is cached though the
page cache.
Note – Buffer cache size of 300 byles per inode, and 1 Mbyte per
2 Gbytes of files expected to be accessed concurrently is a rule of
thumb for the UFS.
For example, if you have a database system with 100 files totaling
100 Gbytes of storage space and only 50 Mbytes of files are accessed at
the same time, then at most you would need 30 Kbytes
(100 x 300 bytes = 30 Kbytes) for the inodes, and about 25 Mbytes
([50 ÷2] x 1 Mbyte = 25 Mbytes) for the metadata.
set bufhwm=30000
You can monitor the buffer cache hit rate using sar -b. The statistics
for the buffer cache show the number of logical reads and writes into
the buffer cache, the number of physical reads and writes out of the
buffer cache, and the read/write hit ratios.
Try to obtain a read cache hit ratio of 100 percent on systems with few,
but very large files, and a hit ratio of 90 percent or better for systems
with many files.
The rate at which the system pages and the rate at which the page
scanner runs is proportional to the rate at which the file system is
reading or writing pages to disk. On large systems, this means that
you would expect to see large paging values.
Priority Paging
Although it may be normal to have high paging and scan rates with
heavy file system usage, it is likely that the page scanner will be
putting too much pressure on your applications’ private process
memory.
Pages that have not been used in the last few seconds will be taken by
the page scanner when you are using the file system. This can have a
very negative effect on application performance. The effect can be poor
interactive response, trashing of the swap disk, or low CPU utilization
due to heavy paging. This happens because the page cache is allowed
to grow to the point where it steals memory pages from important
applications. Priority paging addresses this problem.
set priority_paging=1
The page daemon now begins stealing pages when free memory
reaches cachefree, but will only steal file system pages until the
amount of free memory reaches lotsfree. File system pages do not
include those which are shared libraries and executables. If free
memory reaches lotsfree, the page daemon operates as before,
stealing any unused pages (subject to maxpgio).
Note – Make sure data files do not have the executable bit set. This can
fool the Volume Manager into thinking that these are really
executables, and will not engage priority paging on these files.
When using priority paging, remember that the page daemon begins
processing at cachefree, not lotsfree, so the free memory queue
might be larger than expected under normal conditions.
The table above shows the most commonly used data used in tuning
the memory subsystem.
Note the different units reported by the tools. Make sure that you
know the memory page size for your system. (Use the pagesize
command if necessary.)
# sar -g
● ppgout/s – The actual number of pages that are paged out, per
second. (A single pageout request may involve paging out
multiple pages.)
● atch/s – The number of page faults, per second, that are satisfied
by reclaiming a page currently in memory (attaches per second).
Instances of this include reclaiming an invalid page from the free
list and sharing a page of text currently being used by another
process (for example, two or more processes accessing the same
program text).
● re – The number of pages reclaimed from the free list. The page
had been stolen from a process but had not yet been reused by a
different process so it can be reclaimed by the process, thus
avoiding a full page fault that would need I/O.
● free – Size of the free memory list in KBytes. It includes the pages
of RAM that are immediately ready to be used whenever a process
starts up or needs more memory.
The list above shows the origins of the data elements reported in the
memory tuning reports.
The new paging counters are visible with the memstat command
which can be downloaded from
ftp://playground.sun.com/pub/rmc/memstat. The output from
the memstat command is similar to that of vmstat, but with extra
fields to break down the different types of paging. In addition to the
regular paging counters (sr,po,pi,fr), the memstat command shows
the three types of paging broken out as executable, application, and
file. Sample output follows as well as field descriptions.
With priority paging enabled, the scanner is only freeing file pages.
You can clearly see from the rows of zeros in the executable and
anonymous memory columns that the scanner chooses file pages first.
Swapping
If the page daemon cannot keep up with the demand for memory, then
the demand must be reduced. This is done by swapping out memory
users: writing them to disk. Obviously this does not get the system’s
work done, and so it is only done when there is no other choice. This
is usually called desperation swapping.
The system uses an algorithm to choose LWPs for swapping. When the
last LWP of a process is swapped, the process itself is removed from
memory, and all of its pages are freed. Only pages left unused by the
swapped LWP are freed before that.
The swapper checks once every second to see if it must act. It will
swap out if memory has been low for a sustained period (avefree30)
and is still low (avefree).
avefree is the average amount of free memory over the last 5 seconds,
and avefree30 is the average amount of free memory over the last 30
seconds. These are used to ensure that the shortage has not been
temporary (avefree30), and is still ongoing (avefree).
Those that pass this screening are then ranked, and the LWP with the
highest ranking is swapped out.
Swapping Priorities
The chosen LWP is then swapped out. The swapping of the LWP itself
usually returns little memory. Only when the last LWP of a process is
swapped, and the memory used by the entire process is taken, does
the amount of memory freed become significant.
Swap In
As soon as an LWP has been swapped out, the system tries to bring it
back in. As soon as the pressure is off of the memory subsystem, the
system will calculate priorities for the swapped out LWPs, this time
bringing in one a second, as long as memory pressure is off, until all
have been brought back in.
Swap Space
A page of swap is not allocated until the page has to be written out.
This allows the system to group related pages, limiting the number of
writes and perhaps the number of reads necessary to bring the pages
back in. If no swap disk space is available for the page, the system
leaves the page as is in memory.
Since the system can reserve more swap space than the amount of disk
space used for swap, when a page needs to be written out, there may
not be any available disk space for the page. If this is case, the page is
left in memory.
tmpfs
There are no monitors that report on the usage of tmpfs space. You
need to use the df -k command.
The table above shows the most commonly used data used in
monitoring the swap subsystem.
Remember, you do not tune swaps, you eliminate them. The presence
of any swaps at all indicates stress on the memory system, since it is so
hard to induce swapping.
# vmstat -S 2
procs memory page disk faults cpu
r b w swap free si so pi po fr de sr f0 s0 s6 -- in sy cs us sy id
0 0 0 1640 3656 0 0 8 8 15 0 5 0 1 0 0 187 556 200 1 1 98
0 0 0 174328 1288 0 0 4 0 0 0 0 0 0 0 0 191 470 202 0 0 100
0 0 0 174328 1288 0 0 0 0 0 0 0 0 0 0 0 177 447 188 0 0 100
0 0 0 174328 1288 0 0 0 0 0 0 0 0 0 0 0 180 447 190 0 0 100
0 0 0 174328 1288 0 0 0 0 0 0 0 0 0 0 0 186 447 186 0 0 100
^C#
● w – The swap queue length; that is, the swapped out LWPs waiting
for processing resources to finish.
Average 1.6 1
#
The list above shows the origins of the data elements reported in the
swapping reports.
Shared Libraries
Paging Review
The diagram above shows where memory segments that are paged out
are written to or originate from. Unmodified pages, such as text
segment pages and unmodified data segment pages, are never written
out. A new copy is read in from the executable file when needed.
Memory Utilization
As you can see, the physical memory size of this system is 64 Mbytes.
The page cache uses available free memory to buffer files on the file
system. On most systems, the amount of free memory is almost zero as
a direct result of this.
Kernel Memory
Kernel memory is allocated to hold the initial kernel code at boot time,
then grows dynamically as new device drivers and kernel modules are
used. The dynamic kernel memory allocator (KMA) grabs memory in
large "slabs," and then allocates smaller blocks more efficiently. If there
is a severe memory shortage, the kernel unloads unused kernel
modules and devices, and frees unused slabs.
Totaling the small, large, and oversize allocations, the kernel has
10,703,688 bytes of memory and is actually using 10,100,736 at present.
The pmap command also shows that the csh process is using 152k of
private (non-shared) memory. Another instance of csh will only
consume 152k of memory (assuming its private memory requirements
are similar).
Free Memory
Free memory can be measured with the vmstat command. The first
line of output from vmstat is an average since boot, so the real free
memory figure is available on the second line. The output is in Kbytes.
# vmstat 2
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s6 -- in sy cs us sy id
0 0 0 210528 2808 0 19 49 38 73 0 24 0 5 0 0 247 1049 239 4 2 94
0 0 0 199664 1904 0 4 12 0 0 0 0 0 2 0 0 224 936 261 7 2 92
0 0 0 199696 1056 7 0 0 628 620 0 20 0 0 0 0 401 1212 412 27 9 64
0 0 0 199736 1144 0 0 0 0 0 0 0 0 0 0 0 165 800 188 0 0 100
^C#
Use vmstat to see if the system is paging. If the system is not paging,
then there is no chance of a memory shortage. Excessive paging is
evident by constant nonzero values in the scanrate (sr) and pageout
(po) columns.
Look at the swap device for activity. If there is application paging, then
the swap device will have I/Os queued to it. This is an indicator of a
memory shortage.
There are three basic modes on the memtool GUI: VFS Memory (Buffer
Cache Memory), Process Memory, and a Process/Buffer cache
mapping matrix. The initial screen (Figure 6-1), shows the contents of
the Buffer Cache memory. Fields are described as follows:
● Inuse – The amount of physical memory that this file has mapped
into a process segment or SEGMAP. Generally the difference
between this and the resident figure is what is on the free list
associated with this file.
● Pageins – The amount of minor and major pageins for this file.
The GUI will only display the largest 250 files. A status panel at the
top of the display shows the total amount of files and the number that
have been displayed.
The second mode of the MemTool GUI is the Process Memory display.
Click on the Process Memory checkbox to select this mode.
The third mode shows the Process Matrix which is the relationship
between processes and mapped files. Click on the Process Matrix
checkbox to select this mode.
Each column of the matrix shows the amount of memory mapped into
that process for each file, with an extra row for the private memory
associated with that process.
The matrix can be used to show the total memory usage of a group of
processes. By default, the summary box at the top right corner shows
the memory used by all of the processes displayed.
The list in the overhead above contains the most commonly adjusted
tuning parameters for the memory subsystem. In addition, the list
provides some actions that you can take to improve the efficiency of
your memory system.
With priority paging enabled, the file system scan rate will be higher.
More pages need to be scanned to find file pages that it can steal
(process private memory and executable pages are skipped). High
scan rates are usually found on systems with heavy file system usage
and should not be used as a factor for determining memory shortage.
If you have the Solaris 7 OS (or the appropriate patches for 2.6 and
2.5.1), the memstat command will reveal whether you are paging to
the swap device and if you are, the system is short of memory.
If you have high file system activity, you will find that the scanner
parameters are insufficient and will limit file system performance. To
compensate for this, set some of the scanner parameters to allow the
scanner to scan at a high enough rate to keep up with the file system.
You need to increase fastscan so that the page scanner works faster.
A recommended setting is one quarter of memory with an upper limit
of 1 Gbyte per second. This translates to an upper limit of 131072 for
the fastscan parameter. The handspreadpages parameter should
also be increased with fastscan to the same value.
Another limiting factor for the file system is maxpgio. The maxpgio
parameter is the maximum amount of pages the page scanner can
push. It can also limit the amount of file system pages that are pushed.
This in turn limits the write performance of the file system. If your
system has sufficient memory, set maxpgio to something large, such
as 65536. On E10000 systems this is the default for maxpgio.
If you have memory problems, work on them first. You may eliminate
or reduce many other problems seen in the system that result from the
memory subsystem.
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● Why are tuning parameters other than the scan rate generally not
very useful?
Objectives
Upon completion of this lab, you should be able to:
L6-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L6
Tasks
Process Memory
Complete the following steps:
___________________________________________________
2. Run:
# /usr/proc/bin/pmap -x Xsun-pid
___________________________________________________
___________________________________________________
___________________________________________________
Type:
./paging.sh
lotsfree_______________________ pages
desfree________________________ pages
minfree________________________ pages
slowscan_______________________ pages
fastscan_______________________ pages
physmem________________________ pages
pagesize_______________________ bytes
2. Convert lotsfree to bytes from pages. You will use these values
later in this lab.
_______________________________________________
In this portion of the lab, you will put a load on the system by running
a program called mem_hog, which locks down a portion of physical
memory. Then you will run a shell script called load3.sh. which puts
a load on the system’s virtual memory subsystem. This script should
force the system to page and, perhaps, swap. You will run mem_hog
and load3.sh several times, each with a different set of paging
parameters to observe their effect.
vmstat -S 5
prtconf
3. Type:
./mem_hog physical_memory
Note – When mem_hog has locked down its memory, it prompts you to
press Return to exit. This frees up any memory that has been locked,
so do not press Return yet.
./load3.sh
▼ Scan rate and number of Kbytes being freed per second. Most
of this activity is due to the page daemon.
▼ Disk activity.
sar -A -o sar.norm 5 60
./load3.sh
adb -kw
lotsfree/W 0tnew_lotsfree
lotsfree/D
slowscan/D
Note – Type $q to exit adb. For this lab, you may want to keep this
window open.
___________________________________________________________
___________________________________________________________
___________________________________________________________
./mem_hog physical_mem
sar -A -o sar.1 5 60
./load3.sh
Did you see the changes you expected? Explain why or why not.
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
19. Restore lotsfree to its original value. In the adb window, type:
lotsfree/W 0toriginal_value_of_lotsfree
Scan Rates
To examine the scan rates:
slowscan/W 0t10
fastscan/W 0t100
What effect do you think this will have on the system’s behavior
with respect to:
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
mem_hog physical_mem
sar -A -o sar.2 5 60
load3.sh
6. Exit mem_hog.
Did you see the changes you expected? Explain why or why not.
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
slowscan/W 0toriginal_value
fastscan/W 0toriginal_value
Objectives
Upon completion of this module, you should be able to:
● Define a bus
7-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
7
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
What Is a Bus?
There are two major categories of buses: circuit switched and packet
switched.
Most buses today are packet switched for performance reasons even
though they are more complicated to implement.
System Buses
There are many different kinds of system bus, depending on cost and
performance requirements. Some systems have multiple layers of
buses, with slower buses connecting to faster buses, to meet
performance, flexibility, design, and cost requirements.
System buses change and evolve over the years; Sun has used many
different buses in its SPARC systems. Each new processor family
generally uses a new system bus.
The XDbus (Xerox databus) was used in Sun’s first generation, large,
datacenter systems, the sun4d architecture family. Unlike the general
purpose systems, the family members had different numbers of
parallel buses to provide additional bandwidth without the need for
faster or wider buses.
The Gigaplane bus was introduced with the Enterprise X000 family of
systems. In these servers, each system board has a UPA bus which
connects the board’s components to a Gigaplane backbone bus. The
Gigaplane bus has the ability to electrically and logically connect and
disconnect from the UPA buses while it is running, providing the
ability for the X000 and X500 servers to do dynamic reconfiguration
(the ability to add and remove system boards from a running system).
Gigaplane XB Bus
Just as with the Gigaplane bus, the Gigaplane XB bus provides the
ability to perform dynamic reconfiguration.
Bandwidth
System Bus Speed, Width
Burst (Sustained)
Peripheral Buses
I/O devices generally run at a much lower bandwidth than the system
bus. Trying to connect them to the system bus would be a very
expensive process. In support of this, many different peripheral buses
have evolved to connect I/O devices, usually through interface cards,
to the system bus.
The SBus is a 25-MHz circuit switched bus with a peak theoretical total
bandwidth of around 200 Mbytes/sec in its fastest implementations.
There are many SBus cards available today to perform most functions
required by a Sun system.
The PCI bus was originally seen in personal computer (PC) systems.
The PCI bus can run at 32- and 64-bit data widths, and at 33-, 66-, and
100-MHz clock rates. This provides a maximum theoretical total
bandwidth for a 64-bit, 66-MHz card of over 500 Mbytes/sec. With
hundreds of existing PCI cards, support in Sun PCI system requires
only a device driver.
● Type 1 – Three SBus slots and two onboard SOC (serial optical
channel) FC/OMs (Fibre Channel optical modules)
● Type 2 – Two SBuses, one UPA, and two onboard SOC FC/OMs
▼ No SCSI interface
● Type 5 – Two SBuses, one UPA, and two onboard SOC+ GBICs
Once the peripheral devices have been attached to the system, the
system must be able to transfer data to and from them. Mediated by
the device driver, this is done over the system bus which is connected
to the peripheral bus.
The exact mechanisms of these transfers are beyond the scope of this
course, but the essential feature of the transfers is that the peripheral
card appears to be resident in physical memory; that is, writes to a
section of what looks like main memory are actually writes to the
interface card.
PIO is quite easy to use in a device driver, and is normally used for
slow devices and small transfers.
DVMA
DVMA allows the interface card, on behalf of the device, to transfer
data directly to memory, bypassing the CPU. DVMA is capable of
using the full device bandwidth to provide much faster transfers.
DVMA operations, however, require a much more complicated setup
than PIO. The break-even point is at about 2000 bytes in a transfer.
Not all I/O cards support DVMA, and if they do, the driver may not
have implemented support for it.
prtdiag
The prtdiag utility, first provided with the SunOS 5.5.1 operating
system, provides a complete listing of a system’s hardware
configuration, up to the I/O interface cards.
You can get similar information from the prtconf command, but it is
much harder to read and interpret.
The first section of the prtidag report describes the CPUs present in
the system.
This report, taken from an Enterprise 10000 system, shows eight CPUs
on two system boards. The CPUs are 250-MHz UltraSPARC-II systems
with 1 Mbyte of external cache. (The mask version is the revision level
of the processor chip.)
The next section of the prtdiag report shows the physical memory
configuration. This report shows two system boards, each with two
banks of 256 Mbytes each, for a total system memory size of 1 Gbyte.
There are four memory banks possible on the system board, according
to the report, but only two are filled. This usually allows what is called
two-way memory interleaving.
Memory Interleaving
When writing, say, 64 bytes to memory (the normal transfer or line
size), the system is not able to write it all at once. For example, a
memory bank may have the ability to accept only 8 bytes at a time.
This means that eight memory cycles or transfers will be required to
write the 64 bytes into the bank.
While four 64-byte requests would take the same amount of transfer
time, interleaved or not, interleaving does provide significant
performance benefits.
Aside from the shorter elapsed time, requests are always spread across
multiple banks, providing a sort of memory RAID (redundant array of
independent disks) capability. This helps avoid hot spots (over
utilization) of a particular bank if a program is using data in the same
area consistently. For some types of memory-intensive applications,
increased speeds of 10 percent or more have been seen, just by using
interleaving.
Check your systems to ensure that the memory has been properly
installed for maximum interleave, and remember to take into account
the possibility of interleave when configuring a system.
The prtdiag I/O section shows, by name and model number, the
interface cards installed in each peripheral I/O slot.
Bus Limits
The problem is that as you add more traffic, the bus tends to get
congested and not operate as well. When the bus reaches capacity, no
more traffic can flow over it. Any requests to use the bus must wait
until there is room, delaying the requesting operation.
When this occurs, you do not get any specific messages; the system
does not measure or report on this condition.
Bus Bandwidth
There is an old joke that goes, “If I have checks, I must have money.”
The connection between a check and real money is lost. The same
situation can occur with bus bandwidth: “If I have slots, I must have
bandwidth.” Like checks and money, these two are not necessarily
related.
If you have installed more cards into a periperal bus than you have
bandwidth for, some of the I/O requests will be delayed. The greater
the delay, the more problems will become evident.
Problems from this situation, such as bad I/O response times, dropped
network packets, and so on, occur intermittently but consistently,
almost invariably during the busiest part of the day.
Just because the slot is available does not mean that it can be used. If it
seems unreasonable to support certain interface cards in the same bus,
check the configuration. These problems are not widely known, and it
is possible for both the sales and support staff to overlook a problem.
When calculating throughput, realize that you will usually at best get
60 percent of the rated bandwidth of a peripheral device. It is not
usually necessary to use actual transfer rates when checking a
configuration; you will almost invariably get the same percentage of
capacity either way.
Unexpected and unexplained problems (all the patches are on, all the
hardware has been replaced) may be strong indications of
configuration problems.
Be prepared to leave empty slots in the system. You may not have
bandwidth for some of those slots.
In this case, assuming that all onboard devices (those shown) are in
use, the best choice would be to put the SOC+ cards in slots 1 and 2,
and leave slot 3 empty.
Tuning Reports
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
❑ Define a bus
● Why does not every new system from Sun use the Gigaplane XB
bus?
Objectives
Upon completion of this lab, you should be able to:
L7-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L7
Tasks
System Configuration
Complete this step:
Objectives
Upon completion of this module, you should be able to:
8-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
8
Relevance
● What parts of I/O tuning are related to concepts you have already
seen?
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
● Ridge, Peter M., and David Deming. 1995. The Book of SCSI. No
Starch Press.
● Man pages for the prtconf command and performance tools (sar,
iostat).
The SCSI bus was designed in the early 1980s to provide a common
I/O interface for the many small computer systems then available.
Each required its own proprietary interface, which made development
of standard peripherals impossible. The SCSI bus helped solve this
problem by providing a standard I/O interface.
The SCSI bus was originally shipped in 1982, and its initial standard
(SCSI-1) was approved in 1986. The current level of standardization is
SCSI-3, with further work continuing.
The SCSI standard specifies support for many different devices, not
just disk and tape, but diskette drives, network devices, scanners,
media changers, and others.
SCSI Speeds
The SCSI bus supports four different speeds (the rate at which bytes
are transferred along the cables). All SCSI devices support the async
speed. Faster speed are negotiated with the device by the system
device driver, and can usually be determined with the prtconf
command (discussed later in this module).
The same bus may have devices operating at different speeds; there is
no problem in doing this. Bus arbitration (the process of determining
which device may use the bus) always occurs at async rate.
However, it may not be a good idea to put slow devices on the bus
since they can slow the response of faster devices on the bus.
You can limit the speed at which the bus will transfer, and the use of
some SCSI capabilities, such as tagged queueing, by setting the
scsi_options parameter in the /etc/system file. Normally, this
would not be necessary, but for some devices it may be required. It can
also be set (through the OpenBoot™ PROM [OBP]) on an individual
SCSI controller basis.
SCSI Widths
The width of a SCSI bus specifies how many bytes may be transferred
in parallel down the bus. Multiplied by the speed, this gives the
bandwidth of the bus.
Just like speed, width is negotiated by the driver and device at boot
time, and can be mixed on the same bus. Again, prtconf will provide
the current setting for a device.
Each width uses a different cable. While quad devices are essentially
nonexistent, it is possible to use an adapter cable to mix narrow and
wide devices on a bus, regardless of whether the host adapter is
narrow or wide.
Keep in mind that a wide bus, once it has been connected to narrow
devices, cannot be made wide again.
SCSI Lengths
The SCSI bus is limited, in either case, to the distance that the signals
can travel before enough voltage is lost to make distinguishing a zero
from a one unreliable. The lengths given in the above table are
maximums; every device on the bus and every cable join (butt) takes
an additional 18 inches (0.5 meter) of length of the total.
Since you might have devices of different capabilities on the same bus,
it is not uncommon to find performance problems resulting from this.
A slower device may prevent a faster device from performing an
operation, sometimes for a significant period of time. This will be
discussed later in this module.
On a narrow SCSI bus, there are 57 possible devices on a wide bus 121.
It is commonly thought that there are only 8, which is an
understandable misconception.
Embedded SCSI devices support only one lun per target. Most SCSI
devices available today are embedded, causing the confusion about
the number of devices supported on the bus.
A narrow bus supports 8 targets, a wide bus 16. (This is due to the use
of cable data lines to arbitrate bus usage.) The host adapter has only
one lun.
The SCSI terminator does not take a bus target position. It is invisible
to the host adapter and devices, and is used to control the electrical
characteristics of the bus.
This is not the case on a SCSI bus. Each target has a priority. When a
device on a target wants to transmit, and the bus is free, an arbitration
process is performed. Lasting about 1 millisecond, the arbitration
phase requests all devices that need to transmit to identify themselves,
and then grants the use of the bus to the highest priority device
requesting it. Obviously, higher priority devices get better service.
Fibre Channel
Fibre Channel operates on both fiber cable and copper wire and can be
used for more than just disk I/O. The Fibre Channel specification
supports high-speed system and network interconnects using a wide
variety of popular protocols, including:
● IP – Internet Protocol
Fibre Channel has been adopted by the major computer systems and
storage manufacturers as the next technology for enterprise storage. It
eliminates distance, bandwidth, scalability, and reliability issues of
SCSI.
A queued request must also wait for any other requests ahead of it. A
large number of queued requests suggests that the device or bus is
overloaded. The queue can be reduced by several techniques: use of
tagged queueing, a faster bus, or moving slow devices to a different
bus.
# sar -d 1
Seek Time
Seek time is the amount of time required by the disk drive to position
the arm to the proper disk cylinder. It is usually the most costly part of
the disk I/O operation. Seek times quoted by disk manufacturers are
average seek times, or the time that the drive takes to seek across 1/2
of the (physical) cylinders. With tagged queueing active, this time is
not representative of the actual cost of a seek.
Rotation Time
Rotation time, or latency, is the amount of time that it takes for the
disk to spin under the read/write heads. It is usually quoted as 1/2 of
the time a complete rotation takes. It is not added into the seek time
when a manufacturer quotes disk speeds, although a figure for it is
provided in RPM (revolutions per minute).
9 ms
seek time
4.1 ms
rotatation time
Reconnection Time
Once the data has been transferred to or from cache, the request has to
be completed with the host adapter. An arbitration and request status
transfer must be performed, identical to that done at the start of the
request. The only difference is that the arbitration is done at the
device’s priority, not that of the host adapter.
If the request was a read, the data is transferred to the system as part
of this operation.
Timing is similar to host adapter arbitration, about 1.5 ms, plus any
data transfer time.
Interrupt Time
This is the amount of time that it takes for the completion interrupt for
the device to be handled. It includes interrupt queueing time (if higher
priority interrupts are waiting), time in the interrupt handler, and time
in the device driver interrupt handler.
This overhead image shows the calculation of the total cost of an I/O
operation.
Latency Timing
Table 8-2 Latency Timing Calculation
The MZR drive is divided into zones, usually 12 to 14, each of which
have a constant density. As the zones move inward, the number of
sectors on each track decreases. The drive contains a table of which
sectors are in each zone. There might be 60 sectors in an inner zone
and 120 or more in an outer zone.
The drive then takes the requested sector number, translates it using
the zone table, and determines the actual physical location on the
drive to be accessed. The varying, true geometry of the drive is hidden
from the OS.
This puts denser cylinders on the outside of the drive: more data can
be accessed without moving the arm. This difference is noticeable.
When accessing data sequentially, the performance difference from the
outermost to the innermost cylinders can be as much as 33 percent.
MZR also improves the ITR for this area of the disk.
Disk drives, and most SCSI devices, today contain significant amounts
of data cache. When an I/O request is made, data is transferred in and
out of cache on the device before the actual device recording
mechanism is used. This allows the device to transfer data at the speed
of the bus rather than at its more limited ITR.
Fast Writes
Fast writes allow the device to signal completion of a write operation
when data has been received in the cache, before it is written to the
media. (The disk drive’s cache becomes write back instead of write
through.)
Tagged Queueing
With ordered seeks, sometimes called elevator seeks, the arm makes
passes back and forth over the drive. As the location for a request is
reached, the request is executed. While the requests are fulfilled out of
order, the arm motion is greatly reduced. The average execution time
for a request is significantly less than it would be otherwise.
To avoid long delays for requests at the opposite end of the drive,
requests that arrive after a pass has started are queued for the return
pass.
Mode Pages
Mode pages are blocks of data that are returned by a SCSI device in
response to the appropriate request. The SCSI standard defines several
specific mode pages that must be supported by every SCSI device, and
some additional ones for certain device types. Device manufacturers
usually add their own vendor-specific mode pages to the device.
Specific mode pages are identified by number.
This information can also be retrieved from the device mode pages.
9c40 40 Mbytes/sec
4e20 20 Mbytes/sec
2710 10 Mbytes/sec
1388 5 Mbytes/sec
Only Sun workstations with esp, isp, fas, ptisp, and glm SCSI
controllers (sun4c/4m/4e/4d/4u) running SunOS 4.1 or later are
supported. The isp, fas and glm controllers are only supported under
the SunOS 5.x environment.
When you are planning a disk configuration, you must design for the
type of load the system will create.
700 Mbytes per second ÷ 5.5 Mbytes per RSM drive = 128 drives
The best solution is to move the slow device(s) to another bus, or try to
split the load of highly active devices.
The array provides a single interface (the front end) into a series of SCSI
buses accessing the array drives (the back end). The front end is
configured as any individual storage device and assigned an
appropriate target number. (Customarily the array will be the only
device on the bus due to its high I/O rate.)
The back end is tuned as a series of SCSI buses. Use of the drives is
planned just as if they were drives configured on buses that are not in
an array. Target priorities apply as in a regular SCSI bus.
RAID
As with any technology, there are cases where the use of certain RAID
levels is appropriate and cases where it is not. The incorrect use of a
particular RAID level, especially RAID-5, can destroy the system’s I/O
performance.
RAID-0
When more I/O requests are submitted than the disk can handle, a
queue forms and requests must wait their turn. Striping spreads the
data evenly across all of the volume’s member disks in order to
provide a perfect mechanism for avoiding long queues.
When you initiate many physical I/O operations for each logical I/O
operation, you can easily saturate the controller or bus to which the
drives are attached. You can avoid this problem by distributing the
drives across multiple controllers. An even better solution would be to
put each drive on a private controller.
None of the drives on a single SCSI bus should be part of the same
RAID group. Adding more controllers and managing I/O distribution
across those controllers is critical to achieving optimal performance
from a RAID configuration.
RAID-1
When the failed drive is replaced, the data in the good drive must be
copied to the new drive. This operation is known as synching the
mirror, and can take some time, especially if the affected drives are
large. Again, data access is uninterrupted, but the I/O operations
needed to copy the data to the new drive steal bandwidth from user
data access, possibly decreasing overall performance.
RAID-0+1
Pure RAID-1 suffers from the same problems as pure RAID-0: data is
written sequentially across the volumes, potentially making one drive
busy while the others on that side of the mirror are idle. You can avoid
this problem by striping within your mirrored volumes, just like you
striped the volumes of your RAID-0 group.
RAID-3
RAID-5
A simple write operation results in two reads and two writes, a four-
fold increase in the I/O load. These multiple writes for a single I/O
operation introduce data integrity issues. To maintain data integrity,
most RAID software or firmware uses a commit protocol similar to
that used by database systems: (1) Member disks are read in parallel
for parity computation; (2) the new parity block is computed; (3) the
modified data and new parity are written to a log; (4) the modified
data and parity are written to the member disk in parallel; and (5) the
parts of the log associated with the write operation are removed.
The best way to tune the I/O subsystem is to cut the number of I/O
operations that are done. This can be done in several ways: requesting
larger blocks (transfer time is a small part of the overall cost of an I/O
operation), tuning main memory properly, tuning the UFS caches
appropriately, and using write cancellation.
Remember, the I/O subsystem pays the price for memory problems,
since problems with main memory (improper cache management)
generate more I/O operations due to the need to move data in and out
of storage. Do not spend a lot of time tuning the I/O subsystem if you
have significant memory performance problems.
Spread the I/O activity across as many devices and buses as possible
to try to avoid hotspots or overloaded components. iostat can be
used to measure how well disk loads are spread across each disk on a
server system. sar -d can be used to capture information about
partition usage.
The overhead image above shows some of the most important I/O
subsystem data available from sar and iostat. The data, while
coming from the same origin in the kernel, may be reported differently
by the tools. Be sure you take this into account when comparing data
from the different tools.
# sar -d
# sar -u
Average 1 0 1 97
#
● %wio – The percentage of time the processor is idle and waiting for
I/O completion
● svc_t – The average I/O service time. Normally, this is the time
taken to process the request once it has reached the front of the
queue, however, in this case, it is actually the response time of the
disk.
The overhead image shows the definition of the kernel fields reported
in the tuning reports.
The iostat service time is not what it appears to be. It is the sum of
the times taken by a number of items, only some of which are due to
disk activity. The iostat service time is measured from the moment
that the device driver receives the request until the moment it returns
the information. This includes driver queueing time as well as device
activity time (true service time).
If the drive does not support tagged queuing, multiple requests are not
reordered and each request is dealt with serially. Time spent by one
request waiting for any previous requests to complete will affect its
reported service time.
Service Time
Request Number Response Time
(Each Request)
1 29.65 ms 29.65 ms
2 15.67 ms 45.32 ms
3 13.81 ms 59.13 ms
4 15.05 ms 74.18 ms
5 14.52 ms 88.70 ms
The average iostat service time is the sum of all response times
divided by five: (29.65 + 45.32 + 59.13 + 74.1 + 88.70) ÷ 5 = 49.50 ms.
The iostat report would include in its measure the time request 5
spent waiting for requests 1 to 4 to clear. In this particular scenario, it
is possible to get from the reported figure back to an average service
time for each request using the following formula:
siostat.se
The siostat.se program is a version of iostat that shows the real
definition of service times, which is the time it takes for the first
command in line to be processed.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/siostat.se
16:37:21 ------throughput------ -----wait queue----- ----active queue----
disk r/s w/s Kr/s Kw/s qlen res_t svc_t %ut qlen res_t svc_t %ut
c0t6d0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 0.00 0.00 0.00 0
c0t0d0 7.0 0.0 49.0 0.0 0.00 0.00 0.00 0 0.05 7.49 7.49 5
c0t0d0s0 7.0 0.0 49.0 0.0 0.00 0.00 0.00 0 0.05 7.48 7.48 5
c0t0d0s1 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 0.00 0.00 0.00 0
c0t0d0s2 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 0.00 0.00 0.00 0
c0t0d0s3 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 0.00 0.00 0.00 0
fd0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0 0.00 0.00 0.00 0
^C#
xio.se
The xio.se program is a modified xiostat.se that tries to determine
whether accesses are mostly random or sequential by looking at the
transfer size and typical disk access time values. Based on the expected
service time for a single random access read of the average size, results
above 100 percent indicate random access, and those under 100
percent indicate sequential access.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/xio.se
extended disk statistics
disk r/s w/s Kr/s Kw/s wtq actq wtres actres %w %b svc_t %rndsvc
c0t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0
c0t0d0 9.0 7.0 72.0 56.0 0.0 0.4 0.0 25.5 0 16 10.1 65.4
c0t0d0s0 4.0 0.0 32.0 0.0 0.0 0.0 0.0 7.5 0 3 7.5 48.7
c0t0d0s1 5.0 7.0 40.0 56.0 0.0 0.4 0.0 31.4 0 13 10.9 70.7
c0t0d0s2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0
c0t0d0s3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0
fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0
^C#
iomonitor.se
The iomonitor.se program is like iostat -x, but it does not print
anything unless it sees a slow disk. An additional column headed
delay prints out the total number of I/Os multiplied by the service
time for the interval. This is an aggregate measure of the amount of
time spent waiting for I/Os to the disk on a per-second basis. High
values could be caused by a few very slow I/Os or many fast ones.
iost.se
The iost.se program is iostat -x output that does not show
inactive disks. The default is to skip 0.0 percent busy.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/iost.se
extended disk statistics
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b
c0t0d0 0.0 1.0 0.0 8.0 0.0 0.0 20.1 0 2
c0t0d0s0 4.0 1.0 32.0 8.0 0.0 0.0 8.2 0 4
TOTAL w/s 2.0 16.0
r/s 4.0 32.0
^C#
disks.se
The disks.se program prints out the disk instance to
controller/target mappings and partition usage.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/disks.se
se thinks MAX_DISK is 8
kernel -> path_to_inst -> /dev/dsk part_count [fstype mount]
sd6 -> sd6 -> c0t6d0 0
sd0 -> sd0 -> c0t0d0 3
c0t0d0s0 ufs /
c0t0d0s1 swap swap
c0t0d0s3 ufs /cache
sd0,a -> sd0,a -> c0t0d0s0 0
sd0,b -> sd0,b -> c0t0d0s1 0
sd0,c -> sd0,c -> c0t0d0s2 0
sd0,d -> sd0,d -> c0t0d0s3 0
fd0 -> fd0 -> fd0 0
#
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● What would happen if the SCSI initiator was not the highest
priority device on the bus? How could you tell?
Objectives
Upon completion of this lab, you should be able to:
L8-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L8
Tasks
Target ________
Speed ____________
Target ________
Speed ____________
Type ____________
Speed ____________
Physical ____________
Logical ____________
Arbitration ____________
Rotation ____________
Status ____________
Total ____________
5.Run an iostat report for the device. Do the figures (for a lightly
loaded system) seem similar?
Objectives
Upon completion of this module, you should be able to:
9-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
9
Relevance
● Does the use of storage arrays change the way a partition should
be tuned?
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
The SunOS operating system has the ability to access SunOS file
systems on local disks, remote file systems via NFS, and many other
types of file systems, such as the High Sierra File System and MS-DOS
file systems. This is accomplished using vnodes.
The vnode provides a generic file interface at the kernel level, so the
calling component does not know what sort of file system it is dealing
with. Since it is this transparent in the kernel, it is also transparent to
the user.
The different file systems are manipulated through the vfs (virtual file
system) structure, which composes the mounted file system list. A vfs
is like a vnode for a partition.
All files, directories, named pipes, device files, doors, and so on are
manipulated using the vnode structure.
From the user level, any file is viewed as a stream of bytes. The
underlying hardware manages the file data as blocks of bytes. The
minimum size the hardware can write or read from disk is a sector of
512 bytes.
This diagram shows the order mkfs (called by newfs) lays out the file
system on disk.
Disk Label
The disk label is the first sector of a partition. It is only present for the
first partition on a drive. It contains the drive ID string, a description
of the drive geometry, and the partition table, which describes the start
and extent of the drive’s partitions or slices.
Boot Block
The remainder of the first 8-Kbyte block in the partition contains the
boot block if this is a bootable partition. Otherwise the space is
unused.
Cylinder Groups
The remainder of the partition is divided into cylinder groups (cg). A
cylinder group has a default size of 16 cylinders (in Solaris 2.6 and
later environments, it is calculated based on partition size), regardless
of the disk geometry. It can be made significantly larger.
Backup Superblock
The next part of the cylinder group contains a share of the partition’s
inodes, just after the cgb. The total number of inodes in the partition
is calculated based on the newfs -i operand and the size of the
partition. The resulting total number of inodes is then divided by the
number of cylinder groups, and that number of 128-byte inodes is
then created in each cylinder group.
A bit map in the cgb tracks which inodes are allocated and free.
Data Blocks
The Directory
Any holes in the directory caused by deletion, and the leftover space at
the end of a directory block, are accounted for by having the previous
entry’s record length include the unused bytes. Deleted entries at the
beginning of a 512-byte block are recognized by an inode number of 0.
New entries are allocated on a first-fit basis, using the holes in the
directory structure.
To save time, each used directory entry is cached in the directory name
lookup cache. Prior to the Solaris 7 OS, the directory component name
(between the /’s) must be 31 bytes long or smaller to be cached (the
actual field is 32 characters long but the last character is a termination
character); if it is not, the system must go to disk every time the long
name is requested. The Solaris 7 OS has no length restriction.
Average 1 11 1
#
The inode cache and the DNLC are closely related. For every entry in
the DNLC there will be an entry in the inode cache. This allows the
quick location of the inode entry, without needing to read the disk.
As with most any cache, the DNLC and inode caches are managed on
a least recently used (LRU) basis; new entries replace the old. But, just
as in main storage, entries that are not being currently used are left in
the cache on the chance that they might be used again.
If the cache is too small, it will grow to hold all of the open entries.
Otherwise, closed (unused) entries remain in the cache until they are
replaced. This is beneficial, for it allows any pages from the file to
remain available in memory. This allows files to be opened and read
many times, with no disk I/O occurring.
You can see inode cache statistics by using netstat -k to display the
raw kernel statistics information.
# netstat -k
kstat_types:
raw 0 name=value 1 interrupt 2 i/o 3 event_timer 4
segmap:
fault 2763666 faulta 0 getmap 4539376 get_use 973 get_reclaim 4023294 get_reuse 508075
...
...
inode_cache:
size 1922 maxsize 4236 hits 41507 misses 202147 kmem allocs 32258 kmem frees 30336
maxsize reached 5580 puts at frontlist 189314 puts at backlist 17404
queues to free 0 scans 139285773 thread idles 199953 lookup idles 0 vget idles 0
cache allocs 202147 cache frees 222194 pushes at close 0
...
...
buf_avail_cpu0 0
As mentioned earlier, the DNLC and inode caches should be the same
size since each DNLC entry has an inode cache entry.
Entries removed from the inode cache are also removed from the
DNLC. Since the inode cache caches only UFS data (inodes) and the
DNLC caches both NFS (rnodes) and UFS (inodes) data, the ratio of
NFS to UFS activity should be considered when modifying ncsize.
vmstat -s will display the DNLC hit rate and lookups that failed
because the name was too long. The hit rate should be 90 percent or
better. The sar options used with the DNLC and inode caches are
covered near the end of this module.
The inode
The diagram in the overhead image above shows most of the contents
of the on-disk inode structure.
The file size is an 8-byte integer (long) field. As of the SunOS 5.6
operating system, the largest file size supported by UFS is 1 Tbyte. To
use more than 2 Gbytes, programming changes are needed in most
applications. (See the man page for largefiles(5) for more
information.)
The number of sectors is kept as well as the file size due to UFS
support for sparse or "holey" files, where disk space has not been
allocated for the entire file. It can be seen with the ls -s command,
which also gives the actual amount of disk space used by the file.
The shadow inode pointer is used with access control lists (ACLs).
To turn off these updates, set the sticky bit for the file using, for
example, chmod u+t. The file’s execute permission bit must be off.
See the man page for sticky(4) for more information on the sticky
bit.
The inode contains two sets of block pointers, direct and indirect.
The direct pointers point to one data block, usually 8 Kbytes. With 12-
block pointers, this covers 96 Kbytes of the file.
If the file is larger than 96 Kbytes, the indirect block pointers are used.
The first indirect block points to a data block, which has been divided
into 2048 pointers to data blocks. The use of this indirect block covers
an additional 16 Mbytes of data.
If the file is larger than 16 Mbytes, the second indirect pointer is used.
It points to a block of 2048 addresses, each of which points to another
indirect block. This allows for an additional file size of 32 Gbytes.
The third pointer points to a third level indirect block, which allows an
additional 64 Tbytes to be addressed. However, remember that the
maximum file size is limited to 1 Tbyte at this time.
The UFS buffer cache (also known as metadata cache) holds inodes,
cylinder groups, and indirect blocks only (no file data blocks).
The default may be too much on systems with very large amounts of
main memory. You can reduce it to a few Mbytes with an entry in the
/etc/system file.
You can use the sysdef command to check the size of the cache. Read
and write information is reported by sar -b, as lreads and lwrites.
The UFS buffer cache is sometimes confused with the segmap cache,
which is used for managing application data file requests.
The following partial output of the sysdef command displays the size
of the buffer cache (bufhwm):
Average 0 1 95 0 1 67 0 0
#
Hard Links
The inode number contained in the directory entry creates a hard link
between the file name and the inode.
A file always has a single inode, but multiple directory entries can
point to the same inode. The reference count in the inode structure is
the number of hard links referencing the inode. When the last link is
deleted (unlinked), the inode can be deallocated and the associated
file space freed.
There are no back pointers from the inode to any directory entries.
An entry in the DNLC is made for each hard link, but the same inode
cache entry is used for each.
Since the hard links use inode pointers, they can be used only within
a partition. To reference a file in another partition, symbolic or soft
links must be used.
Symbolic Links
A fast symbolic link is used when the path name that is being linked is
less than or equal to 56 characters. The 56 characters represent the
fourteen 4-byte unused direct and indirect block pointers, where the
file name referenced by the link is placed.
UFS tries to evenly allocate directories, inodes, and data blocks across
the cylinder groups. This provides a constant performance (discussed
on the next page) and provides a better chance for related information
to be kept close by.
Allocation Performance
Many non-UFS file systems allocate from the front of the partition to
the rear, regularly compressing the partition to move data to the front.
The benefit of the UFS approach is that the disk arm would move, on
average, half of the partition regardless of how full the partition is. In
a compacted file system, the arm will move further and further, taking
longer and longer, as the partition fills up. While still fast, it creates the
perception that the system is slowing as the file system fills.
The reason the last step invokes a panic is that the file system
allocation data must be incorrect. If the partition was known to be full,
the allocation request would have been failed immediately. Since it
was not, the file system indicated that there was available space. When
none was found, the file system was inconsistent. The panic will cause
the system to run fsck during the reboot, correcting the information.
Quadratic Rehash
When the "ideal" cylinder group is full, the quadratic algorithm is used
to try to locate a cylinder group with free space. The amount of free
space is not considered.
The faster the quadratic rehash algorithm finds free space, the sooner
it stops, so the partition has a certain amount of space set aside (as
seen with the df command) that can only be allocated by the root user.
In a large, nearly full partition, the algorithm can be very time
consuming.
Until SunOS 5.6, this amount was 10 percent of the available space in
the partition. In Solaris 2.6 operating systems and higher, the amount
reserved depends on the size of the partition, running from 10 percent
in small partitions to 1 percent in large ones.
Cylinder Next
Group Increment
13 1
14 2
16 4
20 8
28 16
11 32
10 64
...
Fragments
The number of fragments per block can be set from 1 to 8 (the default)
by using newfs.
The rules for fragment allocation are given in the overhead image
above.
The ability of the UFS to allocate space as required and support files of
practically unlimited size is the key to its flexibility.
To avoid this overhead, a logging file system can be used. The logging
file system writes a record of the allocation transaction to the file
system log. Updates to the disk structures can be done at a later time.
This also provides a significant benefit in recovery time: the entire file
system does not need to be checked. Only the log transactions need to
be checked and rewritten if necessary.
Application I/O
The file system pages in the segmap cache may be shared with other
processes’ segmap caches, as well as other access mechanisms such as
mmap.
Since the segmap cache is pageable, a block shown in the cache may
not be present in memory. It will be page faulted back in if it is
required again.
Access to the segmap cache is done by the read and write system
calls.
The system call checks the cache to see if the data is present in the
cache. If it is not, the segmap driver requests that the page be located
and allocated to the selected cache’s virtual address. If it is not already
in memory, it is page faulted into the system. (For a full block write,
the read is not necessary.) The specified data is then copied between
the user region and the cache.
Average 0 1 95 0 1 67 0 0
#
Remember, the segmap cache is write back, not write through. The
program will be told that I/O is complete, even though it has not
started. When an application must wait until the data is on disk before
continuing, it can open the file with the O_SYNC or O_DSYNC flags.
sar options -a, -g, and -v provide data relevant to file system caching
efficiency and the cost of cache misses.
Average 0 2 1
#
● iget/s – The number of requests made for inodes that were not in
the DNLC.
The one field that is relative to file system caching in this output is:
● %ufs_ipf – The percentage of UFS inodes taken off the free list by
iget that had reusable pages associated with them. These pages
are flushed and cannot be reclaimed by processes. Thus, this is the
percentage of iget calls with page flushes. A high value indicates
that the free list of inodes is page bound and the number of UFS
inodes may need to be increased.
# sar -v
● file-sz – The number of entries and the size of the open system
file table.
fsflush
Since main memory is treated as a write-back cache, and UFS treats the
segmap cache the same way, this implies that modified file data could
remain in memory indefinitely.
Direct I/O
Direct I/O is application I/O that does not pass through the segmap
cache. Since all application I/O passes through the segmap cache by
default, use of the direct I/O special management is required. The
direct I/O facility is new with the Solaris 2.6 operating system.
Direct I/O can be requested in two ways: per file, by the application
using the directio library call, and as the default for an entire
partition using the forcedirectio mount option.
Using direct I/O on an entire partition allows the use of a UFS file
system for data that would otherwise require raw partitions, such as
database table spaces.
● The specific read or write request is disk sector aligned (on a 512-
byte seek boundary), and the request is for an integer number of
sectors.
Also note:
Using madvise
Cylinder Groups
For data-intensive file systems, this value may be too small. If the
average file size is 15 Mbytes, consider increasing the number of
cylinders per cylinder group, based on the previous computation.
Remember, the file system tries to keep files in a single cylinder group,
so make sure that the cylinder groups are big enough.
# newfs -c 32 /dev/rdsk/xxx
Waiting for a disk I/O every sixteen 512-byte reads is still terribly
inefficient. Most file systems implement a read-ahead algorithm in
which the file system can initiate a read for the next few blocks as it
reads the current block. Because the access pattern is repeating, the file
system can predict that it is very likely the next block will be read,
given the sequential order of the reads.
The following criteria must be met to engage UFS read ahead: the last
file system read must be sequential with the current one, there must
be only one concurrent reader of the file (reads from other processes
will break the sequential access pattern for the file), the blocks of the
file being read must be sequentially layered out on the disk, and the
file must be being read or written via the read and write system calls;
memory-mapped files do not use the UFS read ahead.
The UFS file system uses the cluster size (maxcontig parameter) to
determine the number of blocks that are read ahead. This defaults to
seven 8-Kbyte blocks (56 Kbytes) in Solaris versions up to the Solaris
2.6 OS. In Solaris 2.6, the default changed to the maximum size
transfer supported by the underlying device, which defaults to sixteen
8-Kbyte blocks (128 Kbytes) on most storage devices.
For this file system, you can see that the cluster size (maxcontig) is 16
blocks of 8,192 bytes, or 128 Kbytes.
With seven separate disk devices in the RAID volume, you have the
ability to perform seven I/O operations in parallel. Ideally, you should
also have the file system issue a request that initiates I/Os on all seven
devices at once. To split the I/O into exactly seven components when
it is broken up, you must initiate an I/O the size of the entire stripe
width of the RAID volume. This requires you to issue I/Os that are
128 Kbytes X 7, or 896 Kbytes each. In order to do this, you must set
the cluster size to 896 Kbytes.
The cluster size can be set at the time the file system is created using
the newfs command. Cluster size can be modified after the file system
is created by using the tunefs command. To create a UFS file system
with an 896-Kbyte cluster size command, issue the following
command:
# newfs -C 112 /dev/rdsk/xxx
To modify the cluster size after the file system has been created using
the tunefs -a command, use:
# tunefs -a 112 /dev/rdsk/xxx
To maximize the amount of I/O to a file system, the file system needs
to be configured to write as large as possible I/Os out to disk. The
maxcontig parameter is also used to cluster blocks together for I/Os.
maxcontig should be set so that the cluster size is the same as the
stripe width in a RAID configuration.
The UFS file system uses the same cluster size parameter, maxcontig,
to control how many writes are grouped together before a physical
write is performed. The guidelines used for read ahead should also be
applied to write behind. Again, if a RAID device is used, care should
be taken to align the cluster size to the stripe size of the underlying
device.
The iostat command shows 39.8 write operations per second to the
disk ssd64, averaging 5,097.4 Kbytes/sec from the disk. This results in
an average transfer size of 128 Kbytes.
# iostat -x 5
Note – The UFS clustering algorithm will only work properly if one
process or thread writes to the file at a time. If more than one process
or thread writes to the same file concurrently, the delayed write
algorithm in UFS will begin breaking up the writes into random sizes.
The UFS file system limits the amount of outstanding I/O (waiting to
complete) to 384 Kbytes per file. If a process has a file that reaches the
limit, the entire process is suspended until the resume limit has been
reached. This prevents the file system from flooding the system
memory with outstanding I/O requests.
The default parameters for the UFS write throttle prevent you from
using the full sequential write performance of most disks and storage
systems. If you ever have trouble getting a disk, stripe, or RAID
controller to show up as 100 percent busy when writing sequentially,
the UFS write throttle is the likely cause.
set ufs:ufs_HW=16777216
set ufs:ufs_LW=8388608
The lseek system call shows the offset within the file in hexadecimal.
In this example, the first two seeks are to offset 0x0D780000 and
0xA6D0000, respectively. These two addresses appear to be random,
and further inspection of the remaining offsets shows that the reads
are completely random.
Look at the argument to the read system call to see the size of each
read, which is the third argument. In this example, every read is
exactly 8,192 bytes, or 8 Kbytes. Thus you can see that the seek pattern
is completely random, and that the file is being read in 8-Kbyte blocks.
● Try to match the I/O size and the file system block size.
● Disable prefetch and read ahead, or limit read ahead to the size of
each I/O.
● Disable file system caching when the application does its own
caching (for example, in databases).
Try to match the file system block size to a multiple of the I/O size for
workloads that include a large proportion of writes. A write to a file
system that is not a multiple of the block size requires the old block to
be read, the new contents to be updated, and the whole block to be
written out again. Applications that do odd-sized writes should be
modified to pad each record to the nearest possible block size multiple
to eliminate the read-modify-write cycle. The -b option for the newfs
command sets the block size in bytes.
For example, consider a database that does three reads from a storage
device to retrieve a customer record from disk. If the database takes
500 microseconds of CPU time to retrieve the record, and spends 3 to
5 milliseconds to read the data from disk, it spends a total of
15.5 milliseconds to retrieve the record; 97 percent of that time is spent
waiting for disk reads.
You can dramatically reduce the amount of time spent waiting for
I/Os by using memory to cache previously read disk blocks. If that
disk block is needed again, it is retrieved from memory. This
eliminates the need to go to the storage device again.
Setting the cluster size, maxcontig, to 1 will reduce the size of the
I/Os from the file system to disk. Use newfs -C 1
/dev/rdsk/xxx when creating the file system and tunefs -a 1
/dev/rdsk/xxx when modifying an existing file system.
Once a partition has been created, many of its characteristics are fixed
until it is re-created. There are several post-creation strategies that can
be used to affect the performance of the partition using tunefs.
Fragment Optimization
As discussed earlier, if a file system is optimized for time and if there
is not enough room in the last block of the file to occupy another
fragment, when a file needs to grow, the kernel selects an empty block
and copies the file’s fragments from the old block to the new block.
If a file system is optimized for space and there is not enough room in
the last block to occupy another fragment, when a file needs to grow,
the kernel looks for a block with the number of free fragments equal to
exactly what the file needs and copies the fragments to this new block.
In other words, it uses a best-fit policy.
maxbpg
The UFS file allocation system will only allow a certain number of
blocks from a file in each cylinder group. The default is 16 MBytes,
which is usually larger than the default size of a cylinder group. To
ensure that a large file can use the entire cylinder group (keeping the
file more contiguous), increase this limit. This limit is set with the
tunefs -e parameter, and cannot be set with newfs. A recommended
value that should work for all cylinder group sizes is 4,000,000.
Do not change this value if performance of small files in the file system
is important.
Average 0 1 99 0 1 69 0 0
#
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● What will the effect on UFS be if there is not enough main storage?
● Why does the UltraSPARC not support the 4-Kbyte UFS block
size?
● What are the most expensive UFS operations? Why can not they
be avoided?
Objectives
Upon completion of this lab, you should be able to:
L9-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L9
Tasks
File Systems
Complete the following steps:
size__________________________
ncg___________________________
minfree_______________________
maxcontig_____________________
optim_________________________
nifree________________________
ipg___________________________
___________________________________________________________
Using the values for nifree and ipg, calculate how many
allocated or used inodes are present in the partition. Add 50
percent to this number as a reasonable approximation of how
many might be needed. Check your answer with df -Fufs -o i.
___________________________________________________________
At 128 bytes per inode, how much space is wasted with extra
inodes? What percentage of the partition space is this?
___________________________________________________________
___________________________________________________________
___________________________________________________________
./dnlc.sh
ncsize ____________________________________________________
ufs_ninode _______________________________________________
2. Type:
vmstat -s
toolong ___________________________________________________
Note – The vmstat toolong field will not be present if you are
running the Solaris 7 OS.
set ncsize = 50
set ufs_ninode = 50
vmstat -s
toolong ___________________________________________________
6. Type:
./tester_l 50
vmstat -s
Record the results. What has happened to the dnlc hit rate? Why?
(Hint – It could one of two reasons—the cache was too small or the
directory names that were being referenced were too large.) What
can you deduce just from the vmstat output?
toolong __________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
./tester_s 50
vmstat -s
What changes do you note from the previous runs of tester? Did
the cache hit rate rise? Why or why not?
toolong __________________________________________________
___________________________________________________________
___________________________________________________________
10. Type:
sar -v 1
Note that the size of the inode table corresponds to the value you
set for ufs_ninode and that the total number of inodes exceeds
the size of the table. This is not a bug. ufs_ninode acts as a soft
limit. The actual number of inodes in memory may exceed this
value. However, the system attempts to keep the size below
ufs_ninode.
11. Edit /etc/system. Remove the lines that were added earlier.
./tester_s 50
vmstat -s
1. Use your favorite editor to view the source for the program,
map_file_f_b.c.
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
Why?
___________________________________________________________
___________________________________________________________
mf_f_b.sh f
This script runs map_file_f_b several times and prints out the
average execution time. The f argument tells map_file_f_b to
sequentially scan the file starting at the first page and finishing on
the last page.
4. Type:
mf_f_b.sh b
___________________________________________________________
___________________________________________________________
___________________________________________________________
What options does it accept and what effect do they have on the
program?
__________________________________________________________
__________________________________________________________
vmstat 5
5. Type:
file_advice n
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
7. Type:
file_advice s
___________________________________________________________
___________________________________________________________
___________________________________________________________
9. Type:
file_advice s
Note the number of I/O requests that were posted to the disk
test_file resides on (that is, s0).
___________________________________________________________
11. Type:
file_advice r
___________________________________________________________
___________________________________________________________
___________________________________________________________
Objectives
Upon completion of this module, you should be able to:
10-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
10
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
Networks
The network system has three areas to tune, just like the I/O
subsystem, although most people concentrate on just one.
A network operates like any other bus, with bandwidth and access
issues. Just like any bus, speed and capacity issues play an important
role in performance. This is generally not susceptible to any tuning at
the individual system level, as the network infrastructure is not
adjustable from OS options. Tuning the enterprise network is well
beyond the scope of this course.
Network Bandwidth
You can use the ndd command to check (and set) the operational
characteristics of a network interface.
IP Trunking
TCP Connections
Also, each open TCP connection uses memory, since connection and
state information must be kept.
The overhead image above gives a number of tuning guidelines for the
TCP/IP stack. “Warn” means that this is a condition that requires
further investigation. The tuning parameters for these conditions will
be discussed on the next pages.
There are many other tuning parameters for the TCP stack; do not
adjust them. They are intended for development use or for
conformance with rarely used parts of the TCP specification (RFC
[request for comments] 793) and should not be used.
Make sure you have the latest patches for your TCP stack and network
drivers. This area of the system changes constantly, and new features,
especially performance features, are added regularly. Be sure to read
the README files for details of the changes.
The pending listen queue can have problems similar to the incomplete
Using ndd
The ndd command enables you to inspect and change the settings on
any driver involved in the TCP/IP stack. For example, you can look at
parameters from /dev/tcp, /dev/udp, and /dev/ip. To see what
parameters are available (some are read-only), use the
ndd /dev/ip \? command. (The ? is escaped to avoid shell
substitution.)
The main issue with these buses is access to transmit requests and
responses. While SCSI and some network topologies (such as Token
Ring) have strict contention management protocols, others, such as
Ethernet, operate on a first come, first served basis.
NFS
Before tuning NFS specifically, be sure that the client and server are
properly tuned. Issues such as DNLC and inode caches that are too
small can result in increased network traffic as the information has to
be reread. Once the client and server are tuned, it becomes easier to
identify specific problems as being NFS or network related.
Also, NFS Version 3 has some network message efficiencies and uses
larger block sizes, which can add significant improvements to a very
busy NFS environment. The nfsstat -m command from the client can
identify which version of NFS you are using.
NFS Retransmissions
The NFS distributed file service uses an RPC facility which translates
local commands into requests for the remote host. The RPCs are
synchronous. That is, the client application is blocked or suspended
until the server has completed the call and returned the results. One of
the major factors affecting NFS performance is the retransmission rate.
Client rpc:
Connection oriented:
calls badcalls badxids timeouts newcreds badverfs
156274 4 0 0 0 0
timers cantconn nomem interrupts
0 4 0 0
Connectionless:
calls badcalls retrans badxids timeouts newcreds
9 1 0 0 0 0
badverfs timers nomem cantsend
0 4 0 0
Client nfs:
calls badcalls clgets cltoomany
153655 1 153655 0
Version 2: (6 calls)
null getattr setattr root lookup readlink
0 0% 5 83% 0 0% 0 0% 0 0% 0 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 0 0% 1 16%
Version 3: (151423 calls)
null getattr setattr lookup access readlink
0 0% 44259 29% 5554 3% 26372 17% 18802 12% 765 0%
read write create mkdir symlink mknod
20639 13% 24486 16% 3120 2% 190 0% 4 0% 0 0%
remove rmdir rename link readdir readdirplus
1946 1% 5 0% 1019 0% 17 0% 998 0% 764 0%
fsstat fsinfo pathconf commit
73 0% 33 0% 280 0% 2097 1%
Client nfs_acl:
Version 2: (1 calls)
null getacl setacl getattr access
0 0% 0 0% 0 0% 1 100% 0 0%
Version 3: (2225 calls)
null getacl setacl
0 0% 2225 100% 0 0%
#
Normal client I/O requests, however, are handled by NFS I/O system
threads. The NFS daemon, nfsd, services requests from the network.
Multiple nfsd daemons are started so that a number of outstanding
requests can be processed in parallel. Each nfsd takes one request off
the network and passes it to the I/O subsystem.
The NFS I/O system threads are created by the NFS server startup
script, /etc/init.d/nfs.server. The default number of I/O system
threads is 16. To increase the number of threads, change the nfsd line
found about halfway into the file.
/usr/lib/nfs/nfsd -a 16
● spray – Tests the reliability of your packet sizes. It can tell you
whether packets are being delayed or dropped.
● snoop – Captures packets from the network and trace the calls
from each client to each server.
● Improper grounding
● Missing termination
● Signal reflection
With the -s option, ping sends one datagram per second to a host. It
then prints each response and the time it took for the round trip.
# ping -s proto198
PING proto198: 56 data bytes
64 bytes from proto198 (72.120.4.98): icmp_seq=0. time=0. ms
64 bytes from proto198 (72.120.4.98): icmp_seq=1. time=0. ms
64 bytes from proto198 (72.120.4.98): icmp_seq=2. time=0. ms
64 bytes from proto198 (72.120.4.98): icmp_seq=3. time=0. ms
^C
----proto198 PING Statistics----
5 packets transmitted, 4 packets received, 20% packet loss
round-trip (ms) min/avg/max = 0/0/0
#
The following example sends 100 packets to a host (-c 100); each
packet contains 2048 bytes (-l 2048). The packets are sent with a
delay time of 20 microseconds between each burst (-d 20).
# spray -c 100 -d 20 -l 2048 proto198
sending 100 packets of length 2048 to proto198 ...
no packets dropped by proto198
4433 packets/sec, 9078819 bytes/sec
#
# snoop -o packet.file
^C#
# snoop -i packet.file
# snoop -s 120
# snoop -c 1000
# snoop -o /tmp/packet.file
# snoop broadcast
Isolating Problems
Once the servers are operating correctly, and NFS activity has been
tuned (for example, with cachefs and the actimeout parameter),
look at physical network issues.
Just as system memory problems can cause I/O and CPU problems,
system problems can cause network problems. By fixing any system
problems first, you can get a much clearer look at the cause of the
network problems, if they still exist.
Remember, you can use ndd and ping, spray, and snoop as well as
netstat and nfsstat to get detailed information about your network
and the network traffic.
Tuning Reports
net.se
The net.se program output is like that of netstat -i plus collision
percentage.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/net.se
Name Ipkts Ierrs Opkts Oerrs Colls Coll-Rate
hme0 1146294 0 1077416 0 0 0.00 %
#
netstatx.se
The netstatx.se program is like iostat -x in format. It shows per-
second rates like netstat.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/netstatx.se
Current tcp minimum retransmit timeout is 200 Thu Jul 29 11:15:29 1999
Name Ipkt/s Ierr/s Opkt/s Oerr/s Coll/s Coll% tcpIn tcpOut DupAck %Retran
hme0 162.0 0.0 151.6 0.0 0.0 0.00 145695 128576 0 0.00
^C#
nx.se
The nx.se program prints out netstat and tcp class information.
NoCP means nocanput, which is a packet that is discarded because it
lacks IP-level buffering on input. It can cause a TCP connection to time
out and retransmit the packet from the other end. Defr is defers, an
Ethernet metric that counts the rate at which output packets are
delayed before transmission.
# /opt/RICHPse/bin/se /opt/RICHPse/examples/nx.se
Current tcp RtoMin is 200, interval 5, start Thu Jul 29 13:18:20 1999
13:18:40 Iseg/s Oseg/s InKB/s OuKB/s Rst/s Atf/s Ret% Icn/s Ocn/s
tcp 1.6 1.6 0.19 0.18 0.00 0.00 0.0 0.00 0.00
Name Ipkt/s Opkt/s InKB/s OuKB/s IErr/s OErr/s Coll% NoCP/s Defr/s
hme0 2.4 2.2 0.33 0.30 0.000 0.000 0.0 0.00 0.00
^C#
nfsmonitor.se
The nfsmonitor.se program is the same as nfsstat-m.se except
that it is built as a monitor. It prints out only slow NFS client mount
points, ignoring the ones that are working fine. Only as root can run
it. For the Solaris 2.5 OS, nfsmonitor.se reliably reports data only
for NFS V2 (Version 2). For NFS V3 (Version 3), it always reports
zeroes (as does the regular nfsstat -m command.
tcp_monitor.se
The tcp_monitor.se program is a GUI that is used to monitor TCP
and has a set of sliders that enable you to tune some of the parameters
when run as root.
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
● What other similarities are there between system buses, the SCSI
bus, and networks?
● Why is tuning the NFS client and server more likely to improve
performance than tuning the network?
● How can you be sure that you have a network problem and not a
system issue?
Objectives
Upon completion of this module, you should be able to:
● Identify the type of performance problem that you have, and what
tools are available to resolve it
11-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
11
Relevance
Additional Resources
Additional resources – The following references can provide
additional details on the topics discussed in this module:
The overhead image above gives some basic guidelines for tuning a
system.
You can
Compiler Optimization
Performance Bottlenecks
The main monitors used to record performance data are sar, iostat,
and vmstat. These performance monitors collect and display
performance data at regular intervals. Selecting the sample interval
and duration to use is partially dependent on the area of performance
that is being analyzed.
Memory Bottlenecks
When the page daemon and the swapper need to execute, this will
decrease the amount of execution time the application receives. If the
page daemon is placing pages that have been modified on the page-
free list, these pages will be written out to disk. This will cause a
heavier load on the I/O system, up to the maxpgio limit.
memcntl (or related library calls such as madvise) can be used to more
effectively manage memory being used by an application. These
routines can be used to notify the kernel that the application will be
referencing a set of addresses sequentially or randomly. They can also
be used to release memory that is no longer needed by the application
back to the system.
I/O Bottlenecks
One of the goals in this area is to spread the load out evenly amongst
the disks on a system. This might not always be possible. But in
general, if one disk is much more active than other disks on the
system, there is a potential performance problem. Although
determining an uneven disk load might vary, depending on the system
and expectations, 20 percent is normally used. If the number of
operations per disk vary by more than 20 percent, the system might
benefit by having the disk load balanced.
If several threads are blocked waiting for I/O to complete, the kernel
might not be able to service the requests fast enough to keep them
from piling up. The time a thread spends waiting for an I/O request to
complete is time that the thread could be in execution if the requested
data was available. There is no field that records this information in
sar. However, the b column in vmstat displays this information.
Remember, placing too many drives on a bus will overload that bus,
causing poor response times that are not caused by the drives
themselves.
mmap requires less overhead to do I/O than read and write, as does
direct I/O.
CPU Bottlenecks
If the CPU is not spending a lot of time idle (less than 15 percent), then
threads which are runnable have a higher likelihood of being forced to
wait before being put into execution. In general, if the CPU is
spending more than 70 percent in user mode, the application load
might need some balancing. In most cases, 30 percent is a good high-
water mark for the amount of time that should be spent in system
mode.
A slower response time from applications might also indicate that the
CPU is a bottleneck. These applications may be waiting for access to
the CPU for longer periods of time than normal. Consider locking and
cache effects if no other cause for the performance problem can be
found.
The CPU bottleneck might be the result of some other problem on the
system. For example, a memory shortfall will cause the page daemon
to execute more often. Other lower priority threads will not be able to
run while the page daemon is in execution. Checking for system
daemons which are using up an abnormally large amount of CPU time
may indicate where the real problem lies.
Lastly, if all else fails, adding more CPUs might alleviate some of the
CPU load.
Some of the most useful tuning reminders are given on these next two
pages.
This particular customer has three Sun Enterprise 6000 servers. Two of
the servers are running Oracle®, a relational database management
system, and the third server is running an SAS application. Each of the
three servers is connected to two different storage arrays. Each of the
disk arrays has 1.25 Tbytes of mirrored storage capacity.
Presently, 200 plus users have been created on the SAS server. The
users have remote access to the server over 100BaseT Ethernet. Each
user is connected to the server through a Fast Ethernet switch which
means each user has a dedicated 100-Mbyte/sec pipe into the local
area network.
Each of the 200 users have cron jobs that query the Oracle database for
data, perform some statistical analysis on the data and then, generate
reports. The system is quiet at night and the cron jobs start at around
5:30 each morning and run throughout the day.
Initially, 50 users were set up and performance was fine at that point.
System performance degraded as more users were added. Queries to
the Oracle database were still satisfactory, but performing the
statistical analysis on the SAS server was taking longer as more and
more users were added to the system. It was the SAS server that was
running slow.
The prtdiag utility, first provided with the SunOS 5.5.1 operating
system, provides a complete listing of a system’s hardware
configuration, up to the I/O interface cards.
Intrlv. Intrlv.
Brd Bank MB Status Condition Speed Factor With
--- ----- ---- ------- ---------- ----- ------- -------
0 0 1024 Active OK 60ns 4-way A
0 1 1024 Active OK 60ns 2-way B
2 0 1024 Active OK 60ns 4-way A
2 1 1024 Active OK 60ns 2-way B
4 0 1024 Active OK 60ns 4-way A
6 0 1024 Active OK 60ns 4-way A
...
...
#
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
1 SBus 25 0 nf SUNW,595-3444,595-3445+
1 SBus 25 1 fca FC
1 SBus 25 2 fca FC
1 SBus 25 3 SUNW,hme
1 SBus 25 3 SUNW,fas/sd (block)
1 SBus 25 13 SUNW,socal/sf (scsi-3) 501-3060
3 SBus 25 0 QLGC,isp/sd (block) QLGC,ISP1000
3 SBus 25 1 fca FC
3 SBus 25 2 fca FC
3 SBus 25 3 SUNW,hme
3 SBus 25 3 SUNW,fas/sd (block)
3 SBus 25 13 SUNW,socal/sf (scsi-3) 501-3060
5 SBus 25 0 QLGC,isp/sd (block) QLGC,ISP1000
5 SBus 25 3 SUNW,hme
5 SBus 25 3 SUNW,fas/sd (block)
5 SBus 25 13 SUNW,socal/sf (scsi-3) 501-3060
7 SBus 25 0 fca FC
7 SBus 25 2 QLGC,isp/sd (block) QLGC,ISP1000
7 SBus 25 3 SUNW,hme
7 SBus 25 3 SUNW,fas/sd (block)
7 SBus 25 13 SUNW,socal/sf (scsi-3) 501-3060
On the Enterprise 6000, each board actually has two SBuses for an
aggregate bandwidth of 200 Mbytes/sec. The bandwidth of the
Gigaplane bus on the E6000 is 2.6 Gbytes/sec. With the four boards
(1, 3, 5, and 7) each having two SBuses, aggregate bandwidth of the
eight SBuses is 800 Mbytes/sec. This is well below the available
bandwidth of the 2.6-Gbyte backplane.
Since the SBuses on board 1 have the most devices, this board will be
evaluated for an overload condition. Following are the bandwidths for
the devices on board 1:
Once you identify which processes are active, you can narrow it down
to the one(s) causing any delays and determine which system
resources are causing bottlenecks in the process. The BSD version of
the ps command displays the most active processes in order.
# /usr/ucb/ps -aux
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
user1 3639 11.1 0.120376 4280 pts/9 O 06:42:22 115:37 sas
user2 22396 8.2 0.1 6104 4208 ? O 09:04:13 11:51 /usr/local/sas612/
user3 24306 5.0 0.53086428976 ? S 09:14:57 6:10 /usr/local/sas612/
user4 21821 4.9 0.1 7656 4608 ? S 09:00:07 4:56 /usr/local/sas612/
user5 17093 2.7 0.1 8328 5080 ? S 08:25:55 0:46 /usr/local/sas612/
user6 20156 2.7 0.218896 7480 pts/14 O 08:47:05 4:26 sas
user7 12982 2.2 0.1 5576 3680 ? O 08:00:01 11:02 /usr/local/sas612/
user8 25108 2.0 0.1 5896 4152 ? S 09:18:45 0:26 /usr/local/sas612/
user9 24315 1.7 0.43369622920 ? O 09:15:01 2:21 /usr/local/sas612/
#
You knew that SAS would be the most active process on the system,
however, the report still provides some useful information. The %CPU
column is the percentage of the CPU time used by the process. As you
can see, the first SAS process is using a large percentage of CPU time,
11.1 percent.
Note – User IDs were changed in the ps output to protect the privacy
of the actual users.
This is not the case on this system. The processes have a status of O
indicating it is on the CPU, or running. A status of S indicates that the
process is sleeping.
The process identifier (PID) of the active processes is also useful. The
memory requirements of the process will be determined by using the
/usr/proc/bin/pmap -x PID command.
An I/O operation cannot begin until the request has been transferred
to the device. The request cannot be transferred until the initiator can
access the bus. As long as the bus is busy, the initiator must wait to
start the request. During this period, the request is queued by the
device driver. A read or write command is issued to the device driver
and sits in the wait queue until the SCSI bus and disk are both ready.
Queued requests must also wait for any other requests ahead of them.
A large number of queued requests suggests that the device or bus is
overloaded. The queue can be reduced by several techniques: use of
tagged queueing, a faster bus, or moving slow devices to a different
bus.
As you can see from this random sampling of I/O devices, none of the
devices report a high number of transactions waiting for service. The
percentage of time there are transactions waiting for service is also
quite acceptable. This suggests that the devices are distributed
properly and that the bus is not overloaded.
From the iostat report, you can see that average response times are
high for all disks. Also, sd0, sd2, sd4, and sd6 are busy more than 5
percent of the time. A better distribution of load across these disks
would help. Reducing the number of I/Os would also help.
There are two tuning parameters that control the fsflush daemon:
● autoup – How long a cycle should take; that is, the longest that
unmodified data will remain unwritten. Default is 30 seconds.
The SAS application reads data from the Oracle database tables,
performs some statistical analysis, and produces reports. Therefore, it
is not necessary to activate the fsflush daemon as often. SAS requires
a working file system for each user to write the results of
computations to disk. But these writes are not that critical because the
original Oracle data can always be read again in the event of a system
crash.
set tune_t_fsflushr=10
set autoup=120
Average 16 26 12 45
#
The output of the pmap command shows that the SAS process is using
9904 Kbytes of real memory. Of this, 6712 Kbytes is shared with other
processes on the system, via shared libraries and executables.
Memory requirements for the 200 users the customer has set up to run
SAS would be 638,400 Kbytes (or 638 Mbytes). This system has 6
GBytes of memory which is more than enough to support 200
simultaneous SAS users.
The DNLC caches the most recently referenced directory entry names
and the associated vnode for the directory entry. The number of name
lookups per second is reported as namei/s by the sar -a command.
# sar -a
If namei does not find a directory name in the DNLC, it calls iget to
get the inode for either a file or directory. Most iget calls are the result
of DNLC misses. iget/s reports the number of requests made for
inodes that were not in the DNLC.
For every entry in the DNLC there will be an entry in the inode cache.
This allows the quick location of the inode entry, without having to
read the disk.
inode cache size is set by the kernel parameter ufs_ninode. You can
see inode cache statistics by using netstat -k to dump out the raw
kernel statistics information. maxsize is the variable equal to
ufs_ninode. A maxsize_reached value higher than maxsize
indicates the number of active inodes has exceeded cache size.
# netstat -k
kstat_types:
raw 0 name=value 1 interrupt 2 i/o 3 event_timer 4
segmap:
fault 1646083218 faulta 0 getmap 1531102064 get_use 8868192 get_reclaim 534564036
get_reuse 988319526
...
...
inode_cache:
size 3607 maxsize 17498 hits 770740 misses 591785 kmem allocs 95672 kmem frees 92065
maxsize reached 21216 puts at frontlist 847788 puts at backlist 316742
queues to free 0 scans 8873146 thread idles 587283 lookup idles 0 vget idles 0
cache allocs 591785 cache frees 666538 pushes at close 0
...
...
#
Increasing the size of the DNLC cache, ncsize, and the size of the
inode cache, ufs_ninode, will improve the cache hit rate and help
reduce the number of disk I/Os. The inode cache should be the same
size as the DNLC. Recommended entries in /etc/system are:
set ufs_ninode=20000
set ncsize=20000
The UFS buffer cache (also known as metadata cache) holds inodes,
cylinder groups and indirect blocks only (no file data blocks). The
default buffer cache size allows up to 2 percent of main memory to be
used for caching the file system metadata. The default is too high for
systems with large amounts of memory.
Since only inodes and metadata is stored in the buffer cache, it does
not need to be a very large buffer. In fact, you only need 300 bytes per
inode, and about 1 Mbyte per 2 Gbytes of files that are expected to be
accessed concurrently. The size can be adjusted with the bufhwm
kernel parameter, specified in Kbytes, in the /etc/system file.
For example, if you have a database system with 100 files totaling
100 Gbytes of storage space and you estimate that only 50 Mbytes of
those files will be accessed at the same time, then at most you would
need 30 Kbytes (100 x 300 bytes = 30 Kbytes) for the inodes, and about
25 Mbytes (50 ÷ (2 x 1 Mbyte) = 25 Mbytes) for the metadata. Making
the following entry into /etc/system to reduce the size of the buffer
cache is recommended:
set bufhwm=28000
You can monitor the buffer cache hit rate using sar -b. The statistics
for the buffer cache show the number of logical reads and writes into
the buffer cache, the number of physical reads and writes out of the
buffer cache, and the read/write hit ratios.
Try to obtain a read cache hit ratio of 100 percent on systems with a
few, but very large files, and a hit ratio of 90 percent or better for
systems with many files.
Average 1 19 95 7 29 78 5 1
#
The best RAM shortage indicator is the scan rate (sr) output from
vmstat. A scan rate above 200 pages per second for long periods
(30 seconds) indicates memory shortage. Make sure you have the most
recent OS version or, at least, the most recent kernel patch installed.
# vmstat 2
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
1 1 34 121984 41312 13 2612 11218 2439 14303 0 1528 12 0 0 6 1425 2872 926 15 21 63
0 0 71 6866280 97728 37 4056 22808 11392 44380 0 4220 2 0 0 0 2391 5459 969 22 46 32
0 1 71 6866280 94832 49 5049 28232 12948 47268 0 4347 6 1 0 0 2345 5559 1076 24 53 22
0 2 71 6868128 92984 5 4935 31444 20432 55256 0 4467 13 0 0 0 2364 5530 1056 27 44 29
0 1 71 6868416 95480 5 4621 30732 20100 53460 0 4287 4 0 0 0 2637 5138 1175 24 44 32
^C#
Recall that the page daemon begins scanning when the number of free
pages is greater than lotsfree. The free column of the vmstat
command reports the size of the free memory list in Kbytes. These are
pages of RAM that are immediately ready to be used whenever a
process starts up or needs more memory.
set lotsfree=0x10000000
The number of page out requests per second (pgout/s) are also
displayed as well as the actual number of pages that are paged out per
second (ppgout/s). The number of pages per second that are placed
on the free list by the page daemon (pgfree/s) are also displayed.
It is normal to have high paging and scan rates with heavy file system
usage as is the case in this customer’s environment. However, this can
have a very negative effect on application performance. The effect can
be noticed as poor interactive response, trashing of the swap disk, and
low CPU utilization due to the heavy paging. This happens because
the page cache is allowed to grow to the point where it steals memory
pages from important applications. Priority paging addresses this
problem.
set priority_paging=1
# newfs -c 32 /dev/rdsk/xxx
Most file systems implement a read ahead algorithm in which the file
system can initiate a read for the next few blocks as it reads the current
block. Given the repeating nature of the sequential access patterns of a
DSS, the file system should be tuned to increase the number of blocks
that are read ahead.
The UFS file system uses the cluster size (maxcontig parameter) to
determine the number of blocks that are read ahead. This defaults to
seven 8-Kbyte blocks (56 Kbytes) in Solaris OS versions up to Solaris
2.6. In the Solaris 2.6 OS, the default changed to the maximum size
transfer supported by the underlying device, which defaults to sixteen
8-Kbyte blocks (128 Kbytes) on most storage devices. The default
values for read ahead are often too low and should be set very large to
allow optimal read rates.
Cluster size should equal the number of stripe members per mirror
multiplied by the stripe size. Two stripe members with a stripe size of
128 Kbytes results in an optimal cluster size of 512 Kbytes. The -C
option to newfs or the -a option to tunefs sets the cluster size.
Cluster sizes larger than 128 Kbytes also require the maxphys
parameter to be set in /etc/system. The following entry in
/etc/system provides the necessary configuration for larger cluster
sizes:
❑ Identify the type of performance problem that you have, and what
tools are available to resolve it
Things to consider:
Objectives
Upon completion of this lab, you should be able to:
L11-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
L11
Tasks
vmstat Information
Use the lab11 directory for these exercises.
1. Type:
vmstat -S 5
_______________________________________________________
2. Type:
./load11.sh
▼ Interrupts _____________________________________________
sar Analysis I
Complete the following steps using Appendix F as a reference. In this
part of the lab, use sar to read files containing performance data. This
data is recorded by sar using the -o option. The data is recorded in
binary format. Each sar file was generated using a different
benchmark.
1. Type:
sar -u -f sar.file1
___________________________________________________________
___________________________________________________________
Look at the %wio field. What does this field represent? Is there a
pattern?
___________________________________________________________
___________________________________________________________
___________________________________________________________
sar -q -f sar.file1
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
What is the average length of the swap queue? The size of the
swap queue indicates the number of LWPs that were swapped out
of main memory. What might the size of the swap queue tell you
with respect to the demand for physical memory? (If this field is
blank, then the swap queue was empty.)
___________________________________________________________
___________________________________________________________
___________________________________________________________
sar -d -f sar.file1
Was the disk busy during this benchmark? (Did the benchmark
load the I/O subsystem?) Use the %busy field to make an initial
determination.
___________________________________________________________
Was the disk busy with many requests? (See the avque field.) Did
the number of requests increase as the benchmark progressed?
___________________________________________________________
___________________________________________________________
___________________________________________________________
Was the I/O subsystem under a load? What was the average size
of a request?
___________________________________________________________
sar -r -f sar.file1
sar -g -f sar.file1
Was the paging activity steady? (Did the system page at a steady
rate?) Explain.
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
You have just checked three major areas in the SunOS operating
system that can affect performance (CPU, I/O, and memory
usage). You can use this approach to gather initial data on a
suspect performance problem. Given the data you have just
collected, what can you conclude with respect to the load on the
system?
___________________________________________________________
___________________________________________________________
1. Type:
sar -u -f sar.file2
___________________________________________________________
___________________________________________________________
Look at the %wio field. What does this field represent? Is there a
pattern?
___________________________________________________________
___________________________________________________________
___________________________________________________________
sar -q -f sar.file2
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
What is the average length of the swap queue? The size of the
swap queue indicates the number of LWPs that were swapped out
of main memory. What might the size of the swap queue tell you
with respect to the demand for physical memory? (If this field is
blank, then the swap queue was empty.)
___________________________________________________________
___________________________________________________________
___________________________________________________________
sar -d -f sar.file2
Was the disk busy during this benchmark? (Did the benchmark
load the I/O subsystem?) Use the %busy field to make an initial
determination.
___________________________________________________________
Was the disk busy with many requests? (See the avque field.) Did
the number of requests increase as the benchmark progressed?
___________________________________________________________
___________________________________________________________
___________________________________________________________
Was the I/O subsystem under a load? What was the average size
of a request?
___________________________________________________________
sar -r -f sar.file2
sar -g -f sar.file2
Was the paging activity steady? (Did the system page at a steady
rate?) Explain.
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
What can you conclude about the load on the system? Which
area(s), if any, does this benchmark load?
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
1. Type:
sar -u -f sar.file3
___________________________________________________________
___________________________________________________________
Look at the %wio field. What does this field represent? Is there a
pattern?
___________________________________________________________
___________________________________________________________
___________________________________________________________
sar -q -f sar.file3
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
What is the average length of the swap queue? The size of the
swap queue indicates the number of LWPs that were swapped out
of main memory. What might the size of the swap queue tell you
with respect to the demand for physical memory? (If this field is
blank, then the swap queue was empty.)
___________________________________________________________
___________________________________________________________
___________________________________________________________
sar -d -f sar.file3
Was the disk busy during this benchmark? (Did the benchmark
load the I/O subsystem?) Use the %busy field to make an initial
determination.
___________________________________________________________
Was the disk busy with many requests? (See the avque field.) Did
the number of requests increase as the benchmark progressed?
___________________________________________________________
___________________________________________________________
___________________________________________________________
Was the I/O subsystem under a load? What was the average size
of a request?
___________________________________________________________
sar -r -f sar.file3
sar -g -f sar.file3
Was the paging activity steady? (Did the system page at a steady
rate?) Explain.
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
What can you conclude about the load on the system? Which
area(s), if any, does this benchmark load?
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
A-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
A
Additional Resources
Speed
Platform Burst Size Bus Width Clock Stream Mode
(Read/Write)
Typical Width
Board Description Transfer
Throughput (Bits)
The SyMON 2.0 system monitor can be downloaded from the Sun web
site, http://www.sun.com/symon.
B-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
B
Additional Resources
● The User subsystem that provides the SyMON GUI and related
data, which typically runs on the user’s desktop system
Only one instance of the Server subsystem and one instance of the
Event Generator/Handler can operate per server. Any number of
SyMON GUIs can be installed and can operate on different systems to
monitor one server.
The SyMON system monitor uses Tcl scripts extensively to control the
layout of certain windows (including any SysMeter layouts you save).
These files are saved in your home directory in the .symon/lib/tcl
directory.
The 1.6 version of the Solstice SyMON system monitor supports the
following platforms:
● Ultra 2
If you are installing a new version of the SyMON system monitor over
a previously installed, older version be sure to do the following:
The SUNWsyux packages contain the display images for the different
supported platforms. Install at least one of these packages on the User
GUI subsystem system. Choose the package(s) that contains the
images for the system(s) you want to be able to view.
Use the pkgadd command to load the appropriate packages into the
/opt/SUNWsymon directory.
Environment Variables
Use the following steps to set environment variables in the root user
.profile or .login file:
# sm_confsymon -s ehsystem
# sm_control start
% symon -t server
Note that the Server and Event Handler systems must already have
the SyMON system monitor installed and running for the user
interface to operate properly.
# sm_confsymon -e server
1. SUNW,SPARCserver-1000
2. SUNW,SPARCserver-1000E
3. SUNW,SPARCcenter-2000
4. SUNW,SPARCcenter-2000E
5. SUNW,Ultra-Enterprise
6. SUNW,Ultra-1 (Only SUNW,Ultra-Enterprise-150
is supported)
7. SUNW,Ultra-2 (Only SUNW,Ultra-Enterprise-2
is supported)
8. SUNW,Ultra-4 (Only SUNW,Ultra-Enterprise-450
is supported)
# sm_control start
The appropriate SNMP mib files are provided with the SyMON
system monitor, in the SUNWsye package, and are installed in the
/opt/SUNWsymon/etc/snm directory.
/opt/SUNWsymon/etc/snm/symon.mib.oid
/opt/SUNWsymon/etc/snm/symon.mib.traps
/opt/SUNWsymon/etc/snm/symon.mib.schema
2. Add the files listed in step 1 to your SNM schema directory. With
SNM, copy these files to /opt/SUNWconn/snm/agents on the
SNM host.
For example, where the host running snm is wgsnm, use the following
line:
If you want to kill and deactivate all of the SyMON daemon processes
on the Server or Event Handler system, issue the following command:
# sm_control stop
If you want to remove the SyMON configuration files from your disk
as well, issue the following command:
# sm_confsymon -D
These commands do not remove the data and files installed by pkgadd
into /opt/SUNWsymon.
● Colormap limitations
The SyMON system monitor uses about half the colormap with
8-bit graphics. Portions of its graphics (such as icons, images, or
event highlighting) might not be displayed if it cannot allocate the
required colors. You may have to exit other color-intensive
applications (such as Netscape Navigator™) for the SyMON
system monitor to operate properly.
● SunVTS interface
To call SunVTS from the SyMON system monitor, you must install
these additional packages from the Sun Microsystems Computer
Corporation Server Supplements CD-ROM:
● SUNWodu
▼ SUNWsyc
● SUNWvts
▼ SUNWvtsmn
● The commands used to generate raw data and the files these
commands create
C-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Overview
What Is Accounting?
In the terminal server context, the term accounting originally implied
the ability to bill users for system services. More recently, the term is
used when referring to the monitoring of system operations.
● Troubleshooting
Connection Accounting
Connection accounting includes all information associated with the
logging in and out of users. This element of the accounting mechanism
does not have to be installed for data to be collected. But if it is not
installed, no analysis can be done1.
Process Accounting
As each program terminates, the kernel (the exit() function) places
an entry in /var/adm/pacct. The entry contains the following
information:
1. However, with the last command you can print out the protocol file.
Accounting C-3
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Disk Accounting
Disk accounting enables you to monitor the amount of data a user has
stored on disk. The dodisk command available for this purpose
should be used once per day. The following information is stored:
Charging
The charging mechanisms enable the system administrator to levy
charges for specific services (for example, restoring deleted files).
These entries are stored in /var/adm/fee and are displayed in the
accounting analysis report. The following information is stored:
Location of Files
The shell scripts and binaries are located in the /usr/lib/acct
directory, and the data and report analyses are stored in
/var/adm/acct.
Types of Files
Accounting maintains three types of files:
Accounting C-5
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Programs That Are Run
Programs started by crontab summarize and analyze the collected
data and delete the /var/adm/pacct file after analysis.
● runacct – Summarizes the raw data collected over the day, deletes
the raw data file, and generates analyses as described in the
following paragraphs.
● lastlogin – Lists all known user names and the date on which
these users last logged in
2. To create the new file name, a digit is added to the old file name. The digit
increases; /var/adm/pacct becomes /var/adm/pacct1,
/var/adm/pacct2, and so on.
Accounting C-7
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Starting and Stopping Accounting
# ln /etc/init.d/acct /etc/rc2.d/S22acct
# ln /etc/init.d/acct /etc/rc0.d/K22acct
4. Modify the crontabs for users adm and root in order to start the
programs dodisk, ckpacct, runacct, and monacct automatically.
# EDITOR=vi;export EDITOR
# crontab -e
30 22 * * 4 /usr/lib/acct/dodisk
# crontab -e adm
0 * * * * /usr/lib/acct/ckpacct
30 2 * * * /usr/lib/acct/runacct 2> \
/var/adm/acct/nite/fd2log
30 7 1 * * /usr/lib/acct/monacct
# /etc/init.d/acct start
# vi /etc/acct/holidays
* @(#)holidays 2.0 of 1/1/99
* Prime/Nonprime Table for UNIX Accounting System
*
* Curr Prime Non-Prime
* Year Start Start
*
1999 0800 1800
*
* only the first column (month/day) is significant.
*
* month/day Company
* Holiday
*
Accounting C-9
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Generating the Data and Reports
Raw Data
Raw data is stored in four separate files:
● /var/adm/pacct
● /var/adm/wtmp
● /var/adm/acct/nite/disktacct
● /var/adm/fee
/var/adm/pacct File
/var/adm/wtmp File
● ttymon – The port monitor that monitors the serial port for server
requests such as login
● login – At login
Entries are generated once a day by dodisk. The format of the tacct
structure is described in the acct(4) man page. It is also illustrated
later in this appendix on page -27.
/var/adm/fee File
Accounting C-11
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
rprtMMDD Daily Report File
A daily report file, /var/adm/acct/sum/rprtMMDD, is generated
each night by runacct from the raw data that has been collected. MM
represents the month (two digits), and DD represents the day.
runacct
rprtMMDD
The Daily Report shows the use of the system interfaces and the basic
availability.
24 system boot
3 run-level 3
3 acctg on
2 run-level 6
2 acctg off
1 acctcon
● TOTAL DURATION – The time for which the system was available to
the user during this accounting period.
Accounting C-13
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
● # ON – The same number as # SESS as it is no longer possible to
invoke login directly.
● CONNECT (MINS) – The actual time the user was logged in.
sendmail0 8.970.244.8321.460.000.00158130106
grep 74 5.05 0.11 2.50 44.690.000.0550447117
...
Accounting C-15
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
As shown in this output, the following fields are displayed:
Last Login
00-00-00uucp
00-00-00adm
00-00-00bin
00-00-00sys
95-03-25root
95-04-21otto
95-04-22dmpt
93-10-04guest
92-12-15tjm
92-07-28casey
The runacct shell script takes care not to damage files if errors occur.
A series of protection mechanisms are used that attempt to recognize
an error, provide intelligent diagnostics, and complete processing in
such a way that runacct can be restarted with minimal intervention.
It records its progress by writing descriptive messages into the active
Accounting C-17
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
file. (Files used by runacct are assumed to be in the
/var/adm/acct/nite directory, unless otherwise noted.) All
diagnostic output during the execution of runacct is written into
fd2log.
When runacct is invoked, it creates the files lock and lock1. These
files are used to prevent simultaneous execution of runacct. The
runacct program prints an error message if these files exist when it is
invoked. The lastdate file contains the month and day runacct was
last invoked, and is used to prevent more than one execution per day.
If runacct detects an error, a message is written to the console, mail is
sent to root and adm, locks can be removed, diagnostic files are saved,
and execution is ended.
runacct States
SETUP
wtmpfix
The wtmpfix program checks the wtmp.MMDD file in the nite directory
for accuracy. Because some date changes will cause acctcon to fail,
wtmpfix attempts to adjust the time stamps in the wtmp file if a record
of a date change appears. It also deletes any corrupted entries from the
wtmp file. The fixed version of wtmp.MMDD is written to tmpwtmp.
PROCESS
MERGE
The MERGE program merges the process accounting records with the
connect accounting records to form daytacct.
FEES
The MERGE program merges ASCII tacct records from the fee file into
daytacct.
DISK
If the dodisk procedure has been run, producing the disktacct file,
the DISK program merges the file into daytacct and moves
disktacct to /tmp/disktacct.MMDD.
MERGETACCT
Accounting C-19
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
CMS
USEREXIT
CLEANUP
This program cleans up temporary files, runs prdaily and saves its
output in sum/rpt.MMDD, removes the locks, then exits. Remember,
when restarting runacct in the CLEANUP state, remove the last ptacct
file because it will not be complete.
Periodic Reports
Monthly reports (generated by monacct) follow the same formats as
the daily reports. Further custom analysis is required to use these
reports for true accounting. Certain features should, however, be
noted:
● The report includes all activities since monacct was last run. There
is no interface report.
● The report is in /var/adm/acct/fiscal/fiscrptMM, where
MM represents the current month.
● Monthly summary files are held in /var/adm/acct/fiscal.
● Daily summary files are deleted following monthly reporting.
● Daily reports are deleted following monthly reporting.
● User
● Starting time
● Ending time
● Hog factor
● CPU factor
● Characters transferred
● Blocks read
Accounting C-21
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
acctcom Options
Some options that can be used with acctcom are:
● -C sec – Shows only processes with total CPU time (system plus
user) exceeding sec seconds.
Accounting C-23
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Generating Custom Analyses
The following pages show the structures that define the format of the
entries in the raw data files. Where custom analysis programs are
used, runacct and monacct should not be run.
struct utmp {
char ut_user[8]; /* User login name */
char ut_id[4]; /* /etc/inittab id
usually line #) */
char ut_line[12]; /* device name
(console, lnxx) */
short ut_pid; /* short for compat.
- process id */
short ut_type; /* type of entry */
struct exit_status ut_exit;/* The exit status of
a process */
/* marked as DEAD_PROCESS. */
time_t ut_time; /* time entry was made */
};
Accounting C-25
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
/var/adm/pacct File
The /var/adm/pacct file is the process accounting file. The format of
the information contained in this file is defined in the <sys/acct.h>
file.
struct acct
{
char ac_flag; /* Accounting flag */
char ac_stat; /* Exit status */
uid_t ac_uid; /* Accounting user ID */
gid_t ac_gid; /* Accounting group ID */
dev_t ac_tty; /* control typewriter */
time_t ac_btime; /* Beginning time */
comp_t ac_utime; /* acctng user time
in clock ticks */
comp_t ac_stime; /* acctng system time
in clock ticks */
comp_t ac_etime; /* acctng elapsed time
in clock ticks */
comp_t ac_mem; /* memory usage */
comp_t ac_io; /* chars transferred */
comp_t ac_rw; /* blocks read or written */
char ac_comm[8]; /* command name */
};
/*
* total accounting (for acct period), also for day
*/
struct tacct {
uid_t ta_uid; /* userid */
char ta_name[8]; /* login name */
float ta_cpu[2]; /* cum. cpu time, p/np
(mins) */
float ta_kcore[2]; /* cum kcore-minutes, p/np */
float ta_con[2]; /* cum. connect time, p/np,
mins */
float ta_du; /* cum. disk usage */
long ta_pc; /* count of processes */
unsigned short ta_sc; /* count of login sessions */
unsigned short ta_dc; /* count of disk samples */
unsigned short ta_fee; /* fee for special services */
};
Accounting C-27
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
C
Summary of Accounting Programs and Files
Table C-1 summarizes the accounting scripts, programs, raw data files,
and report files discussed in this appendix.
init /var/adm/wtmp
startup/ /var/adm/wtmp
shutacct
ttymon /var/adm/wtmp
login /var/adm/wtmp
kernel (exit() /var/adm/pacct
function)
dodisk /var/adm/acct/nite/disk-
tacct
chargefee /var/adm/fee
runaccta /var/adm/acct/sum/rprtMMDD
monacct /var/adm/acct/fiscal/fis-
crptMM
a. Actually, runacct calls the prdaily script to create the rprtMMDD file
Objectives
Upon completion of this appendix, you should be able to:
● Define the following terms: back file system, front file system, cache
directory, and write-through cache
D-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
D
Introduction
Function
CacheFS works with NFS to allow the use of high-speed, high-
capacity, local disk drives to store frequently used data from a remote
file system. As a result, subsequent access to this data is much quicker
than having to access the remote file system each time. CacheFS works
similarly with High Sierra File System (HSFS)-based media (such as
CD-ROMs).
Characteristics
CacheFS has these characteristics:
Terminology
Some terms used in connection with CacheFS are:
The back file system is the file system being cached (for example,
an NFS or HSFS file system).
The front file system is the cache itself (where the UFS file system
cache data resides).
● Cache directory
This local UFS directory holds cached data. A file system may be
dedicated to the cache directory, or the cache directory may use a
pre-existing file system containing other directories.
● Write-through cache
Benefits
The benefits of CacheFS are:
● Improved performance
● Non-volatile storage
The cached data survives system crashes and reboots (clients can
quickly resume work using cached data).
Required Software
CacheFS is part of the SUNWcsu and SUNWcsr packages installed on
every system – no additional software is necessary.
Creating a Cache
CacheFS caches are created with the cfsadmin command.
cfsadmin -c cachedirname
Displaying Information
Use the -l option to cfsadmin to see the current cache information,
including the file system ID.
cfsadmin -l
2. Disconnect the cache from the file system, using the file system ID
obtained from the cfsadmin -l command.
cfsadmin -d ID directory
To mount a NFS file system on a client using CacheFS, use the normal
NFS mount command format, specifying a file system type of cachefs
instead of nfs. The mount options specify the remainder of the
required CacheFS information.
mount -F cachefs -o backfstype=nfs,cachedir=cachedirname merlin:/docs /nfsdocs
Using /etc/vfstab
CacheFS mounts can be placed in /etc/vfstab like any other mount
request. The "device to fsck" is the cache itself. Remember to include
the proper mount options.
merlin:/doc /cache/doc /nfsdocs cachefs 2 yes rw,backfstype=nfs,cachedir=cachedirname
In the master map, this provides defaults for all the individual mounts
done under the /home directory. The options can also be used on direct
maps.
While running, CacheFS checks the consistency and status of all files it
is caching on a regular basis. File attributes (last modification time) are
checked on every file access. This can be adjusted using the standard
actimeout parameter.
fsck
A cache file system uses fsck to check the consistency of the cache. It
is run at boot time, against the directory specified in /etc/vfstab. It
will automatically correct any errors it finds.
If the files in the mount point will be used only by the system that they
are mounted to, the non_shared option prevents the deletion of an
updated file. Both the cache and NFS server will be updated.
Warning – Do not use the non_shared option with shared files. Data
corruption could occur.
cachefsstat
cachefsstat is used to display statistics about a particular local
cachefs mount point. This information includes cache hit rates,
writes, file creates and consistency checks.
cachefsstat [-z] [path ...]
If no path is given, statistics for all mount points are listed. The -z
option reinitializes the counters to zero for the specified (or all) paths.
cachefslog
The cachefslog command specifies the location of the cachefs
statistics log for a given mount point and starts logging, or halts
logging altogether.
cachefslog [-f logfile] [-h] local_mount_point
cachefswssize
Using the information from the cachefs log, which is managed by
cachefslog, cachefswssize calculates the amount of cache disk
space required for each mounted file system.
cachefswssize logfile
# cachefswssize /var/tmp/cfslog
/home/tlgbs
end size: 10688k
high water size: 10704k
/mswre
end size: 128k
high water size: 128k
/usr/dist
end size: 1472k
high water size: 1472k
cachefspack
The cachefspack command provides the ability to control and order
the files in the cachefs disk cache for better performance. The rules
for packing files are given in the packingrules(4) man page.
mount -F cachefs -o
backfstype=hsfs,cachedir=cachedirname,ro, \
backpath=/cdrom/... /cdrom/.../doc
Exercise objective – In this lab you will complete the following tasks:
Preparation
For this lab you will need:
a. Log into the server and edit the /etc/dfs/dfstab file to share
a file system. For example:
# /etc/init.d/nfs.server start
Tasks
Complete the following steps:
Does the snoop output reflect remote access file operations? Which
types of NFS operations occur?
____________________________________
____________________________________
____________________________________
____________________________________
____________________________________
____________________________________
Does the snoop output reflect remote file access operations? Which
types of NFS operations occur?
____________________________________
____________________________________
____________________________________
Is the snoop output updated with more NFS operations? Why not?
____________________________________
____________________________________
____________________________________
13. Edit the /etc/vfstab file to include an entry for the cached file
system.
server_name:/back_fs_path – /mnt cachefs 9 yes \
ro,backfstype=nfs, cachedir=/front_fs_path/cachedir
15. Use the mountall command to mount the cached file system
using the entry in /etc/vfstab.
# mountall -F cachefs /etc/vfstab
#
19. Use the cfsadmin command to obtain the cache ID for the cached
file system.
# cfsadmin -l /front_fs_path/cachedir
<Observe output>
#
23. Remove the entry in the /etc/vfstab file for the cached file
system
E-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
E
Additional Resources
● sar
● vmstat
● iostat
● mpstat
● netstat
● nfsstat
F-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999, Revision B
F
Additional Resources
● Man pages for the various performance tools (sar, vmstat, iostat,
mpstat, netstat, nfsstat.)
● Solaris 7 AnswerBook.
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
08:32:43 7 2 0 91
fd0 0 0.0 0 0 0.0
0.0
nfs1 0 0.0 0 0 0.0
0.0
nfs2 4 0.1 2 160 0.3
26.0
nfs3 0 0.0 0 0 0.0
0.0
nfs5 0 0.0 0 0 0.0
0.0
sd0 0 0.0 0 0 0.0
0.0
sd0,a 0 0.0 0 0 0.0
0.0
sd0,b 0 0.0 0 0 0.0
0.0
sd0,c 0 0.0 0 0 0.0
0.0
sd0,d 0 0.0 0 0 0.0
0.0
sd6 0 0.0 0 0 0.0
0.0
Other Options
-e time
-e filename
-i sec
-o filename
-s time
Where:
Where:
Where:
Where:
Where:
Where:
Where:
Where:
# mpstat 2
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt
idl
0 0 0 0 0 0 35 0 2 1 0 96 0 0 0
100
2 0 0 4 208 8 34 0 2 1 0 81 0 0 0
100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt
idl
0 2 0 0 0 0 30 0 1 0 0 54 0 0 0
100
2 0 0 4 212 11 21 0 1 0 0 25 0 0 0
100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt
idl
0 0 0 0 0 0 31 0 1 0 0 53 0 0 2
98
2 0 0 4 248 48 28 0 1 0 0 35 0 0 0
100
^C#
Where:
● CPU – The processor ID.
● minf – The number of minor faults.
● mjf – The number of major faults.
● xcal – The number of inter-processor cross-calls.
These occur when one CPU wakes up another CPU
by interrupting it.
● intr – The number of interrupts.
● ithr – The number of interrupts as threads (not
counting clock interrupt).
● csw – The number of context switches.
● icsw – The number of involuntary context switches.
● migr – The number of thread migrations to another
processor.
● smtx – The number of spins on mutexes (mutual
exclusion lock). The number of times the CPU failed
to obtain a mutex on the first try.
UDP
Local Address Remote Address State
-------------------- -------------------- -------
*.sunrpc Idle
*.* Unbound
*.32771 Idle
*.32773 Idle
*.lockd Idle
...
...
TCP
Local Address Remote Address Swind Send-Q Rwind Recv-Q
State
-------------------- -------------------- ----- ------ ----- ------
-------
*.* *.* 0 0 0 0
IDLE
*.sunrpc *.* 0 0 0 0
LISTEN
*.* *.* 0 0 0 0
IDLE
*.32771 *.* 0 0 0 0
LISTEN
*.32772 *.* 0 0 0 0
LISTEN
*.ftp *.* 0 0 0 0
LISTEN
*.telnet *.* 0 0 0 0
LISTEN
Active UNIX domain sockets
Address Type Vnode Conn Local Addr Remote Addr
3000070d988 stream-ord 30000a67708 00000000 /var/tmp/aaa3daGZa
3000070db20 stream-ord 300000fe7b0 00000000 /tmp/.X11-unix/X0
3000070dcb8 stream-ord 00000000 00000000
#
Where:
segmap:
fault 494755 faulta 0 getmap 728460 get_use 154 get_reclaim 656037
get_reuse 65235
get_unused 0 get_nofree 0 rel_async 15821 rel_write 16486 rel_free
376
rel_abort 0 rel_dontneed 15819 release 711598 pagecreate 64294
...
...
lo0:
ipackets 20140 opackets 20140
hme0:
ipackets 873597 ierrors 0 opackets 831583 oerrors 0 collisions 0
defer 0 framing 0 crc 0 sqe 0 code_violations 0 len_errors 0
ifspeed 100 buff 0 oflo 0 uflo 0 missed 0 tx_late_collisions 0
retry_error 0 first_collisions 0 nocarrier 0 inits 8 nocanput 0
allocbfail 0 runt 0 jabber 0 babble 0 tmd_error 0 tx_late_error 0
rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error
0
slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0
rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0
rx_late_collisions 0 rbytes 543130524 obytes 637185145 multircv 34120
multixmt 0
brdcstrcv 47166 brdcstxmt 100 norcvbuf 0 noxmtbuf 0 phy_failures 0
...
...
lm_config:
buf_size 80 align 8 chunk_size 80 slab_size 8192 alloc 1 alloc_fail 0
free 0 depot_alloc 0 depot_free 0 depot_contention 0 global_alloc 1
global_free 0 buf_constructed 0 buf_avail 100 buf_inuse 1
buf_total 101 buf_max 101 slab_create 1 slab_destroy 0 memory_class 0
hash_size 0 hash_lookup_depth 0 hash_rescale 0 full_magazines 0
empty_magazines 0 magazine_size 7 alloc_from_cpu0 0 free_to_cpu0 0
buf_avail_cpu0 0
multicast routing:
0 hits - kernel forwarding cache hits
0 misses - kernel forwarding cache misses
0 packets potentially forwarded
0 packets actually sent out
0 upcalls - upcalls made to mrouted
0 packets not sent out due to lack of resources
0 datagrams with malformed tunnel options
0 datagrams with no room for tunnel options
0 datagrams arrived on wrong interface
0 datagrams dropped due to upcall Q overflow
0 datagrams cleaned up by the cache
0 datagrams dropped selectively by ratelimiter
0 datagrams dropped - bucket Q overflow
0 datagrams dropped - larger than bkt size
TCP
Local Address Remote Address Swind Send-Q Rwind Recv-Q
State
-------------------- -------------------- ----- ------ ----- ------
-------
129.147.11.213.1023 129.147.4.111.2049 8760 0 8760 0
ESTABLISHED
127.0.0.1.32805 127.0.0.1.32789 32768 0 32768 0
ESTABLISHED
127.0.0.1.32789 127.0.0.1.32805 32768 0 32768 0
ESTABLISHED
...
...
129.147.11.213.34574 129.147.4.36.45430 8760 0 8760 0
CLOSE_WAIT
129.147.11.213.974 129.147.4.43.2049 8760 0 8760 0
ESTABLISHED
Active UNIX domain sockets
Address Type Vnode Conn Local Addr Remote Addr
3000070d988 stream-ord 30000a67708 00000000 /var/tmp/aaa3daGZa
3000070db20 stream-ord 300000fe7b0 00000000 /tmp/.X11-unix/X0
3000070dcb8 stream-ord 00000000 00000000
#
Routing Table:
Destination Gateway Flags Ref Use
Interface
-------------------- -------------------- ----- ----- ------ -
--------
129.147.11.0 rafael U 3 165
hme0
224.0.0.0 rafael U 3 0
hme0
default 129.147.11.248 UG 0 6097
localhost localhost UH 0 9591
lo0
#
UDP
udpInDatagrams = 30568 udpInErrors =
0
udpOutDatagrams = 30537
TCP
Local/Remote Address Swind Snext Suna Rwind Rnext Rack Rto Mss
State
-------------------- ----- -------- -------- ----- -------- -------- ----- ----
- ------
rafael.1023
ra.nfsd 8760 b6e2967a b6e2967a 8760 09c399a6 09c399a6 405
1460 ESTABLISHED
localhost.32805
localhost.32789 32768 02adf2a5 02adf2a5 32768 02afd90f 02afd90f 4839
8192 ESTABLISHED
localhost.32789
localhost.32805 32768 02afd90f 02afd90f 32768 02adf2a5 02adf2a5 4818
8192 ESTABLISHED
localhost.32808
localhost.32802 32768 02b68d0b 02b68d0b 32768 02b71347 02b71347 433
8192 ESTABLISHED
localhost.32802
localhost.32808 32768 02b71347 02b71347 32768 02b68d0b 02b68d0b 406
8192 ESTABLISHED
localhost.32811
localhost.32810 32768 02b7d3ce 02b7d3ce 32768 02b86f76 02b86f76 462
8192 ESTABLISHED
localhost.32810
localhost.32811 32768 02b86f76 02b86f76 32768 02b7d3ce 02b7d3ce 3375
8192 ESTABLISHED
localhost.32814
localhost.32802 32768 02c1ae7e 02c1ae7e 32768 02c3aa4d 02c3aa4d 407
8192 ESTABLISHED
localhost.32802
Client rpc:
Connection oriented:
calls badcalls badxids timeouts newcreds badverfs
149404 4 0 0 0 0
timers cantconn nomem interrupts
0 4 0 0
Connectionless:
calls badcalls retrans badxids timeouts newcreds
9 1 0 0 0 0
badverfs timers nomem cantsend
0 4 0 0
Client nfs:
calls badcalls clgets cltoomany
146831 1 146831 0
Version 2: (6 calls)
null getattr setattr root lookup readlink
0 0% 5 83% 0 0% 0 0% 0 0% 0 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 0 0% 1 16%
Version 3: (144706 calls)
null getattr setattr lookup access readlink
0 0% 42835 29% 5460 3% 25355 17% 18010 12% 742 0%
read write create mkdir symlink mknod
18888 13% 23273 16% 2985 2% 188 0% 3 0% 0 0%
remove rmdir rename link readdir
readdirplus
1870 1% 5 0% 944 0% 17 0% 990 0% 744 0%
fsstat fsinfo pathconf commit
73 0% 32 0% 275 0% 2017 1%
Client nfs_acl:
Version 2: (1 calls)
null getacl setacl getattr access
0 0% 0 0% 0 0% 1 100% 0 0%
Version 3: (2118 calls)
null getacl setacl
0 0% 2118 100% 0 0%
#
F
Where:
Server nfs:
calls badcalls
0 0
Version 2: (0 calls)
null getattr setattr root lookup readlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
Version 3: (0 calls)
null getattr setattr lookup access readlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
read write create mkdir symlink mknod
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir
readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
fsstat fsinfo pathconf commit
0 0% 0 0% 0 0% 0 0%
Client nfs:
calls badcalls clgets cltoomany
159727 1 159727 0
Version 2: (6 calls)
null getattr setattr root lookup readlink
0 0% 5 83% 0 0% 0 0% 0 0% 0 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 0 0% 1 16%
Version 3: (157404 calls)
null getattr setattr lookup access readlink
0 0% 45862 29% 5622 3% 27178 17% 19391 12% 784 0%
read write create mkdir symlink mknod
21439 13% 26273 16% 3224 2% 190 0% 4 0% 0 0%
remove rmdir rename link readdir
readdirplus
2008 1% 5 0% 1073 0% 17 0% 1012 0% 771 0%
fsstat fsinfo pathconf commit
73 0% 35 0% 288 0% 2155 1%
#
Server rpc:
Connection oriented:
calls badcalls nullrecv badlen xdrcall
dupchecks
0 0 0 0 0 0
dupreqs
0
Connectionless:
calls badcalls nullrecv badlen xdrcall
dupchecks
2 0 0 0 0 0
dupreqs
0
Client rpc:
Connection oriented:
calls badcalls badxids timeouts newcreds badverfs
162588 4 0 0 0 0
timers cantconn nomem interrupts
0 4 0 0
Connectionless:
calls badcalls retrans badxids timeouts newcreds
9 1 0 0 0 0
badverfs timers nomem cantsend
0 4 0 0
#
Server rpc:
Connection oriented:
calls badcalls nullrecv badlen xdrcall
dupchecks
0 0 0 0 0 0
dupreqs
0
Connectionless:
calls badcalls nullrecv badlen xdrcall
dupchecks
2 0 0 0 0 0
dupreqs
0
Server nfs:
calls badcalls
0 0
Version 2: (0 calls)
null getattr setattr root lookup readlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
Version 3: (0 calls)
null getattr setattr lookup access readlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
read write create mkdir symlink mknod
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir
readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
fsstat fsinfo pathconf commit
0 0% 0 0% 0 0% 0 0%
Server nfs_acl:
Version 2: (0 calls)
null getacl setacl getattr access
0 0% 0 0% 0 0% 0 0% 0 0%
Version 3: (0 calls)
null getacl setacl
0 0% 0 0% 0 0%
#
Index-51
Copyright 1998 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999
bottlenecks operation 5-7
CPU 11-17 PDC 6-8
I/O 11-13 performance 5-13
memory 11-10 performance issues 5-28
performance 11-8 physical address 5-16
BSD replacement 5-9
file system 9-5 segmap 9-30
bufhwm 9-19 using 9-33
bus set associative 5-17
characteristics 7-4 snooping 5-21
definition 7-3 thrashing 5-23
Fibre Channel 8-15 tuning 5-30
Gigaplane 7-7 UFS size 9-14
Gigaplane XB 7-8 unified 5-18
limits 7-21 virtual address 5-15, 6-8
MBus 7-6 write-back 5-19
overload 7-26 write-through 5-19
PCI 7-12 cachefs D-1
peripheral 7-11 benefits D-3
problems 7-23 CD-ROM D-11
SBus 7-12 characteristics D-2
SCSI 8-3 exercise D-12
tuning reports 7-29 limitations D-4
UPA 7-7 mounting D-7
XDbus 7-6 operation D-8
setting up D-5
C writes D-8
cachefslog D-9
cable cachefspack D-10
SCSI lengths 8-9 cachefsstat D-9
cache cachefswssize D-10
access times 5-27 caches
cachefs D-1 CPU 5-6
characteristics 5-14 cancellation
definition 5-3 write 5-19
direct mapped 5-17 capabilities
directory name 9-10 SyMON B-3
disk drive 8-27 CD-ROM
file system metadata 9-19 cachefs D-11
hardware 5-4 cfsadmin D-5
Harvard 5-18 characteristics
hierarchy 5-25 cache 5-14
hit rate 5-11 file system 9-5
inode 9-12 SBus A-3
miss 5-8 SCSI bus 8-5
NFS D-1 chargefee C-6
Index Index-53
Copyright 1998 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999
I/O performance file system
planning 8-36, 9-43 allocation 9-23
I/O time calculation 8-22 performance 9-24
I/O time components 8-18 block size 9-5
iostat service time 8-59 directory 9-9
latency timing 8-23 fragments 9-28
multiple zone recording 8-25 journaling 9-29
ordered seeks 8-30 layout 9-6
swap space 6-39 local 9-5
tagged queueing 8-29 logging 9-29
tuning statistics 8-55 statistics 9-34
disk accounting C-4 types 9-3
dispadmin 4-12 files
dispatch accounting C-5
kernel threads 4-21 fork 3-3
dispatch parameter table fragments 9-28
interactive/timesharing 4-7 optimization 9-63
real-time 4-18 free memory queue 6-11
dispatch priorities 4-6 fsck
DLAT 6-8 cachefs D-8
DNLC 9-10 fsfluch 9-37
cache hit rate 9-34 fstyp 9-7, 9-63
size 9-14 full duplex Ethernet 10-5
dodisk C-6
DRAM 5-5 G
DVMA 7-14
dynamic reconfiguration 7-8, Gigaplane bus 7-7
7-28 Gigaplane XB 7-8
processor sets 4-24 graphs
performance 1-12
guidelines 11-3
E
elevator seeks 8-30 H
enable_grp_ism E-6
Ethernet handspreadpages 6-16
full duplex 10-5 hard link 9-21
exec 3-3 hardware caches 5-4
exit 3-3 Harvard cache 5-18
hierarchy
cache 5-25
F hit rate
fast write cache 5-11
disk 8-27
fastscan 6-14 I
fencing
processors 4-23 I/O
Fibre Channel 8-15 application 9-30
Index Index-55
Copyright 1998 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999
maxuprc 3-9 npty 3-9
maxusers 9-14 NTP 3-24
MBus 7-6 nulladm C-6
memory
bottlenecks 11-10 O
free queue 6-11
interleaving 7-18 objectives
SRAM and DRAM 5-5 course xxvi
tuning summary 6-63 optimization
virtual 6-3 compiler 11-6
memstat 2-8 overload
message queues SBus 7-27
parameters E-11 overloaded bus 7-26
metadata cache 9-19 overview
minfree 6-14, 9-64 course xx, xxi, xxii
miss SCSI bus 8-3
cache 5-8
mmap 9-41 P
mode pages 8-31 pacct
modload E-3 acctcom C-21
modunload E-3 packet switched bus 7-4
monacct C-6 page daemon 6-12
mount processing 6-14
atime 9-17 page descriptor cache 6-8
cachefs D-7 page scanner 6-14
mpstat 2-10, 4-36 paging 6-11
multiple zone recording 8-25 clock algorithm 6-16
multithreading 3-10 default parameters 6-15
I/O 6-18
N priority 6-25
ncsize 3-9, 9-10 review 6-48
ndd 2-24, 10-5, 10-14 statistics 6-27
ndquot 3-9 parameters
netstat 2-11, 10-28 autoup 9-37
network bufhwm 9-19
bandwidth 10-4 default paging 6-15
hardware performance 10-16 desfree 6-14
tuning 10-3 dispatch parameter table
tuning reports 10-28 interactive/timesharing 4-
Network Time Protocol 3-24 7
newfs 9-6, 9-63 real-time 4-18
NFS 10-18 enable_grp_ism E-6
problems 10-27 fastscan 6-14
server daemon threads 10-21 file system 9-63
nfsstat 2-12, 10-28 handspreadpages 6-16
IPC E-1
Index Index-57
Copyright 1998 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999
real-time scheduling class 4-15 using 9-33
replacement segment
cache 5-9 types 6-6
resolution segments 6-4
clock 3-23 semaphores
response time parameters E-8
definition 1-11 server
run queue 4-27 SyMON supported B-4
runacct C-6, C-17 server I/O boards 7-13
states C-18 server threads
NFS 10-21
S servers
SBus capabilities A-3
sample tuning task 1-13 service time
SBus 7-12 definition 1-11
overload 7-27 iostat 8-59
SBus capabilities A-3 set associative cache 5-17
scheduling shared libraries 6-47
classes 4-4 shared memory
interactive class 4-14 parameters E-6
real-time class 4-15 sircuit switched bus 7-4
real-time issues 4-16 slice 9-5
states 4-3 layout 9-6
SCSI slowscan 6-14
storage arrays 8-39 SNMP
SCSI bus SyMON configuration B-10
addressing 8-11 snooping
characteristics 8-5 cache 5-21
device properties 8-32 soft link 9-22
lengths 8-9 Solaris Desktop Extensions 2-18
options 8-7 source code
overview 8-3 tuning 11-5
properties summary 8-10 SRAM 5-5
speeds 8-6 statistics
tape drives 8-38 application I/O 9-65
target priorities 8-14 CPU 4-30
terminators 8-9 file system 9-34
tuning statistics 8-55 I/O 8-55
widths 8-8 paging 6-27
SCSI device swapping 6-43
mode pages 8-31 sticky bit 9-17
scsi_options 8-7 storage array
scsiinfo 8-34 architecture 8-40
SE Toolkit 2-20 storage arrays 8-39
segmap cache 9-30 summary
bypassing 9-39 accounting C-28
Index Index-59
Copyright 1998 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services October 1999
NFS 10-18 Virtual Adrian 2-20
sample task 1-13 virtual memory 6-3
source code 11-5 /tmp 6-41
TCP 10-9 madvise 9-42
terminology 1-10 mmap 9-41
tips 11-21 segments 6-4
tuning guidelines 11-3 tuning summary 6-63
tuning reports vmstat 2-8
bus 7-29 vnode
mpstat 4-36 interface 9-3
netstat 10-28 vxstat 2-9
network 10-28
nfsstat 10-28 W
tuning tools
accounting 2-23 wait 3-3
iostat 2-9 wait I/O
memstat 2-8 CPU time 4-28
mpstat 2-10 write
netstat 2-11 cachefs D-8
nfsstat 2-12 cancellation 5-19
process manager 2-18 write system call 9-31
SE Toolkit 2-20 write throttle
SyMON 2-13 UFS 9-57
vmstat 2-8 write-back cache 5-19
write-through cache 5-19
U
X
UFS
write throttle 9-57 XDbus 7-6
ufs_HW 9-57
ufs_LW 9-57 Z
ufs_ninode 3-9, 9-12 zombie process 3-5
unified cache 5-18
UPA bus 7-7
use_ism E-6
user
CPU time 4-28
utilization
definition 1-11
V
virtual address
cache 5-15
caching 6-8
lookup 6-10
translation 6-7
Please
Recycle