How To Write Shared Libraries
How To Write Shared Libraries
Ulrich Drepper
[email protected]
Abstract
Today, shared libraries are ubiquitous. Developers use them for multiple reasons and create
them just as they would create application code. This is a problem, though, since on many
platforms some additional techniques must be applied even to generate decent code. Even more
knowledge is needed to generate optimized code. This paper introduces the required rules and
techniques. In addition, it introduces the concept of ABI (Application Binary Interface) stability
and shows how to manage it.
1
function stub from a special file (with the filename ex- be followed to get optimal results. Explaining these rules
tension .sa). At run-time a file ending in .so.X.Y.Z will be the topic of a large portion of this paper.
was used and it had to correspond to the used .sa file.
This in turn requires that an allocated entry in the stub Not all uses of DSOs are for the purpose of saving re-
table always had to be used for the same function. The sources. DSOs are today also often used as a way to
allocation of the table had to be carefully taken care of. structure programs. Different parts of the program are
Introducing a new interface meant appending to the ta- put into separate DSOs. This can be a very powerful tool,
ble. It was never possible to retire a table entry. To avoid especially in the development phase. Instead of relink-
using an old shared library with a program linked with a ing the entire program it is only necessary to relink the
newer version, some record had to be kept in the applica- DSO(s) which changed. This is often much faster.
tion: the X and Y parts of the name of the .so.X.Y.Z
suffix was recorded and the dynamic linker made sure Some projects decide to keep many separate DSOs even
minimum requirements were met. in the deployment phase even though the DSOs are not
reused in other programs. In many situations it is cer-
The benefits of the scheme are that the resulting program tainly a useful thing to do: DSOs can be updated indi-
runs very fast. Calling a function in such a shared li- vidually, reducing the amount of data which has to be
braries is very efficient even for the first call. It can transported. But the number of DSOs must be kept to a
be implemented with only two absolute jumps: the first reasonable level. Not all programs do this, though, and
from the user code to the stub, and the second from the we will see later on why this can be a problem.
stub to the actual code of the function. This is probably
faster than any other shared library implementation, but Before we can start discussing all this some understand-
its speed comes at too high a price: ing of ELF and its implementation is needed.
dress space content is replaced by the content of the file contains the size of each entry. This last value is useful
containing the program. This does not happen by sim- only as a run-time consistency check for the binary.
ply mapping (using mmap) the content of the file. ELF
files are structured and there are normally at least three The different segments are represented by the program
different kinds of regions in the file: header entries with the PT LOAD value in the p type field.
The p offset and p filesz fields specify where in the
file the segment starts and how long it is. The p vaddr
• Code which is executed; this region is normally not and p memsz fields specify where the segment is located
writable; in the the process’ virtual address space and how large the
memory region is. The value of the p vaddr field itself
• Data which is modified; this region is normally not
is not necessarily required to be the final load address.
executable;
DSOs can be loaded at arbitrary addresses in the virtual
• Data which is not used at run-time; since not needed address space. But the relative position of the segments
it should not be loaded at startup. is important. For pre-linked DSOs the actual value of the
p vaddr field is meaningful: it specifies the address for
which the DSO was pre-linked. But even this does not
Modern operating systems and processors can protect mem- mean the dynamic linker cannot ignore this information
ory regions to allow and disallow reading, writing, and if necessary.
executing separately for each page of memory.1 It is
preferable to mark as many pages as possible not writable The size in the file can be smaller than the address space
since this means that the pages can be shared between it takes up in memory. The first p filesz bytes of the
processes which use the same application or DSO the memory region are initialized from the data of the seg-
page is from. Write protection also helps to detect and ment in the file, the difference is initialized with zero.
prevent unintentional or malignant modifications of data This can be used to handle BSS sections2 , sections for
or even code. uninitialized variables which are according to the C stan-
dard initialized with zero. Handling uninitialized vari-
For the kernel to find the different regions, or segments ables this way has the advantage that the file size can be
in ELF-speak, and their access permissions, the ELF file reduced since no initialization value has to be stored, no
format defines a table which contains just this informa- data has to be copied from disc to memory, and the mem-
tion, among other things. The ELF Program Header ta- ory provided by the OS via the mmap interface is already
ble, as it is called, must be present in every executable initialized with zero.
and DSO. It is represented by the C types Elf32 Phdr
and Elf64 Phdr which are defined as can be seen in fig- The p flags finally tells the kernel what permissions to
ure 1. use for the memory pages. This field is a bitmap with the
bits given in the following table being defined. The flags
To locate the program header data structure another data are directly mapped to the flags mmap understands.
structure is needed, the ELF Header. The ELF header is
the only data structure which has a fixed place in the file,
starting at offset zero. Its C data structure can be seen
in figure 2. The e phoff field specifies where, counting
from the beginning of the file, the program header table
starts. The e phnum field contains the number of entries
in the program header table and the e phentsize field
2 A BSS section contains only NUL bytes. Therefore they do not
1A memory page is the smallest entity the memory subsystem of have to be represented in the file on the storage medium. The loader
the OS operates on. The size of a page can vary between different just has to know the size so that it can allocate memory large enough
architectures and even within systems using the same architecture. and fill it with NUL
p flags Value mmap flag Description to once the application is complete. For this a structured
PF X 1 PROT EXEC Execute Permission way exists. The kernel puts an array of tag-value pairs on
PF W 2 PROT WRITE Write Permission the stack of the new process. This auxiliary vector con-
PF R 4 PROT READ Read Permission tains beside the two aforementioned values several more
values which allow the dynamic linker to avoid several
system calls. The elf.h header file defines a number of
constants with a AT prefix. These are the tags for the
After mapping all the PT LOAD segments using the ap- entries in the auxiliary vector.
propriate permissions and the specified address, or after
freely allocating an address for dynamic objects which After setting up the auxiliary vector the kernel is finally
have no fixed load address, the next phase can start. The ready to transfer control to the dynamic linker in user
virtual address space of the dynamically linked executable mode. The entry point is defined in e entry field of the
itself is set up. But the binary is not complete. The kernel ELF header of the dynamic linker.
has to get the dynamic linker to do the rest and for this
the dynamic linker has to be loaded in the same way as
the executable itself (i.e., look for the loadable segments 1.5 Startup in the Dynamic Linker
in the program header). The difference is that the dy-
namic linker itself must be complete and should be freely
relocatable. The second phase of the program startup happens in the
dynamic linker. Its tasks include:
Which binary implements the dynamic linker is not hard-
coded in the kernel. Instead the program header of the
application contains an entry with the tag PT INTERP.
• Determine and load dependencies;
The p offset field of this entry contains the offset of
a NUL-terminated string which specifies the file name of
this file. The only requirement on the named file is that • Relocate the application and all dependencies;
its load address does not conflict with the load address of
any possible executable it might be used with. In gen-
• Initialize the application and dependencies in the
eral this means that the dynamic linker has no fixed load
correct order.
address and can be loaded anywhere; this is just what dy-
namic binaries allow.
Once the dynamic linker has also been mapped into the In the following we will discuss in more detail only the
memory of the to-be-started process we can start the dy- relocation handling. For the other two points the way
namic linker. Note it is not the entry point of the applica- for better performance is clear: have fewer dependen-
tion to which control is transfered to. Only the dynamic cies. Each participating object is initialized exactly once
linker is ready to run. Instead of calling the dynamic but some topological sorting has to happen. The identify
linker right away, one more step is performed. The dy- and load process also scales with the number dependen-
namic linker somehow has to be told where the applica- cies; in most (all?) implementations this does not scale
tion can be found and where control has to be transferred linearly.
Histogram for bucket list length in section [ 2] ’.hash’ (total of 191 buckets):
Addr: 0x00000114 Offset: 0x000114 Link to section: [ 3] ’.dynsym’
Length Number % of total Coverage
0 103 53.9%
1 71 37.2% 67.0%
2 16 8.4% 97.2%
3 1 0.5% 100.0%
Average number of tests: successful lookup: 1.179245
unsuccessful lookup: 0.554974
1. Determine the hash value for the symbol name. Note that there is no problem if the scope contains more
than one definition of the same symbol. The symbol
2. In the first/next object in the lookup scope: lookup algorithm simply picks up the first definition it
finds. Note that a definition in a DSO being weak has no
2.a Determine the hash bucket for the symbol us-
effects. Weak definitions only play a role in static linking.
ing the hash value and the hash table size in
Having multiple definitions has some perhaps surprising
the object.
consequences. Assume DSO ‘A’ defines and references
2.b Get the name offset of the symbol and using an interface and DSO ‘B’ defines the same interface. If
it as the NUL-terminated name. now ‘B’ precedes ‘A’ in the scope, the reference in ‘A’
2.c Compare the symbol name with the reloca- will be satisfied by the definition in ‘B’. It is said that
tion name. the definition in ‘B’ interposes the definition in ‘A’. This
concept is very powerful since it allows more special-
2.d If the names match, compare the version names ized implementation of an interface to be used without
as well. This only has to happen if both, the replacing the general code. One example for this mech-
reference and the definition, are versioned. It anism is the use of the LD PRELOAD functionality of the
requires a string comparison, too. If the ver- dynamic linker where additional DSOs which were not
sion name matches or no such comparison present at link-time are introduced at run-time. But inter-
is performed, we found the definition we are position can also lead to severe problems in ill-designed
looking for. code. More in this in section 1.5.4.
2.e If the definition does not match, retry with the
next element in the chain for the hash bucket. Looking at the algorithm it can be seen that the perfor-
mance of each lookup depends, among other factors, on
2.f If the chain does not contain any further ele-
the length of the hash chains and the number of objects
ment there is no definition in the current ob-
in the lookup scope. These are the two loops described
ject and we proceed with the next object in
above. The lengths of the hash chains depend on the
the lookup scope.
number of symbols and the choice of the hash table size.
3. If there is no further object in the lookup scope the Since the hash function used in the initial step of the algo-
lookup failed. rithm must never change these are the only two remaining
variables. Many linkers do not put special emphasis on
_ZN14some_namespace22some_longer_class_nameC1Ei
_ZN14some_namespace22some_longer_class_name19the_getter_functionEv
ada__calendar__delays___elabb
ada__calendar__delays__timed_delay_nt
ada__calendar__delays__to_duration
ters. One plus point for design, but minus 100 points for looked at it would even be possible to get more accurate
performance. results by multiplying with the exact hash chain length
for the object.
With the knowledge of the hashing function and the de-
tails of the string lookup let us look at a real-world exam- Changing any of the factors ‘number of exported sym-
ple: OpenOffice.org. The package contains 144 separate bols’, ‘length of the symbol strings’, ‘number and length
DSOs. During startup about 20,000 relocations are per- of common prefixes’,‘number of DSOs’, and ‘hash table
formed. Many of the relocations are performed as the size optimization’ can reduce the costs dramatically. In
result of dlopen calls and therefore cannot be optimized general the percentage spent on relocations of the time
away by using prelink [7]. The number of string compar- the dynamic linker uses during startup is around 50-70%
isons needed during the symbol resolution can be used if the binary is already in the file system cache, and about
as a fair value for the startup overhead. We compute an 20-30% if the file has to be loaded from disk.5 It is there-
approximation of this value now. fore worth spending time on these issues and in the re-
mainder of the text we will introduce methods to do just
The average chain length for unsuccessful lookup in all that. So far to remember: pass -O1 to the linker to gener-
DSOs of the OpenOffice.org 1.0 release on IA-32 is 1.1931. ate the final product.
This means for each symbol lookup the dynamic linker
has to perform on average 72 × 1.1931 = 85.9032 string 1.5.3 The GNU-style Hash Table
comparisons. For 20,000 symbols the total is 1,718,064
string comparisons. The average length of an exported
symbol defined in the DSOs of OpenOffice.org is 54.13. All the optimizations proposed in the previous section
Even if we are assuming that only 20% of the string is still leave symbol lookup as a significant factor. A lot
searched before finding a mismatch (which is an opti- of data has to be examined and loading all this data in the
mistic guess since every symbol name is compared com- CPU cache is expensive. As mentioned above, the orig-
pletely at least once to match itself) this would mean a to- inal ELF hash table handling has no more flexibility so
tal of more then 18.5 million characters have to be loaded any solution would have to replace it. This is what the
from memory and compared. No wonder that the startup GNU-style hash table handling does. It can peacefully
is so slow, especially since we ignored other costs. coexist with the old-style hash table handling by having
its own dynamic section entry (DT GNU HASH). Updated
To compute number of lookups the dynamic linker per- dynamic linkers will use the new hash table insted of the
forms one can use the help of the dynamic linker. If the old, therefore provided completely transparent backward
environment variable LD DEBUG is set to symbols one compatibility support. The new hash table implementa-
only has to count the number of lines which start with tion, like the old, is self-contained in each executable and
symbol=. It is best to redirect the dynamic linker’s out- DSO so it is no problem to have binaries with the new
put into a file with LD DEBUG OUTPUT. The number of and some with only the old format in the same process.
string comparisons can then be estimate by multiplying
the count with the average hash chain length. Since the The main cost for the lookup, especially for certain bina-
collected output contains the name of the file which is 5 These numbers assume pre-linking is not used.
2.a The hash value is used to determine whether The hash chain array is organized to have all entries for
an entry with the given hash value is present the same hash bucket follow each other. There is no
at all. This is done with a 2-bit Bloom fil- linked list and therefore the cache utilization is much bet-
ter.6 . If the filter indicates there is no such ter.
definition the next object in the lookup scope
is searched. Only if the Bloom filter and the hash function test suc-
ceed do we access the symbol table itself. All symbol ta-
2.b Determine the hash bucket for the symbol us-
ble entries for a hash chain are consecutive, too, so in case
ing the hash value and the hash table size in
we need to access more than one entry the CPU cache
the object. The value is a symbol index.
prefetching will help here, too.
2.c Get the entry from the chain array correspond-
ing to the symbol index. Compare the value One last change over the old format is that the hash ta-
with the hash value of the symbol we are try- ble only contains a few, necessary records for undefined
ing to locate. Ignore bit 0. symbol. The majority of undefined symbols do not have
2.d If the hash value matches, get the name off- to appear in the hash table. This in some cases signif-
set of the symbol and using it as the NUL- icantly reduces the possibility of hash collisions and it
terminated name. certainly increases the Bloom filter success rate and re-
duces the average hash chain length. The result are sig-
2.e Compare the symbol name with the reloca-
nificant speed-ups of 50% or more in code which cannot
tion name.
depend on pre-linking [7] (pre-linking is always faster).
2.f If the names match, compare the version names
as well. This only has to happen if both, the This does not mean, though, that the optimization tech-
reference and the definition, are versioned. It niques described in the previous section are irrelevant.
requires a string comparison, too. If the ver- They still should very much be applied. Using the new
sion name matches or no such comparison hash table implementation just means that not optimizing
is performed, we found the definition we are the exported and referenced symbols will not have a big
looking for. effect on performance as it used to have.
2.g If the definition does not match and the value
loaded from the hash bucket does not have
bit 0 set, continue with the next entry in the The new hash table format was introduced in Fedora
hash bucket array. Core 6. The entire OS, with a few deliberate exceptions,
is created without the compatibility hash table by using
2.h If bit 0 is set there are no further entry in the
--hash-style=gnu. This means the binaries cannot be
hash chain we proceed with the next object in
used on systems without support for the new hash table for-
the lookup scope. mat in the dynamic linker. Since this is never a goal for any
of the OS releases making this decision was a no-brainer.
3. If there is no further object in the lookup scope the
The result is that all binaries are smaller than with the sec-
lookup failed.
ond set of hash tables and in many cases even smaller than
6 http://en.wikipedia.org/wiki/Bloom filter binaries using only the old format.
A more complicated modification of the lookup scope If now libtwo.so is loaded, the additional local scope
happens when DSOs are loaded dynamic using dlopen. could be like this:
If a DSO is dynamically loaded it brings in its own set
of dependencies which might have to be searched. These libdynamic.so → libtwo.so → libc.so
objects, starting with the one which was requested in the
dlopen call, are appended to the lookup scope if the
object with the reference is among those objects which This local scope is searched after the global scope, pos-
have been loaded by dlopen. That means, those objects sibly with the exception of libdynamic.so which is
are not added to the global lookup scope and they are searched first for lookups in this very same DSO if the
not searched for normal lookups. This third part of the DF DYNAMIC flag is used. But what happens if the sym-
lookup scope, we will call it local lookup scope, is there- bol duplicate is required in libdynamic.so? After
fore dependent on the object which has the reference. all we said so far the result is always: the definition in
libone.so is found since libtwo.so is only in the lo-
The behavior of dlopen can be changed, though. If the cal scope which is searched after the global scope. If the
function gets passed the RTLD GLOBAL flag, the loaded two definitions are incompatible the program is in trou-
object and all the dependencies are added to the global ble.
scope. This is usually a very bad idea. The dynami-
cally added objects can be removed and when this hap- This can be changed with a recent enough GNU C library
pens the lookups of all other objects is influenced. The by ORing RTLD DEEPBIND to the flag word passed as the
pushl foo
• The change in the scope affects all symbols and all
call bar
the DSOs which are loaded. Some symbols might
have to be interposed by definitions in the global
scope which now will not happen.
This would encode the addresses of foo and bar as part
• Already loaded DSOs are not affected which could
of the instruction in the text segment. If the address is
cause unconsistent results depending on whether
only known to the dynamic linker the text segment would
the DSO is already loaded (it might be dynamically
have to be modified at run-time. According to what we
loaded, so there is even a race condition).
learned above this must be avoided.
• LD PRELOAD is ineffective for lookups in the dy-
namically loaded objects since the preloaded ob- Therefore the code generated for DSOs, i.e., when using
jects are part of the global scope, having been added -fpic or -fPIC, looks like this:
right after the executable. Therefore they are looked
at only after the local scope.
• Applications might expect that local definitions are movl foo@GOT(%ebx), %eax
always preferred over other definitions. This (and pushl (%eax)
the previous point) is already partly already a prob- call bar@PLT
lem with the use of DF SYMBOLIC but since this
flag should not be used either, the arguments are
still valid. The address of the variable foo is now not part of the in-
• If any of the implicitly loaded DSOs is loaded ex- struction. Instead it is loaded from the GOT. The address
plicitly afterward, its lookup scope will change. of the location in the GOT relative to the PIC register
value (%ebx) is known at link-time. Therefore the text
• Lastly, the flag is not portable. segment does not have to be changed, only the GOT.7
normally makes sure this is done correctly. When in the kernel to load/map binaries. In this example the
creating data objects it is mostly up to the user relocation processing dominates the startup costs with
to make sure it is placed in the correct segment. more than 50%. There is a lot of potential for optimiza-
Ideally data is also read-only but this works only tions here. The unit used to measure the time is CPU
for constants. The second best choice is a zero- cycles. This means that the values cannot even be com-
initialized variable which does not have to be ini- pared across different implementations of the same ar-
tialized from file content. The rest has to go into chitecture. E.g., the measurement for a PentiumRM III and
the data segment. a PentiumRM 4 machine will be quite different. But the
measurements are perfectly suitable to measure improve-
ments on one machine which is what we are interested
In the following we will not cover the first two points here.
given here. It is up to the developer of the DSO to de-
cide about this. There are no small additional changes to Since relocations play such a vital part of the startup per-
make the DSO behave better, these are fundamental de- formance some information on the number of relocations
sign decisions. We have voiced an opinion here, whether is printed. In the example a total of 133 relocations are
it is has any effect remains to be seen. performed, from the dynamic linker, the C library, and the
executable itself. Of these 5 relocations could be served
1.7 Measuring ld.so Performance from the relocation cache. This is an optimization imple-
mented in the dynamic linker to handle the case of mul-
tiple relocations against the same symbol more efficient.
To perform the optimizations it is useful to quantify the
After the program itself terminated the same information
effect of the optimizations. Fortunately it is very easy to
is printed again. The total number of relocations here is
do this with glibc’s dynamic linker. Using the LD DEBUG
higher since the execution of the application code caused
environment variable it can be instructed to dump in-
a number, 55 to be exact, of run-time relocations to be
formation related to the startup performance. Figure 8
performed.
shows an example invocation, of the echo program in
this case.
The number of relocations which are processed is stable
across successive runs of the program. The time mea-
The output of the dynamic linker is divided in two parts.
surements not. Even in a single-user mode with no other
The part before the program’s output is printed right be-
programs running there would be differences since the
fore the dynamic linker turns over control to the appli-
cache and main memory has to be accessed. It is there-
cation after having performed all the work we described
fore necessary to average the run-time over multiple runs.
in this section. The second part, a summary, is printed
after the application terminated (normally). The actual
It is obviously also possible to count the relocations with-
format might vary for different architectures. It includes
out running the program. Running readelf -d on the
the timing information only on architectures which pro-
binary shows the dynamic section in which the DT RELSZ,
vide easy access to a CPU cycle counter register (modern
DT RELENT, DT RELCOUNT, and DT PLTRELSZ entries are
IA-32, IA-64, x86-64, Alpha in the moment). For other
interesting. They allow computing the number of normal
architectures these lines are simply missing.
and relative relocations as well as the number of PLT en-
tries. If one does not want to do this by hand the relinfo
The timing information provides absolute values for the
script in appendix A can be used.
total time spend during startup in the dynamic linker, the
time needed to perform relocations, and the time spend
Compiled in the same way as before we see that all the re- The next best thing to using static is to explicitly de-
locations introduced by our example code vanished. I.e., fine the visibility of objects in the DSO. The generic ELF
we are left with six relocations and three PLT entries. The ABI defines visibility of symbols. The specification de-
code to access last now looks like this: fines four classes of which here only two are of interest.
STV DEFAULT denotes the normal visibility. The symbol
is exported and can be interposed. The other interesting
class is denoted by STV HIDDEN. Symbols marked like
movl last@GOTOFF(%ebx), %eax
incl %eax
this are not exported from the DSO and therefore can-
movl %eax, last@GOTOFF(%ebx) not be used from other objects. There are a number of
different methods to define visibility.
Instead of changing the default visibility the programmer int index (int scale) {
can choose to define to hide individual symbols. Or, if return next () << scale;
the default visibility is hidden, make specific symbols ex- }
portable by setting the visibility to default.
int index (int scale) { Beside telling the backend of the compiler to emit code to
return next () << scale; flag the symbol as hidden, changing the visibility has an-
} other purpose: it allows the compiler to assume the defi-
nition is local. This means the addressing of variables and
function can happen as if the definitions would be locally
defined in the file as static. Therefore the same code
This defines the variable last and the function next sequences we have seen in the previous section can be
as hidden. All the object files which make up the DSO generated. Using the hidden visibility attribute is there-
which contains this definition can use these symbols. I.e., fore almost completely equivalent to using static; the
while static restricts the visibility of a symbol to the only difference is that the compiler cannot automatically
file it is defined in, the hidden attribute limits the visibil- inline the function since it need not see the definition.
ity to the DSO the definition ends up in. In the example
above the definitions are marked. This does not cause any We can now refine the rule for using static: merge
harm but it is in any case necessary to mark the declara- source files and mark as many functions static as far as
tion. In fact it is more important that the declarations are one feels comfortable. In any case merge the files which
marked appropriately since it is mainly the code gener- contain functions which potentially can be inlined. In all
ated for in a reference that is influenced by the attribute. other cases mark functions (the declarations) which are
not to be exported from the DSO as hidden.
Instead of adding an visibility attribute to each declara-
tion or definition, it is possible to change the default tem- Note that the linker will not add hidden symbols to the
porarily for all definitions and declarations the compiler dynamic symbol table. I.e., even though the symbol ta-
sees at this time. This is mainly useful in header files bles of the object files contain hidden symbols they will
since it reduces changes to a minimum but can also be disappear automatically. By maximizing the number of
useful for definitions. This compiler feature was also in- hidden declarations we therefore reduce the size of the
troduced in gcc 4.0 and is implemented using a pragma:9 symbol table to the minimum.
class foo {
static int u __attribute__
In this example code the static data member u and the ((visibility ("hidden")));
One sort of function which can safely be kept local and class __attribute ((visibility ("hidden")))
foo {
not exported are inline function, either defined in the class
...
definition or separately. Each compilation unit must have };
its own set of all the used inline functions. And all the
functions from all the DSOs and the executable better be
the same and are therefore interchangeable. It is possible
to mark all inline functions explicitly as hidden but this is Just as with the pragma, all defined functions are defined
One nit still exists with the result in the last section: the
string is modifiable. Very often the string will never be static const char *msgs[] = {
modified. In such a case the unsharable data segment is [ERR1] = "message for err1",
unnecessarily big. [ERR2] = "message for err2",
[ERR3] = "message for err3"
};
const char str[] = "some string"; const char *errstr (int nr) {
return msgs[nr];
}
The reason is that var is known to have type foo and 2.5 Improving Generated Code
not a derived type from which the virtual function table
is used. If the class foo is instantiated in another DSO
not only the virtual function table has to be exported by On most platforms the code generated for DSOs differs
that DSO, but also the virtual function virfunc. from code generated for applications. The code in DSOs
needs to be relocatable while application code can usu-
If a tiny runtime overhead is acceptable the virtual func- ally assume a fixed load address. This inevitably means
tion and the externally usable function interface should that the code in DSOs is slightly slower and probably
be separated. Something like this: larger than application code. Sometimes this additional
overhead can be measured. Small, often called functions
fall into this category. This section shows some problem
cases of code in DSOs and ways to avoid them.
/* In the header. */
struct foo { In the preceding text we have seen that for IA-32 a func-
getfoo:
static int foo; addl r14=@gprel(foo),gp;;
int getfoo (void) ld4 r8=[r14]
{ return foo; } br.ret.sptk.many b0
the compiler might end up creating code like this: If the caller knows that the called function uses the same
gp value, it can avoid the loading of gp. IA-32 is really a
special case, but still a very common one. So it is appro-
priate to look for a solution.
getfoo:
call 1f Any solution must avoid the PIC register entirely. We
1: popl %ecx propose two possible ways to improve the situation. First,
addl _GLOBAL_OFFSET_TABLE_[.-1b],%ecx
do not use position-independent code. This will generate
movl foo@GOTOFF(%ecx),%eax
ret
code like
getfoo:
The actual variable access is overshadowed by the over-
movl foo,%eax
head to do so. Loading the GOT address into the %ecx ret
register takes three instructions. What if this function
is called very often? Even worse: what if the function
getfoo would be defined static or hidden and no pointer
to it are ever available? In this case the caller might al- The drawback is that the resulting binary will have text
ready have computed the GOT address; at least on IA-32 relocations. The page on which this code resides will
the GOT address is the same for all functions in the DSO not be sharable, the memory subsystem is more stressed
or executable. The computation of the GOT address in because of this, a runtime relocation is needed, program
foobar would be unnecessary. The key word in this sce- startup is slower because of both these points, and secu-
nario description is “might”. The IA-32 ABI does not rity of the program is impacted. Overall, there is a mea-
require that the caller loads the PIC register. Only if a surable cost associated with not using PIC. This possibil-
function calls uses the PLT do we know that %ebx con- ity should be avoided whenever possible. If the DSO is
tains the GOT address and in this case the call could come question is only used once at the same time (i.e., there are
from any other loaded DSO or the executable. I.e., we re- no additional copies of the same program or other pro-
ally always have to load the GOT address. grams using the DSO) the overhead of the copied page is
not that bad. Only in case the page has to be evacuated
On platforms with better-designed instruction sets the gen- from memory we would see measurable deficits since the
erated code is not bad at all. For example, the x86-64 page cannot simply be discarded, it must be stored in the
version could look like this: disk swap storage.
The code generated for this example does not compute At this point we should remember how the dynamic linker
the GOT address twice for each call to getboth. The works and how the GOT is used. Each GOT entry be-
function intfoo uses the provided pointer and does not longs to a certain symbol and depending on how the sym-
need the GOT address. To preserve the semantics of the bol is used, the dynamic linker will perform the relo-
first code this additional function had to be introduced; cation at startup time or on demand when the symbol
it is now merely a wrapper around intfoo. If it is pos- is used. Of interest here are the relocations of the first
sible to write the sources for a DSO to have all global group. We know exactly when all non-lazy relocation
variables in a structure and pass the additional parameter are performed. So we could change the access permis-
to all internal functions, then the benefit on IA-32 can be sion of the part of the GOT which is modified at startup
big. time to forbid write access after the relocations are done.
Creating objects this way is enabled by the -z relro
But it must be kept in mind that the code generated for the linker option. The linker is instructed to move the sec-
changed example is worse than what would be created for tions, which are only modified by relocations, onto sep-
the original on most other architectures. As can be seen, arate memory page and emit a new program header en-
in the x86-64 case the extra parameter to intfoo would try PT GNU RELRO to point the dynamic linker to these
be pure overhead since we can access global variables special pages. At runtime the dynamic linker can then
without the GOT value. On IA-64 marking getfoo as remove the write access to these pages after it is done.
hidden would allow to avoid the PLT and therefore the gp
register is not reloaded during the call to getfoo. Again, This is only a partial solution but already very useful.
the parameter is pure overhead. For this reason it is ques- By using the -z now linker option it is possible to dis-
tionable whether this IA-32 specific optimization should able all lazy relocation at the expense of increased startup
ever be performed. If IA-32 is the by far most important costs and make all relocations eligible for this special
platform it might be an option. treatment. For long-running applications which are secu-
rity relevant this is definitely attractive: the startup costs
2.6 Increasing Security should not weigh in as much as the gained security. Also,
if the application and DSOs are written well, they avoid
relocations. For instance, the DSOs in the GNU C library
The explanation of ELF so far have shown the critical im- are all build with -z relro and -z now.
portance of the GOT and PLT data structures used at run-
time by the dynamic linker. Since these data structures The GOT and PLT are not the only parts of the applica-
are used to direct memory access and function calls they tion which benefit from this feature. In section 2.4.2 we
are also a security liability. If a program error provides have seen that adding as many const to a data object def-
an attacker with the possibility to overwrite a single word inition as possible has benefits when accessing the data.
in the address space with a value of his choosing, then But there is more. Consider the following code:
targetting a specific GOT entry is a worthwhile goal. For
some architectures, a changed GOT value might redirect
a call to a function, which is called through the PLT, to
an arbitrary other place. Similarly, if the PLT is modified const char *msgs1[] = {
in relocations and therefore writable, that memory could "one", "two", "three"
be modified. };
const char *const msgs2[] = {
An example with security relevance could be a call to "one", "two", "three"
VERS_1.0 { The problem is that unless the DSO containing the defi-
global: index; nitions is used at link time, the linker cannot add a ver-
local: *;
sion name to the undefined reference. Following the rules
};
for symbol versioning [4] this means the earliest version
VERS_2.0 { available at runtime is used which usually is not the in-
global: index; tended version. Going back to the example in section 3.7,
} VERS_1.0; assume the program using the DSO would be compiled
expecting the new interface index2 . Linking happens
without the DSO, which contains the definition. The ref-
erence will be for index and not index@@VERS 2.0. At
We have two definitions of index and therefore the name runtime the dynamic linker will find an unversioned ref-
must be mentioned by the appropriate sections for the two erence and versioned definitions. It will then select the
versions. oldest definition which happens to be index1 . The re-
sult can be catastrophic.
It might also be worthwhile pointing out once again that
the call to index1 in index2 does not use the PLT It is therefore highly recommended to never depend on
and is instead a direct, usually PC-relative, jump. undefined symbols. The linker can help to ensure this if
-Wl,-z,defs is added to compiler command line. If it
With a simple function definition, like the one in this ex- is really necessary to use undefined symbols the newly
ample, it is no problem at all if different parts of the pro- built DSO should be examined to make sure that all ref-
gram call different versions of the index interface. The erences to symbols in versioned DSOs are appropriately
only requirement is that the interface the caller saw at marked.
compile time, also is the interface the linker finds when
handling the relocatable object file. Since the relocatable 3.9 Inter-Object File Relations
object file does not contain the versioning information it
is not possible to keep object files around and hope the
right interface is picked by the linker. Symbol versioning Part of the ABI is also the relationship between the var-
only works for DSOs and executables. If it is necessary ious participating executables and DSOs which are cre-
to reuse relocatable object files later, it is necessary to ated by undefined references. It must be ensured that
recreate the link environment the linker would have seen the dynamic linker can locate exactly the right DSOs at
when the code was compiled. The header files (for C, program start time. The static linker uses the SONAME
and whatever other interface specification exists for other of a DSO in the record to specify the interdependencies
languages) and the DSOs which are used for linking to- between two objects. The information is stored in the
gether form the API. It is not possible to separate the two DT NEEDED entries in the dynamic section of the object
steps, compiling and linking. For this reason packaging with the undefined references. Usually this is only a file
systems, which distinguish between runtime and devel- name, without a complete path. It is then the task of the
opment packages, put the headers and linkable DSOs in dynamic linker to find the correct file at startup time.
$ ldd -u -r \
/usr/lib/libgtk-x11-2.0.so.0.600.0
Unused direct dependencies:
/usr/lib/libpangox-1.0.so.0
/lib/libdl.so.2
The following script computes the number of normal and relative relocations as well as the number of PLT entries
present in a binary. If an appropriate readelf implementation is used it can also be used to look at all files in an
archive. If prelink [7] is available and used, the script also tries to provide information about how often the DSO is
used. This gives the user some idea how much “damage” an ill-written DSO causes.
#! /usr/bin/perl
eval "exec /usr/bin/perl -S $0 $*"
if 0;
# Copyright (C) 2000, 2001, 2002, 2003, 2004, 2005 Ulrich Drepper
# Written by Ulrich Drepper <[email protected]>, 2000.
printf("%s: %d relocations, %d relative (%d%%), %d PLT entries, %d for local syms (%d%%)",
$ARGV[$cnt], $relent == 0 ? 0 : $relsz / $relent, $relcount,
$relent == 0 ? 0 : ($relcount * 100) / ($relsz / $relent),
$relent == 0 ? 0 : $pltrelsz / $relent,
$relent == 0 ? 0 : $pltrelsz / $relent - $extplt,
$relent == 0 ? 0 : ((($pltrelsz / $relent - $extplt) * 100)
/ ($pltrelsz / $relent)));
if ($users >= 0) {
printf(", %d users", $users);
}
printf("\n");
}
The method to handle arrays of string pointers presented in section 2.4.3 show the principle method to construct data
structures which do not require relocations. But the construction is awkward and error-prone. Duplicating the strings
in multiple places in the sources always has the problem of keeping them in sync.
Bruno Haible suggested something like the following to automatically generate the tables. The programmer only has
to add the strings, appropriately marked, to a data file which is used in the compilation. The framework in the actual
sources looks like this:
#include <stddef.h>
The string data has to be provided in the file stringtab.h. For the example from section 2.4.3 the data would look
like this:
The macro S takes two parameters: the first is the index used to locate the string and the second is the string itself.
The order in which the strings are provided is not important. The value of the first parameter is used to place the offset
in the correct slot of the array. It is worthwhile running these sources through the preprocessor to see the results. This
way of handling string arrays has the clear advantage that strings have to be specified only in one place and that the
order they are specified in is not important. Both these issues can otherwise easily lead to very hard to find bugs.
The array msgidx in this cases uses the type unsigned int which is in most cases 32 bits wide. This is usually far
too much to address all bytes in the string collectionin msgstr. So, if size is an issue the type used could be uint16 t
or even uint8 t.
Note that both arrays are marked with const and therefore are not only stored in the read-only data segment and there-
fore shared between processes, preserving, precious data memory, making the data read-only also prevents possible
security problems resulting from overwriting values which can trick the program into doing something harmful.
read-only memory, . . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 27
relative address, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
relative relocation, . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 30
relinfo, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
relocatable binary, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
relocatable object file, . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
[3] Ulrich Drepper, Red Hat, Inc., Good Practices in Library Design, Implementation, and Maintenance,
http://people.redhat.com/drepper/goodpractices.pdf, 2002.
[4] Ulrich Drepper, Red Hat, Inc., ELF Symbol Versioning, http://people.redhat.com/drepper/symbol-versioning, 1999.
[5] Sun Microsystems, Linker and Library Guide, http://docs.sun.com/db/doc/816-1386, 2002.
[6] TIS Committee, Executable and Linking Format (ELF) Specification, Version 1.2,
http://x86.ddj.com/ftp/manuals/tools/elf.pdf, 1995.
[7] Jakub Jelinek, Red Hat, Inc., Prelink, http://people.redhat.com/jakub/prelink.pdf, 2003.
[8] Security Enhanced Linux http://www.nsa.gov/selinux/.
E Revision History
2003-2-27 Minor language cleanup. Describe using export maps with C++. Version 1.1.
2003-3-18 Some more linguistic changes. Version 1.2.
2003-4-4 Document how to write constructor/destructors. Version 1.3.
2004-8-4 Warn about aliases of static objects. Significant change to section 2.2 to introduce new features of gcc 4.0.
Version 1.99.
2004-8-27 Update code in appendices A and B. Version 2.0.
2004-9-23 Document RTLD DEEPBIND and sprof. Version 2.2.