Thursday, February 26, 2009

Linux kernel glossary

2Q algorithm

MM algorithm based on two areas, one managed as a FIFO queue, and
one as an LRU list.
8259 PIC

Outdated interrupt controller present on Intel hardware.



Application Binary Interface, the interface of passed structures
between the user processes (and libraries) and the kernel. For
compatibility, it is important that these remain as static as possible
(i.e. making sure that variables and structure members have the same
bytesize as before, and in the same ordering). Occasionally breakage
is necessary, requiring re-compilation of the user-space sources (note
that this does not affect source-compatibility; that is a separate

Advanced Configuration and Power Interface, replacement for APM
that has the advantage of allowing O/S control of power management
facilities as well exporting the set of hardware currently present on
the system.

Address Generation Interlocking, on x86. When execution of an
instruction requires an address resulting from a non-completed
instruction, the CPU must wait - this is known as an AGI stall.

Accelerated Graphics Port, on x86 boxes.

Asynchronous IO, IO that is performed without the issuing process
blocking on IO completion.
Anticipatory Scheduler

A disk IO scheduler that leaves the disk idle after a read, in
anticipation of the next read.

Generally, used for something which doesn't have the usual
associated object. For example an anonymous address space is not
interested in user address space (that is, no process context). Some
common ones are :
Anonymous page

A page of memory that is not associated with a file on a file
system. This can come from expanding the process's data segment with
brk(), shared memory segments, or mmap() with a MAP_ANON or
MAP_PRIVATE flag. MAP_PRIVATE, although it maps in data from a file,
is considered anonymous because any changes do not get written back to
the file (any dirty pages have to be moved to swap if the page is
freed from main memory).
Anonymous buffer

The buffer cache contains buffers of data on their way to/from the
disk. An anonymous buffer is not associated with a file. One example
of this is data from a deleted file - it will not be written to any
file, but is kept around until it is flushed.

Advanced Linux Sound Architecture.

See local APIC and IO-APIC.

Advanced Power Management, power management standard superseded by
ACPI. APM and SMP just don't mix.

Address Resolution Protocol and this is how a network machine
associates an IP Address with a hardware address.

Abstract Syntax Notation, a protocol for structured data, used,
for example, in the Q.3 management protocol.

Professor Andrew S. Tanenbaum, author of MINIX and several
essential O/S books.

ATA Packet Interface, used by most CD-ROMs, and other devices.

Adaptive Quality of Service Architecture.



Technique used in the VM code, referring to balancing various
parameters such as the number of pages currently free, to avoid
thrashing and other bad memory capacity artefacts. See zones, kswapd

Base Address Registers, for PCI devices.

Binary-Coded Decimal - see a textbook.

See highmem.
big lock

kernel_lock, which locks the entire kernel from entry (no other
task may run in the kernel code). It is recursive per process and
dropped automatically when a process gives up the CPU, then regained
on wake-up, in contrast to other spinlocks.
bit error

Used colloquially to mean a single bit error in some memory
address. Often due to faulty memory (ECC memory can correct single bit
errors). Often results in fake oopsen, with addresses like 0x0008000.
Also seen are values some small offset from zero, plus a bit error,
which is where the value passed a NULL check due to the bit error, and
then the kernel tried to access a structure member by means of the
pointer, leading to the offset.
block bitmap

In UNIX-like filesystems, the usage of disks blocks is recorded in
the block bitmap, where each set bit indicates a specific allocated
bottom-half handler

A set of standard kernel threads that execute tasks on a queue
that have been registered with that type of bottom-half handler for
execution. The code is run on return to user space or at the end of a
hardware interrupt. In 2.3.43 a more general solution with softirqs
and tasklets was implemented. Sometimes abbreviated to "bh", which
should not be confused with buffer head, which is also abbreviated to
bounce buffer

An intermediate buffer. Used for example, in "faking" alignment to
a client from non-aligned resources.

Big-reader locks, used when there are many contending for read
access to a resource, and very few contending for writes (thus the
balance is towards very fast read locking, and very slow write

BootStrap Processor, or the CPU which enables the other CPUs in an
SMP system.

Block Storage Segment. This is the memory mapping section
containing the data allocated for a binary image at execution time.
Also known as "Block Started by Symbol" and "Bull-Shit Storage".

Branch Target Buffer, on x86 processors, the cache of recent
conditional jump results.
buddy allocator

The memory allocation scheme used in the kernel. A vector of lists
of free pages is kept, ordered by the size of the chunk (in powers of
two). When a chunk is allocated, it is removed from the relevant list.
When a chunk is freed back to the free pages pool, it is placed in the
relevant list, starting from the top. If it is physically contiguous
with a present chunk, they are merged and placed in the list above
(i.e. where the chunks are twice the size), and this operation
percolates up the vector. As regions are merged whenever possible,
this design helps to reduce memory fragmentation. FIXME
buffer cache

The buffer cache is a hash table of buffers, indexed by device and
block number. LRU lists are maintained for the buffers in the various
states, with separate lists for buffers of different sizes. With 2.3's
unification of the buffer and page caches, each buffer head points to
part or all of a page structure, through which the buffer's actual
contents are available. FIXME
buffer head

A structure containing information on I/O for some page in real
memory. A buffer can be locked during I/O, or in several other states
depending on its usage or whether it is free. Each buffer is
associated with one page, but every page may have several buffers
(consider the floppy on x86, where the I/O blocksize is 512 bytes, but
each page is commonly 4096 bytes).

Used in kernel code in tests for "impossible" conditions. Signify
a kernel bug or faulty hardware.
bus mastering

Giving a card on a bus (e.g. ISA,PCI) the ability to read/write
directly to main memory. This is how DMA is performed on PCI busses.
byte sex



cache affinity

Where the cache of a CPU represents the current memory set used by
a task, there is said to be cache affinity with that task. A good
thing if the task is regularly scheduled on that CPU. See processor
cache coherency

On an SMP system, ensuring that the local memory cache of each CPU
is consistent with respect to the values which may be stored in other
CPUs' caches, avoiding coherency problems such as the "lost update".
This is achieved by the hardware in concert with the operating system.
cache line

A section of the hardware cache, around 32 bytes large. Kernel
structures are often designed such that the commonly-accessed members
all fit into one cache-line, which reduces cache pollution. Structures
such as this are cache line aligned.
cache ping-pong

A hardware phenomenon in an SMP system, where two tasks on
different CPUs are both accessing the same physical memory in a cache
line. This means as each task runs, when it changes the memory, it
must invalidate the other CPU's relevant cache line (to ensure cache
coherency). Then, when the task on the other CPU runs, it must reload
the cache line (as it's set invalid), before changing it. Repeat ad
jocularum. A bad thing (TM). A common reason for putting a lock on a
different cache line than the data mutexed by the lock : then the
"other" task can grab and drop the lock without having to necessarily
invalidate the cache line on the first CPU. FIXME
cache pollution

Where during execution of a task, another task is scheduled onto
that CPU which disrupts useful lines of the current cache contents,
which will be used soon. That is, cache pollution is a non-optimal
situation where the other process would have been bettered scheduled
on a different CPU or at a different time. The aim is to minimise the
need to replace cache lines, obviously increasing efficiency.
call gate

x86 hardware support for mode switch to kernel (i.e. system call).
In Linux, int 0x80 will trigger the call gate.

These are defined names of capabilities for specific tasks
provided by the kernel, e.g. CAP_SYS_NICE.

Class Based Queueing, a hierarchical packet fair queueing qdisc.
CBQ Homepage

Completely Fair Scheduler

Completely Fair Queueing, an alternative to the Anticipatory IO
scheduler (and the default from 2.6.18 onwards) which allocates IO
priority equally between processes.
chroot jail

A process under the aegis of a chroot() syscall is in a chroot
jail, and cannot access the file system above its notion of root
directory /.

(also: filter or tcf) classifies a network packet by inspecting
it, used by QDiscs.

x86 assembler instructions for disabling and enabling interrupts,
respectively. There are CPU-local and global variants of these. Code
running with interrupts disabled must be fast, for obvious reasons
(this is called interrupt latency).

Eric Raymond's proposal for a replacement to the current kernel
build system. See
cold cache

A cache whose content is invalid or irrelevant with respect to
some task to be run.
completion ports

I/O interface used in O/S's such as Windows NT. Userspace notifies
the kernel of each file descriptor the program is interested. The O/S
uses a callback for each fd to indicate that I/O is ready.

Where two tasks each want an exclusive resource. You may hear talk
of, for example, spinlock contention, which is where one or more tasks
is commonly busy-waiting for a spinlock to become unlocked, as it is
being taken by other tasks.
Context switch

switching the CPU from running one thread to running another thread.


Refers to the changes necessary in the CPU when the scheduler
schedules a different process to run on the CPU. This involves
invalidating the TLB, loading the registers with the saved values,
etc. There is an associated cost with such a switch, so it is best to
avoid un-necessary context switch when possible. Note that the
division of kernel-mode and user-mode means a similar, but simpler,
operation is necessary when a syscall moves into kernel mode. However
this is not called a context switch, as the mode switch doesn't change
the current process. See lazy TLB. One good of feature of Linux is its
extremely low context and mode switch cost, compared to an operating
system like Solaris.


(also: COW) reuse and share existing objects and copy them not
until a modification is required.


Copy-On-Write, efficiency method where a page or other resource
is shared until an attempt to write is made. In that case a copy is
made, and the write is done to the copy.


Current Privilege Level
critical path

A vital code path which should be optimised for the common case.
Critical paths are executed frequently and form the important trunk
routes of various kernel operations. An example would be buffer head
manipulation during file I/O.

Code Storage Segment, aka text section. This is the memory mapping
containing the executable code (text) for a binary image.

a kernel variable which points to the task_struct structure of the
process currently running on this CPU.


Device Mapper

A technology for presenting arbitrary groupings of underlying
sectors on physical devices in a consistent logical fashion usable by
higher level algorithms. Heavily used by kernel technologies such as

Directed Acyclic Graph
dancing makefiles

An experimental new Makefile set up for configuring and compiling
the kernel, written by Michael Elizabeth Chastain.

The cache of dentry structures. Under UNIX an entry in a
particular directory must be searched for linearly, so even if the
disk block containing the directory entry list is in-core, there is an
associated cost. The dcache stores recent results of these searches
which in general speeds up these disk searches by a large factor.
Recent 2.3 work uses the dentries to allow multiple mounting, union
mount, and more.


The hardware data cache is usually referred to as the D-cache.


Any of a number of situations where two or more processes cannot
proceed because they are both waiting for the other to release some
resource. FIXME(give good references).
delayed write

See write behind.
demand zero

In demand paging, where the page is to be zeroed when actually
created (common case: bss segment of an executable image, which is
uninitialised heap data for the executable). Also called ZFOD.

Directory entry, in-core structure defining a file's details:
inode, parent dentry etc. Cached in a hash table indexed by hashed
filename (see dcache).

IP packet bit indicating it should not be fragmented. The remote
host will return ICMP notifications if the packet had to be split
anyway, and these are used in MTU discovery.
directory notification

Provides hooks for notifying tasks when the contents of a
directory has changed. Note "contents" can refer to dentries, the file
inodes, or even the file contents themselves (file notification).

Dial-On-Demand for net connections over POTS.
drop behind

In stream I/O conditions, data that has already been read and
processed is not needed again. The VM ideally should recognise this
and mark the used pages as un-needed, so they can be discarded first.
This technique is called "drop behind".

Data Storage Segment, aka data section. This is the memory mapping
containing the initialised data for a binary image.

Processors such as the Pentium Pro, that can decode and execute
two instructions simultaneously.

Abbrev. fr. duplication.

Debugging Information Format

Double word, i.e. 4 bytes on x86.



See extended attributes.
eager coalescing

What the buddy allocator currently does, i.e. merge adjacent
blocks as soon as possible.
edge-triggered interrupt

The interrupt is triggered by the rising or falling edge of the
interrupt line. This makes IRQ line sharing difficult, as an edge may
occur whilst an ISR is running, and it could be easily missed; to
allow sharing level-triggered interrupts are usually used.

Extended Instruction Pointer. This register contains the PC value
of a task, that is, it points to the next instruction to be fetched,
decoded etc.
elevator algorithm

This algorithm, often used in disk accesses, keeps an ordered list
of requests. When the current request on the disk (e.g. the disk
block) has been satisfied, the next strictly greater request on the
list is dealt with. When a new request arrives, it is inserted into
the ordered list in position (e.g. if the new requested block number
is less than the current handled request, it goes before it in the
list). When reaching the end of the list, the elevator changes
direction, and the situation is reversed.

Executable Linkable Format, a popular binary format, the default
for Linux on most architectures.

Extended Match, small classification helper attached to classifiers.

Explicitly-Parallel Instruction set Computing, an instruction set
architecture where every dependency for an instruction is encoded into
the instruction itself. This has the potential to be faster as the
compiler can encode the data dependencies in the instructions.
exponential back-off

A general algorithm for dealing with contention cases; for
example, collisions on a network bus, or contention for a spinlock.
extended attributes

Also known as multi-part or multi-stream files, files with
extended attributes deviate from the principle of files being a simple
single data stream. An example of extended attributes is the
Macintosh's "resource fork", which is associated with a specific file
(known as the "data fork").


fair scheduler

A scheduler which ensures fairness between users, such that a
user's process count and associated cost only impacts that user,
rather than the whole system as currently. Rik van Riel and Borislav
Deianov have both produced different patches to implement this.
false sharing

On SMP caches, when two parts of single block are accessed,
neither of which collide with the other, the cache coherency protocol
may not be able to detect this, and mark the block as "shared" even
when it isn't. This is known as false sharing.

The code path most commonly taken, often optimised heavily at the
expense of less frequently-taken blocks of code. This is the reason
you see so many gotos in core functions - it produces common-path code
far more efficient than an optimising compiler can manage.

file descriptor

The mapping of a file's contents into memory.

"guages" filesystem-based view of kernel objects

"knobs" filesystem-based manager of kernel objects, or config_items

repository for all things task related,

devices (with various exceptions, contradictions, confusions, and
hysterical raisins ...)
fixed mmap

A user-space request for a mmap starting at a fixed virtual
address. Generally not useful or guaranteed to work; a notable
exception is overlayed mmaps, where a mmaped area has further mmaps of
different types at fixed positions in the map.

Fully-Qualified Domain Name, e.g.



For AGP setups, Graphics Aperture Relocation Table.

GNOME's source documentation system (similar to javadoc).
Available by CVS from gnome. Kernel driver interface descriptions,
built from source using gdoc, are currently being written in 2.3.

Global Descriptor Table. Something to do with x86 memory
segmentation I think (FIXME). See ldt.

In the kernel, often means "get a reference to". This may be as
simple as incrementing a usage count, or it may imply attempting to
retrieve an object from a cache of some sort, or allocating kernel
memory. See put.

Generalised Kernel Hook Infrastructure, an IBM patch to implement
hooks into the kernel code.

I just had to point out that lkml is for Linux kernel development
discussions. Please please don't engage in any threads concerning
licensing issues, Microsoft, or Richard Stallman. Please.
group descriptor

On-disk filesystem structure, containing information for a block
group, such as the inode bitmap and block bitmap.

GRand Unified Bootloader, a popular bootloader for Linux, BSD, and
other OSes.

Global System Interrupt. Mainly used in the context of ACPI. Stupid acronym



high memory, or memory that is not permanently mapped into kernel
memory. Common on 32 bit x86 systems. See HighMemory.

High Precision Event Timer (HPET) is a replacement timer for the
8254 Programmable Interval Timer and the Real-time clock's (RTC)
periodic interrupt function. HPET is a successor to pmtimer, and is
far more efficient to read.


The HPET can produce periodic interrupts at a much higher
resolution than the RTC and is often used to synchronize multimedia
streams, providing smooth playback and reducing the need to use other
timestamp calculations such as an x86 cpu's RDTSC instruction. HPET
support in linux requires that the BIOS expose the HPET (via acpi).


Hierarchical Token Bucket, a qdisc based on TBF and CBQ. HTB Theory



IP Virtual Server, the kernel part of the LVS (Linux Virtual
Server) project. IPVS redirects incoming client requests to one of
several "real" servers, usually for the purpose of load balancing a

Interrupt Service Routine, the function in each device driver that
gets called when an interrupt happens.



An incrementing counter representing system "uptime" in ticks - or
the number of timer interrupts since boot. Ultimately the entire
original concept of a jiffy will likely vanish as systems use timer
events only when necessary and become "jiffyless".



a kernel thread that frees up memory by evicting data from caches
and paging out userspace memory, part of the virtual memory subsystem.



Logical Block Addressing. A way to address IDE disks without
Cylinder/Head/Sector (CHS) coordinates, using linear sector numbers
from the start of the disk. Allows for the use of very large IDE
Linux Device Drivers, 3rd Edition

online edition.

Linux Kernel Module. A (often dynamically loadable at system
runtime) kernel extension ("driver") to support, for example, some
kind of new hardware device or generic software abstraction.

Linux Kernel Mailing List. The primary virtual watering hole
(meeting ground) for kernel developers to share ideas and bounce
opinions off one another during the course of the kernel development
process. FAQ at

Linux Security Module. a security framework for providing
different security levels.

Logical Volume Management. A technology for providing an arbitrary
logical view of underlying data storage in a fashion supporting
resizing and restructuring of storage on the fly. Currently in version
2, originally written by Sistina (now Redhat).

a cross-reference tool that can be used to navigate the Linux
kernel source code, available at



A contiguous virtual array of struct pages representing the
entirety of physical memory pages available within a system.

Memory Management Unit, part of the CPU hardware that enforces
memory boundaries, and throw page faults, upon which the OS builds its
coherent protection. The MMU maps virtual memory to actual, where
protections allow.

MUTual EXclusion locks. This locking primitive is simpler and
semantically tighter than the others, and hence is easier to make
faster, and to prove correct. Some constraints are; lock has one owner
at a time, the locker, who must also be the unlocker. Read
Documentation/mutex-design.txt for much more.

Message Signaled Interrupts. A PCI mode where the interrupt
numbers are extended from 8 bits to 32. These also use the normal pci
data lanes not some magic all over the chipset; which means that a
device can basically have as many interrupts as it wants rather than 4
(1 in practice) for legacy PCI interrupts, and there are also no
interrupt sharing issues, since there are just so many numbers for
interrupts... For more, see



NAPI ("New API") is a modification to the device driver packet
processing framework, which is designed to improve the performance of
high-speed networking. See

Netfilter is a framework that provides a set of hooks within the
Linux kernel for intercepting and manipulating network packets. See and

Communication protocol between kernel and userspace




Page cache

a cache of file data and filesystem metadata, dynamically grown an
shrunk depending on other memory use.
Page table

data structure used by the MMU to translate virtual memory
addresses to physical memory addresses.

Per Processor Data Area is the x86 implementation of per-cpu memory.

Page Frame Number, index into the mem_map[] array which describes
physical memory pages.

Page Global Directory, the top level of the page table tree. The
page table hierarchy is pgd -> pud -> pmd -> pte.

Process IDentifier (POSIX thread identifier)

Page Mid-level Directory, note that pmds are folded into pgds on
systems with 2 level page tables.
Process descriptor

kernel data structure that describes/accounts process data related
to a single process.

Page Table Entry

Page Upper Directory, note that puds are folded into pmds, except
on systems with 4-levels page tables.



Queueing Discipline, queues packets before they are sent out to
the network device, enforces QoS requirements, provides traffic
shaping and prioritizing capabilities.

Quality of Service, method to define the importance/priority of
network services



Read Copy Update, a mechanism for SMPSynchronisation

resource limit, eg. "maximum amount of virtual memory" or "maximum
number of processes". Can be per process or per user.



a lock mechanism that works per process context, see SMPSynchronisation

the part of the kernel that chooses a suitable process to run on
the cpu, see the schedule() function.
Shared/Paged Socket Buffer

(also: pskb) Socket Buffer with uncontinuous data buffer, used for
zero copy, TSO and Scatter/Gather capable network cards.
Slab cache

a fast, SMP scalable kernel memory allocator.
Socket Buffer

(also: skb or sk_buff) data structure used to hold the data and
attributes of a network packet. See and for details.

kind of bottom half rarely used.
Spin lock

a simple SMP lock, see SMPSynchronisation
Swap token

a token to temporarily protect a process from pageout, an
alternative approach to memory scheduling, thrashing control. See the
Token Based Thrashing Control paper by Song Jiang and the Linux-MM
System call

(also: syscall) the way a program transitions from userspace into
kernel space, to call a kernel space function.

A pair of instructions on Pentium2+ that replace older INT
instruction based syscall mechanism. See

symbol table used by ksymoops to resolve numbers to function names
in Oops. Also used by ps and top for WCHAN field.




State of a task that is sleeping (not on the run queue). The
task will sleep until some event occurs that changes its state to
TASK_RUNNING. A task in this state can be awakened by signals.



State of a task that is on the run queue (but not necessarily running).



State of a task that has stopped and is not ready to run
(happens when a task receives SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU or
when any signal is received while the task is being debugged)



State of the task that is sleeping (not on the run queue) and
must be explicitly awakened. A task in this state can not be awakened
by signals.



State of a task that did called exit() but the parent task
didn't call wait4(). The task's descriptor is kept in memory and only
released when the parent task calls wait4()


Token Bucket Filter, a qdisc used for rate limiting

Task Group IDentifier (POSIX process identifier)



the page replacement algorithm used by the Linux 2.6 kernel, based
on the ideas behind the 2Q page replacement algorithm, also see the
AdvancedPageReplacement page.



Virtual Dynamically-linked Shared Object, a kernel-provided shared
library that helps userspace perform a few kernel actions without the
overhead of a system call, as well as automatically choosing the most
efficient syscall mechanism. Also called the "vsyscall page".

Virtual File System, an interface through which multiple
filesystems can be hooked into the kernel.
Virtual memory

every process in the system gets its own memory address space,
independent of the other processes.
Vsyscall page

see VDSO.





A paravirtualisation engine for Linux, an efficient way to run
multiple Linux OSes on one computer. Also runs BSD, Plan9 and other
OSes. (See website for more information.)

eXecute In Place, the ability to run an executable directly from
the filesystem (usually ROM or flash), instead of loading it into





A special networking code path where data is sent to the network
directly from userspace memory; this avoids unnecessary copying of
data and improves performance.

Monday, February 02, 2009

How to run a shared library on Linux

How to run a shared library on Linux

In my prevoius blog I have written how to run the shared libraries on

Shared object should have following entries to run:
1. +x permission that is by default is given by the static
linker(program linker) when creating a shared object.
2. Entry point at which the program/shared library is starts to run.
3. Interpreter(Run time linker) that is used to run any shared library
after loaded by kernel part exec().

Entry point at which the program/shared library is starts to run can be
given by passing -Wl,-e entry_point to the linker at command line:

To create .interp section by using GNU gcc, use the follwing line of
code on linux:
const char my_interp[] __attribute__((section(".interp"))) =

Where /lib/ is the path of interpreter(Run time linker)  in linux.

In open solaris we passed -Wl,-I,/usr/lib/ to the sun linker to
create this section.
I think in gnu linker this option is available but do other things.

Demo on Linux machine:
$ cat func.c
const char my_interp[] __attribute__((section(".interp"))) =
void bar();

exit (0);


$ gcc -fPIC -o -shared -Wl,-e,func func.c

You can see that have .interp section and interp program header.
# readelf -l
Elf file type is DYN (Shared object file)
Entry point 0x4dc
There are 7 program headers, starting at offset 52

Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x00000034 0x00000034 0x000e0 0x000e0 R E 0x4
INTERP 0x0005a3 0x000005a3 0x000005a3 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/]
LOAD 0x000000 0x00000000 0x00000000 0x005bc 0x005bc R E 0x1000
LOAD 0x0005bc 0x000015bc 0x000015bc 0x00104 0x0010c RW 0x1000
DYNAMIC 0x0005d4 0x000015d4 0x000015d4 0x000c0 0x000c0 RW 0x4
NOTE 0x000114 0x00000114 0x00000114 0x00024 0x00024 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4

Section to Segment mapping:
Segment Sections...
01 .interp
02 .gnu.hash .dynsym .dynstr .gnu.version
.gnu.version_r .rel.dyn .rel.plt .init .plt .text .fini .rodata
.interp .eh_frame
03 .ctors .dtors .jcr .dynamic .got .got.plt .bss
04 .dynamic

You can cleary see, have .interp section and INTERP program header.
Now try to run
$ ./