Tag Archives: C

Pthreads Dev – Common Programming Mistakes to Avoid

Disclaimer: Okay, let me straight away say this: most of these points below are from various books and online sources. One that stands out in my mind, an excellent tome, though quite old now, is “Multithreaded Programming with Pthreads”, Bil Lewis and Daniel J. Berg.

Common Programming Errors one (often?) makes when programming MT apps with Pthreads

  • Failure to check return values for errors
  • Using errno without actually checking that an error has occurred
WRONG                       Correct
syscall_foo();              if (syscall_foo() < 0) {
if (errno) { ... }              if (errno) { ... } }

(Also, note that all Pthread APIs may not set errno)

  • Not joining on joinable threads
  • A critical one: Failure to verify that library calls are MT Safe
    Use the foo_r API version if it exists, over the foo.
    Use TLS, TSD, etc.
  • Falling off the bottom of main()
    must call pthread_exit() ; yes, in main as well!
  • Forgetting to include the POSIX_C_SOURCE flag
  • Depending upon Scheduling Order
    Write your programs to depend upon synchronization. Don’t do :

          sleep(5); /* Enough time for manager to start */

Instead, wait until the event in question actually occurs; synchronize on it, perhaps using CVs (condition variables)

  • Not Recognizing Shared Data
    Especially true when manipulating complex data structures – such as lists in which each element (or node) as well as the entire list has a separate mutex for protection; so to search the list, you would have to obtain, then release, each lock as the thread moved down the list. It would work, but be very expensive. Reconsider the design perhaps?
  • Assuming bit, byte or word stores are atomic
    Maybe, maybe not. Don’t assume – protect shared data
  • Not blocking signals when using sigwait(3)
    Any signals you’re blocking upon with the sigwait(3) should never be delivered asynchronously to another thread; block ’em and use ’em
  • Passing pointers to data on the stack to another thread

(Case a) Simple – an integer value:

Correct (below):

 // thread creation loop
 for (i=0; i<NUM_THREADS; i++) {
    thread_create(&thrd[i], &attr, worker, i);
 // join loop...


The integer is passed to each thread as a ‘literal value’; no issues.

Wrong approach (below):

 // thread creation loop
 for (i=0; i<NUM_THREADS; i++) {
     thread_create(&thrd[i], &attr, worker, &i);
  // join loop...


Passing the integer by address implies that some thread B could be reading it while thread main is writing to it! A race, a bug.

(Case b) More complex – a data structure:

Correct (below):

my_struct *pstr;
// thread creation loop
for (i=0; i<NUM_THREADS; i++) {
   pstr = (my_struct *) malloc(...);
   pstr->data = <whatever>;
   pstr->... = ...; // and so on...
   pthread_create(&thrd[i], &attr, worker, pstr);
// in the join loop..


The malloc ensures the memory is accessible to the particular thread it’s being passed to. Thread Safe.

Wrong! (below)

my_struct *pstr = malloc(...);

for (i=0; i<NUM_THREADS; i++) {
   pstr->data = <whatever>;
   pstr->... = ...; // and so on...
   pthread_create(&thrd[i], &attr, worker, pstr);
// join


If you do this (the wrong one, above), then the global pointer (one instance of the data structure only) is being passed around without protection – threads will step on “each other’s toes” corrupting the data and the app. Thread Unsafe.

  • Avoid the above problems; use the TLS (Thread-Local Storage) – a simple and elegant approach to making your code thread safe.

Resource: GCC page on TLS.




Application Binary Interface (ABI) Docs and Their Meaning

Have you, the programmer, ever really thought about how it all actually works? Am sure you have…

We write

printf("Hello, world! value = %d\n", 41+1);

and it works. But it’s ‘C’ code – the microprocessor cannot possibly understand it; all it  “understands” is a stream of binary digits – machine language. So, who or what transforms source code into this machine language?

The compiler of course! How? It just does (cheeky). So who wrote the compiler? How?
Ah. Compiler authors figure out how by reading a document provided by the microprocessor (cpu) folks – the ABI – Application Binary Interface.

People often ask “But what exactly is an ABI?”. I like the answer provided here by JesperE:

"... If you know assembly and how things work at the OS-level, you are conforming to a certain ABI. The ABI govern things like
how parameters are passed, where return values are placed. For many platforms there is only one ABI to choose from, and in those
cases the ABI is just "how things work".

However, the ABI also govern things like how classes/objects are laid out in C++. This is necessary if you want to be able to pass
object references across module boundaries or if you want to mix code compiled with different compilers. ..."

Another way to state it:
The ABI describes the underlying nuts and bolts of the mechanisms  that systems software such as the compiler, linker, loader – IOW, the toolchain – needs to be aware of: data representation, function calling and return conventions, register usage conventions, stack construction, stack frame layout, argument passing – formal linkage, encoding of object files (eg. ELF), etc.

Having a minimal understanding of :

  • a CPU’s ABI – which includes stuff like
    • it’s procedure calling convention
    • stack frame layout
    • ISA (Instruction Set Architecture)
    • registers and their internal usage, and,
  • bare minimal assembly language for that CPU,

helps to no end when debugging a complex situation at the level of the “metal”.

With this in mind, here are a few links to various CPU ABI documents, and other related tutorials:

However, especially for folks new to it, reading the ABI docs can be quite a daunting task! Below, I hope to provide some simplifications which help one gain the essentials without getting completely lost in details (that probably do not matter).

Often, when debugging, one finds that the issue lies with how exactly a function is being called – we need to examine the function parameters, locals, return value. This can even be done when all we have is a binary dump – like the well known core file (see man 5 core for details).

Intel x86 – the IA-32

On the IA-32, the stack is used for function calling, parameter passing, locals.

Stack Frame Layout on IA-32

[...                            <-- Bottom; higher addresses.
RET addr 
[SFP]                      <-- SFP = pointer to previous stack frame [EBP] [optional]
...]                           <-- ESP: Top of stack; in effect, lowest stack address

Intel 64-bit – the x86_64

On this processor family, the situation is far more optimized. Registers are used to pass along the first six arguments to a function; the seventh onwards is passed on the stack. The stack layout is very similar to that on IA-32.

Register Set


<Original image: from Intel manuals>

Actually, the above register-set image applies to all x86 processors – it’s an overlay model:

  • the 32-bit registers are literally “half” the size and their prefix changes from R to E
  • the 16-bit registers are half the size of the 32-bit and their prefix changes from E to A
  • the 8-bit registers are half the size of the 16-bit and their prefix changes from A to AH, AL.

The first six arguments are passed in the following registers as follows:


(By the way, looking up the registers is easy from within GDB: just use it’s info registers command).

An example from this excellent blog “Stack frame layout on x86-64” will help illustrate:

On the x86_64, call a function that receives 8 parameters – ‘a, b, c, d, e, f, g, h’. The situation looks like this now:


What is this “red zone” thing above? From the ABI doc:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

Basically it’s an optimization for the compiler folks: when a ‘leaf’ function is called (one that does not invoke any other functions), the compiler will generate code to use the 128 byte area as ‘scratch’ for the locals. This way we save two machine instructions to lower and raise the stack on function prologue (entry) and epilogue (return).

ARM-32 (Aarch32)

<Credits: some pics shown below are from here : ‘ARM University Program’, YouTube. Please see it for details>.

The Aarch32 processor family has seven modes of operation: of these, six of them are privileged and only one – ‘User’ – is the non-privileged mode, in which user application processes run.


When a process or thread makes a system call, the compiler has the code issue the SWI machine instruction which puts the CPU into Supervisor (SVC) mode.

The Aarch32 Register Set:


Register usage conventions are mentioned below.

Function Calling on the ARM-32

The Aarch32 ABI reveals that it’s registers are used as follows:

Register APCS name Purpose
R0 a1 Argument registerspassing values, don’t need to be preserved,
results are usually returned in R0
R1 a2
R2 a3
R3 a4
R4 v1 Variable registers, used internally by functions, must be preserved if used. Essentially, r4 to r9 hold local variables as register variables.

(Also, in case of the SWI machine instruction (syscall), r7 holds the syscall #).
R5 v2
R6 v3
R7 v4
R8 v5
R9 v6
R10 sl Stack Limit / stack chunk handle
R11 fp Frame Pointer, contains zero, or points to stack backtrace structure
R12 ip Procedure entry temporary workspace
R13 sp Stack Pointer, fully descending stack, points to lowest free word
R14 lr Link Register, return address at function exit
R15 pc Program Counter

(APCS = ARM Procedure Calling Standard)

When a function is called on the ARM-32 family, the compiler generates assembly code such that the first four integer or pointer arguments are placed in the registers r0, r1, r2 and r3. If the function is to receive more than four parameters, the fifth one onwards goes onto the stack. If enabled, the frame pointer (very useful for accurate stack unwinding/backtracing) is in r11. The last three registers are always used for special purposes:

  • r13: stack pointer register
  • r14: link register; in effect, return (text/code) address
  • r15: the program counter (the PC)


The PSR – Processor State Register – holds the system ‘state’; it is constructed like this:



<TODO: Aarch64>

Hope this helps!

A Header of Convenience

Over the years, we tend to collect little snippets of code and routines that we use, like, refine and reuse.

I’ve done so, for (mostly) user-space and kernel programming on the 2.6 / 3.x Linux kernel. Feel free to use it. Please do get back with any bugs you find, suggestions, etc.

License: GPL / LGPL

Click here to view the code!

There are macros / functions to:

  • make debug prints along with function name and line# info (via the usual printk() or trace_printk()) – (only if DEBUG mode is On)
    • [EDIT] : rate-limiting turned Off by default (else we risk missing some prints)
      -will preferably use rate-limited printk’s 
  • dump the kernel-mode stack
  • print the current context (process or interrupt along with flags in the form that ftrace uses)
  • a simple assert() macro (!)
  • a cpu-intensive DELAY_LOOP (useful for test rigs that must spin on the processor)
  • an equivalent to usermode sleep functionality (DELAY_SEC()).

Whew 🙂

Edit: removed the header listing inline here; it’s far more convenient to just view it online here.

kmalloc and vmalloc : Linux kernel memory allocation API Limits

The Intent

To determine how large a memory allocation can be made from within the kernel, via the “usual suspects” – the kmalloc and vmalloc kernel memory allocation APIs, in a single call.

Lets answer this question using two approaches: one, reading the source, and two, trying it out empirically on the system.
(Kernel source from kernel ver 3.0.2; tried out on kernel ver 2.6.35 on an x86 PC and on the (ARM) BeagleBoard).

Quick Summary

For the impatient:

The upper limit (number of bytes that can be allocated in a single kmalloc request), is a function of:

  • the processor – really, the page size – and
  • the number of buddy system freelists (MAX_ORDER).

On both x86 and ARM, with a standard page size of 4 Kb and MAX_ORDER of 11, the kmalloc upper limit is 4 MB!

The vmalloc upper limit is, in theory, the amount of physical RAM on the system.
In practice, the kernel allocates an architecture (cpu) specific “range” of virtual memory for the purpose of vmalloc: from VMALLOC_START to VMALLOC_END.

In practice, it’s usually a lot less. A useful comment by ugoren points out that:
” in 32bit systems, vmalloc is severely limited by its virtual memory area. For a 32bit x86 machine, with 1GB RAM or more, vmalloc is limited to 128MB (for all allocations together, not just for one).

[EDIT/UPDATE #3 : July ’17]
I wrote a simple kernel module (can download the source code, see the link at the end of this article), to test the kmalloc/vmalloc upper limits; the results are what we expect:
for kmalloc, 4 MB is the upper limit with a single call; for vmalloc, it depends on the vmalloc range.

Also, please realize, the actual amount you can acquire at runtime depends on the amount of physically contiguous RAM available at that moment in time; this can and does vary widely.

Finally, what if one require more than 4 MB of physically contiguous memory? That’s pretty much exactly the reason behind CMA – the Contiguous Memory Allocator! Details on CMA and using it are in this excellent LWN article here. Note that CMA was integrated into mainline Linux in v3.17 (05 Oct 2014). Also, the recommended API interface to use CMA is the ‘usual’ DMA [de]alloc APIs (kernel documentation here and here); don’t try and use them directly.

I kmalloc Limit Tests

First, lets check out the limits for kmalloc :

Continue reading kmalloc and vmalloc : Linux kernel memory allocation API Limits

SIGRTMIN or SIGRTMAX – who’s higher up (in signal delivery priority order)?

Linux supports Realtime (RT) signals (part of the Posix IEEE 1003.1b [POSIX.1b Realtime] draft standard). The only other GPOS (General Purpose OS) that does so is Solaris (TODO- verify. Right? Anyone??).

The Linux RT signals can be seen with the usual ‘kill’ command (‘kill -l’; that’s the letter lowercase L).

Continue reading SIGRTMIN or SIGRTMAX – who’s higher up (in signal delivery priority order)?