My book – Hands-On System Programming with Linux

Hello, pleased that my first book has been recently released (on 31 Oct 2018). The publisher is Packt (based in Birmingham, UK).

Quick Links

The book is available in various formats here:

The book has an open-to-public GitHub repository as well.

A chapter of the book is freely available online:  File IO Essentials. Do download and check it out.


A fairly detailed description of the book, and what it attempts to cover is given below. Obviously, am very grateful to my readers. A request: once you’ve gone through the book, please take five minutes to write a review for the book on Amazon.

Hands-On System Programming with Linux

The Linux OS has grown from being a one-person tiny hobby project to a fundamental and integral part of a stunning variety of software products and projects; it is found in tiny industrial robots, consumer electronic devices (smartphones, tablets, music players); it is the powerhouse behind enterprise-scale servers, network switches and data centres all over the Earth.

This book is about a fundamental and key aspect of Linux – systems programming at the library and system call layers of the Linux stack. It will cement for you, in a fast-paced yet deeply technical manner, both the conceptual “why” and with a hands-on approach the practical “how” of systems programming on the Linux OS. Real-world relevant and detailed code examples, tested on the latest 4.x kernel distros, are found throughout the book, with added emphasis on key aspects such as programming for security. Linux was never more relevant in industry; are you?

The book’s style is one of making complex topics and ideas easy to understand; we take great pains to introduce the reader to relevant theoretical concepts such that the fundamentals are crystal clear before moving on to APIs and code. Then, the book goes on to provide detailed practical, hands-on and relevant code examples to solidify the concepts. All the code shown in the book (several example for each chapter) is available ready-to-build-and-try on the book’s GitHub repository here : https://github.com/PacktPublishing/Hands-on-System-Programming-with-Linux. Not only that, each chapter has suggested assignments for the budding programmer to try; some of these assignments have solutions provided as well (find them in the GitHub repo).

This book covers a huge amount of ground within the Linux system programming domain; nevertheless, it does so in a unique way. Several features make this book useful and unique; some of them are enumerated below:

  • How many books on programming have you read that, besides the typical ‘Hello, world’ ‘C’ program, also include a couple of assembly language examples, describe what the CPU ABI is, and a whole lot more, including a detailed description of system architecture, system calls, etc in the very first chapter
  • While we do not delve into intricate details on their usage, this book uses, and briefly shows how to use, plenty of useful and relevant tools – from ltrace and strace to perf, to LTTng and Ftrace
  • The book is indeed detailed; for example, it does not shy away from even delving (to the appropriate degree) into kernel internals wherever required. The understanding and description of an important topic such as Virtual Memory in Chapter 2, serves as a good example. Fairly advanced concepts such as looking into the process stack (gstack, GDB) and what the process VM-split is, are covered as well
  • Ch 4 on Dynamic Memory Allocation goes well beyond the usual malloc API family; here, once we get that covered (in depth of course), the book moves into more advanced topics such as internal malloc behavior, demand paging, explicitly locking and protecting memory pages, using the alloca API, etc
  • Ch 5 on Memory Issues, leads with a detailed description (and actual code test cases) of common memory defects (bugs), that novice (and even experienced!) programmers often overlook. These include the OOB (Out Of Bounds) memory accesses (read/write overflows/underflows), UAF, UAR, leakage and double free
  • Ch 6 then then continues logically forward with a detailed discussion on Tools to detect the afore-mentioned memory defects with; here, we focus on using the well known Valgrind (memcheck) and on the newer, exciting Sanitizer toolset (compiler-based tools). We compare them point for point, and explain how the modern C/C++ developer must tackle memory defects
  • File I/O: a vast topic by itself; we divide the discussion into two parts. The reader is advised to first delve into Appendix A – File IO Essentials (available online) which covers the basic concepts. Our Ch 18 on Advanced File I/O goes deeper; most programmers are aware of and frequently use the “usual” read/write system calls. We show how to optimize performance in various ways- using the MT (multithreaded) optimal pread/pwrite APIs, using scatter-gather (SG-IO). An explanation (with detailed diagrams) on the kernel picture for the block IO code path, shows the reader how I/O actually works – the kernel page cache and related components are shown. Leveraging this knowledge with the posix_fadvise and the readahead APIs is described as well. Memory mapping as a powerful “zero-copy” technique for file IO is explained with sample code. More advanced areas such as DIO and AIO are briefly delved into as well, all with the idea that the developer can leverage the system for maximum I/O performance
  • Most programmers are (dimly) aware of the traditional Unix “permissions model”; we cover this in detail from a system programming perspective in Ch 7 – Process Credentials (we explain setuid/setgid/saved-set-ID programs). We emphasize though, that there is a superior modern approach – the POSIX Capabilities model – which is covered in Ch 8. Security concerns and how these get addressed form the meat of the discussion here
  • Being a book on system programming, we obviously cover Process Execution (Ch 9) and Process Creation (Ch 10) in depth; the fork system call and all its subtleties and intricacies. Towards this end, we encode the “The rules of fork” – a set of seven rules that help the developer truly understand how this important system call actually behaves. As part of this chapter, the wait API and its variations, and the famous Unix “orphan” and “zombie” process are covered as well
  • Again, as expected, we cover the topic of Signaling in depth over two whole chapters (11 and 12). Besides the basics, the book delves into practical details (with code, of course) in covering, for example, various techniques by which one can handle a very high  volume of signals continually bombarding a process. The notion of software reentrancy (and the related reentrant-safety concept) is covered. Ways to prevent the zombie, using an alternate signal stack, etc are covered as well. The next chapter on Signaling delves into the intricacies of handling fatal signals in a real-world Linux application – a must-do! The book provides an effective “template code” to conveniently fulfil this purpose. Using real-time signals for IPC, and synchronous APIs for signaling is covered too
  • The chapter (13) on Timers covers both the traditional and the more recent powerful POSIX timers; here, we explain concepts via two interesting mini-projects – building a “how fast can you react” (CLI) game and a small “run:walk” timer app
  • Multithreading with Pthreads is again a vast area; this book devotes three whole chapters (14, 15, 16) to this important topic. In the first of this trilogy, we delve into the thread concept, how exactly it differs from a process and importantly, why threading is useful (with code examples of course). The next chapter deals in depth with the key topics of concurrency and synchronization (within the Pthreads domain, covering mutex locks and CVs); here, we use the (old but) very interesting Mars Pathfinder mission as a quick case study on priority inversion and how it can be prevented. Next, the really important topics of thread safety, cancellation and cleanup, are covered. The chapter ends with a brief “Multi processing vs MT” discussion and some typical FAQs. (We even include a ‘pthreads_app_template.c’ program in the book’s GitHub tree)
  • Ch 17 delves into intricacies of CPU Scheduling on the Linux OS; key concepts – the Linux state machine, what real-time is, POSIX scheduling policies available, etc are covered. We demonstrate a MT app that switches two worker threads to become (soft) real-time
  • The book ends with a small chapter (19) devoted to Trobleshooting and Best Practices to follow – a small but really key chapter!

Scattered throughout the book are interesting examples and reader assignments (some of which have solutions provided). Also, we try not to be completely x86 specific; in a few examples we cross-compile for the popular ARM-32 CPU; we mention the SEALS project (allowing one to quickly prototype apps on a Qemu emulated ARM/Linux system).

Who this book will benefit:

  • The professional Linux application developer
  • Application architects, leads, technical managers, consultants
  • Linux QA professionals
  • Students desirious of gaining an industry-relevant edge
  • Anybody interested in Linux programming and crafting good software in general.

Advertisements

Advice to a Young Firmware Developer – by Jack Ganssle; and Assembly

I hope Jack Ganssle forgives my directly copying his content; the only reason I do so is that these thoughts of his are precious and I wish for more of us to read and appreciate them. (The discussion is definitely biased towards firmware/embedded developers that work primarily on a hardware platform using ‘C’ as the language. Just a small part of Jack’s excellent Embedded Muse newsletter is shown below; do check out the full article and subscribe to his newsletter). 

 

Directly copied from here: The Embedded Muse, Issue #362 by jack Ganssle, 19 Nov 2018.

Advice to a Young Firmware Developer

… Learn, in detail, how a computer works. You should be able to draw a detailed block diagram of one. Even if you have no interest in the hardware, it’s impossible to understand assembly language and other important aspects of creating firmware without understanding program counters, registers, busses and the like.

Learn an assembly language. Write real programs in it, and make them work. Absent a grounding in assembly much of the operation of a computer will be mysterious. In real life you’ll have to delve into the assembly at least occasionally, at least to work on the startup code, and to find some classes of bugs.

A recent article in IEEE Spectrum surveyed language use and C didn’t even make the cut. Java, Javascript, HMTL and Python were ranked as the most in-demand languages in the USA. Yet around 70% of firmware people work in C. C++ makes up another 20%. For better or worse, all of the other embedded languages are in the noise. Master C, pointers and all. (Rust is increasingly popular, yet, despite the hype, has under a 1% share in the embedded space).

But do learn some other languages. Python can be useful for scripting. Ada gives a discipline I wish more had.

Work in a cross-development environment with an embedded target board. It’s very different from using Visual Studio.

Get comfortable with a Linux shell. With sed, awk, and a hundred other tools you can do incredible things without writing any code.

Take the time to think through what you’re building. It’s tempting to start coding early. Design is critically important. Remember the old saying: “if you think good design is expensive, consider the cost of bad design.”

Monitor your bug rates. Forever. Skip this and you’ll never know two things: if you’re improving, and how you compare to the industry. We all think we’re great developers, but metrics can be a cold shower.

Always be on the prowl for tools. Some are free, others expensive, but great tools will make you more productive and less error-prone. These change all the time so figure on constantly refining your toolbox.

Did you know the average firmware person reads one technical book a year? Yet this field evolves at the pace of Moore’s Law. Constantly study software engineering. We do have a Body of Knowledge. Every year new ideas appear. Some are brilliant, others whacky, but they all make you think.

Learn about the hardware. At least get a general understanding. An engineer who can use an oscilloscope and logic analyzer for troubleshooting code is a valuable addition to a software team. Digital and analog hardware is cool and fascinating. …”


So, okay, that’s the part of the article I wanted to show.

Learning Assembly Language – Resources

How does one just learn assembly language then? Well, there are resources of course that help – books, online articles; here’s a few: 

Pthreads Dev – Common Programming Mistakes to Avoid

Disclaimer: Okay, let me straight away say this: most of these points below are from various books and online sources. One that stands out in my mind, an excellent tome, though quite old now, is “Multithreaded Programming with Pthreads”, Bil Lewis and Daniel J. Berg.

Common Programming Errors one (often?) makes when programming MT apps with Pthreads

  • Failure to check return values for errors
  • Using errno without actually checking that an error has occurred
WRONG                       Correct
syscall_foo();              if (syscall_foo() < 0) {
if (errno) { ... }              if (errno) { ... } }

(Also, note that all Pthread APIs may not set errno)

  • Not joining on joinable threads
  • A critical one: Failure to verify that library calls are MT Safe
    Use the foo_r API version if it exists, over the foo.
    Use TLS, TSD, etc.
  • Falling off the bottom of main()
    must call pthread_exit() ; yes, in main as well!
  • Forgetting to include the POSIX_C_SOURCE flag
  • Depending upon Scheduling Order
    Write your programs to depend upon synchronization. Don’t do :

          sleep(5); /* Enough time for manager to start */

Instead, wait until the event in question actually occurs; synchronize on it, perhaps using CVs (condition variables)

  • Not Recognizing Shared Data
    Especially true when manipulating complex data structures – such as lists in which each element (or node) as well as the entire list has a separate mutex for protection; so to search the list, you would have to obtain, then release, each lock as the thread moved down the list. It would work, but be very expensive. Reconsider the design perhaps?
  • Assuming bit, byte or word stores are atomic
    Maybe, maybe not. Don’t assume – protect shared data
  • Not blocking signals when using sigwait(3)
    Any signals you’re blocking upon with the sigwait(3) should never be delivered asynchronously to another thread; block ’em and use ’em
  • Passing pointers to data on the stack to another thread

(Case a) Simple – an integer value:

Correct (below):

main()
{
 ...
 // thread creation loop
 for (i=0; i<NUM_THREADS; i++) {
    thread_create(&thrd[i], &attr, worker, i);
 }
 ...
 // join loop...

 pthread_exit();
}

The integer is passed to each thread as a ‘literal value’; no issues.

Wrong approach (below):

main()
{
 ...
 // thread creation loop
 for (i=0; i<NUM_THREADS; i++) {
     thread_create(&thrd[i], &attr, worker, &i);
 }
  ...
  // join loop...

  pthread_exit();
}

Passing the integer by address implies that some thread B could be reading it while thread main is writing to it! A race, a bug.

(Case b) More complex – a data structure:

Correct (below):

main()
{
my_struct *pstr;
...
// thread creation loop
for (i=0; i<NUM_THREADS; i++) {
   pstr = (my_struct *) malloc(...);
   pstr->data = <whatever>;
   pstr->... = ...; // and so on...
   pthread_create(&thrd[i], &attr, worker, pstr);
}
...
// in the join loop..
  free(pstr);

pthread_exit();
}

The malloc ensures the memory is accessible to the particular thread it’s being passed to. Thread Safe.

Wrong! (below)

my_struct *pstr = malloc(...);

main()
{
...
for (i=0; i<NUM_THREADS; i++) {
   pstr->data = <whatever>;
   pstr->... = ...; // and so on...
   pthread_create(&thrd[i], &attr, worker, pstr);
}
// join

free(pstr);
pthread_exit();
}

If you do this (the wrong one, above), then the global pointer (one instance of the data structure only) is being passed around without protection – threads will step on “each other’s toes” corrupting the data and the app. Thread Unsafe.

  • Avoid the above problems; use the TLS (Thread-Local Storage) – a simple and elegant approach to making your code thread safe.

Resource: GCC page on TLS.

 

 

Application Binary Interface (ABI) Docs and Their Meaning

Have you, the programmer, ever really thought about how it all actually works? Am sure you have…

We write

printf("Hello, world! value = %d\n", 41+1);

and it works. But it’s ‘C’ code – the microprocessor cannot possibly understand it; all it  “understands” is a stream of binary digits – machine language. So, who or what transforms source code into this machine language?

The compiler of course! How? It just does (cheeky). So who wrote the compiler? How?
Ah. Compiler authors figure out how by reading a document provided by the microprocessor (cpu) folks – the ABI – Application Binary Interface.

People often ask “But what exactly is an ABI?”. I like the answer provided here by JesperE:

"... If you know assembly and how things work at the OS-level, you are conforming to a certain ABI. The ABI govern things like
how parameters are passed, where return values are placed. For many platforms there is only one ABI to choose from, and in those
cases the ABI is just "how things work".

However, the ABI also govern things like how classes/objects are laid out in C++. This is necessary if you want to be able to pass
object references across module boundaries or if you want to mix code compiled with different compilers. ..."

Another way to state it:
The ABI describes the underlying nuts and bolts of the mechanisms  that systems software such as the compiler, linker, loader – IOW, the toolchain – needs to be aware of: data representation, function calling and return conventions, register usage conventions, stack construction, stack frame layout, argument passing – formal linkage, encoding of object files (eg. ELF), etc.

Having a minimal understanding of :

  • a CPU’s ABI – which includes stuff like
    • it’s procedure calling convention
    • stack frame layout
    • ISA (Instruction Set Architecture)
    • registers and their internal usage, and,
  • bare minimal assembly language for that CPU,

helps to no end when debugging a complex situation at the level of the “metal”.

With this in mind, here are a few links to various CPU ABI documents, and other related tutorials:

However, especially for folks new to it, reading the ABI docs can be quite a daunting task! Below, I hope to provide some simplifications which help one gain the essentials without getting completely lost in details (that probably do not matter).

Often, when debugging, one finds that the issue lies with how exactly a function is being called – we need to examine the function parameters, locals, return value. This can even be done when all we have is a binary dump – like the well known core file (see man 5 core for details).

Intel x86 – the IA-32

On the IA-32, the stack is used for function calling, parameter passing, locals.

Stack Frame Layout on IA-32

[...                            <-- Bottom; higher addresses.
PARAMS 
...]              
RET addr 
[SFP]                      <-- SFP = pointer to previous stack frame [EBP] [optional]
[... 
LOCALS 
...]                           <-- ESP: Top of stack; in effect, lowest stack address


Intel 64-bit – the x86_64

On this processor family, the situation is far more optimized. Registers are used to pass along the first six arguments to a function; the seventh onwards is passed on the stack. The stack layout is very similar to that on IA-32.

Register Set

x86_64_registers

<Original image: from Intel manuals>

Actually, the above register-set image applies to all x86 processors – it’s an overlay model:

  • the 32-bit registers are literally “half” the size and their prefix changes from R to E
  • the 16-bit registers are half the size of the 32-bit and their prefix changes from E to A
  • the 8-bit registers are half the size of the 16-bit and their prefix changes from A to AH, AL.

The first six arguments are passed in the following registers as follows:

RDI, RSI, RDX, RCX, R8, R9

(By the way, looking up the registers is easy from within GDB: just use it’s info registers command).

An example from this excellent blog “Stack frame layout on x86-64” will help illustrate:

On the x86_64, call a function that receives 8 parameters – ‘a, b, c, d, e, f, g, h’. The situation looks like this now:

x86_64_func

What is this “red zone” thing above? From the ABI doc:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

Basically it’s an optimization for the compiler folks: when a ‘leaf’ function is called (one that does not invoke any other functions), the compiler will generate code to use the 128 byte area as ‘scratch’ for the locals. This way we save two machine instructions to lower and raise the stack on function prologue (entry) and epilogue (return).

ARM-32 (Aarch32)

<Credits: some pics shown below are from here : ‘ARM University Program’, YouTube. Please see it for details>.

The Aarch32 processor family has seven modes of operation: of these, six of them are privileged and only one – ‘User’ – is the non-privileged mode, in which user application processes run.

modes

When a process or thread makes a system call, the compiler has the code issue the SWI machine instruction which puts the CPU into Supervisor (SVC) mode.

The Aarch32 Register Set:

regs

Register usage conventions are mentioned below.

Function Calling on the ARM-32

The Aarch32 ABI reveals that it’s registers are used as follows:

Register APCS name Purpose
R0 a1 Argument registerspassing values, don’t need to be preserved,
results are usually returned in R0
R1 a2
R2 a3
R3 a4
R4 v1 Variable registers, used internally by functions, must be preserved if used. Essentially, r4 to r9 hold local variables as register variables.

(Also, in case of the SWI machine instruction (syscall), r7 holds the syscall #).
R5 v2
R6 v3
R7 v4
R8 v5
R9 v6
R10 sl Stack Limit / stack chunk handle
R11 fp Frame Pointer, contains zero, or points to stack backtrace structure
R12 ip Procedure entry temporary workspace
R13 sp Stack Pointer, fully descending stack, points to lowest free word
R14 lr Link Register, return address at function exit
R15 pc Program Counter

(APCS = ARM Procedure Calling Standard)

When a function is called on the ARM-32 family, the compiler generates assembly code such that the first four integer or pointer arguments are placed in the registers r0, r1, r2 and r3. If the function is to receive more than four parameters, the fifth one onwards goes onto the stack. If enabled, the frame pointer (very useful for accurate stack unwinding/backtracing) is in r11. The last three registers are always used for special purposes:

  • r13: stack pointer register
  • r14: link register; in effect, return (text/code) address
  • r15: the program counter (the PC)

 

The PSR – Processor State Register – holds the system ‘state’; it is constructed like this:

cpsr

 

<TODO: Aarch64>

Hope this helps!

Setting up Kdump and Crash for ARM-32 – an Ongoing Saga

Author: Kaiwan N Billimoria, kaiwanTECH
Date: 13 July 2017

DUT (Device Under Test):
Hardware platform: Qemu-virtualized Versatile Express Cortex-A9.
Software platform: mainline linux kernel ver 4.9.1, kexec-tools, crash utility.

First, my attempt at setting up the Raspberry Pi 3 failed; mostly due to recurring issues with the bloody MMC card; probably a power issue! (see this link).

Anyway. Then switched to doing the same on the always-reliable Qemu virtualizer; I prefer to setup the Vexpress-CA9.

In fact, a supporting project I maintain on github – the SEALS project – is proving extremely useful for building the ARM-32 hardware/software platform quickly and efficiently. (Fun fact: SEALS = Simple Embedded Arm Linux System).

So, I cloned the above-mentioned git repo for SEALS into a new working folder.

The way SEALS work is simple: edit a configuration file (build.config) to your satisfaction, to reflect the PATH to and versions of the cross-compiler, kernel, kernel command-line parameters, busybox, rootfs size, etc.

Setup the SEALS build.config file.

Screenshot: the build_SEALS.sh script initial screen displaying the current build config:kdumpcr1

<<
Relevant Info reproduced below for clarity:

Toolchain prefix : arm-none-linux-gnueabi-
Toolchain version: (Sourcery CodeBench Lite 2014.05-29) 4.8.3 20140320 (prerelease)

Staging folder : <…>/SEALS_staging
ARM Platform : Versatile Express (A9)

Platform RAM : 512 MB
RootFS force rebuild : 0
RootFS size : 768 MB

Linux kernel to use : 4.9.1
Linux kernel codebase location : <…>/SEALS_staging/linux-4.9.1
Kernel command-line : “console=ttyAMA0 root=/dev/mmcblk0 init=/sbin/init crashkernel=32M”

Busybox to use : 1.26.2
Busybox codebase location : <…>/SEALS_staging/busybox-1.26.2

>>

Screenshot: build_SEALS.sh second GUI screen, allowing the user to select actions to takekdumpcr2

Upon clicking ‘OK’, the build process starts:

I Boot Kernel Setup

  • kernel config: must carefully configure the Linux kernel. Please follow the kernel documentation in detail:
    https://www.kernel.org/doc/Documentation/kdump/kdump.txt [1]In brief, ensure these are set:
    CONFIG_KEXEC=y
    CONFIG_SYSFS=y << should be >>
    CONFIG_DEBUG_INFO=y
    CONFIG_CRASH_DUMP=y
    CONFIG_PROC_VMCORE=y

Dump-capture kernel config options (Arch Dependent, arm)
To use a relocatable kernel, Enable “AUTO_ZRELADDR” support under “Boot” options:      

             AUTO_ZRELADDR=y”

https://gist.github.com/Gnurou/7191098

which succinctly got it working!

  • Copy the ‘kexec’ binary into the root filesystem (staging tree) under it’s sbin/ folder
  • We build a relocatable kernel so that we can use the same ‘zImage’ 
    for the dump kernel as well as the primary boot kernel:
     “Or use the system kernel binary itself as dump-capture kernel and there is 
    no need to build a separate dump-capture kernel. 
    This is possible  only with the architectures which support a 
    relocatable kernel. As  of today, i386, x86_64, ppc64, ia64 and 
    arm architectures support relocatable kernel. ...”
    
  • the SEALS build system will proceed to build the kernel using the cross-compiler specified
  • went through just fine.

II Load dump-capture (or kdump) kernel into boot kernel’s RAM

Do read [1], but to cut a long story short

  • Create a small shell script kx.sh - a wrapper over kexec – in the root filesystem:
     
     #!/bin/sh
    DUMPK_CMDLINE="console=ttyAMA0 root=/dev/mmcblk0 rootfstype=ext4 rootwait init=/sbin/init maxcpus=1 reset_devices"
    kexec --type zImage \
    -p ./zImage-4.9.1-crk \
    --dtb=./vexpress-v2p-ca9.dtb \
    --append="${DUMPK_CMDLINE}" 
    [ $? -ne 0 ] && { 
        echo "kexec failed." ; exit 1
    }
    echo "$0: kexec: success, dump kernel loaded."
    exit 0
    
  • Run it. It will only work (in my experience) when:
    • you’ve passed the kernel parameter ‘crashkernel=32M’
    • verified that indeed the boot kernel has reserved 32MB RAM for the dump-capture kernel/system:
RUN: Running qemu-system-arm now ...

qemu-system-arm -m 512 -M vexpress-a9 -kernel <...>/images/zImage \
-drive file=<...>/images/rfs.img,if=sd,format=raw \
-append "console=ttyAMA0 root=/dev/mmcblk0 init=/sbin/init crashkernel=32M" \
-nographic -no-reboot -dtb <...>/linux-4.9.1/arch/arm/boot/dts/vexpress-v2p-ca9.dtb

Booting Linux on physical CPU 0x0
Linux version 4.9.1-crk (hk@hk) (gcc version 4.8.3 20140320 (prerelease) (Sourcery CodeBench Lite 2014.05-29) ) #2 SMP Wed Jul 12 19:41:08 IST 2017
CPU: ARMv7 Processor [410fc090] revision 0 (ARMv7), cr=10c5387d
CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
OF: fdt:Machine model: V2P-CA9
...
ARM / $ dmesg |grep -i crash
Reserving 32MB of memory at 1920MB for crashkernel (System RAM: 512MB)
Kernel command line: console=ttyAMA0 root=/dev/mmcblk0 init=/sbin/init crashkernel=32M
ARM / $ id
uid=0 gid=0
ARM / $ ./kx.sh
./kx.sh: kexec: success, dump kernel loaded.
ARM / $ 

Ok, the dump-capture kernel has loaded up.
Now to test it!

III Test the soft boot into the dump-capture kernel

On the console of the (emulated) ARM-32:

ARM / $ echo c > /proc/sysrq-trigger 
sysrq: SysRq : Trigger a crash
Unhandled fault: page domain fault (0x81b) at 0x00000000
pgd = 9ee44000
[00000000] *pgd=7ee30831, *pte=00000000, *ppte=00000000
Internal error: : 81b [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 724 Comm: sh Not tainted 4.9.1-crk #2
Hardware name: ARM-Versatile Express
task: 9f589600 task.stack: 9ee40000
PC is at sysrq_handle_crash+0x24/0x2c
LR is at arm_heavy_mb+0x1c/0x38
pc : [<804060d8>] lr : [<80114bd8>] psr: 60000013
sp : 9ee41eb8 ip : 00000000 fp : 00000000

...

[<804060d8>] (sysrq_handle_crash) from [<804065bc>] (__handle_sysrq+0xa8/0x170)
[<804065bc>] (__handle_sysrq) from [<80406ab8>] (write_sysrq_trigger+0x54/0x64)
[<80406ab8>] (write_sysrq_trigger) from [<80278588>] (proc_reg_write+0x58/0x90)
[<80278588>] (proc_reg_write) from [<802235c4>] (__vfs_write+0x28/0x10c)
[<802235c4>] (__vfs_write) from [<80224098>] (vfs_write+0xb4/0x15c)
[<80224098>] (vfs_write) from [<80224d30>] (SyS_write+0x40/0x80)
[<80224d30>] (SyS_write) from [<801074a0>] (ret_fast_syscall+0x0/0x3c)

Code: f57ff04e ebf43aba e3a03000 e3a02001 (e5c32000) 

Loading crashdump kernel...
Bye!
Booting Linux on physical CPU 0x0

Linux version 4.9.1-crk (hk@hk) (gcc version 4.8.3 20140320 (prerelease) (Sourcery CodeBench Lite 2014.05-29) ) #2 SMP Wed Jul 12 19:41:08 IST 2017
CPU: ARMv7 Processor [410fc090] revision 0 (ARMv7), cr=10c5387d
CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
OF: fdt:Machine model: V2P-CA9
OF: fdt:Ignoring memory range 0x60000000 - 0x78000000
Memory policy: Data cache writeback
CPU: All CPU(s) started in SVC mode.
percpu: Embedded 14 pages/cpu @81e76000 s27648 r8192 d21504 u57344
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 7874
Kernel command line: console=ttyAMA0 root=/dev/mmcblk0 rootfstype=ext4 rootwait 
init=/sbin/init maxcpus=1 reset_devices elfcorehdr=0x79f00000 mem=31744K

...
ARM / $ ls -l /proc/vmcore            << the dump image (480 MB here) >>
-r-------- 1 0 0 503324672 Jul 13 12:22 /proc/vmcore
ARM / $ 

Copy the dump file (with cp or scp, whatever), 
get it to the host system.

cp /proc/vmcore <dump-file>
ARM / $ halt
ARM / $ EXT4-fs (mmcblk0): re-mounted. Opts: (null)
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system halt
reboot: System halted
QEMU: Terminated
^A-X  << type Ctrl-a followed by x to exit qemu >>
... and done.

build_SEALS.sh: all done, exiting.
Thank you for using SEALS! We hope you like it.
There is much scope for improvement of course; would love to hear your feedback, ideas, and contribution!
Please visit : https://github.com/kaiwan/seals . 


IV Analyse the kdump image with the crash utility

CORE ANALYSIS SUITE

The core analysis suite is a self-contained tool that can be used to
investigate either live systems, kernel core dumps created from dump
creation facilities such as kdump, kvmdump, xendump, the netdump and
diskdump packages offered by Red Hat, the LKCD kernel patch, the mcore
kernel patch created by Mission Critical Linux, as well as other formats
created by manufacturer-specific firmware.

...

A whitepaper with complete documentation concerning the use of this utility
can be found here:
http://people.redhat.com/anderson/crash_whitepaper [3]
...

The crash binary can only be used on systems of the same architecture as
the host build system. There are a few optional manners of building the
crash binary:

o On an x86_64 host, a 32-bit x86 binary that can be used to analyze
32-bit x86 dumpfiles may be built by typing "make target=X86".
o On an x86 or x86_64 host, a 32-bit x86 binary that can be used to analyze
 32-bit arm dumpfiles may be built by typing "make target=ARM".
...

Ah. To paraphrase, Therein lies the devil, in the details.

[UPDATE : 14 July ’17
I do have it building successfully now. The trick apparently – on x86_64 Ubuntu 17.04 – was to install the 
lib32z1-dev package! Once I did, it built just fine. Many thanks to Dave Anderson (RedHat) who promptly replied to my query on the crash mailing list.]

I cloned the ‘crash’ git repo, did ‘make target=ARM’, it fails with:

...
 ../readline/libreadline.a ../opcodes/libopcodes.a ../bfd/libbfd.a
../libiberty/libiberty.a ../libdecnumber/libdecnumber.a -ldl
-lncurses -lm ../libiberty/libiberty.a build-gnulib/import/libgnu.a
 -lz -ldl -rdynamic
/usr/bin/ld: cannot find -lz
collect2: error: ld returned 1 exit status
Makefile:1174: recipe for target 'gdb' failed
...

Still trying to debug this!

Btw, if you’re unsure, pl see crash’s github Readme on how to build it.
So, now, with a ‘crash’ binary that works, lets get to work:

$ file crash
crash: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.32, …

$ ./crash

crash 7.1.9++
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
[…]

crash: compiled for the ARM architecture
$

To examine a kernel dump (kdump) file, invoke crash like so:

crash <path-to-vmlinux-with-debug-symbols> <path-to-kernel-dumpfile>

$ <...>/crash/crash \
  <...>/SEALS_staging/linux-4.9.1/vmlinux ./kdump.img

crash 7.1.9++
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
[...]
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
[...]
WARNING: cannot find NT_PRSTATUS note for cpu: 1
WARNING: cannot find NT_PRSTATUS note for cpu: 2
WARNING: cannot find NT_PRSTATUS note for cpu: 3

 KERNEL: <...>/SEALS_staging/linux-4.9.1/vmlinux
 DUMPFILE: ./kdump.img
 CPUS: 4 [OFFLINE: 3]
 DATE: Thu Jul 13 00:38:39 2017
 UPTIME: 00:00:42
LOAD AVERAGE: 0.00, 0.00, 0.00
 TASKS: 56
 NODENAME: (none)
 RELEASE: 4.9.1-crk
 VERSION: #2 SMP Wed Jul 12 19:41:08 IST 2017
 MACHINE: armv7l (unknown Mhz)
 MEMORY: 512 MB
 PANIC: "sysrq: SysRq : Trigger a crash"
 PID: 735
 COMMAND: "echo"
 TASK: 9f6af900 [THREAD_INFO: 9ee48000]
 CPU: 0
 STATE: TASK_RUNNING (SYSRQ)

crash> ps
 PID PPID CPU TASK ST %MEM VSZ RSS COMM
 0 0 0 80a05c00 RU 0.0 0 0 [swapper/0]
> 0 0 1 9f4ab700 RU 0.0 0 0 [swapper/1]
> 0 0 2 9f4abc80 RU 0.0 0 0 [swapper/2]
> 0 0 3 9f4ac200 RU 0.0 0 0 [swapper/3]
 1 0 0 9f4a8000 IN 0.1 3344 1500 init
[...]
722 2 0 9f6ac200 IN 0.0 0 0 [ext4-rsv-conver]
728 1 0 9f6ab180 IN 0.1 3348 1672 sh
> 735 728 0 9f6af900 RU 0.1 3344 1080 echo
crash> bt
PID: 735 TASK: 9f6af900 CPU: 0 COMMAND: "echo"
 #0 [<804060d8>] (sysrq_handle_crash) from [<804065bc>]
 #1 [<804065bc>] (__handle_sysrq) from [<80406ab8>]
 #2 [<80406ab8>] (write_sysrq_trigger) from [<80278588>]
 #3 [<80278588>] (proc_reg_write) from [<802235c4>]
 #4 [<802235c4>] (__vfs_write) from [<80224098>]
 #5 [<80224098>] (vfs_write) from [<80224d30>]
 #6 [<80224d30>] (sys_write) from [<801074a0>]
 pc : [<76e8d7ec>] lr : [<0000f9dc>] psr: 60000010
 sp : 7ebdcc7c ip : 00000000 fp : 00000000
 r10: 0010286c r9 : 7ebdce68 r8 : 00000020
 r7 : 00000004 r6 : 00103008 r5 : 00000001 r4 : 00102e2c
 r3 : 00000000 r2 : 00000002 r1 : 00103008 r0 : 00000001
 Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM
crash>

And so on …

Another thing we can do is use gdb – to a limited extent – to analyse the dump file:

From [1]:

Before analyzing the dump image, you should reboot into a stable kernel.

You can do limited analysis using GDB on the dump file copied out of
/proc/vmcore. Use the debug vmlinux built with -g and run the following
command:
  gdb vmlinux <dump-file>

Stack trace for the task on processor 0, register display, and memory
display work fine.

Also, [3] is an excellent whitepaper on using crash. Do read it.

All right, hope that helps!

Low-Level Software Design

[Please note, this article isn’t about formal design methods (LLD), UML, Design Patterns, nor about object-oriented design, etc. It’s written with a view towards the kind of software project I typically get to work on – embedded / Linux OS related, with the primary programming language being ‘C’ and/or scripting (typically with bash).]

When one looks back, all said and done, it isn’t that hard to get a decent software design and architecture. Obviously, the larger your project, the more the thought and analysis that goes into building a robust system. (Certainly, the more the years of experience, the easier it seems).

However, I am of the view that certain fundamentals never change: get them right and many of the pieces auto-slot into place. Work on a project enough and one always comes away with a  “feel” for the architecture and codebase – it’s robust, will work, or it’s just not.

So what are these “fundamentals”? Well, here’s the interesting thing: you already know them! But in the heat and dust of release pressures (“I don’t care that you need another half-day, check it in now!!!”), deadlines, production, we tend to forget the basics. Sounds familiar? 🙂

The points below are definitely nothing new, but always worth reiterating:

Low-level Design and Software Architecture

  • Jot down the requirements: why are we doing this? what do we hope to achieve?
  • Draw an overall diagram of the project, the data structures, the code flow, as you visualise it. You don’t really need fancy software tools- pencil and paper will do, especially at first.
    pencil20on20notebook20-20writing20concept
    Arrive, gently, at the software architecture.
  • Layering helps (but one can overdo it)
    • To paraphrase- “adding a layer can be used to solve any problem in computer science” 🙂 Of course, one can quite easily add new problems too; careful!
  • It evolves – don’t be afraid to iterate, to use trial and error
    • “Be ready to throw the first one away – you’re going to anyway” – paraphrased from that classic book “The Mythical Man Month”
    • “There is no silver bullet” – again from the same book of wisdom. There is no one solution to all your problems – you’ll have to weigh options, make trade-offs. It’s like life y’know 😉
  • Design the code to be modular, structured
  • A function encapsulates an intention
    • Requirement-driven code: why is the function there?
  • Each function does exactly one thing
    • This is really important. If you can do this well, you will greatly reduce bugs, and thus, the need to debug.
  • Use configuration files (Edit: preferably in plain ASCII text format).

Coding

  • Insert function stubs – code it in detail later, get the overall low-level design and function interfacing correct first. What parameters, return value(s)? 
  • Avoid globals
    • use parameters, return values
    • in multithreaded / multiprocess environments, using any kind of global implies using a synchronization primitive of some sort (mutex, semaphore, spinlock, etc) to take care of concurrency concerns, races. Be aware – beware! – this is often a huge source of performance bottlenecks!
      Edit: When writing MT software, use powerful techniques TLS and TSD to further avoid globals.
  • Keep it minimal, and clean: Careful! don’t end up using too many (nested functions) layers – leads to “lasagna / spaghetti code” that’s hard to follow and thus understand
  • If a function’s code exceeds a ‘page’, re-look, redesign.

Of course, a project is not a dead static thing – at least it shouldn’t be. It evolves over time. Expect requirements, and thus your low-level design and code, to change. The better thought out the overall architecture though, the more resilient it will be to constant flux.

For example: you’re writing a device driver and a “read” method is attempting to read data from the ‘device’ (whatever the heck it is), but there is no data available right now, what should we do? Abort, returning an error code? Wait for data? Retry the operation thrice and see?

The “correct” answer: follow the standard. Assuming we’re working on a POSIX-compliant OS (Unix/Linux), the standard says that blocking calls must do precisely that: block, wait for data until it becomes available. So just wait for data. “But I don’t want to wait forever!” cries the application! Okay, implement a non-blocking open in that case (there’s a reason for that O_NONBLOCK flag folks!). Or a timeout feature, if it makes sense.

Shouldn’t the driver method “retry” the operation if it does not succeed at first? Short answer, No. Follow the Unix design philosophy: “provide mechanism, not policy”. Let the application define the policy (should we retry, if yes, how often; should we timeout, if yes, what’s the timeout, etc etc). The mechanism part of it – your driver’s read method implementation must work independent of, in fact ignore, such concerns. (But hey, it must be written to be concurrent and reentrant -safe. A post on that another day perhaps?).

Configuration

Using the same example: lets say we do want the “read” method of our device driver to timeout after, say, 1 second. Where do we specify this value? Recollect, “provide mechanism, not policy”. So we’ll do so in the application, not the driver. But where in the app? Ah. I’d suggest we don’t hardcode the value; instead, keep it in a simple ASCII-text configuration file:

app_config
   read_timeout=1

Of course, with ‘C’ one would usually put stuff like this into a header file. Fair enough, but please keep a separate header – say, app_config.h .

Try and do some crystal-ball-gazing: at some remote (or not-so-remote) point in the future, what if the project requires a GUI front-end? Probably, as an example, we will want to let the end-user view and set configuration – change the timeout, etc – easily via the GUI. Then, you will see the sense of using a simple ASCII-text configuration file to hold config values – reading and updating the values now becomes simple and clean.
Finally, nothing said above is sacred – we learn to take things on a case-by-case basis, use judgement.


A few Resources

Linux Kernel Version Timeline

I wanted to quickly look up Linux kernel release dates by version number.

All the info is on kernelnewbies.org . I’ve just copied it below…

Click on the version # links (below) to see details of that version (redirects to the kernelnewbies website).

Source: http://kernelnewbies.org/LinuxVersions
Last Updated: 01 Apr 2016

4.x

Linux 4.5 Released 13 March, 2016 (63 days)

Linux 4.4 Released 10 January, 2016 (70 days)

Linux 4.3 Released 1 November, 2015 (63 days)

Linux 4.2 Released 30 August, 2015 (70 days)

Linux 4.1 Released 21 June, 2015 (70 days)

Linux 4.0 Released 12 April, 2015 (63 days)

3.x

Linux 3.19 Released 8 February, 2015 (63 days)

Linux 3.18 Released 7 December, 2014 (63 days)

Linux 3.17 Released 5 October, 2014 (63 days)

Linux 3.16 Released 3 August, 2014 (56 days)

Linux 3.15 Released 8 June, 2014 (70 days)

Linux 3.14 Released 30 March, 2014 (70 days)

Linux 3.13 Released 19 January, 2014 (78 days)

Linux 3.12 Released 2 November, 2013 (61 days)

Linux 3.11 Released 2 September, 2013 (64 days)

Linux 3.10 Released 30 June, 2013 (63 days)

Linux 3.9 Released 28 April, 2013 (69 days)

Linux 3.8 Released 18 February, 2013 70 ( days)

Linux 3.7 Released 10 December 2012 (71 days)

Linux 3.6 Released 30 September, 2012 (71 days)

Linux 3.5 Released 21 July, 2012 (62 days)

Linux 3.4 Released 20 May, 2012 (63 days)

Linux 3.3 Released 18 March, 2012 (74 days)

Linux 3.2 Released 4 January, 2012 (72 days)

Linux 3.1 Released 24 October, 2011 (95 days)

Linux 3.0 Released 21 July, 2011 (64 days)

2.6.x

Linux 2.6.39 Released 18 May, 2011 (65 days)

Linux 2.6.38 Released 14 March, 2011 (69 days)

Linux 2.6.37 Released 4 January, 2011 (76 days)

Linux 2.6.36 Released 20 October, 2010 (80 days)

Linux 2.6.35 Released 1 August, 2010 (76 days)

Linux 2.6.34 Released 16 May, 2010 (81 days)

Linux 2.6.33 Released 24 February, 2010 (83 days)

Linux 2.6.32 Released 3 December, 2009 (84 days)

Linux 2.6.31 Released 9 September, 2009 (92 days)

Linux 2.6.30 Released 9 June, 2009 (77 days)

Linux 2.6.29 Released 24 March, 2009 (89 days)

Linux 2.6.28 Released 25 December, 2008 (77 days)

Linux 2.6.27 Released 9 October, 2008 (88 days)

Linux 2.6.26 Released 13 July, 2008 (87 days)

Linux 2.6.25 Released 17 April, 2008 (84 days)

Linux 2.6.24 Released 24 January, 2008 (107 days)

Linux 2.6.23 Released 9 October, 2007 (93 days)

Linux 2.6.22 Released 8 July, 2007 (73 days)

Linux 2.6.21 Released 26 April, 2007 (80 days)

Linux 2.6.20 Released 5 February, 2007 (68 days)

Linux 2.6.19 Released 29 November, 2006 (70 days)

Linux 2.6.18 Released 20 September, 2006 (95 days)

Linux 2.6.17 Released 17 June, 2006 (88 days)

Linux 2.6.16 Released 20 March, 2006 (76 days)

Linux 2.6.15 Released 3 January, 2006 (68 days)

Linux 2.6.14 Released 27 October, 2005 (59 days)

Linux 2.6.13 Released 29 August, 2005 (73 days)

Linux 2.6.12 Released 17 June, 2005 (107 days)

Linux 2.6.11 Released 2 March, 2005 (68 days)

Linux 2.6.10 Released 24 December, 2004 (66 days)

Linux 2.6.9 Released 19 October, 2004 (66 days)

Linux 2.6.8 Released 14 August, 2004 (59 days)

Linux 2.6.7 Released 16 June, 2004 (37 days)

Linux 2.6.6 Released 10 May, 2004 (36 days)

Linux 2.6.5 Released 4 April, 2004 (24 days)

Linux 2.6.4 Released 11 March, 2004 (22 days)

Linux 2.6.3 Released 18 February, 2004 (14 days)

Linux 2.6.2 Released 4 February, 2004 (26 days)

Linux 2.6.1 Released 9 January, 2004 (22 days)

Linux 2.6.0 Released 18 December, 2003

LApTaC – Learn Appreciate Teach and Contribute

For a score and 5 years – my adult working life so far – I’ve had the good fortune to pursue things and areas that interest me deeply.

Systems and assembly programming on the DOS platform in the early ’90s, Kyokushin Karate on the side, UNIX, and that too in DEC (Digital Equipment Corp, now a part of HP) in the mid-90’s, consulting and corporate training on the Linux OS after that (still going strong), with long distance running on the side. Now, last couple of years, Linux training/consulting continues, product development (my own product, yeah!), managing a product team working on an analystics database project, and of course, distance running.

Yawn, yeah, I know, they interest me not you. Nevertheless, a good ride, and still riding!
Which brings me to the title:

LApTaC :
Learn
Appreciate
Teach, and
Contribute.

An oft heard complaint, the lament of the modern person- “I don’t feel fulfilled in my life”.
Perhaps one ought to step back, reassess goals and priorities (striving too hard for how our society defines “success”?), and LApTaC in your life!

Learn:
Its true you know, it never stops. Bored? Learn something new; we now know that age is certainly no barrier. That stuff we used to hear “your brain cells will die as you age and never regenerate”, turns out to be BS.

Appreciate:
Enough said; appreciate what you have and the new things you learn everyday. I like Mark Manson’s practical and straightforward writing; read his essay “Stop Trying to be Happy”, among many others.

Also take a gander at this book: “18 Minutes: Find Your Focus, Master Distraction and Get the Right Things Done” by Peter Bregman.

Teach:
A wise man once said, “If you want to see if you truly understand something, teach it”. Over a decade of training experience shows me this truth; also, the simpler you can explain it, the better you understand it. Jargon rarely, if ever, works.
Volunteer to teach something you know: the “fulfilment” payback you get from voluntary work cannot be overestimated.

and, Contribute:

A passage from the book – “A Return to Love : Reflections on the Principles of a Course in Miracles” by Marianne Williamson – has become popular as an inspirational quote:

Our deepest fear is not that we are inadequate. Our deepest fear is that we are powerful beyond measure. It is our light, not our darkness, that most frightens us. We ask ourselves, who am I to be brilliant, gorgeous, talented, fabulous? Actually, who are you not to be? You are a child of God. Your playing small doesn’t serve the world. There’s nothing enlightened about shrinking so that other people won’t feel insecure around you. We are all meant to shine, as children do. We were born to make manifest the glory of God that is within us. It’s not just in some of us; it’s in everyone. And as we let our own light shine, we unconsciously give other people permission to do the same. As we’re liberated from our own fear, our presence automatically liberates others.
So go full steam ahead and LApTaC your beautiful life!

 

Interesting Numbers

This article delves into looking up Interesting Numbers within (as of now) the following sections:

  • Networking
    • Numbers (with sheet screenshot)
    • Mitigation / Solutions
    • Resources
  • SLOCs – Source Lines Of Code
    • Cars
    • OS’s
  • Powers of 2  [Edit: 09 Jun 2015]

Enjoy!

Networking

  • In general, one requires 1 MHz CPU power to drive 1 Mbps of data (or put another way, 1 CPU cycle per bit of data)
  • Given the (heavy legacy baggage) fact that the standard Ethernet MTU (Maximum Transmission Unit) size is typically 1500 bytes:
    • A 10 Gbps network link running at wire speed, will require to transfer over 800,000 packets per second
    • The table below enumerates the story for differing Ethernet packet sizes and wire rate to be maintained

Screenshot of a sheet describing the relationship between Ethernet frame size and Line Rate to Maintain (click to enlarge)

Screenshot from 2015-05-01 17:28:49
Relationship between Ethernet frame size and network Line Rate

Below Source: “Diving into Linux Networking Stack I”, MJ Schultz

Thus, at a rate of 10 Gbps, for MTU-size packets, we require to sustain a rate of processing approximately 1 packet per microsecond! (and that’s half-duplex, effectively cutting the time down to half for full-duplex)!

How is this possible?

Well, it’s not- not over sustained periods. For one, the interrupt load would be far too high for the processor to effectively handle (leading to https://en.wikipedia.org/wiki/Source_lines_of_codewhat’s called “receive livelock”). For another, the IP (and above) protocol stack processing would also be hard put to sustain these rates.

The solution is two-fold:

  • hardware interrupt mitigation is achieved via the NAPI technique (which many modern drivers use as the default processing mode, switching to interrupt mode only when there are no or few packets left to process)
  • Modern hardware NICs and operating systems use high performance offloading techniques (TSO / LRO / GRO). These essentially offload work from the host processor to the hardware NIC, and effectively allow large packet sizes as well.

TSO effectively lets us offload 64KB of data (to the hardware NIC for segmentation and processing). If the host did the usual TCP processing at typical MSS sizes, this works out to approximately (MTU-40)-sized segments ~= 1460 bytes.
Thus, with TSO, we get an ~ > 40X saving (65536/1460 = 44.88) on CPU utilization!

Also:

NIC Adapter Time available between packets for MTU-size (1538 bytes) packets Packets per second (pps)
10 Gbps 1,230 ns (1.23 us) 813,008 (~ 0.8 M pps)
40 Gbps 120 ns 8,333,333 (~ 8M pps)
100 Gbps ~ 48 ns ~ 20,833,333 (nearly 21M pps) !

[Update: 25 May 2016]:

See this presentation made by Jasper D Brouer, Principal Kernel Engineer, RedHat at DevConf, Feb 2016 : Kernel network stack challenges at increasing speeds [ODP]

100Gbps_NICs

 

[Update: 09 Aug 2015]:

[Inputs below from this LWN article “Improving Linux Networking Performance”, Jan 2015]

Latency-sensitive workloads:
So, we’ve got approx 48 ns to process a packet on a 100 Gbe capable network adapter. Assuming we have a 3 GHz processor, it give us:

  • ~ 200 cycles to process each packet
  • a cache miss will take about 32ns to resolve
  • an SKB on 64-bit requires around 4 cache lines, and they’re written to during packet handling
  • thus, more than 2 cache misses will wipe out the available time budget!
  • what makes it worse: critical code sections require locking
    • the Intel LOCK prefix instruction (used to implement atomic operations at the machine level via cmpxchg or similar) takes ~ 8.25 ns
    • thus a spin_lock/spin_unlock section will take at least 16 ns
  • System call
    • with SELinux and auditing support- ~ 75 ns
    • without SELinux and auditing support- ~ just under 42 ns

“The (Linux) kernel, today, can only forward something between 1M (M=million) and 2M packets per core every second, while some of the bypass alternatives approach a rate of 15M packets per core per second.” Source: Improving Linux networking performance”, Jon Corbet, Jan 2015.

 

Resources

Presentation slides by Jasper D Brouer, Principal Kernel Engineer, RedHat at DevConf, Feb 2016 : Kernel network stack challenges at increasing speeds [ODP]

Large Segmentation Offload (LSO) on Wikipedia

“Improving Linux networking performance”, LWN, Jon Corbet, Jan 2015

JLS2009: Generic receive offload

Linux and TCP Offload Engines [LWN]

Whitepaper: “Introduction to TCP Offload Engines” (Dell)

Whitepaper: “Boosting Data Transfer with TCP Offload Engine Technology” (Dell, Broadcom, MS; benchmarks displayed here)

“The Ethernet standard assumes it will take roughly 50 microseconds for a signal to reach its destination.” – Source: Basic-Networking-Tutorial


SLOCs – Source Lines Of Code

First, please view this brilliant infographic from the “informationisbeautiful” book (and website).
And, here’s the same numbers in a Google sheet!

Cars

Below snippet directly quoted from “This Car Runs on Code”

“The avionics system in the F-22 Raptor, the current U.S. Air Force frontline jet fighter, consists of about 1.7 million lines of software code. The F-35 Joint Strike Fighter, scheduled to become operational in 2010, will require about 5.7 million lines of code to operate its onboard systems. And Boeing’s new 787 Dreamliner, scheduled to be delivered to customers in 2010, requires about 6.5 million lines of software code to operate its avionics and onboard support systems.

These are impressive amounts of software, yet if you bought a premium-class automobile recently, ”it probably contains close to 100 million lines of software code,” says Manfred Broy, a professor of informatics at Technical University, Munich, and a leading expert on software in cars. All that software executes on 70 to 100 microprocessor-based electronic control units (ECUs) networked throughout the body of your car.

…”

Edit: 04 jan 2017
“Car Software: 100M Lines of Code and Counting”
– Article on LinkedIn.

 

Operating Systems

Source: Wikipedia article on SLOCs

… According to Vincent Maraia,[1] the SLOC values for various operating systems in Microsoft‘s Windows NT product line are as follows:

Year Operating System SLOC (Million)
1993 Windows NT 3.1 4-5[1]
1994 Windows NT 3.5 7–8[1]
1996 Windows NT 4.0 11–12[1]
2000 Windows 2000 more than 29[1]
2001 Windows XP 45[2][3]
2003 Windows Server 2003 50[1]

David A. Wheeler studied the Red Hat distribution of the Linux operating system, and reported that Red Hat Linux version 7.1[4] (released April 2001) contained over 30 million physical SLOC. He also extrapolated that, had it been developed by conventional proprietary means, it would have required about 8,000 man-years of development effort and would have cost over $1 billion (in year 2000 U.S. dollars).

A similar study was later made of Debian GNU/Linux version 2.2 (also known as “Potato”); this operating system was originally released in August 2000. This study found that Debian GNU/Linux 2.2 included over 55 million SLOC, and if developed in a conventional proprietary way would have required 14,005 man-years and cost $1.9 billion USD to develop. Later runs of the tools used report that the following release of Debian had 104 million SLOC, and as of year 2005, the newest release is going to include over 213 million SLOC.

One can find figures of major operating systems (the various Windows versions have been presented in a table above).

Year Operating System SLOC (Million)
2000 Debian 2.2 55–59[5][6]
2002 Debian 3.0 104[6]
2005 Debian 3.1 215[6]
2007 Debian 4.0 283[6]
2009 Debian 5.0 324[6]
2012 Debian 7.0 419[7]
2009 OpenSolaris 9.7
FreeBSD 8.8
2005 Mac OS X 10.4 86[8][n 1]
2001 Linux kernel 2.4.2 2.4[4]
2003 Linux kernel 2.6.0 5.2
2009 Linux kernel 2.6.29 11.0
2009 Linux kernel 2.6.32 12.6[9]
2010 Linux kernel 2.6.35 13.5[10]
2012 Linux kernel 3.6 15.9[11]


Powers of 2

Often, especially for nerdy programmers, it’s a good idea to be familiar with powers of 2. I won’t bore you with the “usual” ones (do it yourself IOW 🙂 ).

^2 Quick Summary:

MULTIPLES OF BYTES
DECIMAL
VALUE METRIC
1000 kB kilobyte
10002 MB megabyte
10003 GB gigabyte
10004 TB terabyte
10005 PB petabyte
10006 EB exabyte
10007 ZB zettabyte
10008 YB yottabyte
BINARY
VALUE IEC JEDEC
1024 KiB kibibyte KB kilobyte
10242 MiB mebibyte MB megabyte
10243 GiB gibibyte GB gigabyte
10244 TiB tebibyte
10245 PiB pebibyte
10246 EiB exbibyte
10247 ZiB zebibyte
10248 YiB yobibyte

For example, on an x86_64 running the Linux OS (kernel ver >= 2.6.x), the memory management layer divides the 64-bit process VAS (Virtual Address Space) into two regions:

  • a 128 TB region at the low end for Userland (this includes the text, data, library/memory mapping and stack segments)
  • a 128 TB region at the upper end for kernel VAS (the kernel segment)

How large is the entire VAS?
It’s 2^64 of course, which is 18,446,744,073,709,551,616 bytes !
Wow. What the heck’s that, you ask??
Ok easier: it’s 16 EB (exabytes)    🙂
(see the Summary Table below too).

From the Wikipedia page on Powers of 2 :

The first 96 powers of two
(sequence A000079 in OEIS)

20 = 1 216 = 65,536 232 = 4,294,967,296 248 = 281,474,976,710,656 264 = 18,446,744,073,709,551,616 280 = 1,208,925,819,614,629,174,706,176
21 = 2 217 = 131,072 233 = 8,589,934,592 249 = 562,949,953,421,312 265 = 36,893,488,147,419,103,232 281 = 2,417,851,639,229,258,349,412,352
22 = 4 218 = 262,144 234 = 17,179,869,184 250 = 1,125,899,906,842,624 266 = 73,786,976,294,838,206,464 282 = 4,835,703,278,458,516,698,824,704
23 = 8 219 = 524,288 235 = 34,359,738,368 251 = 2,251,799,813,685,248 267 = 147,573,952,589,676,412,928 283 = 9,671,406,556,917,033,397,649,408
24 = 16 220 = 1,048,576 236 = 68,719,476,736 252 = 4,503,599,627,370,496 268 = 295,147,905,179,352,825,856 284 = 19,342,813,113,834,066,795,298,816
25 = 32 221 = 2,097,152 237 = 137,438,953,472 253 = 9,007,199,254,740,992 269 = 590,295,810,358,705,651,712 285 = 38,685,626,227,668,133,590,597,632
26 = 64 222 = 4,194,304 238 = 274,877,906,944 254 = 18,014,398,509,481,984 270 = 1,180,591,620,717,411,303,424 286 = 77,371,252,455,336,267,181,195,264
27 = 128 223 = 8,388,608 239 = 549,755,813,888 255 = 36,028,797,018,963,968 271 = 2,361,183,241,434,822,606,848 287 = 154,742,504,910,672,534,362,390,528
28 = 256 224 = 16,777,216 240 = 1,099,511,627,776 256 = 72,057,594,037,927,936 272 = 4,722,366,482,869,645,213,696 288 = 309,485,009,821,345,068,724,781,056
29 = 512 225 = 33,554,432 241 = 2,199,023,255,552 257 = 144,115,188,075,855,872 273 = 9,444,732,965,739,290,427,392 289 = 618,970,019,642,690,137,449,562,112
210 = 1,024 226 = 67,108,864 242 = 4,398,046,511,104 258 = 288,230,376,151,711,744 274 = 18,889,465,931,478,580,854,784 290 = 1,237,940,039,285,380,274,899,124,224
211 = 2,048 227 = 134,217,728 243 = 8,796,093,022,208 259 = 576,460,752,303,423,488 275 = 37,778,931,862,957,161,709,568 291 = 2,475,880,078,570,760,549,798,248,448
212 = 4,096 228 = 268,435,456 244 = 17,592,186,044,416 260 = 1,152,921,504,606,846,976 276 = 75,557,863,725,914,323,419,136 292 = 4,951,760,157,141,521,099,596,496,896
213 = 8,192 229 = 536,870,912 245 = 35,184,372,088,832 261 = 2,305,843,009,213,693,952 277 = 151,115,727,451,828,646,838,272 293 = 9,903,520,314,283,042,199,192,993,792
214 = 16,384 230 = 1,073,741,824 246 = 70,368,744,177,664 262 = 4,611,686,018,427,387,904 278 = 302,231,454,903,657,293,676,544 294 = 19,807,040,628,566,084,398,385,987,584
215 = 32,768 231 = 2,147,483,648 247 = 140,737,488,355,328 263 = 9,223,372,036,854,775,808 279 = 604,462,909,807,314,58

Some selected powers of two

28 = 256
The number of values represented by the 8 bits in a byte, more specifically termed as an octet. (The term byte is often defined as a collection of bits rather than the strict definition of an 8-bit quantity, as demonstrated by the term kilobyte.)
210 = 1,024
The binary approximation of the kilo-, or 1,000 multiplier, which causes a change of prefix. For example: 1,024 bytes = 1 kilobyte (or kibibyte).
This number has no special significance to computers, but is important to humans because we make use of powers of ten.
212 = 4,096
The hardware page size of Intel x86 processor.
216 = 65,536
The number of distinct values representable in a single word on a 16-bit processor, such as the original x86 processors.[4]
The maximum range of a short integer variable in the C#, and Java programming languages. The maximum range of a Word or Smallint variable in the Pascal programming language.
220 = 1,048,576
The binary approximation of the mega-, or 1,000,000 multiplier, which causes a change of prefix. For example: 1,048,576 bytes = 1 megabyte (or mibibyte).
This number has no special significance to computers, but is important to humans because we make use of powers of ten.
224 = 16,777,216
The number of unique colors that can be displayed in truecolor, which is used by common computer monitors.
This number is the result of using the three-channel RGB system, with 8 bits for each channel, or 24 bits in total.
230 = 1,073,741,824
The binary approximation of the giga-, or 1,000,000,000 multiplier, which causes a change of prefix. For example, 1,073,741,824 bytes = 1 gigabyte (or gibibyte).
This number has no special significance to computers, but is important to humans because we make use of powers of ten.
231 = 2,147,483,648
The number of non-negative values for a signed 32-bit integer. Since Unix time is measured in seconds since January 1, 1970, it will run out at 2,147,483,647 seconds or 03:14:07 UTC on Tuesday, 19 January 2038 on 32-bit computers running Unix, a problem known as the year 2038 problem.
232 = 4,294,967,296
The number of distinct values representable in a single word on a 32-bit processor. Or, the number of values representable in a doubleword on a 16-bit processor, such as the original x86 processors.[4]
The range of an int variable in the Java and C# programming languages.
The range of a Cardinal or Integer variable in the Pascal programming language.
The minimum range of a long integer variable in the C and C++ programming languages.
The total number of IP addresses under IPv4. Although this is a seemingly large number, IPv4 address exhaustion is imminent.
240 = 1,099,511,627,776
The binary approximation of the tera-, or 1,000,000,000,000 multiplier, which causes a change of prefix. For example, 1,099,511,627,776 bytes = 1 terabyte (or tebibyte).
This number has no special significance to computers, but is important to humans because we make use of powers of ten.
250 = 1,125,899,906,842,624
The binary approximation of the peta-, or 1,000,000,000,000,000 multiplier. 1,125,899,906,842,624 bytes = 1 petabyte (or pebibyte).
260 = 1,152,921,504,606,846,976
The binary approximation of the exa-, or 1,000,000,000,000,000,000 multiplier. 1,152,921,504,606,846,976 bytes = 1 exabyte (or exbibyte).
264 = 18,446,744,073,709,551,616
The number of distinct values representable in a single word on a 64-bit processor. Or, the number of values representable in a doubleword on a 32-bit processor. Or, the number of values representable in a quadword on a 16-bit processor, such as the original x86 processors.[4]
The range of a long variable in the Java and C# programming languages.
The range of a Int64 or QWord variable in the Pascal programming language.
The total number of IPv6 addresses generally given to a single LAN or subnet.
One more than the number of grains of rice on a chessboard, according to the old story, where the first square contains one grain of rice and each succeeding square twice as many as the previous square. For this reason the number 264 – 1 is known as the “chess number”.
270 = 1,180,591,620,717,411,303,424
The binary approximation of yotta-, or 1,000,000,000,000,000,000,000 multiplier, which causes a change of prefix. For example, 1,180,591,620,717,411,303,424 bytes = 1 Yottabyte (or yobibyte).
286 = 77,371,252,455,336,267,181,195,264
286 is conjectured to be the largest power of two not containing a zero.[5]
296 = 79,228,162,514,264,337,593,543,950,336
The total number of IPv6 addresses generally given to a local Internet registry. In CIDR notation, ISPs are given a /32, which means that 128-32=96 bits are available for addresses (as opposed to network designation). Thus, 296 addresses.
2128 = 340,282,366,920,938,463,463,374,607,431,768,211,456
The total number of IP addresses available under IPv6. Also the number of distinct universally unique identifiers (UUIDs).
2333 =
17,498,005,798,264,095,394,980,017,816,940,970,922,825,355,447,145,699,491,406,164,851,279,623,
993,595,007,385,788,105,416,184,430,592
The smallest power of 2 which is greater than a googol (10100).
21024 ≈ 1.7976931348E+308
The maximum number that can fit in an IEEE double-precision floating-point format, and hence the maximum number that can be represented by many programs, for example Microsoft Excel.
257,885,161 = 581,887,266,232,246,442,175,100,…,725,746,141,988,071,724,285,952
One more than the largest known prime number as of 2013. It has more than 17 million digits.[6]

Again, from the Wikipedia page on Terabyte:

–snip–

Illustrative usage examples

Examples of the use of terabyte to describe data sizes in different fields are:

  • Library data: The U.S. Library of Congress Web Capture team claims that as of March 2014 “the Library has collected about 525 terabytes of web archive data” and that it adds about 5 terabytes per month.[20]
  • Online databases: Ancestry.com claims approximately 600 TB of genealogical data with the inclusion of US Census data from 1790 to 1930.[21]
  • Computer hardware: Hitachi introduced the world’s first one terabyte hard disk drive in 2007.[22]
  • Historical Internet traffic: In 1993, total Internet traffic amounted to approximately 100 TB for the year.[23] As of June 2008, Cisco Systems estimated Internet traffic at 160 TB/s (which, assuming to be statistically constant, comes to 5 zettabytes for the year).[24] In other words, the amount of Internet traffic per second in 2008 exceeded all of the Internet traffic in 1993.
  • Social networks: As of May 2009, Yahoo! Groups had “40 terabytes of data to index”.[25]
  • Video: Released in 2009, the 3D animated film Monsters vs. Aliens used 100 TB of storage during development.[26]
  • Usenet: In October 2000, the Deja News Usenet archive had stored over 500 million Usenet messages which used 1.5 TB of storage.[27]
  • Encyclopedia: In January 2010, the database of Wikipedia consists of a 5.87 terabyte SQL dataset.[28]
  • Climate science: In 2010, the German Climate Computing Centre (DKRZ) was generating 10000 TB of data per year, from a supercomputer with a 20 TB memory and 7000 TB disk space.[29]
  • Audio: One terabyte of audio recorded at CD quality contains approx. 2000 hours of audio. Additionally, one terabyte of compressed audio recorded at 128 kB/s contains approx. 17,000 hours of audio.
  • The Hubble Space Telescope has collected more than 45 terabytes of data in its first 20 years of observations.[30]
  • The IBM computer Watson, against which Jeopardy! contestants competed in February 2011, has 16 terabytes of RAM.[31]

–snip–


Linux : 23 years on

Happy 23rd Linux !

Yes, 25 August 1991 is when Linus posted that (famous?) email…That email was sent on 25 August 1991; it’s exactly 23 years since then today!

"Hello everybody out there using minix -
I'm doing a (free) operating system (just a hobby, won't be big and
professional like gnu) for 386(486) AT clones. This has been brewing
since april, and is starting to get ready. ..."

(See the above link for the full text).

Here’s another post by Linus on Linux’s History.

“Some people have told me they don’t think a fat penguin really embodies the grace of Linux, which just tells me they have never seen an angry penguin charging at them in excess of 100 mph …”

Today, like everyday, it’s pretty much business-as-usual : the Linux OS gallops ahead on a smooth trajectory to “world domination” 
🙂
Thank you Linus and Linux : May you live and prosper a Googol years!

Tech musings, hands-on, mostly Linux