Inline Assembly Lab

Part A – Volume Scale Factor w/ Inline Assembly

The first part of this lab is to use SQDMULH or SQRDMULH instructions via inline assembly on a previous volume scaling solution and compare each of the performances on an Aarch64 architecture.

Inline assembly call

  • asm(...); or __asm__(...);
  • asm volatile (...); or __asm__ __volatile (...);
    volatile is used if you want to explicitly prevent the compiler from moving code as a result of optimization.

Additional links for inline assembly:


We’ll take the one solution from our previous program which multiplies each sound sample by a volume scale factor, and replace the multiply operation inside the loop with inline assembler code.

// Volume up using multiply by volume scale factor
void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
    int16_t *x = __builtin_assume_aligned(sample_,16);
    int16_t *y = __builtin_assume_aligned(newSample_,16);

    for (int i = 0; i <= SAMPLESNUM; i+8)
        __asm__ ("LD1 {v0.8h}, [%0];" // load multiple 1-element structure (stores [value] of %0 into v0.8h)
                "DUP v1.8h, %0;" // duplicate element (vector) from %0 to vector register 1
                "SQDMULH v0.8h, v0.8h, v1.8h;" // Signed integer saturating doubling multiply high half (vector)
                "ST1 {v0.8h}, [%0], #16;"
                : "r"(y[i]), "r"(y[i]) // inputs
                //: // outputs
                //: // clobbers

Note: solution is incomplete – will update once resolved

Part B – Open source package that uses Assembler

For the second part of the lab, we analyze an open source package that uses inline assembly. For this I chose the CLN package.

Clone CLN repo

I first cloned the repository (on a Fedora Linux machine), so that I could have a look at all the files and look for any assembler code that exists.
git clone git://

Then searched all files (recursively) within the project folder that contain the keyword ‘asm’.
grep -rnw './' -e "asm"

Two files were found in the results:


The file only contained a comment with “asm”. The header file cl_DS_mul.nuss.h contained many lines of assembler code.

Part of cl_DS_mul.nuss.h inline assembly:

#if defined(__GNUC__) && defined(__i386__)
    var uintD dummy;
    __asm__ __volatile__ (
        "movl %1,%0" "\n\t"
        "subl %2,%0" "\n\t"
        "movl %0,%3"
        : "=&q" (dummy)
        : "m" (a.ow0), "m" (b.ow0), "m" (r.ow0)
        : "cc"

The assembler code is written for an i386 platform as we can see from the defined(__i386__) line surrounding the assembly code for each of these preprocessors.

The author of this file included some comments at the top of the file explaining what these definitions are for:

  • NUSS_IN_EXTERNAL_LOOPS – Define this is you want the external loops instead of inline operation
  • NUSS_ASM_DIRECT – Define this if you want the external loops instead of inline operation
  • DEBUG_NUSS – Define this for (cheap) consistency checks.
  • DEBUG_NUSS_OPERATIONS – Define this for extensive consistency checks.

There were no other files containing any assembler code, so the inline assembly in this project are inclusive to the i386 platform.

Although assembler code is platform specific, it can be beneficial for certain cases when optimization can be improved on that specific platform. Our inline assembler can address that while leaving other platforms to be handled by the compiler optimizations.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s