Category Archives: C

glibc – finding functions optimized for x86_64 but not aarch64

One approach to finding a possible candidate function for optimization was suggested by our professor. This approach was to look for all functions optimized in x86_64 but not on aarch64, since these are the two architectures we have worked with throughout the course.

I wrote a one-liner bash script to achieve this (note: all commands below are run inside the src/glibc folder):

for f in `find ./sysdeps/x86_64/* -name "*.c" |
> sed -rn 's/\.\/sysdeps\/x86_64\/(.*\/)*(.*\.c)/\1\2/p'`; 
> do ls ./sysdeps/aarch64/$f 2>>ls_error.txt; 
> done

This command will look for all c files in the sysdeps/x86_64 directory (recursively) and search for each corresponding filename in sysdeps/aarch64 (with the same directory structure as x86_64). Files that are found in x86_64 but not in aarch64 will cause an error in the ls ./sysdeps/aarch64/$f command and we can redirect those errors to ls_error.txt (files that exist in both will simply redirect to standard output).

$ less ls_error.txt
ls: cannot access './sysdeps/aarch64/dl-procinfo.c': No such file or directory
ls: cannot access './sysdeps/aarch64/dl-runtime.c': No such file or directory
...
...

To get the filenames only from ls_error.txt, I run:

$ cat ls_error.txt | sed -rn "s/^ls.*'\.\/(.*\/)*(.*\.c)'.*$/\2/p" > functions-x86_64_not_aarch64.txt
$ less functions-x86_64_not_aarch64.txt
dl-procinfo.c
dl-runtime.c
...

Now I can go through this list and look for a function that can potentially be optimized for aarch64.

I had previously run pieces of the one-liner script above to write the functions found in x86_64 (to functions_x86_64.txt) and aarch64 (to functions-aarch64.txt) so I could count the number of functions in each architecture and get an idea of what I was dealing with. The functions-x86_64_aarch64.txt file includes functions that exist in both x86_64 and aarch64.

$ ls -l functions*
functions-aarch64.txt
functions-x86_64_aarch64.txt
functions-x86_64_not_aarch64.txt
functions-x86_64.txt
$ wc -l functions*
   54 functions-aarch64.txt
   24 functions-x86_64_aarch64.txt
  178 functions-x86_64_not_aarch64.txt
  199 functions-x86_64.txt

There are 178 files in x86_64 that are not in aarch64. Many of these however are test files, so I excluded the test and tst files:

$ cat functions-x86_64_not_aarch64.txt |
> grep -vP 'test|tst' > functions-x86_64_not_aarch64_notests.txt
$ wc -l functions*
93 functions-x86_64_not_aarch64_notests.txt

Now there are 93 possible functions I can choose from. I will follow up this post with some of my findings.

glibc difftime – no need for optimization

Upon further investigation, difftime can be left as is with no further optimization. Any optimization that can be done will have minimal effect in execution time. I will go over why that is.

double
__difftime (time_t time1, time_t time0)
{
  /* Convert to double and then subtract if no double-rounding error could
     result.  */

  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double)))
    return (double) time1 - (double) time0;

  /* Likewise for long double.  */

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t))
    return (long double) time1 - (long double) time0;

  /* Subtract the smaller integer from the larger, convert the difference to
     double, and then negate if needed.  */

  return time1 < time0 ? - subtract (time0, time1) : subtract (time1, time0);
}

For the first if condition, TYPE_BITS (time_t) and DBL_MANT_DIG are both constants, so the pre-processor will compare them at compile time and strip them from the executable altogether if they evaluate to true. The same applies to the second if condition. TYPES_BITS <= LDBL_MANT_DIG will be evaluated at compile time.

We can further validate this by compiling the code and looking at the assembly file:

I wrote a tester file that uses time.h's difftime.c:

// len_difftime_test.c
#include <stdio.h>
#include <time.h>
#include <limits.h>
#include <stdint.h>

int main(){
    // test time_t to uint_max conversion
    time_t time1 = time(NULL);
    time_t time0 = time(NULL) + 10;
    uintmax_t dt = (uintmax_t) time1 - (uintmax_t) time0;
    double delta = dt;
    printf("time1 = %d\ntime0 = %d\n", time1, time0);
    printf("(uintmax_t) time1 = %d\n", time1);
    printf("(uintmax_t) time0 = %d\n", time0);

    // test difftime function
    double result;
    result = difftime(time1, time0);
    printf("difftime(time1, time0) = %f\n", result);
    result = difftime(time0, time1);
    printf("difftime(time0, time1) = %f\n", result);

    return 0;
}

Compile:
gcc -g -o len_difftime_test len_difftime_test.c

I use gdb debugger to get to line 18 which makes the first call to difftime.
gdb len_difftime_test

Set a breakpoint at line 18 and run:

(gdb) b 18
Breakpoint 1 at 0x400638: file len_difftime_test.c, line 18.
(gdb) r
Starting program: /home/lisac/SourceCode/Seneca/spo600/project/src/glibc/time/len_difftime_test 
time1 = 1490051018
time0 = 1490051028
(uintmax_t) time1 = 1490051018
(uintmax_t) time0 = 1490051028

Breakpoint 1, main () at len_difftime_test.c:18
18      result = difftime(time1, time0);

Step into the difftime function:
__difftime (time1=1490051390, time0=1490051400) at difftime.c:103
103 {
(gdb) s
114     return (long double) time1 - (long double) time0;
(gdb) s
120 }

Short circuiting or test-reordering will not improve the executable since the pre-processor will rid of the comparison of constants when they evaluate to true. As we can see on line 17, there is no condition, only the returning subtract calculation.

Here is the pre-processor output:

cpp difftime.c

  if ((sizeof (time_t) * 8) <= 53 <-- removed
      || (((time_t) 0.5 == 0.5) && sizeof (time_t) < sizeof (long double))) <-- removed
    return (double) time1 - (double) time0;



  if ((sizeof (time_t) * 8) <= 64 || ((time_t) 0.5 == 0.5)) <-- removed
    return (long double) time1 - (long double) time0;

Now I will be looking into more functions that are better candidates for optimization.

glibc – proposed approach to optimize difftime/subtract

This is a continuation to my previous post on choosing a glibc function that could potentially be optimized. Now I’ll discuss my proposed approach for potential optimization.

difftime

difftime has a few handlers for calculating doubles and long doubles, but for any other types it will simply subtract the larger time value from the smaller one and return the result.

Let’s look at difftime first:

/* Return the difference between TIME1 and TIME0.  */
double
__difftime (time_t time1, time_t time0)
{
  /* Convert to double and then subtract if no double-rounding error could
     result.  */

  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double)))
    return (double) time1 - (double) time0;

  /* Likewise for long double.  */

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t))
    return (long double) time1 - (long double) time0;

  /* Subtract the smaller integer from the larger, convert the difference to
     double, and then negate if needed.  */

  return time1 < time0 ? - subtract (time0, time1) : subtract (time1, time0);
}

The IF condition for doubles does not contain any significantly expensive operations (i.e. multiply, divide), and since it doesn’t, it may not be necessary to change anything here, but, we know that if the first condition before the OR is not met, we won’t need to execute the second condition, so this could also be written as:

if (TYPE_BITS (time_t) <= DBL_MANT_DIG) {return (double) time1 - (double) time0;}
if (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double))) {return (double) time1 - (double) time0;}

Since the first condition is the smaller of the two, we test it first and immediately return our result if the condition is met. If not, we can then check the next slightly larger condition.

We can apply a similar approach for the second condition:

if (TYPE_FLOATING (time_t)) {return (long double) time1 - (long double) time0;}
if (TYPE_BITS (time_t) <= LDBL_MANT_DIG) {return (long double) time1 - (long double) time0;}

Something else I noticed inside the __difftime function was the checks for double and long double were always returning time1 minus time0 regardless of which was the larger value. On my particular machine (x86_64), the second IF condition was true since TYPE_BITS(time_t) was lower than LDBL_MANT_DIG, so line 11 was being executed.

double 
__difftime (time_t time1, time_t time0)
{
  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double))) {
    return (double) time1 - (double) time0;
  }

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t)) {
    //return time1 < time0 ? (long double) time0 - (long double) time1 : (long double) 
    return (long double) time1 - (long double) time0;
  }

  return time1 < time0 ? - subtract (time0, time1) : subtract (time1, time0);
}

I wrote a small tester for this:

int main() {

    // test difftime function
    time_t time1 = time(NULL);
    time_t time0 = time(NULL) + 10;
    printf("time1 = %d\ntime0 = %d\n", time1, time0);
    double result;
    result = __difftime(time1, time0);
    printf("difftime(time1, time0) = %f\n", result);
    result = __difftime(time0, time1);
    printf("difftime(time0, time1) = %f\n", result);

    return 0;
}

Which outputs:

__difftime
time1 = 1489180977
time0 = 1489180987
difftime(time1, time0) = -10.000000
__difftime
time1 = 1489180987
time0 = 1489180977
difftime(time0, time1) = 10.000000

Both results should return 10, but we are missing the time1 < time0 comparison check for each of those conditions, so I included the ternary operators in both conditions:

double
__difftime (time_t time1, time_t time0)
{
  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double))) {
    return time1 < time0 ? (double) time0 - (double) time1 : (double) time1 - (double)
  }

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t)) {
    return time1 < time0 ? (long double) time0 - (long double) time1 : (long double) 
  }

  ...
}

New output:

__difftime
time1 = 1489181645
time0 = 1489181655
difftime(time1, time0) = 10.000000
__difftime
time1 = 1489181655
time0 = 1489181645
difftime(time0, time1) = 10.000000

subtract

This function is called for any number other than double or long double type. If the time_t type is not a signed type, then the function simply returns the result of time1 - time0. If time_t type is a signed type, handle optimization.

glibc – choosing potential function to enhance

Following the getting started instructions, I cloned the glibc repository to my Fedora Linux machine.

After cloning the repository, I went through the list of functions looking for one that could potentially be enhanced or optimized in some way. There were several I was interested in (ceil, bsearch, regexec), but I eventually chose the difftime function.

The function is defined inside the ./time/difftime.c file, with its own subtract() function I may look at for potential optimization as well.

Inline Assembly Lab

Part A – Volume Scale Factor w/ Inline Assembly

The first part of this lab is to use SQDMULH or SQRDMULH instructions via inline assembly on a previous volume scaling solution and compare each of the performances on an Aarch64 architecture.

Inline assembly call

  • asm(...); or __asm__(...);
  • asm volatile (...); or __asm__ __volatile (...);
    volatile is used if you want to explicitly prevent the compiler from moving code as a result of optimization.

Additional links for inline assembly:

Solution

We’ll take the one solution from our previous program which multiplies each sound sample by a volume scale factor, and replace the multiply operation inside the loop with inline assembler code.

// Volume up using multiply by volume scale factor
void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    int16_t *x = __builtin_assume_aligned(sample_,16);
    int16_t *y = __builtin_assume_aligned(newSample_,16);

    for (int i = 0; i <= SAMPLESNUM; i+8)
    {
        __asm__ ("LD1 {v0.8h}, [%0];" // load multiple 1-element structure (stores [value] of %0 into v0.8h)
                "DUP v1.8h, %0;" // duplicate element (vector) from %0 to vector register 1
                "SQDMULH v0.8h, v0.8h, v1.8h;" // Signed integer saturating doubling multiply high half (vector)
                "ST1 {v0.8h}, [%0], #16;"
                :
                : "r"(y[i]), "r"(y[i]) // inputs
                //: // outputs
                //: // clobbers
                );
    }
}

Note: solution is incomplete – will update once resolved

Part B – Open source package that uses Assembler

For the second part of the lab, we analyze an open source package that uses inline assembly. For this I chose the CLN package.

Clone CLN repo

I first cloned the repository (on a Fedora Linux machine), so that I could have a look at all the files and look for any assembler code that exists.
git clone git://www.ginac.de/cln.git

Then searched all files (recursively) within the project folder that contain the keyword ‘asm’.
grep -rnw './' -e "asm"

Two files were found in the results:

./src/base/digitseq/cl_DS_mul_nuss.h:963
./src/base/digitseq/cl_DS_mul.cc:176

The cl_DS_mul.cc file only contained a comment with “asm”. The header file cl_DS_mul.nuss.h contained many lines of assembler code.

Part of cl_DS_mul.nuss.h inline assembly:

#if defined(__GNUC__) && defined(__i386__)
    var uintD dummy;
  #ifdef NUSS_ASM_DIRECT
    __asm__ __volatile__ (
        "movl %1,%0" "\n\t"
        "subl %2,%0" "\n\t"
        "movl %0,%3"
        : "=&q" (dummy)
        : "m" (a.ow0), "m" (b.ow0), "m" (r.ow0)
        : "cc"
        );
...
...

The assembler code is written for an i386 platform as we can see from the defined(__i386__) line surrounding the assembly code for each of these preprocessors.

The author of this file included some comments at the top of the file explaining what these definitions are for:

  • NUSS_IN_EXTERNAL_LOOPS – Define this is you want the external loops instead of inline operation
  • NUSS_ASM_DIRECT – Define this if you want the external loops instead of inline operation
  • DEBUG_NUSS – Define this for (cheap) consistency checks.
  • DEBUG_NUSS_OPERATIONS – Define this for extensive consistency checks.

There were no other files containing any assembler code, so the inline assembly in this project are inclusive to the i386 platform.

Although assembler code is platform specific, it can be beneficial for certain cases when optimization can be improved on that specific platform. Our inline assembler can address that while leaving other platforms to be handled by the compiler optimizations.

GCC auto-vectorization & SIMD on Aarch64 architecture

In this exercise, we will compile a short C program and analyze the disassembly file after enabling auto-vectorization. This will be done on an Aarch64 architecture. Later on, we will also refer to a previous post on scaling a sequence of sound samples and how this can be done using SIMD.

Test code

/*
 * vectorization.c
 * This program creates two 1000-element integer arrays and fills them 
 * with random numbers, then sums those two arrays to a third array, 
 * and finally sums the third array to a long int and prints the result
 */

#include <stdio.h>
#include <stdlib.h>

#define SIZE 1000

int main()
{   
    int a[SIZE];
    int b[SIZE];
    int sumArr[SIZE];
    long int totalSum;
    
    for (int i = 0; i < SIZE; i++)
    {   
        a[i] = rand();
        b[i] = rand();
        sumArr[i] = a[i] + b[i];
        totalSum += sumArr[i];
    }
    
    printf("Total long int sum = %d\n", totalSum );
    
    return 0;
}

Compile

c99 -o vectorization vectorization.c
./vectorization

Total long int sum = 310282548

Now we’ll disassemble our binary file.

Disassembly

In this article, it explains that vectorization is an approach to speeding up code which contains many loops such as our code written above.

We’ll compile our code with -O3 level optimization which enables auto-vectorization.

c99 -O3 -o vectorization vectorization.c
objdump -d vectorization

Disassembly output with annotation

00000000004004c0 <main>:
  4004c0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!            

/* Init variables */
  4004c4:       910003fd        mov     x29, sp                         // stack pointer to integer array 'a' address (int a[SIZE])
  4004c8:       a90153f3        stp     x19, x20, [sp,#16]              // int array b[SIZE]
  4004cc:       f90013f5        str     x21, [sp,#32]                   // store register x21 to address pointed to by (sp + (32 * 8))

  4004d0:       52807d33        mov     w19, #0x3e8                     // #1000 - SIZE of array

/* Start of Loop 
 * Calculate sum of both random numbers from a[i] and b[i]
 *  store in sumArr[i]
 */
  4004d4:       97ffffeb        bl      400480 <rand@plt>               // generates random number
  4004d8:       2a0003f5        mov     w21, w0                         // store into a[i]
  4004dc:       97ffffe9        bl      400480 <rand@plt>               // generates random number
  4004e0:       0b0002a0        add     w0, w21, w0                     // add a[i] + b[i] store it in sumArr[i]
  4004e4:       71000673        subs    w19, w19, #0x1                  

/* Calculate Total Sum - store in totalSum
 * End of Loop
 */
  4004e8:       8b20c294        add     x20, x20, w0, sxtw              // add sumArr[i] to totalSum
  4004ec:       54ffff41        b.ne    4004d4 <main+0x14>              // check i <= SIZE
  4004f0:       90000000        adrp    x0, 400000 <_init-0x430>
  4004f4:       aa1403e1        mov     x1, x20
  4004f8:       911c6000        add     x0, x0, #0x718
  4004fc:       97ffffed        bl      4004b0 <printf@plt>
  400500:       2a1303e0        mov     w0, w19
  400504:       f94013f5        ldr     x21, [sp,#32]
  400508:       a94153f3        ldp     x19, x20, [sp,#16]
  40050c:       a8c37bfd        ldp     x29, x30, [sp],#48
  400510:       d65f03c0        ret
  400514:       00000000        .inst   0x00000000 ; undefined

SIMD

From a previous post on scaling an array of sound samples by a factor between 0.000-1.000, we will now try to write a solution to achieve this using SIMD (Single Instruction Multiple Data).

Original C code – no SIMD

const float scale = 0.5;

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        newSample_[i] = sample_[i] * scale;
    }
}

int main()
{
    int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));

    createAudioSample(sample);

    int16_t* newSample = malloc(SAMPLESNUM*sizeof(int16_t));
    naiveVolumeUp(sample, newSample);
	
	return 0;
}

Let’s take a look at our binary’s object dump file for the naiveVolumeUp function:

gcc -O3 -o lab5c lab5.c
objdump -d lab5

0000000000400b90 <naiveVolumeUp>:
  400b90:	48 8d 46 10          	lea    0x10(%rsi),%rax
  400b94:	48 39 c7             	cmp    %rax,%rdi
  400b97:	73 0d                	jae    400ba6 <naiveVolumeUp+0x16>
  400b99:	48 8d 47 10          	lea    0x10(%rdi),%rax
  400b9d:	48 39 c6             	cmp    %rax,%rsi
  400ba0:	0f 82 ba 02 00 00    	jb     400e60 <naiveVolumeUp+0x2d0>
  400ba6:	48 89 f8             	mov    %rdi,%rax
  400ba9:	55                   	push   %rbp
  400baa:	53                   	push   %rbx
  400bab:	48 d1 e8             	shr    %rax
  400bae:	48 f7 d8             	neg    %rax
  400bb1:	83 e0 07             	and    $0x7,%eax
  400bb4:	0f 84 e6 02 00 00    	je     400ea0 <naiveVolumeUp+0x310>
  400bba:	0f bf 17             	movswl (%rdi),%edx
  400bbd:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400bc1:	f3 0f 10 0d eb 05 00 	movss  0x5eb(%rip),%xmm1        # 4011b4 <scale+0x4>
  400bc8:	00 
  400bc9:	83 f8 01             	cmp    $0x1,%eax
  400bcc:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400bd0:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400bd4:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400bd8:	66 89 16             	mov    %dx,(%rsi)
  400bdb:	0f 84 f1 02 00 00    	je     400ed2 <naiveVolumeUp+0x342>
  400be1:	0f bf 57 02          	movswl 0x2(%rdi),%edx
  400be5:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400be9:	83 f8 02             	cmp    $0x2,%eax
  400bec:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400bf0:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400bf4:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400bf8:	66 89 56 02          	mov    %dx,0x2(%rsi)
  400bfc:	0f 84 e1 02 00 00    	je     400ee3 <naiveVolumeUp+0x353>
  400c02:	0f bf 57 04          	movswl 0x4(%rdi),%edx
  400c06:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c0a:	83 f8 03             	cmp    $0x3,%eax
  400c0d:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c11:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c15:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c19:	66 89 56 04          	mov    %dx,0x4(%rsi)
  400c1d:	0f 84 d1 02 00 00    	je     400ef4 <naiveVolumeUp+0x364>
  400c23:	0f bf 57 06          	movswl 0x6(%rdi),%edx
  400c27:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c2b:	83 f8 04             	cmp    $0x4,%eax
  400c2e:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c32:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c36:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c3a:	66 89 56 06          	mov    %dx,0x6(%rsi)
  400c3e:	0f 84 c1 02 00 00    	je     400f05 <naiveVolumeUp+0x375>
  400c44:	0f bf 57 08          	movswl 0x8(%rdi),%edx
  400c48:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c4c:	83 f8 05             	cmp    $0x5,%eax
  400c4f:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c53:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c57:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c5b:	66 89 56 08          	mov    %dx,0x8(%rsi)
  400c5f:	0f 84 b1 02 00 00    	je     400f16 <naiveVolumeUp+0x386>
  400c65:	0f bf 57 0a          	movswl 0xa(%rdi),%edx
  400c69:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c6d:	83 f8 07             	cmp    $0x7,%eax
  400c70:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c74:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c78:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c7c:	66 89 56 0a          	mov    %dx,0xa(%rsi)
  400c80:	0f 85 3b 02 00 00    	jne    400ec1 <naiveVolumeUp+0x331>
  400c86:	0f bf 57 0c          	movswl 0xc(%rdi),%edx
  400c8a:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c8e:	41 ba fa 64 cd 1d    	mov    $0x1dcd64fa,%r10d
  400c94:	41 b9 07 00 00 00    	mov    $0x7,%r9d
  400c9a:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c9e:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400ca2:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400ca6:	66 89 56 0c          	mov    %dx,0xc(%rsi)
  400caa:	b9 f9 64 cd 1d       	mov    $0x1dcd64f9,%ecx
  400caf:	41 bb 01 65 cd 1d    	mov    $0x1dcd6501,%r11d
  400cb5:	41 89 c0             	mov    %eax,%r8d
  400cb8:	29 c1                	sub    %eax,%ecx
  400cba:	41 29 c3             	sub    %eax,%r11d
  400cbd:	c1 e9 03             	shr    $0x3,%ecx
  400cc0:	83 c1 01             	add    $0x1,%ecx
  400cc3:	8d 2c cd 00 00 00 00 	lea    0x0(,%rcx,8),%ebp
  400cca:	66 0f ef e4          	pxor   %xmm4,%xmm4
  400cce:	4d 01 c0             	add    %r8,%r8
  400cd1:	31 c0                	xor    %eax,%eax
  400cd3:	0f 28 1d e6 04 00 00 	movaps 0x4e6(%rip),%xmm3        # 4011c0 <scale+0x10>
  400cda:	4a 8d 1c 07          	lea    (%rdi,%r8,1),%rbx
  400cde:	31 d2                	xor    %edx,%edx
  400ce0:	49 01 f0             	add    %rsi,%r8
  400ce3:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  400ce8:	66 0f 6f 0c 03       	movdqa (%rbx,%rax,1),%xmm1
  400ced:	66 0f 6f d4          	movdqa %xmm4,%xmm2
  400cf1:	83 c2 01             	add    $0x1,%edx
  400cf4:	66 0f 65 d1          	pcmpgtw %xmm1,%xmm2
  400cf8:	66 0f 6f c1          	movdqa %xmm1,%xmm0
  400cfc:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  400d00:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  400d04:	0f 5b c9             	cvtdq2ps %xmm1,%xmm1
  400d07:	0f 59 cb             	mulps  %xmm3,%xmm1
  400d0a:	0f 5b c0             	cvtdq2ps %xmm0,%xmm0
  400d0d:	0f 59 c3             	mulps  %xmm3,%xmm0
  400d10:	f3 0f 5b c9          	cvttps2dq %xmm1,%xmm1
  400d14:	f3 0f 5b c0          	cvttps2dq %xmm0,%xmm0
  400d18:	66 0f 6f d0          	movdqa %xmm0,%xmm2
  400d1c:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  400d20:	66 0f 69 d1          	punpckhwd %xmm1,%xmm2
  400d24:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  400d28:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  400d2c:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  400d30:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  400d34:	41 0f 11 04 00       	movups %xmm0,(%r8,%rax,1)
  400d39:	48 83 c0 10          	add    $0x10,%rax
  400d3d:	39 ca                	cmp    %ecx,%edx
  400d3f:	72 a7                	jb     400ce8 <naiveVolumeUp+0x158>
  400d41:	41 01 e9             	add    %ebp,%r9d
  400d44:	41 29 ea             	sub    %ebp,%r10d
  400d47:	41 39 eb             	cmp    %ebp,%r11d
  400d4a:	0f 84 0d 01 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400d50:	49 63 d1             	movslq %r9d,%rdx
  400d53:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400d57:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400d5b:	f3 0f 10 0d 51 04 00 	movss  0x451(%rip),%xmm1        # 4011b4 <scale+0x4>
  400d62:	00 
  400d63:	41 83 fa 01          	cmp    $0x1,%r10d
  400d67:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400d6b:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400d6f:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400d73:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400d77:	41 8d 51 01          	lea    0x1(%r9),%edx
  400d7b:	0f 84 dc 00 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400d81:	48 63 d2             	movslq %edx,%rdx
  400d84:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400d88:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400d8c:	41 83 fa 02          	cmp    $0x2,%r10d
  400d90:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400d94:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400d98:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400d9c:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400da0:	41 8d 51 02          	lea    0x2(%r9),%edx
  400da4:	0f 84 b3 00 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400daa:	48 63 d2             	movslq %edx,%rdx
  400dad:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400db1:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400db5:	41 83 fa 03          	cmp    $0x3,%r10d
  400db9:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400dbd:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400dc1:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400dc5:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400dc9:	41 8d 51 03          	lea    0x3(%r9),%edx
  400dcd:	0f 84 8a 00 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400dd3:	48 63 d2             	movslq %edx,%rdx
  400dd6:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400dda:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400dde:	41 83 fa 04          	cmp    $0x4,%r10d
  400de2:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400de6:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400dea:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400dee:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400df2:	41 8d 51 04          	lea    0x4(%r9),%edx
  400df6:	74 65                	je     400e5d <naiveVolumeUp+0x2cd>
  400df8:	48 63 d2             	movslq %edx,%rdx
  400dfb:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400dff:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400e03:	41 83 fa 05          	cmp    $0x5,%r10d
  400e07:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400e0b:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400e0f:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400e13:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400e17:	41 8d 51 05          	lea    0x5(%r9),%edx
  400e1b:	74 40                	je     400e5d <naiveVolumeUp+0x2cd>
  400e1d:	48 63 d2             	movslq %edx,%rdx
  400e20:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400e24:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400e28:	41 83 c1 06          	add    $0x6,%r9d
  400e2c:	41 83 fa 06          	cmp    $0x6,%r10d
  400e30:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400e34:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400e38:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400e3c:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400e40:	74 1b                	je     400e5d <naiveVolumeUp+0x2cd>
  400e42:	49 63 c1             	movslq %r9d,%rax
  400e45:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400e49:	0f bf 14 47          	movswl (%rdi,%rax,2),%edx
  400e4d:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400e51:	f3 0f 59 c8          	mulss  %xmm0,%xmm1
  400e55:	f3 0f 2c d1          	cvttss2si %xmm1,%edx
  400e59:	66 89 14 46          	mov    %dx,(%rsi,%rax,2)
  400e5d:	5b                   	pop    %rbx
  400e5e:	5d                   	pop    %rbp
  400e5f:	c3                   	retq   
  400e60:	31 d2                	xor    %edx,%edx
  400e62:	f3 0f 10 0d 4a 03 00 	movss  0x34a(%rip),%xmm1        # 4011b4 <scale+0x4>
  400e69:	00 
  400e6a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  400e70:	0f bf 04 17          	movswl (%rdi,%rdx,1),%eax
  400e74:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400e78:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400e7c:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400e80:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400e84:	66 89 04 16          	mov    %ax,(%rsi,%rdx,1)
  400e88:	48 83 c2 02          	add    $0x2,%rdx
  400e8c:	48 81 fa 02 ca 9a 3b 	cmp    $0x3b9aca02,%rdx
  400e93:	75 db                	jne    400e70 <naiveVolumeUp+0x2e0>
  400e95:	f3 c3                	repz retq 
  400e97:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  400e9e:	00 00 
  400ea0:	bd 00 65 cd 1d       	mov    $0x1dcd6500,%ebp
  400ea5:	b9 a0 ac b9 03       	mov    $0x3b9aca0,%ecx
  400eaa:	41 bb 01 65 cd 1d    	mov    $0x1dcd6501,%r11d
  400eb0:	45 31 c0             	xor    %r8d,%r8d
  400eb3:	41 ba 01 65 cd 1d    	mov    $0x1dcd6501,%r10d
  400eb9:	45 31 c9             	xor    %r9d,%r9d
  400ebc:	e9 09 fe ff ff       	jmpq   400cca <naiveVolumeUp+0x13a>
  400ec1:	41 ba fb 64 cd 1d    	mov    $0x1dcd64fb,%r10d
  400ec7:	41 b9 06 00 00 00    	mov    $0x6,%r9d
  400ecd:	e9 d8 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400ed2:	41 ba 00 65 cd 1d    	mov    $0x1dcd6500,%r10d
  400ed8:	41 b9 01 00 00 00    	mov    $0x1,%r9d
  400ede:	e9 c7 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400ee3:	41 ba ff 64 cd 1d    	mov    $0x1dcd64ff,%r10d
  400ee9:	41 b9 02 00 00 00    	mov    $0x2,%r9d
  400eef:	e9 b6 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400ef4:	41 ba fe 64 cd 1d    	mov    $0x1dcd64fe,%r10d
  400efa:	41 b9 03 00 00 00    	mov    $0x3,%r9d
  400f00:	e9 a5 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400f05:	41 ba fd 64 cd 1d    	mov    $0x1dcd64fd,%r10d
  400f0b:	41 b9 04 00 00 00    	mov    $0x4,%r9d
  400f11:	e9 94 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400f16:	41 ba fc 64 cd 1d    	mov    $0x1dcd64fc,%r10d
  400f1c:	41 b9 05 00 00 00    	mov    $0x5,%r9d
  400f22:	e9 83 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400f27:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  400f2e:	00 00 

For a function with one loop this seems like quite a long set of instructions even if it contains two large size integer arrays. Let’s try to use SIMD to help gcc condense this.

Loop Function using SIMD

The article above mentions that gcc does not know our arrays are aligned so it has to do extra work to account for possible cases in which they are not.

Let’s tell gcc explicitly that these arrays are aligned by calling the __builtin_assume_aligned function on the passed in int16_t* parameters and create two new pointers for our 16-bit aligned arrays.

lab5-simd.c

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    int16_t *x = __builtin_assume_aligned(sample_,16);
    int16_t *y = __builtin_assume_aligned(newSample_,16);
    
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        y[i] = x[i] * scale;
    }   
}

Compile our code again with -O3 level optimization to enable auto-vectorization.
gcc -O3 -o lab5-simd lab5-simd.c
objdump -d lab5-simd

00000000004008e0 <naiveVolumeUp>:
  4008e0:	48 8d 46 10          	lea    0x10(%rsi),%rax
  4008e4:	48 39 c7             	cmp    %rax,%rdi
  4008e7:	73 0d                	jae    4008f6 <naiveVolumeUp+0x16>
  4008e9:	48 8d 47 10          	lea    0x10(%rdi),%rax
  4008ed:	48 39 c6             	cmp    %rax,%rsi
  4008f0:	0f 82 92 00 00 00    	jb     400988 <naiveVolumeUp+0xa8>
  4008f6:	66 0f ef e4          	pxor   %xmm4,%xmm4
  4008fa:	0f 28 1d 4f 03 00 00 	movaps 0x34f(%rip),%xmm3        # 400c50 <scale+0x10>
  400901:	31 c0                	xor    %eax,%eax
  400903:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  400908:	66 0f 6f 0c 07       	movdqa (%rdi,%rax,1),%xmm1
  40090d:	66 0f 6f d4          	movdqa %xmm4,%xmm2
  400911:	66 0f 6f c1          	movdqa %xmm1,%xmm0
  400915:	66 0f 65 d1          	pcmpgtw %xmm1,%xmm2
  400919:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  40091d:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  400921:	0f 5b c0             	cvtdq2ps %xmm0,%xmm0
  400924:	0f 59 c3             	mulps  %xmm3,%xmm0
  400927:	0f 5b c9             	cvtdq2ps %xmm1,%xmm1
  40092a:	0f 59 cb             	mulps  %xmm3,%xmm1
  40092d:	f3 0f 5b c0          	cvttps2dq %xmm0,%xmm0
  400931:	66 0f 6f d0          	movdqa %xmm0,%xmm2
  400935:	f3 0f 5b c9          	cvttps2dq %xmm1,%xmm1
  400939:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  40093d:	66 0f 69 d1          	punpckhwd %xmm1,%xmm2
  400941:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  400945:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  400949:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  40094d:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  400951:	0f 29 04 06          	movaps %xmm0,(%rsi,%rax,1)
  400955:	48 83 c0 10          	add    $0x10,%rax
  400959:	48 3d 00 ca 9a 3b    	cmp    $0x3b9aca00,%rax
  40095f:	75 a7                	jne    400908 <naiveVolumeUp+0x28>
  400961:	0f bf 87 00 ca 9a 3b 	movswl 0x3b9aca00(%rdi),%eax
  400968:	66 0f ef c0          	pxor   %xmm0,%xmm0
  40096c:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400970:	f3 0f 59 05 28 03 00 	mulss  0x328(%rip),%xmm0        # 400ca0 <scale+0x60>
  400977:	00 
  400978:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  40097c:	66 89 86 00 ca 9a 3b 	mov    %ax,0x3b9aca00(%rsi)
  400983:	c3                   	retq   
  400984:	0f 1f 40 00          	nopl   0x0(%rax)
  400988:	31 d2                	xor    %edx,%edx
  40098a:	f3 0f 10 0d 0e 03 00 	movss  0x30e(%rip),%xmm1        # 400ca0 <scale+0x60>
  400991:	00 
  400992:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  400998:	0f bf 04 17          	movswl (%rdi,%rdx,1),%eax
  40099c:	66 0f ef c0          	pxor   %xmm0,%xmm0
  4009a0:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  4009a4:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  4009a8:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  4009ac:	66 89 04 16          	mov    %ax,(%rsi,%rdx,1)
  4009b0:	48 83 c2 02          	add    $0x2,%rdx
  4009b4:	48 81 fa 02 ca 9a 3b 	cmp    $0x3b9aca02,%rdx
  4009bb:	75 db                	jne    400998 <naiveVolumeUp+0xb8>
  4009bd:	f3 c3                	repz retq 
  4009bf:	90                   	nop

The instructions for our function has now been significantly condensed since we’ve explicitly told gcc that the arrays used inside our loop are 16-bit aligned.

Size of binaries – with and without auto-vectorization:

[lisac@localhost lab6-vectorization]$ size lab5
   text    data     bss     dec     hex filename
   4666     580       4    5250    1482 lab5
[lisac@localhost lab6-vectorization]$ size lab5-simd
   text    data     bss     dec     hex filename
   3222     580       4    3806     ede lab5-simd

Algorithm Selection – adjusting a sequence of sound samples by a volume scale factor

The purpose of this exercise is to compare the different algorithm approaches for adjusting a sequence of sound samples. We can analyze how the different gcc optimization levels increase or decrease the performance for each of them on both x86_64 and Aarch64 architectures.

Creating audio sample

First, we will create an audio sample to test with. Since we would like to be able to compare the runtimes between each algorithm, a substantial data set is required, so we will create a sequence of 500000000 sound samples. This will be coded in c, and we will store the sound samples inside a signed int16 array (-32768 to 32767 range).

#define SAMPLESNUM 500000000

void createAudioSample(int16_t* sample_)
{
    for(int i = 0; i <= SAMPLESNUM; i++)
    {
        sample_[i] = rand();
    }
}

int main()
{
    int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));
    createAudioSample(sample);
}

Increase Volume

First algorithm – “Naive” approach

The first algorithm written will increase the volume by simply multiplying each sound sample by a volume scale factor then storing the result into a new array.

const float scale = 0.5; // volume scaling factor

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        newSample_[i] = sample_[i] * scale;
    }
}

Second algorithm – Look up table approach

For this approach, we will create a lookup table with all possible values from -32768 to 32767 multiplied by the volume scale factor (in our case 0.5). We can then reference the lookup later on to adjust our volume by the scale factor.

#define MAXSIZE 65536 // maximum size for signed 16 bit integer
#define HALF 32767 // half of max size for signed 16 bit integer

...

void lookupTableVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    // Create Lookup table
    int16_t lookupTable[MAXSIZE];
    for (int counter = 0; counter < MAXSIZE; counter++)
    {
        lookupTable[counter] = ((counter - HALF )*scale);
    }

    // Increase using lookupTable
    for (int i = 0; i <= MAXSIZE; i++)
    {
        newSample_[i] = lookupTable[sample_[i] + HALF];
    }
}

Here is a function to calculate our functions’ execution times:

void printExecTime(struct timeval t1, struct timeval t2)
{
    double elapsed;
 
    elapsed = (t2.tv_sec - t1.tv_sec) + 1e-6 * (t2.tv_usec - t1.tv_usec);

    printf("elapsed: %.8lf seconds\n", elapsed);
}

And our main:

int main()
{
   struct timeval t1, t2;
   int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));

   createAudioSample(sample);
   printf("\nAudio sample\n============\n");
   printSpecifiedRange(sample, 0, 7);

   int16_t* newSample = malloc(SAMPLESNUM*sizeof(int16_t));
   printf("\nNaive volume up\n===============\n");
   gettimeofday(&t1, NULL); // starting time
   naiveVolumeUp(sample, newSample); // start naive test
   gettimeofday(&t2, NULL); // end time
   printExecTime(t1, t2);
   printSpecifiedRange(newSample, 0, 7);

   free(newSample);
   newSample = malloc(SAMPLESNUM*sizeof(int16_t));
   printf("\nLookup volume up\n================\n");
   gettimeofday(&t1, NULL); // starting time
   lookupTableVolumeUp(sample, newSample); // start lookup table approach 
   gettimeofday(&t2, NULL); // end time
   printExecTime(t1, t2);
   printSpecifiedRange(newSample, 0, 7);

    return 0;
}

Compile our code:
gcc -o lab5 lab5.c
time ./lab5

Audio sample
============
sample[0]=17767
sample[1]=9158
sample[2]=-26519
sample[3]=18547
sample[4]=-9135
sample[5]=23807
sample[6]=-27574
sample[7]=22764

Naive volume up
===============
elapsed: 6.56288600 seconds
sample[0]=8883
sample[1]=4579
sample[2]=-13259
sample[3]=9273
sample[4]=-4567
sample[5]=11903
sample[6]=-13787
sample[7]=11382

Lookup volume up
================
elapsed: 0.11074700 seconds
sample[0]=8883
sample[1]=4579
sample[2]=-13259
sample[3]=9273
sample[4]=-4567
sample[5]=11903
sample[6]=-13787
sample[7]=11382

real	0m16.653s
user	0m11.863s
sys	0m1.145s

We can see the lookup approach is signficantly faster. Now we’ll test the different optimization levels on both Xerxes(x86_64) and Betty(Aarch64) machines.

Runtime w/ optimization levels

I compiled the code with each optimization level from O0 to O3, i.e:
gcc -O0 -o lab5 lab5.c

gcc -O3 -o lab5 lab5.c
c99 -O0 -o lab5 lab5.c

c99 -O3 -o lab5 lab5.c

And recorded the results into a spreadsheet:

xerxes_betty

The Functions section of the report shows both functions’ execution times, and is displayed in seconds.
The first thing I noticed with O0 optimization was that the ratio between Naive and Lookup was significantly wider on Betty than on Xerxes. The Naive function on Xerxes comparing O0 to O1 improves from 3.84 to 1.18. On Betty, it improves from 8.94 to 1.79. The ratio between the two functions from O1 to O3 begin to even out and are relatively similar between the two architectures.

The time section of the report displays results from the time ./lab5 command. This simply displays the runtimes for the whole program execution in real, user and sys time.

The size section displays results from the size lab5 command.
We can see that all data sizes are equally larger on Betty than on Xerxes. One thing that stands out is the O3 level optimization on Xerxes is 4634 bytes, while on Betty it is 2934 bytes.