glibc’s nexttowardf – optimize for aarch64

Here I will detail my findings in trying to optimize glibc’s nexttowardf function for Aarch64 architecture by breaking down the C and assembly code. Then I will go over the tests and conclusions I came to in attempting to optimize this function.

My approach will include translating the GET_FLOAT_WORD and SET_FLOAT_WORD functions to Aarch64 inline assembly.

These are macros defined in the math_private.h files for their respective systems:

./sysdeps/x86_64/fpu/math_private.h
./sysdeps/microblaze/math_private.h
./sysdeps/nios2/math_private.h
./sysdeps/arm/math_private.h
./sysdeps/sparc/fpu/math_private.h
./sysdeps/aarch64/fpu/math_private.h
./sysdeps/tile/math_private.h
./sysdeps/sh/math_private.h
./sysdeps/alpha/fpu/math_private.h
./sysdeps/powerpc/fpu/math_private.h
./sysdeps/generic/math_private.h
./sysdeps/i386/fpu/math_private.h
./sysdeps/m68k/coldfire/fpu/math_private.h
./sysdeps/m68k/m680x0/fpu/math_private.h
./sysdeps/mips/math_private.h

math_private.h for Aarch64 is located in aarch64/fpu/math_private.h and does not include specific definitions so it will use the generic version of these functions (C code) defined in generic/math_private.h.

x86_64

Macros

#if defined __AVX__ || defined SSE2AVX
# define MOVD "vmovd"
# define MOVQ "vmovq"
#else
# define MOVD "movd"
# define MOVQ "movq"
#endif

/* Direct movement of float into integer register.  */
#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)

/* And the reverse.  */
#define SET_FLOAT_WORD(f, i) \
  do {                                        \
    int i_ = i;                                   \
    float f__;                                    \
    asm (MOVD " %1, %0" : "=x" (f__) : "rm" (i_));                \
    f = f__;                                      \
  } while (0)

AVX stands for Advanced Vector Extensions. SSE2AVX is a Streaming SIMD Extension for AVX. If either of these are defined, the VMOV* assembly instructions will be used to handle floating point registers. Otherwise, MOV* instructions will be used.

As noted in the code comments above, GET_FLOAT_WORD directly moves the float into an integer register. SET_FLOAT_WORD does the reverse.

Looking at where the macro is first being used in s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,ix,iy;
    u_int32_t hy,ly,esy;

    GET_FLOAT_WORD(hx,x);

GET_FLOAT_WORD uses an uninitialized int32_t signed 32 bit integer type as its first parameter (hx, which will be initialized inside the macro definition later on), and a float value (x, used to store into an integer register).

Inline assembly

asm (
     MOVD " %1, %0" 
     : "=rm" (i_) 
     : "x" ((float) (d))
    );

"x" ((float) (d))
Input operand – input value from d is placed in a register.

"=rm" (i_)
Output operand – output register value is moved to i_.

MOVD " %1, %0"
Instruction – moves float value (%1) into integer register (%0).

#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)

The while loop here will always be taken out of the compiled code, but the reason for it is to keep the scope of i_ within the scope of the function since this macro will be inserted in the c code itself.
For the parameters in GET_FLOAT_WORD(i, d): i will be the int32_t hx value for the macro’s first parameter – this will be set to the floating point value inside the macro definition. The surrounding parentheses here indicate to explicitly cast the value of i_ to a signed 32 bit int storing it in hx. d is the floating point value that will be passed in.

Testing process

First I tried to find any existing glibc files that were using the s_nexttowardf (alias nexttowardf) function.

grep -R "= nexttowardf"

math/bug-nexttoward.c:  fi = nexttowardf (m, fi);
math/bug-nexttoward.c:  fi = nexttowardf (-m, -fi);
math/bug-nexttoward.c:  m = nexttowardf (zero, inf);
math/bug-nexttoward.c:  m = nexttowardf (copysignf (zero, -1.0), -inf);
math/test-misc.c:    if (nexttowardf (0.0f, INFINITY) != nexttowardf (0.0f, 1.0f)
math/test-misc.c:        || nexttowardf (-0.0f, INFINITY) != nexttowardf (-0.0f, 1.0f)
math/test-misc.c:        || nexttowardf (0.0f, -INFINITY) != nexttowardf (0.0f, -1.0f)
math/test-misc.c:        || nexttowardf (-0.0f, -INFINITY) != nexttowardf (-0.0f, -1.0f))
math/test-misc.c:       printf ("nexttowardf (+-0, +-Inf) != nexttowardf (+-0, +-1)\n");

bug-nexttoward.c and test-misc.c both use the nexttowardf function. I chose to use test-misc.c for testing.

Compile issues

I ran into compile errors for both of the tester files due to missing header file errors as detailed below.

cpp math/test-misc.c

# 25 "math/test-misc.c" 2
math/test-misc.c:25:24: fatal error: math-tests.h: No such file or directory
 #include <math-tests.h>

find ./ -name "math-tests.h"

./sysdeps/nios2/math-tests.h
./sysdeps/arm/math-tests.h
./sysdeps/aarch64/math-tests.h
./sysdeps/tile/math-tests.h
./sysdeps/powerpc/math-tests.h
./sysdeps/generic/math-tests.h
./sysdeps/i386/fpu/math-tests.h
./sysdeps/mips/math-tests.h

I tried including required libraries directly in the same folder or including them statically (with option cpp -I or -include) but still would not compile. So what I ended up doing instead was since I was working on the GET_FLOAT_WORD function first, I created my own tester to test this exclusively.

lentest.c

#include <stdio.h>
#define X86_64
//#define GENERIC

typedef int int32_t;
typedef unsigned int u_int32_t;

typedef union
{
  float value;
  u_int32_t word;
} ieee_float_shape_type;

/* Direct movement of float into integer register.  */
#ifdef X86_64
#define GET_FLOAT_WORD(i, d) \
do {                                          \
  int i_;                                     \
  asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));              \
  (i) = i_;                                   \
} while (0)
#endif

#ifdef GENERIC
#define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif

int main() {
    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    float x=3.1;
    long double y;

    int32_t i = hx;
    float d = x;

    printf("PRE:\ni = %d\nd = %0.7f\n", i, d);

    GET_FLOAT_WORD(i, d);

    printf("POST:\ni = %d\nd = %0.7f\n", i, d);

    return 0;
}

Object code

Compile tester program (w/ no optimization levels):

gcc -g -o lentest.o lentest.c

objdump -d --source lentest.o

        do {
          int i_;
          asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));
  400567:   f3 0f 10 45 f4          movss  -0xc(%rbp),%xmm0
  40056c:   66 0f 7e c0             movd   %xmm0,%eax
  400570:   89 45 e8                mov    %eax,-0x18(%rbp)
          (i) = i_;
  400573:   8b 45 e8                mov    -0x18(%rbp),%eax
  400576:   89 45 f8                mov    %eax,-0x8(%rbp)

movss -0x10(%rbp),%xmm0 – moves 16th bit value of the register base pointer to xmm register (SSE used only a single data type for XMM registers: four 32-bit single-precision floating point numbers)

using gdb layout asm, output register info for %xmm0:

(gdb) info register xmm0
xmm0           {v4_float = {0x3, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0},
  v16_int8 = {0x64, 0x66, 0x66, 0x40, 0x0 <repeats 12 times>}, v8_int16 = {
    0x6664, 0x4066, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x40666664,
    0x0, 0x0, 0x0}, v2_int64 = {0x40666664, 0x0},
  uint128 = 0x00000000000000000000000040666664}

movd %xmm0,%eax – moves value of %xmm0 to %eax integer register.

We have a 32 bit size integer, so I can check what’s inside the v4_int32 by printing its address:

(gdb) print 0x40666664
$4 = 1080033278
(gdb) print i
$5 = 0
(gdb) info register eax
eax            0x405ffffe   1080033278
(gdb) s
(gdb) print i
$7 = 1080033278

And we can see our integer register holds the converted floating point number which will be stored in i.

Translate to Aarch64

Here is the x86_64 in assembly translated to Aarch64:

#ifdef AARCH64
#define SYSTEM "Aarch64"
#define GET_FLOAT_WORD(i, d) \
do {                                        \
    int i_;                                 \
    __asm__("MOV %w0, %w1" \
        : [result] "=r" (i_) \
        : [input_i] "r" ((float) (d)));          \
    (i) = i_;               \
} while(0)
#endif

I added it to the tester in the following section.

Test runtimes

Since the tester program has an execution time of 0.001 seconds, I modified the tester to loop a significant amount of times (defined in LOOPNUM) and increment the float value by 0.1 each time, so I could get more comparable run time results.

New Tester

#include <stdio.h>
#define LOOPNUM 100000000
//#define X86_64
//#define AARCH64
#define GENERIC

typedef int int32_t;
typedef unsigned int u_int32_t;

typedef union
{
  float value;
  u_int32_t word;
} ieee_float_shape_type;

/* Direct movement of float into integer register.  */
#ifdef X86_64
#define SYSTEM "x86_64"
#define GET_FLOAT_WORD(i, d) \
do {                                          \
  int i_;                                     \
  asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));              \
  (i) = i_;                                   \
} while (0)
#endif

#ifdef AARCH64
#define SYSTEM "Aarch64"
#define GET_FLOAT_WORD(i, d) \
do {                                        \
    int i_;                                 \
    __asm__("MOV %w0, %w1" \
        : [result] "=r" (i_) \
        : [input_i] "r" ((float) (d)));          \
    (i) = i_;               \
} while(0)
#endif

#ifdef GENERIC
#define SYSTEM "Generic"
#define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif

int main() {

    printf("Testing %s\n", SYSTEM);

    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    float x=3.1;
    long double y;

    int32_t i = hx;
    float d = x;

    printf("PRE:\ni = %d\nd = %0.7f\n", i, d);
    long int t = 0;

    for (int j=0; j < LOOPNUM; j++) {
        x += .1;
        d = x;

        GET_FLOAT_WORD(i, d);

        t += i;
    }

    printf("POST:\ni = %d\nd = %0.7f\n", i, d);
    printf("t = %d", t);

    return 0;
}

Testing on Xerxes (x86_64) and Betty (Aarch64)

Results for x86_64 versus Generic:

Xerxes

[lkisac@xerxes ~]$ time ./lentest.o 
Testing x86_64

PRE:
i = 4195888
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m0.812s
user    0m0.810s
sys 0m0.001s
[lkisac@xerxes ~]$ gcc -g lentest.c -o lentest.o
[lkisac@xerxes ~]$ time ./lentest.o 
Testing GENERIC

PRE:
i = 4195872
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m0.806s
user    0m0.805s
sys 0m0.001s

It looks like the generic version is actually slightly faster. I validated this by testing each run 10 times and getting the average run time – it was still slightly faster for the generic version.

Betty

Results for Aarch64 versus Generic:

[lkisac@betty ~]$ time ./lentest.o 
Testing GENERIC

PRE:
i = -608132512
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m3.352s
user    0m3.350s
sys 0m0.000s
[lkisac@betty ~]$ vi lentest.c 
[lkisac@betty ~]$ c99 -g lentest.c -o lentest.o 
[lkisac@betty ~]$ time ./lentest.o 
Testing AARCH64

PRE:
i = -558450000
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m3.352s
user    0m3.350s
sys 0m0.000s

The run times for both optimized (assembly code) and previous c code have the same run times. This was also tested multiple times and average calculated with the same results.

Conclusion

It appears that the IEEE C code had the same run time if not better than the inline assembly optimization, so the x86_64 version may be better off with the defined c code rather than the inline assembly.

Advertisements

glibc – nexttowardf candidate for optimization on aarch64

In my previous post, I went over the steps to finding a function that was optimized for x86_64 but not on aarch64.

After going through the list of functions, I came across nexttowardf.c:

[lisac@localhost glibc]$ find ./* -name "*nexttowardf*"
./math/s_nexttowardf.c
./sysdeps/x86_64/fpu/s_nexttowardf.c
./sysdeps/ia64/fpu/s_nexttowardf.S
./sysdeps/ieee754/ldbl-opt/nldbl-nexttowardf.c
./sysdeps/ieee754/ldbl-opt/s_nexttowardfd.c
./sysdeps/ieee754/ldbl-128/s_nexttowardf.c
./sysdeps/ieee754/ldbl-64-128/s_nexttowardf.c
./sysdeps/ieee754/ldbl-96/s_nexttowardf.c
./sysdeps/ieee754/ldbl-128ibm/s_nexttowardf.c
./sysdeps/i386/fpu/s_nexttowardf.c

From nexttowardf‘s man pages:

“The nextafter(), nextafterf(), and nextafterl() functions return the next representable floating-point value following x in the direction of y. If y is less than x, these functions will return the largest representable number less than x.

If x equals y, the functions return y.

The nexttoward(), nexttowardf(), and nexttowardl() functions do the same as the corresponding nextafter() functions, except that they have a long double second argument.”

Looking at the x86_64 optimization sysdeps/x86_64/fpu/s_nexttowardf.c:

#include <sysdeps/i386/fpu/s_nexttowardf.c>

This file contains only an include statement for the i386’s optimization. So I had a look at the sysdeps/i386/fpu/s_nexttowardf.c and it is essentially identical to the original math/s_nexttowardf.c version. One thing to note are the macros inside __nexttowardf: .

original

math/s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    GET_FLOAT_WORD(hx,x);
    EXTRACT_WORDS(hy,ly,y);
    ix = hx&0x7fffffff;     /* |x| */
    iy = hy&0x7fffffff;     /* |y| */
...
...
    SET_FLOAT_WORD(x,hx);

sysdeps/generic/math_private.h:

/* Get a 32 bit int from a float.  */
#ifndef GET_FLOAT_WORD
# define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif
/* Set a float from a 32 bit int.  */
#ifndef SET_FLOAT_WORD
# define SET_FLOAT_WORD(d,i)                    \
do {                                \
  ieee_float_shape_type sf_u;                   \
  sf_u.word = (i);                      \
  (d) = sf_u.value;                     \
} while (0)
#endif
/* Get two 32 bit ints from a double.  */

#define EXTRACT_WORDS(ix0,ix1,d)                \
do {                                \
  ieee_double_shape_type ew_u;                  \
  ew_u.value = (d);                     \
  (ix0) = ew_u.parts.msw;                   \
  (ix1) = ew_u.parts.lsw;                   \
} while (0)

x86_64

sysdeps/i386/fpu/s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,ix,iy;
    u_int32_t hy,ly,esy;

    GET_FLOAT_WORD(hx,x);
    GET_LDOUBLE_WORDS(esy,hy,ly,y);
    ix = hx&0x7fffffff;     /* |x| */
    iy = esy&0x7fff;        /* |y| */
...
...
    SET_FLOAT_WORD(x,hx);

sysdeps/x86_64/fpu/math_private.h:

/* Direct movement of float into integer register.  */
#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)
/* And the reverse.  */
#define SET_FLOAT_WORD(f, i) \
  do {                                        \
    int i_ = i;                                   \
    float f__;                                    \
    asm (MOVD " %1, %0" : "=x" (f__) : "rm" (i_));                \
    f = f__;                                      \
  } while (0)

sysdeps/x86_64/fpu/math_ldbl.h:

/* Get three 32 bit ints from a double.  */

#define GET_LDOUBLE_WORDS(exp,ix0,ix1,d)            \
do {                                \
  ieee_long_double_shape_type ew_u;             \
  ew_u.value = (d);                     \
  (exp) = ew_u.parts.sign_exponent;             \
  (ix0) = ew_u.parts.msw;                   \
  (ix1) = ew_u.parts.lsw;                   \
} while (0)

The x86_64 definition for GET_FLOAT_WORD and SET_FLOAT_WORD contains inline assembly. I will try a similar approach for the aarch64.

glibc – finding functions optimized for x86_64 but not aarch64

One approach to finding a possible candidate function for optimization was suggested by our professor. This approach was to look for all functions optimized in x86_64 but not on aarch64, since these are the two architectures we have worked with throughout the course.

I wrote a one-liner bash script to achieve this (note: all commands below are run inside the src/glibc folder):

for f in `find ./sysdeps/x86_64/* -name "*.c" |
> sed -rn 's/\.\/sysdeps\/x86_64\/(.*\/)*(.*\.c)/\1\2/p'`; 
> do ls ./sysdeps/aarch64/$f 2>>ls_error.txt; 
> done

This command will look for all c files in the sysdeps/x86_64 directory (recursively) and search for each corresponding filename in sysdeps/aarch64 (with the same directory structure as x86_64). Files that are found in x86_64 but not in aarch64 will cause an error in the ls ./sysdeps/aarch64/$f command and we can redirect those errors to ls_error.txt (files that exist in both will simply redirect to standard output).

$ less ls_error.txt
ls: cannot access './sysdeps/aarch64/dl-procinfo.c': No such file or directory
ls: cannot access './sysdeps/aarch64/dl-runtime.c': No such file or directory
...
...

To get the filenames only from ls_error.txt, I run:

$ cat ls_error.txt | sed -rn "s/^ls.*'\.\/(.*\/)*(.*\.c)'.*$/\2/p" > functions-x86_64_not_aarch64.txt
$ less functions-x86_64_not_aarch64.txt
dl-procinfo.c
dl-runtime.c
...

Now I can go through this list and look for a function that can potentially be optimized for aarch64.

I had previously run pieces of the one-liner script above to write the functions found in x86_64 (to functions_x86_64.txt) and aarch64 (to functions-aarch64.txt) so I could count the number of functions in each architecture and get an idea of what I was dealing with. The functions-x86_64_aarch64.txt file includes functions that exist in both x86_64 and aarch64.

$ ls -l functions*
functions-aarch64.txt
functions-x86_64_aarch64.txt
functions-x86_64_not_aarch64.txt
functions-x86_64.txt
$ wc -l functions*
   54 functions-aarch64.txt
   24 functions-x86_64_aarch64.txt
  178 functions-x86_64_not_aarch64.txt
  199 functions-x86_64.txt

There are 178 files in x86_64 that are not in aarch64. Many of these however are test files, so I excluded the test and tst files:

$ cat functions-x86_64_not_aarch64.txt |
> grep -vP 'test|tst' > functions-x86_64_not_aarch64_notests.txt
$ wc -l functions*
93 functions-x86_64_not_aarch64_notests.txt

Now there are 93 possible functions I can choose from. I will follow up this post with some of my findings.

Algorithm Selection – adjusting a sequence of sound samples by a volume scale factor

The purpose of this exercise is to compare the different algorithm approaches for adjusting a sequence of sound samples. We can analyze how the different gcc optimization levels increase or decrease the performance for each of them on both x86_64 and Aarch64 architectures.

Creating audio sample

First, we will create an audio sample to test with. Since we would like to be able to compare the runtimes between each algorithm, a substantial data set is required, so we will create a sequence of 500000000 sound samples. This will be coded in c, and we will store the sound samples inside a signed int16 array (-32768 to 32767 range).

#define SAMPLESNUM 500000000

void createAudioSample(int16_t* sample_)
{
    for(int i = 0; i <= SAMPLESNUM; i++)
    {
        sample_[i] = rand();
    }
}

int main()
{
    int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));
    createAudioSample(sample);
}

Increase Volume

First algorithm – “Naive” approach

The first algorithm written will increase the volume by simply multiplying each sound sample by a volume scale factor then storing the result into a new array.

const float scale = 0.5; // volume scaling factor

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        newSample_[i] = sample_[i] * scale;
    }
}

Second algorithm – Look up table approach

For this approach, we will create a lookup table with all possible values from -32768 to 32767 multiplied by the volume scale factor (in our case 0.5). We can then reference the lookup later on to adjust our volume by the scale factor.

#define MAXSIZE 65536 // maximum size for signed 16 bit integer
#define HALF 32767 // half of max size for signed 16 bit integer

...

void lookupTableVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    // Create Lookup table
    int16_t lookupTable[MAXSIZE];
    for (int counter = 0; counter < MAXSIZE; counter++)
    {
        lookupTable[counter] = ((counter - HALF )*scale);
    }

    // Increase using lookupTable
    for (int i = 0; i <= MAXSIZE; i++)
    {
        newSample_[i] = lookupTable[sample_[i] + HALF];
    }
}

Here is a function to calculate our functions’ execution times:

void printExecTime(struct timeval t1, struct timeval t2)
{
    double elapsed;
 
    elapsed = (t2.tv_sec - t1.tv_sec) + 1e-6 * (t2.tv_usec - t1.tv_usec);

    printf("elapsed: %.8lf seconds\n", elapsed);
}

And our main:

int main()
{
   struct timeval t1, t2;
   int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));

   createAudioSample(sample);
   printf("\nAudio sample\n============\n");
   printSpecifiedRange(sample, 0, 7);

   int16_t* newSample = malloc(SAMPLESNUM*sizeof(int16_t));
   printf("\nNaive volume up\n===============\n");
   gettimeofday(&t1, NULL); // starting time
   naiveVolumeUp(sample, newSample); // start naive test
   gettimeofday(&t2, NULL); // end time
   printExecTime(t1, t2);
   printSpecifiedRange(newSample, 0, 7);

   free(newSample);
   newSample = malloc(SAMPLESNUM*sizeof(int16_t));
   printf("\nLookup volume up\n================\n");
   gettimeofday(&t1, NULL); // starting time
   lookupTableVolumeUp(sample, newSample); // start lookup table approach 
   gettimeofday(&t2, NULL); // end time
   printExecTime(t1, t2);
   printSpecifiedRange(newSample, 0, 7);

    return 0;
}

Compile our code:
gcc -o lab5 lab5.c
time ./lab5

Audio sample
============
sample[0]=17767
sample[1]=9158
sample[2]=-26519
sample[3]=18547
sample[4]=-9135
sample[5]=23807
sample[6]=-27574
sample[7]=22764

Naive volume up
===============
elapsed: 6.56288600 seconds
sample[0]=8883
sample[1]=4579
sample[2]=-13259
sample[3]=9273
sample[4]=-4567
sample[5]=11903
sample[6]=-13787
sample[7]=11382

Lookup volume up
================
elapsed: 0.11074700 seconds
sample[0]=8883
sample[1]=4579
sample[2]=-13259
sample[3]=9273
sample[4]=-4567
sample[5]=11903
sample[6]=-13787
sample[7]=11382

real	0m16.653s
user	0m11.863s
sys	0m1.145s

We can see the lookup approach is signficantly faster. Now we’ll test the different optimization levels on both Xerxes(x86_64) and Betty(Aarch64) machines.

Runtime w/ optimization levels

I compiled the code with each optimization level from O0 to O3, i.e:
gcc -O0 -o lab5 lab5.c

gcc -O3 -o lab5 lab5.c
c99 -O0 -o lab5 lab5.c

c99 -O3 -o lab5 lab5.c

And recorded the results into a spreadsheet:

xerxes_betty

The Functions section of the report shows both functions’ execution times, and is displayed in seconds.
The first thing I noticed with O0 optimization was that the ratio between Naive and Lookup was significantly wider on Betty than on Xerxes. The Naive function on Xerxes comparing O0 to O1 improves from 3.84 to 1.18. On Betty, it improves from 8.94 to 1.79. The ratio between the two functions from O1 to O3 begin to even out and are relatively similar between the two architectures.

The time section of the report displays results from the time ./lab5 command. This simply displays the runtimes for the whole program execution in real, user and sys time.

The size section displays results from the size lab5 command.
We can see that all data sizes are equally larger on Betty than on Xerxes. One thing that stands out is the O3 level optimization on Xerxes is 4634 bytes, while on Betty it is 2934 bytes.