glibc’s nexttowardf – optimize for aarch64

This post will document my findings in trying to optimize glibc’s nexttowardf function for Aarch64 architecture. I will break down the c and assembly code and go over the process of testing and the conclusions I came to in the attempt to optimize this glibc function.

To optimize for aarch64, I will translate the GET_FLOAT_WORD and SET_FLOAT_WORD functions to Aarch64. These are macros defined in the math_private.h files for their respective systems:

./sysdeps/x86_64/fpu/math_private.h
./sysdeps/microblaze/math_private.h
./sysdeps/nios2/math_private.h
./sysdeps/arm/math_private.h
./sysdeps/sparc/fpu/math_private.h
./sysdeps/aarch64/fpu/math_private.h
./sysdeps/tile/math_private.h
./sysdeps/sh/math_private.h
./sysdeps/alpha/fpu/math_private.h
./sysdeps/powerpc/fpu/math_private.h
./sysdeps/generic/math_private.h
./sysdeps/i386/fpu/math_private.h
./sysdeps/m68k/coldfire/fpu/math_private.h
./sysdeps/m68k/m680x0/fpu/math_private.h
./sysdeps/mips/math_private.h

math_private.h for Aarch64, located in aarch64/fpu/math_private.h, does not include its own defined macros (inline assembly) for these functions. Any systems that do not have their own GET_FLOAT_WORD or SET_FLOAT_WORD will use the generic version of these functions (which is c code) defined in generic/math_private.h.

First I will explain the c and assembly code for the x86_64 version.

x86_64

Macros

#if defined __AVX__ || defined SSE2AVX
# define MOVD "vmovd"
# define MOVQ "vmovq"
#else
# define MOVD "movd"
# define MOVQ "movq"
#endif

/* Direct movement of float into integer register.  */
#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)

/* And the reverse.  */
#define SET_FLOAT_WORD(f, i) \
  do {                                        \
    int i_ = i;                                   \
    float f__;                                    \
    asm (MOVD " %1, %0" : "=x" (f__) : "rm" (i_));                \
    f = f__;                                      \
  } while (0)

AVX stands for Advanced Vector Extensions. SSE2AVX is a Streaming SIMD Extension for AVX (introduced by Intel in 2008). If either of these are defined, the VMOV* assembly instructions will be used to handle floating point registers. Otherwise, MOV* instructions will be used.

As noted in the code comments above, GET_FLOAT_WORD directly moves the float into an integer register. SET_FLOAT_WORD does the reverse of that.

If I look at where the macro is first being used in s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,ix,iy;
    u_int32_t hy,ly,esy;

    GET_FLOAT_WORD(hx,x);

GET_FLOAT_WORD uses an uninitialized int32_t signed 32 bit integer type as its first parameter (hx, which will be initialized inside the macro definition later on), and a float value (x, used to store into an integer register).

Inline assembly

asm (
     MOVD " %1, %0" 
     : "=rm" (i_) 
     : "x" ((float) (d))
    );

"x" ((float) (d))
Input operand – input value from d is placed in a register.

"=rm" (i_)
Output operand – output register value is moved to i_.

MOVD " %1, %0"
Instruction – moves float value (%1) into integer register (%0).

#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)

The while loop here will always be taken out of the compiled code, but the reason for it is to keep the scope of i_ within the scope of the function since this macro will be inserted in the c code itself.
For the parameters in GET_FLOAT_WORD(i, d): i will be the int32_t hx value for the macro’s first parameter – this will be set to the floating point value inside the macro definition. The surrounding parentheses here indicate to explicitly cast the value of i_ to a signed 32 bit int storing it in hx. d is the floating point value that will be passed in.

Testing process

First I tried to find any existing glibc files that were using the s_nexttowardf (alias nexttowardf) function.

grep -R "= nexttowardf"

math/bug-nexttoward.c:  fi = nexttowardf (m, fi);
math/bug-nexttoward.c:  fi = nexttowardf (-m, -fi);
math/bug-nexttoward.c:  m = nexttowardf (zero, inf);
math/bug-nexttoward.c:  m = nexttowardf (copysignf (zero, -1.0), -inf);
math/test-misc.c:    if (nexttowardf (0.0f, INFINITY) != nexttowardf (0.0f, 1.0f)
math/test-misc.c:        || nexttowardf (-0.0f, INFINITY) != nexttowardf (-0.0f, 1.0f)
math/test-misc.c:        || nexttowardf (0.0f, -INFINITY) != nexttowardf (0.0f, -1.0f)
math/test-misc.c:        || nexttowardf (-0.0f, -INFINITY) != nexttowardf (-0.0f, -1.0f))
math/test-misc.c:       printf ("nexttowardf (+-0, +-Inf) != nexttowardf (+-0, +-1)\n");

bug-nexttoward.c and test-misc.c both use the nexttowardf function. I chose to use test-misc.c for testing.

Compile issues

I ran into compile errors for both of the tester files due to missing header file errors as detailed below.

cpp math/test-misc.c

# 25 "math/test-misc.c" 2
math/test-misc.c:25:24: fatal error: math-tests.h: No such file or directory
 #include <math-tests.h>

find ./ -name "math-tests.h"

./sysdeps/nios2/math-tests.h
./sysdeps/arm/math-tests.h
./sysdeps/aarch64/math-tests.h
./sysdeps/tile/math-tests.h
./sysdeps/powerpc/math-tests.h
./sysdeps/generic/math-tests.h
./sysdeps/i386/fpu/math-tests.h
./sysdeps/mips/math-tests.h

I tried including required libraries directly in the same folder or including them statically (with option cpp -I or -include) but still would not compile. So what I ended up doing instead was since I was working on the GET_FLOAT_WORD function first, I created my own tester to test this exclusively.

lentest.c

#include <stdio.h>
#define X86_64
//#define GENERIC

typedef int int32_t;
typedef unsigned int u_int32_t;

typedef union
{
  float value;
  u_int32_t word;
} ieee_float_shape_type;

/* Direct movement of float into integer register.  */
#ifdef X86_64
#define GET_FLOAT_WORD(i, d) \
do {                                          \
  int i_;                                     \
  asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));              \
  (i) = i_;                                   \
} while (0)
#endif

#ifdef GENERIC
#define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif

int main() {
    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    float x=3.1;
    long double y;

    int32_t i = hx;
    float d = x;

    printf("PRE:\ni = %d\nd = %0.7f\n", i, d);

    GET_FLOAT_WORD(i, d);

    printf("POST:\ni = %d\nd = %0.7f\n", i, d);

    return 0;
}

Object code

Compile tester program (w/ no optimization levels):

gcc -g -o lentest.o lentest.c

objdump -d --source lentest.o

        do {
          int i_;
          asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));
  400567:   f3 0f 10 45 f4          movss  -0xc(%rbp),%xmm0
  40056c:   66 0f 7e c0             movd   %xmm0,%eax
  400570:   89 45 e8                mov    %eax,-0x18(%rbp)
          (i) = i_;
  400573:   8b 45 e8                mov    -0x18(%rbp),%eax
  400576:   89 45 f8                mov    %eax,-0x8(%rbp)

movss -0x10(%rbp),%xmm0 – moves 16th bit value of the register base pointer to xmm register (SSE used only a single data type for XMM registers: four 32-bit single-precision floating point numbers)

using gdb layout asm, output register info for %xmm0:

(gdb) info register xmm0
xmm0           {v4_float = {0x3, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0},
  v16_int8 = {0x64, 0x66, 0x66, 0x40, 0x0 <repeats 12 times>}, v8_int16 = {
    0x6664, 0x4066, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x40666664,
    0x0, 0x0, 0x0}, v2_int64 = {0x40666664, 0x0},
  uint128 = 0x00000000000000000000000040666664}

movd %xmm0,%eax – moves value of %xmm0 to %eax integer register.

We have a 32 bit size integer, so I can check what’s inside the v4_int32 by printing its address:

(gdb) print 0x40666664
$4 = 1080033278
(gdb) print i
$5 = 0
(gdb) info register eax
eax            0x405ffffe   1080033278
(gdb) s
(gdb) print i
$7 = 1080033278

And we can see our integer register holds the converted floating point number which will be stored in i.

Translate to Aarch64

Now to translate to Aarch64. I added a new #ifdef for the Aarch64 and translated the x86_64 assembly code to Aarch64.

#ifdef AARCH64
#define SYSTEM "Aarch64"
#define GET_FLOAT_WORD(i, d) \
do {                                        \
    int i_;                                 \
    __asm__("MOV %w0, %w1" \
        : [result] "=r" (i_) \
        : [input_i] "r" ((float) (d)));          \
    (i) = i_;               \
} while(0)
#endif

And added it to the tester in the following section.

Test runtimes

Since the tester has an execution time of 0.001 seconds, I modified the tester to loop a significant amount of times (defined in LOOPNUM) and increment the float value by 0.1 each time, so I could get more comparable run time results.

New Tester

#include <stdio.h>
#define LOOPNUM 100000000
//#define X86_64
//#define AARCH64
#define GENERIC

typedef int int32_t;
typedef unsigned int u_int32_t;

typedef union
{
  float value;
  u_int32_t word;
} ieee_float_shape_type;

/* Direct movement of float into integer register.  */
#ifdef X86_64
#define SYSTEM "x86_64"
#define GET_FLOAT_WORD(i, d) \
do {                                          \
  int i_;                                     \
  asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));              \
  (i) = i_;                                   \
} while (0)
#endif

#ifdef AARCH64
#define SYSTEM "Aarch64"
#define GET_FLOAT_WORD(i, d) \
do {                                        \
    int i_;                                 \
    __asm__("MOV %w0, %w1" \
        : [result] "=r" (i_) \
        : [input_i] "r" ((float) (d)));          \
    (i) = i_;               \
} while(0)
#endif

#ifdef GENERIC
#define SYSTEM "Generic"
#define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif

int main() {

    printf("Testing %s\n", SYSTEM);

    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    float x=3.1;
    long double y;

    int32_t i = hx;
    float d = x;

    printf("PRE:\ni = %d\nd = %0.7f\n", i, d);
    long int t = 0;

    for (int j=0; j < LOOPNUM; j++) {
        x += .1;
        d = x;

        GET_FLOAT_WORD(i, d);

        t += i;
    }

    printf("POST:\ni = %d\nd = %0.7f\n", i, d);
    printf("t = %d", t);

    return 0;
}

Testing on Xerxes (x86_64) and Betty (Aarch64)

Results for x86_64 versus Generic:

Xerxes

[lkisac@xerxes ~]$ time ./lentest.o 
Testing x86_64

PRE:
i = 4195888
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m0.812s
user    0m0.810s
sys 0m0.001s
[lkisac@xerxes ~]$ gcc -g lentest.c -o lentest.o
[lkisac@xerxes ~]$ time ./lentest.o 
Testing GENERIC

PRE:
i = 4195872
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m0.806s
user    0m0.805s
sys 0m0.001s

Interesting. It looks like the generic version is actually slightly faster. I validated this by testing each run 10 times and getting the average run time – it was still slightly faster for the generic version.

<h4>Betty</h4>

Results for <strong>Aarch64</strong> versus <strong>Generic</strong>:

[lkisac@betty ~]$ time ./lentest.o 
Testing GENERIC

PRE:
i = -608132512
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m3.352s
user    0m3.350s
sys 0m0.000s
[lkisac@betty ~]$ vi lentest.c 
[lkisac@betty ~]$ c99 -g lentest.c -o lentest.o 
[lkisac@betty ~]$ time ./lentest.o 
Testing AARCH64

PRE:
i = -558450000
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m3.352s
user    0m3.350s
sys 0m0.000s

Also interesting. The run times for both optimized (assembly code) and previous c code have the same run times. This was also tested multiple times and average calculated with the same results.

Conclusion

So it appears that the IEEE c code had the same run time if not better than the inline assembly optimization. This is something I could potentially mention to the community – maybe the x86_64 version would be better off with the defined c code rather than the inline assembly? There could be other reasons for the results I am getting which is something I would like to research further into. In going through the object code with my professor, it was noticed that there was a mov instruction in the object code being compiled that was unnecessary as well. A bug I could look into more to see if it can be filed or has potentially been taken care of already in a later release of gcc.

glibc – nexttowardf candidate for optimization on aarch64

In my previous post, I went over the steps to finding a function that was optimized for x86_64 but not on aarch64.

After going through the list of functions, I came across nexttowardf.c:

[lisac@localhost glibc]$ find ./* -name "*nexttowardf*"
./math/s_nexttowardf.c
./sysdeps/x86_64/fpu/s_nexttowardf.c
./sysdeps/ia64/fpu/s_nexttowardf.S
./sysdeps/ieee754/ldbl-opt/nldbl-nexttowardf.c
./sysdeps/ieee754/ldbl-opt/s_nexttowardfd.c
./sysdeps/ieee754/ldbl-128/s_nexttowardf.c
./sysdeps/ieee754/ldbl-64-128/s_nexttowardf.c
./sysdeps/ieee754/ldbl-96/s_nexttowardf.c
./sysdeps/ieee754/ldbl-128ibm/s_nexttowardf.c
./sysdeps/i386/fpu/s_nexttowardf.c

From nexttowardf‘s man pages:

“The nextafter(), nextafterf(), and nextafterl() functions return the next representable floating-point value following x in the direction of y. If y is less than x, these functions will return the largest representable number less than x.

If x equals y, the functions return y.

The nexttoward(), nexttowardf(), and nexttowardl() functions do the same as the corresponding nextafter() functions, except that they have a long double second argument.”

Looking at the x86_64 optimization sysdeps/x86_64/fpu/s_nexttowardf.c:

#include <sysdeps/i386/fpu/s_nexttowardf.c>

This file contains only an include statement for the i386’s optimization. So I had a look at the sysdeps/i386/fpu/s_nexttowardf.c and it is essentially identical to the original math/s_nexttowardf.c version. One thing to note are the macros inside __nexttowardf: .

original

math/s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    GET_FLOAT_WORD(hx,x);
    EXTRACT_WORDS(hy,ly,y);
    ix = hx&0x7fffffff;     /* |x| */
    iy = hy&0x7fffffff;     /* |y| */
...
...
    SET_FLOAT_WORD(x,hx);

sysdeps/generic/math_private.h:

/* Get a 32 bit int from a float.  */
#ifndef GET_FLOAT_WORD
# define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif
/* Set a float from a 32 bit int.  */
#ifndef SET_FLOAT_WORD
# define SET_FLOAT_WORD(d,i)                    \
do {                                \
  ieee_float_shape_type sf_u;                   \
  sf_u.word = (i);                      \
  (d) = sf_u.value;                     \
} while (0)
#endif
/* Get two 32 bit ints from a double.  */

#define EXTRACT_WORDS(ix0,ix1,d)                \
do {                                \
  ieee_double_shape_type ew_u;                  \
  ew_u.value = (d);                     \
  (ix0) = ew_u.parts.msw;                   \
  (ix1) = ew_u.parts.lsw;                   \
} while (0)

x86_64

sysdeps/i386/fpu/s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,ix,iy;
    u_int32_t hy,ly,esy;

    GET_FLOAT_WORD(hx,x);
    GET_LDOUBLE_WORDS(esy,hy,ly,y);
    ix = hx&0x7fffffff;     /* |x| */
    iy = esy&0x7fff;        /* |y| */
...
...
    SET_FLOAT_WORD(x,hx);

sysdeps/x86_64/fpu/math_private.h:

/* Direct movement of float into integer register.  */
#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)
/* And the reverse.  */
#define SET_FLOAT_WORD(f, i) \
  do {                                        \
    int i_ = i;                                   \
    float f__;                                    \
    asm (MOVD " %1, %0" : "=x" (f__) : "rm" (i_));                \
    f = f__;                                      \
  } while (0)

sysdeps/x86_64/fpu/math_ldbl.h:

/* Get three 32 bit ints from a double.  */

#define GET_LDOUBLE_WORDS(exp,ix0,ix1,d)            \
do {                                \
  ieee_long_double_shape_type ew_u;             \
  ew_u.value = (d);                     \
  (exp) = ew_u.parts.sign_exponent;             \
  (ix0) = ew_u.parts.msw;                   \
  (ix1) = ew_u.parts.lsw;                   \
} while (0)

The x86_64 definition for GET_FLOAT_WORD and SET_FLOAT_WORD contains inline assembly. I will try a similar approach for the aarch64.

glibc – finding functions optimized for x86_64 but not aarch64

One approach to finding a possible candidate function for optimization was suggested by our professor. This approach was to look for all functions optimized in x86_64 but not on aarch64, since these are the two architectures we have worked with throughout the course.

I wrote a one-liner bash script to achieve this (note: all commands below are run inside the src/glibc folder):

for f in `find ./sysdeps/x86_64/* -name "*.c" |
> sed -rn 's/\.\/sysdeps\/x86_64\/(.*\/)*(.*\.c)/\1\2/p'`; 
> do ls ./sysdeps/aarch64/$f 2>>ls_error.txt; 
> done

This command will look for all c files in the sysdeps/x86_64 directory (recursively) and search for each corresponding filename in sysdeps/aarch64 (with the same directory structure as x86_64). Files that are found in x86_64 but not in aarch64 will cause an error in the ls ./sysdeps/aarch64/$f command and we can redirect those errors to ls_error.txt (files that exist in both will simply redirect to standard output).

$ less ls_error.txt
ls: cannot access './sysdeps/aarch64/dl-procinfo.c': No such file or directory
ls: cannot access './sysdeps/aarch64/dl-runtime.c': No such file or directory
...
...

To get the filenames only from ls_error.txt, I run:

$ cat ls_error.txt | sed -rn "s/^ls.*'\.\/(.*\/)*(.*\.c)'.*$/\2/p" > functions-x86_64_not_aarch64.txt
$ less functions-x86_64_not_aarch64.txt
dl-procinfo.c
dl-runtime.c
...

Now I can go through this list and look for a function that can potentially be optimized for aarch64.

I had previously run pieces of the one-liner script above to write the functions found in x86_64 (to functions_x86_64.txt) and aarch64 (to functions-aarch64.txt) so I could count the number of functions in each architecture and get an idea of what I was dealing with. The functions-x86_64_aarch64.txt file includes functions that exist in both x86_64 and aarch64.

$ ls -l functions*
functions-aarch64.txt
functions-x86_64_aarch64.txt
functions-x86_64_not_aarch64.txt
functions-x86_64.txt
$ wc -l functions*
   54 functions-aarch64.txt
   24 functions-x86_64_aarch64.txt
  178 functions-x86_64_not_aarch64.txt
  199 functions-x86_64.txt

There are 178 files in x86_64 that are not in aarch64. Many of these however are test files, so I excluded the test and tst files:

$ cat functions-x86_64_not_aarch64.txt |
> grep -vP 'test|tst' > functions-x86_64_not_aarch64_notests.txt
$ wc -l functions*
93 functions-x86_64_not_aarch64_notests.txt

Now there are 93 possible functions I can choose from. I will follow up this post with some of my findings.

glibc difftime – no need for optimization

Upon further investigation, difftime can be left as is with no further optimization. Any optimization that can be done will have minimal effect in execution time. I will go over why that is.

double
__difftime (time_t time1, time_t time0)
{
  /* Convert to double and then subtract if no double-rounding error could
     result.  */

  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double)))
    return (double) time1 - (double) time0;

  /* Likewise for long double.  */

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t))
    return (long double) time1 - (long double) time0;

  /* Subtract the smaller integer from the larger, convert the difference to
     double, and then negate if needed.  */

  return time1 < time0 ? - subtract (time0, time1) : subtract (time1, time0);
}

For the first if condition, TYPE_BITS (time_t) and DBL_MANT_DIG are both constants, so the pre-processor will compare them at compile time and strip them from the executable altogether if they evaluate to true. The same applies to the second if condition. TYPES_BITS <= LDBL_MANT_DIG will be evaluated at compile time.

We can further validate this by compiling the code and looking at the assembly file:

I wrote a tester file that uses time.h's difftime.c:

// len_difftime_test.c
#include <stdio.h>
#include <time.h>
#include <limits.h>
#include <stdint.h>

int main(){
    // test time_t to uint_max conversion
    time_t time1 = time(NULL);
    time_t time0 = time(NULL) + 10;
    uintmax_t dt = (uintmax_t) time1 - (uintmax_t) time0;
    double delta = dt;
    printf("time1 = %d\ntime0 = %d\n", time1, time0);
    printf("(uintmax_t) time1 = %d\n", time1);
    printf("(uintmax_t) time0 = %d\n", time0);

    // test difftime function
    double result;
    result = difftime(time1, time0);
    printf("difftime(time1, time0) = %f\n", result);
    result = difftime(time0, time1);
    printf("difftime(time0, time1) = %f\n", result);

    return 0;
}

Compile:
gcc -g -o len_difftime_test len_difftime_test.c

I use gdb debugger to get to line 18 which makes the first call to difftime.
gdb len_difftime_test

Set a breakpoint at line 18 and run:

(gdb) b 18
Breakpoint 1 at 0x400638: file len_difftime_test.c, line 18.
(gdb) r
Starting program: /home/lisac/SourceCode/Seneca/spo600/project/src/glibc/time/len_difftime_test 
time1 = 1490051018
time0 = 1490051028
(uintmax_t) time1 = 1490051018
(uintmax_t) time0 = 1490051028

Breakpoint 1, main () at len_difftime_test.c:18
18      result = difftime(time1, time0);

Step into the difftime function:
__difftime (time1=1490051390, time0=1490051400) at difftime.c:103
103 {
(gdb) s
114     return (long double) time1 - (long double) time0;
(gdb) s
120 }

Short circuiting or test-reordering will not improve the executable since the pre-processor will rid of the comparison of constants when they evaluate to true. As we can see on line 17, there is no condition, only the returning subtract calculation.

Here is the pre-processor output:

cpp difftime.c

  if ((sizeof (time_t) * 8) <= 53 <-- removed
      || (((time_t) 0.5 == 0.5) && sizeof (time_t) < sizeof (long double))) <-- removed
    return (double) time1 - (double) time0;



  if ((sizeof (time_t) * 8) <= 64 || ((time_t) 0.5 == 0.5)) <-- removed
    return (long double) time1 - (long double) time0;

Now I will be looking into more functions that are better candidates for optimization.

glibc – proposed approach to optimize difftime/subtract

This is a continuation to my previous post on choosing a glibc function that could potentially be optimized. Now I’ll discuss my proposed approach for potential optimization.

difftime

difftime has a few handlers for calculating doubles and long doubles, but for any other types it will simply subtract the larger time value from the smaller one and return the result.

Let’s look at difftime first:

/* Return the difference between TIME1 and TIME0.  */
double
__difftime (time_t time1, time_t time0)
{
  /* Convert to double and then subtract if no double-rounding error could
     result.  */

  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double)))
    return (double) time1 - (double) time0;

  /* Likewise for long double.  */

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t))
    return (long double) time1 - (long double) time0;

  /* Subtract the smaller integer from the larger, convert the difference to
     double, and then negate if needed.  */

  return time1 < time0 ? - subtract (time0, time1) : subtract (time1, time0);
}

The IF condition for doubles does not contain any significantly expensive operations (i.e. multiply, divide), and since it doesn’t, it may not be necessary to change anything here, but, we know that if the first condition before the OR is not met, we won’t need to execute the second condition, so this could also be written as:

if (TYPE_BITS (time_t) <= DBL_MANT_DIG) {return (double) time1 - (double) time0;}
if (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double))) {return (double) time1 - (double) time0;}

Since the first condition is the smaller of the two, we test it first and immediately return our result if the condition is met. If not, we can then check the next slightly larger condition.

We can apply a similar approach for the second condition:

if (TYPE_FLOATING (time_t)) {return (long double) time1 - (long double) time0;}
if (TYPE_BITS (time_t) <= LDBL_MANT_DIG) {return (long double) time1 - (long double) time0;}

Something else I noticed inside the __difftime function was the checks for double and long double were always returning time1 minus time0 regardless of which was the larger value. On my particular machine (x86_64), the second IF condition was true since TYPE_BITS(time_t) was lower than LDBL_MANT_DIG, so line 11 was being executed.

double 
__difftime (time_t time1, time_t time0)
{
  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double))) {
    return (double) time1 - (double) time0;
  }

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t)) {
    //return time1 < time0 ? (long double) time0 - (long double) time1 : (long double) 
    return (long double) time1 - (long double) time0;
  }

  return time1 < time0 ? - subtract (time0, time1) : subtract (time1, time0);
}

I wrote a small tester for this:

int main() {

    // test difftime function
    time_t time1 = time(NULL);
    time_t time0 = time(NULL) + 10;
    printf("time1 = %d\ntime0 = %d\n", time1, time0);
    double result;
    result = __difftime(time1, time0);
    printf("difftime(time1, time0) = %f\n", result);
    result = __difftime(time0, time1);
    printf("difftime(time0, time1) = %f\n", result);

    return 0;
}

Which outputs:

__difftime
time1 = 1489180977
time0 = 1489180987
difftime(time1, time0) = -10.000000
__difftime
time1 = 1489180987
time0 = 1489180977
difftime(time0, time1) = 10.000000

Both results should return 10, but we are missing the time1 < time0 comparison check for each of those conditions, so I included the ternary operators in both conditions:

double
__difftime (time_t time1, time_t time0)
{
  if (TYPE_BITS (time_t) <= DBL_MANT_DIG
      || (TYPE_FLOATING (time_t) && sizeof (time_t) < sizeof (long double))) {
    return time1 < time0 ? (double) time0 - (double) time1 : (double) time1 - (double)
  }

  if (TYPE_BITS (time_t) <= LDBL_MANT_DIG || TYPE_FLOATING (time_t)) {
    return time1 < time0 ? (long double) time0 - (long double) time1 : (long double) 
  }

  ...
}

New output:

__difftime
time1 = 1489181645
time0 = 1489181655
difftime(time1, time0) = 10.000000
__difftime
time1 = 1489181655
time0 = 1489181645
difftime(time0, time1) = 10.000000

subtract

This function is called for any number other than double or long double type. If the time_t type is not a signed type, then the function simply returns the result of time1 - time0. If time_t type is a signed type, handle optimization.

glibc – choosing potential function to enhance

Following the getting started instructions, I cloned the glibc repository to my Fedora Linux machine.

After cloning the repository, I went through the list of functions looking for one that could potentially be enhanced or optimized in some way. There were several I was interested in (ceil, bsearch, regexec), but I eventually chose the difftime function.

The function is defined inside the ./time/difftime.c file, with its own subtract() function I may look at for potential optimization as well.

Inline Assembly Lab

Part A – Volume Scale Factor w/ Inline Assembly

The first part of this lab is to use SQDMULH or SQRDMULH instructions via inline assembly on a previous volume scaling solution and compare each of the performances on an Aarch64 architecture.

Inline assembly call

  • asm(...); or __asm__(...);
  • asm volatile (...); or __asm__ __volatile (...);
    volatile is used if you want to explicitly prevent the compiler from moving code as a result of optimization.

Additional links for inline assembly:

Solution

We’ll take the one solution from our previous program which multiplies each sound sample by a volume scale factor, and replace the multiply operation inside the loop with inline assembler code.

// Volume up using multiply by volume scale factor
void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    int16_t *x = __builtin_assume_aligned(sample_,16);
    int16_t *y = __builtin_assume_aligned(newSample_,16);

    for (int i = 0; i <= SAMPLESNUM; i+8)
    {
        __asm__ ("LD1 {v0.8h}, [%0];" // load multiple 1-element structure (stores [value] of %0 into v0.8h)
                "DUP v1.8h, %0;" // duplicate element (vector) from %0 to vector register 1
                "SQDMULH v0.8h, v0.8h, v1.8h;" // Signed integer saturating doubling multiply high half (vector)
                "ST1 {v0.8h}, [%0], #16;"
                :
                : "r"(y[i]), "r"(y[i]) // inputs
                //: // outputs
                //: // clobbers
                );
    }
}

Note: solution is incomplete – will update once resolved

Part B – Open source package that uses Assembler

For the second part of the lab, we analyze an open source package that uses inline assembly. For this I chose the CLN package.

Clone CLN repo

I first cloned the repository (on a Fedora Linux machine), so that I could have a look at all the files and look for any assembler code that exists.
git clone git://www.ginac.de/cln.git

Then searched all files (recursively) within the project folder that contain the keyword ‘asm’.
grep -rnw './' -e "asm"

Two files were found in the results:

./src/base/digitseq/cl_DS_mul_nuss.h:963
./src/base/digitseq/cl_DS_mul.cc:176

The cl_DS_mul.cc file only contained a comment with “asm”. The header file cl_DS_mul.nuss.h contained many lines of assembler code.

Part of cl_DS_mul.nuss.h inline assembly:

#if defined(__GNUC__) && defined(__i386__)
    var uintD dummy;
  #ifdef NUSS_ASM_DIRECT
    __asm__ __volatile__ (
        "movl %1,%0" "\n\t"
        "subl %2,%0" "\n\t"
        "movl %0,%3"
        : "=&q" (dummy)
        : "m" (a.ow0), "m" (b.ow0), "m" (r.ow0)
        : "cc"
        );
...
...

The assembler code is written for an i386 platform as we can see from the defined(__i386__) line surrounding the assembly code for each of these preprocessors.

The author of this file included some comments at the top of the file explaining what these definitions are for:

  • NUSS_IN_EXTERNAL_LOOPS – Define this is you want the external loops instead of inline operation
  • NUSS_ASM_DIRECT – Define this if you want the external loops instead of inline operation
  • DEBUG_NUSS – Define this for (cheap) consistency checks.
  • DEBUG_NUSS_OPERATIONS – Define this for extensive consistency checks.

There were no other files containing any assembler code, so the inline assembly in this project are inclusive to the i386 platform.

Although assembler code is platform specific, it can be beneficial for certain cases when optimization can be improved on that specific platform. Our inline assembler can address that while leaving other platforms to be handled by the compiler optimizations.