glibc’s nexttowardf – optimize for aarch64

Here I will detail my findings in trying to optimize glibc’s nexttowardf function for Aarch64 architecture by breaking down the C and assembly code. Then I will go over the tests and conclusions I came to in attempting to optimize this function.

My approach will include translating the GET_FLOAT_WORD and SET_FLOAT_WORD functions to Aarch64 inline assembly.

These are macros defined in the math_private.h files for their respective systems:

./sysdeps/x86_64/fpu/math_private.h
./sysdeps/microblaze/math_private.h
./sysdeps/nios2/math_private.h
./sysdeps/arm/math_private.h
./sysdeps/sparc/fpu/math_private.h
./sysdeps/aarch64/fpu/math_private.h
./sysdeps/tile/math_private.h
./sysdeps/sh/math_private.h
./sysdeps/alpha/fpu/math_private.h
./sysdeps/powerpc/fpu/math_private.h
./sysdeps/generic/math_private.h
./sysdeps/i386/fpu/math_private.h
./sysdeps/m68k/coldfire/fpu/math_private.h
./sysdeps/m68k/m680x0/fpu/math_private.h
./sysdeps/mips/math_private.h

math_private.h for Aarch64 is located in aarch64/fpu/math_private.h and does not include specific definitions so it will use the generic version of these functions (C code) defined in generic/math_private.h.

x86_64

Macros

#if defined __AVX__ || defined SSE2AVX
# define MOVD "vmovd"
# define MOVQ "vmovq"
#else
# define MOVD "movd"
# define MOVQ "movq"
#endif

/* Direct movement of float into integer register.  */
#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)

/* And the reverse.  */
#define SET_FLOAT_WORD(f, i) \
  do {                                        \
    int i_ = i;                                   \
    float f__;                                    \
    asm (MOVD " %1, %0" : "=x" (f__) : "rm" (i_));                \
    f = f__;                                      \
  } while (0)

AVX stands for Advanced Vector Extensions. SSE2AVX is a Streaming SIMD Extension for AVX. If either of these are defined, the VMOV* assembly instructions will be used to handle floating point registers. Otherwise, MOV* instructions will be used.

As noted in the code comments above, GET_FLOAT_WORD directly moves the float into an integer register. SET_FLOAT_WORD does the reverse.

Looking at where the macro is first being used in s_nexttowardf.c:

float __nexttowardf(float x, long double y)
{
    int32_t hx,ix,iy;
    u_int32_t hy,ly,esy;

    GET_FLOAT_WORD(hx,x);

GET_FLOAT_WORD uses an uninitialized int32_t signed 32 bit integer type as its first parameter (hx, which will be initialized inside the macro definition later on), and a float value (x, used to store into an integer register).

Inline assembly

asm (
     MOVD " %1, %0" 
     : "=rm" (i_) 
     : "x" ((float) (d))
    );

"x" ((float) (d))
Input operand – input value from d is placed in a register.

"=rm" (i_)
Output operand – output register value is moved to i_.

MOVD " %1, %0"
Instruction – moves float value (%1) into integer register (%0).

#define GET_FLOAT_WORD(i, d) \
  do {                                        \
    int i_;                                   \
    asm (MOVD " %1, %0" : "=rm" (i_) : "x" ((float) (d)));            \
    (i) = i_;                                     \
  } while (0)

The while loop here will always be taken out of the compiled code, but the reason for it is to keep the scope of i_ within the scope of the function since this macro will be inserted in the c code itself.
For the parameters in GET_FLOAT_WORD(i, d): i will be the int32_t hx value for the macro’s first parameter – this will be set to the floating point value inside the macro definition. The surrounding parentheses here indicate to explicitly cast the value of i_ to a signed 32 bit int storing it in hx. d is the floating point value that will be passed in.

Testing process

First I tried to find any existing glibc files that were using the s_nexttowardf (alias nexttowardf) function.

grep -R "= nexttowardf"

math/bug-nexttoward.c:  fi = nexttowardf (m, fi);
math/bug-nexttoward.c:  fi = nexttowardf (-m, -fi);
math/bug-nexttoward.c:  m = nexttowardf (zero, inf);
math/bug-nexttoward.c:  m = nexttowardf (copysignf (zero, -1.0), -inf);
math/test-misc.c:    if (nexttowardf (0.0f, INFINITY) != nexttowardf (0.0f, 1.0f)
math/test-misc.c:        || nexttowardf (-0.0f, INFINITY) != nexttowardf (-0.0f, 1.0f)
math/test-misc.c:        || nexttowardf (0.0f, -INFINITY) != nexttowardf (0.0f, -1.0f)
math/test-misc.c:        || nexttowardf (-0.0f, -INFINITY) != nexttowardf (-0.0f, -1.0f))
math/test-misc.c:       printf ("nexttowardf (+-0, +-Inf) != nexttowardf (+-0, +-1)\n");

bug-nexttoward.c and test-misc.c both use the nexttowardf function. I chose to use test-misc.c for testing.

Compile issues

I ran into compile errors for both of the tester files due to missing header file errors as detailed below.

cpp math/test-misc.c

# 25 "math/test-misc.c" 2
math/test-misc.c:25:24: fatal error: math-tests.h: No such file or directory
 #include <math-tests.h>

find ./ -name "math-tests.h"

./sysdeps/nios2/math-tests.h
./sysdeps/arm/math-tests.h
./sysdeps/aarch64/math-tests.h
./sysdeps/tile/math-tests.h
./sysdeps/powerpc/math-tests.h
./sysdeps/generic/math-tests.h
./sysdeps/i386/fpu/math-tests.h
./sysdeps/mips/math-tests.h

I tried including required libraries directly in the same folder or including them statically (with option cpp -I or -include) but still would not compile. So what I ended up doing instead was since I was working on the GET_FLOAT_WORD function first, I created my own tester to test this exclusively.

lentest.c

#include <stdio.h>
#define X86_64
//#define GENERIC

typedef int int32_t;
typedef unsigned int u_int32_t;

typedef union
{
  float value;
  u_int32_t word;
} ieee_float_shape_type;

/* Direct movement of float into integer register.  */
#ifdef X86_64
#define GET_FLOAT_WORD(i, d) \
do {                                          \
  int i_;                                     \
  asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));              \
  (i) = i_;                                   \
} while (0)
#endif

#ifdef GENERIC
#define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif

int main() {
    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    float x=3.1;
    long double y;

    int32_t i = hx;
    float d = x;

    printf("PRE:\ni = %d\nd = %0.7f\n", i, d);

    GET_FLOAT_WORD(i, d);

    printf("POST:\ni = %d\nd = %0.7f\n", i, d);

    return 0;
}

Object code

Compile tester program (w/ no optimization levels):

gcc -g -o lentest.o lentest.c

objdump -d --source lentest.o

        do {
          int i_;
          asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));
  400567:   f3 0f 10 45 f4          movss  -0xc(%rbp),%xmm0
  40056c:   66 0f 7e c0             movd   %xmm0,%eax
  400570:   89 45 e8                mov    %eax,-0x18(%rbp)
          (i) = i_;
  400573:   8b 45 e8                mov    -0x18(%rbp),%eax
  400576:   89 45 f8                mov    %eax,-0x8(%rbp)

movss -0x10(%rbp),%xmm0 – moves 16th bit value of the register base pointer to xmm register (SSE used only a single data type for XMM registers: four 32-bit single-precision floating point numbers)

using gdb layout asm, output register info for %xmm0:

(gdb) info register xmm0
xmm0           {v4_float = {0x3, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0},
  v16_int8 = {0x64, 0x66, 0x66, 0x40, 0x0 <repeats 12 times>}, v8_int16 = {
    0x6664, 0x4066, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x40666664,
    0x0, 0x0, 0x0}, v2_int64 = {0x40666664, 0x0},
  uint128 = 0x00000000000000000000000040666664}

movd %xmm0,%eax – moves value of %xmm0 to %eax integer register.

We have a 32 bit size integer, so I can check what’s inside the v4_int32 by printing its address:

(gdb) print 0x40666664
$4 = 1080033278
(gdb) print i
$5 = 0
(gdb) info register eax
eax            0x405ffffe   1080033278
(gdb) s
(gdb) print i
$7 = 1080033278

And we can see our integer register holds the converted floating point number which will be stored in i.

Translate to Aarch64

Here is the x86_64 in assembly translated to Aarch64:

#ifdef AARCH64
#define SYSTEM "Aarch64"
#define GET_FLOAT_WORD(i, d) \
do {                                        \
    int i_;                                 \
    __asm__("MOV %w0, %w1" \
        : [result] "=r" (i_) \
        : [input_i] "r" ((float) (d)));          \
    (i) = i_;               \
} while(0)
#endif

I added it to the tester in the following section.

Test runtimes

Since the tester program has an execution time of 0.001 seconds, I modified the tester to loop a significant amount of times (defined in LOOPNUM) and increment the float value by 0.1 each time, so I could get more comparable run time results.

New Tester

#include <stdio.h>
#define LOOPNUM 100000000
//#define X86_64
//#define AARCH64
#define GENERIC

typedef int int32_t;
typedef unsigned int u_int32_t;

typedef union
{
  float value;
  u_int32_t word;
} ieee_float_shape_type;

/* Direct movement of float into integer register.  */
#ifdef X86_64
#define SYSTEM "x86_64"
#define GET_FLOAT_WORD(i, d) \
do {                                          \
  int i_;                                     \
  asm("movd %1, %0" : "=rm" (i_) : "x" ((float) (d)));              \
  (i) = i_;                                   \
} while (0)
#endif

#ifdef AARCH64
#define SYSTEM "Aarch64"
#define GET_FLOAT_WORD(i, d) \
do {                                        \
    int i_;                                 \
    __asm__("MOV %w0, %w1" \
        : [result] "=r" (i_) \
        : [input_i] "r" ((float) (d)));          \
    (i) = i_;               \
} while(0)
#endif

#ifdef GENERIC
#define SYSTEM "Generic"
#define GET_FLOAT_WORD(i,d)                    \
do {                                \
  ieee_float_shape_type gf_u;                   \
  gf_u.value = (d);                     \
  (i) = gf_u.word;                      \
} while (0)
#endif

int main() {

    printf("Testing %s\n", SYSTEM);

    int32_t hx,hy,ix,iy;
    u_int32_t ly;

    float x=3.1;
    long double y;

    int32_t i = hx;
    float d = x;

    printf("PRE:\ni = %d\nd = %0.7f\n", i, d);
    long int t = 0;

    for (int j=0; j < LOOPNUM; j++) {
        x += .1;
        d = x;

        GET_FLOAT_WORD(i, d);

        t += i;
    }

    printf("POST:\ni = %d\nd = %0.7f\n", i, d);
    printf("t = %d", t);

    return 0;
}

Testing on Xerxes (x86_64) and Betty (Aarch64)

Results for x86_64 versus Generic:

Xerxes

[lkisac@xerxes ~]$ time ./lentest.o 
Testing x86_64

PRE:
i = 4195888
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m0.812s
user    0m0.810s
sys 0m0.001s
[lkisac@xerxes ~]$ gcc -g lentest.c -o lentest.o
[lkisac@xerxes ~]$ time ./lentest.o 
Testing GENERIC

PRE:
i = 4195872
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m0.806s
user    0m0.805s
sys 0m0.001s

It looks like the generic version is actually slightly faster. I validated this by testing each run 10 times and getting the average run time – it was still slightly faster for the generic version.

Betty

Results for Aarch64 versus Generic:

[lkisac@betty ~]$ time ./lentest.o 
Testing GENERIC

PRE:
i = -608132512
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m3.352s
user    0m3.350s
sys 0m0.000s
[lkisac@betty ~]$ vi lentest.c 
[lkisac@betty ~]$ c99 -g lentest.c -o lentest.o 
[lkisac@betty ~]$ time ./lentest.o 
Testing AARCH64

PRE:
i = -558450000
d = 3.0999999
POST:
i = 1241513984
d = 2097152.0000000
t = 2071391079
real    0m3.352s
user    0m3.350s
sys 0m0.000s

The run times for both optimized (assembly code) and previous c code have the same run times. This was also tested multiple times and average calculated with the same results.

Conclusion

It appears that the IEEE C code had the same run time if not better than the inline assembly optimization, so the x86_64 version may be better off with the defined c code rather than the inline assembly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s