All posts by Len Isac

More Thimble bugs to work on (for Release 0.2)

The bugs I will be working on for my second Release (0.2) are both related to a previous issue I contributed to. I wrote a blog about it here.

The first one (Issue #1738) involves fixing the event’s callback parameters. The second (Issue #240) is to have the renamed project fully published and actually update on S3. (Amazon S3 is a simple storage service that Thimble stores its projects on).

I will continue with this post once I have finished working on these bugs.

GCC auto-vectorization & SIMD on Aarch64 architecture

In this exercise, we will compile a short C program and analyze the disassembly file after enabling auto-vectorization. This will be done on an Aarch64 architecture. Later on, we will also refer to a previous post on scaling a sequence of sound samples and how this can be done using SIMD.

Test code

/*
 * vectorization.c
 * This program creates two 1000-element integer arrays and fills them 
 * with random numbers, then sums those two arrays to a third array, 
 * and finally sums the third array to a long int and prints the result
 */

#include <stdio.h>
#include <stdlib.h>

#define SIZE 1000

int main()
{   
    int a[SIZE];
    int b[SIZE];
    int sumArr[SIZE];
    long int totalSum;
    
    for (int i = 0; i < SIZE; i++)
    {   
        a[i] = rand();
        b[i] = rand();
        sumArr[i] = a[i] + b[i];
        totalSum += sumArr[i];
    }
    
    printf("Total long int sum = %d\n", totalSum );
    
    return 0;
}

Compile

c99 -o vectorization vectorization.c
./vectorization

Total long int sum = 310282548

Now we’ll disassemble our binary file.

Disassembly

In this article, it explains that vectorization is an approach to speeding up code which contains many loops such as our code written above.

We’ll compile our code with -O3 level optimization which enables auto-vectorization.

c99 -O3 -o vectorization vectorization.c
objdump -d vectorization

Disassembly output with annotation

00000000004004c0 <main>:
  4004c0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!            

/* Init variables */
  4004c4:       910003fd        mov     x29, sp                         // stack pointer to integer array 'a' address (int a[SIZE])
  4004c8:       a90153f3        stp     x19, x20, [sp,#16]              // int array b[SIZE]
  4004cc:       f90013f5        str     x21, [sp,#32]                   // store register x21 to address pointed to by (sp + (32 * 8))

  4004d0:       52807d33        mov     w19, #0x3e8                     // #1000 - SIZE of array

/* Start of Loop 
 * Calculate sum of both random numbers from a[i] and b[i]
 *  store in sumArr[i]
 */
  4004d4:       97ffffeb        bl      400480 <rand@plt>               // generates random number
  4004d8:       2a0003f5        mov     w21, w0                         // store into a[i]
  4004dc:       97ffffe9        bl      400480 <rand@plt>               // generates random number
  4004e0:       0b0002a0        add     w0, w21, w0                     // add a[i] + b[i] store it in sumArr[i]
  4004e4:       71000673        subs    w19, w19, #0x1                  

/* Calculate Total Sum - store in totalSum
 * End of Loop
 */
  4004e8:       8b20c294        add     x20, x20, w0, sxtw              // add sumArr[i] to totalSum
  4004ec:       54ffff41        b.ne    4004d4 <main+0x14>              // check i <= SIZE
  4004f0:       90000000        adrp    x0, 400000 <_init-0x430>
  4004f4:       aa1403e1        mov     x1, x20
  4004f8:       911c6000        add     x0, x0, #0x718
  4004fc:       97ffffed        bl      4004b0 <printf@plt>
  400500:       2a1303e0        mov     w0, w19
  400504:       f94013f5        ldr     x21, [sp,#32]
  400508:       a94153f3        ldp     x19, x20, [sp,#16]
  40050c:       a8c37bfd        ldp     x29, x30, [sp],#48
  400510:       d65f03c0        ret
  400514:       00000000        .inst   0x00000000 ; undefined

SIMD

From a previous post on scaling an array of sound samples by a factor between 0.000-1.000, we will now try to write a solution to achieve this using SIMD (Single Instruction Multiple Data).

Original C code – no SIMD

const float scale = 0.5;

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        newSample_[i] = sample_[i] * scale;
    }
}

int main()
{
    int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));

    createAudioSample(sample);

    int16_t* newSample = malloc(SAMPLESNUM*sizeof(int16_t));
    naiveVolumeUp(sample, newSample);
	
	return 0;
}

Let’s take a look at our binary’s object dump file for the naiveVolumeUp function:

gcc -O3 -o lab5c lab5.c
objdump -d lab5

0000000000400b90 <naiveVolumeUp>:
  400b90:	48 8d 46 10          	lea    0x10(%rsi),%rax
  400b94:	48 39 c7             	cmp    %rax,%rdi
  400b97:	73 0d                	jae    400ba6 <naiveVolumeUp+0x16>
  400b99:	48 8d 47 10          	lea    0x10(%rdi),%rax
  400b9d:	48 39 c6             	cmp    %rax,%rsi
  400ba0:	0f 82 ba 02 00 00    	jb     400e60 <naiveVolumeUp+0x2d0>
  400ba6:	48 89 f8             	mov    %rdi,%rax
  400ba9:	55                   	push   %rbp
  400baa:	53                   	push   %rbx
  400bab:	48 d1 e8             	shr    %rax
  400bae:	48 f7 d8             	neg    %rax
  400bb1:	83 e0 07             	and    $0x7,%eax
  400bb4:	0f 84 e6 02 00 00    	je     400ea0 <naiveVolumeUp+0x310>
  400bba:	0f bf 17             	movswl (%rdi),%edx
  400bbd:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400bc1:	f3 0f 10 0d eb 05 00 	movss  0x5eb(%rip),%xmm1        # 4011b4 <scale+0x4>
  400bc8:	00 
  400bc9:	83 f8 01             	cmp    $0x1,%eax
  400bcc:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400bd0:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400bd4:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400bd8:	66 89 16             	mov    %dx,(%rsi)
  400bdb:	0f 84 f1 02 00 00    	je     400ed2 <naiveVolumeUp+0x342>
  400be1:	0f bf 57 02          	movswl 0x2(%rdi),%edx
  400be5:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400be9:	83 f8 02             	cmp    $0x2,%eax
  400bec:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400bf0:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400bf4:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400bf8:	66 89 56 02          	mov    %dx,0x2(%rsi)
  400bfc:	0f 84 e1 02 00 00    	je     400ee3 <naiveVolumeUp+0x353>
  400c02:	0f bf 57 04          	movswl 0x4(%rdi),%edx
  400c06:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c0a:	83 f8 03             	cmp    $0x3,%eax
  400c0d:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c11:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c15:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c19:	66 89 56 04          	mov    %dx,0x4(%rsi)
  400c1d:	0f 84 d1 02 00 00    	je     400ef4 <naiveVolumeUp+0x364>
  400c23:	0f bf 57 06          	movswl 0x6(%rdi),%edx
  400c27:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c2b:	83 f8 04             	cmp    $0x4,%eax
  400c2e:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c32:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c36:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c3a:	66 89 56 06          	mov    %dx,0x6(%rsi)
  400c3e:	0f 84 c1 02 00 00    	je     400f05 <naiveVolumeUp+0x375>
  400c44:	0f bf 57 08          	movswl 0x8(%rdi),%edx
  400c48:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c4c:	83 f8 05             	cmp    $0x5,%eax
  400c4f:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c53:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c57:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c5b:	66 89 56 08          	mov    %dx,0x8(%rsi)
  400c5f:	0f 84 b1 02 00 00    	je     400f16 <naiveVolumeUp+0x386>
  400c65:	0f bf 57 0a          	movswl 0xa(%rdi),%edx
  400c69:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c6d:	83 f8 07             	cmp    $0x7,%eax
  400c70:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c74:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400c78:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400c7c:	66 89 56 0a          	mov    %dx,0xa(%rsi)
  400c80:	0f 85 3b 02 00 00    	jne    400ec1 <naiveVolumeUp+0x331>
  400c86:	0f bf 57 0c          	movswl 0xc(%rdi),%edx
  400c8a:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400c8e:	41 ba fa 64 cd 1d    	mov    $0x1dcd64fa,%r10d
  400c94:	41 b9 07 00 00 00    	mov    $0x7,%r9d
  400c9a:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400c9e:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400ca2:	f3 0f 2c d0          	cvttss2si %xmm0,%edx
  400ca6:	66 89 56 0c          	mov    %dx,0xc(%rsi)
  400caa:	b9 f9 64 cd 1d       	mov    $0x1dcd64f9,%ecx
  400caf:	41 bb 01 65 cd 1d    	mov    $0x1dcd6501,%r11d
  400cb5:	41 89 c0             	mov    %eax,%r8d
  400cb8:	29 c1                	sub    %eax,%ecx
  400cba:	41 29 c3             	sub    %eax,%r11d
  400cbd:	c1 e9 03             	shr    $0x3,%ecx
  400cc0:	83 c1 01             	add    $0x1,%ecx
  400cc3:	8d 2c cd 00 00 00 00 	lea    0x0(,%rcx,8),%ebp
  400cca:	66 0f ef e4          	pxor   %xmm4,%xmm4
  400cce:	4d 01 c0             	add    %r8,%r8
  400cd1:	31 c0                	xor    %eax,%eax
  400cd3:	0f 28 1d e6 04 00 00 	movaps 0x4e6(%rip),%xmm3        # 4011c0 <scale+0x10>
  400cda:	4a 8d 1c 07          	lea    (%rdi,%r8,1),%rbx
  400cde:	31 d2                	xor    %edx,%edx
  400ce0:	49 01 f0             	add    %rsi,%r8
  400ce3:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  400ce8:	66 0f 6f 0c 03       	movdqa (%rbx,%rax,1),%xmm1
  400ced:	66 0f 6f d4          	movdqa %xmm4,%xmm2
  400cf1:	83 c2 01             	add    $0x1,%edx
  400cf4:	66 0f 65 d1          	pcmpgtw %xmm1,%xmm2
  400cf8:	66 0f 6f c1          	movdqa %xmm1,%xmm0
  400cfc:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  400d00:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  400d04:	0f 5b c9             	cvtdq2ps %xmm1,%xmm1
  400d07:	0f 59 cb             	mulps  %xmm3,%xmm1
  400d0a:	0f 5b c0             	cvtdq2ps %xmm0,%xmm0
  400d0d:	0f 59 c3             	mulps  %xmm3,%xmm0
  400d10:	f3 0f 5b c9          	cvttps2dq %xmm1,%xmm1
  400d14:	f3 0f 5b c0          	cvttps2dq %xmm0,%xmm0
  400d18:	66 0f 6f d0          	movdqa %xmm0,%xmm2
  400d1c:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  400d20:	66 0f 69 d1          	punpckhwd %xmm1,%xmm2
  400d24:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  400d28:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  400d2c:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  400d30:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  400d34:	41 0f 11 04 00       	movups %xmm0,(%r8,%rax,1)
  400d39:	48 83 c0 10          	add    $0x10,%rax
  400d3d:	39 ca                	cmp    %ecx,%edx
  400d3f:	72 a7                	jb     400ce8 <naiveVolumeUp+0x158>
  400d41:	41 01 e9             	add    %ebp,%r9d
  400d44:	41 29 ea             	sub    %ebp,%r10d
  400d47:	41 39 eb             	cmp    %ebp,%r11d
  400d4a:	0f 84 0d 01 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400d50:	49 63 d1             	movslq %r9d,%rdx
  400d53:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400d57:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400d5b:	f3 0f 10 0d 51 04 00 	movss  0x451(%rip),%xmm1        # 4011b4 <scale+0x4>
  400d62:	00 
  400d63:	41 83 fa 01          	cmp    $0x1,%r10d
  400d67:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400d6b:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400d6f:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400d73:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400d77:	41 8d 51 01          	lea    0x1(%r9),%edx
  400d7b:	0f 84 dc 00 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400d81:	48 63 d2             	movslq %edx,%rdx
  400d84:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400d88:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400d8c:	41 83 fa 02          	cmp    $0x2,%r10d
  400d90:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400d94:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400d98:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400d9c:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400da0:	41 8d 51 02          	lea    0x2(%r9),%edx
  400da4:	0f 84 b3 00 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400daa:	48 63 d2             	movslq %edx,%rdx
  400dad:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400db1:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400db5:	41 83 fa 03          	cmp    $0x3,%r10d
  400db9:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400dbd:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400dc1:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400dc5:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400dc9:	41 8d 51 03          	lea    0x3(%r9),%edx
  400dcd:	0f 84 8a 00 00 00    	je     400e5d <naiveVolumeUp+0x2cd>
  400dd3:	48 63 d2             	movslq %edx,%rdx
  400dd6:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400dda:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400dde:	41 83 fa 04          	cmp    $0x4,%r10d
  400de2:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400de6:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400dea:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400dee:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400df2:	41 8d 51 04          	lea    0x4(%r9),%edx
  400df6:	74 65                	je     400e5d <naiveVolumeUp+0x2cd>
  400df8:	48 63 d2             	movslq %edx,%rdx
  400dfb:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400dff:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400e03:	41 83 fa 05          	cmp    $0x5,%r10d
  400e07:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400e0b:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400e0f:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400e13:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400e17:	41 8d 51 05          	lea    0x5(%r9),%edx
  400e1b:	74 40                	je     400e5d <naiveVolumeUp+0x2cd>
  400e1d:	48 63 d2             	movslq %edx,%rdx
  400e20:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400e24:	0f bf 04 57          	movswl (%rdi,%rdx,2),%eax
  400e28:	41 83 c1 06          	add    $0x6,%r9d
  400e2c:	41 83 fa 06          	cmp    $0x6,%r10d
  400e30:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400e34:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400e38:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400e3c:	66 89 04 56          	mov    %ax,(%rsi,%rdx,2)
  400e40:	74 1b                	je     400e5d <naiveVolumeUp+0x2cd>
  400e42:	49 63 c1             	movslq %r9d,%rax
  400e45:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400e49:	0f bf 14 47          	movswl (%rdi,%rax,2),%edx
  400e4d:	f3 0f 2a c2          	cvtsi2ss %edx,%xmm0
  400e51:	f3 0f 59 c8          	mulss  %xmm0,%xmm1
  400e55:	f3 0f 2c d1          	cvttss2si %xmm1,%edx
  400e59:	66 89 14 46          	mov    %dx,(%rsi,%rax,2)
  400e5d:	5b                   	pop    %rbx
  400e5e:	5d                   	pop    %rbp
  400e5f:	c3                   	retq   
  400e60:	31 d2                	xor    %edx,%edx
  400e62:	f3 0f 10 0d 4a 03 00 	movss  0x34a(%rip),%xmm1        # 4011b4 <scale+0x4>
  400e69:	00 
  400e6a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  400e70:	0f bf 04 17          	movswl (%rdi,%rdx,1),%eax
  400e74:	66 0f ef c0          	pxor   %xmm0,%xmm0
  400e78:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400e7c:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  400e80:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  400e84:	66 89 04 16          	mov    %ax,(%rsi,%rdx,1)
  400e88:	48 83 c2 02          	add    $0x2,%rdx
  400e8c:	48 81 fa 02 ca 9a 3b 	cmp    $0x3b9aca02,%rdx
  400e93:	75 db                	jne    400e70 <naiveVolumeUp+0x2e0>
  400e95:	f3 c3                	repz retq 
  400e97:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  400e9e:	00 00 
  400ea0:	bd 00 65 cd 1d       	mov    $0x1dcd6500,%ebp
  400ea5:	b9 a0 ac b9 03       	mov    $0x3b9aca0,%ecx
  400eaa:	41 bb 01 65 cd 1d    	mov    $0x1dcd6501,%r11d
  400eb0:	45 31 c0             	xor    %r8d,%r8d
  400eb3:	41 ba 01 65 cd 1d    	mov    $0x1dcd6501,%r10d
  400eb9:	45 31 c9             	xor    %r9d,%r9d
  400ebc:	e9 09 fe ff ff       	jmpq   400cca <naiveVolumeUp+0x13a>
  400ec1:	41 ba fb 64 cd 1d    	mov    $0x1dcd64fb,%r10d
  400ec7:	41 b9 06 00 00 00    	mov    $0x6,%r9d
  400ecd:	e9 d8 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400ed2:	41 ba 00 65 cd 1d    	mov    $0x1dcd6500,%r10d
  400ed8:	41 b9 01 00 00 00    	mov    $0x1,%r9d
  400ede:	e9 c7 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400ee3:	41 ba ff 64 cd 1d    	mov    $0x1dcd64ff,%r10d
  400ee9:	41 b9 02 00 00 00    	mov    $0x2,%r9d
  400eef:	e9 b6 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400ef4:	41 ba fe 64 cd 1d    	mov    $0x1dcd64fe,%r10d
  400efa:	41 b9 03 00 00 00    	mov    $0x3,%r9d
  400f00:	e9 a5 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400f05:	41 ba fd 64 cd 1d    	mov    $0x1dcd64fd,%r10d
  400f0b:	41 b9 04 00 00 00    	mov    $0x4,%r9d
  400f11:	e9 94 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400f16:	41 ba fc 64 cd 1d    	mov    $0x1dcd64fc,%r10d
  400f1c:	41 b9 05 00 00 00    	mov    $0x5,%r9d
  400f22:	e9 83 fd ff ff       	jmpq   400caa <naiveVolumeUp+0x11a>
  400f27:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  400f2e:	00 00 

For a function with one loop this seems like quite a long set of instructions even if it contains two large size integer arrays. Let’s try to use SIMD to help gcc condense this.

Loop Function using SIMD

The article above mentions that gcc does not know our arrays are aligned so it has to do extra work to account for possible cases in which they are not.

Let’s tell gcc explicitly that these arrays are aligned by calling the __builtin_assume_aligned function on the passed in int16_t* parameters and create two new pointers for our 16-bit aligned arrays.

lab5-simd.c

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    int16_t *x = __builtin_assume_aligned(sample_,16);
    int16_t *y = __builtin_assume_aligned(newSample_,16);
    
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        y[i] = x[i] * scale;
    }   
}

Compile our code again with -O3 level optimization to enable auto-vectorization.
gcc -O3 -o lab5-simd lab5-simd.c
objdump -d lab5-simd

00000000004008e0 <naiveVolumeUp>:
  4008e0:	48 8d 46 10          	lea    0x10(%rsi),%rax
  4008e4:	48 39 c7             	cmp    %rax,%rdi
  4008e7:	73 0d                	jae    4008f6 <naiveVolumeUp+0x16>
  4008e9:	48 8d 47 10          	lea    0x10(%rdi),%rax
  4008ed:	48 39 c6             	cmp    %rax,%rsi
  4008f0:	0f 82 92 00 00 00    	jb     400988 <naiveVolumeUp+0xa8>
  4008f6:	66 0f ef e4          	pxor   %xmm4,%xmm4
  4008fa:	0f 28 1d 4f 03 00 00 	movaps 0x34f(%rip),%xmm3        # 400c50 <scale+0x10>
  400901:	31 c0                	xor    %eax,%eax
  400903:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  400908:	66 0f 6f 0c 07       	movdqa (%rdi,%rax,1),%xmm1
  40090d:	66 0f 6f d4          	movdqa %xmm4,%xmm2
  400911:	66 0f 6f c1          	movdqa %xmm1,%xmm0
  400915:	66 0f 65 d1          	pcmpgtw %xmm1,%xmm2
  400919:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  40091d:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  400921:	0f 5b c0             	cvtdq2ps %xmm0,%xmm0
  400924:	0f 59 c3             	mulps  %xmm3,%xmm0
  400927:	0f 5b c9             	cvtdq2ps %xmm1,%xmm1
  40092a:	0f 59 cb             	mulps  %xmm3,%xmm1
  40092d:	f3 0f 5b c0          	cvttps2dq %xmm0,%xmm0
  400931:	66 0f 6f d0          	movdqa %xmm0,%xmm2
  400935:	f3 0f 5b c9          	cvttps2dq %xmm1,%xmm1
  400939:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  40093d:	66 0f 69 d1          	punpckhwd %xmm1,%xmm2
  400941:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  400945:	66 0f 61 c2          	punpcklwd %xmm2,%xmm0
  400949:	66 0f 69 ca          	punpckhwd %xmm2,%xmm1
  40094d:	66 0f 61 c1          	punpcklwd %xmm1,%xmm0
  400951:	0f 29 04 06          	movaps %xmm0,(%rsi,%rax,1)
  400955:	48 83 c0 10          	add    $0x10,%rax
  400959:	48 3d 00 ca 9a 3b    	cmp    $0x3b9aca00,%rax
  40095f:	75 a7                	jne    400908 <naiveVolumeUp+0x28>
  400961:	0f bf 87 00 ca 9a 3b 	movswl 0x3b9aca00(%rdi),%eax
  400968:	66 0f ef c0          	pxor   %xmm0,%xmm0
  40096c:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  400970:	f3 0f 59 05 28 03 00 	mulss  0x328(%rip),%xmm0        # 400ca0 <scale+0x60>
  400977:	00 
  400978:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  40097c:	66 89 86 00 ca 9a 3b 	mov    %ax,0x3b9aca00(%rsi)
  400983:	c3                   	retq   
  400984:	0f 1f 40 00          	nopl   0x0(%rax)
  400988:	31 d2                	xor    %edx,%edx
  40098a:	f3 0f 10 0d 0e 03 00 	movss  0x30e(%rip),%xmm1        # 400ca0 <scale+0x60>
  400991:	00 
  400992:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  400998:	0f bf 04 17          	movswl (%rdi,%rdx,1),%eax
  40099c:	66 0f ef c0          	pxor   %xmm0,%xmm0
  4009a0:	f3 0f 2a c0          	cvtsi2ss %eax,%xmm0
  4009a4:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
  4009a8:	f3 0f 2c c0          	cvttss2si %xmm0,%eax
  4009ac:	66 89 04 16          	mov    %ax,(%rsi,%rdx,1)
  4009b0:	48 83 c2 02          	add    $0x2,%rdx
  4009b4:	48 81 fa 02 ca 9a 3b 	cmp    $0x3b9aca02,%rdx
  4009bb:	75 db                	jne    400998 <naiveVolumeUp+0xb8>
  4009bd:	f3 c3                	repz retq 
  4009bf:	90                   	nop

The instructions for our function has now been significantly condensed since we’ve explicitly told gcc that the arrays used inside our loop are 16-bit aligned.

Size of binaries – with and without auto-vectorization:

[lisac@localhost lab6-vectorization]$ size lab5
   text    data     bss     dec     hex filename
   4666     580       4    5250    1482 lab5
[lisac@localhost lab6-vectorization]$ size lab5-simd
   text    data     bss     dec     hex filename
   3222     580       4    3806     ede lab5-simd

Algorithm Selection – adjusting a sequence of sound samples by a volume scale factor

The purpose of this exercise is to compare the different algorithm approaches for adjusting a sequence of sound samples. We can analyze how the different gcc optimization levels increase or decrease the performance for each of them on both x86_64 and Aarch64 architectures.

Creating audio sample

First, we will create an audio sample to test with. Since we would like to be able to compare the runtimes between each algorithm, a substantial data set is required, so we will create a sequence of 500000000 sound samples. This will be coded in c, and we will store the sound samples inside a signed int16 array (-32768 to 32767 range).

#define SAMPLESNUM 500000000

void createAudioSample(int16_t* sample_)
{
    for(int i = 0; i <= SAMPLESNUM; i++)
    {
        sample_[i] = rand();
    }
}

int main()
{
    int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));
    createAudioSample(sample);
}

Increase Volume

First algorithm – “Naive” approach

The first algorithm written will increase the volume by simply multiplying each sound sample by a volume scale factor then storing the result into a new array.

const float scale = 0.5; // volume scaling factor

void naiveVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    for (int i = 0; i <= SAMPLESNUM; i++)
    {
        newSample_[i] = sample_[i] * scale;
    }
}

Second algorithm – Look up table approach

For this approach, we will create a lookup table with all possible values from -32768 to 32767 multiplied by the volume scale factor (in our case 0.5). We can then reference the lookup later on to adjust our volume by the scale factor.

#define MAXSIZE 65536 // maximum size for signed 16 bit integer
#define HALF 32767 // half of max size for signed 16 bit integer

...

void lookupTableVolumeUp(int16_t* sample_, int16_t* newSample_)
{
    // Create Lookup table
    int16_t lookupTable[MAXSIZE];
    for (int counter = 0; counter < MAXSIZE; counter++)
    {
        lookupTable[counter] = ((counter - HALF )*scale);
    }

    // Increase using lookupTable
    for (int i = 0; i <= MAXSIZE; i++)
    {
        newSample_[i] = lookupTable[sample_[i] + HALF];
    }
}

Here is a function to calculate our functions’ execution times:

void printExecTime(struct timeval t1, struct timeval t2)
{
    double elapsed;
 
    elapsed = (t2.tv_sec - t1.tv_sec) + 1e-6 * (t2.tv_usec - t1.tv_usec);

    printf("elapsed: %.8lf seconds\n", elapsed);
}

And our main:

int main()
{
   struct timeval t1, t2;
   int16_t* sample = malloc(SAMPLESNUM*sizeof(int16_t));

   createAudioSample(sample);
   printf("\nAudio sample\n============\n");
   printSpecifiedRange(sample, 0, 7);

   int16_t* newSample = malloc(SAMPLESNUM*sizeof(int16_t));
   printf("\nNaive volume up\n===============\n");
   gettimeofday(&t1, NULL); // starting time
   naiveVolumeUp(sample, newSample); // start naive test
   gettimeofday(&t2, NULL); // end time
   printExecTime(t1, t2);
   printSpecifiedRange(newSample, 0, 7);

   free(newSample);
   newSample = malloc(SAMPLESNUM*sizeof(int16_t));
   printf("\nLookup volume up\n================\n");
   gettimeofday(&t1, NULL); // starting time
   lookupTableVolumeUp(sample, newSample); // start lookup table approach 
   gettimeofday(&t2, NULL); // end time
   printExecTime(t1, t2);
   printSpecifiedRange(newSample, 0, 7);

    return 0;
}

Compile our code:
gcc -o lab5 lab5.c
time ./lab5

Audio sample
============
sample[0]=17767
sample[1]=9158
sample[2]=-26519
sample[3]=18547
sample[4]=-9135
sample[5]=23807
sample[6]=-27574
sample[7]=22764

Naive volume up
===============
elapsed: 6.56288600 seconds
sample[0]=8883
sample[1]=4579
sample[2]=-13259
sample[3]=9273
sample[4]=-4567
sample[5]=11903
sample[6]=-13787
sample[7]=11382

Lookup volume up
================
elapsed: 0.11074700 seconds
sample[0]=8883
sample[1]=4579
sample[2]=-13259
sample[3]=9273
sample[4]=-4567
sample[5]=11903
sample[6]=-13787
sample[7]=11382

real	0m16.653s
user	0m11.863s
sys	0m1.145s

We can see the lookup approach is signficantly faster. Now we’ll test the different optimization levels on both Xerxes(x86_64) and Betty(Aarch64) machines.

Runtime w/ optimization levels

I compiled the code with each optimization level from O0 to O3, i.e:
gcc -O0 -o lab5 lab5.c

gcc -O3 -o lab5 lab5.c
c99 -O0 -o lab5 lab5.c

c99 -O3 -o lab5 lab5.c

And recorded the results into a spreadsheet:

xerxes_betty

The Functions section of the report shows both functions’ execution times, and is displayed in seconds.
The first thing I noticed with O0 optimization was that the ratio between Naive and Lookup was significantly wider on Betty than on Xerxes. The Naive function on Xerxes comparing O0 to O1 improves from 3.84 to 1.18. On Betty, it improves from 8.94 to 1.79. The ratio between the two functions from O1 to O3 begin to even out and are relatively similar between the two architectures.

The time section of the report displays results from the time ./lab5 command. This simply displays the runtimes for the whole program execution in real, user and sys time.

The size section displays results from the size lab5 command.
We can see that all data sizes are equally larger on Betty than on Xerxes. One thing that stands out is the O3 level optimization on Xerxes is 4634 bytes, while on Betty it is 2934 bytes.

Fixing bug in Thimble project using Persistence DevTools

This is a continuation to my previous blog on setting up Brackets, Thimble and choosing a bug to work on. Now that I have my local instance of Thimble installed, I can start working on the bug.

Bug to fix

The issue here is that renaming a project is not recognized as a change in an already published project, so the “Update published version” button is not available.

  • For example, an already published project named: ‘My Project’
  • thimble_savebutton

  • Rename it to ‘My Project renamed’ and click ‘Save’:
  • thimble_projectrename

  • Now if we click on the ‘Publish’ button, the ‘Update published version’ button is not available after renaming the project:
  • thimble_publishnobutton

Setting up workspace

In the previous post, I set up my local Thimble instance by Forking thimble.mozilla.org into my repo, then cloning the project from my repository. To start working on this bug, I created a new branch (lkisac_issue_719). This guide explains how to set up “Persistence with DevTools Workspaces”, so that I could edit the files on my local machine and have the changes applied server-side on a browser refresh. This is essentially done by mapping files/folders to your local computer.

Drag and drop your local project folder (thimble.mozilla.org) to the Source navigator pane in DevTools.

devtools_workspace

You can right click the filename in the Sources editor, then select ‘Reveal in navigator’ to make sure the file your working on is the one from your local folder.

Issues

Vagrant had to be reloaded each time in order to see the changes. This was fixed in this commit, so I pulled these latest changes to my repository.

I also had an issue with my vagrant commands returning this error:

There was an error while executing `VBoxManage`, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.
Command: ["showvminfo", "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"]
Stderr:

Earlier while I was running ‘vagrant up’ I ran into the Windows BSOD (blue screen of death) which interrupted vagrant’s start up. After restarting my computer and running vagrant commands, I would get this error. I searched online for related issues but could not find a solution that worked (including doing a clean installation of vagrant and virtual box and following some suggestions from here). I finally came across this thread where the “.vagrant” file was mentioned. This is a metadata file that stores project info and must have been gotten corrupt when vagrant was interrupted. I tried moving this file outside of my project folder and was able to get vagrant up and running.

During development, one thing I ran into sometimes was after leaving thimble running in vagrant for a while then coming back to it, I would suddenly run into some 401 errors on my Thimble instance. A few ways I got around this was to:

  • Select ‘Disable cache’ under the Network tab in DevTools (this is a useful setting while doing web development for any cached data to not interfere while testing. It won’t affect your regular browser’s cache settings)
  • Sign out of Thimble account then sign back in

Code changes

One of the maintainers provided some guidance on which sections of code would need changing. There were 5 files I modified in the code.

publisher.js

  • Changed handleFileEvent to be on the Publisher prototype
  • Renamed handleFileEvent to showUnpublishedChangesPrompt

main.js

  • Moved ProjectRenameUtility.init call to bramble-ui-bridge.js
  • Added ProjectRenameUtility to BrambleEditor create options

bramble-ui-bridge.js

  • Now contains call to ProjectRenameUtility.init() after publisher init
  • init function receives csrfToken, appUrl and ProjectRenameUtility

bramble-editor.js

  • Pass csrfToken, appUrl and ProjectRenameUtility to BrambleUIBridge.init call.

project-rename.js

  • Call new showUnpublishedChangesPrompt from rename utility’s save function
  • Include publisher in the require, stored in var Publisher
  • Added publisher and bramble parameters in ProjectRenameUtility constructor and stored in this

Setting up breakpoints in DevTools was really useful for working through the code and testing functionality.

Example for setting a breakpoint at line 343

You can click on the line number in the online Sources editor to set the breakpoint.

devtools_breakpoints

Then step through the code with these commands (similar to Visual Studio):

  • F10 – step over
  • F11 – step in
  • Shift + F11 – step out of function

After much trial and error, I was finally able to see the ‘Update published version’ button after renaming the project.

thimble_publishbutton

Create Pull request

After I was able to get the expected results, I committed my changes to my lkisac_issue_719 branch and created my Pull Request.
When you commit code to open source projects, there will be one or more test builds that will likely need to be run to check if the code is valid with those tests. After pushing my changes to git, my first commit failed in the Travis CI build. It was noted that the bramble object in project-rename.js was undefined. So I fixed it in this commit, and the build passed successfully.
Additionally, the maintainer also provided some feedback so they could fully land this patch in Thimble. So I went back into my workspace, made the necessary changes, tested the functionality again, and committed the changes back to git.

Conclusion

It had been a while since I worked with much JavaScript code, so it took a little getting used to (prototypes, chaining, etc). Reading over mozilla’s documentation was helpful in understanding the function calling and how the objects were being passed so I could add my changes correctly.
The most difficult and time consuming part for me in this whole process was getting my workspace set up properly before even getting started on the code. Once I was finally set up and able to write and test code, it was a good experience working on the Thimble code and contributing to this open source project.

Analyzing C compiler options, ELF files & optimization on x86_64 architecture

Compiling a simple hello world C program and exploring some of the different compiler options and how they affect the ELF (Executable and Linkable Format) executable file. This will be done on an x86-64 architecture system.

hello.c

#include <stdio.h>

int main()
{
  printf("Hello world!\n");
}

gcc options:

-g           # enable debugging information
-O0          # do not optimize
-fno-builtin # do not use builtin function optimizations

objdump options:

-f       # display header information for the entire file
-s       # display per-section summary information
-d       # disassemble sections containing code
--source # (implies -d) show source code, if available, along with disassembly

To find out which section headers in the executable contain our code, we will use the -d option:

objdump -d hello

The 3 disassembled sections we see are .init, .plt and .text.

To find out which section contains our string to be printed, we use:

objdump -s hello

If we search for the string “Hello”, we will see that it is inside of the .rodata section.

Contents of section .rodata:

400590 01000200 00000000 00000000 00000000 ................
4005a0 48656c6c 6f20776f 726c6421 0a00     Hello world!..

Now we will compare our original ELF file with new ones using different compiler options:

gcc -g -O0 -fno-builtin hello.c -o hello

ls -l hello
Size of the executable is 11000 bytes.

objdump --source hello


00000000004004f6 :
#include

int main()
{
4004f6: 55             push %rbp
4004f7: 48 89 e5       mov %rsp,%rbp
printf("Hello world!\n");
4004fa: bf a0 05 40 00 mov $0x4005a0,%edi
4004ff: b8 00 00 00 00 mov $0x0,%eax
4004ff: e8 ec fe ff ff callq 4003f0 &lt;puts@plt&gt; 400504: b8 00 00 00 00 mov $0x0,%eax
}
400509: 5d             pop %rbp
40050a: c3             retq
40050b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

Adding/Removing compile options

1. Adding the -static option

gcc -g -O0 -fno-builtin -static hello.c -o hello-1

Size of ELF file has grown to 916088 bytes since the dynamic libraries are added to the binary file. This is good for portability since the binary does not require dependencies at runtime in order to run.

2. Removing -fno-builtin option

gcc -g -O0 hello.c -o hello-2

objdump --source hello-2

00000000004004f6 :
#include

int main()
{
4004f6: 55             push %rbp
4004f7: 48 89 e5       mov %rsp,%rbp
printf("Hello world!\n");
4004fa: bf a0 05 40 00 mov $0x4005a0,%edi
<del datetime="2017-02-02T16:23:20+00:00">4004ff: b8 00 00 00 00 mov $0x0,%eax</del>
400504: e8 e7 fe ff ff callq 4003f0 &lt;printf@plt&gt; 400509: b8 00 00 00 00 mov $0x0,%eax
}
40050e: 5d             pop %rbp
40050f: c3             retq

Since built-in function optimization is not enabled, we can see one instruction has been removed from the ELF file – size is now 8528 bytes.

3. Removing -g option

gcc -O0 hello.c -o hello-3

00000000004004f6 :
4004f6: 55             push %rbp
4004f7: 48 89 e5       mov %rsp,%rbp
4004fa: bf a0 05 40 00 mov $0x4005a0,%edi
4004ff: e8 ec fe ff ff callq 4003f0 &lt;puts@plt&gt; 400504: b8 00 00 00 00 mov $0x0,%eax
400509: 5d             pop %rbp
40050a: c3             retq
40050b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

This compiles without attaching debugging information, and we cannot see the inline source code – size is now 917680 bytes.

4. Adding additional arguments to printf() function

To analyze the differences in registers used for every argument added, 10 sequential integer arguments were added to the printf() function:

hello-4.c

#include

int main()
{
printf("Hello World!, %d%d%d%d%d%d%d%d%d%d\n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}

gcc -O0 -g hello.c -o hello-4

objdump --source lab4-4

00000000004004f6 <main>:
#include <stdio.h>

int main()
{
  4004f6:       55                      push   %rbp
  4004f7:       48 89 e5                mov    %rsp,%rbp
    printf("Hello world!, %d%d%d%d%d%d%d%d%d%d\n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
  4004fa:       48 83 ec 08             sub    $0x8,%rsp
  4004fe:       6a 0a                   pushq  $0xa
  400500:       6a 09                   pushq  $0x9
  400502:       6a 08                   pushq  $0x8
  400504:       6a 07                   pushq  $0x7
  400506:       6a 06                   pushq  $0x6
  400508:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40050e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  400514:       b9 03 00 00 00          mov    $0x3,%ecx
  400519:       ba 02 00 00 00          mov    $0x2,%edx
  40051e:       be 01 00 00 00          mov    $0x1,%esi
  400523:       bf d0 05 40 00          mov    $0x4005d0,%edi
  400528:       b8 00 00 00 00          mov    $0x0,%eax
  40052d:       e8 be fe ff ff          callq  4003f0 <printf@plt>
  400532:       48 83 c4 30             add    $0x30,%rsp
  400536:       b8 00 00 00 00          mov    $0x0,%eax
}
  40053b:       c9                      leaveq 
  40053c:       c3                      retq   
  40053d:       0f 1f 00                nopl   (%rax)

From lines 21 to 15, you can see each argument being stored into registers with the mov) function up until it reaches the 6th argument where it uses pushq.

5. Move printf to new function and call function from main

hello-5.c

#include <stdio.h>

void output()
{
    printf("Hello world!\n");
}

int main()
{
    output();
 
    return 0;
}

g++ -g -fno-builtin -O0 hello-5.c -o hello5

We can see a new line for main, where we call the new output function:

00000000004005cc <main>:
...
4005d0:       e8 e1 ff ff ff          callq  4005b6 <_Z6outputv>
...

And the output function:

00000000004005b6 <_Z6outputv>:
#include <stdio.h>

void output()
{
  4005b6:       55                      push   %rbp
  4005b7:       48 89 e5                mov    %rsp,%rbp
    printf("Hello world!\n");
  4005ba:       bf 70 06 40 00          mov    $0x400670,%edi
  4005bf:       b8 00 00 00 00          mov    $0x0,%eax
  4005c4:       e8 e7 fe ff ff          callq  4004b0 <printf@plt>
}
  4005c9:       90                      nop
  4005ca:       5d                      pop    %rbp
  4005cb:       c3                      retq

6. Add optimization O3

We previously compiled our original hello world program with the -O0 optimization.
From mangcc:
-O0 Reduce compilation time and make debugging produce the expected results. This is the default.
-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -fsplit-paths -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options.

Set O3 optimization:
g++ -g -fno-builtin -O3 hello.c -o hello_O3

Our original binary file size was 11000 bytes, whereas hello_O3 is 11248.

Now we’ll compare the two ELF files and run a diff comparison:


readelf -h hello > hello_readelfh.txt
readelf -h hello6 > hello6_readelfh.txt
diff hello_readelfh.txt hello6_readelfh.txt

< hello_readelfh.txt
> hello6_readelfh.txt

11c11
<   Entry point address:               0x4004c0
---
>   Entry point address:               0x4004e0
13c13
<   Start of section headers:          8784 (bytes into file)
---
>   Start of section headers:          8944 (bytes into file)
19,20c19,20
<   Number of section headers:         35
<   Section header string table index: 32
---
>   Number of section headers:         36
>   Section header string table index: 33

The start of the section headers starts 160 bytes later than the O0 optimized binary, and there is one extra section header for O3.

Git branching, merging & rebasing (Spoon-Knife repo example)

This post will go over some examples for branching, merging and rebasing using GitHub’s Spoon-Knife repository.

Fork & Clone repository

  1. Fork the Spoon-Knife repository from GitHub
  2. Clone repository to local computer:
    git clone https://github.com/lkisac/Spoon-Knife.git
    Confirm that we are on the master branch:
    git branch
  3. Create a new file called name.txt with first name in it:
    vi name.txt

    Len Isac
    
  4. Add and commit the name.txt file to Git:
    git add name.txt
    git commit -m "added name.txt file with first name"
  5. Confirm that you have 1 new commit on the master branch:
    git log

    commit d87e3b4dc4a9497d68fcbbfe3247409225d4a959
    Author: Len Isac Seneca <len_isac@hotmail.com>
    Date:   Sat Feb 4 10:48:55 2017 -0500
    
        added name.txt file with first name
    
  6. Since we would like to do this work on a new branch instead, we’ll create a new “name” branch, and merge it back into master later:
    git checkout -b name
    Confirm we are on newly created “name” branch:
    git branch
    Check to see that the master and name branch both point to the same commit:
    git show name
    git show master

    Since we created the “name” branch from inside the “master” branch, git will create the new branch pointing to master’s commit (“added name.txt file…”).
  7. Move the master branch back one commit:
    git checkout -B master HEAD~1
  8. Check that both branches are pointing to different commits now:
    git show name
    git show master

Merging

  1. Create third branch named “fav-food” from the master branch.
    git checkout -b fav-food
  2. Create food.txt file with a list of favourite foods
    vi food.txt

    Spaghetti
    Toast
    Beef curry
    
  3. Add and commit
    git add food.txt
    git commit -m "added food.txt - list of favourite foods"
  4. Create fourth branch named “name-and-food” from the name branch commit. Since we are inside the fav-food branch but want it to point to the “name” branch’s commit, we need specify it in the second parameter:
    git checkout -b name-and-food name
    Confirm new merge commit “Merge branch ‘fav-food’ into name-and-food”
    git log --max-count=1
    git ls

    We can also see now both the food.txt and name.txt files exist in this branch.

Merge conflicts

To demonstrate merge conflicts, 3 branches that contain the same “animals.txt” file will be created.

  1. Create “animals” test branch from master:
    git checkout -b animals master
  2. Now we’ll create an “animals.txt” file with 3 farm animals.
    vi animals.txt

    cow
    lamb
    horse
    
  3. Add & commit:
    git add animals.txt
    git commit -m "added animals.txt"
  4. Create a second “water-animals” branch pointing to the animals branch commit:
    git checkout -b water-animals
  5. Edit the “animals.txt” file and add 3 water animals:
  6. vi animals.txt

    cow
    lamb
    horse
    dolphin
    whale
    seahorse
    
  7. Add & commit:
    git add animals.txt
    git commit -m "added three water animals to animals.txt
  8. Create a third branch named “jungle-animals pointing to the animals branch commit:
    git checkout -b jungle-animals animals

  9. Edit the “animals.txt” file and add 3 jungle animals:
  10. vi animals.txt

    cow
    lamb
    horse
    tiger
    lion
    bear
    
  11. Add & commit:
    git add animals.txt
    git commit -m "added three jungle animals to animals.txt
  12. Switch to animals branch and merge water-animals into it:
    git checkout animals
    git merge water-animals

    animals.txt now contains the three water animals added in the water-animals branch.

    Updating b11d23a..3fce797
    Fast-forward
     animals.txt | 3 +++
     1 file changed, 3 insertions(+)
    

    Since the branch was ahead by 1 commit, git performs a fast-forward, which aligns the current branch with the branch being merged. git log shows its commit message, not the merge commit message this time.

  13. Now we’ll try merging the “jungle-animals” branch into animals
  14. git merge jungle-animals

    Auto-merging animals.txt
    CONFLICT (content): Merge conflict in animals.txt
    Automatic merge failed; fix conflicts and then commit the result.
    

    git status shows us that we have unmerged paths and need to fix conflicts, then run “commit”.

  15. Fix merge conflicts
  16. vi animals.txt

    cow
    lamb
    horse
    <<<<<<< HEAD
    dolphin
    whale
    seahorse
    =======
    tiger
    lion
    bear
    >>>>>>> jungle-animals
    

    HEAD (animals) and jungle-animals both use the same lines but have different contents. We have to choose which one we want or if we want both, then remove the conflict markers. If I simply remove the conflict markers, I can keep the contents from both branches making it six lines instead of three.

    cow
    lamb
    horse
    dolphin
    whale
    seahorse
    tiger
    lion
    bear
    
  17. Add the fixed file and commit the merge
  18. git add animals.txt
    git commit

    Merge branch 'jungle-animals' into animals
    
    Conflicts:
        animals.txt
    #
    # It looks like you may be committing a merge.
    # If this is not correct, please remove the file
    #   .git/MERGE_HEAD
    # and try again.
    
    
    # Please enter the commit message for your changes. Lines starting
    # with '#' will be ignored, and an empty message aborts the commit.
    # On branch animals
    # All conflicts fixed but you are still merging.
    #
    # Changes to be committed:
    #   modified:   animals.txt
    

    We can remove the “Conflicts: …” lines and commented lines, then save & exit.

    Merge branch 'jungle-animals' into animals
    

Rebasing, Squashing

Rebasing is a great tool to clean up your commits before being pushed to a remote repository. Although it can be done after pushing, it is not recommended unless you know for sure your commits have not been or not being used by any other developers.

We can combine a series of commits into one. This is called “squashing”.

git rebase -i master

pick b11d23a added animals.txt
pick 3fce797 added three water animals to animals.txt
pick 75e509d added three jungle animals to animals.txt
  • I want to combine all three commits into one, so I can squash the two later ones into the first b11d23a:
    pick b11d23a added animals.txt
    squash 3fce797 added three water animals to animals.txt
    squash 75e509d added three jungle animals to animals.txt
    

    Save and exit (Shift + ZZ)
    We see this error message:

    error: could not apply 75e509d... added three jungle animals to animals.txt
    
    When you have resolved this problem, run "git rebase --continue".
    If you prefer to skip this patch, run "git rebase --skip" instead.
    To check out the original branch and stop rebasing, run "git rebase --abort".
    
    Could not apply 75e509dbf9a48559f6ac5d6335554c240c72c45a... added three jungle animals to animals.txt
    

    Git replays each commit one-by-one on top of master and runs into the same conflict we had earlier. So we can fix the conflicts again by removing the conflict markers, then run:
    git add animals.txt

  • We are still running git rebase – since we have fixed the conflict we can continue with:
    git rebase --continue

     # This is a combination of 3 commits.
     # The first commit's message is:
     added animals.txt - three farm animals
    
     # This is the 2nd commit message:
    
     added three water animals to animals.txt
    
     # This is the 3rd commit message:
    
     added three jungle animals to animals.txt
    
     # Please enter the commit message for your changes. Lines starting
     # with '#' will be ignored, and an empty message aborts the commit.
     # rebase in progress; onto d0dd1f6
     # You are currently rebasing branch 'animals' on 'd0dd1f6'.
     #
     # Changes to be committed:
     #   new file:   animals.txt
    
  • We’ll leave the commit message as is with all three commit messages.

  • Save changes Shift + ZZ.
  • detached HEAD d3f629e] added animals.txt
     1 file changed, 9 insertions(+)
     create mode 100644 animals.txt
    Successfully rebased and updated refs/heads/animals.
    
  • Display rebase commit message
  • git log

    commit d3f629e2a2cad4bd61ced309dd15a06737d7ab4d
    Author: Len Isac Seneca <len_isac@hotmail.com>
    Date:   Sat Feb 4 13:39:29 2017 -0500
    
        added animals.txt
    
        added three water animals to animals.txt
    
        added three jungle animals to animals.txt
    

    Three last commits were successfully squashed into one commit.

    Writing Assembler code for x86_64 & Aarch64 architectures

    Printing a loop message output on both x86_64 and Aarch64 architectures.

    Output:

    Loop: 00 
    Loop: 01 
    Loop: 02 
    Loop: 03 
    Loop: 04 
    Loop: 05 
    Loop: 06 
    Loop: 07 
    Loop: 08 
    Loop: 09 
    Loop: 10 
    Loop: 11 
    ...
    Loop: 30
    

    x86_64

    This is a reference guide to what our group followed in order to generate the output above on a x86_64 architecture.

    Source code

    .text                         /* or section.text - used for keeping the actual code */
    .globl    _start              /* tells kernel where program execution begins */
    
    start = 0                     /* starting value for the loop index; note that this is a symbol (constant), not a variable */
    max = 31                      /* loop exits when the index hits this number (loop condition is i<max) */
    zero = 48                     /* start of decimal notation for character 0 */
    
    _start:
        mov     $start,%r15       /* loop index start is in register 15 */
        mov     $10,%r13          /* put value 10 into register 13 */
    
    loop:
        mov     $zero,%r14        /* loop index */
        mov     %r15,%rax         /* move data from register 15 into accumulator register */
        mov     $0,%rdx           /* put value 0 into data register */
    
        div     %r13              /* divide rax by register 13 and put quotient into rax and remainder into rdx*/
    
        mov     $zero,%r14        /* loop index */
        add     %rax,%r14         /* add quotient(in rax) to register 14 */
        movb    %r14b,msg+6       /* add 6 bits to register 14 */
    
        mov     $zero,%r14        /* loop index */
        add     %rdx,%r14         /* add remainder(in rdx) to register 14 */
        movb    %r14b,msg+7       /* add 7 bits to register 14*/
    
        movq    $len,%rdx
        movq    $msg,%rsi
        movq    $1,%rax
        syscall
    
        inc     %r15              /* increment index */
        cmp     $max,%r15         /* see if we're done */
        jne     loop              /* loop if we're not */
    
        mov     $0,%rdi           /* exit status */
        mov     $60,%rax          /* syscall sys_exit */
        syscall
    
    .data                         /* for SEGFAULT error, stores as read/write */
    
    msg:    .ascii "Loop:    \n"
            len = . - msg
    

    Our idea here was to leave some blank space in our msg string “Loop:    \n”, and have the quotient and remainder of 10 divided by the loop index added to the address of the msg + 6 bytes for the first digit and msg + 7 bytes for the second digit to produce the 2-digit decimal number after “Loop: “.

    Layout asm

    gdb’s layout asm was a useful debugging tool to help me figure out exactly what was going on in the memory and registers.

    So for example, on the source code above(“lab3.s”), you could do:
    as -g -o lab3.o lab3.s

    • -o: output filename
    • -g: request that symbol information be attached to the binary for debugging purposes

    Run the linker:
    ld -o lab3 lab3.s

    Run GNU debugger:
    gdb lab3

    Some useful debugging commands:

    • Set a breakpoint: b 19
    • Start program: r
    • Switch to asm layout:layout asm
    • Step one line: s
    • Step over (don’t go into function call): n
    • Print value in register: info register 14 or i r 14
    • Print value in specific address: print "%s\n", 0x30
    • Add x number of bits to an address:

      print "%s\n", 0x30+4
      print "%s\n", 0x30+8
      etc.

    Aarch64

    Here is a reference to replicate the same output on the AArch64 architecture.

    I was able to replicate a similar solution to the one for the x86_64 architecture.

    .text
    .globl _start
    
    start = 0   /* loop index */
    max = 31    /* number to stop loop at */
    num = 48   /* starting ascii value for 0 number */
    /* index = %r10 - to store register 10 into index macro */
    
    _start:
        mov     x3,start
    
    loop:
        mov     x4,10           /* for calculating quotient/remainder */
    
        /* calculate quotient - store in x6 */
        udiv    x6,x3,x4      /* x6 = x3 / 10 */
    
        /* calculate remainder - store in x7 */
        msub    x7,x4,x6,x3   /* x7 = x3 - (10 * x6) */
    
        adr     x1, msg         /* msg location memory address */
        add     x6,x6,num       /* atoi */
        add     x7,x7,num       /* atoi */
        strb    w6,[x1,6]       /* store in msg + 6 bytes memory location */
        strb    w7,[x1,7]       /* store in msg + 7 bytes memory location */
    
        /* print */
        mov     x0,1            /* file descriptor: 1 is stdout */
        mov     x2,len
        mov     x8,64           /* write is syscall #64 */
        svc     0               /* invoke syscall */
    
        add     x3,x3,1         /* increment index */
        cmp     x3,max          /* check for end of loop */
        bne     loop            /* loop if compare returns false */
    
        mov     x0,0            /* status -> 0 */
        mov     x8,93           /* exit is syscall #93 */
        svc     0               /* invoke syscall */
     
    .data
    
    msg:    .ascii  "Loop:    \n"
            len = . - msg
    
    

    Differences between the x86_64 and Aarch64 architectures syntax:

    1. Register names:
      • x86_64: prefixed by %
      • Aarch64: not prefixed
    2. Immediate values:
      • x86_64: prefixed by $
      • Aarch64: not prefixed
    3. Functions:
      • x86_64:
      • div %r13 – calculates the quotient and remainder.  It takes one register argument (%r13), and divides value in rax from it, storing the quotient in rax, and remainder in rdx

      • Aarch64: two operator functions are required for this.
      • udiv x6,x3,x4 – divides x3 by x4 and stores quotient into x6
        msub x7,x4,x6,x3 – subtracts x3 by (x4 * x6) and stores remainder in x7

    4. Arguments:
      • x86_64: Data sources are given as first argument
      • Aarch64: Destination is given as first argument

    Conclusion

    Wrapping my head around how the different registers work, memory operations and overall syntax was difficult, but once I had a look at the disassembly, I was able to get a better view of what was going on in memory after analyzing each executed line of code.  After that, translating the syntax from one architecture to another was relatively simple, at least for what was required for generating this output.

    More useful links:
    ARM Instruction Set Overview