Dialectronics - the dialect of electronic communications





About Us

Our Mission

Minix

RISCV

PowerPC

Lua

XCOFF Bootloader

Old World Macs

Open Firmware

Words

Journeys

Home Repair

Top Level




PowerPC

11/3/22: Patches to fix pcc stack frame calculations on powerpc

While reviewing how other RISC CPUs handle argument register overflow so that I could correctly implement this on RISCV, I realized that the powerpc arch was not putting overflow parameters in the correct location. The main problem is that arch/powerpc/code.c was calculating the overflow location from r3, which is the first parameter register, not r11, which is the first register that will overflow. The second issue was that the calculation for how much space between the stack pointer and the overflow location was 24 bytes instead of 8 as specified by the ELF ABI. Last, the stack frame calculation itself was overstating its needs. Along the way I fixed an issue where r0 and r2 were incorrectly labled, leading r2 to be used as a volatile register and r0 as non-volatile (the reverse is true).

Previously, this
  printf("sequence %s (%d) found at %d %d %d sequence %s (%d) found at %d %d %d\n", sq, sq_l, x, y, x, sq, sq_l, x, y, x);        
would result in
        stw r4,64(r1)
        stw r5,60(r1)
        stw r4,56(r1)
        mr r10,r11       ; assign r11 to r10
        mr r9,r3       ; assign r3 to r9
        mr r8,r4       ; assign r4 to r8
        mr r7,r5       ; assign r5 to r7
        mr r6,r4       ; assign r4 to r6
        mr r5,r11       ; assign r11 to r5
        mr r4,r3       ; assign r3 to r4
        lis r3,L411@ha
        addi r3,r3,L411@l
        bl printf       ; call (args, no result) to scon
The 56(r1) is too high. The patches below correct this to place the overflow parameters at 8 bytes above the SP, consistent with ELF ABI. Possibly Mach-O needs it elsewhere, in which case let me know. I also completely revamped the prolog and epilog code.

Correct code, including a floating point argument substituted at the fifth parameter, which does not count against the overflow of GPRs:
        mr r0,r30       ; preserve FPREG
        mr r30,r1       ; establish frame pointer
        stwu r1,-32(r1) ; move the stack pointer
        stw r0,-4(r30)  ; save FPREG relative to frame pointer
        stw r30,0(r1)   ; save previous stack pointer
        mflr r0
        stw r0,4(r1)
.L403:
        mr r6,r4       ; assign r4 to r6
        mr r7,r5       ; assign r5 to r7
.L407:
        lis r4,.L409@ha
        lfd f0,.L409@l(r4)
        lis r4,.L411@ha       ; load sname
        lfd f2,.L411@l(r4)
        fadd f1,f0,f2       ; (l)double add
        stw r6,12(r1)
        stw r7,8(r1)
        mr r10,r6       ; assign r6 to r10
        mr r9,r0       ; assign r0 to r9
        mr r8,r3       ; assign r3 to r8
        mr r5,r0       ; assign r0 to r5
        mr r4,r3       ; assign r3 to r4
        lis r3,.L413@ha
        addi r3,r3,.L413@l
        bl printf       ; call (args, no result) to scon
        li r3,0
        lwz r30,-4(r30) ; restore FPREG
        lwz r0,4(r1)    ; reload stack pointer
        mtlr r0         ; restore link register
        lwz r1,0(r1)    ; restore stack pointer
        blr
macdefs.h.diff
table.c.diff
code.c.diff
local.c.diff
local2.c.diff

A long time ago: As noted in our Words area, at one time IBM's developerWorks Power Architecture Zone was going to carry a series written by tim on handrolling PowerPC assembly code. While that series never did get published, we decided to publish the first two installments. Part I is an overview of the PowerPC architecture registers and a quick introduction to the syntax of a few instructions. Part II digs into some disassembling of object code in order to discuss what a programmer might run into when debugging. Neither piece should be considered the final authority and may contain one or more factual errors. Part III does some simple math, including using Pascal's Summation to sum between two numbers (neither of which have to be zero) in twelve assembly instructions. (The included version that sums from zero to a number takes only four instructions, while the twelve instruction version deals with upper and lower bounds.)

Currently tim is working on a faster bcopy/memcpy using PowerPC handcoded assembly. The principle is simple: first copy any bytes from the the total size that are not four byte divisible and then copy chunks of four bytes. Since the penalty for using GPRs with unaligned bytes is negligible, it is not necessary to drop back to single byte copies just because the size or address isn't four byte divisible. This can be extended to use Altivec, but doing so invokes additional boundary conditions to handle misaligned data (i.e., not 16 byte boundaries, offsets, et al).

There is a bcopy.S version for incorporation into NetBSD, which is the version we have used for several months without problems. It contains both bcopy and memcpy replacements. For NetBSD, this file goes at /usr/src/lib/libc/arch/powerpc/string/bcopy.S

The 4 byte PowerPC optimized code handles copies about 10% faster than stock NetBSD code for four byte aligned copies and about 80% faster than for unaligned data.

The alternative fourbytes.s can be used in applications, though, which is what tim used to test it. The code will need to be compiled with the as -mregnames flag to allow the register syntax.

tim has also begun an Altivec version that will copy 16 bytes at a time, but this requires handling cases where either or both source and destination are not aligned on 16 byte boundaries in addition to sizes not sixteen byte divisible. Don't look for it any time soon. Oh yeah, IBM might hold patents on some of the techniques needed to do this, if our research on this is correct, but we could never get a definitive answer from their legal department (no doubt waiting until we actually used the code that infringes on their patents).

Also, there is the L2 cache configuring code. It is a little out of date, so contact tim if you want a newer version. (You know the routine - gtkelly at this domain.)