Old World Macs
Currently tim is working on a faster bcopy/memcpy using PowerPC handcoded assembly. The principle is simple: first copy any bytes from the the total size that are not four byte divisible and then copy chunks of four bytes. Since the penalty for using GPRs with unaligned bytes is negligible, it is not necessary to drop back to single byte copies just because the size or address isn't four byte divisible. This can be extended to use Altivec, but doing so invokes additional boundary conditions to handle misaligned data (i.e., not 16 byte boundaries, offsets, et al).
There is a bcopy.S version for incorporation into NetBSD, which is the version we have used for several months without problems. It contains both bcopy and memcpy replacements. For NetBSD, this file goes at /usr/src/lib/libc/arch/powerpc/string/bcopy.S
The 4 byte PowerPC optimized code handles copies about 10% faster than stock NetBSD code for four byte aligned copies and about 80% faster than for unaligned data.
The alternative fourbytes.s can be used in applications, though, which is what tim used to test it. The code will need to be compiled with the as -mregnames flag to allow the register syntax.
tim has also begun an Altivec version that will copy 16 bytes at a time, but this requires handling cases where either or both source and destination are not aligned on 16 byte boundaries in addition to sizes not sixteen byte divisible. Don't look for it any time soon. Oh yeah, IBM might hold patents on some of the techniques needed to do this, if our research on this is correct, but we could never get a definitive answer from their legal department (no doubt waiting until we actually used the code that infringes on their patents).
Also, there is the L2 cache configuring code. It is a little out of date, so contact tim if you want a newer version. (You know the routine - gtkelly at this domain.)