
nothingmuch at woobling
Jun 3, 2008, 7:17 PM
Post #1 of 6
(333 views)
Permalink
|
|
More results from llvm-gcc
|
|
Hola, A while ago Claes compiled Perl 5.6 with llvm-gcc and got some performance improvements. This is a continuation of that effort using the 5.10 source tree. Executive summary: this gives even better results than just plain llvm-gcc, and theoretically opens up the way for much more. Results are PerlBench improvements of 15% over standard gcc compilation, using llvm's optimizing linker. ANd now for the details: llvm-gcc is basically gcc 4.2 with the backend switched to use llvm's native code generation, instead of gcc's. Normally running llvm-gcc -o foo.o foo.c generates native code, which is then linked by the normal linker. However, if you run it as llvm-gcc -emit-llvm -o foo.o foo.c then foo.o is not a real object file, but an llvm bytecode file. This file is then linkable with llvm-ld, allowing interprocedural optimizations. The result of linking perl like this llvm-ld -native -O5 -o perl blah blah blah is an executable that is on average 15-20% faster as measured by PerlBench on my machine, than the perl I compiled with gcc 4.0 and use normally. I blame this 10% improvement over plain llvm-gcc (without linking bytecode, but native .o files) to LLVM's extensive link time optimziations. When linking without -native the perl executable is actually a shell script that runs lli on perl.bc. This has a very slow startup (about 3.5 seconds) but after that it's just as fast and sometimes faster than the -native executable. Unfortunately it cannot used with the -e command line option (filter_del emits an error about removing fitlers). I haven't debugged this yet. In order for dynamic loading of modules to work llvm-ld has to be told to -disable-internalize (basically it needs to keep all the external symbols still available for dynamic linking) and then the Perl test suite passes except for one error relating to sdbm (output below). Without this fix the results are faster, but of course the test suite fails when loading XS code. I suppose whatever it takes to build a static perl could fix this but i haven't actually tried. Apple's iPhone SDK ships with an llvm-gcc that does the linking part automatically but exhibits some breakage. I filed a bug report, and once they fix it theoretically you could get the same speed improvements by using llvm-gcc -O4 and changing nothing else. The steps to run this are replacing ld and cc with the attached script, and making sure that ar is llvm-ar. I couldn't get this to work consistently without editing config.sh myself (Configure didn't respect changing ar or ld, i don't know what the right solution is). And now for the bad news: ext/SDBM_File/t/sdbm..........................................perl(83924) malloc: *** error for object 0x200f07: Non-aligned pointer being freed *** set a breakpoint in malloc_error_break to debug Use of uninitialized value $Dfile in stat at ../ext/SDBM_File/t/sdbm.t line 47. Use of uninitialized value $mode in bitwise and (&) at ../ext/SDBM_File/t/sdbm.t line 49. perl(83924) malloc: *** error for object 0x200f67: Non-aligned pointer being freed *** set a breakpoint in malloc_error_break to debug I havne't looked into this yet. This repeats with several other DBM related tests, but other then that the whole test suite passes. The iphone sdk llvm-gcc -O4 compilation exhibits a few other test fails. And lastly, the future directions: I hope to embed LLVM's bytecode loading and JIT support in the perl executable, and patch XSLoader and DynaLoader to support loading of LLVM bytecode, allowing LLVM based XS modules (could be interesting for PAR like efforts, not just for optimization), and to retain the bytecode output of compiling Perl itself so that it's also available for the JIT. When I have the opcode definitions (pp_*) available as llvm bytecode functions I want to try and emit very naive threaded bytecode from the optree on a per subroutine basis, and transforming these subroutines to XSUBs with the function pointer returned from the llvm JIT. For example, the body of sub { $x + 3 } would become similar to the definition of: /* PL_op == cv->START; the nextstate op*/ PL_op = pp_nextstate(aTHX); PL_op = pp_padsv(aTHX); PL_op = pp_const(aTHX); PL_op = pp_add(aTHX); return pp_leave(aTHX); assuming that all the op->pp_addr == PL_pp_addr[op->type]. Hopefully LLVM will be able to perform interprocedural optimizations between the defintions of the various pp_*. After that is in place the bytecode emitter can be extended, by refactoring pp_* into smaller, non stack based functions, that are not as reliant on the global environment, so that the above code can actually become more like: SV tmp1 = opcode_padsv(aTHX_, pad_op); stack_push(opcode_add(tmp1, const_op_sv); free_tmp(tmp1); /* free if it has a PV */ PL_op = next_op; Allowing simple ops to avoid the overhead of pushing/popping data on the stack, mortalizing, etc. Lastly, I hope to base this emitter on Runops::Trace's recently added features to get trace caching like compilation for just the hotpath, to avoid unnecessary JIT optimization of seldom used optrees. Cheers, Yuval P.S. my new favourite command is make clean -j50 (yes, fifty). -- Yuval Kogman <nothingmuch[at]woobling.org> http://nothingmuch.woobling.org 0xEBD27418
|