Re: [pcre-dev] Powerpc optimisation

Autor: Zoltán Herczeg
Data:
Dla: Frederic Bonnard
CC: pcre-dev
Temat: Re: [pcre-dev] Powerpc optimisation

Hi,

If I understand correctly, stalled-cycles-backend means the current instruction is depend on the result of another instruction. High stalling is not surprising for a JIT code, because the most frequent instructions are loads and branches. Optimizations eliminate most of the arithmetic part of a program. It seems 25% of all instructions was branches. Actually I don't see anything unusual here, IPC is low, but that is also usual for JIT code.

Is it possible to collect cache hit/miss and memory (not cache) load stall statistics? It would be interesting to compare this to x86, but I am not sure we can draw any conclusions from that. Memory dependency and cache statistics would be more important.

Regards,
Zoltan

Frederic Bonnard <frediz@???> írta:
>Hi Zoltán,
>I had a look with linux perf to see what it reports about statistic profiling
>and it seems that branch prediction is not our issue (branche-misses seems low).
>Here is a simple run with pcre-jit on the slowest pattern :
>---
>ubuntu@vm10:~/bench/regex-test$ perf stat ./runtest
>'mark.txt' loaded. (Length: 20045118 bytes)
>-----------------
>Regex: '.{0,3}(Tom|Sawyer|Huckleberry|Finn)'
>[pcre-jit] time: 1268 ms (3015 matches)
>
> Performance counter stats for './runtest':
>
> 6410.784480 task-clock (msec) # 1.000 CPUs utilized > 24 context-switches # 0.004 K/sec > 0 cpu-migrations # 0.000 K/sec > 5,409 page-faults # 0.844 K/sec > 21,217,505,532 cycles # 3.310 GHz [66.69%] > 347,697,697 stalled-cycles-frontend # 1.64% frontend cycles idle [50.03%] > 12,459,150,875 stalled-cycles-backend # 58.72% backend cycles idle [50.02%] > 28,407,626,434 instructions # 1.34 insns per cycle > # 0.44 stalled cycles per insn [66.68%] > 7,000,623,877 branches # 1092.007 M/sec [49.97%] > 181,661,003 branch-misses # 2.59% of all branches [50.00%]

>
> 6.411626272 seconds time elapsed >--- >In my terminal, "58.72%" is in purple :) so maybe that is the source of slowness. >What do you think of that ? >To use linux perf on jitted code, we would need to instrument the code as you mention previously, >and I found this : >https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jit-interface.txt

>
>I'm honestly stuck with here: I guess each line corresponds to a jitted function with its start, length
>and symbol, but I don't know how to code that in pcre.
>
>Fred
>
>On Sat, 6 Jun 2015 18:33:28 +0200 (CEST), Zoltán Herczeg <hzmester@???> wrote:
>> Hi Frederic,
>>
>> I just realized that results on that page are two years old. So I updated the engines to their most recent versions and uploaded new results. These results are overall better for all engines (partly because of a newer gcc). The JIT is also improved overall, e.g. the 3rd starting from the last pattern was decreased to 27 ms from 190 ms.
>>
>> Regards,
>> Zoltan
>>
>> "Zoltán Herczeg" <hzmester@???> írta:
>> >Hi Frederic,
>> >
>> >thank you for measuring PCRE on PPC. The results are quite interesting.
>> >
>> >It seems to me that those patterns are slower whose require heavy backtracking. I mean where fast-forward (skipping) algorithms cannot be used (or they match too frequently). The /[a-zA-Z]+ing/ is a good example for that. Backtracking engines (PCRE, Oniguruma) suffers much more on PPC than those that read input once (TRE, RE2). I suspect branch prediction on x86 is better, but only statistics profilers can prove that. Oprofile is available everywhere, and can profile JIT code. That part is developed by IBM :)
>> >
>> >http://oprofile.sourceforge.net/doc/devel/index.html
>> >
>> >It needs some extra coding though. If you are interested to work on that, I can help.
>> >
>> >Btw the Tom.{10,25}river|river.{10,25}Tom pattern is twice as fast on PPC with JIT if I understand the numbers correctly.
>> >
>> >Regards,
>> >Zoltan
>> >
>> >Frederic Bonnard <frediz@???> írta:
>> >>Thanks Zoltan for the quick reply.
>> >>- Ok I think I got it for SSE2.
>> >>- For SIMD instructions, I fear I don't have currently the knowledge for that but
>> >>would be willing to learn/help.
>> >>- A good start would be that 3rd point, about current code and performance
>> >> status on PPC vs x86.
>> >> I reused http://sljit.sourceforge.net/regex_perf.html, I hope it is relevant.
>> >> pcre directory has been updated to use latest 8.37 instead of 8.32.
>> >> My VMs were :
>> >> * x86-64 4x2.3GHz 4G memory on a x86-64 host
>> >> * ppc64el 4x3GHz 4G memory on a P8 host
>> >> * ppc64 4x3GHz 4G memory on a P8 host
>> >> All were installed with Ubuntu 14.04 LTS.
>> >> Note on Ubuntu for ppc64, default is to have binary in 32b running on a 64b
>> >> kernel, thus the binary 'runtest' is 32b. Maybe I'd need to try with 64b
>> >> binary.
>> >> Here is attached the results for those 3 environments. The goal is not to
>> >> find who's the best but rather find any odd behaviour. Also let's focus on
>> >> pcre/pcre-jit .
>> >> Any comment from experts eyes welcomed.
>> >> On my side, I see very comparable results between ppc64/pcc64el so no major
>> >> issue on ppc64el. Now, between x86 and ppc64el, the results for the latter
>> >> seem overall weaker, all the more that the x86 VM has lower freq.
>> >> Results would need maybe more repetition ? and percentage to compare but I
>> >> already see some x2 or x3 time slower results for pcre-jit :
>> >> .{0,3}(Tom|Sawyer|Huckleberry|Finn)
>> >> [a-zA-Z]+ing
>> >> ^[a-zA-Z]{0,4}ing[^a-zA-Z]
>> >> [a-zA-Z]+ing$
>> >> ^[a-zA-Z ]{5,}$
>> >> ^.{16,20}$
>> >> "[^"]{0,30}[?!\.]"
>> >> Tom.{10,25}river|river.{10,25}Tom
>> >>
>> >> Any special treatment for these that could make code generated on power weaker ?
>> >>
>> >> Fred
>> >>
>> >>--
>> >>## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
>> >
>> >
>> >--
>> >## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
>>
>>
>