Re: [pcre-dev] Powerpc optimisation

Auteur: Frederic Bonnard
Date:
À: Zoltán Herczeg
CC: pcre-dev
Anciens-sujets: Re: [pcre-dev] Powerpc optimisation
Sujet: Re: [pcre-dev] Powerpc optimisation

Hi Zoltán,
I had a look with linux perf to see what it reports about statistic profiling
and it seems that branch prediction is not our issue (branche-misses seems low).
Here is a simple run with pcre-jit on the slowest pattern :
---
ubuntu@vm10:~/bench/regex-test$ perf stat ./runtest
'mark.txt' loaded. (Length: 20045118 bytes)
-----------------
Regex: '.{0,3}(Tom|Sawyer|Huckleberry|Finn)'
[pcre-jit] time: 1268 ms (3015 matches)

Performance counter stats for './runtest':

       6410.784480 task-clock (msec)         #    1.000 CPUs utilized          
                24 context-switches          #    0.004 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
             5,409 page-faults               #    0.844 K/sec                  
    21,217,505,532 cycles                    #    3.310 GHz                     [66.69%]
       347,697,697 stalled-cycles-frontend   #    1.64% frontend cycles idle    [50.03%]
    12,459,150,875 stalled-cycles-backend    #   58.72% backend  cycles idle    [50.02%]
    28,407,626,434 instructions              #    1.34  insns per cycle        
                                             #    0.44  stalled cycles per insn [66.68%]
     7,000,623,877 branches                  # 1092.007 M/sec                   [49.97%]
       181,661,003 branch-misses             #    2.59% of all branches         [50.00%]

       6.411626272 seconds time elapsed
---
In my terminal, "58.72%" is in purple :) so maybe that is the source of slowness.
What do you think of that ?
To use linux perf on jitted code, we would need to instrument the code as you mention previously,
and I found this :
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jit-interface.txt

I'm honestly stuck with here: I guess each line corresponds to a jitted function with its start, length
and symbol, but I don't know how to code that in pcre.

Fred

On Sat, 6 Jun 2015 18:33:28 +0200 (CEST), Zoltán Herczeg <hzmester@???> wrote:
> Hi Frederic,
>
> I just realized that results on that page are two years old. So I updated the engines to their most recent versions and uploaded new results. These results are overall better for all engines (partly because of a newer gcc). The JIT is also improved overall, e.g. the 3rd starting from the last pattern was decreased to 27 ms from 190 ms.
>
> Regards,
> Zoltan
>
> "Zoltán Herczeg" <hzmester@???> írta:
> >Hi Frederic,
> >
> >thank you for measuring PCRE on PPC. The results are quite interesting.
> >
> >It seems to me that those patterns are slower whose require heavy backtracking. I mean where fast-forward (skipping) algorithms cannot be used (or they match too frequently). The /[a-zA-Z]+ing/ is a good example for that. Backtracking engines (PCRE, Oniguruma) suffers much more on PPC than those that read input once (TRE, RE2). I suspect branch prediction on x86 is better, but only statistics profilers can prove that. Oprofile is available everywhere, and can profile JIT code. That part is developed by IBM :)
> >
> >http://oprofile.sourceforge.net/doc/devel/index.html
> >
> >It needs some extra coding though. If you are interested to work on that, I can help.
> >
> >Btw the Tom.{10,25}river|river.{10,25}Tom pattern is twice as fast on PPC with JIT if I understand the numbers correctly.
> >
> >Regards,
> >Zoltan
> >
> >Frederic Bonnard <frediz@???> írta:
> >>Thanks Zoltan for the quick reply.
> >>- Ok I think I got it for SSE2.
> >>- For SIMD instructions, I fear I don't have currently the knowledge for that but
> >>would be willing to learn/help.
> >>- A good start would be that 3rd point, about current code and performance
> >> status on PPC vs x86.
> >> I reused http://sljit.sourceforge.net/regex_perf.html, I hope it is relevant.
> >> pcre directory has been updated to use latest 8.37 instead of 8.32.
> >> My VMs were :
> >> * x86-64 4x2.3GHz 4G memory on a x86-64 host
> >> * ppc64el 4x3GHz 4G memory on a P8 host
> >> * ppc64 4x3GHz 4G memory on a P8 host
> >> All were installed with Ubuntu 14.04 LTS.
> >> Note on Ubuntu for ppc64, default is to have binary in 32b running on a 64b
> >> kernel, thus the binary 'runtest' is 32b. Maybe I'd need to try with 64b
> >> binary.
> >> Here is attached the results for those 3 environments. The goal is not to
> >> find who's the best but rather find any odd behaviour. Also let's focus on
> >> pcre/pcre-jit .
> >> Any comment from experts eyes welcomed.
> >> On my side, I see very comparable results between ppc64/pcc64el so no major
> >> issue on ppc64el. Now, between x86 and ppc64el, the results for the latter
> >> seem overall weaker, all the more that the x86 VM has lower freq.
> >> Results would need maybe more repetition ? and percentage to compare but I
> >> already see some x2 or x3 time slower results for pcre-jit :
> >> .{0,3}(Tom|Sawyer|Huckleberry|Finn)
> >> [a-zA-Z]+ing
> >> ^[a-zA-Z]{0,4}ing[^a-zA-Z]
> >> [a-zA-Z]+ing$
> >> ^[a-zA-Z ]{5,}$
> >> ^.{16,20}$
> >> "[^"]{0,30}[?!\.]"
> >> Tom.{10,25}river|river.{10,25}Tom
> >>
> >> Any special treatment for these that could make code generated on power weaker ?
> >>
> >> Fred
> >>
> >>--
> >>## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
> >
> >
> >--
> >## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
>
>

Ce message fait partie du fil suivant :
	Arborescence complète du fil triée par date

	Zoltán Herczeg à