Re: [pcre-dev] [Bug 1437] Using PCRE-8.34 on x86-64 Linux wi…

Startseite
Nachricht löschen
Autor: Ze'ev Atlas
Datum:  
To: 1437@bugs.exim.org, pcre-dev@exim.org
Betreff: Re: [pcre-dev] [Bug 1437] Using PCRE-8.34 on x86-64 Linux with --enable-jit and --enable-utf , grep -iP '^S' gets stuck on a binary file consuming a lot of CPU for many seconds
Why is that so hard to enforce '^' to start where it supposed to be (i.e. in the absolute start of the supplied string)?
 
Ze'ev Atlas



________________________________
From: Zoltan Herczeg <hzmester@???>
To: pcre-dev@???
Sent: Friday, January 24, 2014 3:02 AM
Subject: [pcre-dev] [Bug 1437] Using PCRE-8.34 on x86-64 Linux with --enable-jit and --enable-utf , grep -iP '^S' gets stuck on a binary file consuming a lot of CPU for many seconds


------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1437




--- Comment #10 from Zoltan Herczeg <hzmester@???>  2014-01-24 08:02:32 ---
Thanks. Now I have the answer. Grep puts a \x0a before the start of the string,
and your string starts with an invalid utf-8 code, 0xbf. You can fully
reproduce this issue on the following input:

\x0a\xbf#

When ^ is matched at #, JIT goes back to \x0a, since \xbf cannot be the start
of the previous character. ^ matches there, and the string pointer now points
to \xbf, which does not match to character S. As you can see now we are before
our original start point, and the match continues from here.

I think there are several corner cases which has similar behaviour, that is the
reason why we don't support them. Checks could be added for them (we should not
ge before the start of the input, we should restore the position after ^,
etc.), but this would be a performance loss for those, who runs the engine
against a valid utf input. It is very costly to check all invalid UTF codes on
the fly (buffer bounds checks, value checks, etc.).

It is a good question what to do here, I would like to hear others opinion
(especially Philip). I think the best thing would be to switch back to ASCII
mode in grep, if the input is not a valid UTF string.



--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev