[pcre-dev] Newline feature request

Góra strony
Delete this message
Autor: Sheri
Data:  
Dla: pcre-dev
Stare tematy: Re: [pcre-dev] Problem with "any" newline
Temat: [pcre-dev] Newline feature request
Hi Philip,

Would you consider adding another newline type? One like "any" but which
omits the unicode line ends, e.g.,:

(?>\r\n|\n|\r)

Here is a relevant thread fragment:

=========================================================================
>
> > > I almost liked the any version, however, there could be
> > > a trouble. Are the Unicode line endings supposed to
> > > work only with UTF-8? It seemed not. VT (0x0B), FF
> > > (0x0C), and 0x85 are treated as new line also in ASCII
> > > mode. Maybe it's OK with VT and FF, but 0x85 as newline
> > > in ASCII mode could introduce a problem. 0x85 is a
> > > LeadByte of some characters in Korean, although they
> > > are rarely used ones somewhat fortunately. It
> > > corresponds to a character even in single-byte code
> > > pages. Is there no way to trigger them only in UTF-8
> > > mode?
> > >
> >
> > In UTF-8 mode, two more are recognized: LS and PS (line separator
> > and paragraph separator)
>
> Yes I know, however, they won't affect the non-UTF8 mode as their
> codepoints are greater than 255.
>
> There was a reason why I didn't wrote as NEL (0x85) while writing VT
> (0x0B) and FF (0x0C). The meaning of 0x0B and 0x0C remains the same
> regardless of the (Windows) codepages: they always correspond to
> U+000B and U+000C. But, the meaning of 0x85 varies from codepage to
> codepage. In most single-byte Windows codepages, for example, 0x85
> corresponds to "..." (Horizontal Ellipsis) whose codepoint is actually
> U+2026, not U+0085. So, excluding 0x85 in non-UTF8 mode will exclude
> U+2026, not NEL (U+0085) in most single-byte Windows codepages. I got
> the feeling the author of PCRE has had little aquaintance with
> Windows. (0x85 isn't assigned to any character in ISO 8859 character
> sets, as far as I can check)
>
> This was sort of a caution. As I don't think 0x85 would appear
> frequently in practice, even in Korean codepage, and there is easy
> work-around in case like (?:.|\x85), I kind of still prefer "any"
> version even with this flaw.


=====================================================

Regards,
Sheri