[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Gertjan Halkes
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #31 from Gertjan Halkes <eximbugz@???> 2011-11-17 19:04:50 ---
(In reply to comment #30)
> This construct is not legal in C, as far as I understand it. It is an
> aliasing violation to store data in one field of a union and read it
> from another. The compiler can do crazy things that would make you
> sad.


Well, the aliasing rules on this are not quite clear. The standard says that
(among others) a stored value may be accessed through "an aggregate or union
type that includes one of the aforementioned types among its
members", but then does not specify it has to be accessed through that member.
However, the standard also does not specify (as far as I can tell) the expected
behaviour for accessing through a different union member, which means that it
is undefined. Relying on undefined behaviour is definately playing with fire.

I guess I should have read up on this before I tried to do this kind of
detection. My original motivation for the union was when I tried to do some
detection of features of the double type, which did break the aliasing rules.
For the byte order of integer types the cast through a character pointer would
actually be valid and does not cause warnings (my memory was tainted by the
double issue). However, both of these methods assume sizeof(int) !=
sizeof(char). Using the uint8_t won't work, because it is optional in C99 (and
if it exists there probably isn't a problem to begin with). Also, when
sizeof(long) == sizeof(char), there is basically no way to detect the endianess
through any method because then all integer types are one "byte" (in the C
standard definition of a byte) and there are no endianess issues as far as the
basic types are concerened.

> I can't guarantee all systems have a macro for this, though.


Well, this is the problem the other methods are trying to work around. I guess
the pointer casting method could be used as a last resort with a static
assertion that sizeof(int) != sizeof(char). Also I'm not sure the signed vs
unsigned will make any difference, as you can stick to positive numbers which
will be represented the same regardless.

Gertjan


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email