Cedric Blancher
2013-04-17 12:52:40 UTC
Glenn, can you take a look at the posting from freebsd-standards? AST
tr -C doesn't ignore unassigned code points as it should be.
Ced
---------- Forwarded message ----------
From: Jilles Tjoelker <jilles at stack.nl>
Date: 7 April 2013 22:31
Subject: Re: Fwd: Where does FreeBSD tr -C differ from tr -c?
To: Cedric Blancher <cedric.blancher at googlemail.com>
Cc: freebsd-hackers at freebsd.org, freebsd-standards at freebsd.org
% printf 'a\200'|LC_ALL=en_US.US-ASCII tr -cd '\000-\177'|hd
00000000 61 |a|
00000001
% printf 'a\200'|LC_ALL=en_US.US-ASCII tr -Cd '\000-\177'|hd
00000000 61 80 |a.|
00000002
Because the bytes 128..255 are not characters in us-ascii, they cannot
be removed with -Cd, only with -cd.
Here is another difference (using LC_CTYPE=en_US.UTF-8, rest C):
% echo $'\U0001a000'|tr -cd '\U0001a000'|hd
% echo $'\U0001a000'|tr -Cd '\U0001a000'|hd
00000000 f0 9a 80 80 |....|
00000004
The cause is that iswrune(3) returns false for the unassigned code point
U+0001A000.
This may well contain bugs because Unicode adds new characters from time
to time and our tables seem to be updated very rarely.
POSIX also says things about collation order. You may not have detected
this because FreeBSD does not implement LC_COLLATE for multibyte locales
yet.
--
Jilles Tjoelker
tr -C doesn't ignore unassigned code points as it should be.
Ced
---------- Forwarded message ----------
From: Jilles Tjoelker <jilles at stack.nl>
Date: 7 April 2013 22:31
Subject: Re: Fwd: Where does FreeBSD tr -C differ from tr -c?
To: Cedric Blancher <cedric.blancher at googlemail.com>
Cc: freebsd-hackers at freebsd.org, freebsd-standards at freebsd.org
The question remain open and I need help. tr -C is implemented by
FreeBSD tr -C but I can't find examples (or a testcase) where tr -c
and tr -C differ.
Reading the rationale of POSIX, here is an example of a difference:FreeBSD tr -C but I can't find examples (or a testcase) where tr -c
and tr -C differ.
% printf 'a\200'|LC_ALL=en_US.US-ASCII tr -cd '\000-\177'|hd
00000000 61 |a|
00000001
% printf 'a\200'|LC_ALL=en_US.US-ASCII tr -Cd '\000-\177'|hd
00000000 61 80 |a.|
00000002
Because the bytes 128..255 are not characters in us-ascii, they cannot
be removed with -Cd, only with -cd.
Here is another difference (using LC_CTYPE=en_US.UTF-8, rest C):
% echo $'\U0001a000'|tr -cd '\U0001a000'|hd
% echo $'\U0001a000'|tr -Cd '\U0001a000'|hd
00000000 f0 9a 80 80 |....|
00000004
The cause is that iswrune(3) returns false for the unassigned code point
U+0001A000.
This may well contain bugs because Unicode adds new characters from time
to time and our tables seem to be updated very rarely.
POSIX also says things about collation order. You may not have detected
this because FreeBSD does not implement LC_COLLATE for multibyte locales
yet.
PS: Who wrote tr -C and how can I contact the author?
You can read the Subversion logs but people may no longer be around.--
Jilles Tjoelker
--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur