[ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Discussion:

[ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Cedric Blancher

2013-04-17 12:52:40 UTC

Glenn, can you take a look at the posting from freebsd-standards? AST
tr -C doesn't ignore unassigned code points as it should be.

Ced

---------- Forwarded message ----------
From: Jilles Tjoelker <jilles at stack.nl>
Date: 7 April 2013 22:31
Subject: Re: Fwd: Where does FreeBSD tr -C differ from tr -c?
To: Cedric Blancher <cedric.blancher at googlemail.com>
Cc: freebsd-hackers at freebsd.org, freebsd-standards at freebsd.org

The question remain open and I need help. tr -C is implemented by
FreeBSD tr -C but I can't find examples (or a testcase) where tr -c
and tr -C differ.

Reading the rationale of POSIX, here is an example of a difference:

% printf 'a\200'|LC_ALL=en_US.US-ASCII tr -cd '\000-\177'|hd
00000000 61 |a|
00000001
% printf 'a\200'|LC_ALL=en_US.US-ASCII tr -Cd '\000-\177'|hd
00000000 61 80 |a.|
00000002

Because the bytes 128..255 are not characters in us-ascii, they cannot
be removed with -Cd, only with -cd.

Here is another difference (using LC_CTYPE=en_US.UTF-8, rest C):

% echo $'\U0001a000'|tr -cd '\U0001a000'|hd
% echo $'\U0001a000'|tr -Cd '\U0001a000'|hd
00000000 f0 9a 80 80 |....|
00000004

The cause is that iswrune(3) returns false for the unassigned code point
U+0001A000.

This may well contain bugs because Unicode adds new characters from time
to time and our tables seem to be updated very rarely.

POSIX also says things about collation order. You may not have detected
this because FreeBSD does not implement LC_COLLATE for multibyte locales
yet.

PS: Who wrote tr -C and how can I contact the author?

You can read the Subversion logs but people may no longer be around.

--
Jilles Tjoelker

--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur

Roland Mainz

2013-04-18 11:38:43 UTC

On Wed, Apr 17, 2013 at 2:52 PM, Cedric Blancher

Post by Cedric Blancher
Glenn, can you take a look at the posting from freebsd-standards? AST
tr -C doesn't ignore unassigned code points as it should be.

[snip]

Grumpf... I think you're right...
... the trouble is that not all platforms implement the |iswrune()|
function (see http://developer.apple.com/library/ios/#documentation/system/conceptual/manpages_iphoneos/man3/iswrune.3.html)
...

... AFAIK (based on some testing on a FreeBSD system vs. Solaris) the
following |iswrune() emulation code should work (and we need a iffe
probe for |iswrune()| and fall-back to the emulation):
-- snip --
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
#include <wctype.h>

static
int iswrune_emu(wint_t c)
{
/*
* we test |iswprint()| first because it has
* usually the largest number of members and
* the fastest implementation
*/
if (iswprint(c))
return (1);
if (iswalnum(c) ||
iswcntrl(c) ||
iswdigit(c) ||
iswgraph(c) ||
iswpunct(c) ||
iswspace(c) ||
iswxdigit(c) ||
iswblank(c) ||
iswlower(c) ||
iswupper(c))
return (1);

return (0);
}

int main(int ac, char *av[])
{
wint_t i;

setlocale(LC_ALL, "");

puts("#start.");

for (i=0x3000 ; i < 0x4000 ; i++)
{
if (!iswrune_emu(i))
{
printf("code point %lx not assigned.\n",
(long)i);
}
}

puts("#done");
return (EXIT_SUCCESS);
}
-- snip --
(note that |iswprint()| is explicitly seperated out to highlight the
performace optimisation)

Erm... Glenn... what do you think ?

----

Bye,
Roland

P.S.: If we use the emulation then AST regex should (IMO0 still
support [:rune:] (through the emulation) ...

--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Cedric Blancher

2013-06-05 00:30:56 UTC

Post by Roland Mainz
On Wed, Apr 17, 2013 at 2:52 PM, Cedric Blancher

Post by Cedric Blancher
Glenn, can you take a look at the posting from freebsd-standards? AST
tr -C doesn't ignore unassigned code points as it should be.

[snip]
Grumpf... I think you're right...
... the trouble is that not all platforms implement the |iswrune()|
function (see http://developer.apple.com/library/ios/#documentation/system/conceptual/manpages_iphoneos/man3/iswrune.3.html)
...
... AFAIK (based on some testing on a FreeBSD system vs. Solaris) the
following |iswrune() emulation code should work (and we need a iffe
-- snip --
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
#include <wctype.h>
static
int iswrune_emu(wint_t c)
{
/*
* we test |iswprint()| first because it has
* usually the largest number of members and
* the fastest implementation
*/
if (iswprint(c))
return (1);
if (iswalnum(c) ||
iswcntrl(c) ||
iswdigit(c) ||
iswgraph(c) ||
iswpunct(c) ||
iswspace(c) ||
iswxdigit(c) ||
iswblank(c) ||
iswlower(c) ||
iswupper(c))
return (1);
return (0);
}
int main(int ac, char *av[])
{
wint_t i;
setlocale(LC_ALL, "");
puts("#start.");
for (i=0x3000 ; i < 0x4000 ; i++)
{
if (!iswrune_emu(i))
{
printf("code point %lx not assigned.\n",
(long)i);
}
}
puts("#done");
return (EXIT_SUCCESS);
}
-- snip --
(note that |iswprint()| is explicitly seperated out to highlight the
performace optimisation)
Erm... Glenn... what do you think ?
----
Bye,
Roland
P.S.: If we use the emulation then AST regex should (IMO0 still
support [:rune:] (through the emulation) ...

Glenn, are you going to put this fix into AST tr for the next alpha?
IMO filtering unassigned code points is required for a standard
conforming tr -C implementation.

Ced

--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur

Glenn Fowler

2013-06-05 05:03:20 UTC

I had posed a question to the posix austin group related to this
and failed to report back to ast-developers

here is the relevant snippet, starting with a response from the group
and my comment

Maybe what you're confusing is the concept of unassigned Unicode
codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
far as C/POSIX is concerned, a multibyte sequence is legal if and only
if it corresponds to a wchar_t value via mbrtowc, and conversely, a
wchar_t value is a valid character if and only if it corresponds to a
multibyte character via wcrtomb. These operations should be inverses;
in particular they should be defined on each other's ranges.

yes there is confusion started on some other threads which contained
references to
int iswrune(wchar_t)
which apparently tests for assigned codepoints
what you just pointed out it is exactly what is needed for the POSIX tr
implementation -- basically that unassigned codepoints do not come into play

basically the only tools an application has for:
valid multibyte sequence is mbrtowc()
valid wchar_t is wcrtomb()
iswrune() is a concept outside the scope of posix
any posix standard command that produces error messages inconsistent with
mbrtowc() or wcrtomb(), e.g., via iswrune(), is non-conforming

On Wed, Apr 17, 2013 at 2:52 PM, Cedric Blancher

Post by Cedric Blancher
Glenn, can you take a look at the posting from freebsd-standards? AST
tr -C doesn't ignore unassigned code points as it should be.

[snip]
Grumpf... I think you're right...
... the trouble is that not all platforms implement the |iswrune()|
function (see http://developer.apple.com/library/ios/#documentation/system/conceptual/manpages_iphoneos/man3/iswrune.3.html)
...
... AFAIK (based on some testing on a FreeBSD system vs. Solaris) the
following |iswrune() emulation code should work (and we need a iffe
-- snip --
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
#include <wctype.h>
static
int iswrune_emu(wint_t c)
{
/*
* we test |iswprint()| first because it has
* usually the largest number of members and
* the fastest implementation
*/
if (iswprint(c))
return (1);
if (iswalnum(c) ||
iswcntrl(c) ||
iswdigit(c) ||
iswgraph(c) ||
iswpunct(c) ||
iswspace(c) ||
iswxdigit(c) ||
iswblank(c) ||
iswlower(c) ||
iswupper(c))
return (1);
return (0);
}
int main(int ac, char *av[])
{
wint_t i;
setlocale(LC_ALL, "");
puts("#start.");
for (i=0x3000 ; i < 0x4000 ; i++)
{
if (!iswrune_emu(i))
{
printf("code point %lx not assigned.\n",
(long)i);
}
}
puts("#done");
return (EXIT_SUCCESS);
}
-- snip --
(note that |iswprint()| is explicitly seperated out to highlight the
performace optimisation)
Erm... Glenn... what do you think ?
----
Bye,
Roland
P.S.: If we use the emulation then AST regex should (IMO0 still
support [:rune:] (through the emulation) ...

Glenn, are you going to put this fix into AST tr for the next alpha?
IMO filtering unassigned code points is required for a standard
conforming tr -C implementation.
Ced
--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur

Cedric Blancher

2013-06-05 10:51:18 UTC

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers
here is the relevant snippet, starting with a response from the group
and my comment

Maybe what you're confusing is the concept of unassigned Unicode
codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
far as C/POSIX is concerned, a multibyte sequence is legal if and only
if it corresponds to a wchar_t value via mbrtowc, and conversely, a
wchar_t value is a valid character if and only if it corresponds to a
multibyte character via wcrtomb. These operations should be inverses;
in particular they should be defined on each other's ranges.

yes there is confusion started on some other threads which contained
references to
int iswrune(wchar_t)
which apparently tests for assigned codepoints
what you just pointed out it is exactly what is needed for the POSIX tr
implementation -- basically that unassigned codepoints do not come into play

valid multibyte sequence is mbrtowc()
valid wchar_t is wcrtomb()

What about libast's optimized UTF-8 versions of mbrtowc() and
wcrtomb()? They do not filter out unassigned code points, do they?
Aside from that almost all mbrtowc() and wcrtomb() implementations for
UTF-8 (and GBK/JIS too) are designed for speed and do NOT test whether
a codepoint is currently assigned in Unicode or not. They delegate the
problem to iswrune() if available or let the applications test whether
the resulting wchar_t matches at least one isw<class>() or not.

Post by Glenn Fowler
iswrune() is a concept outside the scope of posix

This is not correct. POSIX indirectly defines that a codepoint is only
assigned if one or more of the POSIX isw<class>() functions returns a
match. if none of the standard isw<class>() functions returns a match
then the codepoint is not assigned. iswrune() is only a shortcut, as
Roland's emulation code demonstrates.

PS: iswrune() is not specific to Unicode. It is used in the GBK and
JIS locales to distinguish GBK/JIS versions too.

Ced

--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur

Glenn Fowler

2013-06-05 13:52:29 UTC

Post by Cedric Blancher

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers
here is the relevant snippet, starting with a response from the group
and my comment

Maybe what you're confusing is the concept of unassigned Unicode
codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
far as C/POSIX is concerned, a multibyte sequence is legal if and only
if it corresponds to a wchar_t value via mbrtowc, and conversely, a
wchar_t value is a valid character if and only if it corresponds to a
multibyte character via wcrtomb. These operations should be inverses;
in particular they should be defined on each other's ranges.

yes there is confusion started on some other threads which contained
references to
int iswrune(wchar_t)
which apparently tests for assigned codepoints
what you just pointed out it is exactly what is needed for the POSIX tr
implementation -- basically that unassigned codepoints do not come into play

valid multibyte sequence is mbrtowc()
valid wchar_t is wcrtomb()

What about libast's optimized UTF-8 versions of mbrtowc() and
wcrtomb()? They do not filter out unassigned code points, do they?
Aside from that almost all mbrtowc() and wcrtomb() implementations for
UTF-8 (and GBK/JIS too) are designed for speed and do NOT test whether
a codepoint is currently assigned in Unicode or not. They delegate the
problem to iswrune() if available or let the applications test whether
the resulting wchar_t matches at least one isw<class>() or not.

Post by Glenn Fowler
iswrune() is a concept outside the scope of posix

This is not correct. POSIX indirectly defines that a codepoint is only
assigned if one or more of the POSIX isw<class>() functions returns a
match. if none of the standard isw<class>() functions returns a match
then the codepoint is not assigned. iswrune() is only a shortcut, as
Roland's emulation code demonstrates.

nitpicking here
since posix allows an implementation to define extension isw*() classes
there is no portable way to define iswrune() from the outside of any implementation
by "outside the scope" I meant that, within the scope of posix and what it
demands for compliance, "invalid codepoint" is not mentioned

the only place "codepoint" is mentioned is in the rationale for pax describing
why they chose UTF-8 as the internal archive format codeset encoding - specifically
because a pax archive used for interchange must be "codepoint" agnostic and
encode all characters
(rationales are not part of the standard proper)

Post by Cedric Blancher
PS: iswrune() is not specific to Unicode. It is used in the GBK and
JIS locales to distinguish GBK/JIS versions too.

the "point" is that posix commands need only report "invalid character encoding"
(EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement for
any posix command that it report "invalid codepoint"

its nice that some implementations provide iswrune() to make it possible
to portably determine "invalid codepoint", but that has no bearing on
any posix compliant command implementation -- if any posix command implementation
were to fail on "invalid codepoint" it would be non-compliant

a command implementation could be extended via options to include "codepoint"
diagnostics, but it would be an extension

Roland Mainz

2013-06-08 21:17:08 UTC

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers

[snip]

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler
iswrune() is a concept outside the scope of posix

This is not correct. POSIX indirectly defines that a codepoint is only
assigned if one or more of the POSIX isw<class>() functions returns a
match. if none of the standard isw<class>() functions returns a match
then the codepoint is not assigned. iswrune() is only a shortcut, as
Roland's emulation code demonstrates.

nitpicking here
since posix allows an implementation to define extension isw*() classes
there is no portable way to define iswrune() from the outside of any implementation

Erm... yes and no... "yes" ... |isw*()| is extensible... but all
extensions so far (at least those I'm aware of on Solaris, AIX, HP/UX,
Linux and FreeBSD) are "extra" (usually to provide extra language- or
culture-specific help) and the same characters have matches in the
|isw*()|-classes defined by POSIX, too... which means that emulating
|iswrune()| the way I did is it least valid on these platforms
(well... FreeBSD, MacOS X and the OpenSolaris-derived Illumos define
|iswrune()| themselves...).

Post by Glenn Fowler
by "outside the scope" I meant that, within the scope of posix and what it
demands for compliance, "invalid codepoint" is not mentioned

Erm... at least for Unicode and GB18030 the issue is not "invalid
codepoint" ... it's "unassigned codepoint". The codepoint itself may
be valid but has no assigned meaning... which also makes it
"unsortable" ... which was AFAIK the FreeBSD rationale behind
filtering unassigned codepoints out (the other issue is that "sorting"
Unicode characters via |strxfrm()| is tricky in this case since unless
the locale has defined a specific "sort order" the characters are
sorted using their numeric codepoint value... which sorts even
technically "unsortable" unassigned code points. Grrr...).

Post by Glenn Fowler
the only place "codepoint" is mentioned is in the rationale for pax describing
why they chose UTF-8 as the internal archive format codeset encoding - specifically
because a pax archive used for interchange must be "codepoint" agnostic and
encode all characters
(rationales are not part of the standard proper)

Post by Cedric Blancher
PS: iswrune() is not specific to Unicode. It is used in the GBK and
JIS locales to distinguish GBK/JIS versions too.

the "point" is that posix commands need only report "invalid character encoding"
(EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement for
any posix command that it report "invalid codepoint"

See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ|
won't be returned unless the codepoint is beyond the numeric limit for
the matching Unicode standard...

Post by Glenn Fowler
its nice that some implementations provide iswrune() to make it possible
to portably determine "invalid codepoint", but that has no bearing on
any posix compliant command implementation -- if any posix command implementation
were to fail on "invalid codepoint" it would be non-compliant
a command implementation could be extended via options to include "codepoint"
diagnostics, but it would be an extension

Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here
seems to be to "filter out" (maybe using an extra "tr" option) any
characters which do not match either |iswrune()| (if available) or all
of the |isw*()| functions defined by POSIX (maybe we shouldn't name
this class [:rune:] in regex... maybe a better name is
[:_posix_anychar:] ... leading '_' because it is non-standard (for
now) and "posix_anychar" to describe it should be true if it matches
any character class defined by POSIX).

----

Bye,
Roland

--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Glenn Fowler

2013-06-09 02:44:59 UTC

I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great

for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

Post by Roland Mainz

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers

[snip]

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler
iswrune() is a concept outside the scope of posix

This is not correct. POSIX indirectly defines that a codepoint is only
assigned if one or more of the POSIX isw<class>() functions returns a
match. if none of the standard isw<class>() functions returns a match
then the codepoint is not assigned. iswrune() is only a shortcut, as
Roland's emulation code demonstrates.

nitpicking here
since posix allows an implementation to define extension isw*() classes
there is no portable way to define iswrune() from the outside of any implementation

Erm... yes and no... "yes" ... |isw*()| is extensible... but all
extensions so far (at least those I'm aware of on Solaris, AIX, HP/UX,
Linux and FreeBSD) are "extra" (usually to provide extra language- or
culture-specific help) and the same characters have matches in the
|isw*()|-classes defined by POSIX, too... which means that emulating
|iswrune()| the way I did is it least valid on these platforms
(well... FreeBSD, MacOS X and the OpenSolaris-derived Illumos define
|iswrune()| themselves...).

Post by Glenn Fowler
by "outside the scope" I meant that, within the scope of posix and what it
demands for compliance, "invalid codepoint" is not mentioned

Erm... at least for Unicode and GB18030 the issue is not "invalid
codepoint" ... it's "unassigned codepoint". The codepoint itself may
be valid but has no assigned meaning... which also makes it
"unsortable" ... which was AFAIK the FreeBSD rationale behind
filtering unassigned codepoints out (the other issue is that "sorting"
Unicode characters via |strxfrm()| is tricky in this case since unless
the locale has defined a specific "sort order" the characters are
sorted using their numeric codepoint value... which sorts even
technically "unsortable" unassigned code points. Grrr...).

Post by Glenn Fowler
the only place "codepoint" is mentioned is in the rationale for pax describing
why they chose UTF-8 as the internal archive format codeset encoding - specifically
because a pax archive used for interchange must be "codepoint" agnostic and
encode all characters
(rationales are not part of the standard proper)

Post by Cedric Blancher
PS: iswrune() is not specific to Unicode. It is used in the GBK and
JIS locales to distinguish GBK/JIS versions too.

the "point" is that posix commands need only report "invalid character encoding"
(EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement for
any posix command that it report "invalid codepoint"

See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ|
won't be returned unless the codepoint is beyond the numeric limit for
the matching Unicode standard...

Post by Glenn Fowler
its nice that some implementations provide iswrune() to make it possible
to portably determine "invalid codepoint", but that has no bearing on
any posix compliant command implementation -- if any posix command implementation
were to fail on "invalid codepoint" it would be non-compliant
a command implementation could be extended via options to include "codepoint"
diagnostics, but it would be an extension

Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here
seems to be to "filter out" (maybe using an extra "tr" option) any
characters which do not match either |iswrune()| (if available) or all
of the |isw*()| functions defined by POSIX (maybe we shouldn't name
this class [:rune:] in regex... maybe a better name is
[:_posix_anychar:] ... leading '_' because it is non-standard (for
now) and "posix_anychar" to describe it should be true if it matches
any character class defined by POSIX).
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Roland Mainz

2013-06-10 01:47:08 UTC

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers

[snip]

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler
by "outside the scope" I meant that, within the scope of posix and what it
demands for compliance, "invalid codepoint" is not mentioned

Erm... at least for Unicode and GB18030 the issue is not "invalid
codepoint" ... it's "unassigned codepoint". The codepoint itself may
be valid but has no assigned meaning... which also makes it
"unsortable" ... which was AFAIK the FreeBSD rationale behind
filtering unassigned codepoints out (the other issue is that "sorting"
Unicode characters via |strxfrm()| is tricky in this case since unless
the locale has defined a specific "sort order" the characters are
sorted using their numeric codepoint value... which sorts even
technically "unsortable" unassigned code points. Grrr...).

Post by Glenn Fowler
the only place "codepoint" is mentioned is in the rationale for pax describing
why they chose UTF-8 as the internal archive format codeset encoding - specifically
because a pax archive used for interchange must be "codepoint" agnostic and
encode all characters
(rationales are not part of the standard proper)

Post by Cedric Blancher
PS: iswrune() is not specific to Unicode. It is used in the GBK and
JIS locales to distinguish GBK/JIS versions too.

the "point" is that posix commands need only report "invalid character encoding"
(EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement for
any posix command that it report "invalid codepoint"

See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ|
won't be returned unless the codepoint is beyond the numeric limit for
the matching Unicode standard...

Post by Glenn Fowler
its nice that some implementations provide iswrune() to make it possible
to portably determine "invalid codepoint", but that has no bearing on
any posix compliant command implementation -- if any posix command implementation
were to fail on "invalid codepoint" it would be non-compliant
a command implementation could be extended via options to include "codepoint"
diagnostics, but it would be an extension

Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here
seems to be to "filter out" (maybe using an extra "tr" option) any
characters which do not match either |iswrune()| (if available) or all
of the |isw*()| functions defined by POSIX (maybe we shouldn't name
this class [:rune:] in regex... maybe a better name is
[:_posix_anychar:] ... leading '_' because it is non-standard (for
now) and "posix_anychar" to describe it should be true if it matches
any character class defined by POSIX).

[snip]

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

OK... here is the question which bothers me:
tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
perl unicode version):
-- snip --
set -o nounset

typeset -i16 i

for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"

if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done

print '# done.'
-- snip --

|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

----

Bye,
Roland

--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Glenn Fowler

2013-06-10 01:50:13 UTC

Post by Roland Mainz

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

strcoll()

Post by Roland Mainz

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
-- snip --
set -o nounset
typeset -i16 i
for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"
if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done
print '# done.'
-- snip --
|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

[:rune:] would be a fine name for that class

Cedric Blancher

2013-07-22 10:10:32 UTC

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

strcoll()

Post by Roland Mainz

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
-- snip --
set -o nounset
typeset -i16 i
for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"
if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done
print '# done.'
-- snip --
|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

[:rune:] would be a fine name for that class

There's still no [:rune:] emulation in libast :(

Ced

--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur

Glenn Fowler

2013-07-22 14:28:00 UTC

Post by Cedric Blancher

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

strcoll()

Post by Roland Mainz

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
-- snip --
set -o nounset
typeset -i16 i
for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"
if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done
print '# done.'
-- snip --
|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

[:rune:] would be a fine name for that class

There's still no [:rune:] emulation in libast :(

that looks simple enough
but I'm not convinced its correct
what about system and user defined classes
(there are notes on the list about some for chinese characters -- I forget the details)
if those aren't handled then why provide a [:rune:] that might work maybe

Cedric Blancher

2013-08-05 19:35:48 UTC

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

strcoll()

Post by Roland Mainz

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
-- snip --
set -o nounset
typeset -i16 i
for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"
if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done
print '# done.'
-- snip --
|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

[:rune:] would be a fine name for that class

There's still no [:rune:] emulation in libast :(

that looks simple enough
but I'm not convinced its correct
what about system and user defined classes
(there are notes on the list about some for chinese characters -- I forget the details)

Maybe Roland can elaborate. He's an expert for such locales.

Post by Glenn Fowler
if those aren't handled then why provide a [:rune:] that might work maybe

Chinese and Japanese locales have extra classes defined by the locale
data, but they are *always* "extra", i.e. the characters have matches
in the basic POSIX character classes but also match extra classes like
isphonogram() or is ideogram().

Please, could we get [:rune:] and a --weed-out-non-runes option for
tr(1), please?

Ced

--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur

Cedric Blancher

2013-09-18 22:19:11 UTC

Post by Cedric Blancher

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

strcoll()

Post by Roland Mainz

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
-- snip --
set -o nounset
typeset -i16 i
for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"
if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done
print '# done.'
-- snip --
|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

[:rune:] would be a fine name for that class

There's still no [:rune:] emulation in libast :(

that looks simple enough
but I'm not convinced its correct
what about system and user defined classes
(there are notes on the list about some for chinese characters -- I forget the details)

Maybe Roland can elaborate. He's an expert for such locales.

Post by Glenn Fowler
if those aren't handled then why provide a [:rune:] that might work maybe

Chinese and Japanese locales have extra classes defined by the locale
data, but they are *always* "extra", i.e. the characters have matches
in the basic POSIX character classes but also match extra classes like
isphonogram() or is ideogram().
Please, could we get [:rune:] and a --weed-out-non-runes option for
tr(1), please?

Please?

Ced

--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur

Glenn Fowler

2013-09-18 23:42:20 UTC

Post by Cedric Blancher

Post by Cedric Blancher

Post by Glenn Fowler

Post by Cedric Blancher

Post by Glenn Fowler

Post by Roland Mainz

Post by Glenn Fowler
I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility

tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

strcoll()

Post by Roland Mainz

Post by Glenn Fowler
if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great
for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
-- snip --
set -o nounset
typeset -i16 i
for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
ch="${ printf "\u[${i/~(El)16#/}]" ; }"
if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
printf "# match found: %q\n" "${i}"
fi
done
print '# done.'
-- snip --
|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

[:rune:] would be a fine name for that class

There's still no [:rune:] emulation in libast :(

that looks simple enough
but I'm not convinced its correct
what about system and user defined classes
(there are notes on the list about some for chinese characters -- I forget the details)

Maybe Roland can elaborate. He's an expert for such locales.

Post by Glenn Fowler
if those aren't handled then why provide a [:rune:] that might work maybe

Chinese and Japanese locales have extra classes defined by the locale
data, but they are *always* "extra", i.e. the characters have matches
in the basic POSIX character classes but also match extra classes like
isphonogram() or is ideogram().

ast regex already handles the extra classes via the posix wctype() and iswctype() apis
if posix adds a "rune" class then ast will just work

Post by Cedric Blancher

Post by Cedric Blancher
Please, could we get [:rune:] and a --weed-out-non-runes option for
tr(1), please?

Please?

I still don't know how proposed rune interacts with codesets vs languages
note that all posix mb* and wc* apis deal with codesets independent of the language
where is the oracle that says "this is a rune" and what are its input parameters
and does it vary by language X codeset or just by codeset and how does one track
when the oracle changes its mind or a language changes its mind or when implementations
differ in what codepoint are represented

propose how to provide a wctype() and iswctype() like api for "rune" that ast could use
as an intercept in src/lib/libast/regex/regclass.c and then [[::rune:]] will
be visible everywhere in ast

Cedric Blancher

2013-07-22 10:12:48 UTC

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers
here is the relevant snippet, starting with a response from the group
and my comment

Maybe what you're confusing is the concept of unassigned Unicode
codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
far as C/POSIX is concerned, a multibyte sequence is legal if and only
if it corresponds to a wchar_t value via mbrtowc, and conversely, a
wchar_t value is a valid character if and only if it corresponds to a
multibyte character via wcrtomb. These operations should be inverses;
in particular they should be defined on each other's ranges.

yes there is confusion started on some other threads which contained
references to
int iswrune(wchar_t)
which apparently tests for assigned codepoints
what you just pointed out it is exactly what is needed for the POSIX tr
implementation -- basically that unassigned codepoints do not come into play

valid multibyte sequence is mbrtowc()
valid wchar_t is wcrtomb()
iswrune() is a concept outside the scope of posix
any posix standard command that produces error messages inconsistent with
mbrtowc() or wcrtomb(), e.g., via iswrune(), is non-conforming

OK

How do deal with unassigned code points then? I think that FreeBSD tr
removes them from a range is valid since they are not characters. Or
doesn't that fit into POSIX?

Ced

--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur

Glenn Fowler

2013-07-22 14:51:17 UTC

Post by Cedric Blancher

Post by Glenn Fowler
I had posed a question to the posix austin group related to this
and failed to report back to ast-developers
here is the relevant snippet, starting with a response from the group
and my comment

Maybe what you're confusing is the concept of unassigned Unicode
codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
far as C/POSIX is concerned, a multibyte sequence is legal if and only
if it corresponds to a wchar_t value via mbrtowc, and conversely, a
wchar_t value is a valid character if and only if it corresponds to a
multibyte character via wcrtomb. These operations should be inverses;
in particular they should be defined on each other's ranges.

yes there is confusion started on some other threads which contained
references to
int iswrune(wchar_t)
which apparently tests for assigned codepoints
what you just pointed out it is exactly what is needed for the POSIX tr
implementation -- basically that unassigned codepoints do not come into play

valid multibyte sequence is mbrtowc()
valid wchar_t is wcrtomb()
iswrune() is a concept outside the scope of posix
any posix standard command that produces error messages inconsistent with
mbrtowc() or wcrtomb(), e.g., via iswrune(), is non-conforming

OK
How do deal with unassigned code points then? I think that FreeBSD tr
removes them from a range is valid since they are not characters. Or
doesn't that fit into POSIX?

on bsd
compare mbrtowc() and wcrtomb() vs. what tr does with posix -only options
if they differ then tr is non-conforming
if bsd tr want's to somehow incorporate its iswrune() then it must do it with
(a) different non-posix option(s)

think about it
you have a script that uses the options in scope here
it works the same way on all posix conforming systems
*except for bsd because it uses iswrune*
is bsd tr conforming?

Glenn Fowler

2013-06-05 05:06:19 UTC

the problem with ast tr and [=e=] still remains

Post by Cedric Blancher

Post by Roland Mainz
On Wed, Apr 17, 2013 at 2:52 PM, Cedric Blancher

Post by Cedric Blancher
Glenn, can you take a look at the posting from freebsd-standards? AST
tr -C doesn't ignore unassigned code points as it should be.

[snip]
Grumpf... I think you're right...
... the trouble is that not all platforms implement the |iswrune()|
function (see http://developer.apple.com/library/ios/#documentation/system/conceptual/manpages_iphoneos/man3/iswrune.3.html)
...
... AFAIK (based on some testing on a FreeBSD system vs. Solaris) the
following |iswrune() emulation code should work (and we need a iffe
-- snip --
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
#include <wctype.h>
static
int iswrune_emu(wint_t c)
{
/*
* we test |iswprint()| first because it has
* usually the largest number of members and
* the fastest implementation
*/
if (iswprint(c))
return (1);
if (iswalnum(c) ||
iswcntrl(c) ||
iswdigit(c) ||
iswgraph(c) ||
iswpunct(c) ||
iswspace(c) ||
iswxdigit(c) ||
iswblank(c) ||
iswlower(c) ||
iswupper(c))
return (1);
return (0);
}
int main(int ac, char *av[])
{
wint_t i;
setlocale(LC_ALL, "");
puts("#start.");
for (i=0x3000 ; i < 0x4000 ; i++)
{
if (!iswrune_emu(i))
{
printf("code point %lx not assigned.\n",
(long)i);
}
}
puts("#done");
return (EXIT_SUCCESS);
}
-- snip --
(note that |iswprint()| is explicitly seperated out to highlight the
performace optimisation)
Erm... Glenn... what do you think ?
----
Bye,
Roland
P.S.: If we use the emulation then AST regex should (IMO0 still
support [:rune:] (through the emulation) ...

Glenn, are you going to put this fix into AST tr for the next alpha?
IMO filtering unassigned code points is required for a standard
conforming tr -C implementation.
Ced
--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur

17 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Cedric Blancher 2013-04-17 12:52:40 UTC

Roland Mainz 2013-04-18 11:38:43 UTC

Cedric Blancher 2013-06-05 00:30:56 UTC

Glenn Fowler 2013-06-05 05:03:20 UTC

Cedric Blancher 2013-06-05 10:51:18 UTC

Glenn Fowler 2013-06-05 13:52:29 UTC

Roland Mainz 2013-06-08 21:17:08 UTC

Glenn Fowler 2013-06-09 02:44:59 UTC

Roland Mainz 2013-06-10 01:47:08 UTC

Glenn Fowler 2013-06-10 01:50:13 UTC

Cedric Blancher 2013-07-22 10:10:32 UTC

Glenn Fowler 2013-07-22 14:28:00 UTC

Cedric Blancher 2013-08-05 19:35:48 UTC

Cedric Blancher 2013-09-18 22:19:11 UTC

Glenn Fowler 2013-09-18 23:42:20 UTC

Cedric Blancher 2013-07-22 10:12:48 UTC

Glenn Fowler 2013-07-22 14:51:17 UTC

Glenn Fowler 2013-06-05 05:06:19 UTC

about - legalese

Loading...