Discussion:
[ast-developers] Hang when counting invalid character byte sequence in GB18030...
Roland Mainz
2013-09-06 07:43:40 UTC
Permalink
Hi!

----

The following testcase hangs ast-ksh.2013-09-04 in the zh_CN.GB18030
locale in a busy-loop (with and without my patches) ... but only on
Illumos/AMD64/64bit:
-- snip --
typeset -r utf8_euro_char2=$'\342\202\254'
(( ${#utf8_euro_char2} ))
-- snip --

$ bash -c 'LC_ALL=zh_CN.GB18030 ~/bin/ksh x.sh' # ...
... hang...

I have no clue why... ;-(

Sample stack trace looks like this:
-- snip --
dbx: warning: Interrupt ignored but forwarded to child.
signal INT (Interrupt) in _GB18030_mbrtowc at 0xfffffd7fff20eb48
0xfffffd7fff20eb48: _GB18030_mbrtowc+0x0178: movl
$0x0000000000000058,(%rax)
Current function is charlen
2666 else while(mbchar(str))
(dbx) where
[1] _GB18030_mbrtowc(0xfffffd7fff2e9428, 0xfffffd80064ae489,
0xffffffd0, 0x0, 0x4, 0x4), at 0xfffffd7fff20eb48
[2] mbtowc(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff210a94
=>[3] charlen(string = 0xfffffd7fff0029f0 "\xe2\x82\xac", len = -1),
line 2666 in "macro.c"
[4] varsub(mp = 0xfffffd7ffefc4820), line 1553 in "macro.c"
[5] copyto(mp = 0xfffffd7ffefc4820, endch = 0, newquote = 1), line
634 in "macro.c"
[6] sh_macexpand(shp = 0x6cc0c8, argp = 0xfffffd7ffeffe850, arghead
= (nil), flag = 4352), line 245 in "macro.c"
[7] sh_macpat(shp = 0x6cc0c8, arg = 0xfffffd7ffeffe850, flags =
256), line 423 in "macro.c"
[8] sh_exec(shp = 0x6cc0c8, t = 0xfffffd7ffeffe880, flags = 4), line
2528 in "xec.c"
[9] exfile(shp = 0x6cc0c8, iop = 0xfffffd7ffefc24d0, fno = 11), line
603 in "main.c"
[10] sh_main(ac = 2, av = 0xfffffd7fffdffb38, userinit = (nil)),
line 375 in "main.c"
[11] main(argc = 2, argv = 0xfffffd7fffdffb38), line 45 in "pmain.c"
(dbx) print str
str = 0xfffffd80064ae489 "<bad address 0x64ae489>"
-- snip --

I guess somehow the code doesn't react good to the detail that
$'\342\202\254' isn't a valid character sequence in the GB18030
multibyte encoding... ;-(

-- snip --
$ cat x.sh
builtin wc
printf '\342\202\254' | wc -m -X
$ bash -c 'LC_ALL=zh_CN.GB18030 ./arch/sol11.i386\-64/bin/ksh x.sh'
wc: warning: 0xac: invalid multibyte character byte
1 1
-- snip --

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Glenn Fowler
2013-09-07 14:48:40 UTC
Permalink
found the solaris iconv problem in 5 min after sleeping on it
the following command sequences use native commands -- no ast involved

# u.dat is a UTF-32LE file containing <lower-case-u-umlaut><newline> #

$ od -tx1 u.dat
0000000 dc 00 00 00 0a 00 00 00
0000010

# on linux.i386-64
$ /usr/bin/iconv -f UTF-32LE -t US-ASCII < u.dat
/usr/bin/iconv: illegal input sequence at position 0
$ echo $?
1

# on sol11.i386
$ /bin/iconv -f UTF-32LE -t US-ASCII < u.dat
?
$ echo $?
0

solaris is *bad* in at least 3 ways
* it apparently detects a conversion error but does not issue a diagnostic
* it apparently detects a conversion error and substitutes '?' for "bad" bytes
* it apparently detects a conversion error but exits 0

who know what liberties other implementations may take

I wonder if ast, in the C/POSIX locale and MB_CUR_MAX==1, should have
strict and non-strict conformance modes

strict: US-ASCII: characters are 7 bit bytes, bytes with bit 0x80 set are invalid
non-strict: ISO-8859-1: charcters are 8 bit bytes

non-strict would match linux C locale behavior
strict would match whose behavior?

I believe posix gives wiggle room here for the C locale to have chars with bit 0x80 set
ast in non-strict mode will simply apply that wiggle room constsitenly across
all of its os/arch implementations

I guess what I'm really saying is that ast *will* be consistent across all implementations

the question then is: in the C locale is the ast behavior always strict or is it tempered
by astconf("COMFORMANCE")?
ольга крыжановская
2013-09-07 14:55:02 UTC
Permalink
Glenn, does it help to convert to UTF-8 and then use iconv instead of
UTF-32LE and use iconv then, i.e. for \u[] use "integer to UTF-8" and
then iconv(UTF-8 to local encoding)?

Olga
Post by Glenn Fowler
found the solaris iconv problem in 5 min after sleeping on it
the following command sequences use native commands -- no ast involved
# u.dat is a UTF-32LE file containing <lower-case-u-umlaut><newline> #
$ od -tx1 u.dat
0000000 dc 00 00 00 0a 00 00 00
0000010
# on linux.i386-64
$ /usr/bin/iconv -f UTF-32LE -t US-ASCII < u.dat
/usr/bin/iconv: illegal input sequence at position 0
$ echo $?
1
# on sol11.i386
$ /bin/iconv -f UTF-32LE -t US-ASCII < u.dat
?
$ echo $?
0
solaris is *bad* in at least 3 ways
* it apparently detects a conversion error but does not issue a diagnostic
* it apparently detects a conversion error and substitutes '?' for "bad" bytes
* it apparently detects a conversion error but exits 0
who know what liberties other implementations may take
I wonder if ast, in the C/POSIX locale and MB_CUR_MAX==1, should have
strict and non-strict conformance modes
strict: US-ASCII: characters are 7 bit bytes, bytes with bit 0x80 set are invalid
non-strict: ISO-8859-1: charcters are 8 bit bytes
non-strict would match linux C locale behavior
strict would match whose behavior?
I believe posix gives wiggle room here for the C locale to have chars with bit 0x80 set
ast in non-strict mode will simply apply that wiggle room constsitenly across
all of its os/arch implementations
I guess what I'm really saying is that ast *will* be consistent across all implementations
the question then is: in the C locale is the ast behavior always strict or is it tempered
by astconf("COMFORMANCE")?
--
, _ _ ,
{ \/`o;====- Olga Kryzhanovska -====;o`\/ }
.----'-/`-/ olga.kryzhanovska at gmail.com \-`\-'----.
`'-..-| / http://twitter.com/fleyta \ |-..-'`
/\/\ Solaris/BSD//C/C++ programmer /\/\
`--` `--`
Glenn Fowler
2013-09-08 05:25:06 UTC
Permalink
the example code sequence is the exact sequence used by the new utf32s2wcs()

I don't know if doing the UTF-8 sequence makes solaris behave as expected
and even if it did what does that mean about how utf32s2wcs() is coded
how many implementations have yet more ingenious paths to do the what
the current code should have done?

I'll tweak _ast_iconv to just do the right thing
Post by ольга крыжановская
Glenn, does it help to convert to UTF-8 and then use iconv instead of
UTF-32LE and use iconv then, i.e. for \u[] use "integer to UTF-8" and
then iconv(UTF-8 to local encoding)?
Olga
Post by Glenn Fowler
found the solaris iconv problem in 5 min after sleeping on it
the following command sequences use native commands -- no ast involved
# u.dat is a UTF-32LE file containing <lower-case-u-umlaut><newline> #
$ od -tx1 u.dat
0000000 dc 00 00 00 0a 00 00 00
0000010
# on linux.i386-64
$ /usr/bin/iconv -f UTF-32LE -t US-ASCII < u.dat
/usr/bin/iconv: illegal input sequence at position 0
$ echo $?
1
# on sol11.i386
$ /bin/iconv -f UTF-32LE -t US-ASCII < u.dat
?
$ echo $?
0
solaris is *bad* in at least 3 ways
* it apparently detects a conversion error but does not issue a diagnostic
* it apparently detects a conversion error and substitutes '?' for "bad" bytes
* it apparently detects a conversion error but exits 0
who know what liberties other implementations may take
I wonder if ast, in the C/POSIX locale and MB_CUR_MAX==1, should have
strict and non-strict conformance modes
strict: US-ASCII: characters are 7 bit bytes, bytes with bit 0x80 set are invalid
non-strict: ISO-8859-1: charcters are 8 bit bytes
non-strict would match linux C locale behavior
strict would match whose behavior?
I believe posix gives wiggle room here for the C locale to have chars with bit 0x80 set
ast in non-strict mode will simply apply that wiggle room constsitenly across
all of its os/arch implementations
I guess what I'm really saying is that ast *will* be consistent across all implementations
the question then is: in the C locale is the ast behavior always strict or is it tempered
by astconf("COMFORMANCE")?
--
, _ _ ,
{ \/`o;====- Olga Kryzhanovska -====;o`\/ }
.----'-/`-/ olga.kryzhanovska at gmail.com \-`\-'----.
`'-..-| / http://twitter.com/fleyta \ |-..-'`
/\/\ Solaris/BSD//C/C++ programmer /\/\
`--` `--`
Loading...