Discussion:
[ast-developers] Changing the libast |*mb*()| functions to always take a |mbstate_t| ?
Roland Mainz
2013-09-16 16:39:38 UTC
Permalink
Hi!

----

While doing some GB18030 testing I found a disturbing issue:
A lot of calls to the |*mb*()| functions are done without thinking
about the current shift state. The issue is that this state is a
hidden global variable and may easily be overlooked (the issue that
UTF-8 can recover from invalid shift states makes this worse since
UTF-8 locales won't suffer from this problem) ... which causes
problems for Shift-State depending encodings like
GBK/GB18030/ShiftJis.

My preferred solution would be to change the current libast mb API to
always take a |mbstate_t| argument. This would fix this issue (by
making the shift state explicit), fix issues with nesting calls, e.g.
if we are in a specific shift state and then call a utility function
which operates on a different string ... and fix thread-safeness
issues with the "hidden" global variable containing the current shift
state...

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Wendy Lin
2013-09-16 17:26:54 UTC
Permalink
Post by Roland Mainz
Hi!
----
A lot of calls to the |*mb*()| functions are done without thinking
about the current shift state. The issue is that this state is a
hidden global variable and may easily be overlooked (the issue that
UTF-8 can recover from invalid shift states makes this worse since
UTF-8 locales won't suffer from this problem) ... which causes
problems for Shift-State depending encodings like
GBK/GB18030/ShiftJis.
My preferred solution would be to change the current libast mb API to
always take a |mbstate_t| argument. This would fix this issue (by
making the shift state explicit), fix issues with nesting calls, e.g.
if we are in a specific shift state and then call a utility function
which operates on a different string ... and fix thread-safeness
issues with the "hidden" global variable containing the current shift
state...
Well, this may explain why ksh93 sometimes has lapses when it wants to
process characters which are encoded not with UTF8. bash 4 handles
this flawlessly, but only since they use mbstate_t ps;memset (&ps, 0,
sizeof (mbstate_t));wcrtomb() everywhere. Even using a single mb
function without mbstate_t can render your whole application useless.

Q: Why doesn't POSIX deprecate mb functions which do not use a
mbstate_t? The mistake ksh93 does is easy to make and so hard to
rectify.

Wendy
Roland Mainz
2013-09-16 17:59:53 UTC
Permalink
Post by Wendy Lin
Post by Roland Mainz
A lot of calls to the |*mb*()| functions are done without thinking
about the current shift state. The issue is that this state is a
hidden global variable and may easily be overlooked (the issue that
UTF-8 can recover from invalid shift states makes this worse since
UTF-8 locales won't suffer from this problem) ... which causes
problems for Shift-State depending encodings like
GBK/GB18030/ShiftJis.
My preferred solution would be to change the current libast mb API to
always take a |mbstate_t| argument. This would fix this issue (by
making the shift state explicit), fix issues with nesting calls, e.g.
if we are in a specific shift state and then call a utility function
which operates on a different string ... and fix thread-safeness
issues with the "hidden" global variable containing the current shift
state...
Well, this may explain why ksh93 sometimes has lapses when it wants to
process characters which are encoded not with UTF8. bash 4 handles
this flawlessly, but only since they use mbstate_t ps;memset (&ps, 0,
sizeof (mbstate_t));wcrtomb() everywhere. Even using a single mb
function without mbstate_t can render your whole application useless.
The problem is *NOT* the choice of function... the problem is that we
use a global (or semi-global) state variable.
Technically each single utility function (this includes all
multibyte-aware function in libshell and libcmd, too) should have it's
own |mbstate_t|.
A major issue we found is that a multibyte character string is being
processed... and in the middle of that processing we call something
else which operates on a different multibyte character stream. As
result of this nesting or even calling other ((buggy !)
system-)library functions the |mbstate_t| state used by the caller
gets screwed-up. And that's causing trouble all over the place.
There are two reason this doesn't cause much trouble yet:
1. Most people use UTF-8-based locales, which have recovery built into
the encoding itself
2. Many system i18n multibyte handling functions automagically recover
from invalid states without returning an error. But not all can do it
(e.g. because the encoding isn't designed in such a way) or will do it
(to keep the code simple+easy+fast and force correct programming).

For example some GBK/GB18030 implementations can do it (like on
Solaris using IBM's OpenGroup i18n/multibyte implementation) but not
Illumos/OpenSolaris&&FreeBSD which use a different i18n/multibyte
implementation. As result some stuff works on Solaris but causes
endless loops or data corruption on Illumos/OpenSolaris/FreeBSD/etc.
... ;-(
Post by Wendy Lin
Q: Why doesn't POSIX deprecate mb functions which do not use a
mbstate_t? The mistake ksh93 does is easy to make and so hard to
rectify.
Erm... for simple utilities the global state _sounds_ like an easy
choice... but given the trouble you can end-up by simply ignoring the
issue that multibyte encodings can have a state I wish the functions
would've never been invented... ;-/

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Roland Mainz
2013-09-16 18:23:27 UTC
Permalink
Post by Roland Mainz
Post by Wendy Lin
Post by Roland Mainz
A lot of calls to the |*mb*()| functions are done without thinking
about the current shift state. The issue is that this state is a
hidden global variable and may easily be overlooked (the issue that
UTF-8 can recover from invalid shift states makes this worse since
UTF-8 locales won't suffer from this problem) ... which causes
problems for Shift-State depending encodings like
GBK/GB18030/ShiftJis.
My preferred solution would be to change the current libast mb API to
always take a |mbstate_t| argument. This would fix this issue (by
making the shift state explicit), fix issues with nesting calls, e.g.
if we are in a specific shift state and then call a utility function
which operates on a different string ... and fix thread-safeness
issues with the "hidden" global variable containing the current shift
state...
Well, this may explain why ksh93 sometimes has lapses when it wants to
process characters which are encoded not with UTF8. bash 4 handles
this flawlessly, but only since they use mbstate_t ps;memset (&ps, 0,
sizeof (mbstate_t));wcrtomb() everywhere. Even using a single mb
function without mbstate_t can render your whole application useless.
The problem is *NOT* the choice of function... the problem is that we
use a global (or semi-global) state variable.
Technically each single utility function (this includes all
multibyte-aware function in libshell and libcmd, too) should have it's
own |mbstate_t|.
A major issue we found is that a multibyte character string is being
processed... and in the middle of that processing we call something
else which operates on a different multibyte character stream. As
result of this nesting or even calling other ((buggy !)
system-)library functions the |mbstate_t| state used by the caller
gets screwed-up. And that's causing trouble all over the place.
1. Most people use UTF-8-based locales, which have recovery built into
the encoding itself
2. Many system i18n multibyte handling functions automagically recover
from invalid states without returning an error. But not all can do it
(e.g. because the encoding isn't designed in such a way) or will do it
(to keep the code simple+easy+fast and force correct programming).
For example some GBK/GB18030 implementations can do it (like on
Solaris using IBM's OpenGroup i18n/multibyte implementation) but not
Illumos/OpenSolaris&&FreeBSD which use a different i18n/multibyte
implementation. As result some stuff works on Solaris but causes
endless loops or data corruption on Illumos/OpenSolaris/FreeBSD/etc.
... ;-(
Post by Wendy Lin
Q: Why doesn't POSIX deprecate mb functions which do not use a
mbstate_t? The mistake ksh93 does is easy to make and so hard to
rectify.
Erm... for simple utilities the global state _sounds_ like an easy
choice... but given the trouble you can end-up by simply ignoring the
issue that multibyte encodings can have a state I wish the functions
would've never been invented... ;-/
Glenn: Are the following functions *always* available when multibyte
support is enabled for all platforms you can test:
-- snip --
mbrlen()
mbrtowc()
wcrtomb()
wcsrtombs()
mbsrtowcs()
-- snip --

If that's true then most of the patch is just a simple switch-over,
add states and maybe add some new functions which accept a state
object (for cases where we start in a middle of a string and have to
restart over and over again)

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Loading...