Discussion:
[ast-developers] Name of LC_OPTIONS=unicode?
Wendy Lin
2013-09-19 08:49:47 UTC
Permalink
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.

Opinions?

Wendy

---------- Forwarded message ----------
From: Glenn Fowler <gsf at research.att.com>
Date: 14 September 2013 00:15
Subject: [ast-developers] AT&T Software Technology ast alpha software
download update
To: ast-developers at research.att.com



the AT&T Software Technology ast alpha 2013-09-13 source release
has been posted to the download site
http://www.research.att.com/sw/download/alpha/
the package names and md5 checksums are
INIT 327861e49e24dd51079c0a5316a4b2fe
ast-open dfb85d1dfb20acb8a1529bdf4b8cb89a
ast-ksh 746a556a2259aaa6d75468000e5bc36b
the md5 sums should match the ones listed on the download page

the change logs below are limited to ksh and libast
the libast changes involved a lot of meticulous multibyte code
that was hashed out off-list between { gsf roland olga }

there is a "news" link in the left side nav bar that will
be used to detail implemented and proposed ast features
as features mature the news info will migrate to the man pages

if your favorite bug/feature is not in the list below then it
hasn't been addressed yet and we don't know exactly when it will be

changes since 2013-08-29

:::::::: ksh93 ::::::::

13-09-13 --- Release ksh93v- ---
13-09-13 +The signal .sh.value variable is now a compound variable with the name
value.q corresponding to kill -q signed-integer and value.Q
corresponding
to kill -Q unsigned-large-integer.
13-09-13 A bug in $(...) command substitution that corrupted a trailing
multibyte character in non-UTF-8 locales has been fixed.
13-09-13 Eliminted extranesous output of standard error when ksh is invoked
with the -v (verbose) option.
13-09-10 A bug in finding a function defined inside a type that was defined
in a namespace has been fixed.
13-09-10 A bug in the binding of function local variables inside arithmeitc
expression inside namespaces was fixed.
13-09-10 +A -Q option was added to kill to pass integers as large as pointers.
The -q option now only accepts integers as large as typeset -i.
13-09-09 A bug in command substitution has been fixed.
13-09-09 Qualified print format "%([no]unicode)q" added to prefer \u[...]
over \w[...] and override LC_OPTIONS=unicode.
13-09-04 +\w[hex] locale-specific code point literals have been added.
13-09-04 +The float(f) math function was added.
13-09-04 +The int(f) math function was fixed to return 0 for floating point
numbers larger than the maximum integer.
13-09-04 A bug in which assigning a compound indexed array a value of () did
not preserve the -C attribute has been fixed.
13-09-04 kill -q can now pass numbers as large as typeset -li and
.sh.sig.value is typeset -i rather than a compound variable.
13-09-04 kill -q yields the processor and returns 2 when siqueue fails with
EAGAIN and yield.
13-09-03 A bug in which $((x.xxx)) where x is a floating point variable and
xxx is not one of the known extensions yields a random value has
been fixed. It now is unset which has value 0 when set -u is off.
13-09-03 A bug in overriding discipline functions for types defined in
namespaces has been fixed.
13-09-03 A bug which on some systems caused a core dump for large <<< here
documents has been fixed.

:::::::: libast ::::::::

13-09-12 misc/fgetcwd.c: fix stat corruption bug on systems without fdopendir()
13-09-12 path/pathcanon.c: fix bug that added extra / when fgetcwd() returned /
13-09-09 comp/setlocale.c,port/codeset.c: consistent handling of
US-ASCII + conformance(0,0) across all os's
13-09-09 string/utf8towc.c,string/wctoutf8.c: add { utf8toutf32()
utf8towc() wctoutf8() }
13-09-07 include/ast_std.c,comp/setlocale.c,string/stresc.c: add
ast.byte_max for single byte locales
13-09-06 comp/iconv.c: add sfclrerr() to iconv_move() if all input
chars not consumed
13-09-06 port/codeset.h,port/codeset.c: internal api for retrieving
locale codeset names
13-09-04 string/chresc.c,stresc.c: add \w[hex] support -- thanks Roland
13-09-04 string/utf32stowcs.c,string/wcstoutf32s.c: add -- thanks Roland
13-09-04 sfio/sfsetbuf.c: fix bug where SF_GETR mode was not cleared
causing subsequent memory corruption
13-09-04 vmalloc/vmopen.c,vmdcsystem.c,vmstat.c: temporarily set
vm->meth.meth=0 to disable vmstat() during init
13-09-04 port/intercept.c: include <ast_standards.h> to ensure
fdopendir() prototype if _lib_fdopendir
13-09-04 include/ast_std.h,comp/setlocale.c: add LC_OPTIONS=unicode
and AST_LC_unicode
13-09-01 path/pathcanon.c: O_* flags dev path:
/dev/file/flags at flag[,flag...]@[/]path
13-09-01 path/pathcanon.c: limit NAMED_XATTR paths to
/dev/file/xattr at canonical-path//@//[remainder]
13-08-29 cdt/dtstrhash.c: change sign-bit hitting fnv constants to hex
to silence unsigned warnings
Cedric Blancher
2013-09-19 13:46:53 UTC
Permalink
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic
+1
Post by Wendy Lin
and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
"convunicodeliterals" is too long. Either "unicodeliterals" or
"convunicode" would do it nicely :)

Ced
--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur
Irek Szczesniak
2013-09-25 15:05:31 UTC
Permalink
On Thu, Sep 19, 2013 at 3:46 PM, Cedric Blancher
Post by Cedric Blancher
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic
+1
Post by Wendy Lin
and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
"convunicodeliterals" is too long. Either "unicodeliterals" or
"convunicode" would do it nicely :)
I'd prefer unicodeliterals, but would accept convunicode, too. Just
unicode is too generic. But I am also concerned about Olga's comment
about print -C to print compound variables using such literals. How do
we do that without tinkering with LC_OPTIONS each time? Add -U/+U as
requested by Olga?

Irek
Glenn Fowler
2013-09-25 16:05:25 UTC
Permalink
Post by Irek Szczesniak
On Thu, Sep 19, 2013 at 3:46 PM, Cedric Blancher
Post by Cedric Blancher
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic
+1
Post by Wendy Lin
and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
"convunicodeliterals" is too long. Either "unicodeliterals" or
"convunicode" would do it nicely :)
I'd prefer unicodeliterals, but would accept convunicode, too. Just
unicode is too generic. But I am also concerned about Olga's comment
about print -C to print compound variables using such literals. How do
we do that without tinkering with LC_OPTIONS each time? Add -U/+U as
requested by Olga?
there are a few other ksh places where this may have an effect
typeset -p and maybe a few other places where ksh offers a -p option
to produce output that can be re-comsumed by the shell
there's probably a connection with -x tracing too

"unicodeliterals" is a fine solution and I'll put that in right now
but I think it would be good to step back just a bit and list all
of the places where "unicodeliterals" should take affect, at first
*without proposing a solution*

for ksh we already have

print -C
typeset -p
set -x

any others, in or out of ksh?
Irek Szczesniak
2013-09-25 16:20:00 UTC
Permalink
Post by Glenn Fowler
Post by Irek Szczesniak
On Thu, Sep 19, 2013 at 3:46 PM, Cedric Blancher
Post by Cedric Blancher
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic
+1
Post by Wendy Lin
and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
"convunicodeliterals" is too long. Either "unicodeliterals" or
"convunicode" would do it nicely :)
I'd prefer unicodeliterals, but would accept convunicode, too. Just
unicode is too generic. But I am also concerned about Olga's comment
about print -C to print compound variables using such literals. How do
we do that without tinkering with LC_OPTIONS each time? Add -U/+U as
requested by Olga?
there are a few other ksh places where this may have an effect
typeset -p and maybe a few other places where ksh offers a -p option
to produce output that can be re-comsumed by the shell
there's probably a connection with -x tracing too
"unicodeliterals" is a fine solution and I'll put that in right now
but I think it would be good to step back just a bit and list all
of the places where "unicodeliterals" should take affect, at first
*without proposing a solution*
for ksh we already have
print -C
typeset -p
set -x
any others, in or out of ksh?
print -v, and print %B for compound variables.

IMO a good point for -U/+U is: They are used in actual I/O to create
compound variable streams (one of the most undocumented and
undervalued feature in ksh93, which has greatly helped us with our
scripts. Just to praise it here because it solved the problems of
parsing, data version control (just add more fields if you need them
without breaking backwards compatibility) and performance (compared to
streaming XML)).

typeset -p is IMO just used internally and set -x is for diagnostics,
right? Does anyone every tried to parse that?

Irek
Glenn Fowler
2013-09-25 16:37:09 UTC
Permalink
Post by Irek Szczesniak
Post by Glenn Fowler
Post by Irek Szczesniak
On Thu, Sep 19, 2013 at 3:46 PM, Cedric Blancher
Post by Cedric Blancher
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic
+1
Post by Wendy Lin
and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
"convunicodeliterals" is too long. Either "unicodeliterals" or
"convunicode" would do it nicely :)
I'd prefer unicodeliterals, but would accept convunicode, too. Just
unicode is too generic. But I am also concerned about Olga's comment
about print -C to print compound variables using such literals. How do
we do that without tinkering with LC_OPTIONS each time? Add -U/+U as
requested by Olga?
there are a few other ksh places where this may have an effect
typeset -p and maybe a few other places where ksh offers a -p option
to produce output that can be re-comsumed by the shell
there's probably a connection with -x tracing too
"unicodeliterals" is a fine solution and I'll put that in right now
but I think it would be good to step back just a bit and list all
of the places where "unicodeliterals" should take affect, at first
*without proposing a solution*
for ksh we already have
print -C
typeset -p
set -x
any others, in or out of ksh?
print -v, and print %B for compound variables.
IMO a good point for -U/+U is: They are used in actual I/O to create
compound variable streams (one of the most undocumented and
undervalued feature in ksh93, which has greatly helped us with our
scripts. Just to praise it here because it solved the problems of
parsing, data version control (just add more fields if you need them
without breaking backwards compatibility) and performance (compared to
streaming XML)).
typeset -p is IMO just used internally and set -x is for diagnostics,
right? Does anyone every tried to parse that?
dgk can correct me on this
but the idea behind typeset -p is to be able to save portions of ksh context
to be consumed later, possibly in a different { locale system platform }
so the consumer for typeset -p is ksh so it better be as portable w.r.t unicodeliterals
Post by Irek Szczesniak
Irek
Irek Szczesniak
2013-09-25 17:30:18 UTC
Permalink
Post by Glenn Fowler
Post by Irek Szczesniak
Post by Glenn Fowler
Post by Irek Szczesniak
On Thu, Sep 19, 2013 at 3:46 PM, Cedric Blancher
Post by Cedric Blancher
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic
+1
Post by Wendy Lin
and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
"convunicodeliterals" is too long. Either "unicodeliterals" or
"convunicode" would do it nicely :)
I'd prefer unicodeliterals, but would accept convunicode, too. Just
unicode is too generic. But I am also concerned about Olga's comment
about print -C to print compound variables using such literals. How do
we do that without tinkering with LC_OPTIONS each time? Add -U/+U as
requested by Olga?
there are a few other ksh places where this may have an effect
typeset -p and maybe a few other places where ksh offers a -p option
to produce output that can be re-comsumed by the shell
there's probably a connection with -x tracing too
"unicodeliterals" is a fine solution and I'll put that in right now
but I think it would be good to step back just a bit and list all
of the places where "unicodeliterals" should take affect, at first
*without proposing a solution*
for ksh we already have
print -C
typeset -p
set -x
any others, in or out of ksh?
print -v, and print %B for compound variables.
IMO a good point for -U/+U is: They are used in actual I/O to create
compound variable streams (one of the most undocumented and
undervalued feature in ksh93, which has greatly helped us with our
scripts. Just to praise it here because it solved the problems of
parsing, data version control (just add more fields if you need them
without breaking backwards compatibility) and performance (compared to
streaming XML)).
typeset -p is IMO just used internally and set -x is for diagnostics,
right? Does anyone every tried to parse that?
dgk can correct me on this
but the idea behind typeset -p is to be able to save portions of ksh context
to be consumed later, possibly in a different { locale system platform }
so the consumer for typeset -p is ksh so it better be as portable w.r.t unicodeliterals
... and sometimes it needs to be as fast as hell which rules out the
use of \u[] in non-unicode locales. But IMO the *common* usage is
printf %q for string literals and print -C and print -v for compound
variable trees or arrays of compound variables. print -U/+U are about
making it easier to access. Non-common usage is covered by
LC_OPTIONS=unicodeliterals

Irek
ольга крыжановская
2013-09-21 00:39:27 UTC
Permalink
Wendy, I agree with you that 'unicode' is the wrong name. 'Unicode'
refers to the Unicode standard, but defines no action, or purpose.
IMHO it should be renamed, back to 'convunicode'.

There is still an issue with print -U/+U - that option was replaced
with %(unicode)q, but that option has been useless because it does not
have an effect on compound variables, i.e. print -C comvar or print -v
comvar. IMHO we are better off with removing %(unicode)q, because it
is functionality only affecting a single functionality, and is
therefore useless, for everything else. If no one objects, I make a
patch to have printf %(unicode)q removed (useless functionality), and
reintroduce print -U/+U instead.

Olga
Post by Wendy Lin
I have a request about LC_OPTIONS=unicode. I believe the name
'unicode' is too generic and should better describe what it does.
The first patch from Roland Mainz I saw used set -o convunicode, for
"convert to unicode". I think this, or 'convunicodeliterals', would be
a more fitting and descriptive name.
Opinions?
Wendy
---------- Forwarded message ----------
From: Glenn Fowler <gsf at research.att.com>
Date: 14 September 2013 00:15
Subject: [ast-developers] AT&T Software Technology ast alpha software
download update
To: ast-developers at research.att.com
the AT&T Software Technology ast alpha 2013-09-13 source release
has been posted to the download site
http://www.research.att.com/sw/download/alpha/
the package names and md5 checksums are
INIT 327861e49e24dd51079c0a5316a4b2fe
ast-open dfb85d1dfb20acb8a1529bdf4b8cb89a
ast-ksh 746a556a2259aaa6d75468000e5bc36b
the md5 sums should match the ones listed on the download page
the change logs below are limited to ksh and libast
the libast changes involved a lot of meticulous multibyte code
that was hashed out off-list between { gsf roland olga }
there is a "news" link in the left side nav bar that will
be used to detail implemented and proposed ast features
as features mature the news info will migrate to the man pages
if your favorite bug/feature is not in the list below then it
hasn't been addressed yet and we don't know exactly when it will be
changes since 2013-08-29
13-09-13 --- Release ksh93v- ---
13-09-13 +The signal .sh.value variable is now a compound variable with the name
value.q corresponding to kill -q signed-integer and value.Q
corresponding
to kill -Q unsigned-large-integer.
13-09-13 A bug in $(...) command substitution that corrupted a trailing
multibyte character in non-UTF-8 locales has been fixed.
13-09-13 Eliminted extranesous output of standard error when ksh is invoked
with the -v (verbose) option.
13-09-10 A bug in finding a function defined inside a type that was defined
in a namespace has been fixed.
13-09-10 A bug in the binding of function local variables inside arithmeitc
expression inside namespaces was fixed.
13-09-10 +A -Q option was added to kill to pass integers as large as pointers.
The -q option now only accepts integers as large as typeset -i.
13-09-09 A bug in command substitution has been fixed.
13-09-09 Qualified print format "%([no]unicode)q" added to prefer \u[...]
over \w[...] and override LC_OPTIONS=unicode.
13-09-04 +\w[hex] locale-specific code point literals have been added.
13-09-04 +The float(f) math function was added.
13-09-04 +The int(f) math function was fixed to return 0 for floating point
numbers larger than the maximum integer.
13-09-04 A bug in which assigning a compound indexed array a value of () did
not preserve the -C attribute has been fixed.
13-09-04 kill -q can now pass numbers as large as typeset -li and
.sh.sig.value is typeset -i rather than a compound variable.
13-09-04 kill -q yields the processor and returns 2 when siqueue fails with
EAGAIN and yield.
13-09-03 A bug in which $((x.xxx)) where x is a floating point variable and
xxx is not one of the known extensions yields a random value has
been fixed. It now is unset which has value 0 when set -u is off.
13-09-03 A bug in overriding discipline functions for types defined in
namespaces has been fixed.
13-09-03 A bug which on some systems caused a core dump for large <<< here
documents has been fixed.
13-09-12 misc/fgetcwd.c: fix stat corruption bug on systems without fdopendir()
13-09-12 path/pathcanon.c: fix bug that added extra / when fgetcwd() returned /
13-09-09 comp/setlocale.c,port/codeset.c: consistent handling of
US-ASCII + conformance(0,0) across all os's
13-09-09 string/utf8towc.c,string/wctoutf8.c: add { utf8toutf32()
utf8towc() wctoutf8() }
13-09-07 include/ast_std.c,comp/setlocale.c,string/stresc.c: add
ast.byte_max for single byte locales
13-09-06 comp/iconv.c: add sfclrerr() to iconv_move() if all input
chars not consumed
13-09-06 port/codeset.h,port/codeset.c: internal api for retrieving
locale codeset names
13-09-04 string/chresc.c,stresc.c: add \w[hex] support -- thanks Roland
13-09-04 string/utf32stowcs.c,string/wcstoutf32s.c: add -- thanks Roland
13-09-04 sfio/sfsetbuf.c: fix bug where SF_GETR mode was not cleared
causing subsequent memory corruption
13-09-04 vmalloc/vmopen.c,vmdcsystem.c,vmstat.c: temporarily set
vm->meth.meth=0 to disable vmstat() during init
13-09-04 port/intercept.c: include <ast_standards.h> to ensure
fdopendir() prototype if _lib_fdopendir
13-09-04 include/ast_std.h,comp/setlocale.c: add LC_OPTIONS=unicode
and AST_LC_unicode
13-09-01 path/pathcanon.c: limit NAMED_XATTR paths to
13-08-29 cdt/dtstrhash.c: change sign-bit hitting fnv constants to hex
to silence unsigned warnings
_______________________________________________
ast-developers mailing list
ast-developers at lists.research.att.com
http://lists.research.att.com/mailman/listinfo/ast-developers
_______________________________________________
ast-developers mailing list
ast-developers at lists.research.att.com
http://lists.research.att.com/mailman/listinfo/ast-developers
--
, _ _ ,
{ \/`o;====- Olga Kryzhanovska -====;o`\/ }
.----'-/`-/ olga.kryzhanovska at gmail.com \-`\-'----.
`'-..-| / http://twitter.com/fleyta \ |-..-'`
/\/\ Solaris/BSD//C/C++ programmer /\/\
`--` `--`
David Korn
2013-09-25 20:57:33 UTC
Permalink
cc: ast-developers at research.att.com cedric.blancher at gmail.com
Subject: Re: Re: [ast-developers] Name of LC_OPTIONS=unicode?
--------
Post by Glenn Fowler
dgk can correct me on this
but the idea behind typeset -p is to be able to save portions of ksh context
to be consumed later, possibly in a different { locale system platform }
so the consumer for typeset -p is ksh so it better be as portable w.r.t unicodel
iterals
Yes, the purpose of typeset -p is to generate an environment that can be
reinput on any system with the . command.

Of course the type definitions have to be output first.

David Korn
dgk at research.att.com

Continue reading on narkive:
Loading...