[ast-developers] ksh93 double byte space handling

Discussion:

lijo george

2017-04-25 11:35:25 UTC

Hi,

The attached testscript has a leading double byte space separator before
the for loop closing "done" keyword. This fails with a syntax error while
parsing.

Is it a bug or is it expected behaviour?

I've tried it with ksh93u+ and ksh93v- versions on a Solaris setup.
bash and zsh also fails, hence I'm thinking it might not be a bug, but
could someone please confirm this.

Here's a sample output.

***@S11_3_SRU:~# echo $LANG
ja_JP.UTF-8
***@S11_3_SRU:~# cat space.ksh
#!/bin/ksh
for i in 1 2
do
echo $i
done # leading double byte space character
***@S11_3_SRU:~# od -xc space.ksh
0000000 2321 2f62 696e 2f6b 7368 0a66 6f72 2069
# ! / b i n / k s h \n f o r i
0000020 2069 6e20 3120 320a 646f 0a65 6368 6f20
i n 1 2 \n d o \n e c h o
0000040 2469 0ae3 8080 646f 6e65 0a00
$ i \n 343 200 200 d o n e \n
0000053
***@S11_3_SRU:~# ksh --version
version sh (AT&T Research) 93u+ 2012-08-01
***@S11_3_SRU:~# ksh space.ksh
space.ksh: syntax error at line 6: `for' unmatched
***@S11_3_SRU:~# ./ksh-2014
***@S11_3_SRU:~# echo ${.sh.version}
Version AIJMP 93v- 2014-12-24
***@S11_3_SRU:~# ./space.ksh
./space.ksh: syntax error at line 6: `for' unmatched
***@S11_3_SRU:~#

Thanks,
Lijo

lijo george

2017-04-25 12:42:32 UTC

Permalink

Thanks for the suggestion Philippe.
But I'm a bit confused though, Isn't "0xe3 0x80 0x80" the UTF-8
representation of the space character.

Thanks,
Lijo

On Tue, Apr 25, 2017 at 5:49 PM, Philippe Bergheaud <

Post by lijo george
The attached testscript has a leading double byte space separator
before the for loop closing "done" keyword. This fails with a syntax
error while parsing.
Is it a bug or is it expected behaviour?
I've tried it with ksh93u+ and ksh93v- versions on a Solaris setup.
bash and zsh also fails, hence I'm thinking it might not be a bug,
but could someone please confirm this.
Here's a sample output.
ja_JP.UTF-8
#!/bin/ksh
for i in 1 2
do
echo $i
done # leading double byte space character
0000000 2321 2f62 696e 2f6b 7368 0a66 6f72 2069
# ! / b i n / k s h \n f o r i
0000020 2069 6e20 3120 320a 646f 0a65 6368 6f20
i n 1 2 \n d o \n e c h o
0000040 2469 0ae3 8080 646f 6e65 0a00
$ i \n 343 200 200 d o n e \n

You should remove the (invisible) character 0343 (0xe3), before the two
spaces.
Philippe

lijo george

2017-04-30 19:31:41 UTC

Permalink

So I guess the observed behaviour is not a bug but intended behaviour.

It's interesting that this used to work for the old ksh88 version, which
might have been due to less
complicated parsing mechanism.

Thanks,
Lijo

I'm going to consider this _without_ looking at the ksh source, because
mortals will at most look at documentation (and because documentation
should be accurate enough that they shouldn't _have_ to look at source).
My very cursory reading of the man page* is a bit ambiguous whether that
A blank is a tab or a space. An identifier is a sequence of
letters,
digits, or underscores starting with a letter or underscore.
Identi-
fiers are used as components of variable names. A vname is a
sequence
of one or more identifiers separated by a . and optionally preceded
by
a .. Vnames are used as function and variable names. A word
is a
sequence of characters from the character set defined by the
current
locale, excluding non-quoted metacharacters.
"A blank is a tab or a space" is more restrictive than "A word is a
sequence of characters from the character set defined by the current
locale, excluding non-quoted meta characters". And if I try a vertical
tab, formfeed, or carriage return (all plain ASCII characters classified as
white space by isspace(3)) before "done", I get the same error. So it
looks like the more restrictive interpretation holds: only tabs and the
basic space character are acceptable in the code as white space. Of
course, anything should be ok in a quoted string (except whatever closes
the quotes); or rather, anything except a null byte, which does NOT work**
(ksh isn't perl - the latter goes out of its way to tolerate just about
anything).
However, I wouldn't do it, even if it should work, because that makes it
only work in an appropriate (UTF-8) locale; it would certainly be an error
regardless in C locale. If it were me, I would only use anything not
sensible in C locale, within a quoted string constant; one does NOT want
code that does nasty things depending on what locale is in use.
* ${.sh.version} on my Mac is Version AJM 93u+ 2012-08-01, which I gather
is reasonably current. :-)
0000000 # ! / b i n / k s h \n \n e c h
0000020 o " \0 t e s t i n g " \n
0000035
$ ./tryme.ksh
./tryme.ksh: syntax error at line 3: `zero byte' unexpected

Post by lijo george
Thanks for the suggestion Philippe.
But I'm a bit confused though, Isn't "0xe3 0x80 0x80" the UTF-8
representation of the space character.
Thanks,
Lijo
On Tue, Apr 25, 2017 at 5:49 PM, Philippe Bergheaud <

Post by lijo george
0000020 2069 6e20 3120 320a 646f 0a65 6368 6f20
i n 1 2 \n d o \n e c h

Post by lijo george
0000040 2469 0ae3 8080 646f 6e65 0a00
$ i \n 343 200 200 d o n e \n

You should remove the (invisible) character 0343 (0xe3), before the two
spaces.
Philippe

_______________________________________________
ast-users mailing list
http://lists.research.att.com/mailman/listinfo/ast-users