[ast-developers] A tale of context switching and why its the way to hell (was: Re: AT&T Software Technology ast alpha software download update)

Discussion:

Irek Szczesniak

2013-07-19 20:06:56 UTC

Permalink

the AT&T Software Technology ast alpha 2013-06-28 source release
has been posted to the download site
http://www.research.att.com/sw/download/alpha/
the package names and md5 checksums are
INIT eddbf89d061348519d86f2618b708a94
ast-base a745a7d4ce6f53c2e4134af4cc835ff7
ast-open fdb74839ff041e34c800c333188a050e
ast-ksh 8f22428cf30af7146bd210664c2fd166
the md5 sums should match the ones listed on the download page

The release is unusable. The new "API" - if it can be called like that
- added wrappers to all syscalls via #define, which breaks down on
OpenBSD or other platforms which already use #defines for security
wrappers. It's also undebuggable by adding yet another layer of hidden
complexity. I wouldn't mind if if the code would call _ast_open() and
friends directly but hiding it via #define open _ast_open collides
with too many other things, including system libraries and the ability
of normal minds to grok it.
So this won't fly.

thanks for the feedback
unfortunately we don't have acces to bsd machines anymore
bsd *never* did headers right
e.g., if posix says
#include <foo.h>
bsd takes it on itself to demand
#include <sys/hack.h>
#include <sys/hackier.h>
#include <foo.h>
so I'm not surprised that we hit macro clashes
send me offlist the files named by
bin/package results path
and if you did more than one build
bin/package results path old
we knew the intercepts would be controversial, especially the varargs ioctl()
but this is the best way we could think of to flesh out EINTR problems
that arose from the recent signal/queue storm tests
as far as we can tell few system calls in all ast libraries and commands are
immune from EINTR error returns, including surprising ones like close() and stat()
there is no way we could do the edit to wrap syscalls with restart logic,
possibly just for debugging purposes, in a timely manner
so we did it by default for all ast code via macro black magic
knowing that we may run afoul of others doing similar black magic
as the problems arise we'll address them
for now the default is to always intercept
but there is a way to build with intercepts disabled
-D_AST_INTERCEPT=0
*but don't do this for ast code*
in the future the default could change
the intercept approach addresses many issues
* adding restart logic, macros or not, to every piece of ast code would be fugly
and I would not like editing, debugging or maintaining that code on a daily basis
so please don't submit patches to restartify ast code
* whos to say some other issues like EINTR won't arise tomorrow -- with intercepts
we may have a much easier pathway to address those issues
* any user code that expects to be used in ksh builtins or ast plugins must
do the restart logic -- its much easier to instruct builtin developers to
"#include <ast.h>" than to "wrap each syscall with foomacro() barmacro()"
and the latter would have to be bullet proof -- not that easy to say months later
"oops, we should have said foomacro(special arg) barmacro(another special arg)";
if the user code doesn't do syscall macro intercepts it should go smoothly,
otherwise the users will have to "-D_AST_INTERCEPT=0" and examine the user
or 3rd party code for EINTR restartedness

My point is: How is restart controlled? Is this going to be a global
option or thread-local? If it's going to be thread-local you will have
to do context switches between library boundaries, i.e. library a does
it's own ast restart settings and calls library b which does it's own
ast restart settings. Which means each call needs code to save and
restore the state.

This sounds simple, yes? Yes, it is simple. At the beginning. For a
small project like 'hello world'.
Unfortunately - for big projects - it isn't simple. Netscape 4 was
such an example where the good intentions "making it easy and use
save/restore" paved a way for going from 5 save/restore calls between
module boundaries up to over 400 in Netscape 4.5. This design is
nowadays taught in university programming schools as "context switch
way to hell", a cautious tale of what doomed the whole Netscape 4
project, and others as well.
Read again: PRIMARY cause of project failure. I ought not to relive
that experience.

The same cautious tale applies to the contents
src/lib/libast/misc/state.c. You're having good intentions there. But
this is the way to a hell made out of context switches at the module
boundaries.

You'll find that opinion in line with the design of POSIX. Yes, they
have thread-local variables, but only optionally, and they are not
used in any POSIX API like openat() - it only has AT_FDCWD as global
cwd but no thread-local equivalent. Guess why?

The only exception - by accident and stupidity - has been uselocale()
- but even there was even a huge fracas when it was introduced, and
may now be depreciated again at the behalf of the NetBSD community
because thread-local variables aren't portable, or are only portable
if you accept that some platforms can implement thread-local variables
only via a table lookup (which makes it very, very slow).

* although syscall restart on interrupt is part of posix, no 2 unix implementations
apply restart in the same way on the same set of syscalls -- e.g., the intesection
between the ast intercepts and any unix implementation != ast intercepts

Could you give details at this one? This may be a gap in either the
specification or conformance testing.

Irek

Glenn Fowler

2013-07-19 21:25:15 UTC

Permalink

Post by Irek Szczesniak

restart is a global concept with 3 modes for all intercepted calls

(1) default -- fail on EINTR
(2) fail on EINTR unless process restart serial counter changed
in this mode a signal handler can increment the restart serial counter
based on the handler's own state
(3) do not fail on EINTR (i.e., restart) unless process restart serial counter changed
this is the mode ksh will use for itself and all builtins/plugins

the plan is that 3rd party libraries, including the ast -last and -lcmd,
will not fiddle with the restart ast intercept restart settings they inherit

standalone commands will operate in mode (1)
ksh and its builtins/plugins will run in mode (3)

3rd party apps uwing ast libraries would decide on which mode to use
based on the nature of signals the process expects to receive and generate
a shell is a signal-full environment, especially SIGCLD, so mode (2) or (3)
would be appropriate

for the case of 3rd party ksh builtins/plugins that may use non-ast 3rd party libs
it is up to the 3rd party code to know that it will run with many signal deliveries
and that any system call subject to EINTR must be handled properly or that
builtin/plugin will suffer spurious syscall errors caused by EINTR

under this model there is no need to save/restore state between libs
either 3rd party libs compile against ast and play the ast game or
they don't and inject their own code to handle EINTR on any syscall they make
if that EINTR code is done in a bad way the code will behave in a bad way
especially as a ksh plugin

also under this mode there are no coding changes (modulo main() => b_foo())
required to convert a standalone command to a builtin

Post by Irek Szczesniak
This sounds simple, yes? Yes, it is simple. At the beginning. For a
small project like 'hello world'.
Unfortunately - for big projects - it isn't simple. Netscape 4 was
such an example where the good intentions "making it easy and use
save/restore" paved a way for going from 5 save/restore calls between
module boundaries up to over 400 in Netscape 4.5. This design is
nowadays taught in university programming schools as "context switch
way to hell", a cautious tale of what doomed the whole Netscape 4
project, and others as well.
Read again: PRIMARY cause of project failure. I ought not to relive
that experience.
The same cautious tale applies to the contents
src/lib/libast/misc/state.c. You're having good intentions there. But
this is the way to a hell made out of context switches at the module
boundaries.

LOCAL(cwd) in state.c/intercept.c is there for experimentation
right now its AT_FDCWD

Post by Irek Szczesniak
You'll find that opinion in line with the design of POSIX. Yes, they
have thread-local variables, but only optionally, and they are not
used in any POSIX API like openat() - it only has AT_FDCWD as global
cwd but no thread-local equivalent. Guess why?
The only exception - by accident and stupidity - has been uselocale()
- but even there was even a huge fracas when it was introduced, and
may now be depreciated again at the behalf of the NetBSD community
because thread-local variables aren't portable, or are only portable
if you accept that some platforms can implement thread-local variables
only via a table lookup (which makes it very, very slow).

Could you give details at this one? This may be a gap in either the
specification or conformance testing.

its a gaping hole between the posix spec and implementations
posix says any call that may fail with EINTR shall be restarted for signals
with SA_RESTART set by sigaction()

implementations add EINTR at whim to random syscalls, possibly controlled
by underlying drivers, and possibly undocumented, but fail to make the
SA_RESTART connection

just look at man sigaction(2) signal(2) signal(7) for a few systems
near the "restart" descriptions and you'll see wishy washy language like
"certain calls" "slow devices" "bsd semantics (but not on a bsd system)"
nothing to make a portable sw designer cozy, and no mention of stat() or close()
for which we already have evidence are EINTR-able *and* cause spurious errors in
ksh plugins in particular

Glenn Fowler

2013-07-19 21:26:10 UTC

Permalink

btw, thanks for the comments

Post by Irek Szczesniak

My point is: How is restart controlled? Is this going to be a global
option or thread-local? If it's going to be thread-local you will have
to do context switches between library boundaries, i.e. library a does
it's own ast restart settings and calls library b which does it's own
ast restart settings. Which means each call needs code to save and
restore the state.
This sounds simple, yes? Yes, it is simple. At the beginning. For a
small project like 'hello world'.
Unfortunately - for big projects - it isn't simple. Netscape 4 was
such an example where the good intentions "making it easy and use
save/restore" paved a way for going from 5 save/restore calls between
module boundaries up to over 400 in Netscape 4.5. This design is
nowadays taught in university programming schools as "context switch
way to hell", a cautious tale of what doomed the whole Netscape 4
project, and others as well.
Read again: PRIMARY cause of project failure. I ought not to relive
that experience.
The same cautious tale applies to the contents
src/lib/libast/misc/state.c. You're having good intentions there. But
this is the way to a hell made out of context switches at the module
boundaries.
You'll find that opinion in line with the design of POSIX. Yes, they
have thread-local variables, but only optionally, and they are not
used in any POSIX API like openat() - it only has AT_FDCWD as global
cwd but no thread-local equivalent. Guess why?
The only exception - by accident and stupidity - has been uselocale()
- but even there was even a huge fracas when it was introduced, and
may now be depreciated again at the behalf of the NetBSD community
because thread-local variables aren't portable, or are only portable
if you accept that some platforms can implement thread-local variables
only via a table lookup (which makes it very, very slow).

Could you give details at this one? This may be a gap in either the
specification or conformance testing.
Irek