Discussion:
[ast-developers] CHLD/${.sh.sig.code}=@(CONTINUED|STOPPED) traps reliable?
Cedric Blancher
2013-08-24 17:09:54 UTC
Permalink
Are there any known issues where a SIGSTOP can trigger multiple
SIGCHLD trap calls with code=STOPPED for the same event?

We've experiencing trouble with this kind of problem, i.e. lack of
SIGCHLD state change reports when a child changes from stopped to
running or from running to stop, on a massive scale if the number of
children exceeds a few hundred processor if the parent process is
stalled by paging/swapping.

I can't reproduce it with a simple testcase but while searching I once
had this failure:
ksh -x -c 'builtin pids ; integer numsigchld=0 ; trap "print -v
.sh.sig;((numsigchld++))" CHLD ; { while true ; do kill -s STOP $(pids
-f "%(pid)d") ; done } & pid=$! ; sleep 1 ; kill -CONT $pid ;
/usr/bin/sleep 1; kill -KILL $pid ; wait $pid ; print
"$?,${numsigchld}"'
+ builtin pids
+ numsigchld=0
+ typeset -li numsigchld
+ trap 'print -v .sh.sig;((numsigchld++))' CHLD
+ pid=26972
+ sleep 1
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -CONT 26972
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=CONTINUED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=0
typeset -r -i uid=231713
value=(
typeset -r -i int=0
typeset -r -l -i 16 ptr=16#0
)
)
+ ((numsigchld++))
+ /usr/bin/sleep 1
./arch/linux.i386-64/bin/ksh: 26972: Stopped (SIGSTOP)
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -KILL 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=KILLED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=9
typeset -r -i uid=231713
value=(
typeset -r -i int=9
typeset -r -l -i 16 ptr=16#9
)
)
+ ((numsigchld++))
+ wait 26972
+ print 265,5
265,5

SIGCHLD trap was called twice for the STOP signal and the total count
of signals is 5 (numsigchld=5) instead of 4.

Ced
--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur
Irek Szczesniak
2013-08-26 08:14:15 UTC
Permalink
On Sat, Aug 24, 2013 at 7:09 PM, Cedric Blancher
Post by Cedric Blancher
Are there any known issues where a SIGSTOP can trigger multiple
SIGCHLD trap calls with code=STOPPED for the same event?
We've experiencing trouble with this kind of problem, i.e. lack of
SIGCHLD state change reports when a child changes from stopped to
running or from running to stop, on a massive scale if the number of
children exceeds a few hundred processor if the parent process is
stalled by paging/swapping.
I can't reproduce it with a simple testcase but while searching I once
ksh -x -c 'builtin pids ; integer numsigchld=0 ; trap "print -v
.sh.sig;((numsigchld++))" CHLD ; { while true ; do kill -s STOP $(pids
-f "%(pid)d") ; done } & pid=$! ; sleep 1 ; kill -CONT $pid ;
/usr/bin/sleep 1; kill -KILL $pid ; wait $pid ; print
"$?,${numsigchld}"'
+ builtin pids
+ numsigchld=0
+ typeset -li numsigchld
+ trap 'print -v .sh.sig;((numsigchld++))' CHLD
+ pid=26972
+ sleep 1
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -CONT 26972
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=CONTINUED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=0
typeset -r -i uid=231713
value=(
typeset -r -i int=0
typeset -r -l -i 16 ptr=16#0
)
)
+ ((numsigchld++))
+ /usr/bin/sleep 1
./arch/linux.i386-64/bin/ksh: 26972: Stopped (SIGSTOP)
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -KILL 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=KILLED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=9
typeset -r -i uid=231713
value=(
typeset -r -i int=9
typeset -r -l -i 16 ptr=16#9
)
)
+ ((numsigchld++))
+ wait 26972
+ print 265,5
265,5
SIGCHLD trap was called twice for the STOP signal and the total count
of signals is 5 (numsigchld=5) instead of 4.
We can reproduce the problem. I think I can explain the issue: ksh93
doesn't use the siginfo data created by the kernel to manage job
control, instead it polls all jobs for data and generates artificial
siginfo structures to use for .sh.sig. Of course that works only if
the process doing the job management is fast enough and the managed
children don't change state faster than the parent process can poll.

IMO a fix would be to use the SIGCHLD siginfo data created by the
kernel for job management, i.e. kernel creates SIGCHLD siginfo data,
they get queued in the sh_fault() trap handler like other siginfo
data, then processed outside the trap handler and update the internal
jobs structure, call .sh.sig machinery (if trap "..." CHLD is active)
and then dispose the data after that.
This would solve your problem and also solve the performance problem
which occurs if you poll thousands of child processes for state
changes.

Irek

Loading...