Cedric Blancher
2013-08-24 17:09:54 UTC
Are there any known issues where a SIGSTOP can trigger multiple
SIGCHLD trap calls with code=STOPPED for the same event?
We've experiencing trouble with this kind of problem, i.e. lack of
SIGCHLD state change reports when a child changes from stopped to
running or from running to stop, on a massive scale if the number of
children exceeds a few hundred processor if the parent process is
stalled by paging/swapping.
I can't reproduce it with a simple testcase but while searching I once
had this failure:
ksh -x -c 'builtin pids ; integer numsigchld=0 ; trap "print -v
.sh.sig;((numsigchld++))" CHLD ; { while true ; do kill -s STOP $(pids
-f "%(pid)d") ; done } & pid=$! ; sleep 1 ; kill -CONT $pid ;
/usr/bin/sleep 1; kill -KILL $pid ; wait $pid ; print
"$?,${numsigchld}"'
+ builtin pids
+ numsigchld=0
+ typeset -li numsigchld
+ trap 'print -v .sh.sig;((numsigchld++))' CHLD
+ pid=26972
+ sleep 1
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -CONT 26972
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=CONTINUED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=0
typeset -r -i uid=231713
value=(
typeset -r -i int=0
typeset -r -l -i 16 ptr=16#0
)
)
+ ((numsigchld++))
+ /usr/bin/sleep 1
./arch/linux.i386-64/bin/ksh: 26972: Stopped (SIGSTOP)
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -KILL 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=KILLED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=9
typeset -r -i uid=231713
value=(
typeset -r -i int=9
typeset -r -l -i 16 ptr=16#9
)
)
+ ((numsigchld++))
+ wait 26972
+ print 265,5
265,5
SIGCHLD trap was called twice for the STOP signal and the total count
of signals is 5 (numsigchld=5) instead of 4.
Ced
SIGCHLD trap calls with code=STOPPED for the same event?
We've experiencing trouble with this kind of problem, i.e. lack of
SIGCHLD state change reports when a child changes from stopped to
running or from running to stop, on a massive scale if the number of
children exceeds a few hundred processor if the parent process is
stalled by paging/swapping.
I can't reproduce it with a simple testcase but while searching I once
had this failure:
ksh -x -c 'builtin pids ; integer numsigchld=0 ; trap "print -v
.sh.sig;((numsigchld++))" CHLD ; { while true ; do kill -s STOP $(pids
-f "%(pid)d") ; done } & pid=$! ; sleep 1 ; kill -CONT $pid ;
/usr/bin/sleep 1; kill -KILL $pid ; wait $pid ; print
"$?,${numsigchld}"'
+ builtin pids
+ numsigchld=0
+ typeset -li numsigchld
+ trap 'print -v .sh.sig;((numsigchld++))' CHLD
+ pid=26972
+ sleep 1
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -CONT 26972
+ true
+ pids -f '%(pid)d'
+ kill -s STOP 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=CONTINUED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=0
typeset -r -i uid=231713
value=(
typeset -r -i int=0
typeset -r -l -i 16 ptr=16#0
)
)
+ ((numsigchld++))
+ /usr/bin/sleep 1
./arch/linux.i386-64/bin/ksh: 26972: Stopped (SIGSTOP)
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=STOPPED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=19
typeset -r -i uid=231713
value=(
typeset -r -i int=19
typeset -r -l -i 16 ptr=16#13
)
)
+ ((numsigchld++))
+ kill -KILL 26972
+ print -v .sh.sig
(
typeset -r -l -i 16 addr=16#3e80000695c
typeset -r -l -i band=0
typeset -r code=KILLED
typeset -r -i errno=0
typeset -r name=CHLD
typeset -r -i pid=26972
typeset -r -i signo=17
typeset -r -i status=9
typeset -r -i uid=231713
value=(
typeset -r -i int=9
typeset -r -l -i 16 ptr=16#9
)
)
+ ((numsigchld++))
+ wait 26972
+ print 265,5
265,5
SIGCHLD trap was called twice for the STOP signal and the total count
of signals is 5 (numsigchld=5) instead of 4.
Ced
--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur