Discussion:
[ast-developers] [patch] vmalloc |mmap(MAP_ANON)| fixes for fragmentation issues (on Solaris) ... / was: Re: [ast-users] Severe performance regression between ksh 2010-03-05 and 2013-10-08
Roland Mainz
2013-12-09 21:53:05 UTC
Permalink
On Thu, Dec 5, 2013 at 4:50 PM, Irek Szczesniak <iszczesniak at gmail.com>
On Wed, Dec 4, 2013 at 3:02 PM, Glenn Fowler <glenn.s.fowler at gmail.com>
On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com>
On 1 December 2013 17:26, Glenn Fowler <glenn.s.fowler at gmail.com>
I believe this is related to vmalloc changes between 2013-05-31 and
2013-06-09
re-run the tests with
export VMALLOC_OPTIONS=getmem=safe
if that's the problem then it gives a clue on a general solution
details after confirmation
timex ~/bin/ksh -c 'function nanosort { typeset -A a ; integer k=0;
while read i ; do key="$i$((k++))" ; a["$key"]="$i" ; done ; printf
Version AIJMP 93v- 2013-10-08
real 34.60
user 33.27
sys 1.19
VMALLOC_OPTIONS=getmem=safe timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08
real 15.34
user 14.67
sys 0.52
So your hunch that VMALLOC_OPTIONS=getmem=safe fixes the problem is
correct.
What does VMALLOC_OPTIONS=getmem=safe do?
vmalloc has an internal discipline/method for getting memory from the
system
several methods are available with varying degrees of thread safety etc.
see src/lib/libast/vmalloc/vmdcsystem.c for the code
and src/lib/libast/vmalloc/malloc.c for the latest VMALLOC_OPTIONS
description (vmalloc.3 update shortly)
** getmemory=f enable f[,g] getmemory() functions if supported,
all
by default
** anon: mmap(MAP_ANON)
** break|sbrk: sbrk()
** native: native malloc()
** safe: safe sbrk() emulation via
mmap(MAP_ANON)
** zero: mmap(/dev/zero)
i believe the performance regression with "anon" is that on linux
mmap(0....MAP_ANON|MAP_PRIVATE...),
which lets the system decide the address, returns adjacent (when
possible)
region addresses from highest to lowest order
and the reverse order at minimum tends to fragment more memory
"zero" has the same hi=>lo characteristic
i suspect it adversely affects the vmalloc coalescing algorithm but have
not
dug deeper
for now the probe order in vmalloc/vmdcsystem.c was simply changed to
favor
"safe"
MAP_FIXED should be avoided because its only there for special
purposes like the runtime linker ld.so.1 or debuggers.
1. On some systems this is a privileged operation and only available
for users with root privileges
2. SPARC T4 with 256GB and Solaris 11.1 the use of 'safe' degraded the
performance from 9 seconds to almost 15 minutes because it utterly
destroys the systems concept of large pages. If two MAP_FIXED mappings
follow directly each other the system downgrades the page size to the
smallest possible size, even trying to break up larger pages, which in
turn must be done by a special deamon (vmtasks)
3. MAP_PRIVATE|MAP_FIXED|MAP_ANON may no longer be available in future
versions of Solaris
4. Using the 'safe' allocator on SmartOS (solaris 11 clone) triggers a
map(0xFFFFCD800B482000, 1048576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANON, 4294967295, 0) = 0xFFFFCD800B482000
sigaction(SIGSEGV, 0xFFFFFD7FFFDFDE50, 0xFFFFFD7FFFDFDED0) = 0
Incurred fault #6, FLTBOUNDS %pc = 0x0052FE06
siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFCD800B582000
Received signal #11, SIGSEGV [caught]
siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFCD800B582000
lwp_sigmask(SIG_SETMASK, 0x00000400, 0x00000000, 0x00000000,
0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
edit src/lib/libast/vmalloc/vmmaddress.c and change
#define VMCHKMEM 0
this affects vmalloc detecting overbooked memory but will disable the
MAP_FIXED codepath
Erm... Solaris (|__SunOS|) was once (pre-vmalloc-rewrite) "excempt"
from this functionality since it cannot overcommit memory (except if
someone uses |MAP_NORESERVE| or uses kernel debugging options in
/etc/system) ...

... attached (as
"astksh20131010_vmalloc_sunos_fragmentation_fix001.diff.txt") is a
patch which...
1. ... restores this exception for Solaris

2. ... bumps the |mmap()| size to 4MB for 32bit processes and 16MB for
64bit processes since both values are more or less the points where
the fragmentation stops. Note that this does *not* mean it will use so
much memory... it only means that it reserves this amount of memory
and the real allocation happens on the first read, write or execute
access of the matching MMU page. This also means there is no
performance difference between a 1MB |mmap(MAP_ANON)| and a 128MB
|mmap(MAP_ANON)| since it only reserves memory but does not
initalise/allocate it yet... this happens on the first time it's
accessed. The other reasons for the 4MB/16MB size were: x86 has 2MB
largepages, allowing a ksh process to benefit from such pages,
additionaly most AST (including ksh93) applications consume a few MB
of memory... so there is a good chance that the "typical"
application/shell memory consumtion completly fits into that 4MB
chunk. 64bit processes get four times as much memory since it's
expected that they may operate on much larger datasets (and see the
comment about fragmentation above)

Just to demonstrate "reservation" vs. "real usage" via Solaris pmap:
-- snip --
$ ksh -c 'print hello ; pmap -x $$ ; true' | egrep '16384.*anon'
FFFFFD7FFDA00000 16384 148 20 - rw--- [ anon ]
-- snip --
The test shows that of 16384k only 148k have really been touched...
the difference (16384-148) is reserved by the shell process but not
used.

3. Linux has /proc/sys/vm/overcommit_memory which is either 0 or 1 to
describe whether the kernel permits overcommitment of memory or not.
AFAIK a simple function could be written which returns |-1| (not not
permit overcommitment), |0| (don't know) or |1| (does permit
overcommitment) ... and if the function returns |-1| vmalloc should do
the same as on Solaris

4. The patch removes one unneccesary |memset(p, 0, size)| which was
touching pages and therefore allocating them

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
-------------- next part --------------
diff -r -u build_i386_64bit_opt/src/lib/libast/vmalloc/vmhdr.h build_i386_64bit_debug/src/lib/libast/vmalloc/vmhdr.h
--- src/lib/libast/vmalloc/vmhdr.h 2013-08-27 18:44:46.000000000 +0200
+++ src/lib/libast/vmalloc/vmhdr.h 2013-12-09 22:14:12.731227511 +0100
@@ -182,9 +182,9 @@

/* hint to regulate memory requests to discipline functions */
#if _ast_sizeof_size_t > 4 /* the address space is greater than 32-bit */
-#define VM_INCREMENT (1024*1024) /* lots of memory available here */
+#define VM_INCREMENT (16*1024*1024) /* lots of memory available here */
#else
-#define VM_INCREMENT (64*1024) /* perhaps more limited memory */
+#define VM_INCREMENT (4*1024*1024) /* perhaps more limited memory */
#endif

#define VM_PAGESIZE 8192 /* default assumed page size */
diff -r -u build_i386_64bit_opt/src/lib/libast/vmalloc/vmmaddress.c build_i386_64bit_debug/src/lib/libast/vmalloc/vmmaddress.c
--- src/lib/libast/vmalloc/vmmaddress.c 2013-06-09 06:13:49.000000000 +0200
+++ src/lib/libast/vmalloc/vmmaddress.c 2013-12-09 22:19:47.122281075 +0100
@@ -42,8 +42,16 @@
** Written by Kiem-Phong Vo, phongvo at gmail.com, 07/07/2012
*/

-/* see if a given range of address is available for mapping */
+/*
+ * see if a given range of address is available for mapping
+ * This is used for overcommit detection.
+ *
+ * Solaris (__SunOS) is explicily excluded since it does
+ * not allow overcommitment of memory by default
+ */
+#ifndef __SunOS
#define VMCHKMEM 1 /* set this to zero if signal&sigsetjmp don't work */
+#endif

#if VMCHKMEM

diff -r -u build_i386_64bit_opt/src/lib/libast/vmalloc/vmopen.c build_i386_64bit_debug/src/lib/libast/vmalloc/vmopen.c
--- src/lib/libast/vmalloc/vmopen.c 2013-09-04 07:15:04.000000000 +0200
+++ src/lib/libast/vmalloc/vmopen.c 2013-12-06 09:40:41.344273508 +0100
@@ -130,7 +130,9 @@
write(9, "vmalloc: panic: heap initialization error #4\n", 45);
return NIL(Vmalloc_t*);
}
+#if 0
memset(base, 0, size);
+#endif

/* make sure memory is properly aligned */
if((algn = (ssize_t)(VMLONG(base)%ALIGN)) == 0 )
Glenn Fowler
2013-12-09 22:08:58 UTC
Permalink
if that memset(0) is in vmopen() then im not sure its unnecessary

run these tests to check your patch with different sizes and with/without
the memset(0)

bin/package use
cd builtin
nmake test
On Thu, Dec 5, 2013 at 4:50 PM, Irek Szczesniak <iszczesniak at gmail.com>
On Wed, Dec 4, 2013 at 3:02 PM, Glenn Fowler <glenn.s.fowler at gmail.com>
On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com
On 1 December 2013 17:26, Glenn Fowler <glenn.s.fowler at gmail.com>
I believe this is related to vmalloc changes between 2013-05-31 and
2013-06-09
re-run the tests with
export VMALLOC_OPTIONS=getmem=safe
if that's the problem then it gives a clue on a general solution
details after confirmation
timex ~/bin/ksh -c 'function nanosort { typeset -A a ; integer k=0;
while read i ; do key="$i$((k++))" ; a["$key"]="$i" ; done ; printf
Version AIJMP 93v- 2013-10-08
real 34.60
user 33.27
sys 1.19
VMALLOC_OPTIONS=getmem=safe timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08
real 15.34
user 14.67
sys 0.52
So your hunch that VMALLOC_OPTIONS=getmem=safe fixes the problem is
correct.
What does VMALLOC_OPTIONS=getmem=safe do?
vmalloc has an internal discipline/method for getting memory from the
system
several methods are available with varying degrees of thread safety
etc.
see src/lib/libast/vmalloc/vmdcsystem.c for the code
and src/lib/libast/vmalloc/malloc.c for the latest VMALLOC_OPTIONS
description (vmalloc.3 update shortly)
** getmemory=f enable f[,g] getmemory() functions if
supported,
all
by default
** anon: mmap(MAP_ANON)
** break|sbrk: sbrk()
** native: native malloc()
** safe: safe sbrk() emulation via
mmap(MAP_ANON)
** zero: mmap(/dev/zero)
i believe the performance regression with "anon" is that on linux
mmap(0....MAP_ANON|MAP_PRIVATE...),
which lets the system decide the address, returns adjacent (when
possible)
region addresses from highest to lowest order
and the reverse order at minimum tends to fragment more memory
"zero" has the same hi=>lo characteristic
i suspect it adversely affects the vmalloc coalescing algorithm but
have
not
dug deeper
for now the probe order in vmalloc/vmdcsystem.c was simply changed to
favor
"safe"
MAP_FIXED should be avoided because its only there for special
purposes like the runtime linker ld.so.1 or debuggers.
Using this for a general-purpose memory allocator causes serious
1. On some systems this is a privileged operation and only available
for users with root privileges
2. SPARC T4 with 256GB and Solaris 11.1 the use of 'safe' degraded the
performance from 9 seconds to almost 15 minutes because it utterly
destroys the systems concept of large pages. If two MAP_FIXED mappings
follow directly each other the system downgrades the page size to the
smallest possible size, even trying to break up larger pages, which in
turn must be done by a special deamon (vmtasks)
3. MAP_PRIVATE|MAP_FIXED|MAP_ANON may no longer be available in future
versions of Solaris
4. Using the 'safe' allocator on SmartOS (solaris 11 clone) triggers a
map(0xFFFFCD800B482000, 1048576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANON, 4294967295, 0) = 0xFFFFCD800B482000
sigaction(SIGSEGV, 0xFFFFFD7FFFDFDE50, 0xFFFFFD7FFFDFDED0) = 0
Incurred fault #6, FLTBOUNDS %pc = 0x0052FE06
siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFCD800B582000
Received signal #11, SIGSEGV [caught]
siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFCD800B582000
lwp_sigmask(SIG_SETMASK, 0x00000400, 0x00000000, 0x00000000,
0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
edit src/lib/libast/vmalloc/vmmaddress.c and change
#define VMCHKMEM 0
this affects vmalloc detecting overbooked memory but will disable the
MAP_FIXED codepath
Erm... Solaris (|__SunOS|) was once (pre-vmalloc-rewrite) "excempt"
from this functionality since it cannot overcommit memory (except if
someone uses |MAP_NORESERVE| or uses kernel debugging options in
/etc/system) ...
... attached (as
"astksh20131010_vmalloc_sunos_fragmentation_fix001.diff.txt") is a
patch which...
1. ... restores this exception for Solaris
2. ... bumps the |mmap()| size to 4MB for 32bit processes and 16MB for
64bit processes since both values are more or less the points where
the fragmentation stops. Note that this does *not* mean it will use so
much memory... it only means that it reserves this amount of memory
and the real allocation happens on the first read, write or execute
access of the matching MMU page. This also means there is no
performance difference between a 1MB |mmap(MAP_ANON)| and a 128MB
|mmap(MAP_ANON)| since it only reserves memory but does not
initalise/allocate it yet... this happens on the first time it's
accessed. The other reasons for the 4MB/16MB size were: x86 has 2MB
largepages, allowing a ksh process to benefit from such pages,
additionaly most AST (including ksh93) applications consume a few MB
of memory... so there is a good chance that the "typical"
application/shell memory consumtion completly fits into that 4MB
chunk. 64bit processes get four times as much memory since it's
expected that they may operate on much larger datasets (and see the
comment about fragmentation above)
-- snip --
$ ksh -c 'print hello ; pmap -x $$ ; true' | egrep '16384.*anon'
FFFFFD7FFDA00000 16384 148 20 - rw--- [ anon ]
-- snip --
The test shows that of 16384k only 148k have really been touched...
the difference (16384-148) is reserved by the shell process but not
used.
3. Linux has /proc/sys/vm/overcommit_memory which is either 0 or 1 to
describe whether the kernel permits overcommitment of memory or not.
AFAIK a simple function could be written which returns |-1| (not not
permit overcommitment), |0| (don't know) or |1| (does permit
overcommitment) ... and if the function returns |-1| vmalloc should do
the same as on Solaris
4. The patch removes one unneccesary |memset(p, 0, size)| which was
touching pages and therefore allocating them
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.research.att.com/pipermail/ast-developers/attachments/20131209/cabbc00c/attachment-0001.html>
Roland Mainz
2013-12-09 22:21:34 UTC
Permalink
On Mon, Dec 9, 2013 at 4:53 PM, Roland Mainz <roland.mainz at nrubsig.org>
[snip]
Post by Roland Mainz
4. The patch removes one unneccesary |memset(p, 0, size)| which was
touching pages and therefore allocating them
if that memset(0) is in vmopen() then im not sure its unnecessary
run these tests to check your patch with different sizes and with/without
the memset(0)
bin/package use
cd builtin
nmake test
Seems to be no problem... and neither valgrind nor Rational Purify
complained. I think the issue is that a memory page obtained via
|mmap(MAP_ANON)| is zero'ed by the system on the first
read/write/execute access.

This behaviour is AFAIK defined by some standard (POSIX) because Linux
has this extra |mmap()| flag:
-- snip --
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to
improve performance on embedded devices. This flag is only honored if
the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the
security implications, that option is normally enabled only on
embedded devices (i.e., devices where one
has complete control of the contents of user memory).
-- snip --

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Glenn Fowler
2013-12-10 08:21:45 UTC
Permalink
Post by Roland Mainz
On Mon, Dec 9, 2013 at 4:53 PM, Roland Mainz <roland.mainz at nrubsig.org>
[snip]
Post by Roland Mainz
4. The patch removes one unneccesary |memset(p, 0, size)| which was
touching pages and therefore allocating them
if that memset(0) is in vmopen() then im not sure its unnecessary
run these tests to check your patch with different sizes and with/without
the memset(0)
bin/package use
cd builtin
nmake test
Seems to be no problem... and neither valgrind nor Rational Purify
complained. I think the issue is that a memory page obtained via
|mmap(MAP_ANON)| is zero'ed by the system on the first
read/write/execute access.
it consistently chokes for getmem=safe
memset(0) is required but only the head part to cover the 2 small vmalloc
structs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.research.att.com/pipermail/ast-developers/attachments/20131210/cca93e23/attachment-0001.html>
Loading...