[ast-developers] Avoid clearing memory from |mmap(MAP

Discussion:

[ast-developers] Avoid clearing memory from |mmap(MAP_ANON)| ? /

Roland Mainz

2013-12-05 23:41:59 UTC

On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com>

I believe this is related to vmalloc changes between 2013-05-31 and
2013-06-09
re-run the tests with
export VMALLOC_OPTIONS=getmem=safe
if that's the problem then it gives a clue on a general solution
details after confirmation

timex ~/bin/ksh -c 'function nanosort { typeset -A a ; integer k=0;
while read i ; do key="$i$((k++))" ; a["$key"]="$i" ; done ; printf
Version AIJMP 93v- 2013-10-08
real 34.60
user 33.27
sys 1.19
VMALLOC_OPTIONS=getmem=safe timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08
real 15.34
user 14.67
sys 0.52
So your hunch that VMALLOC_OPTIONS=getmem=safe fixes the problem is
correct.
What does VMALLOC_OPTIONS=getmem=safe do?

vmalloc has an internal discipline/method for getting memory from the system
several methods are available with varying degrees of thread safety etc.
see src/lib/libast/vmalloc/vmdcsystem.c for the code
and src/lib/libast/vmalloc/malloc.c for the latest VMALLOC_OPTIONS
description (vmalloc.3 update shortly)
** getmemory=f enable f[,g] getmemory() functions if supported, all
by default
** anon: mmap(MAP_ANON)
** break|sbrk: sbrk()
** native: native malloc()
** safe: safe sbrk() emulation via mmap(MAP_ANON)
** zero: mmap(/dev/zero)
i believe the performance regression with "anon" is that on linux
mmap(0....MAP_ANON|MAP_PRIVATE...),
which lets the system decide the address, returns adjacent (when possible)
region addresses from highest to lowest order
and the reverse order at minimum tends to fragment more memory
"zero" has the same hi=>lo characteristic
i suspect it adversely affects the vmalloc coalescing algorithm but have not
dug deeper
for now the probe order in vmalloc/vmdcsystem.c was simply changed to favor
"safe"

Erm... since Irek prodded me by phone I looked at the issue...
... some observations first (on Solaris 11/Illumos):

1. /dev/zero allocator vs. |sbrk()| allocator on Solaris:
-- snip --
$ VMALLOC_OPTIONS=getmem=zero timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
a["$key"]="$i" ; done ; printf "%s\n" "${a[@]}" ; } ; print
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08

real 32.98
user 32.55
sys 0.32

$ VMALLOC_OPTIONS=getmem=break timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
a["$key"]="$i" ; done ; printf "%s\n" "${a[@]}" ; } ; print
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08

real 1:08.41
user 1:07.87
sys 0.38
-- snip --
... which means the |sbrk()| allocator is twice a slow as the
/dev/zero allocator.

2. The default block size by the normal |mmap(MAP_ANON)| allocator is
1MB. This is IMHO far to small because there is IMO not enough space
for the coalescing algorithm to operate and a *lot* of fragmentation
occurs.
IMHO a _minimum_ page size of 4MB should be picked (as a side-effect
the shell would get 4MB or 2MB largepages on platforms like Solaris
automagically).

3. After each |mmap(MAP_ANON)| allocation the libast allocator
"manually" clears the obtained memory chunk with zero bytes. This is
IMO a *major* source of wasting CPU time (>= ~30%-38% of a
|_ast_malloc(1024*1024)|) because each memory page is instantiated by
writing zeros to it. If the clearing could be avoided (which is
unneccesary anyway) we'd easily win ~30%-38% and do *not* instantiate
pages which we do not use yet.

Just to make it clear: Allocating a 1MB chunk of memory via
|mmap(MAP_ANON)| and a 128MB chunk of memory via |mmap(MAP_ANON)| has
*no* (visible) difference in performance until we touch the pages via
either read/execute or write accesses.
Currently the libast allocator code writes zeros into the whole chunk
of memory obtained via |mmap(MAP_ANON)| which pretty much ruins
performance because *all* pages are created physically instead of just
being some memory marked as "reserved". If libast would stop writing
into memory chunks directly after the |mmap(MAP_ANON)| we could easily
bump the allocation size up to 32MB or better without any performance
penalty...

----

Bye,
Roland

--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Roland Mainz

2013-12-06 00:25:21 UTC

Permalink

Post by Roland Mainz

On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com>

[snip]

Post by Roland Mainz
2. The default block size by the normal |mmap(MAP_ANON)| allocator is
1MB. This is IMHO far to small because there is IMO not enough space
for the coalescing algorithm to operate and a *lot* of fragmentation
occurs.
IMHO a _minimum_ page size of 4MB should be picked (as a side-effect
the shell would get 4MB or 2MB largepages on platforms like Solaris
automagically).
3. After each |mmap(MAP_ANON)| allocation the libast allocator
"manually" clears the obtained memory chunk with zero bytes. This is
IMO a *major* source of wasting CPU time (>= ~30%-38% of a
|_ast_malloc(1024*1024)|) because each memory page is instantiated by
writing zeros to it. If the clearing could be avoided (which is
unneccesary anyway) we'd easily win ~30%-38% and do *not* instantiate
pages which we do not use yet.
Just to make it clear: Allocating a 1MB chunk of memory via
|mmap(MAP_ANON)| and a 128MB chunk of memory via |mmap(MAP_ANON)| has
*no* (visible) difference in performance until we touch the pages via
either read/execute or write accesses.
Currently the libast allocator code writes zeros into the whole chunk
of memory obtained via |mmap(MAP_ANON)| which pretty much ruins
performance because *all* pages are created physically instead of just
being some memory marked as "reserved". If libast would stop writing
into memory chunks directly after the |mmap(MAP_ANON)| we could easily
bump the allocation size up to 32MB or better without any performance
penalty...

BTW: A quick fix for the original problem seems to be the following patch:
-- snip --
diff -r -u src/lib/libast/vmalloc/vmhdr.h src/lib/libast/vmalloc/vmhdr.h
--- src/lib/libast/vmalloc/vmhdr.h 2013-08-27 18:44:46.000000000 +0200
+++ src/lib/libast/vmalloc/vmhdr.h 2013-12-06 01:06:30.777622210 +0100
@@ -182,7 +182,7 @@

/* hint to regulate memory requests to discipline functions */
#if _ast_sizeof_size_t > 4 /* the address space is greater than 32-bit */
-#define VM_INCREMENT (1024*1024) /* lots of memory available here */
+#define VM_INCREMENT (32*1024*1024) /* lots of memory available here */
#else
#define VM_INCREMENT (64*1024) /* perhaps more limited memory */
#endif
-- snip --

It turns out that the issue is mainly fragmentation-related for the
"nanosort" testcase. After appying the patch above the runtime
*significantly* improves - even three seconds better than the old
ksh93 version:
-- snip --
$ VMALLOC_OPTIONS=getmem=anon timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
a["$key"]="$i" ; done ; printf "%s\n" "${a[@]}" ; } ; print
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08

real 13.96
user 13.08
sys 0.44
-- snip --

----

Bye,
Roland

--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Glenn Fowler

2013-12-06 15:18:07 UTC

Permalink

On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com>

On 1 December 2013 17:26, Glenn Fowler <glenn.s.fowler at gmail.com>

vmalloc has an internal discipline/method for getting memory from the

system

several methods are available with varying degrees of thread safety etc.
see src/lib/libast/vmalloc/vmdcsystem.c for the code
and src/lib/libast/vmalloc/malloc.c for the latest VMALLOC_OPTIONS
description (vmalloc.3 update shortly)
** getmemory=f enable f[,g] getmemory() functions if supported,

all

by default
** anon: mmap(MAP_ANON)
** break|sbrk: sbrk()
** native: native malloc()
** safe: safe sbrk() emulation via

mmap(MAP_ANON)

** zero: mmap(/dev/zero)
i believe the performance regression with "anon" is that on linux
mmap(0....MAP_ANON|MAP_PRIVATE...),
which lets the system decide the address, returns adjacent (when

possible)

region addresses from highest to lowest order
and the reverse order at minimum tends to fragment more memory
"zero" has the same hi=>lo characteristic
i suspect it adversely affects the vmalloc coalescing algorithm but have

not

dug deeper
for now the probe order in vmalloc/vmdcsystem.c was simply changed to

favor

"safe"

Erm... since Irek prodded me by phone I looked at the issue...
-- snip --
$ VMALLOC_OPTIONS=getmem=zero timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08
real 32.98
user 32.55
sys 0.32
$ VMALLOC_OPTIONS=getmem=break timex ~/bin/ksh -c 'function nanosort {
typeset -A a ; integer k=0; while read i ; do key="$i$((k++))" ;
"${.sh.version}" ; nanosort <xxx >yyy'
Version AIJMP 93v- 2013-10-08
real 1:08.41
user 1:07.87
sys 0.38
-- snip --
... which means the |sbrk()| allocator is twice a slow as the
/dev/zero allocator.

sbrk is different from safebreak -- look at the vmdcsystem.c code
the alpha will default to not probe just mapped pages for overbooking
this will result in spurious and for the most part untraceable core dumps
on systems running out of memory

2. The default block size by the normal |mmap(MAP_ANON)| allocator is
1MB. This is IMHO far to small because there is IMO not enough space
for the coalescing algorithm to operate and a *lot* of fragmentation
occurs.
IMHO a _minimum_ page size of 4MB should be picked (as a side-effect
the shell would get 4MB or 2MB largepages on platforms like Solaris
automagically).

default block size upped to 4Mi and pagesize=<n>[KMGP][i] can ovveride in
VMALLOC_OPTIONS for testing

3. After each |mmap(MAP_ANON)| allocation the libast allocator
"manually" clears the obtained memory chunk with zero bytes. This is
IMO a *major* source of wasting CPU time (>= ~30%-38% of a
|_ast_malloc(1024*1024)|) because each memory page is instantiated by
writing zeros to it. If the clearing could be avoided (which is
unneccesary anyway) we'd easily win ~30%-38% and do *not* instantiate
pages which we do not use yet.

can you pinpoint the code that does this -- the only memset(0) i see are
due to explicit VM_RSZERO

Just to make it clear: Allocating a 1MB chunk of memory via

|mmap(MAP_ANON)| and a 128MB chunk of memory via |mmap(MAP_ANON)| has
*no* (visible) difference in performance until we touch the pages via
either read/execute or write accesses.
Currently the libast allocator code writes zeros into the whole chunk
of memory obtained via |mmap(MAP_ANON)| which pretty much ruins
performance because *all* pages are created physically instead of just
being some memory marked as "reserved". If libast would stop writing
into memory chunks directly after the |mmap(MAP_ANON)| we could easily
bump the allocation size up to 32MB or better without any performance
penalty...
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.research.att.com/pipermail/ast-developers/attachments/20131206/4bc9cacb/attachment.html>

Phong Vo

2013-12-07 03:09:34 UTC

Permalink

Vmalloc disciplines do not zero out memory. The only explicit zeroing of
memory occurs in the call vmresize() and only with the flag VM_RSZERO or in
the malloc-compatible calloc() call.
Phong

Post by Glenn Fowler

On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com>

On 1 December 2013 17:26, Glenn Fowler <glenn.s.fowler at gmail.com>

vmalloc has an internal discipline/method for getting memory from the

system

supported, all

by default
** anon: mmap(MAP_ANON)
** break|sbrk: sbrk()
** native: native malloc()
** safe: safe sbrk() emulation via

mmap(MAP_ANON)

** zero: mmap(/dev/zero)
i believe the performance regression with "anon" is that on linux
mmap(0....MAP_ANON|MAP_PRIVATE...),
which lets the system decide the address, returns adjacent (when

possible)

have not

dug deeper
for now the probe order in vmalloc/vmdcsystem.c was simply changed to

favor

"safe"

default block size upped to 4Mi and pagesize=<n>[KMGP][i] can ovveride in
VMALLOC_OPTIONS for testing

can you pinpoint the code that does this -- the only memset(0) i see are
due to explicit VM_RSZERO
Just to make it clear: Allocating a 1MB chunk of memory via

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.research.att.com/pipermail/ast-developers/attachments/20131206/70656932/attachment.html>

Roland Mainz

2013-12-09 21:59:04 UTC

Permalink

On Fri, Dec 6, 2013 at 10:18 AM, Glenn Fowler <glenn.s.fowler at gmail.com>

On Thu, Dec 5, 2013 at 6:41 PM, Roland Mainz <roland.mainz at nrubsig.org>

Post by Roland Mainz

On Sun, Dec 1, 2013 at 4:58 PM, Lionel Cons <lionelcons1972 at gmail.com>

[snip]

Post by Roland Mainz
Just to make it clear: Allocating a 1MB chunk of memory via
|mmap(MAP_ANON)| and a 128MB chunk of memory via |mmap(MAP_ANON)| has
*no* (visible) difference in performance until we touch the pages via
either read/execute or write accesses.
Currently the libast allocator code writes zeros into the whole chunk
of memory obtained via |mmap(MAP_ANON)| which pretty much ruins
performance because *all* pages are created physically instead of just
being some memory marked as "reserved". If libast would stop writing
into memory chunks directly after the |mmap(MAP_ANON)| we could easily
bump the allocation size up to 32MB or better without any performance
penalty...

Vmalloc disciplines do not zero out memory. The only explicit zeroing of
memory occurs in the call vmresize() and only with the flag VM_RSZERO or in
the malloc-compatible calloc() call.

Erm... see http://lists.research.att.com/pipermail/ast-developers/2013q4/003770.html
... there is one in src/lib/libast/vmalloc/vmopen.c ...

----

Bye,
Roland

--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)

Lionel Cons

2013-12-10 09:16:21 UTC

Permalink

On Thu, Dec 5, 2013 at 6:41 PM, Roland Mainz <roland.mainz at nrubsig.org>

default block size upped to 4Mi and pagesize=<n>[KMGP][i] can ovveride in
VMALLOC_OPTIONS for testing

A default block size of 4Mi is not sufficient on linux/x86-64 and
Solaris/x86-64 to prevent the excessive memory consumption. I've
played with script to create a graph to see where the excesses stop
and the turning point is at 8.4Mi.

So, if the default block size is bumped it should be bumped to at least 8Mi.

Lionel

Glenn Fowler

2013-12-10 09:40:00 UTC

Permalink

I don't know how you are measuring excessive
with the alpha just posted there are no timing or memory diffs on linux for
segsizes 64Ki 4Ki 8Ki 64Mi
add usage to VMALLOC_OPTIONS to get a summary of total region usage in the
last line
all the segsizes show ~250Mi total usage

$ time VMALLOC_OPTIONS=segsize=64Ki $SHELL ./t01

real 0m14.61s
user 0m14.45s
sys 0m0.14s
$ time VMALLOC_OPTIONS=segsize=4Mi $SHELL ./t01

real 0m14.51s
user 0m14.36s
sys 0m0.14s
$ time VMALLOC_OPTIONS=segsize=8Mi $SHELL ./t01

real 0m14.68s
user 0m14.48s
sys 0m0.19s
$ time VMALLOC_OPTIONS=segsize=64Mi $SHELL ./t01

real 0m14.49s
user 0m14.33s
sys 0m0.14s
$ time VMALLOC_OPTIONS=segsize=64Mi,usage $SHELL ./t01
vmalloc: 0x25c2064e6000 67108864 init
vmalloc: 0x25c2064e6000 67108864 init
vmalloc: 0x25c2064e6000 67108864 region 0x007db200 size=268435456 segs=1
packs=1 busy=58% cache=115424/1801

real 0m15.03s
user 0m14.61s
sys 0m0.40s

Post by Lionel Cons

On Thu, Dec 5, 2013 at 6:41 PM, Roland Mainz <roland.mainz at nrubsig.org>

default block size upped to 4Mi and pagesize=<n>[KMGP][i] can ovveride in
VMALLOC_OPTIONS for testing

A default block size of 4Mi is not sufficient on linux/x86-64 and
Solaris/x86-64 to prevent the excessive memory consumption. I've
played with script to create a graph to see where the excesses stop
and the turning point is at 8.4Mi.
So, if the default block size is bumped it should be bumped to at least 8Mi.
Lionel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.research.att.com/pipermail/ast-developers/attachments/20131210/31c92acc/attachment.html>