Discussion:
[ast-developers] mmap() for command substitutions still not living up to its fullest potential?
Lionel Cons
2013-04-15 01:07:41 UTC
Permalink
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.

Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?

Here's a truss log which shows the problem of using many mmap() calls
with a very small mal chunk size instead of just using a single mmap()
call to map the whole file:

cons at gog.dev.cern.ch$ seq 10000000 >tmpfile
cons at gog.dev.cern.ch$ ls -l tmpfile
-rw-r--r-- 1 cons cons 78888897 Apr 15 02:56 tmpfile
cons at gog.dev.cern.ch$ truss ./arch/sol11.i386-64/bin/ksh -c
'x=$(/bin/cat xxx) ; true' 2>&1 | egrep 'mmap\(0x00.+, MAP_PRIVATE,'
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) =
0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 524288)
= 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
1048576) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
1572864) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
2097152) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
2621440) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
3145728) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
3670016) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
4194304) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
4718592) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
5242880) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
5767168) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
6291456) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
6815744) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
7340032) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
7864320) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
8388608) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
8912896) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
9437184) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
9961472) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
10485760) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
11010048) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
11534336) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
12058624) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
12582912) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
13107200) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
13631488) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
14155776) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
14680064) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
15204352) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
15728640) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
16252928) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
16777216) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
17301504) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
17825792) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
18350080) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
18874368) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
19398656) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
19922944) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
20447232) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
20971520) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
21495808) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
22020096) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
22544384) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
23068672) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
23592960) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
24117248) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
24641536) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
25165824) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
25690112) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
26214400) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
26738688) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
27262976) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
27787264) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
28311552) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
28835840) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
29360128) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
29884416) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
30408704) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
30932992) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
31457280) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
31981568) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
32505856) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
33030144) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
33554432) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
34078720) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
34603008) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
35127296) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
35651584) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
36175872) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
36700160) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
37224448) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
37748736) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
38273024) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
38797312) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
39321600) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
39845888) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
40370176) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
40894464) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
41418752) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
41943040) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
42467328) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
42991616) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
43515904) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
44040192) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
44564480) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
45088768) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
45613056) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
46137344) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
46661632) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
47185920) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
47710208) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
48234496) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
48758784) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
49283072) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
49807360) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
50331648) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
50855936) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
51380224) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
51904512) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
52428800) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
52953088) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
53477376) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
54001664) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
54525952) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
55050240) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
55574528) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
56098816) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
56623104) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
57147392) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
57671680) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
58195968) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
58720256) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
59244544) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
59768832) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
60293120) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
60817408) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
61341696) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
61865984) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
62390272) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
62914560) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
63438848) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
63963136) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
64487424) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
65011712) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
65536000) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
66060288) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
66584576) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
67108864) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
67633152) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
68157440) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
68681728) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
69206016) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
69730304) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
70254592) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
70778880) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
71303168) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
71827456) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
72351744) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
72876032) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
73400320) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
73924608) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
74448896) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
74973184) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
75497472) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
76021760) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
76546048) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
77070336) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
77594624) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
78118912) = 0xFFFFFD7FFE49F000
mmap(0x00000000, 245697, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3,
78643200) = 0xFFFFFD7FFE4E3000

Lionel
Glenn Fowler
2013-04-15 05:13:03 UTC
Permalink
Post by Lionel Cons
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
provide some iffe code that spits out the optimal mmap() page size
for the current os/arch/configuration and that can be rolled into sfio
Roland Mainz
2013-04-15 12:15:05 UTC
Permalink
Post by Glenn Fowler
Post by Lionel Cons
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
provide some iffe code that spits out the optimal mmap() page size
for the current os/arch/configuration and that can be rolled into sfio
Erm... the "page size" (=the size used for MMU pages) is IMHO the
wrong property because it (usually) has to be chosen by the kernel
based on { MMU type, supported page sizes, available continuous memory
(as backing store) ... and for I/O the IOMMU page size and preferred
page size for the matching I/O device }.

The issue here is that the "chunk size" which sfio uses to |mmap()|
parts of a large file is very very low and prevents in most cases the
use of large pages (at least on i386/AMD64 which only has 4096bytes
and 2M/4M pages (other platforms have more choices... for example
UltraSPARC supports page sizes like 8192, 64k, 512k, 4M, 32M, 256M, 2G
pages)).

I did some digging and found that the following patch fixes the issue
for 64bit builds:
-- snip --
--- original/src/lib/libast/sfio/sfrd.c 2012-09-24 20:11:06.000000000 +0200
+++ build_i386_64bit_debug/src/lib/libast/sfio/sfrd.c 2013-04-15
03:24:22.892159982 +0200
@@ -161,18 +161,20 @@

/* make sure current position is page aligned */
if((a = (size_t)(f->here%_Sfpage)) != 0)
{ f->here -= a;
r += a;
}

/* map minimal requirement */
+#if _ptr_bits < 64
if(r > (round = (1 + (n+a)/f->size)*f->size) )
r = round;
+#endif

if(f->data)
SFMUNMAP(f, f->data, f->endb-f->data);

for(;;)
{ f->data = (uchar*)
sysmmapf((caddr_t)0, (size_t)r,
(PROT_READ|PROT_WRITE),
MAP_PRIVATE,
-- snip --

... for 32bit builds the problem is not easily fixable because there
has to be a balance between available address space (4GB... but only
2GB are usually available for file mappings) and maximum number of
open files (e.g. the value returned by $ ulimit -n # ...if we use that
with nfiles==1024 we get a maximum chunk size of $(( (pow(2,32)/2) /
1024. ))==2097152 (which would be acceptable) but for nfiles==65536 we
get a chunk size of $(( (pow(2,32)/2) / 65536. )) == 32768 ... which
renders the advantage of using |mmap()| useless).

Based on that I'd suggest the following solution:
1. Take the patch above to allow 64bit libast consumers to allow
"unlimited" chunk size mapping. This will work in _any_ case because
a) 64bit address space is vast and b) |sfrd()| will retry with half
the chunk size if the previous attempt to |mmap()| fails.
Using an "unlimited" chunk size allows the kernel to pick the best MMU
page size available (and reduces the syscall overhead to almost zero).

Optionally we could "clamp" the chunk size to 44bits (which allows
65536 files opened with 44bit chunks open (while still being able to
use multiple 256G MMU pages for each file mapping) and still having
lots of free virtual address space for memory and stack)

2. Optionally for 32bit processes we should add low and high "limits"
for the chunk size... it should *never* be below 4M and not be higher
than $(( (pow(2,32)/2) / nfiles )) (unless size is lower than 4M).

Does that sound reasonable ?

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Irek Szczesniak
2013-04-15 14:49:20 UTC
Permalink
Post by Roland Mainz
Post by Glenn Fowler
Post by Lionel Cons
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
provide some iffe code that spits out the optimal mmap() page size
for the current os/arch/configuration and that can be rolled into sfio
Erm... the "page size" (=the size used for MMU pages) is IMHO the
wrong property because it (usually) has to be chosen by the kernel
based on { MMU type, supported page sizes, available continuous memory
(as backing store) ... and for I/O the IOMMU page size and preferred
page size for the matching I/O device }.
The issue here is that the "chunk size" which sfio uses to |mmap()|
parts of a large file is very very low and prevents in most cases the
use of large pages (at least on i386/AMD64 which only has 4096bytes
and 2M/4M pages (other platforms have more choices... for example
UltraSPARC supports page sizes like 8192, 64k, 512k, 4M, 32M, 256M, 2G
pages)).
I did some digging and found that the following patch fixes the issue
-- snip --
--- original/src/lib/libast/sfio/sfrd.c 2012-09-24 20:11:06.000000000 +0200
+++ build_i386_64bit_debug/src/lib/libast/sfio/sfrd.c 2013-04-15
03:24:22.892159982 +0200
@@ -161,18 +161,20 @@
/* make sure current position is page aligned */
if((a = (size_t)(f->here%_Sfpage)) != 0)
{ f->here -= a;
r += a;
}
/* map minimal requirement */
+#if _ptr_bits < 64
if(r > (round = (1 + (n+a)/f->size)*f->size) )
r = round;
+#endif
if(f->data)
SFMUNMAP(f, f->data, f->endb-f->data);
for(;;)
{ f->data = (uchar*)
sysmmapf((caddr_t)0, (size_t)r,
(PROT_READ|PROT_WRITE),
MAP_PRIVATE,
-- snip --
... for 32bit builds the problem is not easily fixable because there
has to be a balance between available address space (4GB... but only
2GB are usually available for file mappings) and maximum number of
open files (e.g. the value returned by $ ulimit -n # ...if we use that
with nfiles==1024 we get a maximum chunk size of $(( (pow(2,32)/2) /
1024. ))==2097152 (which would be acceptable) but for nfiles==65536 we
get a chunk size of $(( (pow(2,32)/2) / 65536. )) == 32768 ... which
renders the advantage of using |mmap()| useless).
1. Take the patch above to allow 64bit libast consumers to allow
"unlimited" chunk size mapping. This will work in _any_ case because
a) 64bit address space is vast and b) |sfrd()| will retry with half
the chunk size if the previous attempt to |mmap()| fails.
Using an "unlimited" chunk size allows the kernel to pick the best MMU
page size available (and reduces the syscall overhead to almost zero).
Optionally we could "clamp" the chunk size to 44bits (which allows
65536 files opened with 44bit chunks open (while still being able to
use multiple 256G MMU pages for each file mapping) and still having
lots of free virtual address space for memory and stack)
2. Optionally for 32bit processes we should add low and high "limits"
for the chunk size... it should *never* be below 4M and not be higher
than $(( (pow(2,32)/2) / nfiles )) (unless size is lower than 4M).
Does that sound reasonable ?
Yes, I think the patch to exclude the rounding for 64bit platforms is
reasonable. I seriously doubt that 64bit platforms require extra
checks beyond what the code already does because it is unlikely (with
today's machines) to run into any limits with reasonable processing
time.

The 32bit limits you're proposing in [2] may require some benchmarking
but in the long run I doubt that 32bit platforms require such work -
anyone using 65536 file descriptors will likely use a 64bit address
space anyway.

Irek
Lionel Cons
2013-04-16 18:25:59 UTC
Permalink
Post by Roland Mainz
Post by Glenn Fowler
Post by Lionel Cons
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
provide some iffe code that spits out the optimal mmap() page size
for the current os/arch/configuration and that can be rolled into sfio
Erm... the "page size" (=the size used for MMU pages) is IMHO the
wrong property because it (usually) has to be chosen by the kernel
based on { MMU type, supported page sizes, available continuous memory
(as backing store) ... and for I/O the IOMMU page size and preferred
page size for the matching I/O device }.
The issue here is that the "chunk size" which sfio uses to |mmap()|
parts of a large file is very very low and prevents in most cases the
use of large pages (at least on i386/AMD64 which only has 4096bytes
and 2M/4M pages (other platforms have more choices... for example
UltraSPARC supports page sizes like 8192, 64k, 512k, 4M, 32M, 256M, 2G
pages)).
I did some digging and found that the following patch fixes the issue
-- snip --
--- original/src/lib/libast/sfio/sfrd.c 2012-09-24 20:11:06.000000000 +0200
+++ build_i386_64bit_debug/src/lib/libast/sfio/sfrd.c 2013-04-15
03:24:22.892159982 +0200
@@ -161,18 +161,20 @@
/* make sure current position is page aligned */
if((a = (size_t)(f->here%_Sfpage)) != 0)
{ f->here -= a;
r += a;
}
/* map minimal requirement */
+#if _ptr_bits < 64
if(r > (round = (1 + (n+a)/f->size)*f->size) )
r = round;
+#endif
if(f->data)
SFMUNMAP(f, f->data, f->endb-f->data);
for(;;)
{ f->data = (uchar*)
sysmmapf((caddr_t)0, (size_t)r,
(PROT_READ|PROT_WRITE),
MAP_PRIVATE,
-- snip --
We've tested the patch with Solaris 11.1 on a Oracle SPARC-T3 machine.
Below are the sample numbers, average over 2000 samples for a builtin
grep;grep -F NoNumber tmpfile;true over a Gb-sized text file:

Without your patch:
real 5m30.956s
user 5m2.847s
sys 0m27.204s

With your patch:
real 5m8.956s
user 4m52.592s
sys 0m11.726s

Notice the significant reduction of time spend in sys!

Another noticeable (impressive!) benefit we noticed with the patch was
that if many application using sfio for IO were running in parallel
but working on the same file Solaris automatically assigned 64k pages
to hotspots in the file mapping, even further increasing the
throughput. For one particular application, femtoslice, which runs in
a few thousand iterations on the same file, we noticed an astonishing
9% decrease in run time when 100 processes run in parallel on the same
file. The explanation is simple: The file gets mapped as whole block
into the processes and Solaris 11.1 allows MMU data sharing between
SPARC processors. This, the sharing and the 64k page size sum up to a
9% performance benefit.

+1 for the patch

Lionel
Lionel Cons
2013-06-10 11:23:57 UTC
Permalink
Post by Lionel Cons
Post by Roland Mainz
Post by Glenn Fowler
Post by Lionel Cons
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
provide some iffe code that spits out the optimal mmap() page size
for the current os/arch/configuration and that can be rolled into sfio
Erm... the "page size" (=the size used for MMU pages) is IMHO the
wrong property because it (usually) has to be chosen by the kernel
based on { MMU type, supported page sizes, available continuous memory
(as backing store) ... and for I/O the IOMMU page size and preferred
page size for the matching I/O device }.
The issue here is that the "chunk size" which sfio uses to |mmap()|
parts of a large file is very very low and prevents in most cases the
use of large pages (at least on i386/AMD64 which only has 4096bytes
and 2M/4M pages (other platforms have more choices... for example
UltraSPARC supports page sizes like 8192, 64k, 512k, 4M, 32M, 256M, 2G
pages)).
I did some digging and found that the following patch fixes the issue
-- snip --
--- original/src/lib/libast/sfio/sfrd.c 2012-09-24 20:11:06.000000000 +0200
+++ build_i386_64bit_debug/src/lib/libast/sfio/sfrd.c 2013-04-15
03:24:22.892159982 +0200
@@ -161,18 +161,20 @@
/* make sure current position is page aligned */
if((a = (size_t)(f->here%_Sfpage)) != 0)
{ f->here -= a;
r += a;
}
/* map minimal requirement */
+#if _ptr_bits < 64
if(r > (round = (1 + (n+a)/f->size)*f->size) )
r = round;
+#endif
if(f->data)
SFMUNMAP(f, f->data, f->endb-f->data);
for(;;)
{ f->data = (uchar*)
sysmmapf((caddr_t)0, (size_t)r,
(PROT_READ|PROT_WRITE),
MAP_PRIVATE,
-- snip --
We've tested the patch with Solaris 11.1 on a Oracle SPARC-T3 machine.
Below are the sample numbers, average over 2000 samples for a builtin
real 5m30.956s
user 5m2.847s
sys 0m27.204s
real 5m8.956s
user 4m52.592s
sys 0m11.726s
Notice the significant reduction of time spend in sys!
Another noticeable (impressive!) benefit we noticed with the patch was
that if many application using sfio for IO were running in parallel
but working on the same file Solaris automatically assigned 64k pages
to hotspots in the file mapping, even further increasing the
throughput. For one particular application, femtoslice, which runs in
a few thousand iterations on the same file, we noticed an astonishing
9% decrease in run time when 100 processes run in parallel on the same
file. The explanation is simple: The file gets mapped as whole block
into the processes and Solaris 11.1 allows MMU data sharing between
SPARC processors. This, the sharing and the 64k page size sum up to a
9% performance benefit.
+1 for the patch
This problem is still a MAJOR performance issue. IMO it would be good
if the patch could be unconditionally applied for the next alpha
release to see if there are real problems or not. Our performance lab
staff says there isn't any issues with the patch since we're using it
since April without regressions.

Lionel

Phong Vo
2013-04-15 14:40:26 UTC
Permalink
The default size of the mapped memory for a stream is driven by a #define
in a header file so it isn't hard to change it. I believe an application can
also use sfsetbuf(f, NULL, size) to set a desired size for the mapped buffer.

Generally speaking, Sfio and our other core libraries like CDT and Vmalloc
have been around for a very long time and their default parameters tend to
stay as they are until someone notices a performance issue. Well, not so much
for Vmalloc anymore because it has been completely rewritten recently to deal
with concurrency, both multiple threads and multiple processes for shared memory.

Anyway, it is good that you bring up this issue with Sfio now.
What do people think is a reasonable size to set the default mapped
size to on a 64-bit machine? Keep in mind that there apps with many dozens
files opened at the same time along with other large requirements for memory.

Phong
From ast-developers-bounces at lists.research.att.com Sun Apr 14 21:07:56 2013
To: ast-developers at research.att.com
Subject: [ast-developers] mmap() for command substitutions still not living up to its fullest potential?
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
Irek Szczesniak
2013-04-15 14:53:19 UTC
Permalink
Post by Phong Vo
The default size of the mapped memory for a stream is driven by a #define
in a header file so it isn't hard to change it. I believe an application can
also use sfsetbuf(f, NULL, size) to set a desired size for the mapped buffer.
Generally speaking, Sfio and our other core libraries like CDT and Vmalloc
have been around for a very long time and their default parameters tend to
stay as they are until someone notices a performance issue. Well, not so much
for Vmalloc anymore because it has been completely rewritten recently to deal
with concurrency, both multiple threads and multiple processes for shared memory.
Anyway, it is good that you bring up this issue with Sfio now.
What do people think is a reasonable size to set the default mapped
size to on a 64-bit machine? Keep in mind that there apps with many dozens
files opened at the same time along with other large requirements for memory.
I think Roland's patch from
http://lists.research.att.com/pipermail/ast-developers/2013q2/002431.html
is sufficient for now because long before we exhaust VA space we run
out of patience to wait for the jobs to complete :) If that's not
sufficient then I'll suggest the '44bit clamp' Roland proposed to
partition 64bit VA space into 65536 files with 44bit chunks mapped
simultaneously. That just leaves 16 times the same amount of memory
for anon memory pages (which is many times the amount of world-wide
installed main memory in 2012).

Irek
Lionel Cons
2013-04-15 16:32:17 UTC
Permalink
Post by Phong Vo
The default size of the mapped memory for a stream is driven by a #define
in a header file so it isn't hard to change it. I believe an application can
also use sfsetbuf(f, NULL, size) to set a desired size for the mapped buffer.
Generally speaking, Sfio and our other core libraries like CDT and Vmalloc
have been around for a very long time and their default parameters tend to
stay as they are until someone notices a performance issue. Well, not so much
for Vmalloc anymore because it has been completely rewritten recently to deal
with concurrency, both multiple threads and multiple processes for shared memory.
Anyway, it is good that you bring up this issue with Sfio now.
What do people think is a reasonable size to set the default mapped
size to on a 64-bit machine? Keep in mind that there apps with many dozens
files opened at the same time along with other large requirements for memory.
64-bit machines have plenty of address space. Theory aside we've
tested the patch Roland Mainz submitted and we found that it solved
all instances of excessive mmap() syscalls AND allows ksh93 to benefit
from largepages on Solaris 11.1.

Rudimentary testcase from Solaris 11.1 with tmpfs as filesystem shows
that 2M pages are used:
(for ((i=0 ; i < 20 ; i++)) ; do cat /usr/pub/UTF\-8 ; done) >tmp
./arch/sol11.i386-64/bin/ksh -c 'x=$(/bin/cat tmp) ; true'
pmap -s 601 | grep -F 2M
FFFFFD7FF8600000 43008K 2M rw--- dev:557,2 ino:443010149

The pmap output shows that the mmap() call for the results from the
command substitution uses 2Mbytes pages instead of the x86 default
page size of 4096bytes!!!

Thumbs up for that patch!

Lionel
Phong Vo
2013-04-15 14:52:59 UTC
Permalink
We need to worry about applications that require a large amount of memory
outside of Sfio too. I know of a couple of local apps that would routinely
use 40-50Gbs for shared memory managed by Vmalloc and CDT. We wouldn't
want to thrash them unncessarily.

Phong
From ast-developers-bounces at lists.research.att.com Mon Apr 15 08:15:15 2013
To: Glenn Fowler <gsf at research.att.com>
Cc: ast-developers at research.att.com
Subject: Re: [ast-developers] mmap() for command substitutions still not living up to its fullest potential?
Post by Glenn Fowler
Post by Lionel Cons
Based on the recent discussion about using mmap() for reading the
results of command substitutions I did some testing and found that on
Solaris (Solaris 11 and a 64bit build) ksh93 still behaves not
optimal. The primary problem I see is that MANY mmap() calls with a
very small map size (524288 bytes) are executed instead of either
mapping the input file in one large chunk or at least uses a chunk
size large enough that the system can use largepages (2M for x86,
4M/32M/256M for SPARC64) if possible. Using a chunk size of 524288
bytes is a joke.
Is there a specific reason why the the code in sfrd.c only maps such
small chunks (I'd expect that a 64bit process could easily map 16GB
each time) from a file or is this a bug?
provide some iffe code that spits out the optimal mmap() page size
for the current os/arch/configuration and that can be rolled into sfio
Erm... the "page size" (=the size used for MMU pages) is IMHO the
wrong property because it (usually) has to be chosen by the kernel
based on { MMU type, supported page sizes, available continuous memory
(as backing store) ... and for I/O the IOMMU page size and preferred
page size for the matching I/O device }.
The issue here is that the "chunk size" which sfio uses to |mmap()|
parts of a large file is very very low and prevents in most cases the
use of large pages (at least on i386/AMD64 which only has 4096bytes
and 2M/4M pages (other platforms have more choices... for example
UltraSPARC supports page sizes like 8192, 64k, 512k, 4M, 32M, 256M, 2G
pages)).
I did some digging and found that the following patch fixes the issue
-- snip --
--- original/src/lib/libast/sfio/sfrd.c 2012-09-24 20:11:06.000000000 +0200
+++ build_i386_64bit_debug/src/lib/libast/sfio/sfrd.c 2013-04-15
03:24:22.892159982 +0200
@@ -161,18 +161,20 @@
/* make sure current position is page aligned */
if((a = (size_t)(f->here%_Sfpage)) != 0)
{ f->here -= a;
r += a;
}
/* map minimal requirement */
+#if _ptr_bits < 64
if(r > (round = (1 + (n+a)/f->size)*f->size) )
r = round;
+#endif
if(f->data)
SFMUNMAP(f, f->data, f->endb-f->data);
for(;;)
{ f->data = (uchar*)
sysmmapf((caddr_t)0, (size_t)r,
(PROT_READ|PROT_WRITE),
MAP_PRIVATE,
-- snip --
... for 32bit builds the problem is not easily fixable because there
has to be a balance between available address space (4GB... but only
2GB are usually available for file mappings) and maximum number of
open files (e.g. the value returned by $ ulimit -n # ...if we use that
with nfiles==1024 we get a maximum chunk size of $(( (pow(2,32)/2) /
1024. ))==2097152 (which would be acceptable) but for nfiles==65536 we
get a chunk size of $(( (pow(2,32)/2) / 65536. )) == 32768 ... which
renders the advantage of using |mmap()| useless).
1. Take the patch above to allow 64bit libast consumers to allow
"unlimited" chunk size mapping. This will work in _any_ case because
a) 64bit address space is vast and b) |sfrd()| will retry with half
the chunk size if the previous attempt to |mmap()| fails.
Using an "unlimited" chunk size allows the kernel to pick the best MMU
page size available (and reduces the syscall overhead to almost zero).
Optionally we could "clamp" the chunk size to 44bits (which allows
65536 files opened with 44bit chunks open (while still being able to
use multiple 256G MMU pages for each file mapping) and still having
lots of free virtual address space for memory and stack)
2. Optionally for 32bit processes we should add low and high "limits"
for the chunk size... it should *never* be below 4M and not be higher
than $(( (pow(2,32)/2) / nfiles )) (unless size is lower than 4M).
Does that sound reasonable ?
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
ast-developers at lists.research.att.com
http://lists.research.att.com/mailman/listinfo/ast-developers
Irek Szczesniak
2013-04-15 14:54:38 UTC
Permalink
Post by Phong Vo
We need to worry about applications that require a large amount of memory
outside of Sfio too. I know of a couple of local apps that would routinely
use 40-50Gbs for shared memory managed by Vmalloc and CDT. We wouldn't
want to thrash them unncessarily.
I think this is unlikely to happen but if you insist I'll quote from
my last email:

I think Roland's patch from
http://lists.research.att.com/pipermail/ast-developers/2013q2/002431.html
is sufficient for now because long before we exhaust VA space we run
out of patience to wait for the jobs to complete :) If that's not
sufficient then I'll suggest the '44bit clamp' Roland proposed to
partition 64bit VA space into 65536 files with 44bit chunks mapped
simultaneously. That just leaves 16 times the same amount of memory
for anon memory pages (which is many times the amount of world-wide
installed main memory in 2012).

Irek
Phong Vo
2013-04-15 18:37:57 UTC
Permalink
Exhausting VA space is not likely but keeping processes behaving nicely toward
one another should be a good thing. You need to think about cases when many
processes do large I/O at the same time and the physical memory available on
the machine is far less than what the VA space can accommodate.

It's little known but Sfio does adaptive buffer filling to reduce read I/O,
esp. when many seeks are done (hence most read data are wasted). The same
strategy could be adapted to mapped I/O. We'll look into that.

Phong
From iszczesniak at gmail.com Mon Apr 15 10:53:27 2013
Subject: Re: [ast-developers] mmap() for command substitutions still not living up to its fullest potential?
To: Phong Vo <kpv at research.att.com>
Cc: ast-developers at research.att.com, lionelcons1972 at googlemail.com
Post by Phong Vo
The default size of the mapped memory for a stream is driven by a #define
in a header file so it isn't hard to change it. I believe an application can
also use sfsetbuf(f, NULL, size) to set a desired size for the mapped buffer.
Generally speaking, Sfio and our other core libraries like CDT and Vmalloc
have been around for a very long time and their default parameters tend to
stay as they are until someone notices a performance issue. Well, not so much
for Vmalloc anymore because it has been completely rewritten recently to deal
with concurrency, both multiple threads and multiple processes for shared memory.
Anyway, it is good that you bring up this issue with Sfio now.
What do people think is a reasonable size to set the default mapped
size to on a 64-bit machine? Keep in mind that there apps with many dozens
files opened at the same time along with other large requirements for memory.
I think Roland's patch from
http://lists.research.att.com/pipermail/ast-developers/2013q2/002431.html
is sufficient for now because long before we exhaust VA space we run
out of patience to wait for the jobs to complete :) If that's not
sufficient then I'll suggest the '44bit clamp' Roland proposed to
partition 64bit VA space into 65536 files with 44bit chunks mapped
simultaneously. That just leaves 16 times the same amount of memory
for anon memory pages (which is many times the amount of world-wide
installed main memory in 2012).
Irek
Roland Mainz
2013-04-15 19:09:49 UTC
Permalink
Post by Phong Vo
From iszczesniak at gmail.com Mon Apr 15 10:53:27 2013
Subject: Re: [ast-developers] mmap() for command substitutions still not living up to its fullest potential?
To: Phong Vo <kpv at research.att.com>
Cc: ast-developers at research.att.com, lionelcons1972 at googlemail.com
Post by Phong Vo
The default size of the mapped memory for a stream is driven by a #define
in a header file so it isn't hard to change it. I believe an application can
also use sfsetbuf(f, NULL, size) to set a desired size for the mapped buffer.
Generally speaking, Sfio and our other core libraries like CDT and Vmalloc
have been around for a very long time and their default parameters tend to
stay as they are until someone notices a performance issue. Well, not so much
for Vmalloc anymore because it has been completely rewritten recently to deal
with concurrency, both multiple threads and multiple processes for shared memory.
Anyway, it is good that you bring up this issue with Sfio now.
What do people think is a reasonable size to set the default mapped
size to on a 64-bit machine? Keep in mind that there apps with many dozens
files opened at the same time along with other large requirements for memory.
I think Roland's patch from
http://lists.research.att.com/pipermail/ast-developers/2013q2/002431.html
is sufficient for now because long before we exhaust VA space we run
out of patience to wait for the jobs to complete :) If that's not
sufficient then I'll suggest the '44bit clamp' Roland proposed to
partition 64bit VA space into 65536 files with 44bit chunks mapped
simultaneously. That just leaves 16 times the same amount of memory
for anon memory pages (which is many times the amount of world-wide
installed main memory in 2012).
Exhausting VA space is not likely but keeping processes behaving nicely toward
one another should be a good thing.
Erm... do you mean "Unix processes" in this case ? Note that the size
of the MMU entries doesn't matter in today's MMU designs... basically
each Unix process has it's own "MMU context" and switching between
them is fast (regardless of size).
Post by Phong Vo
You need to think about cases when many
processes do large I/O at the same time and the physical memory available on
the machine is far less than what the VA space can accommodate.
Uhm... this is usually handled gracefully... however there are corner
cases when the machines themselves do not have enough memory for
kernel tasks left and/or filesystem pages compete directly with
application/code pages (for example see
http://sysunconfig.net/unixtips/priority_paging.txt ... old Solaris 7
once invented "priority paging" to deal with that (later Solaris
releases solved the problem differently)) ... but at that point the
system is in trouble anyway and there is no easy way to fix that.
And AFAIK the "sliding window" approach currently used by sfio doesn't
prevent that... it only creates synchronisation points (e.g. the
|mmap()| and |munmap()| calls) at which the process will wait until
resources have been reclaimed by the kernel and made available
again... but the overall costs are much higher in terms of "waiting
time". This doesn't sound problematic on machine with 4 CPUs... but on
machines like the SPARC-T4 with 256 CPUs this can quickly ramp up to
devastatingly 15-20 seconds (!!) of extra overhead just to get the
window moved to the next position on a loaded system (compared to just
map the whole file in one large chunk).
Post by Phong Vo
It's little known but Sfio does adaptive buffer filling to reduce read I/O,
esp. when many seeks are done (hence most read data are wasted). The same
strategy could be adapted to mapped I/O. We'll look into that.
Erm... what does that exactly mean ? Note that today's kernels assume
that file I/O via |mmap()| is done by mapping the whole file or large
chunks (e.g. 8GB etc.) and then MMU entries are filled-in on the fly
when an access is made. Something like a "sliding window" which
creates lots of |mmap()| and |munmap()| calls is *extremely* expensive
and doesn't scale on machines with many CPUs.
Or short: |mmap()| and |munmap()| are expensive calls but accessing
the mapped files is a lot cheaper than |read()| or |write()|. That's
why the current sfio behaviour of "... mapping a tiny window of 512k,
doing I/O and then map the next window..." is very very bad in terms
of scalability and system resource usage. If possible files should be
mapped with the largest "chunk size" as possible or you'll run into
conflict with what the kernel (or better: Solaris and Linux) expects
and is designed for.

----

Bye,
Roland
--
__ . . __
(o.\ \/ /.o) roland.mainz at nrubsig.org
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
Wendy Lin
2013-04-16 09:27:44 UTC
Permalink
Post by Phong Vo
Exhausting VA space is not likely but keeping processes behaving nicely toward
one another should be a good thing. You need to think about cases when many
processes do large I/O at the same time and the physical memory available on
the machine is far less than what the VA space can accommodate.
It's little known but Sfio does adaptive buffer filling to reduce read I/O,
esp. when many seeks are done (hence most read data are wasted). The same
strategy could be adapted to mapped I/O. We'll look into that.
Commenting here since it affects the builtin grep

Phong, please keep the implementation KISS ("Keep it simple, stupid").
The VM system in Linux and AIX better work with large mappings which
do not change than a moving target which changes frequently.
In a 64bit application libast should map the whole file and let the
kernel worry about the rest.

Wendy
Glenn Fowler
2013-04-16 13:07:52 UTC
Permalink
Post by Wendy Lin
Post by Phong Vo
Exhausting VA space is not likely but keeping processes behaving nicely toward
one another should be a good thing. You need to think about cases when many
processes do large I/O at the same time and the physical memory available on
the machine is far less than what the VA space can accommodate.
It's little known but Sfio does adaptive buffer filling to reduce read I/O,
esp. when many seeks are done (hence most read data are wasted). The same
strategy could be adapted to mapped I/O. We'll look into that.
Commenting here since it affects the builtin grep
Phong, please keep the implementation KISS ("Keep it simple, stupid").
The VM system in Linux and AIX better work with large mappings which
do not change than a moving target which changes frequently.
In a 64bit application libast should map the whole file and let the
kernel worry about the rest.
we are concerned about ast/sfio being a good neighbor and not hogging all resources
its fine for every app on a single user desktop to map in 16Gib files
how about each or 2^t threads doing that with 2^f different files of size 2^z
there has to be a point of diminishing returns
and that is what sfio concerns itself with
so that ast apps do not have to worry
Cedric Blancher
2013-04-16 14:45:21 UTC
Permalink
Post by Glenn Fowler
Post by Wendy Lin
Post by Phong Vo
Exhausting VA space is not likely but keeping processes behaving nicely toward
one another should be a good thing. You need to think about cases when many
processes do large I/O at the same time and the physical memory available on
the machine is far less than what the VA space can accommodate.
It's little known but Sfio does adaptive buffer filling to reduce read I/O,
esp. when many seeks are done (hence most read data are wasted). The same
strategy could be adapted to mapped I/O. We'll look into that.
Commenting here since it affects the builtin grep
Phong, please keep the implementation KISS ("Keep it simple, stupid").
The VM system in Linux and AIX better work with large mappings which
do not change than a moving target which changes frequently.
In a 64bit application libast should map the whole file and let the
kernel worry about the rest.
we are concerned about ast/sfio being a good neighbor and not hogging all resources
This should not be a problem. A 10TB file mirrored into a process's
address space doesn't consume 10TB of main memory, it only maps those
pages which have been accessed by the application. If the number of
pages for a specific mapping or the number of file pages mapped
exceeds certain limits the kernel will start to reclaim less used
pages automatically using a garbage collection algorithm. This is done
automatically and usually in parallel to the threads operating on the
same file mapping since today's machines always have a spare CPU which
can be used for such work. If the number of mapped pages is still
growing faster than they can be reclaimed by the garbage collector
then the accessing threads are throttled priority-wise to avoid that
other processes or threads are being handled in an unfair manner. So
the concern that too many mappings can hog all resources is unfounded.
The modern Unix and Linux kernels protect themselves from such
problems.
Post by Glenn Fowler
its fine for every app on a single user desktop to map in 16Gib files
how about each or 2^t threads
65536 threads is the limit. AFAIK all modern systems can not have more
than 65535 file descriptors open.
Post by Glenn Fowler
doing that with 2^f different files of size 2^z
there has to be a point of diminishing returns
and that is what sfio concerns itself with
so that ast apps do not have to worry
Use the proposed limit of 44bits then?
Post by Glenn Fowler
_______________________________________________
ast-developers mailing list
ast-developers at lists.research.att.com
http://lists.research.att.com/mailman/listinfo/ast-developers
Ced
--
Cedric Blancher <cedric.blancher at googlemail.com>
Institute Pasteur
Phong Vo
2013-04-16 15:29:35 UTC
Permalink
From cedric.blancher at gmail.com Tue Apr 16 10:45:28 2013
the concern that too many mappings can hog all resources is unfounded.
The modern Unix and Linux kernels protect themselves from such
problems.
It's good that somebody thought about algorithmic issues at this level.
The lesson in the above is that the right mindset to take for everyone
who writes software that will sit at the core of many applications is to make sure
that the algorithms employed will behave well both on their own and when there may
be interactions with others. These issues are subtle and the algorithmic
problems concerning them tend to be hard from a theoretical complexity point
of view. So, any implemented solutions will often have gaps. Even if these gaps
are low probability events, they can be harmful when hit. It's good that
kernel developers, over a long period of time, have continued to evolve
their work in the right way to close those gaps. But it's likely that it won't be
hard to think of test cases that would bring a system to its knees.

The core AST libs are at a layer above the kernel but they are in a similar situation.
Think about how different libs may be used together for different reasons,
e.g., Sfio to read large files sequentially or CDT to keep large on-line databases
in shared memory, etc. Then, also think about how very large applications may run
on servers shared by many users. We have systems here that run 24/7 on hundreds of
large servers distributed throughout the country to maintain our core network.
Our software must safeguard themselves against extraordinary situations.
This is what we have tried to do with our core libraries.

Back to Sfio, mmap is a just one aspect of the IO subsystem that includes things such
as discipline functions to replace the normal system calls. But, we do think that
we understand what need be done to obviate the observed performance issue. There is no
need to rush into something that we may regret later. Please be patient.

Phong
Loading...