Thursday, September 14, 2006

Debugging performance of PostgreSQL in Solaris Nevada build 46 using Sun Studio Tools


I may still be behind the latest Nevada build, but on my build 46 install of Solaris Nevada on my laptop, I decided to try out pgbench using the PostgreSQL binaries bundled in Solaris Nevada.

My version of PostgreSQL is

 PostgreSQL 8.1.4 on i386-pc-solaris2.11, compiled by /opt/SUNWspro.40/SOS10/bin/cc -Xa


I created pgsample and started loading the database using pgbench. Zpool iostat reported peaks about 7MB/sec writes and visually it looked slow. So I opened the postgresql.conf file and added my favorite modifications to it:

wal_sync_method = fdatasync
wal_buffers = 128
checkpoint_segments = 128
bgwriter_lru_percent = 0
bgwriter_all_percent = 0



and the same load now feeled better with peaks at about 10.5MB/sec though I would say it is still slow since neither the CPU is saturated nor the IO is saturated. I used Brendan Gregg's hotuser utility which uses DTrace to sample the process to find out what's happening.


It reported the following (tail-end)

libc.so.1`strtol                                          160   3.2%
postgres`CopyFrom 200 4.0%
postgres`PageAddItem 250 5.0%
libc.so.1`memcpy 270 5.4%
libc.so.1`memset 300 6.0%
postgres`tas 430 8.7%
postgres`XLogInsert 1031 20.7%


So it seems if I improve the performance of XLogInsert, I might actually load the data faster with the inbuild postresql in Solaris Nevada.

This is where Sun Studio 11 tools comes handy now that it is Free!!.

# PATH=$PATH:/opt/SUNWspro/bin
# umask 000
# dbx /usr/bin/postgres 2690
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading postgres
Reading ld.so.1
...
Attached to process 2690
stopped in _write at 0xd13464d5
0xd13464d5: _write+0x0015: jae _write+0x21 [ 0xd13464e1, .+0xc ]
(dbx) collector enable
(dbx) cont
Creating experiment database test.1.er ...
Reading libmp.so.2
Reading libmd.so.1
Reading libscf.so.1
Reading libuutil.so.1


execution completed, exit code is 0
(dbx) quit


Now that I have the experiment collected, time to do some reviews using er_print

# er_print test.1.er
(er_print) func
Functions sorted by metric: Exclusive User CPU Time

Excl. Incl. Name
User CPU User CPU
sec. sec.
20.920 20.920
3.360 6.830 comparetup_index
2.170 8.690 tuplesort_heap_siftup
1.610 2.110 XLogInsert
1.290 2.410 myFunctionCall2
1.210 3.630 inlineApplySortFunction



Since I had continued the experiment till the end, it also collected information on building indexes at the end which shows up at the top. However XLogInsert still shows up out there which needs to be investigated.


Had it been a debug build, I could have just used the lines command to point to the linenumber in the source code file.

(er_print) lines
Objects sorted by metric: Exclusive User CPU Time

Excl. Incl. Name
User CPU User CPU
sec. sec.
20.920 20.920 Total
3.360 6.830 Function: comparetup_index, instructions from source file tuplesort.c
2.170 8.690 Function: tuplesort_heap_siftup, instructions from source file tuplesort.c
1.610 2.110 Function: XLogInsert, instructions without line numbers
1.290 2.410 Function: myFunctionCall2, instructions from source file tuplesort.c
1.210 3.630 Function: inlineApplySortFunction, instructions from source file tuplesort.c



Hence I used the disassembly feature to see if there are quick hotspots. Sun Studio tools marks the hotspots with "##" so it becomes easier to notice them.

(er_print) dis XLogInsert
Source file: (unknown)
Object file: /usr/bin/postgres
Load Object: /usr/bin/postgres

Excl. Incl.
User CPU User CPU
sec. sec.

0.010 0.010 [?] 80a5ab0: pushl %ebp
0. 0. [?] 80a5ab1: movl %esp,%ebp
0. 0. [?] 80a5ab3: subl $0x1c0,%esp
0. 0. [?] 80a5ab9: andl $-0x10,%esp
.....

0. 0. [?] 80a5bb8: call XLogCheckBuffer [ 0x80a64a1, .+0x8e9 ]

.....

0. 0. [?] 80a5c3a: testl %eax,%eax
0. 0. [?] 80a5c3c: jbe .+0x65 [ 0x80a5ca1 ]
0.070 0.070 [?] 80a5c3e: movl %esi,%eax
0.030 0.030 [?] 80a5c40: shrl $0x18,%eax
0.060 0.060 [?] 80a5c43: movzbl (%ecx),%edi
0.030 0.030 [?] 80a5c46: xorl %edi,%eax
0.020 0.020 [?] 80a5c48: movzbl %al,%eax
0.210 0.210 [?] 80a5c4b: incl %ecx
0. 0. [?] 80a5c4c: shll $8,%esi
0. 0. [?] 80a5c4f: xorl 0x82499b0(,%eax,4),%esi
## 0.360 0.360 [?] 80a5c56: movl %edx,%eax
0. 0. [?] 80a5c58: decl %eax
0. 0. [?] 80a5c59: movl %edx,%edi
0.050 0.050 [?] 80a5c5b: movl %eax,%edx
0. 0. [?] 80a5c5d: testl %edi,%edi


.....
0. 0. [?] 80a5dfc: call elog_start [ 0x81f97ab, .+0x1539af ]
.....



0. 0. [?] 80a6499: popl %esi
0. 0. [?] 80a649a: popl %ebx
0. 0. [?] 80a649b: movl %ebp,%esp
0. 0. [?] 80a649d: popl %ebp
0. 0. [?] 80a649e: ret $4
(er_print)



So based on the disassembly it seems somewhere between the call to XLogCheckBuffer and elog_start there is a bottleneck.


Well maybe one of the Postgresql hacker can use this information and figure out and modify it so that it is no longer a bottleneck. It looks like a loop and could use some loop optimizations.


Similarly we can use it to debug and find hotspots in other functions also

(er_print) dis comparetup_index
Source file: tuplesort.c
Object file: /usr/bin/postgres
Load Object: /usr/bin/postgres

Excl. Incl.
User CPU User CPU
sec. sec.
Function: comparetup_index
0.130 0.130 [?] 820ef9d: pushl %ebp
0. 0. [?] 820ef9e: movl %esp,%ebp
0.030 0.030 [?] 820efa0: subl $0x20,%esp
0. 0. [?] 820efa3: andl $-0x10,%esp
0. 0. [?] 820efa6: pushl %ebx


........

0. 0. [?] 820f15b: call nocache_index_getattr [ 0x8080ec8, .-0x18e293 ]
0. 0. [?] 820f160: addl $0x10,%esp
0.110 0.110 [?] 820f163: movsbl -4(%ebp),%ecx
0. 0. [?] 820f167: pushl %ecx
0. 0. [?] 820f168: pushl %eax
0.030 0.030 [?] 820f169: movsbl -8(%ebp),%eax
0. 0. [?] 820f16d: pushl %eax
0. 0. [?] 820f16e: movl -0x10(%ebp),%eax
0.040 0.040 [?] 820f171: pushl %eax
0. 0. [?] 820f172: pushl $2
0.030 0.030 [?] 820f174: leal 0xc(%edi),%eax
0. 0. [?] 820f177: pushl %eax
## 0. 3.490 [?] 820f178: call inlineApplySortFunction [ 0x820e9e0, .-0x798 ]
0.150 0.150 [?] 820f17d: addl $0x18,%esp
0. 0. [?] 820f180: testl %eax,%eax
0.110 0.110 [?] 820f182: jne .+0xe0 [ 0x820f262 ]
0. 0. [?] 820f188: cmpb $0,-8(%ebp)
0. 0. [?] 820f18c: setne %al
...



Looks like index building also has a bottleneck in inlineApplySortFunction.



No comments: