Here are my initial results with PostgreSQL 9.0
This is still work in progress in some ways since this test has never been done before with PostgreSQL 9.0 (or atleast I haven't seen any body publish anything around this before).
Anyway my real goal is not to just put the numbers here but to understand what is happening here. Round about 40-48 clients we seen to peak out on scaling and hit a wall as far as scaling goes. The statement is pretty simple select from sbtest with a primary key which is randomly generated.
This is where I miss dtrace and OpenSolaris. Now that I am working on Linux it thought of trying it out with systemtap. The setting up of a working systemtap itself was a big challenge and took me some time to make it to work on my existing kernel. (Boy Linux should get their act straight at user level tracing. Its not omnipresent as dtrace on Solaris).
I converted my old lockstat utility to work on systemtap. The script seems to work but systemtap can act funny sometimes like dont abruptly exit from systemtap otherwise it may send a "wrong" signal to postgres backend which then just commits suicide since it cannot figure out what to do with such "trace/debug" signal.
LOCKNAME LWID M W/A COUNT SUM-TIME(us) MAX(us) AVG-TIME(us)
LockMgrLock 45 Ex W 85343 469682510 13152 5503
LockMgrLock 57 Ex W 57547 30903727 8313 537
LockMgrLock 44 Ex W 390 34061 1670 87
LockMgrLock 59 Ex W 375 41570 2032 110
LockMgrLock 56 Ex W 361 39685 1889 109
LockMgrLock 47 Ex W 344 24548 1564 71
LockMgrLock 54 Ex W 335 67770 2319 202
LockMgrLock 50 Ex W 325 44213 1690 136
LockMgrLock 49 Ex W 325 39280 1475 120
LockMgrLock 55 Ex W 323 39448 1584 122
LockMgrLock 48 Ex W 323 26982 1669 83
What you see above is top WAITS on lwlocks by count and what was the average wait and max wait time for a particular LWLock.
(Thinking it might be related to NUM_LOCK_PARTITIONS, I did some experiments with different sizes but since they are all related to the same table it does not help here.)
Then I modified my lockstat script slightly to do stack straces for those two locks:
Majority of those were pretty much caused by the same code path:
Lock id:45, LockMode:0
0x000000000062412e : LWLockAcquire+0x25e/0x270 [/usr/local/aurora-1.0/bin/postgres]
0x00000000006228cc : LockAcquireExtended+0x2dc/0xa40 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000620788 : LockRelationOid+0x48/0x60 [/usr/local/aurora-1.0/bin/postgres]
0x000000000064be85 : AcquireExecutorLocks+0xd5/0x190 [/usr/local/aurora-1.0/bin/postgres]
0x000000000064c96a : RevalidateCachedPlan+0x5a/0x3b0 [/usr/local/aurora-1.0/bin/postgres]
0x000000000062ee64 : exec_bind_message+0x604/0xab0 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000630bbd : PostgresMain+0x82d/0x16e0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f302e : ServerLoop+0x96e/0xcb0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f3dec : PostmasterMain+0xa7c/0x1150 [/usr/local/aurora-1.0/bin/postgres]
0x000000000044e4f0 : main+0x370/0x430 [/usr/local/aurora-1.0/bin/postgres]
Lock id:57, LockMode:0
0x000000000062412e : LWLockAcquire+0x25e/0x270 [/usr/local/aurora-1.0/bin/postgres]
0x00000000006228cc : LockAcquireExtended+0x2dc/0xa40 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000620788 : LockRelationOid+0x48/0x60 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000477425 : relation_open+0x55/0x90 [/usr/local/aurora-1.0/bin/postgres]
0x000000000047e5e3 : index_open+0x13/0x90 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005732a5 : ExecInitIndexScan+0x125/0x1e0 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000560275 : ExecInitNode+0x135/0x290 [/usr/local/aurora-1.0/bin/postgres]
0x000000000055ed80 : standard_ExecutorStart+0x530/0xc70 [/usr/local/aurora-1.0/bin/postgres]
0x000000000063278b : PortalStart+0x1bb/0x380 [/usr/local/aurora-1.0/bin/postgres]
0x000000000062ee9c : exec_bind_message+0x63c/0xab0 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000630bbd : PostgresMain+0x82d/0x16e0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f302e : ServerLoop+0x96e/0xcb0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f3dec : PostmasterMain+0xa7c/0x1150 [/usr/local/aurora-1.0/bin/postgres]
0x000000000044e4f0 : main+0x370/0x430 [/usr/local/aurora-1.0/bin/postgres]
0x000000000062412e : LWLockAcquire+0x25e/0x270 [/usr/local/aurora-1.0/bin/postgres]
0x00000000006228cc : LockAcquireExtended+0x2dc/0xa40 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000620788 : LockRelationOid+0x48/0x60 [/usr/local/aurora-1.0/bin/postgres]
0x000000000064be85 : AcquireExecutorLocks+0xd5/0x190 [/usr/local/aurora-1.0/bin/postgres]
0x000000000064c96a : RevalidateCachedPlan+0x5a/0x3b0 [/usr/local/aurora-1.0/bin/postgres]
0x000000000062ee64 : exec_bind_message+0x604/0xab0 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000630bbd : PostgresMain+0x82d/0x16e0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f302e : ServerLoop+0x96e/0xcb0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f3dec : PostmasterMain+0xa7c/0x1150 [/usr/local/aurora-1.0/bin/postgres]
0x000000000044e4f0 : main+0x370/0x430 [/usr/local/aurora-1.0/bin/postgres]
Lock id:57, LockMode:0
0x000000000062412e : LWLockAcquire+0x25e/0x270 [/usr/local/aurora-1.0/bin/postgres]
0x00000000006228cc : LockAcquireExtended+0x2dc/0xa40 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000620788 : LockRelationOid+0x48/0x60 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000477425 : relation_open+0x55/0x90 [/usr/local/aurora-1.0/bin/postgres]
0x000000000047e5e3 : index_open+0x13/0x90 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005732a5 : ExecInitIndexScan+0x125/0x1e0 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000560275 : ExecInitNode+0x135/0x290 [/usr/local/aurora-1.0/bin/postgres]
0x000000000055ed80 : standard_ExecutorStart+0x530/0xc70 [/usr/local/aurora-1.0/bin/postgres]
0x000000000063278b : PortalStart+0x1bb/0x380 [/usr/local/aurora-1.0/bin/postgres]
0x000000000062ee9c : exec_bind_message+0x63c/0xab0 [/usr/local/aurora-1.0/bin/postgres]
0x0000000000630bbd : PostgresMain+0x82d/0x16e0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f302e : ServerLoop+0x96e/0xcb0 [/usr/local/aurora-1.0/bin/postgres]
0x00000000005f3dec : PostmasterMain+0xa7c/0x1150 [/usr/local/aurora-1.0/bin/postgres]
0x000000000044e4f0 : main+0x370/0x430 [/usr/local/aurora-1.0/bin/postgres]
So what I understand is that we have two problems here:
1. RevalidateCachedPlan (the Major bottleneck)
2. AccessShare Lock entry (the second bottleneck)
Well atleast now I know the area where there seems to be some scaling bottlenecks which can limit simple SELECT statement scalings. The question now is what to do about them. Back to the mailing list.
1 comment:
Hi, Jignesh. Thanks for working through the systemtap setup pains and getting some useful data out. Please be aware that the trace/debug signal issue is one that we've seen occasionally, and recent mainline (git systemtap) code has some related changes. If you see the same problems, you might give that a try.
Post a Comment