Background:
We were performance testing PostgreSQL 8.3beta1 testing on our latest Sun SPARC Enterprise T5220 Server using Solaris 10 8/07 and Sun Fire X4200 using Solaris 10 8/07. Generally for performance benefits in Solaris we put file systems on forcedirectio we bypass the filesystem cache and go direct to disks.
Problem:
What we were observing that there were huge reads happening on the file system holding $PGDATA and the database tables during an OLTP Benchmark run. Initially we thought that our bufferpools were not big enough. But thanks to 64-bit builds we could use bigger bufferpools. However even with extraordinary bufferpool sizes we still saw lots of reads going to the disks.
DTrace to the Rescue:
I modified iosnoop.d to just snoop on reads. The modified rsnoop.d is as follows:
$ cat rsnoop.d
#!/usr/sbin/dtrace -s
syscall::read:entry
/execname=="postgres"/
{
printf("pid %d reading %s\n", pid, fds[arg0].fi_pathname);
}
Based on it I found that most postgresql processes were doing lots of reads on files from pg_clog directory.
CLOG or commit logs keep track of transactions in flight. Writes of CLOG comes from recording of transaction commits( or when it aborts) or when an XLOG is generated. However though I am not clear on reads yet, it seems every process constantly reads it to get some status. CLOG data is not cached in any PostgreSQL shared memory segments and hence becomes the bottleneck as it has to constantly go to the filesystem to get the read data.
# ./rsnoop.d
dtrace: script './rsnoop.d' matched 1 probe
CPU ID FUNCTION:NAME
0 49222 read:entry pid 8739 reading /export/home0/igen/pgdata/pg_clog/000C
0 49222 read:entry pid 9607 reading /export/home0/igen/pgdata/pg_clog/000C
0 49222 read:entry pid 9423 reading /export/home0/igen/pgdata/pg_clog/000C
0 49222 read:entry pid 8731 reading /export/home0/igen/pgdata/pg_clog/000C
0 49222 read:entry pid 8719 reading /export/home0/igen/pgdata/pg_clog/000C
0 49222 read:entry pid 9019 reading /export/home0/igen/pgdata/pg_clog/000C
1 49222 read:entry pid 9255 reading /export/home0/igen/pgdata/pg_clog/000C
1 49222 read:entry pid 8867 reading /export/home0/igen/pgdata/pg_clog/000C
Later on during another run I added ustack() after the printf in the above script to get the function name also:
# ./rsnoop.d
dtrace: script './rsnoop.d' matched 1 probe
CPU ID FUNCTION:NAME
0 49222 read:entry pid 10956 reading /export/home0/igen/pgdata/pg_clog/0011
libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesVacuum+0x21a
postgres`heap_prune_chain+0x14b
postgres`heap_page_prune_opt+0x1e6
postgres`index_getnext+0x144
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
postgres`ExecIndexScan+0x43
postgres`ExecProcNode+0x183
postgres`ExecutePlan+0x9e
postgres`ExecutorRun+0xab
postgres`PortalRunSelect+0x47a
postgres`PortalRun+0x262
postgres`exec_execute_message+0x565
postgres`PostgresMain+0xf45
postgres`BackendRun+0x3f9
0 49222 read:entry pid 10414 reading /export/home0/igen/pgdata/pg_clog/0011
libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesVacuum+0x21a
postgres`heap_prune_chain+0x14b
postgres`heap_page_prune_opt+0x1e6
postgres`index_getnext+0x144
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
^C libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesMVCC+0x34f
postgres`index_getnext+0x29e
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
postgres`ExecIndexScan+0x43
postgres`ExecProcNode+0x183
postgres`ExecutePlan+0x9e
postgres`ExecutorRun+0xab
postgres`PortalRunSelect+0x47a
postgres`PortalRun+0x262
postgres`exec_execute_message+0x565
postgres`PostgresMain+0xf45
postgres`BackendRun+0x3f9
postgres`BackendStartup+0x271
postgres`ServerLoop+0x259
So multiple processes are reading the same file. In this case since the file system is told not to cache files, hence all read ios are being sent to the disk to read the file again.
Workaround for the high reads on CLOG on Solaris:
Start with the cluster $PGDATA on regular UFS (which is buffered and logging is enabled). Always create a new tablespace for your database on forcedirectio mounted file system which bypasses the file system cache. This allows all PostgreSQL CLOG files to be cached in UFS greatly reducing stress on the underlying storage. For writes to the best of my knowledge, PostgreSQL will still do fsync to force the writes the CLOGs onto the disks so it is consistent. But the reads are spared from going to the disks and returned from the cache.
Result:
With rightly sized bufferpool now all database data can be in PostgreSQL cache and hence reads are spared from the tablespaces. As for PGDATA data, UFS will do the caching of CLOG files, etc and hence sparring reads from going to the disks again. In the end what we achieve is a right sized bufferpool where there are no reads required during a high OLTP environment and the disks are just busy doing the writes of updates and inserts.
Asynchronous Commit (New in 8.3):
Also as requested by Josh Berkus, I tried out Asynchronous Commit in 8.3beta 1
I compared four scenarios on internal disks (the prime target)
1. Default options (commit_delay off, synchronous_commit=true)
2. With Commit_delay on
3. With Asynchronous and Commit_delay on
4. With Asynchronous commit but Commit_delay off
5. With Fsync off
In 8.2 I found compared to (1),( 2) gave me a huge boost (2X) but fsync would be eventually even 2.8X faster than (2)
In 8.3 hence I did not even test the default option and took (2) as my baseline run and found (3),(4),(5) pretty much gave me the similar boost 2.55X over my baseline run (2) since eventually I was CPU bound on my box and IO ended up handling well.
(Though I found (5) was slightly better in 8.2 compared to (5) in 8.3beta1 since it was getting CPU saturated slightly earlier)
Additional CPU consumption Findings:
In the lightweight OLTP Testing that was performed with about 1000 users, with 8.3 with the above workaround in place for CLOG. I reached a scenario where the system was out of CPU resources with about 1000 users. Anyway doing a quick profiling using the "hotuser" program available in the DTraceToolkit the top function is postgres`GetSnapshotData
# ./hotuser -p 10559
....
postgres`hash_seq_term 1 2.1%
postgres`SearchCatCache 2 4.2%
postgres`hash_search_with_hash_value 4 8.3%
postgres`GetSnapshotData 6 12.5%
Also Lock Waits during the 1000 User run was as follows:
# ./83_lwlock_wait.d 9231
Lock Id Mode Count
WALInsertLock Exclusive 1
ProcArrayLock Exclusive 19
Lock Id Combined Time (ns)
WALInsertLock 428507
ProcArrayLock 1009652776
# ./83_lwlock_wait.d 9153
Lock Id Mode Count
CLogControlLock Exclusive 1
WALInsertLock Exclusive 1
ProcArrayLock Exclusive 15
Lock Id Combined Time (ns)
CLogControlLock 25536
WALInsertLock 397265
ProcArrayLock 696894211
#
My Opinions:
My Guess is that the ProcArrayLock is coming from the GetSnapShotData function or maybe caused by it. But I will let the experts comment.
I am of the opionion that if we tune GetSnapShotData, then we should be able to handle more users.
So overall I think I am excited with the 8.3beta1 performance specially in terms of asynchronous_commit however to get the CPU performance in line with GetSnapshotData() and also fixing the CLOG reading problem occuring in SimpleLruRead() will greatly enhance the performance of 8.3 for OLTP benchmarks.
No comments:
Post a Comment