Wednesday, April 20, 2005

Tuning Solaris 10 x86 for PostgreSQL 8.0.2


Recently I ran few tests against PostgreSQL on Solaris 10 x86. My main aim was to see how I can improve PostgreSQL performance when deployed on Solaris 10 x86. These are basically the changes that I did to improve performance on my setup.


First since PostgreSQL relies on file system cache to do most of the buffering activities, I found that the default Solaris 10 x86 does not help it a lot. Specially for repeat queries since it won't keep the database files in file system cache since they could be as big as 1GB.


Hence I made the following changes in /etc/system to help improve the caching of the database files.


NOTE: These are the only tunables defined in my /etc/system

* Setting 10GB for file system cache out of 16GB RAM
set segmapsize=10737418240
set ufs:freebehind=0



In my case since PostgreSQL was the only application running setting segmap_percent=80 helps a lot.. The default value is 12. Realistically it should be reasonably set like 50% or so.. (Assuming you have enough memory to spare since it will take it out of the freelist)


This allows the 1GB datafiles of PostgreSQL to be in cache and reused for a repeat usage. It improves IO performance quite a bit specially if the next query uses the same datafiles (rows).



Also the next thing I found that WAL - Write Ahead Log of PostgreSQL was becoming the bottleneck during writes since it shared the same cached filesystem as the database files. As per with my talks with Josh Berkus it helped by moving it to a different file system. What helped more was actually putting it on a file system with "forcedirectio" option turned on.


DISCLAIMER: I haven't cut-n-paste the scripts below so there could be typos while typing them in (use at your own risk).



mount -o forcedirectio /dev/rdsk/cntndnsn /mypath/pg_xlog
tunefs -a 128 /mypath/pg_xlog
cp -R -p $PGDATA/base/pg_xlog/* /mypath/pg_xlog
rm -rf $PGDATA/base/pg_xlog
ln -s /mypath/pg_xlog $PGDATA/base/pg_xlog



Also at the same time changing maxcontig of $PGDATA to 128 also helps.


What we did above is put database on "cached" file system and the WAL on "directio" filesystem and it helps improve loading data into PostgreSQL.


Try this and let me know what's the change in your performance.


 


8 comments:

Bill Hathaway said...

Hi, thanks for posting your tips. I noticed you didn't mention changing the internal shared memory allocation that Postgres uses for buffering (I'm not a Postgres or Oracle expert, but to me this seems analogous to the Oracle SGA). I think the Postgres defaults for this are fairly small for modern systems. I was curious if you had tried experimenting with the shared_buffers values, or that was something which wasn't tested yet. Typically I thought it was better to cache inside the application if possible, but if it turns out not to be efficient, increasing the segmap_percent makes a lot of sense.

Jignesh Shah said...

Thanks for your comment. Actually I had tried increasing/decreasing shared memory within PostgreSQL. Infact I find that decreasing the shared memory to the minimum allows me to increase the free memory for segmap and which helps in my workload. Talking with couple of other people more familiar with PostgreSQL than I am, I find that it is unlike Oracle and is contradictory to our thoughts of using more shared memory since PostgreSQL relies heavily on file system caching rather than in-application caching.

Adam from Oz said...

What postgres compilation options did you use? Anything special?

PatrickG said...

Great stuff! Did you turn off the fsync() after each write? That results in a huge performance boost for me.

Jignesh Shah said...

Hi Adam from OZ,


In my case I had just used pre-built binaries as it was configured by the 'configure' script. Another collegue of mine had used Sun Studio 10 compilers and had seen a boost in performance too. I am not sure which options he had used but generally best options are received by using -xO4 -xtarget=native options with Sun Studio 10.


Hi PatrickG,


Since in my case I am using force directio for the WAL, turning fsync off for WAL will not make much difference to the best of my knowledge since the writes are already commited to the disk. You should see that boost if you turn it off on a cached file system (say in my case the database files itself). But then there are people who will never do that since that leaves an opportunity for corruption to occur in case of power failure.


Thanks for your feedback.


Michael Crozier said...

I'm a little confused by the parameters you listed, especially with the comment "NOTE: These are the only tunables defined in my /etc/system".
Can segmapsize be larger than the value indicated by segmap_percent? Does the value of either tunable limit the range of the other?
Thanks for the information.

Jignesh Shah said...

The reason I highlight that these are the only tunables defined in my /etc/system is to put emphasis on the fact that many of the defaults in Solaris 10 are bumped up compared to Solaris 9 and you may now not need to tune them in /etc/system. (The less clutter in /etc/system, the better)
As for segmapsize, it depends on your total physical RAM available and how much you portion you want to keep free for "Shared memory", heapsizes, etc for all the processes. By using segmapsize, you are reserving that memory for file system buffer cache. Of course you can go higher as high as your physical memory minus the minimum memory that the kernel needs to start up.

Mischa Sandberg said...

About PG shared_buffers: these aren't the same as Oracle's SGA. PG is happy to let the OS own and manage most of (RAM) disk cache; you tell the query planner that, with the "effective_cache_size = ..." config hint.
As of PG8, the internal buffercache manager in PG is smarter than most OS's (google or wpedia "adaptive replacement cache algorithm), so bumping shared_buffers up to 50% of RAM might pay (ymmv). For my query load, it makes more sense to crank up "work_mem" that gives processes more private memory for RAM sorts and hashes.
(Anecdotal) I once saw a Sol9 problem with having big shm: the VM system seemed to bookkeep recently-used shm pages multiple time, as part of EACH PROCESS' resident set; and it made poor swapping decisions based on those huge estimated process sizes. Anybody care to comment?