Sunday, February 10, 2008

Sizing PostgreSQL 8.3 Logs

Steve (via comments) asked about guidelines for  PostgreSQL logs for X Transactions. I thought for a while on how to answer that question. One approach was to start answering it from the disk level and work the way to PostgreSQL logs and second approach was to start from PostgreSQL logs and work my way to the disks. After some time, I took the later approach. Why? Well by doing so I also would help people size their systems properly.


Most of the reasons that I will use in this post specially related to the disk/storage side probably comes from my interpretations of what I understood from listening to experts within and outside Sun. So while it may not be accurate but to me sounds logical. Feel free to flame me if there is something illogical here.


Generally I have found that people approach their sizing in the following way: First they select their application, then their database, then  either the Operating System or the Server itself and probably think last about the Storage or mostly carve it from existing SAN infrastructure.


To make my post simple, lets assume that you have identified Sun Fire X4450 (its new and hence in fashion) for your OLTP appplication running on PostgreSQL 8.3.  I would say that server can support 3000-4000  OLTP transactions per second. For sizing purposes lets say we are targetting a max of 3000 tps on it. Now lets work our way through this.


If I understand PostgreSQL 8.3 right, then by default each transaction when it commits will result in a sync on the wal record which will write a page which typically is 8K to the log devices.  (By default in PostgreSQL 8.3: fsync = on, commit_delay=0, synchronous_commit=true, full_page_writes=on)


So lets size the pressure on the logs:


For 1 transaction =  atleast 1 IO operation of 8KB on log filesystem


For 3000 tps =    atleast 3000  IO operation per second (iops)  of total 24000 KB/sec of writes (roughly about 24 MB/sec for simplicity)


Now most modern drives do support 25-60MB/sec writes so it would seem that the overall throughput should not be a problem.


However on the same token most modern drives (few term them as spindles in this conversation) typically only do well around 120 iops average (considering that the write cache is turned off on the drive for safety).


Aha so a single disk cannot handle the logs. Well then the number of drives I would need just for the log file system is 3000/120 = 25


So we will require about 25 disks just to handle the log devices. Now generally each disk is about 73GB nowadays. Which means that I will have to dedicate about 1,825 GB just for the logs. How much actual space is required for log? Well even if you tune checkpoint_segments to a high value like 512 or 1024 (which is pretty high with each segment about 16MB) that will be roughly be still less than 36GB (or about half of 1 disk).


Its still not over. Generally you will need protection for the log devices. If you do RAID1+0 then you end up with 50 disks. If you do RAID-5, then first of all you will need to make sure that the hardware controller can do 3000 parities/second without really increasing your service times. At this time most people probably think I am loosing my mind since most people don't really allocate 50 disks to their logs. How do they then support the same transactions with lot lower spindles of disks? Various technologies come to the rescue:


Lets look at PostgreSQL 8.3 technologies that helps us to reduce the spindle count for logs (and also risks associated with it if any):


 


1. full_page_writes: This option - a tunable in postgresql.conf - by default is turned on. By turning this option off, PostgreSQL does not write the entire 8K page out to the log but just the amount that is actually changed. Say for example this reduces your average writes from 8K to 2K then your max throughput support required drops from 24MB/sec to say about 6MB/sec. However note that this by itself does not reduce the 3000 iops done to the log file system. Also this introduces a risk of partial write. That is if the page when written to the actual table/index storage is only written partially when power fails, then the information stored in the log will not be enough to recreate the page when after power restore postgresql server is trying to recover the database.


2. commit_delay: This option - a tunable in postgresql.conf - by default is set to 0. However by setting it to a positive number (>0) has  quite a bit of change on how Postgres will write to the logs. Say there are more than 6 (greater than default commit_siblings=5) transactions are in flight and one of them wants to commit. The database server detects that are chances of other transactions also completing and hence takes a quick nanosleep (equal to commit_delay) and allows other transactions that also wants to commit into the wal buffers. Once it wakes up (or some other transactions forces it to sync first), it will write all committed transactions in one go. The drawback is individual transactions are delayed to commit (and hence probably increasing their response time by the sleep in nanoseconds) but the real advantage is now all those transactions being committed will only result in 1 OP.  This is a huge win. For example say that on average with the heavy load at peak it is doing three transactions per sync then it reduces the iops requirement from 3000 iops to 1000 iops (which immediately reduces the need from 25 spindles to about 9 spindles).  Also there is no risk to the database, plus delays to the transactions are typically the minimum sleep resolution that the operating system supports. More transactions within a slight delay, lesser the load of iops on logs.


 


(1&2 both on: If only commit_delay is on then eventhough the sync is doing 1 iops since all the transactions involved could be operating on diffent pages (or should be) hence it will have to write 3 8KB pages or 24KB pages together. However with full_page_writes off, it is probably much less than 24KB.)


 3. synchronous_commit: PostgreSQL 8.3 has added yet another technology to reduce the load on log devices. This option when set to false will for most cases avoid synchronous commit (or syncing of the WAL buffers after commit). A separate process will sync based on time which will allow more and more transactions to be combined together (till the maximum value  the operating system allows) within a single IO operating. This infact allows still further reduction of 9 disks from earlier case to say maybe 6 or so. However there is a risk associated with synchronous_commit. Unlike commit_delay which delays the transactions, in this case the transactions gets committed even though wal buffers are not yet flushed/synced to the device. Hence there are chances of loosing transactions (within a period equal to 3x wal_writer_delay) even though it was reported committed.  However considering many people who revert to unsafe fsync=off , this allows the database to be consistent state in case of power failure.


We have seen how existing technologies can be leveraged in PostgreSQL 8.3 to reduce the spindles required for log device from 25 to  9 or even lower to say 6. Now lets look at hardware technologies that helps to reduce the number of spindles required.


1. Write cache on disk itself: Most modern disks now have write cache on them. By turning them on, it can significantly help increase the iops in case of writes on those drives. However even though the drives says it has written the data to the application to improve the service time, the data could still be in cache only and hence lost during power failure which could lead to unusable database.


2. Battery backed Write cache: This write cache is similar to the write cache on disk, except it is generally external and backed up by a  battery which can sustain power failures (depend on the length of failure). However it is more expensive setup but the cost is offset by the savings of the spindles itself, since it may result that you only need 1 spindle (and 1 more for mirror) (NOTE: we are just talking about the log devices here). However thing to note is the writes from the cache to the disk will still happen at "disk" speed which means if you have peak loads then the cache can help out a lot but if you have consistent inflow of writes then the cache might eventually fill up and will now have to wait for the disk to free up some cache. So it might still be worthwhile to have more spindles depending on the cache size and how long the peak writes will last.


After this long post, we have only sized the PostgreSQL 8.3 logs yet.









 


No comments: