Saturday, May 08, 2010

Choosing stripsize in LUN or recordsize in ZFS for PostgreSQL

In every database deployment, you have to request storage for the database. Typically in enterprises or cloud deployments, the requests goes to Storage Administrator who will ask you basics of the level of protection and probably some needs of the database if you are lucky. Many times they just give you a LUN or a filesystem (either based on NFS or something else). You have no idea whether that storage will suit your needs for the database or not. In this blogpost, lets even ignore the protection requirements to get the best performance out of the LUN.. RAID-0 typically is the best performing level. Some say RAID-1 is better for reads but I still havent enough practical cases to say yes that's the case.

But even in RAID-0 I have seen charts all over the place. The Storage Administrator has no idea (unless you tell him/her) what the database will be doing on the LUN. Typically all RAID has what's known as the stripsize which in some ways is the blocksize that it uses to do a logical break and go to the next spindle. For example for a RAID-0 over 8 disks with a stripsize of 8K. Every write you do will be divided in terms of 8K and then will spread the writes over those 8 disks. Its is typically assumed that you can write every hard disk simultaneously so in this case for the same latency of writing to a single disk now you can do 8K x 8 or 64K simultaneously which typically will be less than writing 64K to the same spindle. (64K is also a typical stripsize available.) So why do we need to worry about the stripsize, we should select the smallest stripsize and just forget about it. Well for many arrays smallest strip size available is native block of the disk which is 512 Bytes to a typical max of 256KB or 512KB.

Two reasons to worry about it:

1. Modern disks technologies  including new SSDs or Flash technology basically are many times optimized for a block size (including flash technologies). In some flash based storage technologies, the write latency is same whether you write 512B or 4K. In such cases basically you are creating more work by writing smaller than that size. So obviously the smallest stripsize is not the most optimum.

2. Also there is a penalty of using  the small strip size when you consider the throughput case. What is the throughput case? Throughput case is that you are really trying to get large amoun of data. How's it limited? Well it is limited by IOPS. Every disk has a working max range of how many IO operations it can sustain per second. Irrespective of what manufacturers say, the real random limit is more like anywhere from 100 IOPS to 160 IOPS per second with SATA on the real slow side, SAS somewhere in between and the Fiber Channel based disks on the high side.

So lets say you can get about 150 IOPS per disk in your datacenter then your maximum sustainable throughput that can get from a stripsize of 4K is 4K x 150 x 8  which is like 4800KB/sec or abour 4.6 MB/sec.. Pretty slow. Now increasing the stripsize to 8K gives you double the througput and if you use the bigger stripsize like 128K you get about 150MB/sec and 512kb (typical max stripsize) allows you 600MB/sec. Ofcourse by this time you could already be limited by your bus technology. Both SAS/SATA have 3 Gig links which means realistically the max you can get is about 300-350MB/sec.
So in this case 256KB is a good choice for systems which needs to move large amount of data like Data Warehousing/DSS.

But again remember the throughput comes at a cost. What is that cost? Latency. Time taken to read write increases as you increase the size of the IO. Repeating again the disks or flash technologies are optimized at a blocksize. If you go beyond that size the latency could increase linearly or sometimes exponentially (if it is a bad design). So you have to know your disks properly and make the right selection. For OLTP which is very sensitive to read and write operations for every transactions, an increase in latency by 100% can cause a drop of transactions by 50%  (depending on the workload) and hence can cause dramatic impact. For such database deployments, everything to cut down latency helps.

So your storage can right now either do the best of OLTP layout or the best of throughput layout which is typically used by Data Warehousing or so-called Decision Support Systems. Many people sometimes do a trade-off to be somewhere in between and hence end up selecting stripsize somewher in between.  I have seen 64KB/128KB as common stripsize where they try to do a balance. So instead of making one group of people using the database unhappy, this tends to make all groups of people unhappy. But sometimes due to the nature of the usage of the database and cost,  that's probably the right way to go. Though I would recommend actually figuring out the various databases and have multiple LUNS of various stripsize available to the database so that the DBA can make righ decisions based on the usage.

The setting in ZFS called recordsize is similar to this stripsize except for the fact that its range is only from 8KB to 128KB. The default is 128KB for a reason that it can do a good tradeoff for all uses. However if you are going to do only OLTP type transactions, then changing that to 8KB is very helpful. The drawback is as soon you run any long running query which requires to scan through large amount of data and if you have 8KB recordsize you will be limited by the throughput as explained above. 

What is really needed is dynamic stripsize or recordsize so it adapts to the write being done. However again the reads will be then dependent on the write stripsize which may not be case that you want. Most DSS/DW systems reads from OLTP systems so those will still be impacted if you have dynamic stripsize for writes which will be small if done through OLTP system. Hence an engineering problem to solve to remove dependence of these various stripsizes and record size.. Hey even databases have block sizes maybe even these blocksizes need to be adaptive :-)

Of course questions for the reader is whether the blocksize of the database should be equal to strip size or equal to  N x stripsize where N is the number of "effective" disks in the array/LUN.
The real answer is it depends on the database technology.

1 comment:

Anonymous said...

I have definitely seen the 8KB recordsize on ZFS cripple performance, usually on DSS type systems. I think even on OLTP, with Postgres pushing towards 8KB recordsize can have negative effects, due to the lack of true covering indexes in Postgres (which leads to more large scans than you'd like).