Jignesh Shah's Blog

Thursday, November 18, 2010

sysbench, PostgreSQL 9.0 and OLTP complex read-write test

Continuing with my sysbench saga with PostgreSQL 9.0, I was generally not encountering any errors except recently. I found two differences on how I did execute this test in order to hit this problem. One is I used a relatively small number of rows (1 million) which forces the special distribution of sysbench clients to be smaller and a reasonably high number of threads (80) as follows

$ sysbench --db-driver=pgsql --oltp-dist-type=special --oltp-table-size=1000000 --oltp-read-only=off --oltp-test-mode=complex --max-requests=0 --max-time=300 --num-threads=64 --test=oltp run

sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 80

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
FATAL: query execution failed: 9490352
FATAL: database error, exiting...
Done.

And the test failed with error reported in pg_log:

ERROR: duplicate key value violates unique constraint "sbtest_pkey"
DETAIL: Key (id)=(500815) already exists.
STATEMENT: INSERT INTO sbtest values($1,0,' ','aaaaaaaaaaffffffffffrrrrrrrrrreeeeeeeeeeyyyyyyyyyy')

The way sysbench works for the complex read-write test transaction, after doing some bunch of selects and updates, it deletes a row and inserts the same row back. So in a transaction logic it should not hit this error since it just deleted the row. This took me some time to recreate it in a way that I could understand what is happening.

What's happening is the timing of the start of a query in transaction with respect to another transaction in flight
working with the same key-value row such that the other transaction just deleted the row and also inserted the same row back in the same transaction.

Lets consider two transactions A and B

Transaction A                      Transaction B

BEGIN;

                                   BEGIN;

DELETE FROM sbtest WHERE id=500815;

(returns DELETE 1)

INSERT INTO sbtest values(500815,0,'','aaaaaaaaaaffffffffffrrrrrrrrrreeeeeeeeeeyyyyyyyyyy');

(returns INSERT 0 1)

                                   DELETE FROM sbtest WHERE id=500815;< ------- hangs/waits

END;

(COMMIT)

                                   (returns DELETE 0 – returns success but doesn’t delete any rows . It doesn't roll back the transaction)

                                    INSERT INTO sbtest values(500815,0,'','aaaaaaaaaaffffffffffrrrrrrrrrreeeeeeeeeeyyyyyyyyyy');

                                    ERROR: duplicate key value violates unique constraint "sbtest_pkey"

                                    END;

                                    (ROLLBACK)

The way MySQL-InnoDB handles it is slightly different. It actually deletes the new row inserted and hence always can do the INSERT successfully and thats why sysbench with MySQL - InnoDB never showed that problem.

If you read the documentation on PostgreSQL for READ COMMITTED Isolation level, it says: "In effect, a SELECT query sees a snapshot of the database as of the instant the query begins to run". So had the DELETE actually started after the END of the first transaction it would delete the new INSERTed row.I am not sure how other databases (ORACLE, DB2, etc) behave in this scenario.

As one of my co-worker mentioned "Correctness is in the Implementation of Beholder", it is hard to say who is right or wrong..

Just another one of the minor differences between MySQL and PostgreSQL.

Wednesday, November 03, 2010

PostgreSQL in Virtual Environment - PGWest 2010

Yesterday I had my presentation at PgWest 2010 at San Francisco.

If you have questions on the presentation please feel free to leave comments.

Monday, October 04, 2010

Postgres 9 and sysbench 0.4.12

Ever since the release of Postgres 9 I was hoping for a chance to do some performance tests on it. I finally got around to try it out. This time I tried Postgres running in a virtual machine on top of VMware's vSphere Infrastructure 4.1.

Since there were plenty of sysbench numbers available for mysql, I thought it was time to try it with sysbench. The version I tried was sysbench 0.4.12. Compiling sysbench 0.4.12 for PostgreSQL itself was a huge challenge that it deserves a separate blogpost but for now lets start at the point where I got a working sysbench binary that I run it from a separate virtual machine against my Postgres database server.

The first step to do a sysbench oltp was to do the "prepare" step where the benchmark tool creates the table and loads rows into the table. I saw many mysql tests running with 1 million rows or some with 10 million rows.

So anyway I started with 1 million rows with similar options. The load started and went on and went on for a while.. (I should have taken a lunch break instead of waiting for it to finish.) It literally finished after some 20 odd minutes. I found that odd. So I downloaded MySQL 5.5 rc binaries and created a mysql database and compiled sysbench to run against MySQL and started the same "prepare" step with 1 million rows.

The "prepare" step finished for MySQL in an amazing 30 seconds or so. My first reaction was MySQL is damn fast than Postgres. As a Postgres lover I found that hard to believe. There's got to be some other problem. I added the --debug flag to sysbench and I saw the insert statements that sysbench did to mysql. When I used the same debug flag for Postgres version it did not show me any statements that it used.. That was no help.

Anyway the Postgres was running on a version of the "late" OpenSolaris and used the query_time_breakdown.d script from pgfoundry dtrace scripts to see that the query that it was using to insert rows. I noticed that it was doing multi-row inserts on MySQL and single-row inserts on Postgres.

Now came the hard part of reading through the sysbench code which took me some while to understand its layout and how it uses the database driver. To cut the long story short, I figured sysbench uses "drv_caps_t" structure to record the capabilities of the database drivers. The MySQL driver enables multi-row inserts while the Postgres driver had multi-row support disabled.

Checking the documentation of Postgres I saw multi-row inserts in a INSERT statement is allowed.

So I modified sysbench/drivers/pgsql/drv_pgsql.c and changed

/* PgSQL driver capabilities */
static drv_caps_t pgsql_drv_caps =
{
0,
1,
1,
0,
0,
1,
0,
NULL
};

to the following

/* PgSQL driver capabilities */
static drv_caps_t pgsql_drv_caps =
{
1,
1,
1,
0,
0,
1,
0,
NULL
};

The resulting difference that it now loads 1000 rows in each INSERT statement (like it does for MySQL)

As for the difference it makes on the "prepare" operation:
$ time ./sysbench --pgsql-user=pguser --pgsql-password=xxxxxx --pgsql-db=sbtest --pgsql-host=pgdb --db-driver=pgsql --oltp-dist-type=special --oltp-table-size=1000000 --oltp-read-only=off --oltp-test-mode=complex --max-requests=0 --init-rng --max-time=300 --num-threads=96 --test=oltp --debug prepare

sysbench 0.4.12: multi-threaded system evaluation benchmark
Creating table 'sbtest'...
NOTICE: CREATE TABLE will create implicit sequence "sbtest_id_seq" for serial column "sbtest.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "sbtest_pkey" for table "sbtest"
Creating 1000000 records in table 'sbtest'...

real 17m47.695s
user 0m0.328s
sys 0m0.700s

to:

$ time ./sysbench --pgsql-user=pguser --pgsql-password=xxxxxx --pgsql-db=sbtest --pgsql-host=pgdb --db-driver=pgsql --oltp-dist-type=special --oltp-table-size=1000000 --oltp-read-only=off --oltp-test-mode=complex --max-requests=0 --init-rng --max-time=300 --num-threads=96 --test=oltp --debug prepare

sysbench 0.4.12: multi-threaded system evaluation benchmark
Creating table 'sbtest'...
NOTICE: CREATE TABLE will create implicit sequence "sbtest_id_seq" for serial column "sbtest.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "sbtest_pkey" for table "sbtest"
Creating 1000000 records in table 'sbtest'...

real 0m36.237s
user 0m0.180s
sys 0m0.064s

This looks much better and acceptable. Now I can proceed to the real sysbench testing.
If somebody knows the maintainer of the sysbench, can you please point him to this blog article to get the drv_pgsql.c patched up to be similar to MySQL driver.

Sunday, June 20, 2010

Virtualization and Postgres

It will be soon couple of months for me at VMware. I have been learning a lot of VMware and Virtualization. Often time I try to view it from the point of view of a database like Postgres. To say the least it has been a learning experience.

In the most simplest form of understanding, virtualization (not to be confused with simulation or emulation) is a set of resources. The primary resources are CPU, Memory, Disk and Network. Then there are others but most others are not that interested for majority of the database users (or alteast for me). For a java programmer this is more like an interface which can be implemented by a class but retains its properties. One cannot use it directly but you use a class typically implementing that interface. Similarly for virtualization while you think only in terms of this interface resources it is actually implemented by the underlying physical resources of the system. Hence it is not to be confused with "emulation".

That said one would think that they need not think of it as a "special" case in their deployment. This is where expectations and reality start to diverge and we could have a case of finger pointing. To set the right expectations for virtualization one needs to understand the philosophy behind virtualization which I guess is what I have been doing recently. It is more like soul-searching in some sense. From what I get the philosophy behind virtualization is actually efficient and simplified utilization of resources. Again lets go back to our primary resources CPU, Memory, Disk, Network. What virtualization tries to achieve is efficient utilization of these resources beyond one operating system. I think everyone understands the efficient utilization part. The clause that becomes tricky is "beyond one operating system". This clause is like the last minute change included by some smart lawyer which no one will understand till some years letter when another legal review is done. :-)

This clause is what is causing many operating systems, applications like database to rethink their behavior of "My world is the only world here". The details of these behaviour are actually too long to list but to give an idea, an idle operating system throws enough tick interrupts to keep the clock up to date that few of these operating systems while doing nothing can keep a core busy just to do these time updates. Occassionally it can happen that the interrupt that does not happen in time which can cause time drifting in operating systems.

Similarly my experience with databases and shared memory is often the databases just pin the max memory we expect the database will use at peak loads. So even though it may be say 10-20% of the case that peak utilization will occur but memory has been fully allocated and pinned (marked "unusable" for others) which may not be the most efficient way of using that resource. (Primary true in my past life where the immediate thing we do is use Intimate Shared Memory on Solaris so nothing else can use it.) Other thing that can come to mind is spin locks where CPU cylces are forced to spin to do nothing but wait, network polling to see if there is more incoming data and many other operations.

Plus with the isolation provided a person within an operating system now has no idea what other things could be using the same resources that the OS is using which makes it more difficult for both the administrator and the Virtualization professional to reach some common understanding. :-)

Anyway this is probably my first post on this topic in relation with Postgres. As I learn more things about Virtualization, you can expect me to share my learning of the impact of Virtualization on Postgres and vice versa. In the mean while if you have other "bugging" questions related to Postgres and Virtualization do let me know and maybe I can test it out on my spare time.

Saturday, May 08, 2010

Choosing stripsize in LUN or recordsize in ZFS for PostgreSQL

In every database deployment, you have to request storage for the database. Typically in enterprises or cloud deployments, the requests goes to Storage Administrator who will ask you basics of the level of protection and probably some needs of the database if you are lucky. Many times they just give you a LUN or a filesystem (either based on NFS or something else). You have no idea whether that storage will suit your needs for the database or not. In this blogpost, lets even ignore the protection requirements to get the best performance out of the LUN.. RAID-0 typically is the best performing level. Some say RAID-1 is better for reads but I still havent enough practical cases to say yes that's the case.

But even in RAID-0 I have seen charts all over the place. The Storage Administrator has no idea (unless you tell him/her) what the database will be doing on the LUN. Typically all RAID has what's known as the stripsize which in some ways is the blocksize that it uses to do a logical break and go to the next spindle. For example for a RAID-0 over 8 disks with a stripsize of 8K. Every write you do will be divided in terms of 8K and then will spread the writes over those 8 disks. Its is typically assumed that you can write every hard disk simultaneously so in this case for the same latency of writing to a single disk now you can do 8K x 8 or 64K simultaneously which typically will be less than writing 64K to the same spindle. (64K is also a typical stripsize available.) So why do we need to worry about the stripsize, we should select the smallest stripsize and just forget about it. Well for many arrays smallest strip size available is native block of the disk which is 512 Bytes to a typical max of 256KB or 512KB.

Two reasons to worry about it:

1. Modern disks technologies including new SSDs or Flash technology basically are many times optimized for a block size (including flash technologies). In some flash based storage technologies, the write latency is same whether you write 512B or 4K. In such cases basically you are creating more work by writing smaller than that size. So obviously the smallest stripsize is not the most optimum.

2. Also there is a penalty of using the small strip size when you consider the throughput case. What is the throughput case? Throughput case is that you are really trying to get large amoun of data. How's it limited? Well it is limited by IOPS. Every disk has a working max range of how many IO operations it can sustain per second. Irrespective of what manufacturers say, the real random limit is more like anywhere from 100 IOPS to 160 IOPS per second with SATA on the real slow side, SAS somewhere in between and the Fiber Channel based disks on the high side.

So lets say you can get about 150 IOPS per disk in your datacenter then your maximum sustainable throughput that can get from a stripsize of 4K is 4K x 150 x 8 which is like 4800KB/sec or abour 4.6 MB/sec.. Pretty slow. Now increasing the stripsize to 8K gives you double the througput and if you use the bigger stripsize like 128K you get about 150MB/sec and 512kb (typical max stripsize) allows you 600MB/sec. Ofcourse by this time you could already be limited by your bus technology. Both SAS/SATA have 3 Gig links which means realistically the max you can get is about 300-350MB/sec.
So in this case 256KB is a good choice for systems which needs to move large amount of data like Data Warehousing/DSS.

But again remember the throughput comes at a cost. What is that cost? Latency. Time taken to read write increases as you increase the size of the IO. Repeating again the disks or flash technologies are optimized at a blocksize. If you go beyond that size the latency could increase linearly or sometimes exponentially (if it is a bad design). So you have to know your disks properly and make the right selection. For OLTP which is very sensitive to read and write operations for every transactions, an increase in latency by 100% can cause a drop of transactions by 50% (depending on the workload) and hence can cause dramatic impact. For such database deployments, everything to cut down latency helps.

So your storage can right now either do the best of OLTP layout or the best of throughput layout which is typically used by Data Warehousing or so-called Decision Support Systems. Many people sometimes do a trade-off to be somewhere in between and hence end up selecting stripsize somewher in between. I have seen 64KB/128KB as common stripsize where they try to do a balance. So instead of making one group of people using the database unhappy, this tends to make all groups of people unhappy. But sometimes due to the nature of the usage of the database and cost, that's probably the right way to go. Though I would recommend actually figuring out the various databases and have multiple LUNS of various stripsize available to the database so that the DBA can make righ decisions based on the usage.

The setting in ZFS called recordsize is similar to this stripsize except for the fact that its range is only from 8KB to 128KB. The default is 128KB for a reason that it can do a good tradeoff for all uses. However if you are going to do only OLTP type transactions, then changing that to 8KB is very helpful. The drawback is as soon you run any long running query which requires to scan through large amount of data and if you have 8KB recordsize you will be limited by the throughput as explained above.

What is really needed is dynamic stripsize or recordsize so it adapts to the write being done. However again the reads will be then dependent on the write stripsize which may not be case that you want. Most DSS/DW systems reads from OLTP systems so those will still be impacted if you have dynamic stripsize for writes which will be small if done through OLTP system. Hence an engineering problem to solve to remove dependence of these various stripsizes and record size.. Hey even databases have block sizes maybe even these blocksizes need to be adaptive :-)

Of course questions for the reader is whether the blocksize of the database should be equal to strip size or equal to N x stripsize where N is the number of "effective" disks in the array/LUN.
The real answer is it depends on the database technology.

Monday, May 03, 2010

Now at VMware Inc

It looks like I haven't blogged for a while. Thing have been busy. I am now working at VMware Inc. I will primarily be doing performance work. Hopefully I can still continue blogging here about Postgres and other Open Source technologies.

Thursday, April 15, 2010

My Last Day and Blog Post at Sun

I am getting close to finish my 10th year at Sun Microsystems, Inc now part of Oracle America, Inc. However before I finish the year, it is time for me to move on with my new ventures.

Today is my last day in Sun/Oracle.

My new coordinates http://jkshah.blogspot.com and my LinkedIn Profile.

Stay in touch.

Tuesday, March 02, 2010

9.0devel and PGMonitor

Using the latest CVS head, just upgraded my home setup to 9.0devel on OpenSolaris 2009.06

$ psql postgres

psql (9.0devel)

Type "help" for help.

postgres=#

Also I upgraded the PostgreSQL Monitor to use 8.4 JDBC4 driver instead of the 8.3 JDBC3 driver. Hopefully I will get some more time to enhance the tool itself.

Friday, February 05, 2010

Building latest PostgreSQL CVS Head on OpenSolaris

With the talk about PostgreSQL 9.0 alpha 5, I thought it is time for me to try out another CVS head build on OpenSolaris. Of course this time this was on my home desktop which runs OpenSolaris 2009.06
I wanted to download the CVS head. The instructions are there on the wiki page. However before following it I needed to install the CVS on my OpenSolaris instance.

#pkg install SUNWcvs

The using the instructions I created a copy of the cvs repository and created my own project workspace.

I already had the Sun Compilers on my setup. (If not or have an old copy then you can always install or upgrade it as

# pkg install sunstudio

)
So I started my task of creating the new binaries on OpenSolaris. I found it to be bit bumpy.
Here is my configure options with my standard options.

$ ./configure --prefix=$HOME/project CFLAGS="-xO3 -xarch=native -xspace -W0,-Lt -W2,-Rcond_elim -Xa  -xildoff -xc99=none -xCC" --without-readline

However then I hit my first problem to do make.
Need to do using GNU make which was not installed on my desktop.
Back to pkg manager

# pkg install SUNWgmake

Continuing again with make which proceeded and then eventually stopped again due to missing bison. (I wonder why the "configure" script did not catch that?)

# pkg install SUNWbison

Anyway I started to run make again now that I have installed bison. Strangely it failed again.
Figured it still did not find bison. I had to use ./configure statement again and tried gmake after that which allowed gmake to pick up bison.

However it failed again.

gmake[3]: Entering directory `/export/home/postgres/project/pgsql.project/src/backend/parser'
/usr/bin/bison -d -o gram.c gram.y
gmake[3]: *** [gram.c] Broken Pipe
gmake[3]: Leaving directory `/export/home/postgres/project/pgsql.project/src/backend/parser'
gmake[2]: *** [parser/gram.h] Error 2

This one was not easy to solve. I thought that probably the bison was buggy and was about to give up. Then I thought I will give it a shot using truss to figure out what is happening

$ truss -f /usr/bin/bison -d -o gram.c gram.y 2> /tmp/bisontruss.txt

Going through that file bisontruss.txt I found:

  fcntl(6, F_DUP2FD, 0x00000001)                  = 1

  close(6)                                        = 0

  execve("/usr/sfw/bin/gm4", 0x08047D00, 0x08047DA0) Err#2 ENOENT

I had no clue what gm4 does, but it is missing.
I used the search feature of OpenSolaris to see if there is a package associated with it.

# pkg search gm4

INDEX      ACTION    VALUE                     PACKAGE

.....
basename link usr/sfw/bin/gm4 pkg:/SUNWgm4@1.4.2-0.111

As they say on TV : Yep, there's a pkg for that

# pkg install SUNWgm4

And now back to gmake

Darn another package missing:

gmake[3]: Entering directory `/export/home/postgres/project/pgsql.project/src/backend/bootstrap'

***

ERROR: `flex' is missing on your system. It is needed to create the

file `bootscanner.c'. You can either get flex from a GNU mirror site

or download an official distribution of PostgreSQL, which contains

pre-packaged flex output.

***

# pkg search -r flex

INDEX      ACTION    VALUE                     PACKAGE

....

basename   link      usr/sfw/bin/flex          pkg:/SUNWflexlex@2.5.33-0.111

# pkg install SUNWflexlex

(Of course gmake wont use it immediately till you use the configure statement again)

Finally after a long time (it sure seemed to take a long time) the gmake seemed to hang on preproc.c compilation:

"preproc.y", line 13548: warning: line number in #line directive must be less than or equal to 32767

This is just not my day.
I did a cleanup of the make. At this point I found that I forgot to enable dtrace probes in my configure statement and also since I am using 64-bit kernel, decided to build 64-bit binaries. Retrying with the slightly modified configure still did not solve the loop problem in preproc.y which just runs 100% on the CPU. I left it running for a long while just to see if it finished (till my patience runs out) It did not finish.

I finally updated my sunstudio binaries and retired gmake. This time it succeeded in finishing the gmake. (Hence I changed my wordings above to reflect install/upgrade sunstudio).

After that a quick gmake install

$ gmake install

and the binaries are ready.
A quick check reflects

$ initdb

$ pg_ctl start -l server.log

$ psql postgres

psql (8.5devel)

Type "help" for help.

postgres=#

For a second I was expecting to see psql(9.0develop) but this is still acceptable. :-)

In Summary, you want to do the following before building the latest PostgreSQL source on OpenSolaris

# pkg install SUNWcvs sunstudio SUNWgmake SUNWbison SUNWgm4 SUNWflexlex

and the configure script

$ ./configure --prefix=$HOME/project CFLAGS="-m64 -xO3  -xarch=native -xspace -W0,-Lt -W2,-Rcond_elim -Xa  -xildoff -xc99=none -xCC" --without-readline --enable-dtrace DTRACEFLAGS="-64"

And with gmake I am ready to use the latest build of PostgreSQL on OpenSolaris.

Building latest PostgreSQL on OpenSolaris

I am moving my PostgreSQL on OpenSolaris realted entries to a new external blog. Since it is not part of my $dayjob anymore. Hope you update your bookmarks too.

Read "Building latest PostgreSQL CVS Head on OpenSolaris".

Friday, January 29, 2010

PostgreSQL related posts move here from now on

With changes in my employer's blogging policies I will have to move PostgreSQL related talks here. This gives me some freedom about my posts related to PostgreSQL.

However my old PostgreSQL posts are still available at http://blogs.sun.com/jkshah/

Wednesday, January 20, 2010

Sun rules in PeopleSoft NA Payroll Benchmarks

Sun previously had published record numbers with PeopleSoft NA Payroll 240K benchmark using Sun F5100 Flash Storage Array . However Giri just announced a follow-on benchmark which beats even that number using twice the number of streams using the Sun Enterprise M4000 server and Sun F5100 Flash Storage Array.

Wednesday, January 13, 2010

New Year, New Role and New Build

Happy New Year. Its a new year and I have started a new role in Applications Integration Engineering which is part of the Sun Storage 7000 - Unified Storage Systems group. AIE's main charter is to integrate ISV products with Sun Storage 7000 family. I hope to continue working with databases and other applications and specially how it interacts and integrates with the FISHworks based products. Years ago, interestingly, I don't think I would have recommended NFS to be used with any database application. But looks like it is now way more stabilized in its current form. Then there is also iSCSI. But there is yet another way to connect to these systems soon which I think is more attractive to me and maybe even other database folks at large. More about that when time is right.

Anyway with the new role, I thought it was time to update my existing OpenSolaris (b128a) to the latest OpenSolaris build 130. I must admit this has been the first OpenSolaris upgrade which was not as smooth as expected. First things first I got hit with bug with the naming of /dev repository. I first heard about the bug from George Drapeau but even though I worked around it I could still not upgrade to the latest build. Then I heard from Mandy about the problem that if I had ever installed from /contrib repository I could still not upgrade to the latest build with the changed /dev name. I uninstalled all the software from /contrib and crossing my fingers the pkg image-update command still failed. Of course I then realized I probably had couple of packages from the /pending repository and even the Sun /extra repository. Uninstalling all the extra software was not fun but still the darn thing did not upgrade. Finally gave up and read about this forced upgrade using -f as follows

# pkg image-update -f

and it worked. It started downloading the updates and finally created a new boot environment with the new build.

However the reboot to the new environment just stuck at the graphical boot with the orange bar going from left to right. After 5 minutes I killed the power and rebooted and this time used "e" on the grub menu and deleted the splashfile, foreground and background lines and changed the kernel boot line from console=graphics to console=text and pressed "b" to boot using the modified grub entry. I figured out that the X server refused to start. Cutting a long story short (it actually took me almost a day) to figure a simple solution, re-move my custom /etc/X11/xorg.conf (which I was forced to create few upgrades (b111a) ago) so the X server can use its new defaults to start without any problems.

Of course that worked till I got the login and when I entered my login information, I ended with a white screen. Arrg yet another bug. Reading the mailing list got the following solution

$ pfexec /usr/sbin/unlink /usr/lib/xorg/modules/extensions/GL
$ pfexec ln -s ../../../../../var/run/opengl/server /usr/lib/xorg/modules/extensions/GL

With the above changes, finally rebooting the desktop into fresh working build 130 of OpenSolaris and I was ready to try out the new Thunderbird 3.0 and Firefox 3.5. Of course AWN (the mac like dock) worked for most part but the dock preferences refused to start. I did file a bug and it seems that it will be fixed in b131 but the quick fix is to edit

/usr/bin/awn-manager and replace the first line

#!/usr/bin/python

#!/usr/bin/python2.6

and that should allow you to see your AWN dock preferences once again.

If you ask me was it worth all the pain to upgrade to this new version. My simple answer is yes

Few thing fixed for me:

The new login screen is much nicer (in last few builds I could hardly read what I was typing in the login name text field on a widescreen monitor.

On build 128a I saw that the screen saver unlock screen was taking a long time to respond which seems to have gone away with this build.

I like the full text search capabilities of Thunderbird 3.0

Of course your reasons may be different then mine to upgrade and who knows build 131 might be out soon in next week or two then it probably might be a smoother upgrade if you can wait for it. (I can't.)

Monday, October 12, 2009

Accelerate your Payroll Performance with F20 PCIe Card

I guess you already heard about Sun Storage F5100 Flash Array and its world record benchmarks.

But it's not F5100 that I am going to talk about but its smaller sibling called Sun Flash Accelerator F20 PCIe Card. The name is a mouthful like all Sun product names so I will just call it "The Accelerator Card" in the remainder of this blog entry. Of course the idea is not to start with the answer and find a problem with it. But I am going to narrate is how we saw a problem and then thought of using this answer to solve the problem.

Recently our group ISV-E was doing our standard thing of making applications run best on Sun. In this particular project with PeopleSoft Enterprise 9.0 on M5000 system using Sun Storage 6540, we encountered a problem that certain batch jobs where taking a long time to execute. Peoplesoft Enterprise 9.0 actually have ways to breakup jobs and run them in parallel so as to use the multi-core of the multi-processor system. But yet we could not really leverage the system enough to be satisfactory. In this project they were using Oracle Database 11g. I got to give it to Oracle, they do have good tools. We used Oracle Enterprise Manager and saw for the troubled batch process, it was showing lot of blue color in its output.

Also looking at the top Objects, the tool reported which tables and index were troublesome which was causing that amount of blue appear in the chart. This "Blue" problem is what led us to an idea to test out the Accelerator Card in the system and see if can help out here. What we did was created a few tablespaces and spread them out on the four Flash Modules on the Accelerator Card and moved the highly active (or "hot" ) tables and indices to the newly created tablespace. What we saw was simply huge reduction in the blue area and more green. That lead to the slogan in our team

"Go Green with the Accelerator Card !"

The Accelerator card not only reduced the time on this process but many other batch processes which had high IO components. Here is a relative comparison of how it helped (with additional slight boost from upgrading SPARC64 VII from 2.4Ghz to 2.53Ghz CPUs).

Of course the next question is what if you take the same thing to its bigger sibling, Sun Storage F5100 Flash Array, well that's exactly what we did and as they say the rest is history.(Hint: Read the world records link and search for PeopleSoft) For more information check out Vince's blog entry on PeopleSoft Enterprise Payroll 9.0 NA and also Why Sun Storage F5100 is good for PeopleSoft 9.0 NA Payroll application.

Truly if you use Oracle and use Oracle Enterprise Manager to monitor your application performance and are turning blue by seeing lot of Blue area in the chart then just remember

"Go Green with the Accelerator Card !"

Monday, September 14, 2009

Infobright Tuning on OpenSolaris/Solaris 10

Recently I was working on a project which used Infobright as the database. The version tested was 3.1.1 both on OpenSolaris as well as Solaris 10. Infobright is like a column-oriented database engine for MySQL primarily targeted towards data warehouse, data mining type of project deployments.

While everything was working as expected, one thing we did notice that as number of concurrent connections tried to query against the database we noticed that queries deteriorated fast in the sense that not much parallel benefits were being squeezed from the machine. Now this sucks! (apparently sucks is now a technical term). It sucks because the server has definitely many cores and typically each Infobright query still can at the max peg a core. So the expectation will be typically to atleast handle concurrent queries which is close to the number of cores (figuratively speaking though in reality it depends).

Anyway we started digging into this problem. First we noticed that CPU cycles were heavy so IO was probably not the culprit (in this case). Using plockstat we found

# plockstat -A -p 2039    (where 2039 is the PID of mysqld server running 4 simultaneous queries)

^C
Mutex hold

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
3634393     1122 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
3626645     1047 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe
2 536317885 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12
12  6338626 mysqld`LOCK_open             mysqld`_Z10open_tableP3THDP13st_table_listP11st_mem_rootPbj+0x55a
9057     1275 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
8493     1051 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe
7928     1119 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe
5   326542 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12
683     1189 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
564     1339 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
564     1274 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
564     1156 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe
17    36292 0x1777780                    mysqld`_ZN7IBMutex6UnlockEv+0x12
2   246377 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12
57     8074 mysqld`_iob+0xa8             libstdc++.so.6.0.3`_ZNSo5flushEv+0x30
218     1479 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
4    78172 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12
4    75161 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12
….

R/W reader hold

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
44     1171 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3
12     3144 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c
1    14125 0xf7aa18                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536
1    12089 0xf762e8                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536
2     1886 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c
2     1776 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c
1     3006 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c
1     2765 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c
1     1797 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c
1     1131 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3

Mutex block

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
2175 11867793 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe
1931 12334706 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b
3 93404485 libc.so.1`libc_malloc_lock   mysqld`my_malloc+0x32
1    11581 libc.so.1`libc_malloc_lock   mysqld`_ZN11Item_stringD0Ev+0x49
1     1769 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZnwmRKSt9nothrow_t+0x20
..

Now typically if you see libc_malloc_lock in a plockstat for a multi-threaded program then it is a sign that the default malloc/free routines in libc is the culprit since the default malloc is not scalable enough for a multi-threaded program. There are alternate implementations which are more scalable than the default. Two such options which are already part of OpenSolaris, Solaris 10 are libmtmalloc.so and libumem.so. They can be forced to be used instead of the default without recompiling the binaries by preloading anyone of them before the startup command.

In case of the 64-bit Infobright binaries we did that by modifying the startup script mysqld-ib and added the following line just before invocation of mysqld command.

LD_PRELOAD_64=/usr/lib/64/libmtmalloc.so;
export LD_PRELOAD_64

What we found was now the response times for each query was more in-line as it was being executed on its own. well not true entirely but you get the point. For a 4 concurrent queries we found that it had improved from like 1X to 2.5X reduction in total execution time.

Similary when we used libumem.so we found the reduction more like 3X when 4 queries were executing concurrently.

LD_PRELOAD_64=/usr/lib/64/libumem.so;
export LD_PRELOAD_64

Definitely something to use for all Infobright installations on OpenSolaris or Solaris 10.

In a following blog post we will see other ways to tune Infobright which are not as drastic as this one but still buys some percentage of improvements. Stay tuned!!