Jignesh Shah's Blog: postgresql

Showing posts with label postgresql. Show all posts

Sunday, November 21, 2010

PostgreSQL 9.0 Simple Select Scaling using sysbench

While I still haven't found a working way to test sysbench read-write test, I decided to continue on sysbench testing with PostgreSQL 9.0. This time I selected the oltp-test-mode=simple which essentially does a select lookup based on a primary key. The basic idea is to see if you are only doing a select statement how far can you scale.

Here are my initial results with PostgreSQL 9.0

This is still work in progress in some ways since this test has never been done before with PostgreSQL 9.0 (or atleast I haven't seen any body publish anything around this before).

Anyway my real goal is not to just put the numbers here but to understand what is happening here. Round about 40-48 clients we seen to peak out on scaling and hit a wall as far as scaling goes. The statement is pretty simple select from sbtest with a primary key which is randomly generated.

This is where I miss dtrace and OpenSolaris. Now that I am working on Linux it thought of trying it out with systemtap. The setting up of a working systemtap itself was a big challenge and took me some time to make it to work on my existing kernel. (Boy Linux should get their act straight at user level tracing. Its not omnipresent as dtrace on Solaris).

I converted my old lockstat utility to work on systemtap. The script seems to work but systemtap can act funny sometimes like dont abruptly exit from systemtap otherwise it may send a "wrong" signal to postgres backend which then just commits suicide since it cannot figure out what to do with such "trace/debug" signal.

        LOCKNAME LWID M W/A COUNT SUM-TIME(us) MAX(us) AVG-TIME(us)
      LockMgrLock 45 Ex W 85343 469682510 13152 5503
      LockMgrLock 57 Ex W   57547 30903727 8313 537
      LockMgrLock 44 Ex W 390 34061 1670   87
      LockMgrLock 59 Ex W     375     41570 2032 110
      LockMgrLock 56 Ex W 361     39685 1889 109
      LockMgrLock 47 Ex W 344     24548 1564   71
      LockMgrLock 54 Ex W 335 67770 2319 202
      LockMgrLock 50 Ex W 325 44213 1690 136
      LockMgrLock 49 Ex W 325     39280 1475 120
      LockMgrLock 55 Ex W 323     39448 1584 122
      LockMgrLock 48 Ex W 323     26982 1669   83

What you see above is top WAITS on lwlocks by count and what was the average wait and max wait time for a particular LWLock.

(Thinking it might be related to NUM_LOCK_PARTITIONS, I did some experiments with different sizes but since they are all related to the same table it does not help here.)

Then I modified my lockstat script slightly to do stack straces for those two locks:

Majority of those were pretty much caused by the same code path:

Lock id:45, LockMode:0

 0x000000000062412e : LWLockAcquire+0x25e/0x270 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000006228cc : LockAcquireExtended+0x2dc/0xa40 [/usr/local/aurora-1.0/bin/postgres]

 0x0000000000620788 : LockRelationOid+0x48/0x60 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000064be85 : AcquireExecutorLocks+0xd5/0x190 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000064c96a : RevalidateCachedPlan+0x5a/0x3b0 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000062ee64 : exec_bind_message+0x604/0xab0 [/usr/local/aurora-1.0/bin/postgres]

 0x0000000000630bbd : PostgresMain+0x82d/0x16e0 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000005f302e : ServerLoop+0x96e/0xcb0 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000005f3dec : PostmasterMain+0xa7c/0x1150 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000044e4f0 : main+0x370/0x430 [/usr/local/aurora-1.0/bin/postgres]

Lock id:57, LockMode:0

 0x000000000062412e : LWLockAcquire+0x25e/0x270 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000006228cc : LockAcquireExtended+0x2dc/0xa40 [/usr/local/aurora-1.0/bin/postgres]

 0x0000000000620788 : LockRelationOid+0x48/0x60 [/usr/local/aurora-1.0/bin/postgres]

 0x0000000000477425 : relation_open+0x55/0x90 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000047e5e3 : index_open+0x13/0x90 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000005732a5 : ExecInitIndexScan+0x125/0x1e0 [/usr/local/aurora-1.0/bin/postgres]

 0x0000000000560275 : ExecInitNode+0x135/0x290 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000055ed80 : standard_ExecutorStart+0x530/0xc70 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000063278b : PortalStart+0x1bb/0x380 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000062ee9c : exec_bind_message+0x63c/0xab0 [/usr/local/aurora-1.0/bin/postgres]

 0x0000000000630bbd : PostgresMain+0x82d/0x16e0 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000005f302e : ServerLoop+0x96e/0xcb0 [/usr/local/aurora-1.0/bin/postgres]

 0x00000000005f3dec : PostmasterMain+0xa7c/0x1150 [/usr/local/aurora-1.0/bin/postgres]

 0x000000000044e4f0 : main+0x370/0x430 [/usr/local/aurora-1.0/bin/postgres]

So what I understand is that we have two problems here:
1. RevalidateCachedPlan (the Major bottleneck)
2. AccessShare Lock entry (the second bottleneck)

Well atleast now I know the area where there seems to be some scaling bottlenecks which can limit simple SELECT statement scalings. The question now is what to do about them. Back to the mailing list.

Thursday, November 18, 2010

sysbench, PostgreSQL 9.0 and OLTP complex read-write test

Continuing with my sysbench saga with PostgreSQL 9.0, I was generally not encountering any errors except recently. I found two differences on how I did execute this test in order to hit this problem. One is I used a relatively small number of rows (1 million) which forces the special distribution of sysbench clients to be smaller and a reasonably high number of threads (80) as follows

$ sysbench --db-driver=pgsql --oltp-dist-type=special --oltp-table-size=1000000 --oltp-read-only=off --oltp-test-mode=complex --max-requests=0 --max-time=300 --num-threads=64 --test=oltp run

sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 80

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
FATAL: query execution failed: 9490352
FATAL: database error, exiting...
Done.

And the test failed with error reported in pg_log:

ERROR: duplicate key value violates unique constraint "sbtest_pkey"
DETAIL: Key (id)=(500815) already exists.
STATEMENT: INSERT INTO sbtest values($1,0,' ','aaaaaaaaaaffffffffffrrrrrrrrrreeeeeeeeeeyyyyyyyyyy')

The way sysbench works for the complex read-write test transaction, after doing some bunch of selects and updates, it deletes a row and inserts the same row back. So in a transaction logic it should not hit this error since it just deleted the row. This took me some time to recreate it in a way that I could understand what is happening.

What's happening is the timing of the start of a query in transaction with respect to another transaction in flight
working with the same key-value row such that the other transaction just deleted the row and also inserted the same row back in the same transaction.

Lets consider two transactions A and B

Transaction A                      Transaction B

BEGIN;

                                   BEGIN;

DELETE FROM sbtest WHERE id=500815;

(returns DELETE 1)

INSERT INTO sbtest values(500815,0,'','aaaaaaaaaaffffffffffrrrrrrrrrreeeeeeeeeeyyyyyyyyyy');

(returns INSERT 0 1)

                                   DELETE FROM sbtest WHERE id=500815;< ------- hangs/waits

END;

(COMMIT)

                                   (returns DELETE 0 – returns success but doesn’t delete any rows . It doesn't roll back the transaction)

                                    INSERT INTO sbtest values(500815,0,'','aaaaaaaaaaffffffffffrrrrrrrrrreeeeeeeeeeyyyyyyyyyy');

                                    ERROR: duplicate key value violates unique constraint "sbtest_pkey"

                                    END;

                                    (ROLLBACK)

The way MySQL-InnoDB handles it is slightly different. It actually deletes the new row inserted and hence always can do the INSERT successfully and thats why sysbench with MySQL - InnoDB never showed that problem.

If you read the documentation on PostgreSQL for READ COMMITTED Isolation level, it says: "In effect, a SELECT query sees a snapshot of the database as of the instant the query begins to run". So had the DELETE actually started after the END of the first transaction it would delete the new INSERTed row.I am not sure how other databases (ORACLE, DB2, etc) behave in this scenario.

As one of my co-worker mentioned "Correctness is in the Implementation of Beholder", it is hard to say who is right or wrong..

Just another one of the minor differences between MySQL and PostgreSQL.

Wednesday, November 03, 2010

PostgreSQL in Virtual Environment - PGWest 2010

Yesterday I had my presentation at PgWest 2010 at San Francisco.

If you have questions on the presentation please feel free to leave comments.

Monday, October 04, 2010

Postgres 9 and sysbench 0.4.12

Ever since the release of Postgres 9 I was hoping for a chance to do some performance tests on it. I finally got around to try it out. This time I tried Postgres running in a virtual machine on top of VMware's vSphere Infrastructure 4.1.

Since there were plenty of sysbench numbers available for mysql, I thought it was time to try it with sysbench. The version I tried was sysbench 0.4.12. Compiling sysbench 0.4.12 for PostgreSQL itself was a huge challenge that it deserves a separate blogpost but for now lets start at the point where I got a working sysbench binary that I run it from a separate virtual machine against my Postgres database server.

The first step to do a sysbench oltp was to do the "prepare" step where the benchmark tool creates the table and loads rows into the table. I saw many mysql tests running with 1 million rows or some with 10 million rows.

So anyway I started with 1 million rows with similar options. The load started and went on and went on for a while.. (I should have taken a lunch break instead of waiting for it to finish.) It literally finished after some 20 odd minutes. I found that odd. So I downloaded MySQL 5.5 rc binaries and created a mysql database and compiled sysbench to run against MySQL and started the same "prepare" step with 1 million rows.

The "prepare" step finished for MySQL in an amazing 30 seconds or so. My first reaction was MySQL is damn fast than Postgres. As a Postgres lover I found that hard to believe. There's got to be some other problem. I added the --debug flag to sysbench and I saw the insert statements that sysbench did to mysql. When I used the same debug flag for Postgres version it did not show me any statements that it used.. That was no help.

Anyway the Postgres was running on a version of the "late" OpenSolaris and used the query_time_breakdown.d script from pgfoundry dtrace scripts to see that the query that it was using to insert rows. I noticed that it was doing multi-row inserts on MySQL and single-row inserts on Postgres.

Now came the hard part of reading through the sysbench code which took me some while to understand its layout and how it uses the database driver. To cut the long story short, I figured sysbench uses "drv_caps_t" structure to record the capabilities of the database drivers. The MySQL driver enables multi-row inserts while the Postgres driver had multi-row support disabled.

Checking the documentation of Postgres I saw multi-row inserts in a INSERT statement is allowed.

So I modified sysbench/drivers/pgsql/drv_pgsql.c and changed

/* PgSQL driver capabilities */
static drv_caps_t pgsql_drv_caps =
{
0,
1,
1,
0,
0,
1,
0,
NULL
};

to the following

/* PgSQL driver capabilities */
static drv_caps_t pgsql_drv_caps =
{
1,
1,
1,
0,
0,
1,
0,
NULL
};

The resulting difference that it now loads 1000 rows in each INSERT statement (like it does for MySQL)

As for the difference it makes on the "prepare" operation:
$ time ./sysbench --pgsql-user=pguser --pgsql-password=xxxxxx --pgsql-db=sbtest --pgsql-host=pgdb --db-driver=pgsql --oltp-dist-type=special --oltp-table-size=1000000 --oltp-read-only=off --oltp-test-mode=complex --max-requests=0 --init-rng --max-time=300 --num-threads=96 --test=oltp --debug prepare

sysbench 0.4.12: multi-threaded system evaluation benchmark
Creating table 'sbtest'...
NOTICE: CREATE TABLE will create implicit sequence "sbtest_id_seq" for serial column "sbtest.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "sbtest_pkey" for table "sbtest"
Creating 1000000 records in table 'sbtest'...

real 17m47.695s
user 0m0.328s
sys 0m0.700s

to:

$ time ./sysbench --pgsql-user=pguser --pgsql-password=xxxxxx --pgsql-db=sbtest --pgsql-host=pgdb --db-driver=pgsql --oltp-dist-type=special --oltp-table-size=1000000 --oltp-read-only=off --oltp-test-mode=complex --max-requests=0 --init-rng --max-time=300 --num-threads=96 --test=oltp --debug prepare

sysbench 0.4.12: multi-threaded system evaluation benchmark
Creating table 'sbtest'...
NOTICE: CREATE TABLE will create implicit sequence "sbtest_id_seq" for serial column "sbtest.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "sbtest_pkey" for table "sbtest"
Creating 1000000 records in table 'sbtest'...

real 0m36.237s
user 0m0.180s
sys 0m0.064s

This looks much better and acceptable. Now I can proceed to the real sysbench testing.
If somebody knows the maintainer of the sysbench, can you please point him to this blog article to get the drv_pgsql.c patched up to be similar to MySQL driver.

Sunday, June 20, 2010

Virtualization and Postgres

It will be soon couple of months for me at VMware. I have been learning a lot of VMware and Virtualization. Often time I try to view it from the point of view of a database like Postgres. To say the least it has been a learning experience.

In the most simplest form of understanding, virtualization (not to be confused with simulation or emulation) is a set of resources. The primary resources are CPU, Memory, Disk and Network. Then there are others but most others are not that interested for majority of the database users (or alteast for me). For a java programmer this is more like an interface which can be implemented by a class but retains its properties. One cannot use it directly but you use a class typically implementing that interface. Similarly for virtualization while you think only in terms of this interface resources it is actually implemented by the underlying physical resources of the system. Hence it is not to be confused with "emulation".

That said one would think that they need not think of it as a "special" case in their deployment. This is where expectations and reality start to diverge and we could have a case of finger pointing. To set the right expectations for virtualization one needs to understand the philosophy behind virtualization which I guess is what I have been doing recently. It is more like soul-searching in some sense. From what I get the philosophy behind virtualization is actually efficient and simplified utilization of resources. Again lets go back to our primary resources CPU, Memory, Disk, Network. What virtualization tries to achieve is efficient utilization of these resources beyond one operating system. I think everyone understands the efficient utilization part. The clause that becomes tricky is "beyond one operating system". This clause is like the last minute change included by some smart lawyer which no one will understand till some years letter when another legal review is done. :-)

This clause is what is causing many operating systems, applications like database to rethink their behavior of "My world is the only world here". The details of these behaviour are actually too long to list but to give an idea, an idle operating system throws enough tick interrupts to keep the clock up to date that few of these operating systems while doing nothing can keep a core busy just to do these time updates. Occassionally it can happen that the interrupt that does not happen in time which can cause time drifting in operating systems.

Similarly my experience with databases and shared memory is often the databases just pin the max memory we expect the database will use at peak loads. So even though it may be say 10-20% of the case that peak utilization will occur but memory has been fully allocated and pinned (marked "unusable" for others) which may not be the most efficient way of using that resource. (Primary true in my past life where the immediate thing we do is use Intimate Shared Memory on Solaris so nothing else can use it.) Other thing that can come to mind is spin locks where CPU cylces are forced to spin to do nothing but wait, network polling to see if there is more incoming data and many other operations.

Plus with the isolation provided a person within an operating system now has no idea what other things could be using the same resources that the OS is using which makes it more difficult for both the administrator and the Virtualization professional to reach some common understanding. :-)

Anyway this is probably my first post on this topic in relation with Postgres. As I learn more things about Virtualization, you can expect me to share my learning of the impact of Virtualization on Postgres and vice versa. In the mean while if you have other "bugging" questions related to Postgres and Virtualization do let me know and maybe I can test it out on my spare time.