Thursday, September 19, 2019

PostgreSQL Extensions - A Deeper Look

My slides from my session "PostgreSQL Extensions - A Deeper Look" at PostgresOpen 2019 and PostgresConf SV 2019

This blog represents my own view points and not of my employer, Amazon Web Services.

Sunday, September 08, 2019

Tuning DB Parameters for PostgreSQL 12 in Amazon RDS

In my last entry, we saw how to setup PostgreSQL 12 beta 3 in Amazon RDS. In that entry I purposely left out how to change database parameters as I realized that it deserves an entry (or more) by itself.

Using the AWS CLI you can create a new database parameter group as follows:

$ aws rds create-db-parameter-group  --db-parameter-group-name jkpg12pg    \
 --db-parameter-group-family postgres12 --description "My PostgreSQL 12 Parameter Group" \
 --region us-east-2 --endpoint  

We have just created a group and not applied the parameters to any database. Before we apply, we do to see what are the default values  in the created group. You can run a command as follows to see values being set by default in the group.

$ aws rds describe-db-parameters --db-parameter-group-name jkpg12pg  \
--region us-east-2 --endpoint \
--query 'Parameters[].[ParameterName,ParameterValue]' --output text

The output contains a list of parameters with values. Let's look at some of the values to see how to interpret them.

application_name None
autovacuum_max_workers GREATEST({DBInstanceClassMemory/64371566592},3)
autovacuum_vacuum_cost_limit GREATEST({log(DBInstanceClassMemory/21474836480)*600},200)
effective_cache_size {DBInstanceClassMemory/16384}
jit None
maintenance_work_mem GREATEST({DBInstanceClassMemory*1024/63963136},65536)
max_connections LEAST({DBInstanceClassMemory/9531392},5000)
shared_buffers {DBInstanceClassMemory/32768}
shared_preload_libraries pg_stat_statements
work_mem None
xmlbinary None
xmloption None

When you see None it basically is equivalent to being not set in postgresql.conf and the default value of the PostgreSQL version engine is used by PostgreSQL. In the above example, you will notice that jit is set to None which means it will take the default  ON value in PostgreSQL 12 and enable jit in the instance.

If you change a parameter set to a specific value based on the supported type for that parameter, then that value (based on the supported unit) will be used instead of the default value in PostgreSQL. For example you see that shared_preload_libraries has a default value of pg_stat_statements. Which means when you deploy a PostgreSQL 12 instance, pg_stat_statements extension will have the library preloaded and available for it without requiring another restart.

Other interesting parameters are ones with values containing {} or with GREATEST or LEAST functions. These values are using macro functions allowing you to set them based on the DBInstanceClassMemory (in bytes) based on the instance size used by the database instance.

For example, shared_buffers is set to {DBInstanceClassMemory/32768} . In PostgreSQL, when there are no units, shared_buffers is set to the number of 8KB  pages based on the value.
So in this example it shows that it is set to  25% or 1/4th of total RAM in bytes in terms of 8192 bytes =   (RAM/8192 )/4 or RAM/32768.

Setting values is an important task to get optimum usage of a PostgreSQL database. Lets look at how I think of setting these values for an instance.

Lets consider a case of an RDS instance type of db.m4.16xlarge with 64GB as RAM. For simplicity, I am not considering basic Linux kernel memory and RDS monitoring and other OS processes but filesystem cache will be considered as that is significant portion of memory used by a PostgreSQL instance. The other major component are shared buffers which is a common shared memory area used by PostgreSQL processes. The final component is the aggregate of all individual private memory of each connection of PostgreSQL database.

TOTAL RAM = Filesystem Cache + Shared DB Buffers Cache + Sum of all  PostgreSQL connections

By default on RDS, the shared buffers are set to 25% of RAM. It would be fair to consider that file system cache usage could be equal or greater than that as all pages will come through file system cache and constitutes 25-30% of RAM. While shared_buffers can be controlled by the shared_buffers parameter, the file system cache cannot be controlled directly, however it can be freed by OS during low memory situations. So using our example of 64GB total memory, we already have accounted for 16GB + 16-20GB of file system cache, leaving about 28-32GB of memory free for your private memory consumed by database connections. In the rest of the calculation we assume this two part to be roughly 50% of RAM

Private memory of database connections are hard to easily measure as they are not 'RSS' of a process but 'RSS' - touched shared memory pages and depends on the number of connections and chunks of work_mem consumed by each connection.

For capacity calculation we could use something simple as

      Average Memory per PostgreSQL connections * Concurrent Connections <= 50% of RAM 

where Average memory per PostgreSQL connection can be simplified to say n * work_mem + process overhead where n can vary by the type of queries. For example a query with JOIN of two tables and ORDER BY can end up using 2 work_mem chunks along with memory process overhead. Putting that into numbers with a default work_mem of 4MB and an approximate process overhead of say roughly 5MB of PostgreSQL (if Linux huge pages are disabled then this number may need to bumped on higher side), each PostgreSQL connection is about 2x4 + 5 = 13MB. If you have 1,000 concurrent MB you may end up consuming about 13GB and for 2,000 connections that number can jump to 26GB. Hence we should make sure that

Work_mem <=   (  50% RAM in KB  /concurrent_connections   -  5,000 KB)/2

Hence query tuning, lowering work_mem, max_connections can help control this component of the memory. If your queries actually ends up requiring more work_mem memory, then your default 25% of shared_buffers needs to be reduced down to make more memory available for your work_mem and max_connection needs. It is perfectly reasonable to take down the percentage of shared_buffers to fit the number of concurrent connections as it is better to have lower hit ratio of buffer pool than ending up using swap space.

The above perspective is a simple way to look at it. I am ignoring other things like temporary tables in memory and other temporary memory that will also consume memory with the assumption queries are simple join and order by queries. But if you are using temporary tables and have large analytical queries, you have to account for that memory in your average memory per PostgreSQL connection to arrive at the usage needed and then maybe reduce shared buffers to make sure the total usage is within total RAM and not end up using swap or causing large flush of the file system cache.

If you want to lower your shared buffers to say 20% instead of the default 25%, you would change the macro for the parameter to be set to (RAM/8192)/5 or {DBInstanceClassMemory/40960}

To override a parameter in a particular group you can do as follows:

$ aws rds modify-db-parameter-group --db-parameter-group-name jkpg12pg \
 --region us-east-2 --endpoint  \
 --parameters "ParameterName=shared_buffers,ParameterValue=\"{DBInstanceClassMemory/40960}\",ApplyMethod=pending-reboot" 

When you list it again you see the change in the parameter group. If you notice that for this parameter, the ApplyMethod is pending-reboot. For static parameters you can only apply them on reboot and shared_buffers is a static parameter. For dynamic parameters you can also use "immediate" as ApplyMethod, which will apply the changes immediately to all database instances using the parameter group.

In our case we have still not applied it to the database yet so it does not matter. Lets first apply it to our database.

$ aws rds modify-db-instance \
 --db-instance-identifier mypg12b3 --db-parameter-group jkpg12pg \
 --region us-east-2 --endpoint 

Note however changing the group does not mean all the parameters are in effect. Since static changes can only be applied after a reboot, we will reboot the instance as follows:

$ aws rds reboot-db-instance  --db-instance-identifier mypg12b3  \
--region us-east-2 --endpoint  

With the reboot now we have a database running with a custom parameter group which has parameters sets tuned based on the expected workflow. You may not get them right in the first try but now you know how to set them and apply them to the database using the CLI commands.

This blog represents my own view points and not of my employer, Amazon Web Services.

Thursday, August 29, 2019

Setting up PostgreSQL 12 Beta 3 for Testing in Amazon RDS

One of the amazing things about the PostgreSQL community is launching releases like clockwork. On 8/8/2019 the PostgreSQL community not only launched the minor versions for PostgreSQL 11 and old major versions but also a new Beta 3 version for upcoming PostgreSQL 12.

On AWS, you can check versions of PostgreSQL available in your region as follows:


$ aws rds describe-db-engine-versions --engine postgres --query 'DBEngineVersions[*].EngineVersion'

You will not see any beta versions out there. Pre-release versions for PostgreSQL in AWS  are available in the Database Preview Environment within US East 2 (Ohio).  If you are using the cli you have to add the region us-east-2 and also the url endpoint to your CLI commands.


$ aws rds describe-db-engine-versions --engine postgres \
  --query 'DBEngineVersions[*].EngineVersion' \
  --region us-east-2 --endpoint

The versions displayed are bit cryptic but they denote the  major version followed by date  when the build was synced for the preview release. The version description will be more friendly to read than the version itself.


$ aws rds describe-db-engine-versions --engine postgres \
  --query 'DBEngineVersions[*].DBEngineVersionDescription' \
  --region us-east-2 --endpoint
    "PostgreSQL 12.20190617 (BETA2)", 
    "PostgreSQL 12.20190806 (BETA3)"

Lets deploy an instance of PostgreSQL 12 Beta 3 aka version 12.20190806.


$ aws rds create-db-instance  \
--engine postgres  --engine-version 12.20190806 --db-instance-identifier mypg12b3 \
--allocated-storage 100 --db-instance-class db.t2.small     \
--db-name benchdb  --master-username pgadmin  --master-user-password SuperSecret \
--region us-east-2 --endpoint  


After couple or few minutes the end point will be available and can be queried as follows:


$ aws rds describe-db-instances  --db-instance-identifier mypg12b3 --query 'DBInstances[].Endpoint' \
--region us-east-2 --endpoint 
        "HostedZoneId": "ZZOC4A7DETW6VV", 
        "Port": 5432, 
        "Address": ""


If you have a default vpc security group defined in US East 2 (Ohio), you should be able to use the latest psql client to connect to it based on your default rules. If you do not have a default vpc security group, a new security group is created for you to which you have to add your client to the database instance. The security group will be in your US-East-2 (Ohio) region EC2 security groups for the preview environment.

 Once you have your client added to the security group, your client will be able to connect to the database as follows:


$ psql -h -d benchdb -U pgadmin 
Password for user pgadmin: 
psql (10.4, server 12beta3)
WARNING: psql major version 10, server major version 12.
         Some psql features might not work.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

benchdb=> select version();
 PostgreSQL 12beta3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
(1 row)


For this given major version, the supported extensions can be queried as follows:


benchdb=> show rds.extensions;
 address_standardizer, address_standardizer_data_us, amcheck, aws_commons, bloom, btree_gin, btree_gist, citext, cube, dblink, dict_int, dict_xsyn, earthdistance, fuzzystrmatch, hstore, hstore_plperl,
 intagg, intarray, ip4r, isn, jsonb_plperl, log_fdw, ltree, pageinspect, pg_buffercache, pg_freespacemap, pg_prewarm, pg_similarity, pg_stat_statements, pg_trgm, pg_visibility, pgcrypto, pgrouting, pg
rowlocks, pgstattuple, pgtap, plperl, plpgsql, pltcl, postgis, postgis_tiger_geocoder, postgis_topology, postgres_fdw, prefix, sslinfo, tablefunc, test_parser, tsm_system_rows, tsm_system_time, unacce
nt, uuid-ossp
(1 row)

Extensions are created using your master username login as follows:

benchdb=> CREATE EXTENSION pg_stat_statements;
benchdb=> CREATE EXTENSION postgis;
benchdb=> CREATE EXTENSION postgis_topology;

To verify the versions of the extensions that I have created.

benchdb=> select * from pg_extension;
  oid  |      extname       | extowner | extnamespace | extrelocatable | extversion  |   extconfig   |          extcondition           
 14299 | plpgsql            |       10 |           11 | f              | 1.0         |               | 
 16402 | pg_stat_statements |       10 |         2200 | t              | 1.7         |               | 
 16410 | postgis            |       10 |         2200 | f              | 3.0.0alpha3 | {16712}       | {"WHERE NOT (                  +
       |                    |          |              |                |             |               | )"}
 17418 | postgis_topology   |       10 |        17417 | f              | 3.0.0alpha3 | {17421,17434} | {"",""}
(4 rows)

One of the recent enhancements available since PostgreSQL 11 release in Amazon RDS is that pg_stat_statements library is now loaded by default unless explicitly disabled. This means I can immediately use pg_stat_statements after creating the extension.

select * from pg_stat_statements order by total_time desc limit 4;
 userid | dbid  |       queryid        |               query               | calls | total_time  |  min_time   |  max_time   |  mean_time  | stddev_time | rows | shared_blks_hit | shared_blks_read | s
hared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time | blk_write_time 
     10 | 16384 | -6310040060866956228 | select pg_start_backup($1, $2)    |     1 | 4934.715563 | 4934.715563 | 4934.715563 | 4934.715563 |           0 |    1 |               0 |                0 |  
                 0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
     10 | 16384 |  4124339773204179264 | select pg_stop_backup()           |     1 | 4656.211207 | 4656.211207 | 4656.211207 | 4656.211207 |           0 |    1 |               0 |                0 |  
                 0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
  16394 | 16396 | -2030728853740878493 | CREATE EXTENSION postgis          |     1 |  691.652456 |  691.652456 |  691.652456 |  691.652456 |           0 |    0 |           71359 |              247 |  
               835 |                 707 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
  16394 | 16396 | -2651058291501154175 | CREATE EXTENSION postgis_topology |     1 |   61.100119 |   61.100119 |   61.100119 |   61.100119 |           0 |    0 |            8539 |               26 |  
                57 |                  37 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
(4 rows)


Note the instances in preview environment are meant for development and testing for 60 days with your applications to try out new features and optimize them for PostgreSQL 12 !

This blog represents my own view points and not of my employer, Amazon Web Services.

Wednesday, February 06, 2019

PGConf.RU 2019: Slides from my sessions

It was my first visit to Moscow for PGConf.RU 2019. Enjoyed meeting the strong community of PostgreSQL in Russia!

Slides from my sessions:

1. Deep Dive into the RDS PostgreSQL Universe

2. Tips and Tricks for Amazon RDS for PostgreSQL

This blog represents my own view points and not of my employer, Amazon Web Services.

Thursday, October 18, 2018

Hello PostgreSQL 11 - Get ready to say goodbye to PostgreSQL 9.3

Earlier today (Oct 18, 2018), the PostgreSQL community announced the release of PostgreSQL 11.  Having done multiple software releases earlier, I appreciate the hard work by all contributors to get yet another major release on schedule. It is hard to do a major release every year and the community has been doing it since PostgreSQL 8.4 making this the 10th  release in the last decade. 

Everybody has their favorite feature in PostgreSQL 11 and I have one that is top on my list which is the transactional support in stored procedures2nd Quadrant had first announced that feature end of last year and at that time, it instantly became my favorite as I see it as a giant leap in PostgreSQL as it allows people to essentially write long data routines like ETL broken down in multiple transactions. Of course many users will certainly enjoy the improvements in  table partitioning system, query parallelism, and just-in-time (JIT) compilation for accelerating the execution of expressions in queries. However, the developers will certainly get more freedom with the stored procedure improvements.

With the release of PostgreSQL 11, now there are 6 major releases supported: PostgreSQL 9.3, 9.4, 9.5, 9.6, 10 and, 11. It is definitely a good time to start thinking to upgrade your PostgreSQL 9.3 databases. As per the versioning policy, the final minor release for PostgreSQL 9.3 will be on November 8th, 2018.  PostgreSQL 9.3 will be the last major version which does not support logical replication which was first introduced in PostgreSQL 9.4.  Hence, I expect this is will be the last painful upgrade because PostgreSQL 9.4 onwards you can always leverage logical replication to minimize the downtime while switching to a new version. All is not lost for PostgreSQL 9.3, while the experience is not exactly the same there are still tools available using the older trigger based replication to help or just bite the bullet and upgrade once with a small maintenance window as later versions will give you more options for your next major version upgrade.

If you need tips and tricks for upgrading your PostgreSQL 9.3 instances,  let me know! :-)