Showing posts with label postgresql. Show all posts
Showing posts with label postgresql. Show all posts

Thursday, September 19, 2019

PostgreSQL Extensions - A Deeper Look

My slides from my session "PostgreSQL Extensions - A Deeper Look" at PostgresOpen 2019 and PostgresConf SV 2019









This blog represents my own view points and not of my employer, Amazon Web Services.

Sunday, September 08, 2019

Tuning DB Parameters for PostgreSQL 12 in Amazon RDS

In my last entry, we saw how to setup PostgreSQL 12 beta 3 in Amazon RDS. In that entry I purposely left out how to change database parameters as I realized that it deserves an entry (or more) by itself.

Using the AWS CLI you can create a new database parameter group as follows:

$ aws rds create-db-parameter-group  --db-parameter-group-name jkpg12pg    \
 --db-parameter-group-family postgres12 --description "My PostgreSQL 12 Parameter Group" \
 --region us-east-2 --endpoint https://rds-preview.us-east-2.amazonaws.com  


We have just created a group and not applied the parameters to any database. Before we apply, we do to see what are the default values  in the created group. You can run a command as follows to see values being set by default in the group.

$ aws rds describe-db-parameters --db-parameter-group-name jkpg12pg  \
--region us-east-2 --endpoint https://rds-preview.us-east-2.amazonaws.com \
--query 'Parameters[].[ParameterName,ParameterValue]' --output text

The output contains a list of parameters with values. Let's look at some of the values to see how to interpret them.


application_name None
..
autovacuum_max_workers GREATEST({DBInstanceClassMemory/64371566592},3)
autovacuum_vacuum_cost_limit GREATEST({log(DBInstanceClassMemory/21474836480)*600},200)
..
effective_cache_size {DBInstanceClassMemory/16384}
..
jit None
..
maintenance_work_mem GREATEST({DBInstanceClassMemory*1024/63963136},65536)
max_connections LEAST({DBInstanceClassMemory/9531392},5000)
..
shared_buffers {DBInstanceClassMemory/32768}
shared_preload_libraries pg_stat_statements
..
work_mem None
xmlbinary None
xmloption None
       

When you see None it basically is equivalent to being not set in postgresql.conf and the default value of the PostgreSQL version engine is used by PostgreSQL. In the above example, you will notice that jit is set to None which means it will take the default  ON value in PostgreSQL 12 and enable jit in the instance.

If you change a parameter set to a specific value based on the supported type for that parameter, then that value (based on the supported unit) will be used instead of the default value in PostgreSQL. For example you see that shared_preload_libraries has a default value of pg_stat_statements. Which means when you deploy a PostgreSQL 12 instance, pg_stat_statements extension will have the library preloaded and available for it without requiring another restart.

Other interesting parameters are ones with values containing {} or with GREATEST or LEAST functions. These values are using macro functions allowing you to set them based on the DBInstanceClassMemory (in bytes) based on the instance size used by the database instance.

For example, shared_buffers is set to {DBInstanceClassMemory/32768} . In PostgreSQL, when there are no units, shared_buffers is set to the number of 8KB  pages based on the value.
So in this example it shows that it is set to  25% or 1/4th of total RAM in bytes in terms of 8192 bytes =   (RAM/8192 )/4 or RAM/32768.


Setting values is an important task to get optimum usage of a PostgreSQL database. Lets look at how I think of setting these values for an instance.

Lets consider a case of an RDS instance type of db.m4.16xlarge with 64GB as RAM. For simplicity, I am not considering basic Linux kernel memory and RDS monitoring and other OS processes but filesystem cache will be considered as that is significant portion of memory used by a PostgreSQL instance. The other major component are shared buffers which is a common shared memory area used by PostgreSQL processes. The final component is the aggregate of all individual private memory of each connection of PostgreSQL database.


TOTAL RAM = Filesystem Cache + Shared DB Buffers Cache + Sum of all  PostgreSQL connections


By default on RDS, the shared buffers are set to 25% of RAM. It would be fair to consider that file system cache usage could be equal or greater than that as all pages will come through file system cache and constitutes 25-30% of RAM. While shared_buffers can be controlled by the shared_buffers parameter, the file system cache cannot be controlled directly, however it can be freed by OS during low memory situations. So using our example of 64GB total memory, we already have accounted for 16GB + 16-20GB of file system cache, leaving about 28-32GB of memory free for your private memory consumed by database connections. In the rest of the calculation we assume this two part to be roughly 50% of RAM

Private memory of database connections are hard to easily measure as they are not 'RSS' of a process but 'RSS' - touched shared memory pages and depends on the number of connections and chunks of work_mem consumed by each connection.

For capacity calculation we could use something simple as

      Average Memory per PostgreSQL connections * Concurrent Connections <= 50% of RAM 

where Average memory per PostgreSQL connection can be simplified to say n * work_mem + process overhead where n can vary by the type of queries. For example a query with JOIN of two tables and ORDER BY can end up using 2 work_mem chunks along with memory process overhead. Putting that into numbers with a default work_mem of 4MB and an approximate process overhead of say roughly 5MB of PostgreSQL (if Linux huge pages are disabled then this number may need to bumped on higher side), each PostgreSQL connection is about 2x4 + 5 = 13MB. If you have 1,000 concurrent MB you may end up consuming about 13GB and for 2,000 connections that number can jump to 26GB. Hence we should make sure that

Work_mem <=   (  50% RAM in KB  /concurrent_connections   -  5,000 KB)/2

Hence query tuning, lowering work_mem, max_connections can help control this component of the memory. If your queries actually ends up requiring more work_mem memory, then your default 25% of shared_buffers needs to be reduced down to make more memory available for your work_mem and max_connection needs. It is perfectly reasonable to take down the percentage of shared_buffers to fit the number of concurrent connections as it is better to have lower hit ratio of buffer pool than ending up using swap space.

The above perspective is a simple way to look at it. I am ignoring other things like temporary tables in memory and other temporary memory that will also consume memory with the assumption queries are simple join and order by queries. But if you are using temporary tables and have large analytical queries, you have to account for that memory in your average memory per PostgreSQL connection to arrive at the usage needed and then maybe reduce shared buffers to make sure the total usage is within total RAM and not end up using swap or causing large flush of the file system cache.

If you want to lower your shared buffers to say 20% instead of the default 25%, you would change the macro for the parameter to be set to (RAM/8192)/5 or {DBInstanceClassMemory/40960}

To override a parameter in a particular group you can do as follows:

$ aws rds modify-db-parameter-group --db-parameter-group-name jkpg12pg \
 --region us-east-2 --endpoint https://rds-preview.us-east-2.amazonaws.com  \
 --parameters "ParameterName=shared_buffers,ParameterValue=\"{DBInstanceClassMemory/40960}\",ApplyMethod=pending-reboot" 



When you list it again you see the change in the parameter group. If you notice that for this parameter, the ApplyMethod is pending-reboot. For static parameters you can only apply them on reboot and shared_buffers is a static parameter. For dynamic parameters you can also use "immediate" as ApplyMethod, which will apply the changes immediately to all database instances using the parameter group.

In our case we have still not applied it to the database yet so it does not matter. Lets first apply it to our database.

$ aws rds modify-db-instance \
 --db-instance-identifier mypg12b3 --db-parameter-group jkpg12pg \
 --region us-east-2 --endpoint  https://rds-preview.us-east-2.amazonaws.com 


Note however changing the group does not mean all the parameters are in effect. Since static changes can only be applied after a reboot, we will reboot the instance as follows:

$ aws rds reboot-db-instance  --db-instance-identifier mypg12b3  \
--region us-east-2 --endpoint  https://rds-preview.us-east-2.amazonaws.com  

With the reboot now we have a database running with a custom parameter group which has parameters sets tuned based on the expected workflow. You may not get them right in the first try but now you know how to set them and apply them to the database using the CLI commands.



This blog represents my own view points and not of my employer, Amazon Web Services.


Thursday, August 29, 2019

Setting up PostgreSQL 12 Beta 3 for Testing in Amazon RDS

One of the amazing things about the PostgreSQL community is launching releases like clockwork. On 8/8/2019 the PostgreSQL community not only launched the minor versions for PostgreSQL 11 and old major versions but also a new Beta 3 version for upcoming PostgreSQL 12.

On AWS, you can check versions of PostgreSQL available in your region as follows:

       

$ aws rds describe-db-engine-versions --engine postgres --query 'DBEngineVersions[*].EngineVersion'
[
    "9.3.12", 
...  
    "11.2", 
    "11.4"
]
       
 

You will not see any beta versions out there. Pre-release versions for PostgreSQL in AWS  are available in the Database Preview Environment within US East 2 (Ohio).  If you are using the cli you have to add the region us-east-2 and also the url endpoint  rds-preview.us-east-2.amazonaws.com to your CLI commands.


       

$ aws rds describe-db-engine-versions --engine postgres \
  --query 'DBEngineVersions[*].EngineVersion' \
  --region us-east-2 --endpoint https://rds-preview.us-east-2.amazonaws.com
[
    "12.20190617", 
    "12.20190806"
]
       
 

The versions displayed are bit cryptic but they denote the  major version followed by date  when the build was synced for the preview release. The version description will be more friendly to read than the version itself.


       

$ aws rds describe-db-engine-versions --engine postgres \
  --query 'DBEngineVersions[*].DBEngineVersionDescription' \
  --region us-east-2 --endpoint https://rds-preview.us-east-2.amazonaws.com
[
    "PostgreSQL 12.20190617 (BETA2)", 
    "PostgreSQL 12.20190806 (BETA3)"
]
       
 

Lets deploy an instance of PostgreSQL 12 Beta 3 aka version 12.20190806.

       

$ aws rds create-db-instance  \
--engine postgres  --engine-version 12.20190806 --db-instance-identifier mypg12b3 \
--allocated-storage 100 --db-instance-class db.t2.small     \
--db-name benchdb  --master-username pgadmin  --master-user-password SuperSecret \
--region us-east-2 --endpoint  https://rds-preview.us-east-2.amazonaws.com  

       
 

After couple or few minutes the end point will be available and can be queried as follows:

       

$ aws rds describe-db-instances  --db-instance-identifier mypg12b3 --query 'DBInstances[].Endpoint' \
--region us-east-2 --endpoint  https://rds-preview.us-east-2.amazonaws.com 
[
    {
        "HostedZoneId": "ZZOC4A7DETW6VV", 
        "Port": 5432, 
        "Address": "mypg12b3.c9zz9zzzzzzz.us-east-2.rds-preview.amazonaws.com"
    }
]

       
 

If you have a default vpc security group defined in US East 2 (Ohio), you should be able to use the latest psql client to connect to it based on your default rules. If you do not have a default vpc security group, a new security group is created for you to which you have to add your client to the database instance. The security group will be in your US-East-2 (Ohio) region EC2 security groups for the preview environment.

 Once you have your client added to the security group, your client will be able to connect to the database as follows:

       

$ psql -h mypg12b3.c9zz9zzzzzzz.us-east-2.rds-preview.amazonaws.com -d benchdb -U pgadmin 
Password for user pgadmin: 
psql (10.4, server 12beta3)
WARNING: psql major version 10, server major version 12.
         Some psql features might not work.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

benchdb=> select version();
                                                  version                                                  
-----------------------------------------------------------------------------------------------------------
 PostgreSQL 12beta3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
(1 row)

benchdb=> 
       
 

For this given major version, the supported extensions can be queried as follows:

       

benchdb=> show rds.extensions;
                                                                                                                                                                                                        
                                                                                                    rds.extensions                                                                                      
                                                                                                                                                                                                        
              
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------
 address_standardizer, address_standardizer_data_us, amcheck, aws_commons, bloom, btree_gin, btree_gist, citext, cube, dblink, dict_int, dict_xsyn, earthdistance, fuzzystrmatch, hstore, hstore_plperl,
 intagg, intarray, ip4r, isn, jsonb_plperl, log_fdw, ltree, pageinspect, pg_buffercache, pg_freespacemap, pg_prewarm, pg_similarity, pg_stat_statements, pg_trgm, pg_visibility, pgcrypto, pgrouting, pg
rowlocks, pgstattuple, pgtap, plperl, plpgsql, pltcl, postgis, postgis_tiger_geocoder, postgis_topology, postgres_fdw, prefix, sslinfo, tablefunc, test_parser, tsm_system_rows, tsm_system_time, unacce
nt, uuid-ossp
(1 row)
       
 


Extensions are created using your master username login as follows:
       

benchdb=> CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION
benchdb=> CREATE EXTENSION postgis;
CREATE EXTENSION
benchdb=> CREATE EXTENSION postgis_topology;
CREATE EXTENSION
       
 

To verify the versions of the extensions that I have created.

       
benchdb=> select * from pg_extension;
  oid  |      extname       | extowner | extnamespace | extrelocatable | extversion  |   extconfig   |          extcondition           
-------+--------------------+----------+--------------+----------------+-------------+---------------+---------------------------------
 14299 | plpgsql            |       10 |           11 | f              | 1.0         |               | 
 16402 | pg_stat_statements |       10 |         2200 | t              | 1.7         |               | 
 16410 | postgis            |       10 |         2200 | f              | 3.0.0alpha3 | {16712}       | {"WHERE NOT (                  +
...
       |                    |          |              |                |             |               | )"}
 17418 | postgis_topology   |       10 |        17417 | f              | 3.0.0alpha3 | {17421,17434} | {"",""}
(4 rows)
       
 

One of the recent enhancements available since PostgreSQL 11 release in Amazon RDS is that pg_stat_statements library is now loaded by default unless explicitly disabled. This means I can immediately use pg_stat_statements after creating the extension.

       
benchdb=>     
select * from pg_stat_statements order by total_time desc limit 4;
 userid | dbid  |       queryid        |               query               | calls | total_time  |  min_time   |  max_time   |  mean_time  | stddev_time | rows | shared_blks_hit | shared_blks_read | s
hared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time | blk_write_time 
--------+-------+----------------------+-----------------------------------+-------+-------------+-------------+-------------+-------------+-------------+------+-----------------+------------------+--
-------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+---------------+----------------
     10 | 16384 | -6310040060866956228 | select pg_start_backup($1, $2)    |     1 | 4934.715563 | 4934.715563 | 4934.715563 | 4934.715563 |           0 |    1 |               0 |                0 |  
                 0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
     10 | 16384 |  4124339773204179264 | select pg_stop_backup()           |     1 | 4656.211207 | 4656.211207 | 4656.211207 | 4656.211207 |           0 |    1 |               0 |                0 |  
                 0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
  16394 | 16396 | -2030728853740878493 | CREATE EXTENSION postgis          |     1 |  691.652456 |  691.652456 |  691.652456 |  691.652456 |           0 |    0 |           71359 |              247 |  
               835 |                 707 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
  16394 | 16396 | -2651058291501154175 | CREATE EXTENSION postgis_topology |     1 |   61.100119 |   61.100119 |   61.100119 |   61.100119 |           0 |    0 |            8539 |               26 |  
                57 |                  37 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0 |              0
(4 rows)

       
 


Note the instances in preview environment are meant for development and testing for 60 days with your applications to try out new features and optimize them for PostgreSQL 12 !



This blog represents my own view points and not of my employer, Amazon Web Services.

Wednesday, February 06, 2019

PGConf.RU 2019: Slides from my sessions

It was my first visit to Moscow for PGConf.RU 2019. Enjoyed meeting the strong community of PostgreSQL in Russia!


Slides from my sessions:

1. Deep Dive into the RDS PostgreSQL Universe




2. Tips and Tricks for Amazon RDS for PostgreSQL



This blog represents my own view points and not of my employer, Amazon Web Services.


Thursday, October 18, 2018

Hello PostgreSQL 11 - Get ready to say goodbye to PostgreSQL 9.3

Earlier today (Oct 18, 2018), the PostgreSQL community announced the release of PostgreSQL 11.  Having done multiple software releases earlier, I appreciate the hard work by all contributors to get yet another major release on schedule. It is hard to do a major release every year and the community has been doing it since PostgreSQL 8.4 making this the 10th  release in the last decade. 

Everybody has their favorite feature in PostgreSQL 11 and I have one that is top on my list which is the transactional support in stored procedures2nd Quadrant had first announced that feature end of last year and at that time, it instantly became my favorite as I see it as a giant leap in PostgreSQL as it allows people to essentially write long data routines like ETL broken down in multiple transactions. Of course many users will certainly enjoy the improvements in  table partitioning system, query parallelism, and just-in-time (JIT) compilation for accelerating the execution of expressions in queries. However, the developers will certainly get more freedom with the stored procedure improvements.

With the release of PostgreSQL 11, now there are 6 major releases supported: PostgreSQL 9.3, 9.4, 9.5, 9.6, 10 and, 11. It is definitely a good time to start thinking to upgrade your PostgreSQL 9.3 databases. As per the versioning policy, the final minor release for PostgreSQL 9.3 will be on November 8th, 2018.  PostgreSQL 9.3 will be the last major version which does not support logical replication which was first introduced in PostgreSQL 9.4.  Hence, I expect this is will be the last painful upgrade because PostgreSQL 9.4 onwards you can always leverage logical replication to minimize the downtime while switching to a new version. All is not lost for PostgreSQL 9.3, while the experience is not exactly the same there are still tools available using the older trigger based replication to help or just bite the bullet and upgrade once with a small maintenance window as later versions will give you more options for your next major version upgrade.

If you need tips and tricks for upgrading your PostgreSQL 9.3 instances,  let me know! :-)

Wednesday, July 25, 2018

Loading data in PostgreSQL 11 Beta Using Native Logical Replication from PostgreSQL 10 in Amazon RDS

In the last blog entry,  I talked about creating two instances of PostgreSQL 11 Beta in Amazon RDS Database Preview Environment and setting up native logical replication. Today, Amazon RDS announced support for PostgreSQL 10.4 with native logical replication.  Let's see how to use this new support to replicate data from PostgreSQL 10 in Amazon RDS into PostgreSQL 11 Beta instances in  preview environment.

I  start with a new PostgreSQL 10.4 instance in Amazon RDS and populated it with data from an older dataset of IMDB.

benchdb-> \d
                          List of relations
 Schema |                 Name                  |   Type   |  Owner
--------+---------------------------------------+----------+---------
 public | acted_in                              | table    | pgadmin
 public | acted_in_idacted_in_seq               | sequence | pgadmin
 public | actors                                | table    | pgadmin
 public | aka_names                             | table    | pgadmin
 public | aka_names_idaka_names_seq             | sequence | pgadmin
 public | aka_titles                            | table    | pgadmin
 public | aka_titles_idaka_titles_seq           | sequence | pgadmin
 public | genres                                | table    | pgadmin
 public | keywords                              | table    | pgadmin
 public | movies                                | table    | pgadmin
 public | movies_genres                         | table    | pgadmin
 public | movies_genres_idmovies_genres_seq     | sequence | pgadmin
 public | movies_keywords                       | table    | pgadmin
 public | movies_keywords_idmovies_keywords_seq | sequence | pgadmin
 public | series                                | table    | pgadmin
(15 rows)

This "production" PostgreSQL 10 database also has data in it.

benchdb=> select count(*) from acted_in;
 count
--------
 618706
(1 row)

benchdb=> select count(*) from movies;
 count
--------
 183510
(1 row)

benchdb=> select count(*) from series;
 count
--------
 162498
(1 row)


In order to prepare PostgreSQL 10 instance in Amazon RDS for logical replication, we need to verify that rds.logical_replication database parameter is enabled. If it is not enabled, you can create a customer parameter group with rds.logical_replication enabled and the parameter group assigned to the database instance. In my case I had already used a custom parameter group with logical replication enabled.

benchdb=> show rds.logical_replication;
 rds.logical_replication
-------------------------
 on
(1 row)

In order to use logical replication,  a replication user to be created  in PostgreSQL 10 instance that will be used to connect from PostgreSQL 11 instance. In case of Amazon RDS, that can be done by granting the rds_replication role to the user.

benchdb=> CREATE USER pg11repluser WITH password 'SECRET';
CREATE ROLE
benchdb=> GRANT rds_replication TO pg11repluser;
GRANT ROLE

For security purpose, it is better that the replication user only has SELECT permissions on the tables to be replicated.

benchdb=> GRANT SELECT ON ALL TABLES IN SCHEMA public TO pg11repluser;
GRANT

The final step inside the database is to create  a publication pgprod10 to indicate which tables need to be replicated. An easier way to select all tables is as follows:

benchdb=> CREATE PUBLICATION pgprod10 FOR ALL TABLES;
CREATE PUBLICATION


One thing to note here is to edit  the inbound rules of the security group of the production instance to allow the PostgreSQL 11 Beta instance to connect.

On PostgreSQL 11 Beta instance, first thing is to recreate schema. We use pg_dump for this purpose only to copy the schema over from PostgreSQL 10 instance

$ pg_dump -s  -h pg104.XXXXX.us-east-2.rds.amazonaws.com -U pgadmin benchdb > movies_schema.sql


Load the schema into PostgreSQL 11 using the psql client

$ psql -h pg11from10.XXXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin -d benchdb -f movies_schema.sql

Note you might see errors for GRANT statements if the same users are not defined in the new instance. It is okay to ignore these messages.

ERROR:  role "pg11repluser" does not exist

We are now ready to create the subscription on PostgreSQL 11 Beta. We verify that there are no rows in this case and then confirm that we get all expected rows after the subscription is created.

benchdb=> select count(*) from acted_in;
 count
-------
     0
(1 row)

benchdb=> select count(*) from movies;
 count
-------
     0
(1 row)

benchdb=> select count(*) from series;
 count
-------
     0
(1 row)

benchdb=> CREATE SUBSCRIPTION pg11beta1 CONNECTION 'host=pg104.XXXXX.us-east-2.rds.amazonaws.com dbname=benchdb user=pg11repluser password=SECRET' PUBLICATION pgprod10;
NOTICE:  created replication slot "pg11beta1" on publisher
CREATE SUBSCRIPTION
benchdb=> select count(*) from acted_in;
 count
--------
 618706
(1 row)

benchdb=> select count(*) from movies;
 count
--------
 183510
(1 row)

benchdb=> select count(*) from series;
 count
--------
 162498
(1 row)


benchdb=>

With the new native logical replication support in PostgreSQL 10 in Amazon RDS, it is now easy to replicate the data into PostgreSQL 11 Beta instance in Amazon RDS Database Preview Environment.  It can also be used to replicate data to/from database instances deployed outside of Amazon RDS.

This blog represents my own view points and not of my employer, Amazon Web Services.


Friday, June 08, 2018

Setting up PostgreSQL 11 Beta 1 in Amazon RDS Database Preview Environment


PostgreSQL 11 Beta 1 has been out for more than couple of weeks. The best way to experience it is  to try out the new version and test drive it yourself.

Rather than building it directly from source, I take the easy way out and deploy it in the cloud. Fortunately, it is already available in Amazon RDS Database Preview Environment.



For this post I am going to use the AWS CLI since it is easy to understand the command line and copy/paste it and also easier to script it for repetitive testing. To use the Database Preview environment, the endpoint has to be modified to use https://rds-preview.us-east-2.amazonaws.com/ instead of the default for the region.

Because there might be multiple PostgreSQL 11 beta releases possible, it is important to understand which build version is being deployed.  I can always leave it to the default which typically would be the latest preferred version but lot of times I want to make sure on the version I am deploying. The command to get all the versions of PostgreSQL 11 is describe-db-engine-versions.

$ aws rds describe-db-engine-versions --engine postgres --db-parameter-group-family postgres11 --endpoint-url  https://rds-preview.us-east-2.amazonaws.com/ 
{
    "DBEngineVersions": [
        {
            "Engine": "postgres", 
            "DBParameterGroupFamily": "postgres11", 
            "SupportsLogExportsToCloudwatchLogs": false, 
            "SupportsReadReplica": true, 
            "DBEngineDescription": "PostgreSQL", 
            "EngineVersion": "11.20180419", 
            "DBEngineVersionDescription": "PostgreSQL 11.20180419 (68c23cba)", 
            "ValidUpgradeTarget": [
                {
                    "Engine": "postgres", 
                    "IsMajorVersionUpgrade": false, 
                    "AutoUpgrade": false, 
                    "EngineVersion": "11.20180524"
                }
            ]
        }, 
        {
            "Engine": "postgres", 
            "DBParameterGroupFamily": "postgres11", 
            "SupportsLogExportsToCloudwatchLogs": false, 
            "SupportsReadReplica": true, 
            "DBEngineDescription": "PostgreSQL", 
            "EngineVersion": "11.20180524", 
            "DBEngineVersionDescription": "PostgreSQL 11.20180524 (BETA1)", 
            "ValidUpgradeTarget": []
        }
    ]

}

From the above, I see there are two versions 11.20180419 and 11.20180524. The versions are based on datestamp with the description showing the tag information of the version. Since I am interested in the BETA1 version I use the version 11.20180524.


$ aws rds create-db-instance  --endpoint  https://rds-preview.us-east-2.amazonaws.com  --allocated-storage 100 --db-instance-class db.t2.small  --db-name benchdb  --master-username SECRET  --master-user-password XXXXX  --engine postgres  --engine-version 11.20180524   --db-instance-identifier pg11beta1

Once deployed, I can always get the endpoint of the instance as follows:

$ aws rds describe-db-instances --endpoint=https://rds-preview.us-east-2.amazonaws.com --db-instance-identifier pg11beta1 |grep Address
                "Address": "pg11beta1.XXXXXX.us-east-2.rds-preview.amazonaws.com"


In my account I have already added my client to my default security group,


$ psql -h pg11beta1.XXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin -d benchdb -c 'SELECT VERSION()'
                                                  version                                         
 PostgreSQL 11beta1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit


It is hard to test a database without any data.  Normally I would just use pgbench directly on it and load data. But then I thought let me try a different way of loading data in this instance of the database similarly to how people will typically do from a different production setup. For this purpose, I will need to setup a production database before I proceed.

Before I  create a production database instance, I first create a custom parameter group so that  I can  enable my typical settings that I use in a production database.  In the preview environment I created a  PostgreSQL 11 database family parameter group and edit the group to change some of the parameters as follows:

rds.logical_replication = 1

and saved the group.
Next, I create my production instance using the newly created parameter group.


 $ aws rds create-db-instance --allocated-storage 100 --db-instance-class db.t2.small --engine postgres --db-name benchdb --master-username pgadmin --master-user-password SECRET --db-instance-identifier pg11prod 


It is still empty so I filled it up with my production data.

$ pgbench -i -s 100  -h pg11prod.XXX.us-east-2.rds-preview.amazonaws.com -U pgadmin benchdb


Now, I have a typical setup with one production instance and another empty test instance. I know have to figure how to get the data into my test instance. I could always dump all data using pg_dump and restored it on the new instance but this time I am going to try logical replication.

For setting up logical replication between two instances I first need to recreate the schema on the other instance. pg_dump provides a flag -s to just dump the schema with no data. I dump the schema from the production setup


$ pg_dump -s  -h pg11prod.XXXX.us-east-2.rds.amazonaws.com -U pgadmin benchdb > schema.txt

and then load the schema into my test setup


$ psql -h pg11beta1.XXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin -d benchdb -f schema.txt



Now, I want to actually setup logical replication between the two users. For this I need a replication user. I could use the master password but that is too risky. So, I create a new user with read only privileges on the tables in the database and give it replication rights that will work in Amazon RDS.

$ psql -h pg11prod.XXXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin benchdb



benchdb=> CREATE USER repluser WITH PASSWORD 'SECRET';
CREATE ROLE
benchdb=> GRANT rds_replication TO repluser;

GRANT ROLE
benchdb=> GRANT SELECT ON ALL TABLES IN SCHEMA public TO repluser;
GRANT


Next, I have to setup a  publication for all tables in the production database

benchdb=> CREATE PUBLICATION pgprod11 FOR ALL TABLES;
CREATE PUBLICATION

One more thing to add here is to change the inbound rules of the security group of the production instance to allow the test instance to connect.

On my test instance I need to create a subscription to subscribe to all changes happening on my production setup.

$ psql -h pg11beta1.XXXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin benchdb

benchdb=> CREATE SUBSCRIPTION pg11beta1 CONNECTION 'host=pg11prod.XXXXXX.us-east-2.rds-preview.amazonaws.com dbname=benchdb user=repluser password= SECRET' PUBLICATION pgprod11;
NOTICE:  created replication slot "pg11beta1" on publisher
CREATE SUBSCRIPTION

Note if the command itself is taking a long time to execute then typically it means that it cannot connect to the production instance. Check the security group to make sure the rule to allow your test instance to connect is set properly.  If the connection is allowed then the command returns instantaneously. However, the actual data might be loading behind the scenes.

After some time, I can see that my test instance has all the initial data from the production setup.

benchdb=> select count(*) from pgbench_branches;
 count
-------
   100
(1 row)

benchdb=> select count(*) from pgbench_history;
 count
-------
     0
(1 row)

(The table pgbench_history is typically empty after a fresh setup of pgbench)

Now let's run application workload on our production database pg11prod

$ pgbench -c 10  -T 300 -P 10  -h pg11prod.XXXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin -U pgadmin benchdb


As the load starts (after the initial vacuum), log into the test instance and check for changes. With pgbench default test, it is easy to verify changes by counting entries in pgbench_history.

$ psql -h pg11beta1.XXXXX.us-east-2.rds-preview.amazonaws.com -U pgadmin benchdb
psql (10.4 (Ubuntu 10.4-2.pgdg16.04+1), server 11beta1)
Type "help" for help.

benchdb=> select count(*) from pgbench_history;
 count
-------
  2211
(1 row)

benchdb=> select count(*) from pgbench_history;
 count
-------
 10484
(1 row)

This is a simple test to see changes are being propagated from the production instance to the test instance.

I, finally, have a logical replication using a read-only user between two PostgreSQL 11 instances in Amazon RDS Database Preview Environment.




Pretty cool!


This blog represents my own view points and not of my employer, Amazon Web Services.

Sunday, September 10, 2017

Best Practices with Managed PostgreSQL in the Cloud - #pgopen2017

Best Practices with Managed PostgreSQL in the Cloud @ Postgres Open SV 2017 (#pgopen2017)



Best Practices with Managed PostgreSQL in the Cloud from Jignesh Shah


This blog represents my own view points and not of my employer, Amazon Web Services.

Wednesday, January 06, 2016

PostgreSQL and Linux Containers: #SouthbayPUG Presentation

It was a great to talk about Linux Containers tonight at Southbay PostgreSQL User Group at Pivotal.
The slides are now posted online:



Wednesday, September 30, 2015

Mirror Mirror on the wall, Where's the data? In my Vol

When I first started working with docker last year, there was a clear pattern already out there the docker image itself only consists of application binary (and depending on the philosophy - the entire OS libraries that are required) and all application data goes in a volume.

Also the concept  called "Data Container" also seemed to be little popular at that time.  Not everyone bought into that philosophy and there were various other patterns emerging out then on how people used volumes with their docker containers.

One of the emerging pattern was (or still is)  "Data Initialization if it does not exist" during container startup.
Let's face it, when we first start a docker container consisting of say PostgreSQL 9.4 database the volume is an empty file system. We then do an initdb and setup a database so that it is ready to serve.
The simplest way is to  check if the data directory has data in it and if it does not have data, then run initdb and setup the most common best practices of the database and serve it up.

Where's the simplest place to do this? In the entrypoint script of docker container of course.

I did the same mistake in my jkshah/postgres:9.4 image too. In fact I still see that same pattern in the official postgres docker image also where it looks for PG_VERSION and if it does not exists then it runs initdb.

if [ ! -s "$PGDATA/PG_VERSION" ]; then
    gosu postgres initdb

    ...
fi

This certainly has advantages:
1. Very simple to code the script.
2. Great Out of the box experience - You start the container up - the container sets itself up and it is ready to use.

Lets look what happens next in real life enterprise usages.

We got in scenarios while the applications using such databases are running but they lost all data in it. Hmm what's going wrong here? The application is working fine, the database is working fine, but all data is like it was freshly deployed and not something that was running well for 3-5 months.

Let's look at various activities that an enterprise will typically do with such a data volume - file system on the host where PostgreSQL containers are running.
1. The host location of the volume itself will be a mounted file system coming off  SAN or some storage device.
2. Enterprise will be backing up that file system on periodic intervals
3. On some cases they will be restoring that file system when required.
4. Sometimes the backend storage may have hiccups. (No ! That does not happen :-) )

In any of the above cases, where a mount fails or mounts a wrong file system or if the restore fails, you could end up with an empty file system for a volume path.  (Not all people had checks for this)

Now when you start the PostgreSQL docker container on such a volume you will get a new  database fully initialized. Most current automations that I have seen works such that in those cases even the application will fully initialize the database with its own schema and initial data and the application moves on like nothing is wrong here.

In the above case it might seem that the application is working to all probes till a customer tries to login into the setup  and find that they do not exist in the system .

For DBAs the anal rule is "No Data" error is better than "Wrong/Lost Data" serviced out of a database (specially PostgreSQL users). For this reason, this particular pattern of database initialization is becoming an ANTI Pattern in my view specially for docker containers. A better approach is to have an entrypoint command specifically to do a setup(initialization) knowingly and then all subsequent starts should be called with another entrypoint command to specifically fail if it does not find the data.

Of course again this is a philosophical view on how it should be handled. I would love to hear what people have to say about this.


Sunday, September 20, 2015

Is it a privilege to run a container in Docker?

Recently while working with various applications in a docker container, we came across few containers that will not run properly unless privileged mode is enabled. The privileged mode gives the container the same rights as host which means it can make changes on host where the container runs. (Huge difference compared to VM - Imagine your VM making changes to the hypervisor directly.)

Of course privileged mode has its uses and I am definitely glad that it is available. However it is not a general purpose option to be used lightly. So imagine my surprise that one of the most common tools that is used in many enterprises now Chef server when running in a docker container also required privileged mode to run. There are various versions available but they all required the mode.

While investigating Chef Server to see why it requires the mode I found it primarily requires it to set some ulimit parameters and a specific kernel parameter inside the container.

sysctl -w kernel.shmmax=17179869184

Now before you say, aha simple lets change the value in the host itself and let the container pick up the value from the host itself.. Let me say been there .. it ain't gonna work.  The reason it does not work is due to how Linux namespaces work with CLONEIPC. The net result is everytime a container is created a new namespace of System V IPC is setup with the default  shmmax of 32MB.  The default will be changed in a later Linux kernel to 4GB but of course like most companies there will not be patience to wait for the Linux kernel to show up let alone a certified Linux distro for production setups.

There are few hacks to work it out as Jerome indicates in a mailing list.  But of course none of them was something that was suitable.

Now lets go back to the original command that needed to be executed which required. I have worked with those commands for years always to increase shared memory for databases that uses Sys V style of shared memory like Oracle, PostgreSQL (well till 9.2), etc.

Guess what doing a little digging I did find PostgreSQL used as an embedded database in $CHEF_SERVER_INSTALL/embedded/bin/postgres. Checking the version of "postgres" binary confirmed it to be 9.2.

Checking latest version of Chef server found it to be still using Postgres 9.2. Eventually ended up creating a custom image using Postgres 9.4 and voila got the container running without privileged mode. Thanks Robert Haas.

It also means that as more and more PostgreSQL based containers are being used in containers, it is better to move to the latest version of PostgreSQL for a better experience.



Friday, February 20, 2015

CentOS 7, Docker, Postgres and DVDStore kit

Its been a long time since I have posted an entry. It has been a very busy year and more about that in a later post. Finally I had some time to try out new versions of Linux and new OSS technologies.

I started to learn by installing the latest version of CentOS 7. CentOS closely follows RHEL 7 and coming from SLES 11 and older CentOS 6.5, I saw many new changes which are pretty interesting.

New commands to learn immediately as I started navigating:
systemctl
firewall-cmd

I admit that I missed my favorite files in /etc/init.d and looking at new location of /etc/systemd/system/multi-user.target.wants/ will take me a while to get used to.

firewall-cmd actually was more welcome considering how hard I  found to remember the exact rule syntax of iptables.

There is new Grub2 but honestly lately  I do not even worry about it (which is a good thing). Apart from that I see XFS is the new default file system and LVM now has snapshot support for Ext4 and XFS and many more.

However the biggest draw for me was the support for Linux Containers. As a Sun alumni, I was always draw to the battle of who did containers first and no longer worry about it, but as BSD Jails progressed to Solaris Containers to now the hottest technology: Docker container, it sure has its appeal.

In order to install docker however you need the "Extras" CentOS 7 repository enabled. However  docker is being updated faster so the "Extras" repository is getting old at 1.3 with the latest out (as of last week) is Docker 1.5. To get Docker 1.5  you will need to enable "virt7-testing" repository on CentOS 7

I took a shortcut to just create a file /etc/yum.repos.d/virt7-testing.repo with the following contents in it.

[virt7-testing]
name=virt7-testing
baseurl=http://cbs.centos.org/repos/virt7-testing/x86_64/os/
enabled=1
gpgcheck=0

Then I was ready to install docker as follows

# yum install docker

I did find that it actually does not start the daemon immediately, so using the new systemctl command I enabled  and then started the daemon

# systemctl enable docker
# systemctl start docker

We now have the setup ready. However what good is the setup unless you have something to demonstrate quickly. This is where I see Docker winning over other container technology and probably their differentiator. There is an "AppStore" for the container images available to download images. Of course you need a login to access the Docker Hub as it is called at  http://hub.docker.com (which is for free fortunately). 

# docker login

To login to the hub and now you are ready to get new images.
I have uploaded two images for the demonstration for today
1. A Standard Postgres 9.4 image
2. A DVDStore benchmark application image based on kit from http://linux.dell.com/dvdstore/

To download the images is as simple as pull
# docker pull jkshah/postgres:9.4
# docker pull jkshah/dvdstore

Now lets see on how to deploy them. 
For PostgreSQL 9.4 since it is a database it will require storage for "Persistent Data" so first we make a location on the host that can be used for storing the data.

# mkdir /hostpath/pgdata

SELinux is enabled by default on CentOS 7 which means there is an additional step required to make the location read/write from Linux containers

# chcon -Rt svirt_sandbox_file_t /hostpath/pgdata

Now we will create a container as a daemon which will map the container port to host port 5432 and setup a database with a username and password that we set. (Please do not use secret as password :-) )
# docker run -d -p 5432:5432 --name postgres94 -v /hostpath/pgdata:/var/lib/postgresql/data -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=secret -t jkshah/postgres:9.4


Here now if you check /hostpath/pgdata you will see the database files on the host.
 
Now lets deploy an application using this database container.

# docker run -d -p 80:80 -–name dvdstore2 -–link postgres94:ds2db –-env DS2DBINIT=1 jkshah/dvdstore

The above command starts another container based on the DVDStore image which expects a database "ds2db" defined which is satisfied using the link option to link the database container created earlier. The application container also intiailizes the database so it is ready to serve requests at port 80 of the host. 

This opens up new avenues to now benchmark your PostgreSQL hardware easily. (Wait the load test driver code is still on Windows  :-( )