Friday, 8 December 2017

Psql directly from Atom text editor.

Im not a big fun of postgres clients like pgadmin, not sure why, but i always favoured psql over everything.
A friend of mine wrote an atom package that brings psql into atom. Which is awesome. It's tested in linux and (briefly) in osx. In osx i only had to change the path for psql because for some reason it didn't get it from the environment. Feel free to check it out and spam him with any issues you might have.

Link for the package can be found here.

Thanks for reading
-- Vasilis Ventirozos

Monday, 27 November 2017

Monitoring statements in pg10

Recently we migrated our first client to postgres 10.1, the database used to run on 9.2.
Everything went smooth with a total downtime of 12 seconds but obviously monitoring broke so i had to fix that and i thought to share the new statements to save people some trouble.
The statements that broke had to do with the change of xlog -> wal (and location -> lsn) which mostly affected replication monitoring statements. Bellow is what i used to have on PG 9 and what i got for PG 10.

Slot monitoring :
pg 9
SELECT slot_name, database, active,pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
FROM pg_replication_slots;

pg 10
SELECT slot_name, database, active, pg_wal_lsn_diff (pg_current_wal_insert_lsn(), restart_lsn)
AS retained_bytes
FROM pg_replication_slots;

Replication Lag monitoring :
pg 9
SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, application_name
FROM pg_stat_replication;

pg 10
SELECT pg_wal_lsn_diff (pg_current_wal_flush_lsn(), flush_lsn ) AS lag_bytes, application_name
FROM pg_stat_replication;

Thanks for reading
- Vasilis Ventirozos

Wednesday, 30 August 2017

PostgreSQL on ZFS

What is ZFS. (a paste from wikipedia)

ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume managementsnapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs.

Advantages of ZFS.

Data integrity guaranteed.
ZFS has volume management on filesystem level, this enables copy on write. When a data block changed, it will change its location on the disk before the new write is complete. If you have a crash, that data would be damaged. ZFS does not change the location of the data until the write is verified, which means that your data is safe during a crash. ZFS also use checksums to ensure that all data is correct, every write is tested. ZFS doesn't just protect data with copy on write, it also offers additional protection by having an additional RAID level. RAID-Z3 which allows a maximum of three disk failures in a ZFS pool. Traditional RAID levels only allow two disk failures per volume. ZFS offers RAID mirrors who are usually composed of a single disk keeping its own copy. With a multi-disk mirror you can have multiple copies, which adds levels of data integrity not found in traditional RAID setups and it's also great for reads.
Performance
ZFS also allows to send writes to individual disks, instead of just the whole RAID volume. Because of this, ZFS can spread writes which is speeding up write performance. It also has algorithms that ensure that the most frequently used data blocks are kept in the fastest media, meaning that you can have just a few SSD drives as cache.
Compression
Compression can be enabled on each dataset , its a transparent operation where zfs will compress and uncompress data on the fly. Because cpu operations are usually cheaper than IO and considering that compression algorithms are much smarter than they used to be the cpu overhead is usually not even noticeable, combined with the gain of compression  this is an extremely useful feature for databases. The supported compression algorithms are: LZJB, LZ4, ZLE, and Gz (1-9). Today we are going to play with LZ4.
Snapshots
snapshot is a read-only copy of a file system or volume. Snapshots can be created almost instantly, and they initially consume no additional disk space within the pool. However, as data within the active dataset changes, the snapshot consumes disk space by continuing to reference the old data, thus preventing the disk space from being freed.

-----------

And after this rather long intro on ZFS (that may or may not have pasted descriptions from oracle) its time to put it to work for a PostgreSQL database. In my previous post I went through how to install postgres on a openZFS volume on Ubuntu so in this post i will be using a vm that already has a ZFS volume. 
Enabling compression (LZ4).
> zfs set compression=lz4 pg_zfs 
Where pg_zfs is the name of the volume.

And yeah, that's pretty much it.

Now i will initdb a new cluster create a table and insert some data :

create table test_compression (id int , name text default 'Evol' , surname text default 'Monkey');
insert into test_compression (id) select generate_series (1,5000000);
create unique index id_unique_idx on test_compression (id);
create index name_idx on test_compression (name);
checkpoint ;
select pg_size_pretty(pg_total_relation_size('test_compression'));
 pg_size_pretty
----------------
 426 MB

now as root :

root@lab:~# zfs get compressratio pg_zfs
NAME    PROPERTY       VALUE  SOURCE

pg_zfs  compressratio  3.38x  -

Which is pretty neat!

Using snapshots.

To get a snapshot of your database directory something like this would work. as root :
psql -c "select pg_start_backup('zfs_snapshot');" -U vasilis postgres && zfs snapshot pg_zfs@blogpost && psql -c "select pg_stop_backup();" -U vasilis postgres 

zfs list -t snapshot
NAME              USED  AVAIL  REFER  MOUNTPOINT
pg_zfs@blogpost      0      -   279M  -

To restore, say that we delete the whole data directory :

rm -rf /pg_zfs/pgdata/
killall -9 postgres
zfs rollback pg_zfs@blogpost
/pg_zfs/pgsql/bin/pg_ctl -D /pg_zfs/pgdata/ start
psql -q -t -A -c "select count(*) from test_compression" monkey
5000000

All Done!
One thing to keep in mind is that all these operations are nearly instant.

At OmniTI, we use ZFS on OmniOS a lot and for this reason we have developed some scripts that can help with automation, sending snapshots on remote servers etc. These scripts can be found here.

Biggest con of ZFS is that there is no huge acceptability in linux, people are reporting that open-zfs is production ready, but personally i haven't used it with linux. I've seen that Canonical would promote it but there were some legal issues (?). Thing is that and now, with redhat ditching btrfs it's a good opportunity for ZFS to be used more. So if you have test / development environments , maybe you could give it a try and spread the word of zfs-awesomeness.


Thanks for reading
- Vasilis


Friday, 16 June 2017

PostgreSQL on ZFS with BPF tracing on top.

At OmniTI we love solaris, my personal favourite features are ZFS and DTrace. Unfortunately not many run postgres on solaris so i have decided to implement similar features in linux. Instead of Dtrace i'll install BPF, in-kernel bytecode that can be used for tracing introduced in recent kernels (4.X). 
This post will be a part of a three series post. In this post we'll start with setup, in part #2 with ZFS and how to use it for backups / snapshots. In part #3 we'll dig into BPF a bit more.

Step 1 is to setup a new ubuntu. I setup a VM using ubuntu-16.04.2-server-amd64.iso.

As root :
Add the repo for bcc :
> echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list

sudo apt-get update

Install all necessary and some optional packages :

> apt-get install -y sudo wget apt-transport-https joe less build-essential libreadline-dev \
zlib1g-dev flex bison libxml2-dev libxslt-dev libssl-dev openssh-server screen git \
binutils bcc bcc-tools libbcc-examples python-bcc zfsutils-linux \
systemtap systemtap-sdt-dev

Make sure kernel is properly compiled :

> cat /boot/config-`uname -r` |grep BPF
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_HAVE_BPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m

Test BCC (stands for BPF Compiler Collection)

> python /usr/share/bcc/examples/tracing/bitehist.py
Tracing... Hit Ctrl-C to end.
^C
     kbytes              : count     distribution
         0 -> 1          : 7        |************                            |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 22       |****************************************|
         8 -> 15         : 19       |**********************************      |
        16 -> 31         : 8        |**************                          |
        32 -> 63         : 6        |**********                              |
        64 -> 127        : 1        |*                                       |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 1        |*                                       |

Now its time to install postgres on a zfs partition, in my case i had a disk (sdb) attached on my VM :

> fdisk /dev/sdb

Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x4226e0bf.

Command (m for help): n
Partition type
   p   primary (0 primary, 0 extended, 4 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-41943039, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-41943039, default 41943039):

Created a new partition 1 of type 'Linux' and of size 20 GiB.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

To create the pool :

> sudo zpool create pg_zfs /dev/sdb1

> zpool status
  pool: pg_zfs
 state: ONLINE
  scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
pg_zfs      ONLINE       0     0     0
 sdb1      ONLINE       0     0     0

errors: No known data errors

> mount |grep pg_
pg_zfs on /pg_zfs type zfs (rw,relatime,xattr,noacl)

> ls /pg_zfs/ -l
total 0

> cd /pg_zfs/
> mkdir pgsql
> mkdir pgdata
> chown postgres:postgres pgsql/
> chown postgres:postgres pgdata/

Now with everything ready compile postgres from source :

> wget -c https://ftp.postgresql.org/pub/source/v10beta1/postgresql-10beta1.tar.gz

> tar zxfv postgresql-10beta1.tar.gz

> cd postgresql-10*
> ./configure --prefix=/pg_zfs/pgsql/ --enable-dtrace
> make -j 4 world
> make -j 4 install-world
> export PATH=$PATH:/pg_zfs/pgsql/bin
> export LD_LIBRARY_PATH=/pg_zfs/pgsql/lib
> export PGDATA=/pg_zfs/pgdata
> initdb
> pg_ctl start

At this point, postgres binaries and datafiles are on zfs. Now to check the probes we have available :

/usr/share/bcc/tools/tplist -l /pg_zfs/pgsql/bin/postgres  |awk {'print $2'}
postgresql:clog__checkpoint__start
postgresql:clog__checkpoint__done
postgresql:multixact__checkpoint__start
postgresql:multixact__checkpoint__done
postgresql:subtrans__checkpoint__start
postgresql:subtrans__checkpoint__done
postgresql:twophase__checkpoint__start
postgresql:twophase__checkpoint__done
postgresql:transaction__start
postgresql:transaction__commit
postgresql:transaction__abort
postgresql:wal__buffer__write__dirty__start
postgresql:wal__buffer__write__dirty__done
postgresql:wal__switch
postgresql:checkpoint__start
postgresql:checkpoint__done
postgresql:wal__insert
postgresql:statement__status
postgresql:buffer__flush__start
postgresql:buffer__flush__done
postgresql:buffer__read__start
postgresql:buffer__read__done
postgresql:buffer__write__dirty__start
postgresql:buffer__write__dirty__done
postgresql:buffer__sync__done
postgresql:buffer__sync__start
postgresql:buffer__sync__written
postgresql:buffer__checkpoint__start
postgresql:buffer__checkpoint__sync__start
postgresql:buffer__checkpoint__done
postgresql:lock__wait__start
postgresql:lock__wait__done
postgresql:deadlock__found
postgresql:lwlock__wait__start
postgresql:lwlock__wait__done
postgresql:lwlock__acquire
postgresql:lwlock__condacquire
postgresql:lwlock__condacquire__fail
postgresql:lwlock__acquire__or__wait
postgresql:lwlock__acquire__or__wait__fail
postgresql:lwlock__release
postgresql:smgr__md__read__start
postgresql:smgr__md__read__done
postgresql:smgr__md__write__start
postgresql:smgr__md__write__done
postgresql:query__parse__start
postgresql:query__parse__done
postgresql:query__rewrite__start
postgresql:query__rewrite__done
postgresql:query__plan__start
postgresql:query__plan__done
postgresql:query__start
postgresql:query__done
postgresql:query__execute__start
postgresql:query__execute__done
postgresql:sort__start
postgresql:sort__done

To make sure tracing works properly, while running a statement on a different terminal :

> sudo /usr/share/bcc/tools/dbslower postgres -p 1208
Tracing database queries for pids 1208 slower than 1 ms...
TIME(s)        PID          MS QUERY
2.729496       1208   2399.665 insert into test select * from generate_series (1,100000);


To be continued ...
-- Vasilis Ventirozos

Saturday, 10 June 2017

An unusual upgrade

I have mentioned in previous posts that in my 4 years with OmniTI, we've tackled a lot of migrations. Most of them are usually the "typical" procedure. The methodology we use is more or less explained here. Last week we had a usecase for a kind of "unusual" upgrade, a 9.2 compiled with 
"--disable-integer-datetimes" meaning that all datetimes were represented as floating point internally, something that was the default at up to 8.3. This changed at (i think) 8.4 where datetimes were represented as int64 which offers more precision. 
The requirement was to migrate the database to a new one that will use integer datetimes with the minimum possible downtime. Obviously a direct upgrade wouldn't work and pg_dump / restore was not an option so we decided to approach and test this scenario differently.

The general idea is the following :

Upgrade to a 9.6 that was compiled with "--disable-integer-datetimes" and then using something like pglogical or mimeo to replicate to another 9.6 that would use integer datetimes. For this, i used 2 containers and pagila test database to make this simulation as much realistic as i could. In this post i will describe the i steps I followed.

Installed both 9.2 and 9.6 on the same box :

9.2.21 with the following options :
./configure --prefix=/home/postgres/pgsql92/ --disable-integer-datetimes
make -j 8 world
sudo make install-world

9.6.2 with the following options :
./configure --prefix=/home/postgres/pgsql96/ --disable-integer-datetimes
make -j 8 world
sudo make install-world

initiated a new cluster and started 9.2, loaded pagila testdb (schema and data), started the database.
From now on this will act like my production database.

downloaded and installed pglogical 2.0.1 9.6 :
wget -c http://packages.2ndquadrant.com/pglogical/tarballs/pglogical-2.0.1.tar.bz2
uncompress :
tar jxfv pglogical-2.0.1.tar.bz2
compile and install :
make USE_PGXS=1 clean all
sudo PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/postgres/pgsql96/bin make USE_PGXS=1 install

Note: i installed 9.6 before upgrading because pglogical requires some changes in the parameters (shown later) and a library loaded, so in order not to restart twice i had it installed before the upgrade.

initiated a new 9.6 cluster so i can upgrade, stopped 9.2 and upgraded 9.2 to 9.6 :

pre upgrade check :
pgsql96/bin/pg_upgrade -b /home/postgres/pgsql92/bin/ -B /home/postgres/pgsql96/bin/ -c -d /home/postgres/pgdata92/ -D /home/postgres/pgdata96/ -v
stopped 9.2 :
-- outage starts --
pgsql92/bin/pg_ctl -D /home/postgres/pgdata92/ stop
upgrade:
pgsql96/bin/pg_upgrade -b /home/postgres/pgsql92/bin/ -B /home/postgres/pgsql96/bin/  -d /home/postgres/pgdata92/ -D /home/postgres/pgdata96/ -v -k

added the following in postgresql.conf :
wal_level = 'logical'
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
shared_preload_libraries = 'pglogical'
track_commit_timestamp = on

on master hba.conf (and slave, cause why not) added :
host    replication     postgres        10.0.0.1/16             trust  
(security was not a concern so "trust" was ok)

started 9.6
-- outage stops --
analyzed 9.6 and cleaned up 9.2
and issued "create extension pg_logical;" to postgres

At this point i had my "production" db upgraded to 9.6 with pglogical installed and everything ready for logical replication.

On the second box that would have postgres 9.6,compiled without the "--disable-integer-datetimes" flag. I installed pglogical with exactly the same way i did for the first box and at this point i was ready to replicate :

on production (provider) I created a new node and added a set with all objects in public schema:
SELECT pglogical.create_node( node_name := 'provider1', dsn := 'host=10.0.0.2 port=5432 dbname=monkey' );
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);

Keep in mind :
ERROR:  table payment_p2007_01 cannot be added to replication set default
DETAIL:  table does not have PRIMARY KEY and given replication set is configured to replicate UPDATEs and/or DELETEs
HINT:  Add a PRIMARY KEY to the table

ALL tables that are going to be replicated need to have a primary key.
after adding a pk to the tables that didn't have one i went to the slave and i did :

Create a node for the subscriber :
SELECT pglogical.create_node(node_name := 'subscriber1',dsn := 'host=10.0.0.3 port=5432 dbname=monkey');

Started the replication process which sync'd (schema and data): 
SELECT pglogical.create_subscription( subscription_name := 'subscription1',provider_dsn := 'host=10.0.0.2 port=5432 dbname=monkey', synchronize_structure := true);

To verify that these 2 databases have different storage types :

postgres@old_server:~/pgdata96$ pg_controldata |grep Date
Date/time type storage:               floating-point numbers

postgres@new_server:~/pgdata$ pg_controldata |grep Date
Date/time type storage:               64-bit integers

The database was transferred and from now on replicated.
At this point, if this was the real deal it would be preferred to first transfer the schema and then start replication just to be sure that all objects will transfer but in my case i didn't really care about that.

Keep in mind that since postgres 10 floating point datetimes are no longer supported. If you tried to compile it it would give :

postgres@a1bdb0750dc5:~/postgresql-10beta1$ ./configure --disable-integer-datetimes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking which template to use... linux
configure: error: --disable-integer-datetimes is no longer supported 


This is a pretty specialized scenario, but from what i saw there are some databases out there having their date times stored as floating points that hopefully could benefit from this migration procedure.


Thanks for reading
-- Vasilis Ventirozos

Friday, 9 June 2017

Tip for faster wal replay on a slave

I've been in situations where i need a slave db to replay a lot of wal files fast, and by a lot i mean tens of thousands. This could happen because of a reporting database refreshing or simply because a slave was down for an extended period of time. It's known that lowering shared_buffers speeds up wal replay for obvious reasons, but by how much ?

I did a benchmark on an old server and the results are interesting :

With 32GB of shared buffers and with 6390Mb of wals (1840 wal files)
it took 1408 seconds to complete the replay.

With 64MB of shared buffers and with 6510Mb of wals (1920 wal files)
it took 1132 seconds to complete the replay.

My test was done by stopping the slave, inserting 50 mil rows to a test table, wait for the wal transfer to complete, then stop the master and start the slave and watch OmniPITR logs.

The performance gain in wal replay was about 20% in postgres 10beta1 which doesn't sound bad, especially in times of need.


Thanks for reading
-- Vasilis

Friday, 19 May 2017

Reusing an old master (as slave) after a switchover

Todays blogpost is old news but it's not very well known and it has to do with how to add an old master after a slave has been promoted. Fujii Masao explained the situation in his patch back in the day.

So in todays post i will be demonstrating a combination of replication slots for retaining the writes on a promoted slave and how to re-introduce an old master to the replication.

Say that we have a master-slave setup, stop the master gracefully and create and activate a replica slot on the slave :

monkey=# SELECT * FROM pg_create_physical_replication_slot('this_is_a_replica_slot');
       slot_name        | xlog_position
------------------------+---------------
 this_is_a_replica_slot |
(1 row)

postgres@bf9823730feb:~$ pg_receivexlog -D . -S this_is_a_replica_slot -v -h 10.0.0.3 -U repuser
pg_receivexlog: starting log streaming at 0/4000000 (timeline 1) ^C
pg_receivexlog: received interrupt signal, exiting
pg_receivexlog: not renaming "000000010000000000000004.partial", segment is not complete
pg_receivexlog: received interrupt signal, exiting


This way all the changes after the slave promotion will be retained. Now , lets promote the slave and check the replication slot status.

postgres@bf9823730feb:~$ pg_ctl promote
server promoting

postgres@bf9823730feb:~$ psql -c "SELECT slot_name, database,active,  pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;" postgres
       slot_name        | database | active | retained_bytes
------------------------+----------+--------+----------------
 this_is_a_replica_slot |          | f      |           9056

As you can see , the new master is now retaining all writes. Now, to the old master, lets make a recovery.conf that looks like this :

standby_mode = 'on'
primary_conninfo = 'user=repuser host=10.0.0.3 port=5432 application_name=a_slave'
trigger_file = '/home/postgres/pgdata/finish.recovery'
primary_slot_name = 'this_is_a_replica_slot'
recovery_target_timeline = 'latest'

where host is obviously the new master. 
Start the old master and you should see something similar to:

LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/5000098
LOG:  database system is ready to accept read only connections
LOG:  invalid record length at 0/5000098: wanted 24, got 0
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/5000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/5000098.
LOG:  new target timeline is 2
LOG:  restarted WAL streaming at 0/5000000 on timeline 2
LOG:  redo starts at 0/5000098


This would also work if you had archiving enabled instead of replication slots. Given that you have plenty of available disk space on the new master, it should be fine to keep the old master down for any reasonable amount of time and re-introduce it to the replication without issues.

The only 2 things that you have to make sure is to shutdown postgres on old-master gracefully and that the slave has caught up before the promotion.


Thanks for reading.