Having Some Fun with Postgres 9.6

Postgres 9.6 and PostGIS 2.3 were officially released last week. A lot of great additions are included, but the most interesting to me are the changes allowing parallel execution of PostGIS functions, table scans, joins and aggregations, and full text search of phrases. (full changelogs can be found here and here). With these new features, we’ll be able to do large scans in a fraction of the time, and search text by groups of words and even be able to specify the distance between the words.

In my excitement to play with the new features, I ended up creating a table that could make use of all of them:

create table test (id serial primary key, words text, words_vector tsvector, geo geography(POINT,4326));

I needed to generate a bunch of random words and locations, so I wrote a quick Elixir script to generate some SQL that looks like this:

--sql.sql
INSERT INTO test (words, words_vector, geo) VALUES ('rinsing creature purple driving enormous', to_tsvector('rinsing creature purple driving enormous'), ST_GeomFromText('POINT(-87.8838489598126 43.05321645391486)', 4326));
INSERT INTO test (words, words_vector, geo) VALUES ('hurry sweet wrap electricity magical', to_tsvector('hurry sweet wrap electricity magical'), ST_GeomFromText('POINT(-87.89123394285978 43.069957512150665)', 4326));
INSERT INTO test (words, words_vector, geo) VALUES ('soda face class magical acoustics', to_tsvector('soda face class magical acoustics'), ST_GeomFromText('POINT(-87.88803538250723 43.04880785849348)', 4326));
INSERT INTO test (words, words_vector, geo) VALUES ('pie elderly driving guitar turkey', to_tsvector('pie elderly driving guitar turkey'), ST_GeomFromText('POINT(-87.88571839021715 43.06895217356573)', 4326));

If you’re interested in that, I did a small writeup on it here.

And then run it:

psql -d 96_testing --quiet -f sql.sql

When it finishes, we’ll have a table full of all kinds of data.

Parallel Query Execution

It’s important to note that by default, parallel execution is disabled. It can be enabled by setting max_parallel_workers_per_gather to a value greater than 1. If necessary, we also have the option of forcing the query planner’s plan with set force_parallel_mode=on, and it will execute queries in parallel even if it’s slower. These settings are per-database, and so should only need to be set once.

An aggregation of the entire table should be a good opportunity for parallelization:

set max_parallel_workers_per_gather=4;
EXPLAIN ANALYZE SELECT count(*) FROM test;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=7565.30..7565.31 rows=1 width=8) (actual time=47.368..47.368 rows=1 loops=1)
   ->  Gather  (cost=7565.08..7565.29 rows=2 width=8) (actual time=47.302..47.362 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=6565.08..6565.09 rows=1 width=8) (actual time=38.794..38.795 rows=1 loops=3)
               ->  Parallel Seq Scan on test  (cost=0.00..6304.67 rows=104167 width=0) (actual time=0.016..26.256 rows=83333 loops=3)
 Planning time: 0.064 ms
 Execution time: 49.679 ms
(8 rows)

And sure enough, the query was run in parallel! If we need to get all of the rows that have a point within 500m of our central point, we can use newly parallel-eligible PostGIS functions:

EXPLAIN ANALYZE SELECT * FROM "test" AS r0 WHERE ST_DWithin(r0."geo"::geography, ST_SetSRID(ST_MakePoint(-87.9079503, 43.0384303), 4326), 500);
                                                                                                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..46897.60 rows=96 width=141) (actual time=0.394..138.373 rows=4967 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test r0  (cost=0.00..45888.00 rows=40 width=141) (actual time=1.786..124.150 rows=1656 loops=3)
         Filter: ((geo && '0101000020E61000002D3993DB1BFA55C076D6B848EB844540'::geography) AND ('0101000020E61000002D3993DB1BFA55C076D6B848EB844540'::geography && _st_expand(geo, '500'::double precision)) AND _st_dwithin(geo, '0101000020E61000002D3993DB1BFA55C076D6B848EB844540'::geography, '500'::double precision, true))
         Rows Removed by Filter: 81678
 Planning time: 0.754 ms
 Execution time: 141.749 ms

This is all great, but despite this awesome new feature, it is still best in most cases to avoid sequential scans on common queries, and create indices where possible. Parallel queries may be an option to improve the speed of queries that are difficult or wasteful to index, but still require scanning, joining, or aggregating large numbers of rows.

Searching Text for Phrases

We also have our column with random words in it. On earlier versions of Postgres, we were limited to searching rows for text with a specific set of words happening in any order:

select id, words from test where words_vector @@ to_tsquery('magical & cheese & guitar & angle');
   id   |                 words
--------+---------------------------------------
 833384 | sweet magical cheese guitar angle
 835999 | angle magical guitar wrap cheese
 840577 | guitar cheese bells angle magical
 869806 | creature cheese angle magical guitar
 ...
(60 rows)

I’m interested in all of those things, but I’m really most interested in seeing rows where they appear in that exact order. This is solved with the new <-> and <N> operators, or the phraseto_tsquery function. The <N> operator allows us to specify the distance the words should be from each other, whereas the other options are used to search only for adjacent words.

To find all of the rows that have our exact phrase magical cheese guitar angle, we can do:

select id, words from test where words_vector @@ phraseto_tsquery('magical cheese guitar angle');
   id   |               words
--------+-----------------------------------
 833384 | sweet magical cheese guitar angle
(1 row)
 ```
I also happen to be interested in phrases that start have the word `magic` two words before the word `elderly`, but also have `pie` two words after that:
```sql
select id, words from test where words_vector @@ to_tsquery('magic <2> elderly <2> pie');
   id   |                 words
--------+---------------------------------------
 844092 | magical turkey elderly thumb pie
 857782 | magical cheese elderly angle pie
 821614 | magical electricity elderly soda pie
...
(10 rows)

It’s worth noting that the new text search functions are also capable of being parallelized. If we wanted to, we could combine our text search and location search where clauses, and have them run in parallel so we can get phrases that contain magical cheese guitar angles, and also have a location nearby:

EXPLAIN ANALYZE SELECT id, words FROM "test" AS r0 WHERE ST_DWithin(r0."geo"::geography, ST_SetSRID(ST_MakePoint(-87.9079503, 43.0384303), 4326), 800) AND words_vector @@ phraseto_tsquery('magical cheese guitar angle');
                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=73190.19..73190.20 rows=1 width=32) (actual time=229.269..229.269 rows=1 loops=1)
   ->  Gather  (cost=1000.00..73190.18 rows=1 width=4) (actual time=229.261..229.261 rows=0 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on test r0  (cost=0.00..72190.08 rows=1 width=4) (actual time=216.342..216.342 rows=0 loops=3)
               Filter: ((geo && '0101000020E61000002D3993DB1BFA55C076D6B848EB844540'::geography) AND ('0101000020E61000002D3993DB1BFA55C076D6B848EB844540'::geography && _st_expand(geo, '800'::double precision)) AND _st_dwithin(geo, '0101000020E61000002D3993DB1BFA55C076D6B848EB844540'::geography, '800'::double precision, true) AND (words_vector @@ phraseto_tsquery('magical cheese guitar angle'::text)))
               Rows Removed by Filter: 83333
 Planning time: 0.846 ms
 Execution time: 231.874 ms

While this isn’t the strangest query, a large amount of Postgres and PostGIS system functions are parallel safe, and so even the most exotic queries may be parallelized. Realistically, a GiST or GIN index would be great for querying the columns in this manner.

After playing around with some of the new features in Postgres 9.6, I’m very excited to start applying them to production applications, and hope you are too!

Feel free to contact me on Twitter at @mitchellhenke with any questions or clarifications. Thanks for reading!