I Lessen Data: April 2015

ok, I have some benchmark numbers. Two almost-identical databases, one in mongo, one in postgres, so I can actually compare apples to apples. In Mongo I have everything in one collection. In postgres, I have two tables: tweet_pgh, which is all the data, and tweet_pgh_small, which is just the data I need.

These tables are pretty optimized: the mongo collection has indices on user.screen name, coordinates (geospatial), coordinates.coordinates.0 and coordinates.coordinates.1. The sql tables have btree indices on user_screen_name and gist indices on the coordinates (clustered). Oh, and of course I've got PostGIS installed.

Every query here selects everything in a table, then iterates through it (just incrementing a counter) to simulate a semi-realistic use case.

Count in mongo: 3653683

Count in postgres: 3653278

Searching for single user stuff. User A has 2942 items, User B has 1928 items, User C has 3499 items. (using A, B, and C here for their anonymity instead of their real twitter handles.) First three are mongo, the rest are postgres.

Mongo: {'user.screen_name': '(user A)'} 11 sec

Mongo: {'user.screen_name': '(user B)'} 6 sec

Mongo: {'user.screen_name': '(user C)'} 24 sec

Postgres: SELECT * FROM tweet_pgh WHERE user_screen_name = '(user A)'; 5 sec

Postgres: SELECT * FROM tweet_pgh WHERE user_screen_name = '(user B)'; 6 sec

Postgres: SELECT * FROM tweet_pgh WHERE user_screen_name = '(user C)'; 6 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE user_screen_name = '(user A)'; 1.9 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE user_screen_name = '(user B)'; 1.4 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE user_screen_name = '(user C)'; 1.9 sec

Huh. So for these it looks like postgres takes either the same amount of time, half the time, or 1/4 the time. But the real win comes from just having a smaller table.

Searching for geographic areas. And these are *with geospatial indices* on both tables. Here I'm searching for a 0.01-degree lat by 0.01-degree lng box, which has 69k things in it.

Mongo: {'coordinates': {'$geoWithin': {'$geometry': {'type': 'Polygon', 'coordinates': [[[-79.99, 40.44], [-79.99, 40.45], [-80, 40.45], [-80, 40.44], [-79.99, 40.44]]]}}}} - 53 min

Postgres: SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-80, 40.44, -79.99, 40.45, 4326) ~ coordinates; 33 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE ST_MakeEnvelope(-80, 40.44, -79.99, 40.45, 4326) ~ coordinates; 3 sec

Then try the same thing with a smaller box (0.001x0.001 degree) with way fewer things (~ 5):

Mongo: {'coordinates': {'$geoWithin': {'$geometry': {'type': 'Polygon', 'coordinates': [[[-79.899, 40.44], [-79.899, 40.441], [-79.9, 40.441], [-79.9, 40.44], [-79.899, 40.44]]]}}}}; 7 min

Postgres: SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; 0.16 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; 0.07 sec

(still waiting on the mongo query for this)

Wow! So, wait. So the same query that took 53 minutes in Mongo took 33 seconds in postgres. And three seconds in a reduced-size table.

It's a little unfair; in Mongo you can do a little better by using $gt and $lt on the coordinates instead of doing a $geoWithin - it's easier to compare numbers than coordinates, if you're just doing a box query. So we have the following:

{'coordinates.coordinates.1': {'$gt': 40.45, '$lt': 40.46}, 'coordinates.coordinates.0': {'$gt': -79.95, '$lt': -79.94}} 20 min

{'coordinates.coordinates.1': {'$gt': 40.46, '$lt': 40.461}, 'coordinates.coordinates.0': {'$gt': -79.97, '$lt': -79.969}} 2 min

But this is still nowhere near the postgres time! So:

- use PostgreSQL with PostGIS for geo queries, not MongoDB

- this is especially true if you can index and cluster your dbs

- maybe postgres is a little better for simpler queries too

- in postgres (maybe in mongo too) there is a lot of benefit to be had from cutting down the size of your tables as small as possible

I Lessen Data

Friday, April 10, 2015

MongoDB vs PostgreSQL for geo queries: postgres is indeed pretty much across-the-board faster