I Lessen Data: 2015

Monday, November 9, 2015

Dump a PostgreSQL table to json

This seems like it should be easy but it isn't, and I've had to look it up a couple times. Here's the magic juice:

copy (select array_to_json(array_agg(row_to_json(t)))
from (
select * from tweet_pgh limit 5
) t
) to '/home/dantasse/data_dump/foo.json';

gets 5 rows from the tweet_pgh table and dumps them to that file. (the directory has to be world-writeable I think, because it'll execute as the postgres user.)

Edit:
COPY (SELECT ROW_TO_JSON(t) FROM (SELECT ST_AsGeoJSON(coordinates), * FROM tweet_austin) t) TO '/data/austin.json';
or just
COPY (SELECT ROW_TO_JSON(t) FROM (SELECT * FROM tweet_austin) t) TO '/data/austin.json';
seems to work better when you're memory limited.

Monday, May 4, 2015

"Let's Hadoop Something" on Amazon EMR

Edit: ugh, disregard this post maybe; still couldn't get the program to both run without errors and produce correct output. Started using Spark locally instead; it's better for my needs anyway. Leaving it up just in case this helps at all.

That's the goal I set out with: to run *something* on Hadoop. In this case it was simple: in each tweet, get the frequency of each emoji. So if a tweet is:
☃☃☃ it's cold ☃☃☃☹☹
I want to get "[6, 2]" because there's one emoji that happens 6x and one that is 2x.
Then I want to get the overall frequency of each frequency. Basically, I just want to know how many times people tweet an emoji once, and how many times they spam it 10x.

(the clever among you would recognize that this doesn't require any hadoops. indeed, it takes about 10 seconds to just loop through a 500mb text file with a bunch of tweets and spit out the answer. so in the real world, I should not be map-reducing anything here. but it's a learning experience; bear with me.)

Writing map-reduce code involves separating your code into mappers and reducers. For python, for example, this is easy: just make one python file for your mapper and one for your reducer, that both take in lines on sys.stdin and output lines to stdout. (this was unexpectedly simple for me. I like that they just use stdin and stdout, and one-data-point-per-line.)

Upload everything to s3: your mapper script, your reducer script, and all your input data (in the form of a text file).

Then create an EMR cluster (using all the defaults, but do add an SSH key and just take m1.medium instances if you're just testing, as they're the cheapest), and add a "step" that is a "streaming program". Terminate your cluster when you're done so you don't spend extra money.

You'll find your output in the output location that you chose, in the form of a few files like "part-00000", "part-00001". Not sure why it doesn't keep reducing until you're down to one final output, but I guess that is for you to do on your own?

Otherwise, good luck! This took me about a couple bucks to run, plus $15ish in debugging. It is frustrating when debugging costs real money. So it goes.

More caveats!
- as of May 5, 2015, Amazon EMR only supports Python 2.6.9. This is mildly frustrating. (here is a list of what it supports that will hopefully be kept up to date) Test your code on python2.6 before you upload it; it'll save you some headaches.
- make sure you set the "output location" to be a folder that doesn't exist yet. Otherwise it will just crash. You can figure this out from the logs, but each map-reduce job is ~20 min and probably a buck or two so better not to waste any.
- sometimes the logs are hard to find. The most valuable ones will be available through the EC2 console. You can find them here:

Click "view logs" to see logs related to the whole job, or go into "view jobs" then "view tasks" then "view attempts" to see logs for each individual mapper or reducer. Sometimes the logs will not show up. This is frustrating. Wait a few minutes and try again. If they're still not there, then I am not sure why. Sorry about it.

Friday, April 10, 2015

MongoDB vs PostgreSQL for geo queries: postgres is indeed pretty much across-the-board faster

ok, I have some benchmark numbers. Two almost-identical databases, one in mongo, one in postgres, so I can actually compare apples to apples. In Mongo I have everything in one collection. In postgres, I have two tables: tweet_pgh, which is all the data, and tweet_pgh_small, which is just the data I need.

These tables are pretty optimized: the mongo collection has indices on user.screen name, coordinates (geospatial), coordinates.coordinates.0 and coordinates.coordinates.1. The sql tables have btree indices on user_screen_name and gist indices on the coordinates (clustered). Oh, and of course I've got PostGIS installed.

Every query here selects everything in a table, then iterates through it (just incrementing a counter) to simulate a semi-realistic use case.

Count in mongo: 3653683

Count in postgres: 3653278

Searching for single user stuff. User A has 2942 items, User B has 1928 items, User C has 3499 items. (using A, B, and C here for their anonymity instead of their real twitter handles.) First three are mongo, the rest are postgres.

Mongo: {'user.screen_name': '(user A)'} 11 sec

Mongo: {'user.screen_name': '(user B)'} 6 sec

Mongo: {'user.screen_name': '(user C)'} 24 sec

Postgres: SELECT * FROM tweet_pgh WHERE user_screen_name = '(user A)'; 5 sec

Postgres: SELECT * FROM tweet_pgh WHERE user_screen_name = '(user B)'; 6 sec

Postgres: SELECT * FROM tweet_pgh WHERE user_screen_name = '(user C)'; 6 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE user_screen_name = '(user A)'; 1.9 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE user_screen_name = '(user B)'; 1.4 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE user_screen_name = '(user C)'; 1.9 sec

Huh. So for these it looks like postgres takes either the same amount of time, half the time, or 1/4 the time. But the real win comes from just having a smaller table.

Searching for geographic areas. And these are *with geospatial indices* on both tables. Here I'm searching for a 0.01-degree lat by 0.01-degree lng box, which has 69k things in it.

Mongo: {'coordinates': {'$geoWithin': {'$geometry': {'type': 'Polygon', 'coordinates': [[[-79.99, 40.44], [-79.99, 40.45], [-80, 40.45], [-80, 40.44], [-79.99, 40.44]]]}}}} - 53 min

Postgres: SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-80, 40.44, -79.99, 40.45, 4326) ~ coordinates; 33 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE ST_MakeEnvelope(-80, 40.44, -79.99, 40.45, 4326) ~ coordinates; 3 sec

Then try the same thing with a smaller box (0.001x0.001 degree) with way fewer things (~ 5):

Mongo: {'coordinates': {'$geoWithin': {'$geometry': {'type': 'Polygon', 'coordinates': [[[-79.899, 40.44], [-79.899, 40.441], [-79.9, 40.441], [-79.9, 40.44], [-79.899, 40.44]]]}}}}; 7 min

Postgres: SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; 0.16 sec

Postgres: SELECT * FROM tweet_pgh_small WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; 0.07 sec

(still waiting on the mongo query for this)

Wow! So, wait. So the same query that took 53 minutes in Mongo took 33 seconds in postgres. And three seconds in a reduced-size table.

It's a little unfair; in Mongo you can do a little better by using $gt and $lt on the coordinates instead of doing a $geoWithin - it's easier to compare numbers than coordinates, if you're just doing a box query. So we have the following:

{'coordinates.coordinates.1': {'$gt': 40.45, '$lt': 40.46}, 'coordinates.coordinates.0': {'$gt': -79.95, '$lt': -79.94}} 20 min

{'coordinates.coordinates.1': {'$gt': 40.46, '$lt': 40.461}, 'coordinates.coordinates.0': {'$gt': -79.97, '$lt': -79.969}} 2 min

But this is still nowhere near the postgres time! So:

- use PostgreSQL with PostGIS for geo queries, not MongoDB

- this is especially true if you can index and cluster your dbs

- maybe postgres is a little better for simpler queries too

- in postgres (maybe in mongo too) there is a lot of benefit to be had from cutting down the size of your tables as small as possible

Saturday, January 24, 2015

Dealing with images on the Pebble watch

My goal: a Pokemon watch, that shows a pokemon each minute instead of the numerical minute. I figured I'd get 144x144 images, and then just display one each minute. The challenges:

1. You need black and white (not grayscale) images. This is easily solved with a tool like HyperDither, which uses "dithering" to create shades of gray using only black and white pixels.
For example, changes this color Bulbasaur into this black and white one:

This is nice. HyperDither has a batch mode, which works, but it actually makes slightly smaller PNGs if you do it one at a time.

2. Saving space. You get 24kb RAM per app and 96kb storage for images/fonts per app. It's not easy to fit all the images you need in 96kb. I wanted 60 images, so I had to average ~1.5kb per image. Luckily, black/white 144x144 PNGs are not so big. Using HyperDither, I got pretty close to 96kb, but not quite: I'd have 90kb of PNGs locally, but when I uploaded them to the Pebble, it converted them to "pbi" files, which are almost double the size.

(also note that if you include too many pictures that are over 96kb, you'll get a cryptic error: it'll compile fine, but then will say "Installation failed. Check your phone for details." but no real way to check your phone for details, and no hints about why it fails. You'll get a similar cryptic error if you upload an image but never reference it in your code (stack overflow related question).

Spriting
So I figured there was some overhead per-image, so I tried spriting, or combining multiple images into one big image, like so: (created with InstantSprite, which is awesome and free)

and then you just take a subset of that image at a time. Pebble provides a call, gbitmap_create_as_sub_bitmap, to do exactly that. However, a few problems:
- if you make an image that is too big (like 10 pokemon at a time, or 1440x144), then gbitmap_create_as_sub_bitmap just crashes the app with no feedback. I found that 6 pokemon (864x144) worked, while 1440x144 didn't. Not sure what the actual limit is.
- it doesn't even make the images smaller! 10 PNGs with 6 pokemon each is not really any smaller than 60 PNGs with 1 pokemon each.

uPNG. Forget Spriting.
So, forget spriting. I then found Matthew Hungerford's port of uPNG, which is a library that lets you use raw PNG files instead of converting them to PBI first. Just include the .c and .h files, use gbitmap_create_with_png_resource instead of gbitmap_create_with_resource, and then edit your appinfo.json file to change the type of each image from "png" to "raw".
(editing your appinfo.json may require pushing your code from CloudPebble to Github, then cloning it locally, editing the file, committing and pushing your change. also, you might have to move your image files from resources/images/foo.png to resources/data/foo.png. anyway, this is nice, because you can then upload a bunch of images by editing a text file instead of uploading through the GUI a lot.)

He also provides scripts to transform images to B/W or dithered first, which ended up saving a few kb in my case.

In the end, thanks to uPNG, I had 80kb of images that actually stayed 80kb. (plus a little bit of overhead per image, but didn't matter.) Success! Here's my watch code on Github.

Other Tips
- Use logging, like: APP_LOG(APP_LOG_LEVEL_DEBUG, "hello"); - this took me too long to discover.