Edit: ugh, disregard this post maybe; still couldn't get the program to both run without errors and produce correct output. Started using Spark locally instead; it's better for my needs anyway. Leaving it up just in case this helps at all.
That's the goal I set out with: to run *something* on Hadoop. In this case it was simple: in each tweet, get the frequency of each emoji. So if a tweet is:
☃☃☃ it's cold ☃☃☃☹☹
I want to get "[6, 2]" because there's one emoji that happens 6x and one that is 2x.
Then I want to get the overall frequency of each frequency. Basically, I just want to know how many times people tweet an emoji once, and how many times they spam it 10x.
(the clever among you would recognize that this doesn't require any hadoops. indeed, it takes about 10 seconds to just loop through a 500mb text file with a bunch of tweets and spit out the answer. so in the real world, I should not be map-reducing anything here. but it's a learning experience; bear with me.)
Writing map-reduce code involves separating your code into mappers and reducers. For python, for example, this is easy: just make one python file for your mapper and one for your reducer, that both take in lines on sys.stdin and output lines to stdout. (this was unexpectedly simple for me. I like that they just use stdin and stdout, and one-data-point-per-line.)
Upload everything to s3: your mapper script, your reducer script, and all your input data (in the form of a text file).
Then create an EMR cluster (using all the defaults, but do add an SSH key and just take m1.medium instances if you're just testing, as they're the cheapest), and add a "step" that is a "streaming program". Terminate your cluster when you're done so you don't spend extra money.
You'll find your output in the output location that you chose, in the form of a few files like "part-00000", "part-00001". Not sure why it doesn't keep reducing until you're down to one final output, but I guess that is for you to do on your own?
Otherwise, good luck! This took me about a couple bucks to run, plus $15ish in debugging. It is frustrating when debugging costs real money. So it goes.
More caveats!
- as of May 5, 2015, Amazon EMR only supports Python 2.6.9. This is mildly frustrating. (here is a list of what it supports that will hopefully be kept up to date) Test your code on python2.6 before you upload it; it'll save you some headaches.
- make sure you set the "output location" to be a folder that doesn't exist yet. Otherwise it will just crash. You can figure this out from the logs, but each map-reduce job is ~20 min and probably a buck or two so better not to waste any.
- sometimes the logs are hard to find. The most valuable ones will be available through the EC2 console. You can find them here:
Click "view logs" to see logs related to the whole job, or go into "view jobs" then "view tasks" then "view attempts" to see logs for each individual mapper or reducer. Sometimes the logs will not show up. This is frustrating. Wait a few minutes and try again. If they're still not there, then I am not sure why. Sorry about it.
That's the goal I set out with: to run *something* on Hadoop. In this case it was simple: in each tweet, get the frequency of each emoji. So if a tweet is:
☃☃☃ it's cold ☃☃☃☹☹
I want to get "[6, 2]" because there's one emoji that happens 6x and one that is 2x.
Then I want to get the overall frequency of each frequency. Basically, I just want to know how many times people tweet an emoji once, and how many times they spam it 10x.
(the clever among you would recognize that this doesn't require any hadoops. indeed, it takes about 10 seconds to just loop through a 500mb text file with a bunch of tweets and spit out the answer. so in the real world, I should not be map-reducing anything here. but it's a learning experience; bear with me.)
Writing map-reduce code involves separating your code into mappers and reducers. For python, for example, this is easy: just make one python file for your mapper and one for your reducer, that both take in lines on sys.stdin and output lines to stdout. (this was unexpectedly simple for me. I like that they just use stdin and stdout, and one-data-point-per-line.)
Upload everything to s3: your mapper script, your reducer script, and all your input data (in the form of a text file).
Then create an EMR cluster (using all the defaults, but do add an SSH key and just take m1.medium instances if you're just testing, as they're the cheapest), and add a "step" that is a "streaming program". Terminate your cluster when you're done so you don't spend extra money.
You'll find your output in the output location that you chose, in the form of a few files like "part-00000", "part-00001". Not sure why it doesn't keep reducing until you're down to one final output, but I guess that is for you to do on your own?
Otherwise, good luck! This took me about a couple bucks to run, plus $15ish in debugging. It is frustrating when debugging costs real money. So it goes.
More caveats!
- as of May 5, 2015, Amazon EMR only supports Python 2.6.9. This is mildly frustrating. (here is a list of what it supports that will hopefully be kept up to date) Test your code on python2.6 before you upload it; it'll save you some headaches.
- make sure you set the "output location" to be a folder that doesn't exist yet. Otherwise it will just crash. You can figure this out from the logs, but each map-reduce job is ~20 min and probably a buck or two so better not to waste any.
- sometimes the logs are hard to find. The most valuable ones will be available through the EC2 console. You can find them here:
Click "view logs" to see logs related to the whole job, or go into "view jobs" then "view tasks" then "view attempts" to see logs for each individual mapper or reducer. Sometimes the logs will not show up. This is frustrating. Wait a few minutes and try again. If they're still not there, then I am not sure why. Sorry about it.