I Lessen Data: Python sentence segmentation, kind of quick and mostly legit

Monday, February 3, 2014

Python sentence segmentation, kind of quick and mostly legit

Sentence segmentation (splitting a big block of text into sentences) is not trivial. You can't just split on periods, for example, because you'll get tripped up on every Dr. and Ms. and etc. and so on! However, it's mostly solved and in libraries, so here's a quick way to do it in python.

NLTK is a pretty general-purpose natural language processing toolkit. You could install the whole thing via instructions on their website. But that will also install a lot of other NLP tools. Also, a lot of these tools can be trained, which makes them more accurate if you have training data, but more difficult to get started if you don't have such training data. To get a pre-trained model:

- download Punkt from NLTK Data (direct link to Punkt)
- unzip it and copy english.pickle into the same directory as your python file. This is the trained model, which has been serialized out to a file. (obviously, this assumes you're segmenting English text; if not, grab one of the other .pickle files.)
- in your python code, unpickle it like so:
import pickle
segmenter_file = open('english.pickle', 'r')
sentence_segmenter = pickle.Unpickler(segmenter_file).load()
- then call:
sentences = sentence_segmenter.tokenize(text)
(where "text" is a string containing all your text).

4 comments:

AnonymousNovember 7, 2015 at 4:05 PM
This doesn't seem to work, the Unpickler throws an ImportError (ImportError: No module named nltk.tokenize.punkt).

Any way to get around this without having the entire NLTK overhead?
ReplyDelete
Replies
AnonymousNovember 7, 2015 at 4:15 PM
Thanks for the quick reply Dan. Was hoping to do this task without importing any external libraries.

Any good tips for doing so (apart from a bunch of regex which drive me nuts)?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.