Monday, February 3, 2014

Python sentence segmentation, kind of quick and mostly legit

Sentence segmentation (splitting a big block of text into sentences) is not trivial. You can't just split on periods, for example, because you'll get tripped up on every Dr. and Ms. and etc. and so on! However, it's mostly solved and in libraries, so here's a quick way to do it in python.

NLTK is a pretty general-purpose natural language processing toolkit. You could install the whole thing via instructions on their website. But that will also install a lot of other NLP tools. Also, a lot of these tools can be trained, which makes them more accurate if you have training data, but more difficult to get started if you don't have such training data. To get a pre-trained model:

- download Punkt from NLTK Data (direct link to Punkt)
- unzip it and copy english.pickle into the same directory as your python file. This is the trained model, which has been serialized out to a file. (obviously, this assumes you're segmenting English text; if not, grab one of the other .pickle files.)
- in your python code, unpickle it like so:
import pickle
segmenter_file = open('english.pickle', 'r')
sentence_segmenter = pickle.Unpickler(segmenter_file).load()
- then call:
sentences = sentence_segmenter.tokenize(text)
(where "text" is a string containing all your text).

4 comments:

  1. This doesn't seem to work, the Unpickler throws an ImportError (ImportError: No module named nltk.tokenize.punkt).

    Any way to get around this without having the entire NLTK overhead?

    ReplyDelete
    Replies
    1. Guess you do have to install nltk. Looks like you can just do it with pip now though. At a terminal:
      pip install nltk
      Then use the code above.

      Delete
  2. Thanks for the quick reply Dan. Was hoping to do this task without importing any external libraries.

    Any good tips for doing so (apart from a bunch of regex which drive me nuts)?

    ReplyDelete
    Replies
    1. Nah, sorry. Well, you could just split on periods and exclamation points, if you don't have to be very accurate (this will be pretty bad, but the best dead simple no library solution I can think of)

      Delete