Wednesday, August 17, 2011

Double-UTF

I've stored a bit of a snapshot of all the music I've liked by picking one song per album and putting them into a seasonal playlist on itunes. So I've got "2006 Spring", for example, which has about a dozen songs I liked to listen to in spring 2006.

But I've just stored these in itunes. Not only does that mean they're locked within the Apple Empire, they're also vulnerable to me losing my hard drive. So I wanted to get them into real text files. Luckily, itunes lets you export playlists. Unluckily, it's in some bizarre janky format, when I really just want to extract the artist, title, and album for each song. Simple python script to the rescue.

Ah, but even after deleting some of the crud, I was left with a file in a mash of file formats! See, I had pulled out artist, title, and album, then concatenated them with commas, then written that to a file. But I hadn't paid attention to encodings, so I had some UTF-16 characters, then some UTF-8 commas, then more UTF-16 characters. But Python has an easy answer: just read in the one file as UTF-16, specify that your output file is UTF-8, and within your script deal with strings and don't worry about encodings.

Tim Bray explains UTF-8, UTF-16, and UTF-32 clearly; this is something I probably should have thoroughly understood a while ago.
Evan Jones has a nice overview of how to use unicode in Python.

And here's my script:


#!/usr/bin/env python
import codecs


for filename in open("filenames.txt"): # next time I'll learn
                                       # syntax for "for filename
                                       # in current directory"
        filename = filename.strip()
        outfilename = "output/" + filename.replace(" ", "_")
        outfile = codecs.open(outfilename, "w", "utf-8")
        for bigline in codecs.open(filename, "r", "utf-16"):
                lines = bigline.split("\r")
                for line in lines:
                        parts = line.split("\t")
                        if len(parts) < 4:
                                continue
                        song = parts[0]
                        artist = parts[1]
                        album = parts[3]
                        linetowrite = "%s, %s, %s\n" % (artist, song, album)
                        outfile.write(linetowrite)

1 comment:

  1. I'd like to estimate that about 80-90% of programmers doesn't understand Unicode. It's an amazingly smart encoding. I didn't really understand it until this year after reading http://www.joelonsoftware.com/articles/Unicode.html (I don't agree with everything Joel says, but he does write well).

    ReplyDelete