tpip

Text Processing in Python

Chapter 1

Higher order functions in text processing

Functional programming (FP) style is well-suited for Python text processing. It can be less ambiguous and does not create superflous variables. Pitfalls include deeply nested map() or filter() calls. Higher order functions (HOFs) can help to battle deep nesting. Simple library of combinatorial higher-order functions:

from operator import mul, add, truth
# return a list of results of calling each function in fns with arguments
# passed in args
apply_each = lambda fns, args=[]: map(apply, fns, [args]*len(fns))
# convert each value in lst to boolean
bools = lambda lst: map(truth, lst)
# two of the above combined
bool_each = lambda fns, args=[]: bools(apply_each(fns, args))
# call each function in fns with arguments in args, convert results to
# booleans and reduce to a single value
conjoin = lambda fns, args=[]: reduce(mul, bools_each(fns, args))
# combine several functions into one
all = lambda fns: lambda arg, fns=fns: conjoin(fns, (arg,))

Own questions:

  • What's the difference between bool() and truth()? Is truth() deprecated? -- According to the docs, the two are identical.

Python datatypes

Polymorphism in Python is achieved by duck typing: if it walks like a duck and quacks like a duck, it's duck. Pythonic way of doing things is not to check data types, like it is commonly done in C, but rather to check what operations an object supports.

Syntactic constructs in Python work by calling magic methods on objects. Because of this, it is easy to create own datatypes that mimic built-in types. Starting with Python 2.2, new style classes allow you to inherit from all basic Python datatypes. The class that inherits from string is going to be faster than the class that inherits from UserString; the class that inherits from list is going to be faster than the class that inherits from UserLIst and so on. While it may be desirable to extend one of the basic datatypes, it might be sufficient to inherit from object and implement only those methods that are needed.

Basic datatypes include:

Files

File-like objects usually support reading and/or writing strings. Other methods, such as seek() or truncate(), are usually supported for files that are actually stored on the filesystem.

Int, Long

int and long are two standard datatypes to represent integers. Integers are limited in range - usually between plus and minus 2**31, depending on the system. Longs are unbound in size. When an operation on an integer exceeds the range of an int, it is automatically promoted to long. No operations short of int() can demote a long back to int. Integer's special capability is support for bitwise operations.

Floating point & complex numbers

Floating point math is hard. Let's go shopping!

Dictionaries

Dictionaries in Python are mappings between immutable objects and other Python objects.

Lists and tuples

Lists are mutable, preferably homogeneous sequences of objects. Tuples are immutable and preferably heterogeneous. The main implication of tuple immutability is that tuples can be used as dictionary keys. It is important to note that dictionaries only check for mutability by examining the return value of an object's hash() method.

Strings

String are similar to tuples in their functionality. Same as tuples, strings are immutable sequences. Unlike tuples, they support non-magic methods for character manipulation. Strings also support 2 styles interpolation, one similar to C's sprintf() function, the other is dictionary-based string interpolation.

linecache module allows to access certain lines in large files in a memory- efficient way.

Standard library modules

sys

Among other things, sys module provides information about the interpreter the current script runs in.

dircache

dircache module is a more efficient replacement of the os.listdir() function.

struct

struct module is used to create byte representations of basic datatypes or read C structs.

Chapter 2

Common task 1: Quickly sorting lines on custom criteria

The builtin Python method sort is very fast when it sort object by their "natural order". This natural order might not be the order you need, so sort() accepts an optional callback method that gets two objects as arguments and has to return -1 if the first object is larger, 0 if objects are equal and 1 if the second object is larger. Sorting using custom callbacks is much slower than sorting by natural order. The fastest way to sort then is to perform a Schwartzian transform - transform the list of objects so that it can be sorted using the natural order, sort and then transform it back.

Common task 2: Reformatting paragraphs of text

Simple algorithm.

Common task 3: Column statistics for delimited or flat-record files

A simple CSV parser that can perform calculations on column values, such as sum, avg etc.

Common task 4: Counting characters, words, lines, and paragraphs

wc = len(s.split('\n'))

wc -l on Linux will report one line less than the code above because it does count the empty string that results from newline at the end of the file.

Common task 5: Transmitting binary data as ASCII

Wrapper around Python's builtin binascii encoders.

Common task 6: Creating word or letter histograms

Another simple algorithm based uniqueness of keys in dictionaries.

Common task 7: Reading a file backwards by record, line or paragraph

The important part is seek(offset[, whence]) usage. Optional argument whence defaults to 0 (offset from start of file, offset should be >= 0); other values are 1 (move relative to current position, positive or negative), and 2 (move relative to end of file, usually negative, although many platforms allow seeking beyond the end of a file).

String module

String module reference. Most functions in the string module should be used as methods on string objects instead.

Strings as files and files as strings

mmap module - memory-mapped files. Seems like an advanced function, see stackoverflow.com, for example. Maybe check out a Unix programming book for POSIX mmap details and usage? Interestingly enough TPiP itself does not mention POSIX or system calls.

StringIO, cStringIO - file-like objects that have no connection to the filesystem. cStringIO is much faster, but cannot be subclassed.

Converting between binary and ASCII

module base64 - convert to/from base64 encoding (RFC1521). module binascii - convert between binary data and ASCII. module binhex - encode and decode binhex4 files. module quopri - convert to/from quoted printable encoding (RFC1521). module uu - UUencode and UUdecode files.

Cryptography

crypt, rotor, md5 and sha modules.

Compression

module zlib - compress and decompress with zlib library. The underlying compression engine for all Python standard library compression modules. module gzip - read and write gzipped files. module zipfile - read and write ZIP files.

Rest of the chapter

Discussion of digital signatures and full-text searching and indexing.

Chapter 3: Regular Expressions

Introduction to regular expressions. Caret ("^") and dollarsign ("$") match zero-width patterns. Non-greedy modifier ("?"); for example, "th.*?s' is a non-greedy pattern.