About the Tokenize Module
As the name suggests, the tokenize module can be used to create “tokens” from a paragraph or a chunk of text. Each individual broken piece returned after the tokenization process is called a token. Once you tokenize a text, you can implement your own logic in your Python program to process the tokens according to your use case. The tokenize module provides some useful methods that can be used to create tokens. The usage of these methods can be best understood through examples. Some of them are explained below.
Tokenizing a Paragraph or Sentence
You can tokenize a paragraph or a sentence with space-separated words using the code sample explained below.
from io import BytesIO
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)
for t in tokens:
print (t)
The first two statements import the necessary Python modules required for converting a piece of text into individual tokens. A variable called “text” contains an example string. Next, the “tokenize” method from the tokenize module is called. It uses the “readline” method as a mandatory argument. Since the text variable is of “str” type, using it directly will throw an error. The readline argument is a callable method that must return bytes instead of a string for the tokenize method to work correctly. So using the “BytesIO” class, the text is converted into a stream of bytes by specifying an encoding type.
The tokenize method generates a named tuple containing five types: type (type of the token), string (name of the token), start (starting position of the token), end (ending position of the token), and line (the line that was used for creating tokens). So after running the above code sample, you should get an output similar to this:
TokenInfo(type=1 (NAME), string='Lorem', start=(1, 0), end=(1, 5), line='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
…
…
TokenInfo(type=54 (OP), string='.', start=(1, 122), end=(1, 123), line='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 123), end=(1, 124), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
As you can see in the output above, the tokenize method generates a “TokenInfo” object with five types mentioned above. If you want to access these types individually, use dot notation (as shown in the code sample below).
from io import BytesIO
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)
for t in tokens:
print (t.string, t.start, t.end, t.type)
After running the above code sample, you should get the following output:
utf-8 (0, 0) (0, 0) 62
Lorem (1, 0) (1, 5) 1
ipsum (1, 6) (1, 11) 1
…
…
Note that the “t.type” call returns a constant number for the token type. If you want a more human-readable token type, use the “token” module and the “tok_name” dictionary available in it.
from io import BytesIO
import token
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)
for t in tokens:
print (t.string, t.start, t.end, token.tok_name[t.type])
By supplying “t.type” constant to “tok_name” dictionary, you can get a human readable name for the token type. After running the above code sample, you should get the following output:
Lorem (1, 0) (1, 5) NAME
ipsum (1, 6) (1, 11) NAME
dolor (1, 12) (1, 17) NAME
…
…
A full list of all token types and their names is available here. Note that the first token is always the encoding type of input stream, and it doesn’t have a start and end value.
You can easily get a list of just token names using for loop statements or list comprehensions, as shown in the code sample below.
from io import BytesIO
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)
token_list = [t.string for t in tokens]
print (token_list)
After running the above code sample, you should get the following output:
You can use the “generate_tokens” method available in the tokenize module if you want to tokenize a string without converting it to bytes. It still takes a callable readline method as the mandatory argument, but it only handles strings returned by the readline method and not bytes (unlike the tokenize method explained above). The code sample below illustrates the usage of the generate_tokens method. Instead of the BytesIO class, now the “StringIO” class is used.
from io import StringIO
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
tokens = tokenize.generate_tokens(StringIO(text).readline)
token_list = [t.string for t in tokens]
print (token_list)
After running the above code sample, you should get the following output:
Tokenizing Contents of a File
You can use the “with open” statement in “rb” mode to directly read the contents of a file and then tokenize it. The “r” in “rb” mode stands for read-only mode while “b” stands for binary mode. The code sample below opens a “sample.txt” file and tokenizes its contents using the tokenize and readline methods.
with open("sample.txt", "rb") as f:
tokens = tokenize.tokenize(f.readline)
token_list = [t.string for t in tokens]
print (token_list)
You can also use “open”, a convenience method available in the tokenize module, and then call generate_tokens and readline methods to create tokens from a file directly.
with tokenize.open("sample.txt") as f:
tokens = tokenize.generate_tokens(f.readline)
token_list = [t.string for t in tokens]
print (token_list)
Assuming that the sample.txt file contains the same example string, you should get the following output after running the two code samples explained above.
Conclusion
The tokenize module in Python provides a useful way to tokenize chunks of text containing space-separated words. It also creates a map of starting and ending positions of tokens. If you want to tokenize each and every word of a text, the tokenize method is better than the “split” method as it also takes care of tokenizing punctuation characters / other symbols and also infers the token type.