Python

How to Use the Tokenize Module in Python

This article will cover a guide on using the “Tokenize” module in Python. The tokenize module can be used to segment or divide the text into small pieces in various ways. You can use these segments in Python applications that use machine learning, natural language processing, and artificial intelligence algorithms. All the code samples in this article are tested with Python 3.9.5 on Ubuntu 21.04.

About the Tokenize Module

As the name suggests, the tokenize module can be used to create “tokens” from a paragraph or a chunk of text. Each individual broken piece returned after the tokenization process is called a token. Once you tokenize a text, you can implement your own logic in your Python program to process the tokens according to your use case. The tokenize module provides some useful methods that can be used to create tokens. The usage of these methods can be best understood through examples. Some of them are explained below.

Tokenizing a Paragraph or Sentence

You can tokenize a paragraph or a sentence with space-separated words using the code sample explained below.

import tokenize

from io import BytesIO

 

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)

for t in tokens:

        print (t)

The first two statements import the necessary Python modules required for converting a piece of text into individual tokens. A variable called “text” contains an example string. Next, the “tokenize” method from the tokenize module is called. It uses the “readline” method as a mandatory argument. Since the text variable is of “str” type, using it directly will throw an error. The readline argument is a callable method that must return bytes instead of a string for the tokenize method to work correctly. So using the “BytesIO” class, the text is converted into a stream of bytes by specifying an encoding type.

The tokenize method generates a named tuple containing five types: type (type of the token), string (name of the token), start (starting position of the token), end (ending position of the token), and line (the line that was used for creating tokens). So after running the above code sample, you should get an output similar to this:

TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')

TokenInfo(type=1 (NAME), string='Lorem', start=(1, 0), end=(1, 5), line='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')





TokenInfo(type=54 (OP), string='.', start=(1, 122), end=(1, 123), line='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')

TokenInfo(type=4 (NEWLINE), string='', start=(1, 123), end=(1, 124), line='')

TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

As you can see in the output above, the tokenize method generates a “TokenInfo” object with five types mentioned above. If you want to access these types individually, use dot notation (as shown in the code sample below).

import tokenize

from io import BytesIO

 

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)

for t in tokens:

        print (t.string, t.start, t.end, t.type)

After running the above code sample, you should get the following output:

 

utf-8 (0, 0) (0, 0) 62

Lorem (1, 0) (1, 5) 1

ipsum (1, 6) (1, 11) 1



Note that the “t.type” call returns a constant number for the token type. If you want a more human-readable token type, use the “token” module and the “tok_name” dictionary available in it.

import tokenize

from io import BytesIO

import token

 

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)

for t in tokens:

        print (t.string, t.start, t.end, token.tok_name[t.type])

By supplying “t.type” constant to “tok_name” dictionary, you can get a human readable name for the token type. After running the above code sample, you should get the following output:

utf-8 (0, 0) (0, 0) ENCODING

Lorem (1, 0) (1, 5) NAME

ipsum (1, 6) (1, 11) NAME

dolor (1, 12) (1, 17) NAME



A full list of all token types and their names is available here. Note that the first token is always the encoding type of input stream, and it doesn’t have a start and end value.

You can easily get a list of just token names using for loop statements or list comprehensions, as shown in the code sample below.

import tokenize

from io import BytesIO

 

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

tokens = tokenize.tokenize(BytesIO(text.encode('utf-8')).readline)

token_list = [t.string for t in tokens]

        print (token_list)

After running the above code sample, you should get the following output:

['utf-8', 'Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.', '', '']

You can use the “generate_tokens” method available in the tokenize module if you want to tokenize a string without converting it to bytes. It still takes a callable readline method as the mandatory argument, but it only handles strings returned by the readline method and not bytes (unlike the tokenize method explained above). The code sample below illustrates the usage of the generate_tokens method. Instead of the BytesIO class, now the “StringIO” class is used.

import tokenize

from io import StringIO

 

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

tokens = tokenize.generate_tokens(StringIO(text).readline)

token_list = [t.string for t in tokens]

print (token_list)

After running the above code sample, you should get the following output:

['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.', '', '']

Tokenizing Contents of a File

You can use the “with open” statement in “rb” mode to directly read the contents of a file and then tokenize it. The “r” in “rb” mode stands for read-only mode while “b” stands for binary mode. The code sample below opens a “sample.txt” file and tokenizes its contents using the tokenize and readline methods.

import tokenize

with open("sample.txt", "rb") as f:

        tokens = tokenize.tokenize(f.readline)

        token_list = [t.string for t in tokens]

        print (token_list)

You can also use “open”, a convenience method available in the tokenize module, and then call generate_tokens and readline methods to create tokens from a file directly.

import tokenize

 

with tokenize.open("sample.txt") as f:

        tokens = tokenize.generate_tokens(f.readline)

        token_list = [t.string for t in tokens]

        print (token_list)

Assuming that the sample.txt file contains the same example string, you should get the following output after running the two code samples explained above.

['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.', '', '']

Conclusion

The tokenize module in Python provides a useful way to tokenize chunks of text containing space-separated words. It also creates a map of starting and ending positions of tokens. If you want to tokenize each and every word of a text, the tokenize method is better than the “split” method as it also takes care of tokenizing punctuation characters / other symbols and also infers the token type.

About the author

Nitesh Kumar

I am a freelancer software developer and content writer who loves Linux, open source software and the free software community.