Linux Applications

Discover the Patterns and Hidden Information in Your Data Using Apache UIMA in Linux

When working with large amounts of data that are captured using a broad set of parameters, trying to find the relations and patterns between features can become a tiresome task. Despite having different pre-existing models that are already available in the data analytics space, using one to actually find a meaningful inference on large datasets can become a complex and comprehensive knowledge discovery task. Large datasets with a very broad set of data collecting parameters tend to have multiple different types of data inferences all stockpiled together. Lightweight intelligence in finding algorithms are therefore unable to correctly find all of the relationships which are contained in such a dataset.

This is where Apache UIMA comes in. Unstructured Information Management applications (UIMA) are specifically built for this purpose – to find the meaning in an otherwise seemingly unmeaningful data distribution. It is usually used to sort the unstructured data and to categorize the meanings which are contained in the relationships between different features that are present in a dataset. What the Apache UIMA does is enabling the users to understand what features are codependent on each other, which relationships are important for what categories in a dataset, and how all of the instances in a dataset end up pushing the dataset in a certain direction.

UIMA is not limited to working with text-based data; it can also be used with signal-based data (video and audio data). This means that not only can UIMA find the meaning in textual data, it can also analyze the large datasets which contain the audio or video samples and generate the meaning for the user based on some set of provided parameters. To summarize, Apache UIMA enables knowledge discovery using a multi-modal analytical approach that views the dataset from different perspectives to find all of the relationships that are contained within.

Installation

To start with Apache UIMA installation, we start off with updating the apt local repository which contains the package names and information.

1. Run the following command in the terminal to update the apt local repositories and information:

$ sudo apt-get update -y

You should see an output which is similar to the following:

2. We now install the Apache UIMA by running the following command in the terminal:

$ sudo apt-get install -y uima-doc

NOTE: The -y argument ensures that the installation happens silently without you having to input “yes” for any prompt that the installation setup requires.

You should see an output which is similar to the following:

3. We now download the preferred UIMA distribution package by either visiting the link or using the wget tool and running the command in the terminal (for Linux users only):

$ wget https://dlcdn.apache.org//uima//uimaj-3.3.1/uimaj-3.3.1-bin.tar.gz

You should see an output which is similar to the following:

4. Once the download is complete, we extract the downloaded file and cd into it.

Run the following command in the terminal:

$ tar xzf <name of the downloaded file>

Like so:

Then, move into the extracted folder by running the following command:

$ cd apache-uima

5. We now create a UIMA environment variable and give it the path where the extracted folder resides.
Run the following command in the terminal:

$ export UIMA_HOME="<directory path to where your extracted folder resides>"

6. Run the following commands in the terminal. You will see an instance of Apache UIMA opening  up:

$ $UIMA_HOME/bin/adjustExamplePaths.sh
$ $UIMA_HOME/bin/documentAnalyzer.sh

User Guide

With the Apache UIMA now ready to use, we start with selecting the location of the Analysis Engine XML Descriptor. For the purposes of this guide, we select a premade data distribution to run the analysis on and find the patterns in this data distribution.

We now run the model and examine the outputs it generates.

Let’s take a look at one of the generated outputs:

We can see that out of the entire dataset which contain the multitudes of the text-based passages containing different information about different subject matters, UIMA is able to sort them into smaller distributions which contains the information about a certain topic.

By selecting the PersonTitle in the available annotations, we can see that it is able to highlight all of the people that are mentioned in the data distribution.

Conclusion

Finding the meaning and inference in large unstructured datasets can be a difficult task. The number of different parameters to look out for and analyze make the target space really huge and it becomes somewhat inefficient to analyze such a dataset with traditional algorithms. Apache UIMA helps solve this issue since it is able to analyze the large datasets with relative ease and generate inference, find relationships, and discover the patterns in even the largest datasets that are compiled on the basis of a very broad set of input parameters. Not only does it perform brilliantly on text-based data, it also does really well on audio or video data.

About the author

Zeeman Memon

Hi there! I'm a Software Engineer who loves to write about tech. You can reach out to me on LinkedIn.