UDC_#3 Intelligent Text Extraction with Pega

Stefan Kienzler

22.04.2024

Pega provides a powerful text extraction method for analyzing email content and categorizing text data. Text extraction allows you to identify named entities in text data and assign them to predefined categories, such as organizations, locations, people, quantities, or values. By leveraging the machine learning capabilities of the Pega platform, entity extraction models can be created that are capable of recognizing named entities.

By creating entity extraction models to recognize keywords and phrases, cases can be automatically created, forms filled, or orders routed. Each entity extraction model classifies keywords and phrases such as names of people, places, organizations, etc. into predefined categories called entity types.

Multiple recognition methods are combined to identify each entity type in unstructured text. Entity types can be used to create and manage complex entity extraction models, such as date or date-time. In addition, entity types support the management of nested entities. For example, an address can contain nested entity types such as country, state, province, postal code, street, and so on.

In addition to the keyword-based method and the machine learning method, RUTA scripts can also be used to identify entities. The Apache UIMA RUTA script is a rule-based scripting language used to recognize patterns in text. Annotations are used in conjunction with conditions to define the patterns. When a pattern matches, the appropriate action is performed. In addition, regular expressions can be used in the script to find the patterns.

Compared to the topic model, only the Conditional Random Fields (CRF) algorithm is available for text extraction. The choice between keyword lists and machine learning as well as the recognition of entities using RUTA scripts are possible options. In contrast to topic recognition, all entities are stored in a list and the F-score provides information about the performance of the model.

Blog

Intelligent Text Processing with Pega

Read our first article on text processing with Pega.

Conditional Random Fields (CRF)

CRFs are an important type of machine learning model, especially in natural language processing. They are used for text segmentation, labeling, and recognition of named entities such as people and organizations. Compared to simpler models such as Hidden Markov Models, CRFs can consider a wider range of features and contexts. The algorithm is described by the conditional probability P(y│x) using feature functions that model dependencies between input and output variables. CRFs are supervised learning models and can be customized.

Feature Functions

CRFs use feature functions to model dependencies between input and output variables. These functions use contextual information to make accurate predictions and enable the integration of domain-specific knowledge. The feature functions are critical to the scalability and efficiency of the CRF model. They are used to capture relevant contextual information and incorporate it into the modeling to achieve a better representation of the data.

Example of a feature function

A feature function f(x,i,y_i,y_(i-1)) can take the value 1 or 0 due to a certain condition. This allows the integration of different questions and increases the effectiveness and accuracy of analysis and prediction.

The flexibility of CRFs in model design, their adaptability to specific requirements and the consideration of contextual information make them a powerful tool for text extraction.

modal2_0

Quelle: https://www.pega.com/sites/default/files/styles/1920/public/media/images/2020-04/modal2_0.png?itok=Np51rlGl

Training

To get accurate predictions from Pega's machine learning models, effective preparation of the training data is essential. The text extraction models use CSV, XLS, or XLSX files that must meet specific requirements.

The text extraction model requires a file with two columns: "Content" and "Type". The "Content" column contains the email data, while the "Type" column indicates whether the data is training or test data. The data in the Content column is prepared according to a specific pattern, where entities are defined by markers such as <START:...> and <END>. These entities, such as "Bank details_change", are later mapped to variables in Pega and enable fields to be filled in automatically.

Text Extraction Pega

UDC_#3 Intelligent Text Extraction with Pega

Intelligent Text Processing with Pega

Conditional Random Fields (CRF)

Feature Functions

Example of a feature function

Training

More interesting articles

UDC_#2 Intelligent Text Categorization - Pega's 3 Models

UDC_#1 Intelligent Text Processing with Pega

UDC_How to Upgrade Pega in 9 Steps