Upgrade Your Knowledge | USU Blog

UDC_#2 Intelligent Text Categorization - Pega's 3 Models

Written by Stefan Kienzler | Apr 22, 2024 9:02:41 AM

To analyze and categorize the content of an email, Pega provides several text categorization capabilities. Using text categorization, large amounts of data can be efficiently analyzed and assigned to predefined categories. 

Three different models can be distinguished.

  • Sentiment Detection
  • Intent Detection
  • Topic Detection

All three models run independently in parallel and support machine learning categorization. Topic Detection also supports keyword-based text categorization.

With keyword-based text categorization, the text is scanned and searched for topic-specific keywords. Based on the recognized keywords, the categorization assigns the text to an appropriate topic. This categorization is used when the machine learning model is not fully developed and does not provide satisfactory results.

In machine learning text categorization, the model itself learns to categorize the text by analyzing previous text classifications. By classifying the text, different patterns of topic recognition can be identified. To improve the accuracy of topic detection in production environments, the machine learning models can also be provided with feedback. Machine learning topic detection is particularly useful when access to previous customer messages and their categories is available, or when relevant training data can be provided to the machine learning model.

 

Sentiment Detection

Sentiment detection is about recognizing the characteristics of the text being analyzed. Using machine learning and natural language processing, an email bot can detect negative emotions in an email. The analyzed text is then categorized as positive, neutral or negative. This enables an efficient and timely response to critical concerns.

Intent Detection

The second model is intent detection. This involves determining the intentions of the text being analyzed. The goal is to identify the purpose of the text or the author's intent. By identifying the intent behind a piece of text, this process allows for a more accurate interpretation of user communication and supports effective business responses and actions.

Topic Detection

Topic detection is concerned with identifying the overarching topic of a single piece of text or an entire document in order to efficiently process an incoming customer request and initiate appropriate actions. For example, support or service requests can be identified and an appropriate action initiated. This results in improved service quality and smoother customer interactions.

In Topic Detection, there are three algorithms to choose from when building a model. By default, the model is built with all algorithms, but after building, you can select an algorithm, ideally based on the highest F-score. (The F-score is a weighted measure of how well a model performs.)

Maximum Entropy

The Maximum Entropy Model (MaxEnt) is based on the principle of maximum entropy and allows the estimation of probabilities based on given constraints. It optimizes the conditional entropy to provide robust and versatile predictions. The model uses feature functions weighted by a Lagrange multiplier.

Naive Bayes

Naive Bayes is a probabilistic model based on Bayes' theorem. It assumes that all variables are independent. The algorithm is efficient in training and uses the prior probability and the probability of words in an email to calculate the probability for a detail category.

Support Vector Machine (SVM)

SVM is a linear classifier that searches for a hyperplane to optimally separate data points. It can also handle nonlinear decision boundaries through the use of kernels. Multiclass SVM can be extended with one-vs-rest or one-against-one approaches, where the choice depends on the data set and the specific requirements.

The choice of the best algorithm should be influenced by the requirements of the problem, the size of the data set, and the desired classification accuracy. Each algorithm has its strengths and weaknesses, and careful consideration of these factors is critical to selecting the optimal approach.

Training

To get accurate predictions from Pega's machine learning models, it is critical to prepare the training data carefully. The Topic Detection models use CSV, XLS, or XLSX file formats that must meet certain criteria.

The Topic Detection model requires a file with three columns: "Content", "Result" and "Type". The Content column contains the email data, while the Result column specifies the desired result or topic. In this case, the topic starts with the word "Action", followed by the detail category, which is indicated by a hyphen instead of an underscore. The Type column indicates whether the data is training or test data.

 

Content Result Type
[E-Mail or Text] Action > [DetailCategory]  
 
  •  
  •  
  •  
 

Training and model selection subtleties

When training models, you can decide whether to overwrite or augment existing data. It is possible to integrate data from different sources, such as information provided by the channel. You can also specify the proportion of training and test data, with the default being 70% for training and 30% for testing.

Three different algorithms are available for the topic model: Maximum Entropy, Naive Bayes, and Support Vector Machine. All three models can be created simultaneously, and the selection is based on the highest F-score, which represents the performance of the model.

The precise structuring and preparation of training data plays a critical role in the success of machine learning models in Pega. Addressing the specific needs of each model ensures optimal performance and predictive accuracy.