WorkFusion IDP Webinar Follow-up: We Answer 6 Bonus Audience Questions

Robot

BLEEP, BLIP, BLOOP
In May, WorkFusion hosted an informative webinar on the opportunities and benefits of intelligent document processing (IDP) — featuring Anil Vijayan of Everest Group and Arnesh Sahay, a Solutions Engineer for WorkFusion, moderated by Veronika Andreeva, Product Marketing Manager for WorkFusion.

Over 300 people have attended the free webinar so far, and it’s still available on-demand at: http://learn.workfusion.com/2019-may-intelligent-document-processing-webinar-ft-everest-group

The 30-minute presentation was followed by a Q&A, and many attendees had questions that we wanted to answer with some detail here.

Scroll for the answers to these topics:

  • How are ML/NLP algorithms utilized in WorkFusion’s IDP solution?
  • What does the initial training for a WorkFusion ML model look like? After an initial model is trained, how does the software perform quality control against the model’s output?
  • How does WorkFusion’s IDP solution solve for high volumes of data?
  • Can IDP be used to read data from emails? Can we schedule bots to run at varying times for different processes?
  • What does a false positive mean in the context of machine learning processes?

How are ML/NLP algorithms utilized in WorkFusion’s IDP solution?

WorkFusion Process AutoML is a proprietary technology for Cognitive Automation. AutoML automatically finds the best machine learning algorithm and method of automation for myriad use cases. There are several different open-source algorithms leveraged in this process, depending on the type of model being trained.

In machine learning, a hyperparameter can consist of annotators, feature extractors, and various post-processors that are used to control the learning process. In order to produce the highest-performing machine learning models for any given use case, WorkFusion Process AutoML performs a series of experiments to test different “optimal” hyperparameter configurations by systematically varying one or more components and examining its effects on the result. This process is called hyperparameter optimization.

The two most common use cases deployed in WorkFusion software include information extraction: gathering data from unstructured documents to enter it into systems; and classification: classifying documents, transactions and emails into workflows. AutoML enables operations teams to train their own machine learning models without the need for a data scientist.

AutoML SDK goes further and allows Machine Learning Engineers (MLEs) to step in and extend machine learning models for any information extraction or classification use cases. This SDK consists of the following components, each responsible for a specific task in a pipeline:

Parser — remove HTML tags from documents, create an AutoML SDK Document

Annotator — split text into tokens, add boundary elements, add Named-Entity Recognition (NER)

Feature Extractors — analyze tokens in documents and create features

Algorithms — aggregate all field documents and feature extractors, provide a model and its statistics

Post-Processing — apply normalization logic and field grouping

Some examples of machine learning algorithms that are supported in AutoML SDK include Support-Vector Machines (SVM), Deep Neural Networks (DNN), Regression, etc.

What does the initial training for a WorkFusion ML model look like? After an initial model is trained, how does the software perform quality control against the model’s output?

The amount of data needed in the first iteration of training for an ML model depends on the complexity of the use case. An efficient information extraction model for a Prescription Intake use case can be created with as few as 50 documents, whereas a complex multi-class classification model for an Email Intake use case may require more than 100 documents per category. With our enterprise customers, we typically see that ML models are trained on at least 3,000–5,000 documents before they are deployed to Production.

Automatic Quality Control (AutoQC) refers to the use of statistical methods in monitoring and maintaining the quality of products and services. In a WorkFusion automation workflow, the AutoQC sub-process chooses an optimal, cost-effective combination of automated machines and cloud workers that always delivers at or above the acceptable quality level.

The main concept of AutoQC is to take samples from a defined batch of items (records, documents, etc.) and verify the quality of each individual sample. After sample verification, the whole batch is considered as accepted or rejected depending on the Rejection Limit parameter set in the process. Therefore, AutoQC enables customers to maintain a continuous quality check without needing to verify the entirety of the total batch.

1*HLvn5IkDVEnLGU5nJznZUQ.png


Can IDP be used to read data from emails? Can we schedule bots to run at varying times for different processes?

Email Intake is a standard use case that WorkFusion has solved for several different customers. The data within emails (subject, body, attachments) are all parsed via RPA; any attachments are sent to the Optical Character Recognition (OCR) sub-process for conversion to a machine-readable format (XML/HTML); the body of the email is merged with the converted attachments; and a single document is sent to the AutoML process. The case study below outlines how WorkFusion tackled an Email Intake process at a large U.S. bank which was inundated with massive email volumes and riddled with inefficiencies due to human error. (View our Email Intake Case Study to learn more.)

Scheduling bots is a simple and straightforward process in WorkFusion’s Control Tower. The image below shows how process owners can create their own schedules for various processes to kick off at designated times.

1*Dkx9hZwoSfxJegLgU7_cQQ.png


What does a false positive mean in the context of machine learning processes?

When evaluating the performance of machine learning models, we need to be able to differentiate between values that were identified correctly and incorrectly. When looking at a Binary Classification use case, a correctly classified document can be either a True Positive (TP) or a True Negative (TN). Alternatively, an incorrectly classified document can be either a False Positive (FP) or a False Negative (FN).

The definitions below pertain to a simple HotDog/NotHotDog use case, as presented on HBO’s “Silicon Valley” and help illustrate the differences between the four possible outcomes:

1*WE4bkkG2CziDI-5-Y5Pwwg.png


TP — Image is a hot dog and the model identified it as a hot dog. (Correct)

FN — Image is a hot dog, but the model identified it as not a hot dog. (Incorrect)

TN — Image is not a hot dog, and the model identified it as not a hot dog. (Correct)

FP — Image was not a hot dog, but the model identified it as a hot dog. (Incorrect)

A binary classification model is relatively simple to calculate for. But when evaluating a more complex multi-class classification model, confusion matrices are often used to map out the four possible outcomes against each possible category, to better understand the performance of a model.

How does WorkFusion’s IDP solution solve for high volumes of data?

WorkFusion’s Enterprise Architecture is designed to handle high volumes of data coming in across multiple different business processes. To address the increased load on the systems, single instances can be scaled either horizontally or vertically. Horizontal scaling consists of provisioning more servers for ML, OCR, and RPA as needed to respond to increased volume growth, and vertical scaling refers to dedicating more powerful hardware to each component.

As different business lines within an organization start using WorkFusion, it is important to keep data segregated across different departments. There are two approaches to address this: One is to instantiate new environments that correspond to business units as they are onboarded and have sufficient volume (this is the most common approach); another is to set up a shared multi-tenant instance with scaled ML, OCR and RPA components.

stat

IDP Webinar Follow-up: We Answer 6 Bonus Audience Questions was originally published in WorkFusion on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue reading...
 
Top