Unsupervised Learning at Protenus

by Vicki Toy-Edens on June 15, 2020

Have you ever answered a CAPTCHA while trying to access a website? Well then it's likely that you are an integral part of training classifier algorithms. Not only do those CAPTCHAs prove you’re a human, they also serve as large repositories for labeled datasets. You are likely teaching the next generation of self-driving cars the difference between a pole and a pedestrian.

There’s a good reason why CAPTCHAs are used to collect training data for classifiers. When you train a model you need a lot of trained data to ensure that the model is not being overfit. For instance, if we had 100 labeled images of dogs but they were all brown labrador retrievers, the classifier will be optimized to identify brown labrador retrievers, but not all dogs. The classifier has to see many different examples of dogs and other objects to be a good generalized detector of dogs: different sizes, different colors, different angles, etc.

This leads one to ask: how do you create a classifier if you don’t have a large labeled dataset?

If only you could install this captcha... : StarWars

Supervised vs. Unsupervised Learning

There are two major types of machine learning: supervised and unsupervised learning. Supervised learning is similar to how primary school tests work. The student (classifier) is given homework (training set) and is graded on how well their homework matches the answer key (labels). The student tries to get the best score they can on their homework, similar to how a classifier tries to optimize matching the labeled dataset. That type of machine learning algorithm requires a robust labeled dataset.

Unsupervised learning is pretty much the opposite case. The classifier has no answer key and must infer information based on patterns within a dataset.

At Protenus, our healthcare compliance analytics platform uses both supervised and unsupervised machine learning. In one of our previous blog posts we discussed how we use training labels to iterate on our supervised classifier, so in this post we will delve into unsupervised learning methods.

Anomaly, Anomaly

There are many reasons why someone may choose an unsupervised learning algorithm: for example, the problem may be extremely complex or the event may be extremely rare. In the case of drug diversion, both cases are true. It is rare for a healthcare worker to divert drugs and healthcare workers may divert drugs in many unique ways. This leads to very few true positive diversions for a labeled dataset: the proverbial needle in a haystack.

Despite its rarity, drug diversion can cause irreparable harm to both the diverter and patients caught in the crossfire of a diverter’s behavior, so we must find a way to detect diversion that does not rely on supervised learning. We must rely on discovering how drug diverters behave differently from non-diverters rather than trying to find similarities between drug diverters.The best way to approach this problem is to try to find anomalous behaviors in diverters; that is when a diverter is a complete outlier from other regular employees. This can be done with many different unsupervised algorithms such as cluster analysis, anomaly detection, and neural networks.

Which Apple is Better For You, Red or Green? - Living Healthy

To do this, Protenus uses input from our clients as well as our staff with countless years of diversion experience to identify or create key features from automated dispensing cabinets (ADC) and electronic health records (EHR) data that may contribute or correspond with diversion behaviors. Our features combine together to incorporate both suspiciousness--whether the employee is acting as an outlier--and risk--the amount and type of drugs the user is handling. Protenus data scientists examine the performance of these features both separately and as a whole, taking into account the distribution of the feature and the number of outlying users.

For example, we use multiple features to indicate if an employee is sloppily documenting their drug handlings and identify outlier behavior based on how many times the employee performs the suspicious behavior. An employee sloppily documenting drug handlings once is often not cause for concern, but when an employee does this repeatedly, it is part of a pattern and causes them to diverge from the “normal” behavior seen among their peers. Sloppy documentation may be due to either an employee covering up drug diversion or the employee just not following normal procedures. In either case, this is often a set of events that our clients want to be alerted to in order to address policy violations or to identify drug diverters.

It is important to note that our algorithm groups or “clusters” employees with similar job functions and only compares an employee to similar peers. Employees can be clustered using different components from ADC and EHR data (e.g., overlapping patients being treated, similar departments or roles, areas of ADC machines accessed). It is critical to compare employees with their peers because it is not useful to flag an employee for suspicious activity that can be easily explained by their job function. For instance, an employee may work in post-operative surgery care and handle a large amount of controlled substances compared to all hospital employees, but only an average amount compared to other employees in post-operative surgery care. We would create many false positive results if we flagged employees without accounting for the fact that different roles require different behaviors.

While the focus of identifying potential diversion is identifying outlier behavior, a crucial part of the analysis is to compare this behavior to past diverters’ behavior. An ideal feature will result in suspicious behavior (especially past drug diverters’ behavior) being flagged as extreme outliers and non-diverter behavior never being identified as outliers so that the false positive rate is low. However, this is often impossible to achieve as there are multiple ways to divert drugs. One feature may flag a past drug diverter as an extreme outlier while a different past drug diverter may appear to have normal behavior for that feature (e.g., this diverter may not have sloppy drug handling documentation behavior). The result is that Protenus must develop and implement a multitude of features that encompass the variation in drug diversion flavors. This makes for a very difficult but interesting research problem as we are always expanding our research to incorporate the behaviors of different types of diverters.

At Protenus we are constantly improving our analytics to incorporate new knowledge uncovered by violations that customers describe as well as insights from healthcare experts. These improvements come in the form of labels and insights for both our supervised and unsupervised classifiers. As a compliance analytics company that uses machine learning to identify suspicious human behavior, we constantly improve and adapt our techniques to mirror the ever-changing nature of human decision making.

If you’d like to hear more about engineering at Protenus, check out my coworker’s articles on Scaling Infrastructure for Growth and Engineering Culture.

Topics: Protenus Culture