Though many modern algorithms are robust in detecting and handling outliers in the data, based on the domain and the requirement, outlier analysis can give key information about the problem areas. Let’s consider for instance an atypical blood report. The abnormality in the report can indicate the existence of some patient-specific characteristics different from a normal patient in which case, though the observation is abnormal, it can be an informative report to diagnose the underlying ailment in the patient. On the other hand, if this abnormality is because the patient had an intake of food or alcohol before the sample collection or because of poor blood sample collection, this points to human errors and hence the report may need to be discarded or corrected/ repeated. Therefore, it is quite crucial to consider the cause of an outlier before taking action.
An outlier is a data point that is different from the rest of the data. Outliers are also often referred to as abnormalities, deviants or anomalies.
How are outliers different from Noise?
Noise can be mislabeled examples called class noise or errors in the values of the attributes called attribute noise. Noise has values very close to the true data but are not true. Whereas, an outlier can include not only errors but also deviants from the rest of the population. An outlier can be a true data point, but its value can largely deviate from all other data points.
Often, outliers contain very useful information about the underlying system and this information helps in fraud detection, intrusion detection systems, rare and critical disease diagnosis, customer segmentation, etc.
Outlier Detection and Novelty Detection
Outlier detection estimators try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
In some instances, the training data may not be polluted by any outliers and we may be interested in detecting whether a new observation could be an outlier. In this context, an outlier is called a novelty.
Outlier detection and novelty detection are both used for anomaly detection, where one is interested in detecting abnormal or unusual observations. Outlier detection is more an unsupervised anomaly detection and novelty detection is a semi-supervised anomaly detection. In the context of outlier detection, the outliers/anomalies cannot form a dense cluster as available estimators assume that the outliers/anomalies are located in low-density regions. On the contrary, in the context of novelty detection, novelties/anomalies can form a dense cluster as long as they are in a low-density region of the training data, considered as normal in this context.
Types of outliers
Broadly, outliers can be classified into three categories:
1. Global Outlier (Point Anomaly)
A data point significantly deviates from the rest of the data set
2. Contextual outlier (conditional outlier)
A data point deviates significantly based on a selected context. Attributes of an instance should be identified as contextual (time and location) and behavioral (characteristics of the data point, like temperature) attributes
3. Collective outliers
A subset of data points collectively deviates significantly from the whole data set, even when individual data points may not be outliers
Outliers can be informative. Removing them is not always the solution. Analyzing and fixing them using an appropriate method is very vital. The choice of an outlier detection method depends on many factors like the type and size of the data, availability of ground truth about the data and the need for interpretability in a model, etc. Some of these methods shall be discussed in the next article with their practical applications. Also check out our blog "MLmuse: Learning Survivals with Survival Analysis", here. To get the best artificial intelligence solutions for your business, reach out to us at Clairvoyant.
1. Novelty detection — https://scikit-