Subspace Outlier Mining: Detection and Description of Outliers in High Dimensional

Dr. Emmanuel Müller
Karlsruhe Institute of Technology (KIT)

Date: 17th January 2012 (Tuesday)
Time: 14:00 - 15:00
Venue: North N104B

Outlier mining has become an important task to detect inconsistent or suspicious objects in large databases. In recent applications, outlier mining is important for consistency checks of sensor network measurements, fraud detection in financial transactions, emergency detection in health surveillance and many more. As measuring and storing of data has become cheap, in all of these applications, objects are described by many attributes in high dimensional databases. However, for each object only a few relevant attributes provide the meaningful information for outlier detection, the residual attributes are irrelevant for this object. For example in health surveillance, for one patient attributes such as "age" and "skin humidity" might be important to detect the abnormal "dehydration" status of this patient. Other attributes such as "heart beat rate" are irrelevant for the detection of this outlier, but are relevant for the detection of abnormal patients with a heart disease.

Traditional data mining techniques are well established for outlier detection using all available attributes (full data space), but they miss outliers which are hidden in subsets of relevant attributes (subspace projections).  In the full data space all objects appear to be alike so that traditional techniques cannot distinguish between outliers and regular objects. Thus, our general aim is to develop novel subspace outlier mining techniques based on object deviation in projections of the data. We focus on outlier ranking a special research field of outlier mining, which sorts objects according to their local degree of deviation. In addition to this ranking, we provide descriptive information about the reasons why an object seems outlying. In our example, providing information about the high deviation in "age" and "skin humidity" while showing normal measurements in all other attributes assists health professionals in verifying this automatically detected outlier.