Primer on Encodings

Why Encodings?

Most popular machine learning methods such as deep learning require numerical data as input. However, quite often categorical data is common. In health care, for example, a person's vitals could be a combination of both, they could include height, weight (numerical) and gender, race (categorical).

Quite simply if an ML system operates on numerical data, categorical data needs to be converted to a number or vector of numbers.

Encodings for Inputs

The treatment appropriateness of the type of embedding can depend on how it is used in a system. For example, if an encoding is primarily used as a feature to feed as input to a ML system, then decoding a noisy number or vector back to the category is not important. However, even with this one should be cautious not to introduce implicit relationships between classes that aren't there. Examples are discussed with ordinal encoding. Furthermore, in this regard statistical related encodings might be appropriate such as contrast or Bayseian encoding as statistical relationships between categories may be desirable to preserve.

Decoding Encodings

In some applications, such as classification, decoding an noisy encoded number of vector to a category is very important. For example, if you have a system that classifies animal images into the animals they represent it is important to know what a given vector represents especially when the resultant vector does not match precisely. Does the vector (0.1, 0.2, 0, 0.9) decode to a panda when panda encodes to (0,0,0,1)? What deep learning networks often do is apply a softmax function which converts the output of the network into a probability density function. Quite often for categorical outputs, a maximum likelihood loss function which requires a probability density function is used rather than the more numerically oriented mean square error.

Types of Encoding

Coding methods can be categorized as classic, contrast, Bayesian and word embeddings. Classic, contrast and Bayseian encoding are given a good overview treatment by Hale's blog with examples to be found as part of the scikit-learn category encoding package. Both contrast encoding and Bayesian encoding use the statistics of the data to facilitate encoding. These two categories may be of use when more statistical analysis is required, however there has not been widespread adoption of these encoding techniques for machine learning. Word embeddings form a category in themselves and are very important when machine learning is used for natural language processing, which is a different discussion altogether.

Dimensionality Issues

As a general rule, the higher the dimension the less geometric atrifacts are induced by the encoding. Of course, this does not come without cost. There is generally an increase in amount of complexity in a neural network and corresponding computation resources needed. Additionally, there are geometric issues such as "the curse of high dimensionality".

Also as a general rule for machine learning models, the higher the dimension, the more samples are needed to train otherwise it leads to a problem known as high dimensionality low sample size (HDLSS) situation, which is a problem for most machine learning methods.

For high cardinality categories, there is a tradeoff between the use (input vs output), drawbacks of high dimensionality and geometric artifacts. The proper choice depends on your situation. Quasiorthonormal encoding and spherical encoding may offer a happy medium and is the central focus of this poster.