West Health Data Science Blog

Applied Data Science to Lower Healthcare Costs for Successful Aging

Fri 21 February 2020

Categorial Encodings

Posted by Haw-minn Lu in Machine Learning   

Introduction

While most popular machine learning methods such as deep learning require numerical data as input, categorical data is very common practice. For example, a person's vitals could be a combination of both, they could include height, weight (numerical) and gender, race (categorical). The challenge is to convert the categorical data into a vector of some sort. There is a good overview of most encoding techniques at this blog post. He further categorizes the coding techniques as classic, contrast, and Bayesian. Notably, word embeddings are not mentioned. Both contrast encoding and Bayesian encoding use the statistics of the data to facilitate encoding. These two categories may be of use when more statistical analysis is required, however there has not been widespread adoption of these emcoding techniques for machine learning. These encoding techiques can be found as part of the scikit-learn category encoding package.

Though we will pursue a discussion of classic encodings and modern derivatives, word embeddings deserve special mention. Word embeddings are used to represent words, phrases or even entire documents as a vector. Their goal is that similar meaning or concepts get mapped to vectors that are close in the target vector space. Additionally, it is adapted for encoding a large categorical feature (i.e., words) into a relatively lower dimensional space.

Ordinal Encoding

To begin our overview of fundamental encoding methods, we start with Ordinal (Label) Encoding. Ordinal encoding is the simplest and perhaps most naive approach encoding a categorical feature --- one simply assigns a number to each member of a category. This is often how data from surveys are encoded into spreadsheets for easy storage and calculation of basic statistics. An associated data dictionary is used to convert the values back and forth between a number and a category. Take for example the case of gender, male could be encoded as 1 and female as 2, with a data dictionary as follows: {'male': 1, 'female': 2}

Suppose we have three categories of ethnic groups: White, Black, and Asian. Under ordinal encoding, suppose What is encoded as 1, Black is encoded as 2 and Asian is encoded as 3. If a machine learning classification is somehow confused between Asian and White and decides to split the difference and report the in-between value (2) which encodes Black. The issue is that arbitrary gradation between 1 and 3 introduces a natural interpolation (2) that may be nonsense. Thus, the natural ordering of the numbers imposes an ordered geometrical relationship between the categories that does not apply.

Nonetheless there are situations where ordinal encoding makes sense. For example, a 'rate your satisfaction' survey typically encodes five levels (1) terrible, (2) acceptable (3) mediocre, (4) good, (5) excellent.

One Hot Encoding

This is the most common encoding used in machine learning. One hot encoding takes a category with cardinality \(N\) and encodes each categorical value with an \(N\)-dimensional vector with a single '1' and the remainder '0's. Take as an example encoding 5 makes of Japanese Cars: Toyota, Honda, Subaru, Nissan, Mitsubishi. The following table shows a comparison of coding between ordinal and one-hot:

Make Ordinal One-Hot
Toyota 1 (1,0,0,0,0)
Honda 2 (0,1,0,0,0)
Subaru 3 (0,0,1,0,0)
Nissan 4 (0,0,0,1,0)
Mitsubishi 5 (0,0,0,0,1)

The advantage is that one hot encoding doesn't induce an implicit ordering or between categories. The primary disadvantage is that the dimensionality of the problem has increased with corresponding increases in complexity and computation (see curse of dimensionality) This easily leads to the high dimensionality low sample size (HDLSS) situation, which is a problem for most machine learning methods.

Binary Encoding, Hash Encoding, BaseN Encoding

Somewhere in between these two are binary encoding, hash encoding, and baseN encoding. Binary encoding simply labels each category with a unique binary code and converts the binary code to a vector. Using the previous example of the Japanese car makes:

Make Ordinal as Binary Binary Code
Toyota 1 001 (0,0,1)
Honda 2 010 (0,1,0)
Subaru 3 011 (0,1,1)
Nissan 4 100 (1,0,0)
Mitsubishi 5 101 (1,0,1)

Hash encoding assigns each category an ordinal value that is then converted into a binary hash value that is encoded as an \(n\)-tuple in the same fashion as the binary encoding. You can view hash encoding as binary encoding applied to the hashed ordinal value. Hash encoding has several advantages. First, it is open ended so new categories can be added later. Second, the resultant dimensionality can be much lower than one-hot encoding. The chief disadvantage is that categories can collide if two categories accidentally map into the same hash value. This is a hash collision and must be fixed separately using a resolution mechanism. A good treatment of hash coding can be found here

Finally, baseN encoding is a generalization of binary encoding that uses a number base other than 2 (binary). Below is an example of the Japanese car makes using base 3,

Make Ordinal as Ternary Ternary Code Balanced Ternary Code
Toyota 1 01 (0,1) (0,1)
Honda 2 02 (0,2) (0,-1)
Subaru 3 10 (1,0) (1,0)
Nissan 4 11 (1,1) (1,1)
Mitsubishi 5 12 (1,2) (1,-1)

A disadvantage of all three of these techniques is that while it does reduce the dimension of the encoded feature, artificial geometric relationships may creep in between unrelated categories. For example, (0.7,0.7) may be confusion between Toyota and Honda or a weak Subaru result, although this is still better than ordinal encoding.

Quasiorthonormal Encoding

Quasiorthonormal encoding (QOE) is a generalization of the one-hot encoding and exploits the fact that in high-dimensional vector spaces, two random vectors are generally almost orthogonal. The concept originated with Kůrková and Kainen almost 25 years ago. The principle is the same as one-hot encoding when you view the encoded values as unit vectors as shown:

Make Ordinal One-Hot QOE
Toyota 1 \(\mathbf{u}_1\) \(\mathbf{q}_1\)
Honda 2 \(\mathbf{u}_2\) \(\mathbf{q}_2\)
Subaru 3 \(\mathbf{u}_3\) \(\mathbf{q}_3\)
Nissan 4 \(\mathbf{u}_4\) \(\mathbf{q}_4\)
Mitsubishi 5 \(\mathbf{u}_5\) \(\mathbf{q}_5\)

where \(\mathbf{u}_i\) are unit vectors in \(\mathbb{R}^5\) and \({\mathbf{q}_i}\) are a set of quasiorthogonal vectors in a dimension less than 5. The set \({\mathbf{q}_i}\) is quasiorthonormal if each \(\mathbf{q}_i\) is normal (\(||\mathbf{q}_i||=1\)) and \(|\mathbf{q}_i\cdot \mathbf{q}_j|<\epsilon\) for some small \(\epsilon\) value and for \(i\ne j\). In otherwords, if \(i\) and \(j\) are distinct \(\mathbf{q}_i\) and \(\mathbf{q}_j\) are almost perpendicular.

We will explore more about QOE in the next article.

Summary

Although most machine learning methods require numerical data, categorical data is very common in certain areas, such as medical data science or survey analysis. We reviewed the most popular encoding methods to make this kind of data tractable for machine learning, but each of them has drawbacks in computation or interpretation that can lead to confusing downstream results, if not handled correctly at the outset.