Categorial Encodings
Posted by Haw-minn Lu in Machine Learning
Introduction
While most popular machine learning methods such as deep learning require
numerical data as input, categorical data is very common practice. For
example, a person's vitals could be a combination of both, they could include
height, weight (numerical) and gender, race (categorical). The challenge is to
convert the categorical data into a vector of some sort. There is a good
overview of most encoding techniques at this blog
post.
He further categorizes the coding techniques as classic, contrast, and
Bayesian. Notably, word embeddings are not mentioned. Both contrast
encoding and Bayesian encoding use the
statistics of the data to facilitate encoding. These two categories may be of
use when more statistical analysis is required, however there has not been widespread adoption of these emcoding techniques for machine learning. These encoding techiques can be found as
part of the scikit-learn
category encoding
package.
Though we will pursue a discussion of classic encodings and modern derivatives, word embeddings deserve special mention. Word embeddings are used to represent words, phrases or even entire documents as a vector. Their goal is that similar meaning or concepts get mapped to vectors that are close in the target vector space. Additionally, it is adapted for encoding a large categorical feature (i.e., words) into a relatively lower dimensional space.
Ordinal Encoding
To begin our overview of fundamental encoding methods, we start with Ordinal (Label)
Encoding. Ordinal encoding is the
simplest and perhaps most naive approach encoding a categorical feature --- one simply assigns a number to each member of a category. This is
often how data from surveys are encoded into spreadsheets for easy storage and
calculation of basic statistics. An associated data dictionary is used to
convert the values back and forth between a number and a category. Take for
example the case of gender, male could be encoded as 1 and female as 2, with a
data dictionary as follows: {'male': 1, 'female': 2}
Suppose we have three categories of ethnic groups: White, Black, and Asian. Under ordinal encoding, suppose What is encoded as 1, Black is encoded as 2 and Asian is encoded as 3. If a machine learning classification is somehow confused between Asian and White and decides to split the difference and report the in-between value (2) which encodes Black. The issue is that arbitrary gradation between 1 and 3 introduces a natural interpolation (2) that may be nonsense. Thus, the natural ordering of the numbers imposes an ordered geometrical relationship between the categories that does not apply.
Nonetheless there are situations where ordinal encoding makes sense. For example, a 'rate your satisfaction' survey typically encodes five levels (1) terrible, (2) acceptable (3) mediocre, (4) good, (5) excellent.
One Hot Encoding
This is the most common encoding used in machine learning. One hot encoding takes a category with cardinality \(N\) and encodes each categorical value with an \(N\)-dimensional vector with a single '1' and the remainder '0's. Take as an example encoding 5 makes of Japanese Cars: Toyota, Honda, Subaru, Nissan, Mitsubishi. The following table shows a comparison of coding between ordinal and one-hot:
Make | Ordinal | One-Hot |
---|---|---|
Toyota | 1 | (1,0,0,0,0) |
Honda | 2 | (0,1,0,0,0) |
Subaru | 3 | (0,0,1,0,0) |
Nissan | 4 | (0,0,0,1,0) |
Mitsubishi | 5 | (0,0,0,0,1) |
The advantage is that one hot encoding doesn't induce an implicit ordering or between categories. The primary disadvantage is that the dimensionality of the problem has increased with corresponding increases in complexity and computation (see curse of dimensionality) This easily leads to the high dimensionality low sample size (HDLSS) situation, which is a problem for most machine learning methods.
Binary Encoding, Hash Encoding, BaseN Encoding
Somewhere in between these two are binary encoding, hash encoding, and baseN encoding. Binary encoding simply labels each category with a unique binary code and converts the binary code to a vector. Using the previous example of the Japanese car makes:
Make | Ordinal | as Binary | Binary Code |
---|---|---|---|
Toyota | 1 | 001 | (0,0,1) |
Honda | 2 | 010 | (0,1,0) |
Subaru | 3 | 011 | (0,1,1) |
Nissan | 4 | 100 | (1,0,0) |
Mitsubishi | 5 | 101 | (1,0,1) |
Hash encoding assigns each category an ordinal value that is then converted into a binary hash value that is encoded as an \(n\)-tuple in the same fashion as the binary encoding. You can view hash encoding as binary encoding applied to the hashed ordinal value. Hash encoding has several advantages. First, it is open ended so new categories can be added later. Second, the resultant dimensionality can be much lower than one-hot encoding. The chief disadvantage is that categories can collide if two categories accidentally map into the same hash value. This is a hash collision and must be fixed separately using a resolution mechanism. A good treatment of hash coding can be found here
Finally, baseN encoding is a generalization of binary encoding that uses a number base other than 2 (binary). Below is an example of the Japanese car makes using base 3,
Make | Ordinal | as Ternary | Ternary Code | Balanced Ternary Code |
---|---|---|---|---|
Toyota | 1 | 01 | (0,1) | (0,1) |
Honda | 2 | 02 | (0,2) | (0,-1) |
Subaru | 3 | 10 | (1,0) | (1,0) |
Nissan | 4 | 11 | (1,1) | (1,1) |
Mitsubishi | 5 | 12 | (1,2) | (1,-1) |
A disadvantage of all three of these techniques is that while it does reduce the
dimension of the encoded feature, artificial geometric relationships
may creep in between unrelated categories. For example,
(0.7,0.7)
may be confusion between Toyota and Honda or a weak Subaru result,
although this is still better than ordinal encoding.
Quasiorthonormal Encoding
Quasiorthonormal encoding (QOE) is a generalization of the one-hot encoding and exploits the fact that in high-dimensional vector spaces, two random vectors are generally almost orthogonal. The concept originated with Kůrková and Kainen almost 25 years ago. The principle is the same as one-hot encoding when you view the encoded values as unit vectors as shown:
Make | Ordinal | One-Hot | QOE |
---|---|---|---|
Toyota | 1 | \(\mathbf{u}_1\) | \(\mathbf{q}_1\) |
Honda | 2 | \(\mathbf{u}_2\) | \(\mathbf{q}_2\) |
Subaru | 3 | \(\mathbf{u}_3\) | \(\mathbf{q}_3\) |
Nissan | 4 | \(\mathbf{u}_4\) | \(\mathbf{q}_4\) |
Mitsubishi | 5 | \(\mathbf{u}_5\) | \(\mathbf{q}_5\) |
where \(\mathbf{u}_i\) are unit vectors in \(\mathbb{R}^5\) and \({\mathbf{q}_i}\) are a set of quasiorthogonal vectors in a dimension less than 5. The set \({\mathbf{q}_i}\) is quasiorthonormal if each \(\mathbf{q}_i\) is normal (\(||\mathbf{q}_i||=1\)) and \(|\mathbf{q}_i\cdot \mathbf{q}_j|<\epsilon\) for some small \(\epsilon\) value and for \(i\ne j\). In otherwords, if \(i\) and \(j\) are distinct \(\mathbf{q}_i\) and \(\mathbf{q}_j\) are almost perpendicular.
We will explore more about QOE in the next article.
Summary
Although most machine learning methods require numerical data, categorical data is very common in certain areas, such as medical data science or survey analysis. We reviewed the most popular encoding methods to make this kind of data tractable for machine learning, but each of them has drawbacks in computation or interpretation that can lead to confusing downstream results, if not handled correctly at the outset.