Binary Encoding, Hash Encoding, BaseN Encoding

Our example with the Japanese car makes:

Make Ordinal as Binary Binary Code
Toyota 1 001 (0,0,1)
Honda 2 010 (0,1,0)
Subaru 3 011 (0,1,1)
Nissan 4 100 (1,0,0)
Mitsubishi 5 101 (1,0,1)

Between ordinal encoding and one-hot encoding Somewhere in between these two are binary encoding, hash encoding and baseN encoding. Binary encoding simply labels each category with a
unique binary code and converts the binary code to a vector. Using the previous example of the Japanese car makes, the above table shows an example of binary encoding. BaseN encoding is a generalization of binary encoding that uses a number base other than 2 (binary).

Hash encoding assigns each category an ordinal value that is then converted into a binary hash value that is encoded as an -tuple in the same fashion as the binary encoding. You can view hash encoding as binary encoding applied to the hashed ordinal value. Hash encoding has several advantages. First, it is open ended so new categories can be added later. Second, the resultant dimensionality can be much lower than one-hot encoding. The chief disadvantage is that categories can collide if two categories accidentally map into the same hash value. This is a hash collision and must be fixed separately using a resolution mechanism. Bernardi's blog provides a good treatment of hash coding.

A disadvantage of all three of these techniques is that while it does reduce the dimension of the encoded feature, artificial geometric relationships may creep in between unrelated categories. For example, (0.7,0.7) may be confusion between Toyota and Honda or a weak Subaru result, although the effect is not as pronounced as ordinal encoding.

Back to Classic Encodings