West Health Data Science Blog

Applied Data Science to Lower Healthcare Costs for Successful Aging

Fri 21 February 2020

Quasiorthonormal Encoding

Posted by Haw-minn Lu in Machine Learning   

Introduction

Quasiorthonormal encoding (QOE) is a generalization of the one-hot encoding and exploits the fact that in high dimensional vector spaces, two random vectors are almost always orthogonal. The concept originated with Kůrková and Kainen almost 25 years ago. The principle is the same as one-hot encoding when you view the encoded values as unit vectors as shown:

Make Ordinal One-Hot QOE
Toyota 1 \(\mathbf{u}_1\) \(\mathbf{q}_1\)
Honda 2 \(\mathbf{u}_2\) \(\mathbf{q}_2\)
Subaru 3 \(\mathbf{u}_3\) \(\mathbf{q}_3\)
Nissan 4 \(\mathbf{u}_4\) \(\mathbf{q}_4\)
Mitsubishi 5 \(\mathbf{u}_5\) \(\mathbf{q}_5\)

where \(\mathbf{u}_i\) are unit vectors in \(\mathbb{R}^5\) and \({\mathbf{q}_i}\) are a set of quasiorthogonal vectors in a dimension less than 5. The set \({\mathbf{q}_i}\) is quasiorthonormal if each \(\mathbf{q}_i\) is normal (\(||\mathbf{q}_i||=1\)) and \(|\mathbf{q}_i\cdot \mathbf{q}_j|<\epsilon\) for some small \(\epsilon\) value and for \(i\ne j\). In otherwords, if \(i\) and \(j\) are distinct \(\mathbf{q}_i\) and \(\mathbf{q}_j\) are almost perpendicular.

By relaxing the condition that the encoded values be orthogonal many more codes can be used in a smaller number of dimensions. A similar principle to this is used in areas such as Code Division Multiple Access (CDMA) that is used in mobile telephony. In lay terms, if you allow for almost orthogonal vectors rather than strictly orthogonal, you can fit more vectors into the same vector space.

Some advantages to QOE include a reduction of dimensionality over that of using one-hot encoding thus limiting effects of the "curse of dimensionality" or the problem of high dimension low sample size (HDLSS). The advantage over other encodings such as binary, hash, etc. is that it does not induce artificial geometric relationships that can cause downlstream bias in the results because each label in a category remains mathematically orthogonal to the other labels.

While Kůrková and Kainen proposed this type of encoding almost 25 years ago, machine learning has advanced since then so we include a short recap of the relevant issues next.

One Hot Encoding Recap

In machine learning, the typical aspects of one hot encoding maps a variable with \(n\) categories into a set of unit vectors in a \(n\)-dimensional space: \(L=\{l_i\}\) for \(i=1\ldots n\), then the one hot encoding \(\mathbb{1}_L:L \mapsto \mathbb{R}^n\) given by \(l_i \mapsto \mathbf{u}_i\) where \(\mathbf{u}_i\) is an orthonormal basis in \(\mathbb{R}^n\). The simplest basis used is \(\mathbf{u}_i = (0,0,\ldots, 1, 0,\ldots, 0)\) where the \(1\) is in the \(i\)th position which is know as the standard basis for \(\mathbb{R}^n\).

Mapping of a vector back to the original category uses the argmax function, so for a vector \(\mathbf{z}\), \(\mathrm{argmax}(\mathbf{z}) = i\) where \(z_i>z_j\) for all \(j\ne i\) and the vector \(\mathbf{z}\) decodes to \(l_{\mathrm{argmax}(\mathbf{z})}\). Of course, the argmax function is not easily differentiable which presents problems in ML learning algorithms that require derivatives. To fix this, a softer version is used called the softargmax or now as simply softmax and is defined as follows:

$$\mathrm{softmax}(\mathbf{z})_i={e^{z_i}\over \sum_{j=1}^n e^{z_j}}$$

for \(i=1,2,\ldots,n\) and \({\bf z}=(z_1, z_2,\ldots, z_n) \in \mathbb{R}^n\) where \(\mathbf{z}\) is the vector being decoded. The softmax function decodes a one-hot encoded vector into a probability density function which enables application of negative log likelihood loss functions or cross entropy losses.

We can generalize the one-hot encoding formulation to use an arbitrary orthonormal basis \({\mathbf{u}_i}\). To decode a one-hot encoded value in this framework, we would take \(i = \mathrm{argmax}(\mathbf{z}\cdot\mathbf{u}_1,\mathbf{z}\cdot\mathbf{u}_2,\ldots,\mathbf{z}\cdot\mathbf{u}_n)\). This reduces to \(\mathrm{argmax}(\mathbf{z})\) for the standard basis. Thus, the softmax function can be expressed as the following,

$$\mathrm{softmax}({\bf z})_i={e^{{\bf z}\cdot {\bf u}_i}\over \sum_{j=1}^n e^{{\bf z}\cdot {\bf u}_j}}.$$

Quasiorthogonality

Given an \(\epsilon\) two vectors \({\bf x}\) and \({\bf y}\) are said to be quasiorthogonal if \({|{\bf x}\cdot {\bf y}|\over \|{\bf x}\| \|{\bf y}\|}<\epsilon\). This extends the orthogonality principle by allowing the inner product to not exactly equal zero. As an extension, we can define a quasiorthonormal basis by a set of normal vectors \(\{{\bf q}_i\}\) for \(i=1,\ldots,K\) such that \(|{\bf q}_i\cdot {\bf q}_j|< \epsilon $ and $||{\bf q}_i||=1\), for all \(i,j\in\{1,\ldots,K\}\), where in principle for large enough \(n\), \(K\gg n\). Kainen and Kůrková derived a lower bound for \(K\) as a function of \(\epsilon\) and \(n\). Namely,

$$K \ge e^{n\epsilon^2}.$$

Quasiorthonormal Encoding

Briefly, quasiorthonormal encoding simply substitutes a quasiorthonormal basis \(\{{\bf q}_i\}\) for the orthonormal basis \(\{{\bf u}_i\}\) used above. So more formally, given a quasiorthonormal basis, we can define a QOE for a set \(L=\{l_i\}\) by \({\mathbb q}(l_i)= {\bf q}_i\). Decoding under QOE would use the following formula for decoding \(\mathbf{z}\):

$$i = \mathrm{argmax}(\mathbf{z}\cdot\mathbf{q}_1,\mathbf{z}\cdot\mathbf{q}_2,\ldots,\mathbf{z}\cdot\mathbf{q}_n)$$

The analogous softmax function, let's call it qsoftmax, would be expressed as

$$\mathrm{qsoftmax}({\bf z})_i={e^{{\bf z}\cdot {\bf q}_i}\over \sum_{j=1}^K e^{{\bf z}\cdot {\bf q}_j}}$$

The only real difference in the formulation is that while still operating in \({\mathbb R}^n\) we are encoding \(K>n\) labels.

Implementation

To facilitate modern vectorized computation packages such as numpy and tensorflow, we define the following \(n\times K\) change of coordinates matrix

$$\mathbf{Q}= \left[\begin{matrix} \bigg| & \bigg| & &\bigg | \\ \mathbf{q}_1 & \mathbf{q}_2 & \cdots & \mathbf{q}_K \\ \bigg| & \bigg| & &\bigg | \end{matrix}\right].$$

that transforms between the QOE space and the one hot encoding space. So given a argmax or softmax function, we can express the quasiorthonormal variant as follows

$$\mathrm{qargmax}(\mathbf{z}) = \mathrm{argmax}(\mathbf{Qz})$$

and

$$\mathrm{qsoftmax}(\mathbf{z}) = \mathrm{softmax}(\mathbf{Qz}).$$

This facilitates the use of optimized functions such as softmax in libraries like tensorflow and using the above matrix enables quick implementation of QOE into these packages. In the examples below, we use tensorflow to test the effectiveness of using QOE over one-hot encoding.

Construction of an Quasiorthonormal set

It is difficult find explicit constructions of quasiorthonormal sets in the literature. Several methods are mentioned by Kainen, but these constructions are somewhat theoretical and hard for the lay person to follow. There are a number of combinatorial problems related such as spherical codes and Steiner Triple Systems, which strive to find optimal solutions. As a practical matter, we only need to find fast suboptimal solutions so we can use spherical codes. Spherical codes try to find a set of points on the \(n\)-dimensional hypersphere such that the minimum distance between two points is maximized. In most constructions of spherical codes, a given point's antipodal point is also in that code set. So in order to get a quasiorthogonal set, for each pair of antipodal points, only one element of the pair is selected.

Experiment and Demonstration

As an initial experiment, we applied QOE to classification of the MNIST handwriting dataset, using the 60000 training examples with 10000 test examples. As there are 10 categories, we needed sets of quasiorthonormal bases with 10 elements. We took the spherical code for 24 points in 4-dimensions, giving us 12 quasi-orthogonal vectors. The maximum pairwise dot product was 0.5 leading to an angle of 60\(^\circ\). We also took the spherical code for 56 points in 7-dimensions, giving 28 quasi-orthogonal vectors. The maximum pairwise dot product was .33 leading to an angle of a little over 70\(^\circ\)

We used a hidden layer with 64 units with a ReLU activation function. Next there is a 20% dropout layer to mitigate overtraining, then an output layer whose width depends on the encoding used.

As a preliminary, we define a qsoftmax metafunction:

def qsoftmax(basis):
    def func(x):
        qx = tf.matmul(tf.constant(basis),x,transpose_b=True)        
        return tf.nn.softmax(tf.transpose(qx))
    return func

which takes a quasiorthogonal basis and returns the quasiorthogonal softmax function based on the basis. The various transpose operations are necessary to conform the inputs and outputs to the shape provided and required by the tensorflow model.

For tensorflow and keras, the base network is the following,

normal_model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
  tf.keras.layers.Activation(tf.nn.softmax)
])

We used a separate Activation layer to make the qsoftmax clearer.

As a sanity test, you can implement the following:

sanity_model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
  tf.keras.layers.Lambda(qsoftmax(numpy.identity(10,dtype=numpy.float32)))
])

This should function identically as the reference model because it tests that the qsoftmax function operates as expected (which it does for us) since applying the identity matrix to qsoftmax merely converts it to the standard softmax function. This is useful for troubleshooting if you have difficulty.

For the two QOE experiments we labeled the bases for the two quasiorthonormal sets base4 and base7 and only took the first 10 vectors. We used the following additional test models

basis4_model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(4),
  tf.keras.layers.Lambda(qsoftmax(basis4))
])
basis7_model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(7),
  tf.keras.layers.Lambda(qsoftmax(basis7))
])

The following table is the mean of the accuracy over three training runs of the validation data with training data in parentheses

Number of Epochs One Hot Encoding 7-Dimensional QO 4-Dimensional QO
10 97.53% (97.30%) 97.24% (96.94%) 95.65% (95.15%)
20 97.68% (98.02%) 97.49% (97.75%) 95.94% (96.15%)

The seven dimensional quasiorthogonal basis (at 70\(^\circ\)), performs nearly as well as the one-hot encoded version whereas the four dimensional quasiorthogonal basis (at 60\(^\circ\)) case did not perform as well. Though 95\% vs 98\% percent may not seem significant, for the MNIST handwriting dataset, this relatively small change in performance is how these classification algorithms are judged.

Spherical Encoding

One key to the effectiveness of an encoding method is how well categorical values can be decoded in an noisy environment. Orthogonality provides excellent decodability especially using the softmax function, but this comes at the cost of the number of dimensions used for the encoding. Quasiorthoganility with the QO softmax function performs less effectively at decoding than orthogonality with the softmax, but reduces dimensionality.

A curious observation is that if you examine the effect of the softmax function on negative values along one of the basis vectors, the softmax function severely attenuates it. Therefore, one can construct an orthogonal code using both the 1 and -1 values so for standard unit vectors, you could encode categorical values to \(\{\mathbf{u}_i, -\mathbf{u}_i,\}\), reducing the dimensionality by half. Of course nothing comes for free, if a prediction gets confused between two antipodal unit vectors, the result could be that they cancel out and allow the noise to dictate the resulting category. By contrast, for one-hot encoding, the result would get decoded as one of the two possible values.

With this risk in mind, we can further extend the idea to a quasiorthogonal basis by adding the antipodal vectors for each vector in the basis. The result not only doubles the number of vectors that can be used for encoding, it reduces the problem of finding a basis to that of finding spherical codes.

Finally, we tested how effective this types of coding is by using an orthogonal basis in 5 dimensions and adding the antipodal unit vector to produce a set of 10 vectors.

The model tested becomes:

basis5_model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(5),
  tf.keras.layers.Lambda(qsoftmax(basis5))
])

In addition, we ran a test using the 3 dimension 10 vector spherical code provided here. The accuracy is shown in the following table with training accuracy in parentheses.

Number of Epochs One Hot Encoding 5-Dimensional SC 3-Dimensional SC
10 97.53% (97.30%) 96.51% (96.26%) 95.37% (94.83%)
20 97.68% (98.02%) 96.82% (97.11%) 95.74% (95.83%)

In this case, the 5-dimensional spherical codes performed close to the one-hot encoding by not as closely as the 7-dimension QO codes. The 3-dimensional spherical codes performed on par with the 4-dimensional QO codes.

Conclusion

While the extreme dimensionality reduction from 10 to 4 or 10 to 3 did not yield comparable performance to one-hot encoding. More modest reductions such as 10 to 7 and 10 to 5 did. It is worth considering that quasiorthogonal or spherical codes are much harder to find in low dimensions. One should note that, though we went from 10 to 7 dimensions, we did not fully exploit the space spanned by the quasiorthogonal vector set. Otherwise, we would likely have had the similar results if the categorical labels had a cardinality of 28 rather than 10. Furthermore, as the target encoded space approaches 20 or 30 dimensions, we have the ability to encode 100,000 or even a million labels. So for high cardinality categories, QO encoding and spherical encoding provide an efficient categorical encoding while controlling the curse of dimensionality.