Tue 25 February 2020

Metamorphic Truth

Posted by Haw-minn Lu in Machine Learning

What is Metamorphic Truth?

Metamorphic truth is a method of training an Artificial Neural Network (ANN) where the "truth" supplied during training is not fixed and may depend on the prediction made by the ANN. More specifically during training a truth metamorphism is defined as a function of the prediction, the ground truth, and perhaps additional metadata.

History and Concept

The concept arose when building a denoising autoencoder with missing data. The obvious approach of filling in the missing data with dummy data may work well but could also ultimately distort the loss function. By viewing ANN training as a feedback system, it became clear that the loss function should somehow ignore the dummy filled-in data and thus remove the downstream bias from these terms. Then, the mean square error (MSE) error would look like

$$\sum_{i\in S_{\mathrm{notmissing}}} |y_i - \tilde y_i|^2,$$

where $S_{\mathrm{notmissing}}$ is the set of indexes for which $y_i$ is not missing and ${\tilde{\mathbf{y}}}$ is the predicted value.

From the perspective of a feedback system, any prediction the autoencoder makes is acceptable, or, in lay terms, "your guess is as good as mine." Detailed discussion of this type of autoencoder can be found in our paper.

Unfortunately, there are several practical problems. In our missing data problem, the attributes $\{y_i\}$ that are missing may change based on the present sample so the loss function would be specific to which sample is being used to train the autoencoder. Most machine learning packages do not allow sample by sample customization of the loss function.

As a practical consideration, rather than dismantling the structure of established machine learning libraries or contriving a complex loss function, it was much simpler to alter the so-called "ground truth" that is fed back into the autoencoder during learning using a standard MSE loss function:

$$\sum_{i=1}^N |\bar y_i - \tilde y_i|^2,$$

where $N$ is the dimension of the output vector and $\bar y_i$ is given by the following:

$$\bar y_i=\begin{cases} \tilde y_i & \text{if $y_i$ is missing} \\ y_i & \text{otherwise} \end{cases}$$

This mapping cancels the dummy $y_i$ values in the loss function, thus removing its effects. This mapping is defined as a truth metamorphism.

Learning Implications

What started out as a "hack" for implementing a desired mathematical calculation has much broader implications to learning. From one perspective, rigid ground truths can be thought of as another form of overtraining. In practice, there may be ambiguous labels and by enforcing a fixed ground truth for each output in training, you may be preventing the network from exploring other potential minima consistent with the label-ambiguity.

For example, human learning with feedback tailored to an individual's current understanding usually works better. Learning from a books is like the rigid ground truth model, whereas learning from a live teacher is like the metamorphic truth model. When learning from the book, the "truth" doesn't change no matter how many times you read it, though your understanding might improved each time. When learning from a live teacher, one might get an indication that the answer is close enough to the answer the teacher originally wanted.

Formalizing the Concept

Despite the variety of architectures, ANNs that implement the multilayer perceptron have a common output layer that computes the following:

$$\tilde y_i = \sum_{j=0}^{L_\mathrm{output}} w_{i,j} x_{j},$$

where $w_{i,j}$ are the weights from the output layer, $x_{j}$ are the outputs from the last hidden layer and $L_\mathrm{output}$ is the number of units in the output layer. For the mean square error, the loss function is defined as:

$$E(\mathbf{w}) = \frac{1}{N}\sum_{i} (y_i - \tilde y_i)^2.$$

Typical gradient descent algorithms start by calculating the derivative of the error with respect to the weights and apply it first to the output layer and then the the hidden layers via the chain-rule,

$$\frac{\partial E}{\partial w_{j,i}} = \frac{\partial E}{\partial \tilde y_i} \frac{\partial \tilde y_i}{\partial w_{j,i}}.$$

Using the MSE this becomes

$$\frac{\partial E}{\partial w_{j,i}} = \frac{2}{N}\sum_{i} (y_i - \tilde{y_i}) \frac{\partial \tilde y_i}{\partial w_{j,i}}.$$

To train using metamorphic truth, we define a truth metamorphism: $\bar{\mathbf{y}}=\cal{M}(\mathbf{y}, \mathbf{\tilde {y}})$ and we define the loss function of $\mathbf{\bar y}$ as opposed to the ground truth $\mathbf{y}$. Treating the metamorphism as a function just gives a more complex loss function. What separates a metamorphism from simply another loss function is that we define the following,

$$\frac{\partial \cal{M}}{\partial w_{j,i}}\triangleq 0.$$

where we have extrapolated ${\partial \bar{y_i}}/{\partial w_{j,i}} = 0$ from ${\partial {y_i}}/{\partial w_{j,i}} = 0$. In non-mathematical terms, the truth metamorph is treated exactly like the ground truth. The result is that all the learning algorithms can proceed unchanged once the truth metamorph is substituted for the ground truth.

Potential Mathematical Implications

No doubt a lot of question arises by this "hacking" of the mathematics. In essence, what has been done is the error surface for which a gradient descent algorithm changes during the learning process making it possibly a moving target for the ANN model which may result in oscillations that make the minima harder to find. Practically speaking, it is too early to tell if this is a potential problem. However, should it proved to be a problem, a lesson from feedback systems can be applied. Whenever a system is in danger of instability, a dampening agent of some sort is imposed. In fact, in modern stocastic gradient descent elements of dampening are introduced already (e.g., batch-learning, momentum). Our one practical application of metamorphic truth for data imputation does not suffer these issues.

Examples of Situations to Use Metamorphic Truth

The following are some examples of where metamorphic truth might be applicable. We have not had detailed case study on any of these except the "Don't Care" example which has been incorporated into an autoencoder for imputation as mentioned above.

Don't Care

In some models there are cases where the output is irrelevant. For example, certain states in logic design are ignored (see here). The imputation example is a situation where the predicted value from a missing data element is irrelevant.

As another example, consider a model that predicts medical conditions and gender. If the output is a gender-specific condition (such as pregnancy), one might choose to ignore the corresponding gender output as a postprocessing step Hence, in these cases you don't care what the prediction is and metamorphic truth allows you to train a model in this way if desired.

In this example, the metamorphism is similar to the one given above:

$$\cal{M}(\mathbf{y}, \mathbf{\tilde y})_i=\begin{cases} \tilde y_i & \text{if you don't care what $y_i$ is} \\ y_i & \text{otherwise} \end{cases}$$

Multiple Answers

Consider the following digits from the MNIST database:

These are three ambiguous numerals. Without disclosing the truth value in the MNIST database, one could probably make the case that the first number is a 2 or an incomplete 3. The second number could be a 2 or a 7 and the third a 7 and a 9.

Suppose the middle number was labelled as a 7 one could make the case that this was labelled incorrectly or the person labelling the samples was taught to write numbers a little differently so that what seems like it should be a 2 in your eyes could be a 7 in the labeller's eyes. Perhaps the entire dataset is labeled by multiple people. There could be a lack of consistency in the way the numbers are labeled. Even if the person who labeled the digit is the actual person who wrote it, it may very well be that the way they wrote the number 7 is outside the norm of how the digit is drawn.

For all of the above reasons, it is possible that these cases could adversely bias an ANN's learning. For this reason, one might want to say that both 2 and 3 are acceptable answers to the first classification and 2 and 7 are acceptible answers to the second classification etc.

One can implement this kind of training by putting multiple values for each $\bf y$. The metamorphism takes the predicted values and returns the closest correct answer to the prediction.

Confidence Labels

In some situations, there may be a small subset of impeccably labeled data and a larger set of less reliable data that you want to incorporate into learning. You could define a metamorphism

$$\cal{M}(\mathbf{y}, \mathbf{\tilde y})_i= (1-\alpha)\tilde y_i + \alpha y_i,$$

where $\alpha$ is a confidence value in the truth for that particular sample and attribute. This metamorphism gives more or less weight to a truth value based on the confidence and inversely weights the contribution of the prediciton

Implementational Issues

While originally developed for a multiple imputation autoencoder using tensorflow, it is unclear how to implement metamorphic truth with other popular machine learning packages. The following is a high level description of how to implement it in tensorflow. The metamorphism function takes the predicted output, the ground truth output, and a sample index as input. The sample index is used to access metadata when necessary. For example in the imputation problem the attributes that are missing vary from sample to sample, so the index keeps track of which sample is transformed by the metamorphism.

The basic idea is to append the index of the training sample to the training output to retain the order if the samples are scrambled. This index is then stripped off in the loss function when the metamorphism is applied.

When you compile a Tensorflow model, you specify a customized loss function. The loss function will first split the index from the truth value (recall we appended the index to it), leaving the truth value without the index. The truth value, predicted value and index are fed into the metamorphism.

For tensorflow you must turn the output of the metamorphism into a constant with respect to the model as tensorflow will build a computation graph and try to compute the gradient of the metamorphism (which may have complicated discontinuities). To compensate for this, we construct a custom function zero_derivative, as follows

@tf.custom_gradient
def zero_derivative(x):
    def grad(dy):
    return tf.zeros_like(dy)
    return x, grad

This 'function' is the identity function with a zero derivative. In effect, it turns the operand x into a constant. Finally, when you run fit you supply as the training output, the original training output with the index appended to each sample.

Conclusion

Metamorphic truth expands the use cases for deep learning networks to situations with ambiguous labeling or in situations with outputs that are irrelevant. Real applications for metamorphic truth are limited to date and the theoretical and practical implementation study metamorphic truth is just beginning.

West Health Data Science Blog