Textures, not shapes¶
Training a CNN for object recognition typically involves only showing the algorithm many examples of images that contain or don't contain a target object. Humans also need to see many examples of various objects to get the basic idea. Humans, however, seem to have a bias towards recognition by shape which is missing from CNNs in general.
Geirhos, Bethge and their colleagues created images that included two conflicting cues, with a shape taken from one object and a texture from another: the silhouette of a cat colored in with the cracked gray texture of elephant skin, for instance, or a bear made up of aluminum cans, or the outline of an airplane filled with overlapping clock faces. Presented with hundreds of these images, humans labeled them based on their shape — cat, bear, airplane — almost every time, as expected. Four different classification algorithms, however, leaned the other way, spitting out labels that reflected the textures of the objects: elephant, can, clock.
This is a problem worth solving, since the addition of even a small amount of noise can throw off CNN-based classifiers, where humans aren't fooled. "Adversarial examples" even do this maliciously, adding exactly the right amount of noise to cause misclassification. So how to fix this?
Geirhos wanted to see what would happen when the team forced their models to ignore texture. The team took images traditionally used to train classification algorithms and “painted” them in different styles, essentially stripping them of useful texture information. When they retrained each of the deep learning models on the new images, the systems began relying on larger, more global patterns and exhibited a shape bias much more like that of humans.
There were many other insights in this relatively short article, and I commend it to you. It enriched my understanding of what's going on in neural networks, and how far we still need to go to reach parity with humans.