Why AI will never replace human picture descriptions

Yes, a bold statement, I know, but this piece by Dr. Elizabeth Fernandez made my conviction even stronger.

For some years now, there have been advancements in computer-generated image recognition. That recognition nowadays goes far beyond optical character recognition. Face recognition, objects, some scenes are things that software such as the Facebook algorithms, Microsoft’s Seeing AI and Google’s image recognition will cope with. In the case of some celebrities, Microsoft’s offering will, for example, even put names to faces.

Google’s service now also ties into Chrome,. In the case of a missing alternative text, users can right-click and request that the image be processed by Google’s artificial intelligence. The result will then be filled in so screen readers will pick it up. For the new Chromium-based Edge browser by Microsoft, that service is disabled, but I guess Microsoft will soon put something similar in place using their backend that Seeing AI also uses.

Especially this browser integration has led to fears that this will make web developers lazy and make them describe their images less. I am convinced that this fear will not be necessary. Some managers or other decision makers may try, but they’ll fail.

For one, validators will still flag missing alternative text. The service, in Google’s case, currently only kicks in if alt text is completely missing. Even in the case of Twitter, where an image has no description, that image will still get an alternative text of “Image”, in which case Chrome doesn’t offer the processing of alternative text.

And second, and this is the point that the above mentioned article largely confirms, no matter how many objects, people, and possible scenes artificial intelligence and deep machine learning will learn to recognize, there will always be two factors missing. One is described by Dr. Fernandez in her article.

The first – DNNs are easy to fool. For example, imagine you have a picture of a banana. A neural network successfully classifies it as a banana. But it’s possible to create a generative adversarial network that can fool your DNN. By adding a slight amount of noise or another image besides the banana, your DNN might now think the picture of a banana is a toaster. A human could not be fooled by such a trick. Some argue that this is because DNNs can see things humans can’t, but Watson says, “This disconnect between biological and artificial neural networks suggests that the latter lack some crucial component essential to navigating the real world.”

And that leads straight to the second point: That missing ingredient is human interpretation. Yes, AI can by now tell you about the birds and the trees, and the flowers and the bees, in a picture or photograph, but the actual message is something only a human can get from it. Yes, it can recognize that there are several people standing around a table, and maybe even who they are, but why, what the context is, or the piece of the action, is, and I am convinced, always will be, up to the sighted viewer to interpret.

Yes, people can also interpret imagery differently, depending on their social and cultural backgrounds, but within a certain context, the interpretation will always be meaningful. It’s the layer beyond the pure fact, beyond the “what or who”. That factor comes from personal experiences, which are not only factual, but also emotional. Seeing something or someone doing or looking at someone or something, other pieces to the puzzle in a picture, will also provoke an emotional response in most humans. While AI might, through some pattern recognition, even be able to recognize facial expressions beyond “smiling” some day, the why will always be up to the spectator to deduce and convey to others.

So while I am convinced that technology can do a lot of good things for us, and I work in a field that actually helps with that, there are things which will probably be forever reserved for humans and other sentient species.

Sorry, not sorry, to all the web developers out there who will still have to come up with good alternative text for their images. It’ll be part of your jobs for years to come, as it will be part of my job for years to come to remind you to put them in. 😉