Novel Image Captioning

When does a machine “understand” an image? One definition is when it can generate a novel caption that summarizes the salient content within an image. This content may include objects that are present, their attributes, actions, or their relations with each other. Determining the salient content requires not only knowing the contents of an image, but also deducing which aspects of the scene may be interesting or novel through commonsense knowledge. This video demonstrates the quality of the latest image captioning model (after) compared with the old model (before).