Computer scientists at MIT managed to develop a system that can identify objects within an image, based on a spoken description. The system will highlight the parts of the image it finds relevant to the description, in real time.
It learns the words from recorded speech clips and objects in raw images and then, associates them with one another.
The team modified a pre-existing image handling neural network, making it split the image in a grid of cells. The audio network then cuts it up into 1 or 2 second snippets. After the image and the right caption are paired, the training process score the AI system on its performance. If it sounds a lot like teaching a child what objects are by pointing at and naming them, you’re not too far off.
“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing,”
– David Harwath, researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group
Florian Metze, an associate research professor at the Language Technologies Institute at Carnegie Mellon University says about the A.I that
“It is exciting to see that neural methods are now also able to associate image elements with audio segments, without requiring text as an intermediary. This is not human-like learning; it’s based entirely on correlations, without any feedback, but it might help us understand how shared representations might be formed from audio and visual cues.”
The A.I can be used in numerous ways but the MIT researchers have set their eyes on improving translation.