Try this experiment. Find a picture on your Facebook wall that matches the description ‘A face like a wet weekend’. Too difficult? How about ‘A woman with brown eyes’?

Now go in the opposite direction. Describe a random picture of a face. What did you just do? More to the point, how did you manage? Does your description mention eyes, hair colour, ethnicity, emotional expression? The choices seem endless. No wonder Yorick Wilks, a veteran AI researcher, once compared the generation of language to the task of counting from infinity to one. It ought to be impossible. But since it clearly isn’t for humans, can we get machines to do it? There are at least two challenges involved.

Challenge number one is a classic computer vision problem: extract features (smiling? female? moustache?) from an image. Currently, the technique of choice involves ‘deep’ convolutional neural networks. Neural networks are designed to learn mappings from inputs (image pixels) to outputs (features). A convolutional network works by studying the image piece by piece, as it were, shifting a lens over the image a few pixels at a time. ‘Deep’ versions involve layers of such networks, allowing learning to proceed from low-level features (such as edges and colours) to high-level ones (such as women, smiles and moustaches).

Challenge number two is a classic natural language processing problem: express the features using language. This time, we can use a recurrent neural network, which is trained on sequences fed to it in pieces of increasing length. Such a net should learn that the sequence ‘a face like a wet...’ will probably be followed by the word ‘weekend’, whereas ‘a face like a smacked...’ will be followed by... well, you get the idea.

Crucially, however, these two challenges need to be addressed jointly. A recurrent net may know how the word ‘moustache’ is used in English sentences, but a face describer has to link it to the visual features of moustaches, as opposed to fringes, ogees or snoozing cats. The aim is thus to learn visually grounded linguistic expressions. That will get us at least to the ‘woman with brown eyes’ stage. Dealing with wet weekends needs even more work, as it involves metaphor and analogy.

Face descriptions are just one intriguing case of the central AI problem of linking language with perception, whose applications range from facilitating access to multimedia content for the visually impaired, to developing artificial systems that interact more naturally with their physical environment.

At the University of Malta, this work is being conducted by members of the RIVAL (Research In Vision And Language) group, a multidisciplinary team whose members are drawn from the Institute of Linguistics and Language Technology and the Departments of Artificial Intelligence, Communication and Computer Engineering, and Systems and Control Engineering. Originally, it was established as a result of Malta’s participation in an EU COST action studying integrated models for language and vision processing.

Now how about doing that experiment and giving us a hand with our data collection? Just click on the link below.

Dr Albert Gatt is the director of the Institute of Linguistics and Language Technology at the University of Malta.

Sound bites

• While it’s relatively straightforward for robots to ‘see’ objects with cameras and other sensors, interpreting what they see, from a single glimpse, is difficult. New technology enables robots to spot a new object and recognise what it is, whether it is right side up or upside down, without examining it from multiple angles. It can also fill in the blind spots in its field of vision and ‘imagine’ any parts that are hidden from view.

• A study by social psychologists shows that people can reliably tell if someone is richer or poorer than average just by looking at a neutral face without any expression. This is due to visibility of the positions of muscles that become etched in a person’s face by their late teens or early adulthood. Emotions mask life-long habits of expression, such as frequent happiness, which is stereotypically associated with being wealthy and satisfied.

For more soundbites listen to Radio Mocha on Radju Malta 2 every Monday at 1pm and Friday at 6pm

Did you know?

• The first semi-automated facial recognition programmes were created by Woody Bledsoe, Helen Chan Wolf and Charles Bisson in the 60s. Their programmes required the administrator to locate features such as the eyes, ears, nose and mouth on the photograph. It then calculated distances and ratios to a common reference point which was then compared to reference data.

• In 1988, Kirby and Sirovich applied principle component analysis, a standard linear algebra technique, to the face recognition problem. It was considered a milestone because it showed that less than 100 values were required to accurately code a suitable aligned and normalised face.

• Facial features, such as the distance between the eyes, the width of the nose and the length of the jaw line, are used to create an individual faceprint.

• Applications using face recognition have to deal with privacy concerns from users, as some people argue that such software is intruding on people’s lives.

• Facial recognition was used by a school in France to ensure that online students weren’t slacking off. Using the student’s webcam, the software would track eye movement and facial expressions to find out if the student was paying attention during the video lecture

For more trivia see:

Independent journalism costs money. Support Times of Malta for the price of a coffee.

Support Us