April 07, 2021

AI Is Already Beside Us [Part II]

Nowadays, we often see the word AI on a daily basis. The word took wings among society, however many of us may not know what AI actually is or have a firm understanding of it. In this article, we will explain AI through an interview with a key figure in the field of AI .

In the previous interview, we had IT journalist Hiromi Yuzuki interview Yoshitaka Ushiku of OMRON SINIC X Corporation*1 about the history of AI. In this round, Ms. Yuzuki asks the potential of today's AI as seen through Dr. Ushiku's current research, "integration of visual and natural language through deep learning."

*1 A strategic base that creates OMRON's vision of "near-future design."

AI that Bridges Texts and Images

Hiromi Yuzuki (Yuzuki): Please tell me about the specific applications of your study of "integration of visual perception and natural language."

Yoshitaka Ushiku (Ushiku): Thus far, the primary function of AI in image recognition has been to make classifications, such as recognizing a person's face with a smartphone or judging the quality of products on a factory production line. In other words, it is basically a labeling operation. Such AI only has to judge whether a person or a car in the shown image or not. However, when humans see something in a flash, they find deeper meaning from what they see and express it in the natural language(text). For example, when humans see a painting by Claude Monet, they may talk about how the piece came into being. Humans can express a wealth of information in words unconsciously. In this way, since around 2010, I've been working to link "visual perception" with "the information that lies behind the data obtained from vision = information that humans can describe in text."

The first thing I worked on was how to get machines to automatically generate captions for images. Conventional image recognition has used machine learning only to detect where a person is in the image. My research goes beyond that to get AI to think about what kind of texts humans would use to describe the image.


Yuzuki: Take this image, for example. Does that mean that AI can describe "a family is having a meal around the table" or it's a "birthday party" not only there are people in the image?

Ushiku: Yes. I want to get AI to properly distinguish the more nuanced details. Today, AI's accuracy has improved further, and it can verbalize minute details. We are advancing it to a level in which AI can generate a novel from a movie by applying this technology not only to photographs and images but also video. As for something closer to our daily lives, we are working with Kyoto University, the University of Tokyo, and Cookpad on a study for automatically generating recipes from photographs of cooking processes and videos of people cooking, as well as research for collecting such data. This involves a model trained by a machine learning method to recognize the type of ingredients and how far the vegetables have been cooked, to automatically generate text, to predict the process for completing the dish according to the ingredients, and to automatically convert that data into a recipe.

By applying this AI, it is possible to generate procedural text from data obtained by observing a series of procedures involved in various everyday tasks, such as assembling parts in a factory or an artist composing music or creating art. Since this also enables the computer to see which process is being undertaken, it can advise the user of the appropriate next step if an incorrect step is taken. We hope to develop various applications by connecting tasks with the natural language associated with them.

On another note, I think we can apply AI to translation work. AI has been used in translation recently. Some services offer Japanese and English translations for industrial use, but there are times when the translations are still incorrect.

In such cases, I think the translation's accuracy may be able to be improved if an image provides clues. When we translate the English word "seal" into Japanese, many of us picture a sticker if we see the text alone(sticker is called seal in Japanese), but the animal "seal" is spelled the same. Therefore, it would be easier to discern the word's meaning if it came with an image.

Yuzuki: It'll be of great help to YouTubers if captions can be added to photos and subtitles can be added to videos automatically.

Ushiku: I'm occasionally asked if we can make that happen. (laughs)

330_1.jpgYoshitaka Ushiku of OMRON SINIC X

A World Where Humans and AI Converse While Viewing Image or Videos


Ushiku: Here's an image of a dog chewing a vegetable. Can you see what the vegetable is, Ms. Yuzuki?

Yuzuki: Is it a carrot?

Ushiku: Yes. I'm also researching how AI can respond to questions after being shown images like this. When asked "What is the dog chewing?", what parts of the photo should the AI examine to find an answer in order to make a judgment and then spontaneously answer "a carrot"?

Yuzuki: The AI judges that it should look at the dog's mouth because of the word "chewing," correct?

Ushiku: Exactly. In doing this research, I'm often asked, "Isn't AI reflecting what it has learned from what humans look at in the image and their line of sight when they're asked the same question?" That kind of method can also be regarded as AI and is one of the studies now underway.

However, my research does not involve learning line-of-sight data but instead involves learning from a combination of three types of data: images, questions, and answers. Therefore, as you said before, AI learns only from the text and the image to judge that it should look at the dog's mouth based on the word "chewing."

A system for simply answering questions with natural language alone without images is called a Q&A system in technical parlance. This research area has existed since the dawn of the study of AI. In 2011, IBM's Watson supercomputer defeated a human champion on an American quiz show. This system required feeding massive amounts of data into the machine in order to enable it to search through the data to respond to questions.

However, a question is comprised of not only natural language but images as well, so an answer must be generated from this combination. I believe my research of "visual Q&A" can be used in something like that.

Further down the road, we can expect it to be applied to medical diagnoses, such as determining a disease's name from X-ray photographs and judging what is shown in the shadows. If we evolve AI further and give it a personality, we may even be able to "converse with AI" to exchange opinions with it while watching TV.

Ushiku: Another example of dialog with AI is to in another person's research involve the use of images, a research topic of others. For instance, suppose a customer is talking to a shoe store salesperson about the type of shoes they want. The salesperson shows the customer a variety of recommended shoes, so the customer says, "I want the last design I saw but with slightly lower heels." Then, the salesperson suggests new products to meet that request. The dialog moves along not only through the exchange of words but also according to the products' visual information. Suppose that AI could also participate in a dialog based on both the bountiful exchanges that humans naturally engage in and visual information. In that case, the AI could suggest several shoes like a concierge and recommend shoes to the customer's favor.

The idea is that there will be more things that AI can do if it can freely switch between dialog using images and dialog using natural language. That is the gist of the "integration of visual perception and natural language" that I am working on.

Yuzuki: It would be great if we could ask AI to do a daily fashion check for us.

Ushiku: A friend of mine is researching precisely that. With fashion in particular, trends vary by region, too. Trends are visualized from big data for AI to judge whether you are on-trend.

There has been news of the development of an AI in China that can check whether clothes fit a person virtually. I want to conduct research to enable AI to see not just whether a product fits a person or not but also to understand the latest trends.

Doraemon Is the Final Objective

Yuzuki: What kind of future awaits us if you succeed in your research?

Ushiku: Ultimately what I'd like to develop is "Doraemon."*2 I don't want to say this too much because so many people say the same thing (laughs). It'll be great if people and machines can share what they see, hear, eat, and touch just like the world in animation of Doraemon, not just conversing with AI or having it label things.

Say a robot can sense and judge what a human sees and hears by some means. For example, the robot sees that a product is in a specific condition and assesses that the person will take out a wrench to do some work. In that case, we want the robot to make decision to pick up the wrench. Humans do things that require human judgment, and robots (as assistants) take care of humans and do the rest in response. That's what OMRON call, human-machine collaboration.

Yuzuki: It's like a master and apprentice relationship.

Ushiku: I want to build that kind of trust-based relationship. However, the master is also human, so the person may sometimes make a mistake. When that happens, it would be nice if the AI could tell us gently, "You've forgotten this process." That's what I'd like to achieve.

I am expecting the future of society will be to use machines, namely robots as tools, and AI as software to enable people to play more active role in the field of creativity to generate new ideas and spreading them throughout society. To the machine, the work of the machine, to man the thrill of further creation.

*2 Doraemon is a famous animation character in Japan. It is an imaginary blue robot cat came from the 22nd century. It has a four-dimensional pocket that carries lots of fancy gadgets that can be used to help and improve our daily lives. Those gadgets include such as a small helicopter for humans to fly by only putting it on your head.


Dr. Ushiku aims for humans to engage in natural dialog with robots through research on the "mutual understanding of natural language and visual perception." As mentioned in the Doraemon example, a sci-fi-like scenario lies right before our eyes. It will be fascinating to see how AI evolves.


Yoshitaka Ushiku
Principal Investigator
OMRON SINIC X Corporation
Dr. Ushiku completed his doctorate at the Graduate School of Information Science and Technology of the University of Tokyo in 2014 and joined NTT Communication Science Laboratories. After working as a lecturer at his alma mater in 2016, Ushiku was appointed as a principal investigator at OMRON SINIC X Corporation in October 2018. He has been the Chief Research Officer of Ridge-i Inc. since 2019. Dr. Ushiku primarily studies cross-media understanding by machine learning, such as image caption generation.

Hiromi Yuzuki
IT journalist
Ms. Yuzuki pens Apple-related articles, including tips for using iPads for work, and produces video reports on overseas tech information. She has appeared on "The World Unknown to Matsuko" as an iPhone case expert. Her YouTube channel is called Gadgetouch.