Nice tech, but shame on Microsoft, Alibaba’s spinners
Journos fell over themselves to breathlessly report that, for instance, “ROBOTS CAN NOW READ BETTER THAN HUMANS, PUTTING MILLIONS OF JOBS AT RISK.” Real headline.
But don’t be fooled. It’s just more drivel pumped out by corporate spin doctors to boost share prices, attract customers, and claim bragging rights. Both teams announced on Monday that they developed machine-learning code that can read and understand text well enough to pass a comprehension test at, apparently, the same level as humans.
Luo Si, chief scientist for natural language at Alibaba’s Institute of Data Science and Technologies, gushed: “It is our great honor to witness the milestone where machines surpass humans in reading comprehension.”
Microsoft was a little more modest. Ming Zhou, assistant managing director of Microsoft Research Asia, claimed its results were “an important milestone,” but he admitted that, overall, people are still “much better than machines at comprehending the complexity and nuance of language.”
Clearly, Redmond has learned from last year’s Ms Pac-Man overreach.
Looking past their boasts this week, neither Alibaba nor Microsoft revealed much in the way of detail about their text-reading AI models. However, at the heart of it all is the Stanford Question Answering Dataset (SQuAD). It contains more than 100,000 pieces of text taken from more than 500 Wikipedia articles, making it one of the biggest datasets to test machine reading capabilities.
SQuAD is arranged like this: you have chunks of text taken from Wikipedia, then a set of five questions and answers for every text chunk. For example, from a page on southern California, a text chunk about the region’s surroundings has the question “what is the name of the border to the south?” and the correct answer “the Mexico–United States border.”
The idea is to train an AI system to read and understand those text chunks so it can correctly answer the associated questions. The goal is to develop software smart enough to answer the questions as well as, if not more accurately than, normal living folk.
A quick peek at the SQuAD leadership board shows Microsoft and Alibaba are tied in first place with respective scores of 82.650 and 82.440, both higher than the human benchmark score 82.304.
But read the description of SQuAD more closely. It consists of “questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.”
Alarm bells should be ringing inside your head right now. The answer to every question is explicitly contained in the text. It’s not so much reading comprehension as text extraction. There is no real understanding of the prose by the machines; it’s a case of enhanced pattern matching. Human beings are smarter than this.
Let’s go back to the southern California example. The question is…
…and the answer is “the Mexico–United States border.” But look at the associated text chunk. We’ve highlighted the relevant part by making it italic, not that, as a human reader, you need it:
The AI just has to work out that it needs to locate the words relevant to “border” and “south,” and pull out “Mexico–United States border.” Cool. Nice bit of programming. Hats off, really. But a breakthrough in machine intelligence? No. way.
The computers have no idea what the words actually mean. The definition doesn’t matter. To the software, it’s all matrices of numbers linking similar strings of characters. Thus, it’s easy for algorithms to find the answer in the text by searching for related or matching words using these vectors.
Yoav Goldberg, a senior lecturer teaching computer science at Bar Ilan University in Israel, pointed out that the conditions in which humans are tested against the dataset, to set the goal line for the AI software, isn’t great. That part is carried out by people who sign up to Amazon’s Mechanical Turk, a service that pays serfs peanuts to complete jobs such as identifying objects in photos or videos, or transcribing audio to collect data.
Essentially, a bunch of people are each given the SQuAD text chunks and the associated questions, and get two minutes to answer all five conundrums. Each turk is then scored on how accurately they answered the queries, and earns a small amount of change. It’s pretty mundane, and given that these cyber-rubes get paid more if they work faster, they’re likely to slip up or not perform their best.
In other words, the benchmark human accuracy score may be set unreasonably low, allowing machines to appear smarter than the average bod.
It’s pretty clear that machines are not even close to reaching human parity in reading text, so don’t believe the awful headlines that this will put millions of jobs at risk. ®