This undermines the claim of GPT-2's creators that they've withheld the code because they're afraid it would be used to create malicious textual deep fakes — I suspect that such things would routinely exhibit significant "world modelling failures".Caveat yada yada.
And Seabrook's experience illustrates a crucial lesson that AI researchers learned 40 or 50 years ago: "evaluation by demonstration" is a recipe for what John Pierce called glamor and (self-) deceit ("Whither Speech Recognition", JASA 1969). Why? Because we humans are prone to over-generalizing and anthropomorphizing the behavior of machines; and because someone who wants to show how good a system is will choose successful examples and discard failures. I'd be surprised if Seabrook didn't do a bit of this in creating and selecting his "Read Predicted Text" examples.
In general, anecdotal experiences are not a reliable basis for evaluating scientific or technological progress; and badly-designed experiments are if anything worse.
Liberman recommends thinking about the Winograd Schema Challenge, which poses simple language-based problems that are difficult for computational systems to handle, GPT-2 included.
