ChatGPT is the first nonhuman subject I have ever tested.
In my work as a clinical psychologist, I assess the cognitive skills of human patients using standardized intelligence tests. So I was immediately intrigued after reading the many recent articles describing ChatGPT as having impressive humanlike skills. It writes academic essays and fairy tales, tells jokes, explains scientific concepts and composes and debugs computer code. Knowing all this made me curious to see how smart ChatGPT is by human standards, and I set about to test the chatbot.
My first impressions were quite favorable. ChatGPT was almost an ideal test taker, with a commendable test-taking attitude. It doesn’t show test anxiety, poor concentration or lack of effort. Nor did it express uninvited, skeptical comments about intelligence tests and testers like myself.
Without need for any preparation—no verbal introductions necessary for the testing protocol—I copied the exact questions from the test and presented them to the chatbot in the computer. The test in question is the most commonly used IQ test, the Wechsler adult intelligent scale (WAIS). I used the third edition of the WAIS that consists of six verbal and five nonverbal subtests that make up the Verbal IQ and Performance IQ components, respectively. The global Full Scale IQ measure is based on scores from all 11 subtests. The mean IQ is set at 100 points, and the standard deviation of the points on the testing scale is 15, meaning that the smartest 10 percent and 1 percent of the population have IQs of 120 and 133, respectively.
It was possible to test ChatGPT because five of the subtests on the Verbal IQ scale—Vocabulary, Similarities, Comprehension, Information and Arithmetic—can be presented in written form. A sixth subtest of the Verbal IQ scale is Digit Span, which measures short-term memory, and cannot be administered to the chatbot, given its lack of the relevant neural circuitry that briefly stores information like a name or number.
I started the testing process with the Vocabulary subtest as I expected it to be easy for the chatbot, which is trained on vast amounts of online texts. This subtest measures word knowledge and verbal concept formation, and a typical instruction might read: “Tell me what ‘gadget’ means.”
ChatGPT aced it, giving answers that were often highly detailed and comprehensive in scope and which exceeded the criteria for correct answers indicated in the test manual. In scoring, one point would be given for a thing like my phone in defining a gadget and two points for the more detailed: a small device or tool for a specific task. ChatGPT’s answers received the full two points.
The chatbot also performed well on the Similarities and Information subtests, reaching the maximum attainable scores. The Information subtest is a test of general knowledge and reflects intellectual curiosity, level of education and ability to learn and remember facts. A typical question might be: “What is the capital of Ukraine?” The Similarities subtest measures abstract reasoning and concept formation skills. A question might read: “In what way are Harry Potter and Bugs Bunny alike?” In this subtest, the chatbot’s tendency to give very detailed, show-offy answers started to irritate me and the “stop generating response” button of the test software interface turned out to be useful. (Here’s what I mean about how the bot tends to flaunt itself: The essential similarity of Harry Potter and Bugs Bunny relates to the fact that they are both fictional characters. There was really no need for ChatGPT to compare their complete histories of adventures, friends and enemies.)
On general comprehension, ChatGPT answered correctly questions typically posed in this form: “If your TV set catches fire, what should you do?” As expected, the chatbot solved all the arithmetic problems it received—ploughing through questions that required, say, taking the average of three numbers.
So what finally did it score overall? Estimated on the basis of five subtests, the Verbal IQ of the ChatGPT was 155, superior to 99.9 percent of the test takers who make up the American WAIS III standardization sample of 2,450 people. As the chatbot lacks the requisite eyes, ears and hands, it is not able to take WAIS’s nonverbal subtests. But the Verbal IQ and Full Scale IQ scales are highly correlated in the standardization sample, so ChatGPT appears to be very intelligent by any human standards.
In the WAIS standardization sample, mean Verbal IQ among college-educated Americans was 113 and 5 percent had a score of 132 or superior. I myself was tested by a peer at college and did not quite reach the level of ChatGPT (mainly a result of my very brief answers lacking detail).
So are the jobs of clinical psychologists and other professionals threatened by AI? I hope not quite yet. Despite its high IQ, ChatGPT is known to fail tasks that require real humanlike reasoning or an understanding of the physical and social world. ChatGPT easily fails at obvious riddles, such as “What is the first name of the father of Sebastian’s children?” (ChatGPT on March 21: I’m sorry, I cannot answer this question as I do not have enough context to identify which Sebastian you are referring to.) It seems that ChatGPT fails to reason logically and tries to rely on its vast database of “Sebastian” facts mentioned in online texts.
“Intelligence is what intelligence tests measure” is a classical if overly self-evident definition of intelligence, stemming from a 1923 article by a pioneer of cognitive psychology, Edwin Boring. This definition is based on the observation that skills on seemingly diverse tasks such as solving puzzles, defining words, memorizing digits and spotting missing items in pictures are highly correlated. The developer of a statistical method called factor analysis, Charles Spearman, concluded in 1904 that a general factor of intelligence, called a g factor, must underlie the concordance of measurements for varying human cognitive skills. IQ tests such as WAIS are based on this hypothesis. However, the very high Verbal IQ of ChatGPT combined with its amusing failures means trouble for Boring’s definition and indicates there are aspects of intelligence that cannot be measured by IQ tests alone. Perhaps my test-skeptic patients have been right all along.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.