A historian and cultural commentator has been examining the reliability of AIs for historical research, with thoughts on the future of AI & us. She summarises what she's discovered below, including answers to such questions as:
- Which AIs got the highest scores overall?
- Which AIs got the highest scores by topic: scientific/technical, historical context, creativity, historical and legal?
- Unavoidable methodological issues with As
- Lessons on use of AIs for historical research
- Will AIs surpass and replace humans?
Dianne Durante has written several books, and maintains a historical blog. What she has used AIs for in the past, "and will still use," are for very specific questions:
how to trouble-shoot the document feeder on an HP 8000-series printer/scanner, where to find Gaussian blur on the Adobe InDesign menu, what stretches to use for a tight IT band, how much time to allow for a visit to the Kingsley Plantation, or what the Leopards Eating People’s Faces Party is. An AI [she says] gives me answers much, much faster than I could get them by wading through Google search results. ...
As a historian [however], I tend to need answers to much more obscure and complex questions. When I started using Grok for such questions last summer, it gave me egregiously incorrect answers. (See Part 1 of this series.)
So I set out to discover:Are AIs reliable for providing historical facts? Can I trust them to accurately deliver all the relevant details on matters such as Chladni figures and the Proclamation of 1763? Should I assume I always need to do further research? Should I avoid AIs altogether, and spend my research time looking for other sources?
Are AIs useful for going beyond facts to analysis? For example, are they good at providing interpretation, overviews, and/or inductive conclusions, such as a list of the most significant artworks of the 18th century, or of the major events of the 1790s?
Are some AIs better than others, in general or on specific topics?
Head to her many earlier posts (starting back in xxx 2025) to see her detailed methodology and results.
So, how did they all do? In summary, based on the average of the scores from all 7 of her questions:
Winner: Grok, with 70%. That’s better than the others, but if you were using Grok to write your answers on an exam consisting of my 7 questions, you’d barely scrape through with a C. [That caveat is important.]
Loser: Perplexity, with 38%.
Mid-range: ChatGPT (50%), Claude (48%), and Deepseek (56%).
There was no way to ask Britannica or Wikipedia several of the questions, so I didn’t give them an overall score.
For results by category, best for Scientific and Technical: Grok and Deepseek (100% and 95% respectively; average = 81%).
... best for Historical Context: Claude and Deepseek (60%, 58%; average 51%)
... best for Creativity: Perplexity (85%; average 76%)
... best for Historical and Legal: Grok (70%; average 52%)
Head to her post to see what specific questions she asked, and why. She has a few thoughts ("If you have limited time for research, don’t spend every minute of it with AIs"), and a reminder:
LLMs don’t think. All the AIs I looked at except Britannica’s Chatbot are large-language models, a.k.a. LLMs (see Part 3). An LLM is fed an enormous amount of data so it can generate human-like language by predicting what words will follow a particular word or phrase. An AI doesn’t receive your question, gather data, observe how it relates what it already knows, analyze it according to scientific or philosophical principles, and then consider the most effective way to present the information to you. The AI just predicts what might come next. That’s why it can slide seamlessly from truth to hallucination. An AI will repeat any errors in the data fed into it, be it from major media, random posts on the internet, or Wikipedia. An AI is the ultimate in second-handedness.
So do not assume accuracy in your answers, especially if it's a topic you don't know much about.
I like her conclusion:
Re AIs becoming indistinguishable from humans, and then making humans obsolete: if philosophers, biologists, psychologists, et al., can’t explain the mechanisms of free will, the procedure for induction, etc., then we cannot program a computer to do those things. Until and unless we can, AIs are not human-like in the ways that matter most, and cannot replace humans.
Head to her post to read it all.
1 comment:
But ALL humans are notoriously reliable and correct on historical facts. The problem with ai is the human generated slop it is trained on. Tempted to quickly build an app to check grokipedia wiki and britannica to run her tests. Easy with antigravity.
Post a Comment