Saturday, 7 February 2026

Just how reliable are AIs? A historian's examination.

A historian and cultural commentator has been examining the reliability of AIs for historical research, with thoughts on the future of AI & us. She summarises what she's discovered below, including answers to such questions as:

  • Which AIs got the highest scores overall?
  • Which AIs got the highest scores by topic: scientific/technical, historical context, creativity, historical and legal?
  • Unavoidable methodological issues with As
  • Lessons on use of AIs for historical research
  • Will AIs surpass and replace humans?

Dianne Durante has written several books, and maintains a historical blog. What she has used AIs for in the past, "and will still use," are for very specific questions:

how to trouble-shoot the document feeder on an HP 8000-series printer/scanner, where to find Gaussian blur on the Adobe InDesign menu, what stretches to use for a tight IT band, how much time to allow for a visit to the Kingsley Plantation, or what the Leopards Eating People’s Faces Party is. An AI [she says] gives me answers much, much faster than I could get them by wading through Google search results. ... 
As a historian [however], I tend to need answers to much more obscure and complex questions. When I started using Grok for such questions last summer, it gave me egregiously incorrect answers. (See Part 1 of this series.)

So I set out to discover:
Are AIs reliable for providing historical facts? Can I trust them to accurately deliver all the relevant details on matters such as Chladni figures and the Proclamation of 1763? Should I assume I always need to do further research? Should I avoid AIs altogether, and spend my research time looking for other sources?

Are AIs useful for going beyond facts to analysis? For example, are they good at providing interpretation, overviews, and/or inductive conclusions, such as a list of the most significant artworks of the 18th century, or of the major events of the 1790s?

Are some AIs better than others, in general or on specific topics?

Head to her many earlier posts (starting back in xxx 2025) to see her detailed methodology and results.

So, how did they all do?  In summary, based on the average of the scores from all 7 of her questions:

Winner: Grok, with 70%. That’s better than the others, but if you were using Grok to write your answers on an exam consisting of my 7 questions, you’d barely scrape through with a C. [That caveat is important.]

Loser: Perplexity, with 38%.

Mid-range: ChatGPT (50%), Claude (48%), and Deepseek (56%).

There was no way to ask Britannica or Wikipedia several of the questions, so I didn’t give them an overall score.

For results by category, best for Scientific and Technical: Grok and Deepseek (100% and 95% respectively; average = 81%).

                                    ... best for Historical Context: Claude and Deepseek (60%, 58%; average 51%)

                                    ... best for Creativity: Perplexity (85%; average 76%)

                                    ... best for Historical and Legal: Grok (70%; average 52%)

Head to her post to see what specific questions she asked, and why. She has a few thoughts ("If you have limited time for research, don’t spend every minute of it with AIs"), and a reminder:

    LLMs don’t think. All the AIs I looked at except Britannica’s Chatbot are large-language models, a.k.a. LLMs (see Part 3). An LLM is fed an enormous amount of data so it can generate human-like language by predicting what words will follow a particular word or phrase. An AI doesn’t receive your question, gather data, observe how it relates what it already knows, analyze it according to scientific or philosophical principles, and then consider the most effective way to present the information to you. The AI just predicts what might come next. That’s why it can slide seamlessly from truth to hallucination. An AI will repeat any errors in the data fed into it, be it from major media, random posts on the internet, or Wikipedia. An AI is the ultimate in second-handedness.

    So do not assume accuracy in your answers, especially if it's a topic you don't know much about.

    I like her conclusion:

    Re AIs becoming indistinguishable from humans, and then making humans obsolete: if philosophers, biologists, psychologists, et al., can’t explain the mechanisms of free will, the procedure for induction, etc., then we cannot program a computer to do those things. Until and unless we can, AIs are not human-like in the ways that matter most, and cannot replace humans.

     Head to her post to read it all.

    4 comments:

    1. But ALL humans are notoriously reliable and correct on historical facts. The problem with ai is the human generated slop it is trained on. Tempted to quickly build an app to check grokipedia wiki and britannica to run her tests. Easy with antigravity.

      ReplyDelete
      Replies
      1. It's worse than that, it's trained on inaccurate information, then it makes shit up.

        Delete
    2. She asserts, if philosophers, biologists, psychologists, et al., can’t explain the mechanisms of free will, the procedure for induction, etc., then we cannot program a computer to do those things.

      What if they are an emergent? If they are then it would be unnecessary to know how they operate. An obvious approach to try would be a massively parallel distributed system. Allow it to iterate and develop itself- free climbing.

      These are the early, very early developments the public is seeing. Proprietary stuff is vastly superior and you have yet to be exposed to any of that yet. There is some impressive research going on and you have yet to see any of that yet either. These systems are getting better with time and development. They're improving very fast and by a lot. They are already good enough to replace a significant fraction of white collar jobs. That fraction is going to increase fast.

      People like to argue that these systems do not have and will not have "free will" or consciousness. It doesn't matter. They are more effective and a lot more efficient for many, many, many tasks than are the present denizens of cubicles and office spaces and the like. They are only going to improve. Join the dots.







      ReplyDelete

    We welcome thoughtful disagreement.
    But we do (ir)regularly moderate comments -- and we *will* delete any with insulting or abusive language. Or if they're just inane. It’s okay to disagree, but pretend you’re having a drink in the living room with the person you’re disagreeing with. This includes me.
    PS: Have the honesty and courage to use your real name. That gives added weight to any opinion.