AI coding helpers get FAILing grade

Purdue researchers expose generative AI tools like Copilot's frequent errors when asked basic development questions.

Richi Jennings, Independent industry analyst, editor, and content strategist.

An academic study says ChatGPT is wrong more than half the time, when asked the sort of programming questions you’d find on Stack Overflow. The “comprehensive analysis” concludes that GitHub Copilot’s LLM engine will make many conceptual errors, couching its output in a wordy, confident and authoritative tone.

So, it’s hard to spot the errors, say the researchers. In this week’s Secure Software Blogwatch, we can’t say we’re totally surprised.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: Turing’s original paper.

Fools rush in

What’s the craic? I don’t think Sabrina Ortiz is an AI, but you never can be sure — “ChatGPT answers more than half of software engineering questions incorrectly”:

“Incorrect ChatGPT-generated answers”
Despite the convenience, a new study finds that you may not want to use ChatGPT for software engineering prompts. … A new Purdue University study investigated … just how efficient ChatGPT is in answering software engineering prompts: … 52% of ChatGPT's answers were incorrect and only … 48% were correct.
…
To further analyze the quality of ChatGPT responses, the researchers asked 12 participants with different levels of programming expertise to give their insights on the answers. [They] failed to correctly identify incorrect ChatGPT-generated answers 39.34% of the time.

That’s really worrying. John Lopez sounds concerned — “ChatGPT Fools a Third of Users”:

“Conceptual errors”
The assessment spanned various criteria, including correctness, consistency, comprehensiveness, and conciseness. The results were both enlightening and concerning.
…
The study … delves deep into the quality and usability of ChatGPT's responses, uncovering some intriguing and, at times, problematic findings. … The Purdue team meticulously examined ChatGPT's answers to 517 questions sourced from Stack Overflow.
…
The model often provides solutions or code snippets without clearly understanding their implications. … The research highlighted a distinctive trait of ChatGPT's approach - a propensity for conceptual errors. The model seems to struggle to grasp the underlying context of questions, leading to a higher frequency of errors.

Horse’s mouth? Samia Kabir, David N. Udo-Imeh, Bonan Kou and Tianyi Zhang ask, “Who Answers It Better?”:

“Logical errors”
Software developers often resort to online resources for a variety of software engineering tasks, e.g., API learning, bug fixing, comprehension of code or concepts, etc. … The emergence of Large Language Models (LLMs) has demonstrated the potential to transform the web help-seeking patterns of software developers [e.g.,] GitHub Copilot.
…
Despite the increasing popularity of ChatGPT, concerns remain surrounding its nature as a generative model and the associated risks. … Many answers are incorrect due to ChatGPT’s incapability to understand the underlying context of the question. [They are] significantly lengthy, and not consistent with human answers half of the time. … Verifying the answers — no matter how correct or trustworthy they look — is imperative.
…
The majority of the code errors are due to applying wrong logic or implementing non-existing or wrong API, Library, or Functions. … Also, in many cases, ChatGPT makes discernible logical errors that no human expert can make. For example, setting a loop ending condition to be equal to something that is never true or never false — e.g., while (i<0 and i>10)

Of course, it’s just a preprint. So it’s not undergone peer review — which is where we come in. vba616 follows the money:

The free version of ChatGPT … has a seemingly consistent type of answer whenever I ask it how to code something I don't know how to do: It outputs some boilerplate code, and inserts a method with comments saying "insert logic here" — entirely sidestepping the request.
…
If it was a human cow-orker, you would not keep them on the payroll. It is a cheat, like ELIZA, like SHRDLU, like a useless employee.
…
I asked ChatGPT-3.5 for the usual rejoinder to complainers: "If you're looking for an enhanced experience, you might want to consider trying out the paid version, called ChatGPT-4."

It also rings true for Justin Pasher:

I've typically gone to ChatGPT for some more obscure technical problems that I struggle finding meaningful answers to on Google. … If you don't understand the nuances of the concept you are dealing with, you probably can't figure out how to fix little things that are wrong. Your best bet is just asking again and seeing it can fix it.
…
For example, I was asking questions about using some PowerShell commands to do something. It kept giving me commands (which were valid) with parameters that were not. I had to keep correcting it by saying, "Command X doesn't support the Y parameter." It would apologize, then continue to give answers that simply did not work.

But a 50:50 chance ain’t bad — right? Wrong, thinks Anubis IV:

Close enough is fine in horseshoes and hand grenades, but not when providing authoritative answers for StackOverflow. Confidently wrong 50% of the time is unacceptable, and any human with that track record would know to shut up and listen.
…
Iterating with it is an exercise in frustration. It apologizes and then regresses, then apologizes and introduces syntax from another language, then apologizes and rewrites the code completely. A better way of thinking of it is that when it’s wrong it’ll likely never get it right eventually, or if it does the effort far outstrips any benefits.

And then there’s the problem of spotting the errors. vba616 digs deeper:

I have a recent example of Google getting something trivial wrong and ChatGPT getting it right. But oddly that doesn't change my opinion of ChatGPT: The only reason I know that ChatGPT was right is because I blindly followed the first Googled suggestion.

Two can play at that game. Richard 12 plays at that game, too:

In my experience, when prompted in subjects I know well, ChatGPT produces somewhat- to very-wrong answers most of the time. Whether right, inaccurate or completely wrong, it produces equally confident language. It is therefore extremely likely to produce confidently wrong answers in subjects that I do not know.
…
I cannot tell whether the text it is spewing is approximately correct, dangerously wrong, or merely somewhat inaccurate. Therefore it is clearly worse than useless.

Meanwhile, RecycledEle steps back for a wider perspective:

I can remember a few time when a computer program getting 48% of coding questions right was science fiction.

And Finally:

Testing Turing

Previously in And finally

You have been reading Secure Software Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi, @richij or ssbw@richi.uk. Ask your doctor before reading. Your mileage may vary. Past performance is no guarantee of future results. Do not stare into laser with remaining eye. E&OE. 30.

Image sauce: KOBU (via Unsplash; leveled and cropped)

Keep learning

Read the 2025 Gartner® Market Guide to Software Supply Chain Security. Plus: See RL's webinar for expert insights.
Get the white paper: Go Beyond the SBOM. Plus: See the webinar: Welcome CycloneDX xBOM.
Go big-picture on the software risk landscape with RL's 2025 Software Supply Chain Security Report. Plus: See our webinar for discussion about the findings.
Get up to speed on securing AI/ML with our white paper: AI Is the Supply Chain. Plus: See RL's research on nullifAI and replay our Webinar to learn how RL discovered the novel threat.
Learn how commercial software risk is under-addressed: Download the white paper — and see our related webinar for more insights.

Explore RL's Spectra suite: Spectra Assure for software supply chain security, Spectra Detect for scalable file analysis, Spectra Analyze for malware analysis and threat hunting, and Spectra Intelligence for reputation data and intelligence.

Tags:Dev & DevSecOps

AI coding helpers get FAILing grade

Purdue researchers expose generative AI tools like Copilot's frequent errors when asked basic development questions.

Fools rush in

And Finally:

Keep learning

Spectra Assure Free Trial