RL Blog

Topics

All Blog PostsAppSec & Supply Chain SecurityDev & DevSecOpsProducts & TechnologySecurity OperationsThreat Research

Follow us

XX / TwitterLinkedInLinkedInFacebookFacebookInstagramInstagramYouTubeYouTubeblueskyBluesky

Subscribe

Get the best of RL Blog delivered to your in-box weekly. Stay up to date on key trends, analysis and best practices across threat intelligence and software supply chain security.

ReversingLabs: The More Powerful, Cost-Effective Alternative to VirusTotalSee Why
Skip to main content
Contact UsSupportLoginBlogCommunity
reversinglabsReversingLabs: Home
Solutions
Secure Software OnboardingSecure Build & ReleaseProtect Virtual MachinesIntegrate Safe Open SourceGo Beyond the SBOM
Increase Email Threat ResilienceDetect Malware in File Shares & StorageAdvanced Malware Analysis SuiteICAP Enabled Solutions
Scalable File AnalysisHigh-Fidelity Threat IntelligenceCurated Ransomware FeedAutomate Malware Analysis Workflows
Products & Technology
Spectra Assure®Software Supply Chain SecuritySpectra DetectHigh-Speed, High-Volume, Large File AnalysisSpectra AnalyzeIn-Depth Malware Analysis & Hunting for the SOCSpectra IntelligenceAuthoritative Reputation Data & Intelligence
Spectra CoreIntegrations
Industry
Energy & UtilitiesFinanceHealthcareHigh TechPublic Sector
Partners
Become a PartnerValue-Added PartnersTechnology PartnersMarketplacesOEM Partners
Alliances
Resources
BlogContent LibraryCybersecurity GlossaryConversingLabs PodcastEvents & WebinarsLearning with ReversingLabsWeekly Insights Newsletter
Customer StoriesDemo VideosDocumentationOpenSource YARA Rules
Company
About UsLeadershipCareersSeries B Investment
EventsRL at RSAC
Press ReleasesIn the News
Pricing
Software Supply Chain SecurityMalware Analysis and Threat Hunting
Request a demo
Menu
Dev & DevSecOpsAugust 15, 2023

AI coding helpers get FAILing grade

Purdue researchers expose generative AI tools like Copilot's frequent errors when asked basic development questions.

Richi Jennings
Richi Jennings, Independent industry analyst, editor, and content strategist.Richi Jennings
FacebookFacebookXX / TwitterLinkedInLinkedInblueskyBlueskyEmail Us
php echo hello world in code

An academic study says ChatGPT is wrong more than half the time, when asked the sort of programming questions you’d find on Stack Overflow. The “comprehensive analysis” concludes that GitHub Copilot’s LLM engine will make many conceptual errors, couching its output in a wordy, confident and authoritative tone.

So, it’s hard to spot the errors, say the researchers. In this week’s Secure Software Blogwatch, we can’t say we’re totally surprised.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: Turing’s original paper.

See also: AI and the software supply chain: Application security just got a whole lot more complicated

Fools rush in

What’s the craic? I don’t think Sabrina Ortiz is an AI, but you never can be sure — “ChatGPT answers more than half of software engineering questions incorrectly”:

“Incorrect ChatGPT-generated answers”

Despite the convenience, a new study finds that you may not want to use ChatGPT for software engineering prompts. … A new Purdue University study investigated … just how efficient ChatGPT is in answering software engineering prompts: … 52% of ChatGPT's answers were incorrect and only … 48% were correct.

…

To further analyze the quality of ChatGPT responses, the researchers asked 12 participants with different levels of programming expertise to give their insights on the answers. [They] failed to correctly identify incorrect ChatGPT-generated answers 39.34% of the time.

That’s really worrying. John Lopez sounds concerned — “ChatGPT Fools a Third of Users”:

“Conceptual errors”

The assessment spanned various criteria, including correctness, consistency, comprehensiveness, and conciseness. The results were both enlightening and concerning.

…

The study … delves deep into the quality and usability of ChatGPT's responses, uncovering some intriguing and, at times, problematic findings. … The Purdue team meticulously examined ChatGPT's answers to 517 questions sourced from Stack Overflow.

…

The model often provides solutions or code snippets without clearly understanding their implications. … The research highlighted a distinctive trait of ChatGPT's approach - a propensity for conceptual errors. The model seems to struggle to grasp the underlying context of questions, leading to a higher frequency of errors.

Horse’s mouth? Samia Kabir, David N. Udo-Imeh, Bonan Kou and Tianyi Zhang ask, “Who Answers It Better?”:

“Logical errors”

Software developers often resort to online resources for a variety of software engineering tasks, e.g., API learning, bug fixing, comprehension of code or concepts, etc. … The emergence of Large Language Models (LLMs) has demonstrated the potential to transform the web help-seeking patterns of software developers [e.g.,] GitHub Copilot.

…

Despite the increasing popularity of ChatGPT, concerns remain surrounding its nature as a generative model and the associated risks. … Many answers are incorrect due to ChatGPT’s incapability to understand the underlying context of the question. [They are] significantly lengthy, and not consistent with human answers half of the time. … Verifying the answers — no matter how correct or trustworthy they look — is imperative.

…

The majority of the code errors are due to applying wrong logic or implementing non-existing or wrong API, Library, or Functions. … Also, in many cases, ChatGPT makes discernible logical errors that no human expert can make. For example, setting a loop ending condition to be equal to something that is never true or never false — e.g., while (i<0 and i>10)

Of course, it’s just a preprint. So it’s not undergone peer review — which is where we come in. vba616 follows the money:

The free version of ChatGPT … has a seemingly consistent type of answer whenever I ask it how to code something I don't know how to do: It outputs some boilerplate code, and inserts a method with comments saying "insert logic here" — entirely sidestepping the request.

…

If it was a human cow-orker, you would not keep them on the payroll. It is a cheat, like ELIZA, like SHRDLU, like a useless employee.

…

I asked ChatGPT-3.5 for the usual rejoinder to complainers: "If you're looking for an enhanced experience, you might want to consider trying out the paid version, called ChatGPT-4."

It also rings true for Justin Pasher:

I've typically gone to ChatGPT for some more obscure technical problems that I struggle finding meaningful answers to on Google. … If you don't understand the nuances of the concept you are dealing with, you probably can't figure out how to fix little things that are wrong. Your best bet is just asking again and seeing it can fix it.

…

For example, I was asking questions about using some PowerShell commands to do something. It kept giving me commands (which were valid) with parameters that were not. I had to keep correcting it by saying, "Command X doesn't support the Y parameter." It would apologize, then continue to give answers that simply did not work.

But a 50:50 chance ain’t bad — right? Wrong, thinks Anubis IV:

Close enough is fine in horseshoes and hand grenades, but not when providing authoritative answers for StackOverflow. Confidently wrong 50% of the time is unacceptable, and any human with that track record would know to shut up and listen.

…

Iterating with it is an exercise in frustration. It apologizes and then regresses, then apologizes and introduces syntax from another language, then apologizes and rewrites the code completely. A better way of thinking of it is that when it’s wrong it’ll likely never get it right eventually, or if it does the effort far outstrips any benefits.

And then there’s the problem of spotting the errors. vba616 digs deeper:

I have a recent example of Google getting something trivial wrong and ChatGPT getting it right. But oddly that doesn't change my opinion of ChatGPT: The only reason I know that ChatGPT was right is because I blindly followed the first Googled suggestion.

Two can play at that game. Richard 12 plays at that game, too:

In my experience, when prompted in subjects I know well, ChatGPT produces somewhat- to very-wrong answers most of the time. Whether right, inaccurate or completely wrong, it produces equally confident language. It is therefore extremely likely to produce confidently wrong answers in subjects that I do not know.

…

I cannot tell whether the text it is spewing is approximately correct, dangerously wrong, or merely somewhat inaccurate. Therefore it is clearly worse than useless.

Meanwhile, RecycledEle steps back for a wider perspective:

I can remember a few time when a computer program getting 48% of coding questions right was science fiction.

And Finally:

Testing Turing

Previously in And finally


You have been reading Secure Software Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi, @richij or ssbw@richi.uk. Ask your doctor before reading. Your mileage may vary. Past performance is no guarantee of future results. Do not stare into laser with remaining eye. E&OE. 30.

Image sauce: KOBU (via Unsplash; leveled and cropped)

Keep learning

  • Get up to speed on the state of software security with RL's Software Supply Chain Security Report 2026. Plus: See the the webinar to discussing the findings.
  • Learn why binary analysis is a must-have in the Gartner® CISO Playbook for Commercial Software Supply Chain Security.
  • Take action on securing AI/ML with our report: AI Is the Supply Chain. Plus: See RL's research on nullifAI and watch how RL discovered the novel threat.
  • Get the report: Go Beyond the SBOM. Plus: See the CycloneDX xBOM webinar.

Explore RL's Spectra suite: Spectra Assure for software supply chain security, Spectra Detect for scalable file analysis, Spectra Analyze for malware analysis and threat hunting, and Spectra Intelligence for reputation data and intelligence.

Tags:Dev & DevSecOps

More Blog Posts

MCP security robot

Lab offers 9 ways to improve MCP security

The Vulnerable MCP Servers Lab delivers integration training, demos, and instruction on attack methods.

Learn More about Lab offers 9 ways to improve MCP security
Lab offers 9 ways to improve MCP security
AI coding new life for Rust

How AI coding is breathing new life into Rust 

AI tools are making Rust a favorite language of developers — even those maintaining codebases like Microsoft’s.

Learn More about How AI coding is breathing new life into Rust 
How AI coding is breathing new life into Rust 
Open-source software (OSS)

Anthropic’s PSF investment: Why it matters

Here’s what the $1.5M investment in the Python Software Foundation will mean for AI coding and open-source security.

Learn More about Anthropic’s PSF investment: Why it matters
Anthropic’s PSF investment: Why it matters
Software quality crisis

Software quality's decline: How AI accelerates it

Development is in freefall toward software entropy and insecurity. Can spec-driven development help?

Learn More about Software quality's decline: How AI accelerates it
Software quality's decline: How AI accelerates it

Spectra Assure Free Trial

Get your 14-day free trial of Spectra Assure for Software Supply Chain Security

Get Free TrialMore about Spectra Assure Free Trial
Blog
Events
About Us
Webinars
In the News
Careers
Demo Videos
Cybersecurity Glossary
Contact Us
reversinglabsReversingLabs: Home
Privacy PolicyCookiesImpressum
All rights reserved ReversingLabs © 2026
XX / TwitterLinkedInLinkedInFacebookFacebookInstagramInstagramYouTubeYouTubeblueskyBlueskyRSSRSS
Back to Top