
The Threat of Package Hallucinations
EPISODE TRANSCRIPT
PAUL ROBERTS: [00:00:00] Hey everybody and welcome back to another episode of ConversingLabs. This is a regular podcast from the team here at ReversingLabs where we dig into the latest developments in malware, threat analysis and a lot of software supply chain security risks. I'm excited to have in the studio with me, Major Joe Spracklen, who is a PhD in computer science at the University of Texas in San Antonio.
And the lead author of a research paper that was just published in March titled "We [00:01:00] have a Package for You! A Comprehensive Analysis of Package Hallucinations in Code Generating Large Language Models," LLMs for short. And so Joe and I are gonna talk about the study that he and his colleagues conducted to assess the tendency of large language model AI to hallucinate software packages as dependencies to AI generated code, and of course, how that might fuel efforts by malicious actors who want to infiltrate software supply chains.
Joe, welcome to ConversingLabs.
MAJOR JOE SPRACKLEN: Hi. Thanks for having me. Good to here.
PAUL ROBERTS: Good to have you. And we're really interested in talking with you about this study and kind of learning a little bit more about, what you all learned. Before we start Joe could you just tell our listeners a little bit more about yourself and, you know you're a Major in the US military, but right now you're a full-time PhD student, so some of the things you're working on there as a PhD student.
MAJOR JOE SPRACKLEN: Yeah, I'm Joe Spracklen. I'm a second [00:02:00] year PhD student at the University of Texas San Antonio studying computer science.
And specifically my research is focused on hallucinations in code generating large language models, which is something I just stumbled upon and we attacked it full force once we realized how cool of an idea it was.
And we knew that other people would not be far behind us. So it came together quickly and it wasn't my intended field of study. It just the way it goes. And I've gone with the flow so far. But yeah, and as you mentioned, I'm also an active duty Major in the United States Army. I've spent the previous eight years or so at the Army Cyber Command before this in a variety of roles from execution phase where I was a team lead of a reaction of a incident response team, up to a company commander where I was in charge of about 150 soldiers and four cyber protection teams. And I switched career paths after that point and moved into more of a data science role where most recently I was a deputy director of assessments where [00:03:00] our job was using data, quantify how well we're accomplishing our objectives in cyberspace, both offense and defense which was really cool and a rewarding job.
And now I'm a full-time student and following this, I'll be joining the faculty at United States Military Academy at West Point to be in their computer science department.
PAUL ROBERTS: Very cool. So talk just a little, when you were in the military and working in cyber command was large language model AI and AI generated code on your radar at all? Was this something that you knew you wanted to study as a PhD student? Or is it just that's just where everything is happening right now? Fish where the fish are, so to speak.
MAJOR JOE SPRACKLEN: It seems like 90% of research now is obviously, especially cybersecurity research is in this intersection of AI and cybersecurity.
PAUL ROBERTS: Sure.
MAJOR JOE SPRACKLEN: It's a very dense field of course, when I left Army Cyber Command, ChatGPT was just starting to become pretty proficient at coding, although, wasn't allowed on our networks. And so if I was doing any kind of coding at the office, I would code some stuff up at home, , not [00:04:00] sensitive information, but code some stuff up at home and then try to remember it and bring it with me to the building the next day.
I think they've, integrated it. They have some solutions for our teams inside sensitive areas now, approved uses. And you have to, because at the speed that these things allow you to work now, these large language models, if you're not utilizing them, you're just behind the game.
You're behind the curve. You have to. But yeah, large language models in terms of research, this was not my intended field of study. Again, there's two types of PhD students I've found, ones who have a complete roadmap of exactly what they wanna do, all laid out when they get there and they go off and do it.
And ones that have a general idea and figure it out along the way. And I was definitely more in the latter camp. And this route has just fallen into my lap and I love it. I really love the research we're doing, and myself and my collaborators have fallen into.
PAUL ROBERTS: As we mentioned, your new paper is called "We Have a Package For You! Comprehensive Analysis of Package Hallucinations By Code Generating LLMs." Couple questions. [00:05:00] First of all, how you said this kind of came about by accident. What put this on your radar and that of your advisor, this phenomena of package hallucinations?
MAJOR JOE SPRACKLEN: I knew I had some sort of broad interest in the software supply chain just from, previous experience with Army Cyber Command. It was a direction I was honing in on, and it was in my first semester as a PhD student, so I was having regular meetings with my advisor and I talked about articles I'd read and we would brainstorm from ideas. And actually my advisor and I have gone back and forth on who first suggested this or who brought this up first, but it just came about organically in one of our meetings. And we both, I remember, both had a look on our face.
That's a great idea. What if someone were to weaponize these hallucinations, these package hallucinations that the models make and we had been talking about this off and on for a period of time, and I'd talked to other people about it saying, yeah, they've had, talking about the mistakes the coding models have making, we said, yeah, it's even suggested packages that don't even exist.
And we [00:06:00] laugh it off. Oh, that's a kind of a silly mistake. And we were like wait a second. What if they did? Then I go okay. So now we need to quantify the problem. Now we need to do this kind of rigorous, comprehensive experiment to learn more about them. How often do they happen? What causes them? Can you prevent them? Things like that, and that's how we set up the framework for the experiment from there.
PAUL ROBERTS: Okay. So for our listeners could you just define what constitutes a package hallucination? As I see it, this seems pretty similar to other types of AI hallucinations, right?
You read about the lawyers, the attorneys who use AI to generate some filing with the court and it just makes up a whole bunch of precedents that don't really exist. It is it basically the same behavior? But just in the context of code writing?
MAJOR JOE SPRACKLEN: Man, I could talk about this for a while.
I'll try to distill it down. So the phenomenon of hallucinations is a widely known thing at this point. These AI models are prone to making mistakes, and we typically call those mistakes [00:07:00] hallucinations, although that word is grown from what it started to it used to originally mean only things that the model is made up that are completely fictitious and completely made up out of nowhere. Now we refer it to meaning kind of any type of error that the model makes, which is the same thing, but it's melded into a more general term. But the models make mistakes.
We've known this for a while. It's part of the basic nature of the models where they're not deterministic, which is a desirable property. We don't want them to. Given the same prompt. We want them to generate different output every time. They have this kind of pathic nature where they're sampling from a distribution, which is desirable and creates this, rich creative output.
But along with that there's two basic sources of hallucinations. One is uncertainty based hallucination, where the model either legitimately doesn't know the answer or it's confused about the question, and so it has to infer some things and it has to give an [00:08:00] answer.
The second is what we call sampling errors. As it's sampling the tokens to select for generation from this distribution it may select tokens in a, low probability tokens or tokens in a certain sequence. In which it backed itself into a corner and there's no plausible way forward for it to go than to create something and make something up.
It can back itself into a corner through the choices it's made through sampling. And that's a very deep topic. I'll stay at the kind of high level there. And for the most part, package hallucinations we feel are the same with a little bit of an exception.
We do think there are is some mechanism. If you think about normal kind of natural language generation, even if you pick some sort of obscure or creative word, there's still a way to linguistically bring it back and save it. When you're talking about code and specifically packages.
It is a very specific string [00:09:00] and there's no room for error. And there's often non-normal tokens or strings that are put together. So imagine if you're in Python, you think of NumPY. That's not a real word, right? That's only a word in coding.
That's only a word in the Python world. Or you think of something like TensorFlow or, and you even talk about versions and if there was a number, high version two underscore and there's underscores or special characters and it's very complicated and it's a very complicated sequence for the model to get exactly right.
There's no room for error. And so we think it's even more, these package stations are even more error prone than your typical language generation because it has to be so specific and there is a certain point. If it's generated a certain string of tokens where there's nowhere for it to go, no possible set of continuations could lead it to a valid response.
That was a really long-winded way of saying, largely they do behave like normal hallucinations with kind of an added complexity and difficulty factor. [00:10:00]
PAUL ROBERTS: Because of the specific context of code that needs to be executable code, right? That you need to then be able to-
MAJOR JOE SPRACKLEN: Yeah. Coding is, it's very rigid. It has to adhere to these syntax rules. It's very structured, right?
PAUL ROBERTS: Yeah.
MAJOR JOE SPRACKLEN: Get back, I think I didn't even answer the top line question, which is what is a package hallucination? So a package hallucination as we mentioned, the models tend to invent fictitious things to satisfy the question they were asked, a package hallucination occurs when the model recommends or includes in the code, a reference to a package which doesn't actually exist. And this is brought about by our reliance on open source repositories. And especially in languages like Python and JavaScript, we rely on these open source repositories to manage our software dependencies, our package dependencies, and for downloading an easy distribution.
So if you, if the model creates a package that doesn't exist, you might think that's no big deal. You type it in and it just says [00:11:00] error, package not found. And that's not a big deal. We thought, from an adversarial mindset. What if someone proactively registered?
They were able to figure out, hey this model is continually hallucinating, the same types of things. They proactively register a package again on an open source repository where it's anonymous. Anyone can upload without a bunch of rigorous oversight . Now, then they lace that package with malware of their choosing.
Now an unsuspecting user who's following, blindly trusting the large language model, gets this package recommended to them, and if they don't do their due diligence, they might download this package and then they've infected their machines. And it's really, that's the simplicity and the brilliance of this specific type of attack, it's very direct, it's very simple, and it literally takes one single command to infect your machine.
PAUL ROBERTS: And it removes one of the big challenges with package confusion attacks, which is at some point the malicious actors [00:12:00] have to convince, historically, traditionally had to get the developer to actually go and, obtain that package, pull it into their application.
Whether that was typosquatting or just, standing up a discreet malicious package, but then you had to just dress it up like something that was legitimate and then, promote it in ways that got developers aware of it. That was a big obstacle to getting them.
But if you know that their code is looking for a package that doesn't exist and it's gonna pull it in, if you just create it, that just removes that whole very difficult part of the attack chain from needing to happen.
MAJOR JOE SPRACKLEN: Exactly. And as you mentioned, typosquatting has been around for a while.
This has been a known attack vector. The trick has always been how do we get people to download these packages? The typo squatting is where you register a package name that is very similar to how you would type a legitimate package name, but this one, is laced with malware.
In that case, you're relying on a user mistake mistyping something, and that's a low hit rate, [00:13:00] right?
PAUL ROBERTS: Yeah.
MAJOR JOE SPRACKLEN: But now we have these coding models that becoming more and more trusted. They're getting better and better at coding. And so you're becoming kind of less and less skeptical of the things that they produce. And as the trust builds, now you have this trusted agent constantly recommending you do this thing. And you take down your safety nets a little bit. You don't pay as close attention as you maybe should.
PAUL ROBERTS: That's right.
MAJOR JOE SPRACKLEN: And it's really that simple. Yeah.
PAUL ROBERTS: Are you fact checking everything that ChatGPT is writing for you? Or are you yeah, that looks pretty good,
MAJOR JOE SPRACKLEN: Yeah, move it. Move it.
PAUL ROBERTS: Yeah. Which is the same thing. It's the same dynamic. One question I had reading your paper, which you did talk about a little bit, which is from malicious supply chain actors standpoint, how would they go about figuring out what hallucinated packages these LLMs are creating and therefore, which to stand up, which to create and put on these open source, where would they be looking for that information?
MAJOR JOE SPRACKLEN: Yeah. Honestly, this is one of the reasons we were a bit [00:14:00] skeptical to share our code and to share our GitHub repo where the code is to be used to test, because they could literally take that.
A lot of people have asked for it. We have a huge list of hallucinated package names that we generated. We did not make that public just because it could be directly used in a malicious way. We have shared that with people who we know are verified researchers for research purposes, but we keep that private.
But I will take some experimentation, on the adversary's part, and you could use our code to do that, to run a comprehensive set of tests. And you check the package names are generated versus the published package names. So whether it's more of a manual experience where you're just coding and you happen to notice this package hallucination, that kind of accrued organically while you're maybe doing some legitimate project, or you do some sort of rigorous experiment where you at [00:15:00] scale iteratively probe the model with various prompts and see what was generated and compile a list of hallucinated packages that you wanna try to exploit. But yeah, it will take some manual analysis or some sort of manual work to discover, that is the hardest part for the adversary in this scenario.
PAUL ROBERTS: Okay. Let's talk about the study that you did and your approach to tackling this pretty kind of fuzzy problem, which is challenging from a researcher standpoint. So you, selected 16 popular large language models both open source LLMs, as well as closed source commercial ones, ChatGPT and so on.
And you use 'em to generate, I think half a million code samples? And looked in those for package hallucinations and in particular two languages that you focus on- one was Python and JavaScript. So just talking about that structure, what large language models you used, these code samples you generated and the choice [00:16:00] of which development languages to focus on.
What was the thinking?
MAJOR JOE SPRACKLEN: So first off, I think I'll start with, we started this research a while ago.
This is back in February of 2024. We started this research and actually the first version of this was published in June of 2024. It's been updated and revised as we've applied to different conferences to get this published in. And then this paper will be appearing in Usns 2025 here in a couple months, which we're very excited about.
PAUL ROBERTS: Very cool. Yeah.
MAJOR JOE SPRACKLEN: At the time, models we picked were the best performing models that were available. We didn't shy away from anything and we wanted to give a really honest and true evaluation of how big of an issue, how prevalent these package hallucinations were given the best models at the time.
Some criticism I've got recently have said these models are really old and even, 15 months or 16 months old is an eternity-
PAUL ROBERTS: AI terminology, which is like more than a month old.
MAJOR JOE SPRACKLEN: No doubt using the best models today, I think these hallucination rates would be cut down [00:17:00] pretty significantly just given the advances in the last year plus. At the time we chose, we wanted to give a fair share between open source models and commercial models. The commercial series we chose, ChatGPT was dominating the scene at that time, so we sampled three different GPT models. We paid several thousand dollars to ping their API a bunch. And we chose 13 models, we wanna choose a nice representation between smaller models. The smallest one we tested had 1 billion parameters all the way up to 16 billion and 32 billion parameters. Yeah, we had a couple 32s, yeah. So we went on a wide range of model sizes. We did DeepSeek, Code Llama. Wizard Coder, Mixed Drill, I'm probably missing one in there. But yeah, we a wide variety of vendors and sizes and we just want to really give an honest and comprehensive look. We chose Python and JavaScript because those represent the two largest open source package repository ecosystems.
I know npm has around 5 [00:18:00] million packages and PyPI has about 500,000 somewhere in there, but they're the two largest by a fair margin and the two most popular programming languages, I believe that's, I believe they're one in two. Of course you can measure that a variety of ways, but undoubtedly two of the most popular programming languages, however you cut it. So we felt there were two easy choices to make for a representative kind of sample.
PAUL ROBERTS: What you found in this research that you did, it's actually really interesting, and I'll just call out a couple data points here. One of the big discoveries, in the sample population that you researched back in 2024, there was a pretty notable difference between the performance of closed source, commercial, large language model, AI and open source when it came to hallucinating packages. So for the closed source commercial large language models, you had around 5.2% [00:19:00] hallucination rate. But for the open source models you studied, it was more than four times than that, 21.7%. Did that surprise you? And what can we attribute that to?
MAJOR JOE SPRACKLEN: Actually, I wasn't too surprised. I did suspect that GPT models are gonna be way, way better. Of course, the easiest thing to say is the GPT models being proprietary, closed source models, we don't know how large they are in terms of parameter sizes and complexity and transformer layers and all those things.
We do know, by any account, they're 10 times larger than the other models to be tested. So just from the parameter counts, just from the parametric knowledge inside the models, you'd expect they're gonna be way, way better. They just have much more resources to work with and kind of computational power backing them to get better results. So that wasn't a surprise, but even at 5%, if you think about the scale and how many people are using these, 5% still, a small percent of a large number is still a large [00:20:00] number, as they say. So, still, that number actually was a bit larger than I expected.
I thought we were gonna find maybe 1 or 2%. And so the disparity between the open source and the closed source, I wasn't surprised in, but I was surprised in the overall findings. That 20% and 5% were a bit larger than I expected.
PAUL ROBERTS: Even though, as you point out in your paper with the ChatGPT, the number you had 5% was much, much lower than some other published research that suggested there was a much higher hallucination rate with ChatGPT, I think it was 20 something percent.
They performed a lot better for you, is that just how you guys did your research versus-
MAJOR JOE SPRACKLEN: There was a couple blog posts before I put this research in being a blog post, I think he did a great job, but before the, not the same type of, rigorous kind of experimentation behind it.
And it's also being a blog, it's not meant to be like [00:21:00] reproducible in terms of how he did the experiments. So it's actually not clear how he prompted the models, what dataset he was using and what he identified as hallucination. So there, there was clearly something different in terms of methodology for him to get such a larger number and we're not quite sure what the disparity.
PAUL ROBERTS: Maybe just a smaller sample size, right? Yeah.
MAJOR JOE SPRACKLEN: Could be. Yeah. And I don't wanna comment too much on somebody else's work and speculate, but I think we're confident, of course, our data set was so large and we gave such a wide variety of prompts and rigorous collection of data that we're pretty confident with our results.
PAUL ROBERTS: Just describe briefly how, when the, when you used the, when you generated the code that you then analyzed for hallucinated packages I would assume, some of the folks listening are probably asking this in their head, like, how did you decide what code to generate? I know, 'cause I've read your paper and it's actually interesting the approach you took, but explain to them, like how did you decide here's the code that we're gonna generate and use as a basis for assessing the package hallucination [00:22:00] phenomenon.
MAJOR JOE SPRACKLEN: Yeah, so we actually went with two different data sets, and they're with two distinct purposes. The obvious one is we wanted to have real questions that people might actually, feasibly ask a large language model. And so we, we hit up Stack Overflow, which is everyone's favorite resource before coding models became so good.
So we scraped a bunch of popular coding questions from Stack Overflow, specifically regarding Python and JavaScript. And just asked the model. So that was one easy win. The second thing we wanted to make sure we did is we wanted to make sure we had a really broad range of questions.
We really wanted to cover any topic, what are the, any possible topic you might ask the model to code for you. Any question. We wanted to have that covered. So we took a bit of a hybrid approach, we first scraped the top 5,000 most popular packages from PyPI and npm. And they all have a kind of an official [00:23:00] description that describes what the package does in one sentence. Took that sentence, fed it to an LLM, and we said, Hey, generate coding questions that might accomplish this goal using the package description.
PAUL ROBERTS: Yeah. Based on these descriptions, right?
MAJOR JOE SPRACKLEN: Yeah. And I, first, I wasn't sure that was gonna work. I was actually amazed at how well it worked. We got this really rich data set of novel coding questions. Hey, generate code that does this, generate code that does this, and they're all very well formatted.
It was awesome. So that accomplished a second goal. We had to wanna have a really broad and comprehensive view of anything someone might ask a coding model. And that way, just to get back to the original results, I do think maybe our results, because that approach, were maybe slightly inflated. Because if you ask a coding model a data science question in Python, for example, it's gonna have no problem giving you Pandas and NumPI and [00:24:00] map.lib all day. It'll give you those without error, almost a hundred percent of the time. Nothing's a hundred percent, almost a hundred percent of the time.
It's really when you start to get into the tails of the distribution, into the more obscure coding questions where it doesn't have as much trading data to rely on. That's where you start to get into trouble. And we really wanted to put stress on the tails of the distribution and flesh out some of those issues.
PAUL ROBERTS: Really interesting. So one of the things that you all noticed when you started digging into what factors seemed to influence the likelihood of it hallucinating a package was temperature settings. If you're not familiar, temperature is a measure of creativity, that creative freedom that you're giving to the AI model to diverge from what's the status quo or what's normal and take some risks. And you found that the higher temperature you allow the AI model to do, and the commercial [00:25:00] model seemed they have a ceiling for how high, you can turn the temperature, the open source models have a higher ceiling. That seemed to correlate with more package hallucinations. Can you talk about that and what might explain that? It intuitively makes sense. I don't think anyone's gonna be surprised to hear that, but is there anything we can learn from that I guess?
MAJOR JOE SPRACKLEN: In this setting, I'm not sure how instructive it is because I think as we're talking about the structure of code, you really don't want them all to be too creative, and that's what temperature does. So temperature essentially flattens the distribution. So it makes it more uniform, it makes it more likely that the low probability tokens are gonna be selected.
And with coding, again, with a rigid structure, you need certain words and certain specific places you don't necessarily want that. So I don't think it's a surprise that the temperature setting really messed up the results. And again, as you got to more extreme temperatures, especially with the open source models, they actually started to generate more hallucinations than [00:26:00] correct packages. It was over 50%. And the open AI model, what would you set the temperature setting above 2, Smartly because they don't want their models to go off the rails when they're public facing. It's bad for business.
PAUL ROBERTS: We've all seen what that looks like and it's not pretty.
MAJOR JOE SPRACKLEN: Yeah. You don't want that to get in some meme that's, that goes viral. It's bad for business. So the open one source model is you can push higher and we did, we push out those all the way to 5. I think you're capped at 2 for the GPT series. But yeah, not surprising, honestly and I'm thinking about it, going back to, we were talking about earlier with package hallucinations specifically, you need a very specific string in a very specific order. And if you don't follow that exact pattern then there's no way for the model to recover.
You can't linguistically find a creative path that make and make sense. It's either the right string or it's the wrong string. And so when you're, if you're sampling these low probability tokens, it's oftentimes you create a string that has no chance to lead to [00:27:00] some sort of valid package name.
And that started to happen more and more often as we jack the temperature up.
PAUL ROBERTS: Package hallucinations, we've been talking mostly about just invented dependencies, package names that don't exist. Are there other types of hallucinations that you saw in the AI generated code, or is it pretty much the just the packages?
MAJOR JOE SPRACKLEN: I saw them, certainly, we didn't test rigorously for them, but and again, this goes back to is any error a hallucination? Most hierarchies that have put out there in academic settings, do you consider like a syntax area to be a type of hallucination or just a logical area where the code is valid, but it doesn't solve the problem in the right way?
Is that a hallucination or is that just a, a skill issue as they might say for the model? It's a blur. It's a gray area, but yeah, I think all the time I see I get code even now, even GPT that has errors in it, it doesn't quite do what I want it to do. And especially GPT doesn't have as many syntax errors, especially lately that [00:28:00] I've noticed.
But you get even, lesser models, do still spit out syntax errors where the code just won't even run or compile. Which yeah, you would. I expect that to be a problem. It's probably not too far away from being solved. Even if it's just through brute force and just continuing to scale up the size and complexity of the models, which again, I'm not sure how long that can continue, but.
PAUL ROBERTS: It's expensive and resource intensive. Yes.
MAJOR JOE SPRACKLEN: I think OpenAI is mentioning a hundred million dollars for a training cycle for their new models. I don't know how much more we can keep increasing just the size of the models, but...
PAUL ROBERTS: So your paper talks about things to do to address the risks posed by package hallucinations.
And you make this distinction between, pre-generation and post-generation, which you might think of as proactive versus reactive, I think [00:29:00] is really post-generation would mean how do we search for and find, hallucinated packages and remove them? And you're saying that's probably not the best use of your time and effort.
Much better to prevent those hallucinations from happening in the first place, pre-generation. Can you talk about some of the things that you research and found might be effective at preventing the creation of hallucinated packages in the first place?
MAJOR JOE SPRACKLEN: Yeah, I think it's a really important point to talk about.
I think most people, when I present this problem to 'em, they say oh that's no big deal, especially with retrieve, augmented generation really becoming popular now where the model has access to some external source of knowledge, whether it's reaching out to the internet or reaching out to an internal database.
It can use external knowledge to enrich its response and give it some sort of factual grounding before it responds, say, oh yeah, you just have the model reach out to the internet to verify that the package exists. And then I say the really kind of insidious thing [00:30:00] about that approach, you can call it a white list approach where you have some list of approved packages.
Now, where do you get that list? You might, do you scrape it straight from PyPI or npm, and you just have this list of published package names? What if the adversary has already published their package?
And the malicious package name is already on the white list. So your model goes out and checks, oh, yep, that package exists. And that's a key distinction. So we were detecting the hallucination part. Assuming that no one had published any of these packages yet. It's possible there was already hallucinated package names that were registered already and we were already going against that list.
So yeah, once someone publishes the package, it appears that it's valid unless you have some way to detect that the code itself contains malware static analysis or dynamic analysis or whatever it is. So that's kinda the first insidious thing that isn't obvious from the bad. It's just whitelist the package names in it.
[00:31:00] Problem solved, not quite.
PAUL ROBERTS: That's right. Unless you've looked at all of those whitelisted packages and made sure that they're not malicious.
MAJOR JOE SPRACKLEN: And to be clear, especially larger organizations have realized that this is a huge problem a while ago, and there's entire companies whose whole business model is vetting the packages that are allowed to be imported into these organizations, they actually scan the code. They have some advanced metrics that they use to before they allow, and many organizations, I'm sure have internal repositories where they host packages and they only allow packages into this internal repository that have been vetted by one or more companies or systems that they have to verify these things. As you alluded to there's two ways. Yeah. You have the post-generation method, which is this somehow filtering the packages that have been generated, that's imperfect. That's a very, that's ongoing research imperfect [00:32:00] way. The ideal way to do it is to influence the model before generation, so you change the actual behavior of the model through modifying the weights. Some sort of fine tuning, retrieve augmented generation does work for this, so we tried methods of pre-generation mitigation. Again, we did fine tuning, we did retrieve lock wind degeneration. We also did this method that we called self detection, this will be a quick tangent, but one of the really interesting things we found during the study was several of the models were really good at detecting their own hallucinations.
This is probably the most surprising thing that we found throughout the whole research. So if we did a separate experiment after we did the main results on, in quantifying the rate of hallucination, we did the secondary test where if we detected a hallucinated package, we turned right around and re-prompted the model with the package name that we just detected.
Say, Hey, is this a valid [00:33:00] package? Very simple and direct. And in the case of ChatGPT and DeepSeek, we found over 80% of the time, they were able to recognize their own hallucination. So that is to say, if asked, is this a valid package? They would say, no, that is not a valid package. Even though the model had just created that a moment before.
So we were really startled for that. So they reinforced the idea of sampling errors where the model has an idea of right and wrong, but when it's forced to answer, it often can pin itself into a corner and there's no valid way out besides inventing something. So even if the model knows it's wrong, it's forced to give you an answer.
I think that's one of the best ways we could possibly improve LLMs is finding a way to inform the user of some sort of an uncertainty. Hey, I'm uncertain about this answer or I'm a bit confused about the question you asked. Can you restate your [00:34:00] question or here's my answer, but maybe double check that. 'Cause it doesn't seem right. I know that's, again, for the big companies, they don't want to express that uncertainty and that doubt into the users that's, again, bad for business, but I think you need to find maybe some sort of happy medium there where the user can be more informed on what's going on inside the model. Because oftentimes, they have a great idea of right and wrong kind of intrinsically.
PAUL ROBERTS: Okay. Flipping question around when you looked at the hallucinated packages, were there clear, like heuristics or characteristics of hallucinated packages that seem pretty consistent and that you might use to filter them out of a list of, legitimate and hallucinated packages or not so much?
MAJOR JOE SPRACKLEN: Actually, no and no clear pattern. The one thing I will say that stood out about the package names is the sheer breadth of package unique names that were generated. Again, I think over 205,000 [00:35:00] unique examples of package names that were completely fictitious. I thought it was going to be more concentrated and maybe a smaller set that just generated a lot, but we really found a huge breadth of packages, package names. And some were repeatedly generated, but I wasn't expecting the sheer range of responses that we got. But other than that, nothing really stood out. The one thing that did stand out, and again, I was a bit worried until I got this test done, I thought a huge number of these hallucinations were going to be able to be traced back to packages that existed in the models training data, that were then deleted or removed from the repository. And I was like, man, we got great results so far. But I, it would be a killer if 90% of these packages were just, things that were deleted. And we found it was actually a very minuscule percentage- it was like 0.7%. Only 0.7% of the packages used to exist. And then were subsequently [00:36:00] deleted so they were in the model's training data so it was a very insignificant factor in the hallucinations.
PAUL ROBERTS: Obviously, AI and large language models playing a huge role in code creation. Anywhere from 30 to 50 percent of code is AI generated depending on who you ask. And the use of AI by developers is even higher, right? It's in the 90 percentile range. Given that AI tools obviously greatly enhance code creation, speed development, but also come with risks, what suggestions or recommendations do you have for developers who are out there as we speak, using AI to create code, to have a critical eye in a way that will help them to identify hallucinations and other things that might pose a security risk?
MAJOR JOE SPRACKLEN: Yeah, I think, one, one thing that's obvious is, these models are wonderful. They're incredible guides. They have in such a short time changed my entire workflow. They've changed how I do business and that's [00:37:00] true for a lot of people.
So they're here to stay. They've already integrated themselves into our lives in almost an irreplaceable way, and I don't see that going away. Our research definitely doesn't say, Hey, don't use these models, they're unsafe. Just urging caution. Continue to be skeptical.
Continue to stay vigilant. I like to compare it to a self-driving car, where they're great navigators, but you need to have one hand on the wheel. Hands on the wheel. Yeah. You need to be focused. They do a great job, almost all the time, but. But when they don't, you need to be ready because there are potential weaknesses.
And that's, I think it's really interesting and I would consider, package hallucinations to be one of the non-obvious, kinda a second or third order effect of AI that's opening kind of new vulnerabilities that we really hadn't considered before. And I think there's many examples of that and they're gonna continue to show themselves and just to stay alert.
And to stay vigilant. But I think overall, these [00:38:00] AI in general, this point is undoubtedly, a net plus in all of our lives, and I hope it continues to stay that way.
PAUL ROBERTS: Let's hope so. Joe last question. What's next? What other questions are out there that you're looking to answer? Where does this research go next?
MAJOR JOE SPRACKLEN: Yeah, I have two big projects going now. I've zoomed out a little bit, taken the problem of general hallucinations in code generating models. One of my projects is looking at analyzing the internal state of the models, that kind of hidden activations that go on while it's generating, and seeing if we can predict the hallucinations before they actually happen.
This is the entire scope, this includes context errors or logical errors if we can predict a mistake before it happened. This draws back to, I mentioned this a short time before this kind of self detection capability. Oftentimes the models know when something's not quite right and we're seeing if we can detect that before it happens and even go a step further [00:39:00] and specifically identify it.
PAUL ROBERTS: Build that into the model, right?
MAJOR JOE SPRACKLEN: That'd be great, that'd be a great next step. We would like to incorporate some sort of feedback mechanism, so we can definitely already detect uncertainty versus sampling errors. And I've referenced this a few times, so in the case of uncertainty, it would be pretty low cost for the model to say, Hey I don't understand your question, can you just restate it or maybe ask it a different way? That would cut down hallucinations quite a bit. And then for sampling errors, if the model detects something, whether it's through entropy methods or some sort of information theory way, there's different ways to detect or the model hasn't got something quite right.
And so if you're able to type that, this one's a bit more expensive, but you can even have the model regenerate its output and just "Hey, try again." Maybe we'll find a different path this time. So that's the overall goal. First off, we have to reliably predict them and then there are ways to build that into a product.
And the second [00:40:00] big project I'm working on is working on this idea of vulnerability direction inside the LLM. So other research has shown these large language models are able to internalize and understand abstract concepts, including things like vulnerability. So we have detected what we feel is a direction in this high dimensional vector space that encodes vulnerability, such that, if you nudge the model in one direction on this vector, you generate more vulnerable code and if you nudge it in the other direction, you generate more secure code. We're ironing out some wrinkles with that. It's a fairly complicated problem, but it's pretty cool to know that, and this has been shown for other concepts as well but you can nudge the models in certain directions in this high dimensional space and influence the way that they generate.
PAUL ROBERTS: Yeah, I mean it's interesting, we're in these early days of, [00:41:00] even two or three years ago, right? The quality of code generated by AI models was just not reliable enough. But we're, we've recently crossed that line into, yeah, this is actually, functional, usable code.
And so it's like this kind of frenzy, greenfield of holy crap. But, the types of things you're talking about are not as far along as how do we put checks and quality and security evaluations on this code, this amazing looking code that's being generated before we implement it?
MAJOR JOE SPRACKLEN: And it is amazing. Yeah. But now I think we've sprinted up this capability. They emerge so fast and I think it's healthy. We're maybe slowing down a bit and taking a step back and saying, okay, these are great. Now how can we build in the security?
How can we make them better for the user? How can we, increase the quality of life to get the most out of them without having to worry about on these second and third order effects? The third project that I'm actually just starting up is we're digging back into packaged hallucinations, and we're digging [00:42:00] back into mitigation.
So we're looking at some more advanced methods of mitigating without decreasing the code quality. That was the big finding of our paper was we found the fine tuning was extremely effective to mitigate the packaged hallucinations, but it came at a substantial decrease to the quality of the code generated, so right. We're trying to decrease the package hallucinations, while keeping the code quality high.
PAUL ROBERTS: Toning down that creativity. Yeah.
MAJOR JOE SPRACKLEN: Yeah. That's gonna be, one part of it for sure.
PAUL ROBERTS: Major Joe Spracklen, PhD student, Computer Science, University of Texas, San Antonio. Thank you so much for coming in and talking to us about your research on AI code hallucinations, and I'm really interested to see what is coming up next.
We'd love to have you back on the ConversingLabs podcast.
MAJOR JOE SPRACKLEN: Thanks so much for having me. I had a great time. Thank you. Yeah, it was great.
PAUL ROBERTS: Great. Thanks for taking the time to talk to us.
MAJOR JOE SPRACKLEN: Absolutely.
PAUL ROBERTS: And I wanted to thank also our audience and our listeners, folks who joined us via social media, and of course our producer, Carolynn [00:43:00] van Arsdale, who does so much of the yeoman's work to make these episodes happen. Carolynn, thank you so much. And tune in in a couple weeks, we're gonna be talking about supply chain risk with regard to aviation technology. That's another conversation we got coming up shortly. So stick around, we'll be back with more ConversingLabs episodes in the future.