GitHub’s Copilot ML code-completion engine is violating copyright wholesale. So say several high-profile open source advocates.
It was predictable, really: Microsoft should have seen this coming. It’s ludicrous to blame license violation on the poor, stressed dev trusting GitHub.
The lesson for devs? Be extremely careful about the code fragments you import. In this week’s Secure Software Blogwatch, we go around.
Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: Jet powered coffin.
Shut up and think of the deadline
What’s the craic? Tim Anderson reports — “GitHub Copilot under fire”:
“Wrongful use of copyright code”
Developer Tim Davis, a professor of Computer Science and Engineering at Texas A&M University, has claimed … that GitHub Copilot, an AI-based programming assistant, “emits large chunks of my copyrighted code, with no attribution, no LGPL license.” … The code Davis posted does seem very close.
One of the concerns in the open source community is that if chunks of open source code are regurgitated wholesale, without specifying any license, then it is breaking the purpose of the license. Another concern is that developers may inadvertently combine code with incompatible licenses into one project.
Part of the problem is that open source code, by design, is likely to appear in multiple projects by different people, so it will end up multiple times on GitHub and among multiple users of Copilot. With or without Copilot, developers can make wrongful use of copyright code.
Horse’s mouth? Tim Davis — @DocSparse — has more:
“I’m passionately committed to open source”
For example, the simple prompt "sparse matrix transpose, cs_" produces my cs_transpose in CSparse. … Same variable names, helper functions, comments. … Not OK. [And] there's no way to opt out of GitHub's use of my code by Copilot.
Somehow it knows how to complete the comment /* sparse matrix transpose in the style of Tim Davis*/ and then return … my LGPL code verbatim, with no license stated and no copyright. … So why not also keep the copyright and license intact? … I plan on asking GitHub to emit my copyright and license when it emits my code. … They're smart people — they can figure it out. … Also, academia rewards citations and use of work. If my name is stripped then I lose that way too.
My sparse C=A*B is faster than the one [previously] in MATLAB … and the Intel MKL sparse library. Why would I bother to take the time (years) to write such code if I can't benefit from copyright protection? … It is all humanity-advancing open source code. … I'm passionately committed to open source code. Redis uses my code. … The Julia language, scipy, R, every linux distro. The code can be found in many drones, robots … inside every Occulus / Meta headset [and] Google StreetView.
It’s not only Davis. Kip Kniskern knows — “Copilot apparently violating open source licensing”:
“Microsoft has been vague”
Writer, lawyer, and programmer Matthew Butterick has some issues with Microsoft's machine-learning based code assistant, GitHub Copilot, and the way it is apparently mishandling open-source licenses. … It's the way the AI is trained, or more precisely from where it's trained, that is becoming a problem for developers like Butterick..
The problem here is that these public repos that GitHub is trained on are licensed, and require attribution. … Microsoft has been vague about its use of the code, calling it fair use. But … for programmers like Butterick, who contribute open source code out of a sense of community, stripping any attribution away from their work is a problem.
Giddyup. Matthew Butterick asks, “How will you feel if Copilot erases your open-source community?”:
“It is a parasite”
I’ve been professionally involved with open-source software since 1998, including two years at Red Hat. … In June 2022, I wrote about the legal problems with GitHub Copilot, in particular its mishandling of open-source licenses.
I’m currently working with the Joseph Saveri Law Firm to investigate a potential lawsuit against GitHub Copilot … for violating its legal duties to open-source authors and end users. … Once you accept a Copilot suggestion, all that becomes your problem. [But] how can Copilot users comply with the license if they don’t even know it exists? … To be fair, Microsoft doesn’t really dispute this. They just bury it in the fine print.
Obviously, open-source developers … don’t do it for the money. … But we don’t do it for nothing, either. A big benefit of releasing open-source software is the people: the community of users, testers, and contributors that coalesces around our work. Our communities help us make our software better in ways we couldn’t on our own.
Copilot is … poisonous to open source. … It is a parasite.
If I use Copilot, what’s the risk? entfe001 explains:
Closed source licenses will use copyright law to make sure you can't share, modify or reuse their code. Open source licenses will use copyright law to make sure you can share, modify or reuse their code—on their conditions.
Where this **** AI falls foul is that they might share, modify and reuse third party code without granting whatever rights or obligations the original license "gave" to the training set. For starters, most … open source licenses require that a copy of the license itself to be given along with the source code, no matter if the whole work or just a part.
For MIT-like licenses, not retaining authorship notices is a copyright license violation. For GPL-like it is even worse, as none of the GPL granted rights would be passed upon downstream, which is by itself a violation.
It gets worse: Copilot is spitting out closed source code, too. So says esskay:
I had something similar happen … a couple of days ago. I'm on friendly terms with a competing codebase's developer and have confirmed the following with them (both mine and it are closed source and hosted on GitHub).
Halfway through building something I was given a block of code by Copilot, which contained a copyright line with my competitors name, company number and email address. Those details have never, ever been published in a public repository. How did that happen?
It’s a legal compliance nightmare. Here’s u/Untgradd:
I do open source compliance activities for a software product, so I’m intimately aware of … the kind of licensing requirements typically found in open source software. This service seems to be actively causing a compliance nightmare.
Given my understanding, say some … dev autocompletes their way through a particularly productive sprint and you inadvertently release code containing copyleft code. [Now] you’re legally obligated to release your source code. … I could imagine a scenario where some clever folks effectively grep for well known copyleft snippets as a means of targeting closed source … software.
But surely Microsoft has a point? It’s all out there in public, so it’s fair use — right? b0llchit thinks a thought experiment:
By that standard you can take all what is written about books and use the description's content text to train a ML system. Then when you use the system and it writes, "Henry Flotter and the magical wanderer's gem," we'll see how long the fair use defence will stand.
Meanwhile, what can be done about it? Jed Brown — @five9a2 — has this suggestion for Microsoft:
Replace with a Clippy: “It looks like you’re trying to implement a sparse matrix library. Have you considered calling a high quality library such as … ?”
You have been reading Secure Software Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi or email@example.com. Ask your doctor before reading. Your mileage may vary. Past performance is no guarantee of future results. Do not stare into laser with remaining eye. E&OE. 30.
- Dev & DevSecOps