<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1076912843267184&amp;ev=PageView&amp;noscript=1">

RL Blog


The Hugging Face API token breach: 5 lessons learned

More than 1,500 tokens were exposed, leaving millions of AI models and datasets vulnerable. Here's what your security team can learn from the compromise.

John P. Mello Jr.
Blog Author

John P. Mello Jr., Freelance technology writer.

Researchers from Lasso Security rattled the AI development world early in December when they discovered that more than 1,500 Hugging Face API tokens were exposed, leaving millions of users vulnerable.

Hugging Face is the GitHub for AI developers. Its open-source library hosts more than 500,000 AI models and 250,000 datasets, including pre-trained models from Meta-Llama, Bloom, and Pythia.

One of the most used features of the website is its API, which allows developers and organizations to integrate models and read, create, modify, and delete repositories or files within them. A compromise of its API could be catastrophic.

Bar Lanyado, a security researcher at Lasso Security, said in his team's analysis of the compromise that the Hugging Face API tokens are significant for organizations and that exploiting them could lead to major negative outcomes, including data breaches and the spread of malicious models that "could affect millions of users who rely on these foundational models for their applications."

"The gravity of the situation cannot be overstated. With control over an organization boasting millions of downloads, we now possess the capability to manipulate existing models, potentially turning them into malicious entities."
Bar Lanyado

After the disclosure, the repository and its affected users rushed to mitigate the problem, narrowly evading a debacle. Here are five lessons learned from the breach — and some best practices for reality-checking API security in your development environment.

[ Learn more: Secure AI development guidance: What software teams need to know | MFA and software supply chain security: It's no magic bullet ]

1. Don't store login information in public repositories

Roger Grimes, a defense evangelist for KnowBe4, said shared logins are a huge ongoing problem. "After years of telling developers not to store logon information on public repositories, they continue to do so in large numbers," he said.

Grimes said the big takeaway was that technical defenses are now a requirement.

"Studies have shown that when logon information is stored in deposited code, it's only minutes before potential adversaries start to take advantage of it. While I'm a huge believer in the power of education to combat most cybersecurity problems, this is one that needs more technical defenses."
Roger Grimes

Public repositories, in an attempt to mitigate the problem, should do proactive scanning when a developer uploads code and block the storing of logon information within stored code — or at least warn the developer of the severe consequences, he said. 

With the severity of a potential breach, repositories such as GitHub have been rushing to two-factor and multifactor authentication (2FA and MFA) to protect accounts. However, 2FA and MFA are not panaceas, experts warn.

2. Use multiple API keys — and rotate them

Nick Rago, a field CTO with the API security firm Salt Security, said that it's good security practice to use not just one API key with third-party providers, but many, each focused on certain integration services to minimize impact of an exposed token. It is also a best practice to frequently rotate keys.

If a third-party provider only allows public API access with static tokens, it's good to use an API gateway as an intermediary between a developer and the third-party API, Rago said.

"That way, the organizations can enforce more robust API posture and authentication methods in their code, such as OAuth or MTLS."
Nick Rago

3. Be aware of third-party API usage

Rago explained that API security is not just about securing APIs that are internally developed; ensuring safe consumption and usage of leveraged third-party APIs is also critical.

"Key business process today consists of API supply chain calls that consist of consumption of both internal and third-party APIs. Therefore, it is important that organizations have a good understanding of what third-party APIs are in use, their function, and the data associated with them to assess risk."
—Nick Rago

Education is also important, because developers need to understand the ramifications of mishandling privileged API keys. And technologies should be in place to ensure that secrets such as static API tokens don't find their way into code and then into exposed repositories, Rago said.

4. Your AI tools need to take data handling seriously

Teresa Rothaar, a governance, risk, and compliance analyst at Keeper Security, said AI development demands the highest security protocols given the amount of sensitive data AI models need to be fed for training to generate accurate and appropriate results. That means AI data sets alone are valuable.

"In addition to the danger of data poisoning — a scenario where threat actors feed AI models inaccurate or inappropriate data — threat actors may seek to steal fully trained AI models that organizations have invested thousands of work hours and millions of dollars into. Why invest your own money and time into building an AI model if you can steal another organization’s work?"
Teresa Rothaar

5. AI providers need to foster trust in APIs and beyond

Karl Mattson, CISO of API security firm Noname Security, said that as large language models grow in use, they will become embedded into applications using APIs. Organizations are already using generative AI from a variety of vendors and various channels. This utilization is taking different forms, including integrating generative AI into in-house application development, incorporating it into third-party applications, or accessing it directly via API from providers such as OpenAI or Google's Bard, Mattson said.

"As API attacks continue to increase on AI, organizations integrating with generative AI technologies may face the same risks and consequences. The AI industry will need to work to maintain trust by building secure API implementations and protecting third-party transactions with good security hygiene.”
Karl Mattson

Best practices for securing your APIs

Tushar Kulkarni, a graduate student at Indiana University who was part of a recent an RSA Conference webcast on API security, shared six measures organizations can take to secure their API implementations.

  • Don't use GUIDs/UUIDs that can be guessed by a threat actor in an intuitive way. GUIDs (globally unique identifiers) and UUIDs (universally unique identifiers) are used as identifiers for various resources or objects in APIs. During the web session, Kulkarni demonstrated how weak identifiers can be used to compromise an API.
  • Never rely on a client to filter sensitive data. It's always a good practice to allow a client to fetch only the data that is needed and nothing more.
  • Enforce a limit on how often a client can call the API endpoint. Without limits, a threat actor can carry out attacks, such as credential stuffing.
  • Lock down endpoints. Make sure all administrative endpoints validate a user's role and privileges before performing an action.
  • Avoid functions binding client-side data into code variables and later into objects in databases. Binding client-side data directly into code variables can expose the API to injection attacks, such as SQL injection or NoSQL injection. It can also expose the API to unintentional code injection vulnerabilities.
  • Enforce a strong CORS policy with custom, unguessable authorization headers. Enforcing a strong CORS policy ensures that only trusted domains are allowed to make requests to the API. That helps mitigate the risk of cross-site request forgery and other cross-origin attacks. "Enforcing a strong CORS policy is very important," Kulkarni said.

Developers should treat all API inputs as dangerous, Kulkarni said.

"You should never assume end users won't fool around with the API on their own. You should always assume that every end user is an attacker."
Tushar Kulkarni

Keep learning

Explore RL's Spectra suite: Spectra Assure for software supply chain security, Spectra Detect for scalable file analysis, Spectra Analyze for malware analysis and threat hunting, and Spectra Intelligence for reputation data and intelligence.

More Blog Posts

    Special Reports