Microsoft AI researchers accidentally exposed tens of terabytes of sensitive data, including private keys and passwords, while publishing a storage bucket of open-source training data on GitHub.
In research shared with TechCrunch, cloud security startup Wiz said it discovered a GitHub repository belonging to Microsoft’s AI research division as part of its ongoing work into the accidental exposure of cloud-hosted data.
Readers of the GitHub repository, which provided open source code and AI models for image recognition, were instructed to download the models from an Azure Storage URL. However, Wiz found that this URL was configured to grant permissions on the entire storage account, exposing additional private data by mistake.
This data included 38 terabytes of sensitive information, including the personal backups of two Microsoft employees’ personal computers. The data also contained other sensitive personal data, including passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees.
The URL, which had exposed this data since 2020, was also misconfigured to allow “full control” rather than “read-only” permissions, according to Wiz, which meant anyone who knew where to look could potentially delete, replace, and inject malicious content into them.
Wiz notes that the storage account wasn’t directly exposed. Rather, the Microsoft AI developers included an overly permissive shared access signature (SAS) token in the URL. SAS tokens are a mechanism used by Azure that allows users to create shareable links granting access to an Azure Storage account’s data.
“AI unlocks huge potential for tech companies,” Wiz co-founder and CTO Ami Luttwak told TechCrunch. “However, as data scientists and engineers race to bring new AI solutions to production, the massive amounts of data they handle require additional security checks and safeguards. With many development teams needing to manipulate massive amounts of data, share it with their peers or collaborate on public open-source projects, cases like Microsoft’s are increasingly hard to monitor and avoid.”
Wiz said it shared its findings with Microsoft on June 22, and Microsoft revoked the SAS token two days later on June 24. Microsoft said it completed its investigation on potential organizational impact on August 16.
In a blog post shared with TechCrunch before publication, Microsoft’s Security Response Center said that “no customer data was exposed, and no other internal services were put at risk because of this issue.”
Microsoft said that as a result of Wiz’s research, it has expanded GitHub’s secret spanning service, which monitors all public open-source code changes for plaintext exposure of credentials and other secrets to include any SAS token that may have overly permissive expirations or privileges.