GitHub hits CTRL-Z, decides it will train its AI with user data after all
As of April 24 you'll be feeding the Octocat unless you opt out
Microsoft's GitHub next month plans to begin using customer interaction data – "specifically inputs, outputs, code snippets, and associated context" – to train its AI models.
The code locker’s revised policy applies to Copilot Free, Pro, and Pro+ customers, as of April 24. Copilot Business and Copilot Enterprise users are exempt thanks to the terms of their contracts. Students and teachers who access Copilot will also be spared.
Those affected have the option to opt out in accordance with "established industry practices" – meaning according to US norms as opposed to European norms where opt-in is commonly required. To opt out, GitHub users should visit /settings/copilot/features and disable "Allow GitHub to use my data for AI model training" under the Privacy heading.
Mario Rodriguez, GitHub's chief product officer, would rather you didn't.
"By participating, you'll help our models better understand development workflows, deliver more accurate and secure code pattern suggestions, and improve their ability to help you catch potential bugs before they reach production," he wrote in a blog post.
To excuse its covetous behavior, GitHub in its FAQs notes that Anthropic, JetBrains, and corporate parent Microsoft operate similar opt-out data use policies.
The rationale for the change, according to Rodriguez, is that interaction data makes company AI models perform better. Adding interaction data from Microsoft employees has led to meaningful improvements, he claims, such as an increased acceptance rate for AI model suggestions.
The data GitHub wants includes:
- Model outputs that have been accepted or modified;
- Model inputs including code snippets shown;
- Code context surrounding your cursor position;
- Comments and documentation you've written;
- File names and repo structure;
- Interactions with Copilot features (e.g. chats); and
- Feedback (e.g. thumbs up/down ratings).
- AI supply chain attacks don't even require malware…just post poisoned documentation
- Dell slims down business laptops, fattens up cooling and battery life
- Jen Easterly, cybersecurity's 'relentless optimist,' hopes feds come back to RSAC next year
- Oracle: AI agents can reason, decide and act - liability question remains
The policy shift does somewhat change the meaning of GitHub private repositories, which are notionally "only accessible to you, people you explicitly share access with, and, for organization repositories, certain organization members." These might be more accurately described as "GitHub private* repositories," with the asterisk to denote the limits of GitHub’s definition of the word "private."
As the FAQs explain: "If a Copilot user has their settings set to enable model training on their interaction data, code snippets from private repositories can be collected and used for model training while the user is actively engaged with Copilot while working in that repository."
Recent banter in the GitHub community doesn’t include much enthusiasm for the plan. To judge by emoji votes alone, users have offered 59 thumbs-down votes and just three rocket ships, which we understand signal some measure of excitement.
But among the 39 posts commenting on the change at the time this article was filed, no one other than Martin Woodward, GitHub VP of developer relations, has really endorsed the idea.
User indignation might be somewhat mitigated if GitHub users recognized that OpenAI's Codex – used in GitHub Copilot – is "a GPT language model fine-tuned on publicly available code from GitHub." That verbiage shows the data-gorged AI horse is already out of the barn, so to speak.
Shutting the doors at this point won't change the fact that the AI industry is built on data gathered without asking for a strong indicator of enthusiastic consent. ®
Narrower topics
- 2FA
- AdBlock Plus
- Advanced persistent threat
- AIOps
- App
- Application Delivery Controller
- Audacity
- Authentication
- BEC
- Black Hat
- BSides
- Bug Bounty
- Center for Internet Security
- CHERI
- CISO
- Common Vulnerability Scoring System
- Confluence
- cookies
- Cybercrime
- Cybersecurity
- Cybersecurity and Infrastructure Security Agency
- Cybersecurity Information Sharing Act
- Database
- Data Breach
- Data Protection
- Data Theft
- DDoS
- DeepSeek
- DEF CON
- Digital certificate
- Encryption
- End Point Protection
- Exploit
- Firewall
- FOSDEM
- FOSS
- Gemini
- Google AI
- Google Project Zero
- GPT-3
- GPT-4
- Grab
- Graphics Interchange Format
- Hacker
- Hacking
- Hacktivism
- IDE
- Identity Theft
- Image compression
- Incident response
- Infosec
- Infrastructure Security
- Jenkins
- Kenna Security
- Large Language Model
- Legacy Technology
- LibreOffice
- Machine Learning
- Map
- MCubed
- Microsoft 365
- Microsoft Office
- Microsoft Teams
- Mobile Device Management
- NCSAM
- NCSC
- Neural Networks
- NLP
- OpenOffice
- Palo Alto Networks
- Password
- Personally Identifiable Information
- Phishing
- Privacy Sandbox
- Programming Language
- QR code
- Quantum key distribution
- Ransomware
- Remote Access Trojan
- Retrieval Augmented Generation
- Retro computing
- REvil
- RSA Conference
- Search Engine
- Software Bill of Materials
- Software bug
- Software License
- Spamming
- Spyware
- Star Wars
- Surveillance
- Tensor Processing Unit
- Text Editor
- TLS
- TOPS
- Trojan
- Trusted Platform Module
- User interface
- Visual Studio
- Visual Studio Code
- Vulnerability
- Wannacry
- WebAssembly
- Web Browser
- WordPress
- Zero trust
Broader topics
More about
Narrower topics
- 2FA
- AdBlock Plus
- Advanced persistent threat
- AIOps
- App
- Application Delivery Controller
- Audacity
- Authentication
- BEC
- Black Hat
- BSides
- Bug Bounty
- Center for Internet Security
- CHERI
- CISO
- Common Vulnerability Scoring System
- Confluence
- cookies
- Cybercrime
- Cybersecurity
- Cybersecurity and Infrastructure Security Agency
- Cybersecurity Information Sharing Act
- Database
- Data Breach
- Data Protection
- Data Theft
- DDoS
- DeepSeek
- DEF CON
- Digital certificate
- Encryption
- End Point Protection
- Exploit
- Firewall
- FOSDEM
- FOSS
- Gemini
- Google AI
- Google Project Zero
- GPT-3
- GPT-4
- Grab
- Graphics Interchange Format
- Hacker
- Hacking
- Hacktivism
- IDE
- Identity Theft
- Image compression
- Incident response
- Infosec
- Infrastructure Security
- Jenkins
- Kenna Security
- Large Language Model
- Legacy Technology
- LibreOffice
- Machine Learning
- Map
- MCubed
- Microsoft 365
- Microsoft Office
- Microsoft Teams
- Mobile Device Management
- NCSAM
- NCSC
- Neural Networks
- NLP
- OpenOffice
- Palo Alto Networks
- Password
- Personally Identifiable Information
- Phishing
- Privacy Sandbox
- Programming Language
- QR code
- Quantum key distribution
- Ransomware
- Remote Access Trojan
- Retrieval Augmented Generation
- Retro computing
- REvil
- RSA Conference
- Search Engine
- Software Bill of Materials
- Software bug
- Software License
- Spamming
- Spyware
- Star Wars
- Surveillance
- Tensor Processing Unit
- Text Editor
- TLS
- TOPS
- Trojan
- Trusted Platform Module
- User interface
- Visual Studio
- Visual Studio Code
- Vulnerability
- Wannacry
- WebAssembly
- Web Browser
- WordPress
- Zero trust
