Data Hacks and the US-China AI Race

ChinaTalk / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisIndustry & Market Moves

Read original →

共有:

Key Points

The article argues that “data hacks” are becoming a central lever in the US–China AI race by influencing model performance and competitiveness through data manipulation or advantage.
It frames the competitive dynamics as increasingly driven by who controls higher-quality, more accessible, or better-sourced training/labeling data rather than only algorithmic breakthroughs.
It suggests that traditional barriers and safeguards around data provenance and access may be insufficient as adversarial or strategic data acquisition practices evolve.
The piece implies that future AI advantage will depend on resilience to data exploitation and on stronger data governance, auditing, and verification practices.

Data Hacks and the US-China AI Race

Project 2027

Apr 14, 2026

Trent Kannegieter is a JD candidate at Yale Law School. Previously, he was Chief of Staff at SparkAI, a machine learning operations and autonomy startup that was acquired by a Fortune 100 company.

Mercor isn’t top of mind in DC today, but the expert-data ecosystem it represents is a core determinant of AI capabilities growth and the U.S. frontier AI advantage. Last week’s hack and data breach at Mercor, and the broader security challenge it represents, may shape the future of tech competition. It also could hold the keys to understanding two rapidly vanishing U.S. assets in the field: the specialized-data moat and a differentiable frontier itself.

What is Mercor?

Mercor is a $10Bn startup that builds specialized human-expert datasets. Firms like Mercor, Surge AI, and Turing1 are key to AI capabilities growth, helping labs unlock high performance in new domains. These datasets are one large reason why model performance continues to improve even after foundation-model labs scraped most of the web many training runs ago.

Models are only as good as their data. But one of the greatest challenges in AI development today is finding a way to make new, less-documented fields legible to models. While today’s frontier models have achieved incredible capabilities, they still struggle with “generalized” reasoning when asked to function in domains where they lack robust training data.2 Thus, one of the key drivers of capabilities growth, especially for the type of white-collar workflow automation critical to helping models complete high-value tasks, is the collection and curation of specialized datasets in expert domains. For example, radiology models train on large corpora of X-rays.3

Mercor specifically conducts these operations through a hiring platform.4 As Mercor built this platform’s talent base, it accumulated an impressive set of specialties from biotech research and interventional radiology to corporate law and international business development. This data is incredibly valuable for the labs; when it helps models generate new insights, it unlocks whole new production domains.5

Demand for Mercor’s products has grown alongside the AI fundraising flywheel. As foundation labs raise more money to complete ever-larger training runs and reach ever-more users and ever-more-demanding benchmarks, demand and available capital for specialized data rise, too.6 Labs spend tremendous amounts on specialized data. An exclusive with The Information suggested that Mercor’s annualized revenue had recently reached $1 billion.

The Most Important Hack DC’s Never Heard Of

But recently, Mercor was hacked. On Monday, March 30, a group known as Lapsus$ claimed to have stolen 4TB of Mercor’s data. The hack allegedly included everything from candidate profiles and personally identifiable information (PII) to video interviews with experts, source code, and other proprietary information and secrets.7

The Mercor hacks suggest that expert data companies could be a weak link to copy or steal a tremendous lab investment in data. Suddenly, critical data from a company built on proprietary data was available for purchase, allegedly at a price of $1 million for nonexclusive use. For scale on how good a deal this represents on such assets, Mercor pays its contractors over $1.5 million every day to build them.8 Mercor project lengths vary widely, but if the hack contains data from a substantial share of projects, the contractor costs alone may exceed the price of the leaked dataset many times over.

Mercor’s close partnership with the labs also raises the concern that the hack might have exposed secrets about how foundation-model labs manage their product development. Within days of the hack, Meta had paused its contracts with Mercor, reportedly due to such a concern.

We don’t know specifically what data was leaked in this hack. (For example, it is unclear how much annotated data or process secrets were exposed, as opposed to data about the experts or less important procedural concerns.) But the specific fallout from this hack might prove less significant than demonstrating that these kinds of hacks are achievable. The most consequential concerns might be future hacks of expert-data startups and the threat of them.

Such concerns are especially acute in the wake of Anthropic’s announcement this week that they would withhold release of their new Claude Mythos Preview model due to its immense potential to conduct cyberattacks. Even if Anthropic refrains from releasing its model today, other close followers might not be so judicious. This development raises concerns that advanced cyberattack capabilities are coming, fast.

Expect Fast-Followers to Pounce

Fast-follower foundation-model builders — especially in China — will surely try to access this incredibly valuable data. (From both this hack and any future attacks.) Why? Beyond these firms’ sophistication, they’ve also conducted far more controversial operations recently. Take, for example, Anthropic’s recently publicized allegations of mass distillation of Claude models by Moonshot AI, Minimax, and DeepSeek. OpenAI raised similar concerns, including in a letter to the US House Select Committee on the CCP. Obviously, data curation strategies extend far beyond distillation attacks.9 But these incidents suggest the inventive methods Chinese firms are willing to employ to close the gap with leading U.S. labs. They’ve shown a willingness to use even nominally closed-source models to develop their own, often open-source alternatives. Alongside fighting Anthropic and OpenAI’s distillation defenses to build a synthetic dataset, it is also surely worth playing around with the datasets.

Two Key Takeaways

Increased Urgency of Strong Export Controls.

Threats to one moat increase the importance of protecting another. If one can distill models and steal critical expert data, then access to compute becomes an even more important long-term differentiator.

A Higher Premium on Security for National-Interest Expert Human Data Startups.

This incident strengthens the case for more active state involvement in bolstering cybersecurity for strategically significant AI companies. Cyberattacks like this could disincentivize innovation: Why invest in high-quality, bespoke datasets if hacks will let non-paying firms free ride? The need to protect a critical asset like national-interest AI startups that help power our labs justifies federal security assistance. (Put another way, perhaps such calls to secure the foundation-model labs need to be expanded to key partners like expert human data firms.)

Such an arrangement would allow these national-interest entities to leverage existing state infrastructure to provide, among other things, unique threat intelligence; testing; and incident response and support with the state’s unique authorities, scale, and visibility. Such services already have precedent. For example, the NSA offers cybersecurity services to private contractors working with the Department of ~~Defense~~ War. Similarly, the FBI’s Business Alliance Initiative offers support to private firms in the form of counterintelligence vulnerability assessments, information on specific threats, and advice on a variety of infiltration scenarios.10

If export controls and enforcement11 are key to maintaining U.S. AI advantage, then leveraging state security strengths that can incentivize firms like Mercor to keep developing datasets that advance the frontier in critical professional work is key to continuing U.S. AI innovation and capabilities growth.

To receive new posts and support our work, subscribe!

Other competitors in the space include Handshake, micro1, and certain parts of Scale AI (like its Expert Match). Better-known-in-DC data-labeling incumbents like Labelbox or the core business of Scale AI provide labeled data, but don’t necessarily specialize in specialized domain-expert data like Mercor and others. Mercor’s core value proposition relies much more on (1) recruiting elite talent to the platform and (2) building workflows that help convert complex processes to data that is helpful in model training with the help of those staffers. Part of Mercor’s success is attributable to the team’s ability to do both of these challenging tasks very well, to the satisfaction of frontier labs.

For a deeper dive on this claim, Song, Han, and Goodman (2026) provide a helpful survey of the research on LLM “reasoning failures.” Section 4.2 in particular covers many ways that LLM performance and reasoning struggle in contexts without robust training data.

See, for example, the CheXpert dataset of over 200,000 chest X-rays from Stanford Health Care exams.

Mercor refers to itself as an “AI recruiting platform.”

Some attention to these ecosystems has broken containment of the tech ecosystem. For instance, in the last year, Bloomberg finance columnist Matt Levine discussed the different initiatives, including one spearheaded by OpenAI, trying to hire ex-investment bankers to help train the models that will then automate investment banking. Levine had stumbled onto one of the most lucrative spaces in the AI ecosystem today.

Of course, it’s not enough to just sell this data. Other attributes of Mercor and other leading firms presumably allow them to succeed when others fail in the same space.

This announced hack by an independent cybercrime group might also not be the only breach of the company. Other, more sophisticated hackers might also be attempting to access this data. Of course, this point is speculative as an outsider.

This piece puts aside the privacy concerns around leaks of contractors’ PII (including Social Security numbers) and the related already-assembling class-action lawsuits. This decision is meant to focus the piece on key details for U.S.-China tech competition, and of course this doesn’t mean these concerns aren’t also important to the people in Mercor’s ecosystem or Mercor itself.)

Among other things, China has its own firms dedicated to curating specialty datasets. (Both branches of top firms like SenseTime, Baidu, and Tencent and startups like Dataocean AI and Datatang.)

Credit to Maggie Baughman for assistance with these specific authorities.

For example, building a network of informants and enforcers to prevent chip smuggling.