Leaked data exposes a Chinese AI censorship machine

March 26, 2025By Unknown Author|Source: Tech Crunch|Read Time: 4 mins|Share

The dataset suggests that China may have intentions to leverage AI for repressive purposes. This highlights the potential ethical concerns surrounding the development and use of AI technology. It is important to consider the implications of such findings on global security and human rights. Researchers and policymakers may need to address the ethical and regulatory challenges posed by AI advancements in repressive contexts. Further analysis and scrutiny of these developments are crucial to understanding the broader implications of AI in authoritarian regimes.

Leaked data exposes a Chinese AI censorship machine — Representational image

A complaint about poverty in rural China. A news report about a corrupt Communist Party member. A cry for help about corrupt cops shaking down entrepreneurs. These are just a few of the 133,000 examples fed into a sophisticated large language model that’s designed to automatically flag any piece of content considered sensitive by the Chinese government.

China's Sophisticated AI Censorship System

A leaked database seen by TechCrunch reveals China has developed an AI system that supercharges its already formidable censorship machine, extending far beyond traditional taboos like the Tiananmen Square massacre. The system appears primarily geared toward censoring Chinese citizens online but could be used for other purposes, like improving Chinese AI models’ already extensive censorship.

Xiao Qiang, a researcher at UC Berkeley who studies Chinese censorship and who also examined the dataset, told TechCrunch that it was “clear evidence” that the Chinese government or its affiliates want to use LLMs to improve repression. “Unlike traditional censorship mechanisms, which rely on human labor for keyword-based filtering and manual review, an LLM trained on such instructions would significantly improve the efficiency and granularity of state-led information control,” Qiang told TechCrunch. This adds to growing evidence that authoritarian regimes are quickly adopting the latest AI tech.

The Dataset and its Implications

The dataset was discovered by security researcher NetAskari, who shared a sample with TechCrunch after finding it stored in an unsecured Elasticsearch database hosted on a Baidu server. This doesn’t indicate any involvement from either company — all kinds of organizations store their data with these providers. There’s no indication of who, exactly, built the dataset, but records show that the data is recent, with its latest entries dating from December 2024.

In language eerily reminiscent of how people prompt ChatGPT, the system’s creator tasks an unnamed LLM to figure out if a piece of content has anything to do with sensitive topics related to politics, social life, and the military. Such content is deemed “highest priority” and needs to be immediately flagged. Top priority topics include pollution and food safety scandals, financial fraud, and labor disputes, which are hot-button issues in China that sometimes lead to public protests.

Detection of Dissent and Social Unrest

From this huge collection of 133,000 examples that the LLM must evaluate for censorship, TechCrunch gathered 10 representative pieces of content. Topics likely to stir up social unrest are a recurring theme. One snippet is a post by a business owner complaining about corrupt local police officers shaking down entrepreneurs, a rising issue in China as its economy struggles. Another piece of content laments rural poverty in China, describing run-down towns that only have elderly people and children left in them.

There’s also a news report about the Chinese Communist Party (CCP) expelling a local official for severe corruption and believing in “superstitions” instead of Marxism. There’s extensive material related to Taiwan and military matters, such as commentary about Taiwan’s military capabilities and details about a new Chinese jet fighter. The Chinese word for Taiwan (台湾) alone is mentioned over 15,000 times in the data, a search by TechCrunch shows. Subtle dissent appears to be targeted, too.

Evolution of AI-Driven Censorship

The dataset examined by TechCrunch is the latest evidence that authoritarian governments are seeking to leverage AI for repressive purposes. OpenAI released a report last month revealing that an unidentified actor, likely operating from China, used generative AI to monitor social media conversations — particularly those advocating for human rights protests against China — and forward them to the Chinese government. Traditionally, China’s censorship methods rely on more basic algorithms that automatically block content mentioning blacklisted terms, like “Tiananmen massacre” or “Xi Jinping,” as many users experienced using DeepSeek for the first time. But newer AI tech, like LLMs, can make censorship more efficient by finding even subtle criticism at a vast scale.