Amazon paper exposes biases in unreliable-news datasets

The paper, which received honorable mention at EACL, presents guidelines for better analysis and construction of datasets.

At the 2021 Conference of the European Association for Computational Linguistics (EACL), we received honorable mention in the best-long-paper category for our paper "Hidden biases in unreliable news detection datasets”, coauthored with Xiang Zhou (while he was an Amazon intern) and Mohit Bansal from the University of North Carolina at Chapel Hill.

In this paper, we studied datasets used by the research community for developing models to automatically identify unreliable news. We found that the datasets had biases that are responsible for much of the accuracy in identifying unreliable news that previous papers reported. This suggests that models built on these datasets will not generalize well in a real-world setting. 

cloud_wrong.png
A word cloud of keywords and keyword phrases from the titles of news articles in one dataset that are correlated with incorrect prediction of article accuracy. The size of a word indicates the strength of the correlation. Models trained on the dataset are most prone to errors on the topics of politics and world news and most accurate on sports and entertainment (see below), an indication of bias in the dataset.

To provide the research community a path forward, we followed up the analysis with a detailed study of the structure of the bias, guidelines for reducing the bias in existing datasets, and guidelines for developing higher-quality datasets in the future. 

Data collection

We started our analysis by looking at the data collection strategies used for creating unreliable-news-article datasets. Creating such datasets requires collecting news articles and their corresponding labels (for instance, “reliable” or “unreliable”). 

As expected, collecting the labels is the most challenging task. Some fact-checking websites (e.g., PolitiFact, GossipCop) assign labels to individual articles. While this provides accurate labels, the process is both time consuming and expensive, resulting in comparatively small datasets. 

An approach that scales better is assigning a reliability (or bias) score to each news outlet (or site, such as cnn.com or nytimes.com). This is as an easy way to create large-scale datasets, but it generates noisy labels. We studied biases in datasets that take both approaches — site- and article-level labeling.

Keyword correlations

As a representative example of a dataset that is annotated at the article level, we studied the popular FakeNewsNet dataset. We trained a simple (logistic-regression) model to predict the labels (“reliable” or “unreliable”) of news items in the dataset on the basis of keywords and found that its accuracy (78%) was almost as high as that of a state-of-the-art BERT-based model (81%). Examining the keywords that drove the model’s performance, we found that celebrity names (“Brad”, “Pitt”, “Jenner”, etc.) predicted the “unreliable” label, while neutral terms like “2018” or “season” predicted the “reliable” label. 

These results indicate that the ability to predict the labels of the articles in such datasets may depend on the presence of simple keywords that flag topics, such as celebrity news, rather than any deeper pattern. This implies that the dataset composition is biased, because it has strong correlations between topic words and the unreliable-news label. (It doesn’t mean that articles mentioning Brad Pitt or celebrities in general are intrinsically unreliable.)

This is partly due to biases in the fact-checking sites’ article selections. Another source of bias is that in the process of constructing FakeNewsNet, the authors used a web search engine, with its own proprietary news-ranking and verification processes, to retrieve the full texts of the news articles (which are not provided by fact-checking sites). This sometimes results in mismatches, in which unreliable content is replaced with reliable content without an update to the label. 

Site classification

We also studied the NELA dataset, which uses site-level labels. We find even more challenges with site-level labeling, mostly due to the weak labeling process, where an article from a supposedly unreliable news source can be factual and vice versa.

While the literature reports models that are highly accurate at labeling news articles from NELA and similar datasets as reliable or unreliable, we found that much of the accuracy is due to having articles from the same sites in both training and test data. This means that the model can ignore the task of identifying unreliable content and just learn that particular sites are reliable or unreliable. 

To demonstrate this point, we conducted a “random labels” experiment, where we randomly shuffled all the site-level labels such that they no longer represented the reliability of the site but were just an arbitrary feature of the site itself. We found that the models trained using random labels performed within 2% of the accuracy of models trained on the true labels. (These models are learning to identify sites, but that’s not practically useful, because the site name is included in any given article’s web address.)

We also show that while using a clean train/test site split is necessary, it is not sufficient to measure a given model’s generalization power. We further tested different site splits and found that the performance varies depending on how similar the sites in the test and training sets are: higher accuracy on a test set is correlated with higher similarity between the sites in the training set and test set. 

cloud_correct.png
A word cloud of keywords and keyword phrases from the titles of news articles in one dataset that are correlated with correct prediction of article accuracy. The size of a word indicates the strength of the correlation.

We then took properly split datasets — with low similarity between train and test sets — trained models on them, and examined what kinds of articles were most prone to be erroneously identified as reliable or unreliable. We discovered that the models are most prone to errors when the topics are politics and world news and most accurate on sports and entertainment. Reliability of news is important on any topic, but the finding that model performance is degraded on politics and world news topics underscores the importance of improving data for unreliable-news detection. 

Recommendations

Our paper showed that, to ensure that improvements in model performance reflect real unreliable-news detection capabilities, the community needs to make several changes in data collection, dataset construction, and experimental design. To facilitate these changes, we provide a table of recommended best practices (see below). We hope that this paper will stimulate quality improvments in unreliable-news modeling, analysis, and data. All of our code is licensed under Apache 2 and is available on GitHub.

Data collection

Dataset construction

Experiment design

Collect from less biased or unbiased resources (e.g., original news outlets)

Examine the most salient words to check for biases in the datasets

Apply debiasing techniques when developing models on biased datasets

Collect from diverse resources (in terms of sources, topics, time, etc.)

Run simple bag-of-words baselines to check how severe the bias is

Check the performance on sources/dates not in your training set

Collect precise article-level labels, if possible

Provide train/dev/test splits with non-overlapping source/time

Check the performance on sources with limited examples

--

--

Test your model on multiple complementary datasets (e.g., with different domains, styles, etc.)

Research areas

Related content

US, NY, New York
We are seeking a Robotics/AI Motor Control Scientist to develop cutting-edge machine learning algorithms for motor control systems in robots. In this role, you will focus on creating and optimizing intelligent motor control strategies to enable robots to perform complex, whole-body tasks. Your contributions will be essential in advancing robotics by enabling fluid, reliable, and safe interactions between robots and their environments. Key job responsibilities - Develop controllers that leverage reinforcement learning, imitation learning, or other advanced AI techniques to achieve natural, robust, and adaptive motor behaviors - Collaborate with multi-disciplinary teams to integrate motor control systems with robotic hardware, ensuring alignment with real-world constraints such as actuator dynamics and energy efficiency - Use simulation and real-world testing to refine and validate control algorithms - Stay updated on advancements in robotics, AI, and control systems to apply advanced techniques to robotic motion challenges - Lead technical projects from conception through production deployment - Mentor junior scientists and engineers - Bridge research initiatives with practical engineering implementation About the team Fauna Robotics, an Amazon company, is building capable, safe, and genuinely delightful robots for everyday life. Our goal is simple: make robots people actually want to live and interact with in everyday human spaces. We believe that future won’t arrive until building for robotics becomes far more accessible. Today, too much effort is spent reinventing the fundamentals. We’re changing that by developing tightly integrated hardware and software systems that make it faster, safer, and more intuitive to create real-world robotic products. Our work spans the full stack: mechanical design, control systems, dynamic modeling, and intelligent software. The focus is not just functionality, but experience. We’re building robots that feel responsive, expressive, and genuinely useful. At Fauna, you’ll work at the frontier of this space, helping define how robots move, manipulate, and interact with people in natural environments. It’s an opportunity to solve hard problems across hardware and software with a team focused on making robotics accessible and joyful to build. If you care about making robotics real for everyone and building systems that are as delightful as they are capable, we’re interested in hearing from you. an opportunity to solve hard problems across hardware and software with a team focused on making robotics accessible and joyful to build. If you care about making robotics real for everyone and building systems that are as delightful as they are capable, we’re interested in hearing from you.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company.
US, WA, Seattle
Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company.
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers.