Agentic Learning AI Lab

Are LLMs Prescient? A Continuous Evaluation using Daily News as Oracle

Amelia (Hui) Dai, Ryan Teehan, and Mengye Ren

New York University

The 42nd International Conference on Machine Learning (ICML 2025)

TL;DR: Our new benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" events based on pre-training data.

Abstract

Existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to model updates and an evolving information landscape. Moreover, they often lack the ability to assess how model performance evolves over time, as they consist of static questions without a temporal dimension. To address these, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" events based on pre-training data. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) can enhance prediction accuracy, the degradation persists, highlighting the need for ongoing model updates.

Model Performance Over Time (Closed-Book)

TF Questions

MC Questions

Browse Daily QA Pairs

True/False Questions

QuestionAnswerSource Article

Multiple-Choice Questions

QuestionabcdAnswerSource Article

Daily Oracle Dataset

Dataset Overview

  • Daily Oracle is a continuous evaluation benchmark using automatically generated QA pairs from daily news to assess how the future prediction capabilities of LLMs evolve over time.
  • While Daily Oracle is daily updated, for our current analysis we use the subset covering the period from January 2020 to December 2024 (~17.2 questions per day).
Question size
Category breakdown

Example QA pairs

QA Construction Pipeline

For each day, we collect news articles from the daily-updated Common Crawl News Dataset and scrape news using the Newspaper3k package. We use LLMs to generate QA pairs with the few-shot prompting technique.

Evaluation

Closed-Book Setting

  • In the closed-book setting, we assess how accurately LLMs can answer forecasting questions based on the knowledge they learned from their training data without providing extra information.
  • Performance degradation over time is observed across all models. This indicates that while LLMs demonstrate certain abilities to understand real-world events and make predictions, they struggle to maintain these abilities.

Constrained Open-Book Setting

  • In the constrained open-book setting, we explore how access to news articles up to different time cutoffs influences LLM performance using RAG.
  • RAG cutoff: the latest accessible date for retrieving articles.
  • RAG has the potential to enhance prediction accuracy, but the performance degradation pattern persists, highlighting the need for continuous model updates.

Gold Article Setting

  • In the gold article setting, models are provided direct access to the gold article from which the question is generated.
  • LLM performance can be improved significantly to around 90%, demonstrating the answerability of Daily Oracle.
  • However, even when these are treated as reading comprehension questions rather than forecasting questions, most of the models still show declining trends.
  • This provides an "upper bound" of open-book retrieval, and the remaining decline in the model's performance suggests continuous pre-training of LLMs is still needed in the context of news event forecasting to address outdated representations.

BibTeX

@inproceedings{dai2025dailyoracle,
  title     = {Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle},
  author    = {Dai, Hui and Teehan, Ryan and Ren, Mengye},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2025}
}