Are LLMs Prescient? A Continuous Evaluation using Daily News as Oracle
Amelia (Hui) Dai, Ryan Teehan, and Mengye Ren
New York University
The 42nd International Conference on Machine Learning (ICML 2025)
TL;DR: Our new benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" events based on pre-training data.
Abstract
Existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to model updates and an evolving information landscape. Moreover, they often lack the ability to assess how model performance evolves over time, as they consist of static questions without a temporal dimension. To address these, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" events based on pre-training data. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) can enhance prediction accuracy, the degradation persists, highlighting the need for ongoing model updates.
Model Performance Over Time (Closed-Book)
Browse Daily QA Pairs
Daily Oracle Dataset
Dataset Overview
- Daily Oracle is a continuous evaluation benchmark using automatically generated QA pairs from daily news to assess how the future prediction capabilities of LLMs evolve over time.
- While Daily Oracle is daily updated, for our current analysis we use the subset covering the period from January 2020 to December 2024 (~17.2 questions per day).


Example QA pairs

QA Construction Pipeline
For each day, we collect news articles from the daily-updated Common Crawl News Dataset and scrape news using the Newspaper3k package. We use LLMs to generate QA pairs with the few-shot prompting technique.

Evaluation
Closed-Book Setting
- In the closed-book setting, we assess how accurately LLMs can answer forecasting questions based on the knowledge they learned from their training data without providing extra information.
- Performance degradation over time is observed across all models. This indicates that while LLMs demonstrate certain abilities to understand real-world events and make predictions, they struggle to maintain these abilities.

Constrained Open-Book Setting
- In the constrained open-book setting, we explore how access to news articles up to different time cutoffs influences LLM performance using RAG.
- RAG cutoff: the latest accessible date for retrieving articles.
- RAG has the potential to enhance prediction accuracy, but the performance degradation pattern persists, highlighting the need for continuous model updates.
Mixtral-8x7B in the constrained open-book setting.
Mistral-7B in the constrained open-book setting.
Llama-3-8B in the constrained open-book setting.
Qwen-2-7B in the constrained open-book setting.
Gemma-2-2B in the constrained open-book setting.
Claude-3.5-Sonnet in the constrained open-book setting.
GPT-4 in the constrained open-book setting.
Gold Article Setting
- In the gold article setting, models are provided direct access to the gold article from which the question is generated.
- LLM performance can be improved significantly to around 90%, demonstrating the answerability of Daily Oracle.
- However, even when these are treated as reading comprehension questions rather than forecasting questions, most of the models still show declining trends.
- This provides an "upper bound" of open-book retrieval, and the remaining decline in the model's performance suggests continuous pre-training of LLMs is still needed in the context of news event forecasting to address outdated representations.

BibTeX
@inproceedings{dai2025dailyoracle,
title = {Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle},
author = {Dai, Hui and Teehan, Ryan and Ren, Mengye},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2025}
}