Large language models (LLMs) like ChatGPT are increasingly used for economic forecasting and financial analysis. Alejandro Lopez-Lira, Yuehua Tang, and Mingyin Zhu, authors of the April 2025 study “The Memorization Problem: Can We Trust LLMs' Economic Forecasts?” raised critical questions about the reliability of LLMs to forecast.
What did the Researchers Examine?
Lopez-Lira, Tang, and Zhu tested whether LLMs memorize historical economic data during training, which could distort their ability to generate genuine forecasts. Most of their tests used data from January 1990 to September 2023. They focused on:
Key economic indicators (e.g., GDP, unemployment rates).
Stock returns and earnings conference calls.
Whether LLMs could "cheat" by recalling exact numerical values from their training data, even when instructed not to do so.
Key Findings
Perfect Recall of Historical Data:
LLMs demonstrated near-perfect memory of economic data from before their knowledge cutoff dates (e.g., pre-2023 for many models). For example, they could reproduce exact stock prices or unemployment figures from specific dates with high accuracy.Forecasting vs. Memorization:
When asked to predict economic outcomes within their training period, LLMs often relied on memorized data rather than genuine forecasting, making it impossible to distinguish whether their outputs reflected economic insight or recall.Failure of Safeguards:
Explicit instructions to avoid using post-cutoff data did not prevent memorization. Even when researchers masked contextual clues (e.g., hiding company names), LLMs frequently reconstructed missing information accurately.Post-Cutoff Limitations:
For events after their training cutoff, LLMs showed no memorization but also no reliable forecasting advantage over traditional methods—their accuracy dropped significantly, aligning with earlier studies showing LLMs underperform human analysts in earnings predictions.
Why This Matters for Investors
Backtesting Risks: If an LLM-based tool claims success in historical market simulations, its performance might stem from memorized data—not predictive skill.
Data Contamination: Strategies trained on LLM outputs could inherit hidden biases or outdated patterns, leading to overconfidence in flawed models.
Transparency Gaps: Most commercial LLMs don’t disclose their training data, making it hard to audit for memorization.
Their findings led Lopez-Lira, Tang, and Zhu to conclude that “the Large language models cannot be trusted for economic forecasts during periods covered by their training data.”
Key Takeaways
Verify Historical Claims: Treat LLM-generated backtests with skepticism unless the model’s training data is explicitly excluded from the test period.
Prefer Post-Cutoff Analysis: For forward-looking forecasts, prioritize models updated with the latest data—but recognize their limitations.
Combine Human and AI Insights: Studies show that aggregating LLM predictions with human judgment can improve accuracy, but simple averaging often outperforms complex hybrid approaches.
Summarizing, while LLMs offer exciting tools for investors, their apparent "intelligence" may sometimes be an illusion of memorization. Vigilance and independent verification remain essential when integrating AI into financial decision-making.
Larry Swedroe is the author or co-author of 18 books on investing, including his latest Enrich Your Future. He is also a consultant to RIAs as an educator on investment strategies.