Exploring the Boundaries of LLMs in LP Trading

We let five leading large language models independently design and execute LP strategies on PancakeSwap on the real BSC chain, with the same initial capital, tool access, public market data, and a unified prompt. We observe which model can achieve the highest profit within a limited period.

Introduction

Large language models are reshaping many industries. Their strong capabilities in text processing, logical reasoning, and pattern recognition are already being applied to financial customer service, report generation, and quantitative investment research. They can quickly process massive information, capture market sentiment, and even assist in building complex trading strategies. However, most evaluations of LLMs still rely on static benchmarks. These tests are good at measuring "knowledge reserves" and "single-shot reasoning," but they struggle to assess stability and adaptability in long-horizon, high-risk, dynamic decision-making in the real world. Models can answer "what is," but whether they can continuously "do the right thing" and bear consequences remains unknown. Therefore, we chose liquidity providing (LP) as the "ultimate exam" to test real decision-making. LP is not just simple asset exchange; it requires delicate trade-offs among price volatility, transaction costs, and potential impermanent loss. This makes it an ideal environment for testing long-horizon autonomous decision-making. The core question of the Farmore project is direct and challenging: In a setting with almost no human intervention, can today’s LLMs act as qualified systematic LP agents and achieve stable profitability?

Why Choose Real Funds and On-Chain Environments

Simulation cannot reproduce the core challenges of real trading. The success of an LP strategy often depends on the subtle interplay among transaction costs, slippage, network latency, and expected returns. Only in a real on-chain environment must models pay real gas fees for every operation, absorb price volatility, and face the risk of liquidity ranges becoming invalid. More importantly, the on-chain environment provides unparalleled transparency and verifiability. All transaction records, position adjustments, and final PnL are publicly accessible, ensuring the credibility and seriousness of comparisons between models.

How We Designed Phase One

Phase One aims to create a fair, transparent, and highly controllable experimental environment. We placed five leading models under identical market conditions, rules, and prompt templates, with a single goal: maximize LP returns (including fee income and changes in position value). We deliberately excluded narrative inputs such as news or social media sentiment, providing only structured numerical data. This includes price and volume data, volatility indicators, on-chain liquidity distribution, and each model’s own account status. The purpose is to force models to rely solely on time-series data understanding to infer market structure, identify risk patterns, and make decisions.

The action space is strictly constrained:

Set or adjust LP ranges
Add or remove liquidity
Hold (observe) or fully exit
Define risk-control parameters (e.g., stop-loss rules or liquidity withdrawal conditions) We focus on mid-to-low frequency decisions, giving models enough time to "think" while still confronting real trading friction and market volatility.

Preliminary Findings

After multiple rounds of testing, we observed consistent and distinct “trading personalities” across models:

Diverging risk preferences: Some models prefer wider LP ranges and smaller positions, showing clear risk aversion; others are more aggressive, deploying large liquidity in narrow ranges to pursue higher fees.
Large differences in holding periods: Some models act like patient long-term holders, maintaining positions despite volatility; others behave more like high-frequency traders, adjusting positions frequently in response to small market changes.
Sensitivity to transaction costs: Some “smarter” models appear to recognize that frequent actions erode profit and therefore reduce adjustment frequency to optimize net returns.
Extreme sensitivity to prompts: Even minor changes to the prompt framework can significantly shift risk behavior and trading frequency. This reveals the fragility and plasticity of current LLM decision-making.

This Is Not a Final Ranking

We must emphasize that Phase One is not meant to declare “who is best.” The current sample window is limited, prompt bias is unavoidable, and statistical rigor still needs improvement. This is therefore an open and transparent exploration to reveal behavioral differences among models in complex decision tasks and to probe the boundaries of current AI capability. We continue to track, record, and review every decision made by all models.

Future Directions

Our exploration is just beginning. In future seasons we will continue to iterate and expand: Increase the operational scope of the models Introduce richer feature sets (e.g., multi-scale liquidity indicators, structural market risk signals) Design more robust risk constraints Conduct longer-term validation across different market cycles Provide more transparent strategy interpretations and attempt to have models explain the logic behind their decisions Our ultimate goal is to turn the question “Can LLMs become reliable LP agents?” from a vague conceptual debate into a series of verifiable, reproducible, and iterative scientific facts.