Author: José Carlos Gonzáles Tanaka
TL;DR: This post builds a forex backtest where a DeepSeek LLM labels market regimes from compact numeric summaries. We compare it to a KMeans baseline, apply monthly walk-forward optimization, and report out-of-sample results from 2023 onward.
What is LLM regime labeling?
LLM regime labeling is the use of a large language model to classify current market conditions into discrete states such as trend, range, or high volatility from compact numeric summaries of recent price behaviour. In this project, the LLM does not generate trading rules. It only assigns a regime label that the strategy then uses in a fully deterministic way.
Prerequisites
To fully grasp the regime-labeling approach in this blog, it helps to have a basic familiarity with clustering methods and market regimes. For foundational reading, you can check Markov processes and Hidden Markov Chains that will serve you on understanding how to identify Market Regimes using Hidden Markov models to trade.
What you will get:
- A complete Python script that runs a 2023+ out-of-sample FX backtest with monthly walk-forward optimization.
- A baseline regime classifier (KMeans) and an LLM-based classifier (DeepSeek) that use the same feature set.
- A clear comparison of equity curves and key metrics (CAGR, Sharpe, Sortino, Calmar, MaxDD, win rate).
- A practical checklist of tweaks to improve LLM-based regime labeling performance and robustness.
What this project builds
This project demonstrates a practical way to combine an LLM with a quantitative trading workflow without letting the LLM touch raw price history. Here, we use DeepSeek, a third-party large language model accessed through an API, for a narrow task: labeling each period as a market regime (trend up/down, range, high/low volatility) from a compact numeric summary. The trading rules remain fully deterministic.
To reduce overfitting, we evaluate performance out-of-sample (OOS) starting in 2023 and use monthly walk-forward optimization (WFO). Every month, we tune a small set of parameters using only prior data, then trade the next month with the chosen parameters.
What the script does (in plain English)
- Step 1: Download daily EURUSD data and compute simple features (returns, volatility, trend score, ATR proxy, z-score).
- Step 2: Create regime labels in two ways:
- a non-LLM baseline using KMeans clustering, and
- an LLM version using DeepSeek.
- Step 3: For each month in 2023+, optimize a few parameters on a trailing training window (default: 3 years).
- Step 4: Trade the next month, stitch months together, and produce an equity curve.
Full Python script explained in parts
First, we’ll explain the script in the same order it’s written. After each explanation, we’ll show the exact code block so you can match the narrative to the implementation. This section is meant to be readable even if you’re a beginner in Python.
Let's import the corresponding libraries:
Configure Settings and API Access
Next, the script begins with a short settings block. This is where you choose the FX symbol, the historical data range, transaction costs, and the walk-forward optimization grid. In this script, the data starts in 2006, while the out-of-sample evaluation begins in 2023.. It also includes the DeepSeek configuration (API key, base URL, and model).
In addition, the optimization grid is intentionally small so the experiment stays readable and the walk-forward loop does not become a “hyper-parameter monster.”
Then, once these settings are fixed, the rest of the script can run end-to-end without changing any trading logic.
Please find below the code for this part:
Download Data and Engineer Features
Then the script downloads daily price data and converts raw OHLC data into a compact, easy-to-interpret feature set.
After that, it computes the signals used everywhere else: daily log returns, rolling annualized volatility, a 20-day trend score (mean/std of returns), an ATR-style range proxy, and a 20-day z-score (distance from the moving average).
Finally, both the baseline and the LLM see these same features, which makes the comparison fair: the labeling method is the main difference.
Here is the code for this part:
Baseline Regime Labels with KMeans
Next, we create a non-LLM baseline using KMeans clustering on rolling window summaries. To avoid leakage, the KMeans model is fit only on the pre-2023 period.
In addition, the clusters are converted into named regimes using simple heuristics: the most positive trend cluster becomes TREND_UP, the most negative becomes TREND_DOWN, the highest volatility cluster becomes HIGH_VOL, the lowest becomes LOW_VOL, and the remaining one becomes RANGE.
Then, labels are forward-filled until the next labeling date so the strategy has a regime label each day.
Find below the code for this part:
LLM Regime Labels with DeepSeek and Caching
Then, we build the LLM regime labeler using DeepSeek. Instead of sending a long price history, the model receives only a compact numeric summary of the last N days (mean return, volatility, trend score, ATR proxy, z-score, and a drawdown proxy).
For a broader treatment of LLM applications in systematic trading, see QuantInsti’s related learning resources on trading with LLMs such as Agentic AI for Trading.
In addition, the prompt requests a single label from a fixed set and expects strict JSON output, which makes the labeling step easier to parse and audit.
After that, the script caches each labeled date in a JSON file so reruns do not spend tokens on the same periods again.
Check the code:
Strategy Logic and Position Sizing
Next, the strategy converts regime labels into daily positions using simple regime-conditioned rules.
For trend regimes it takes directional exposure (long in TREND_UP, short in TREND_DOWN). For RANGE it mean-reverts using the z-score: it fades short when price is far above the mean and fades long when price is far below the mean. For HIGH_VOL and UNCERTAIN it stays flat by default, while LOW_VOL uses a smaller trend-following position.
Then, to reduce lookahead bias, positions are shifted by one day so trades are assumed to execute on the next bar.
See below the code section:
Performance Metrics
Then, we define the evaluation metrics used later in the Results section: CAGR, annual volatility, Sharpe, Sortino, Calmar, max drawdown, and win rate.
These metrics help compare not only returns, but also the risk taken to earn them.
See below the code script:
Monthly Walk-Forward Optimization
After that, the script runs monthly walk-forward optimization. Each month in 2023+, it trains on the trailing TRAIN_YEARS of data, tries a small parameter grid, and selects the set with the best training Sharpe.
If you want a deeper explanation of the methodology itself, see QuantInsti’s guide to walk-forward optimization before applying the code in this project.
Then it trades the next month with those chosen parameters and stitches the monthly results into one out-of-sample equity curve.
At this stage, the optimizer is tuning only these knobs: z_thr (z-score entry threshold in RANGE), range_size (RANGE position size), lowvol_size (LOW_VOL sizing), and highvol_size (HIGH_VOL sizing, often 0.0).
Check the code:
Putting the Full Workflow Together
Finally, main() wires everything together: data → features → regimes → monthly walk-forward → equity curves. It prints metrics, plots both curves, and saves CSV files for equity and monthly parameter choices.
Visualize the code below:
Results (OOS from 2023): LLM vs non-LLM
Check the plot:
Quick performance summary (Quantstats-style metrics for OOS 2023+):
|
Strategy |
Final Equity |
Peak Equity |
CAGR |
Annual Volatility |
Sharpe |
Sortino |
Calmar |
Max Drawdown |
Win Rate |
|
Non-LLM (KMeans) + monthly WFO |
1.091 |
1.160 |
2.63% |
5.05% |
0.541 |
0.884 |
0.268 |
-9.82% |
38.72% |
|
LLM (DeepSeek) + monthly WFO |
1.203 |
1.277 |
5.69% |
6.40% |
0.898 |
1.578 |
0.663 |
-8.59% |
41.92% |
How to read this table: CAGR is the annualized growth rate, AnnVol is annualized volatility, Sharpe and Sortino measure risk-adjusted returns (Sortino focuses on downside risk), Calmar relates return to drawdown, MaxDD is the worst peak-to-trough loss, and WinRate is the share of positive days.
Why might the LLM have done better in this window? One possible explanation is that the DeepSeek labels combine multiple signals at once, including trend strength, volatility, drawdown proxy, and z-score, instead of relying on distance-based clustering alone. That may help the strategy adapt more cleanly around regime transitions. Even so, this result comes from a single pair over a limited OOS period, so it should be treated as an encouraging result rather than a confirmed edge.
The equity curves you generated cover 2023-01-02 to 2026-03-31 (OOS). To make the comparison fair, both curves are rebased to start at 1.0 on the first OOS date.
At a high level, the LLM-labeled strategy produced a higher terminal equity and a higher peak equity over this OOS window. That suggests the LLM regime labels (combined with monthly WFO) were more useful than the KMeans regimes for deciding when to apply trend-following vs range mean-reversion behaviors.
Interpreting the differences
Why might the LLM version outperform? In this setup, the LLM acts as a flexible classifier that makes “soft” judgments from multiple signals at once (trend score, volatility, drawdown proxy, z-score). KMeans can be sensitive to scaling, cluster shapes, and may not separate regimes cleanly when the market transitions between states.
However, treat this as a hypothesis generator, not a final conclusion. This OOS window is relatively short and includes a specific FX regime mix. The monthly WFO also introduces the risk of mild overfitting if the parameter grid is too large or the training objective is not aligned with your real-world constraints (e.g., drawdown limits). In addition, the flat 1 bps transaction-cost assumption may understate real execution costs in some market conditions, and daily EURUSD data from public sources may not perfectly reflect tradable close prices.
10 practical tweaks to improve the LLM-based strategy
Below are concrete, implementation-level tweaks that may improve performance or robustness when you use an LLM for regime labeling. These focus on reducing label noise, improving consistency, and aligning optimization with real trading constraints.
- Add a “confidence” output and abstain rule: Ask the LLM for {regime, confidence}. If confidence is low, label as UNCERTAIN and stay flat. This reduces noisy trades.
- Use majority vote (self-consistency) on hard months: For month-start labeling windows, call the LLM 3 times (temperature=0) and take the majority label. Cache the majority result.
- Increase information in the summary (still numeric): Add features like rolling skew, kurtosis, breakout frequency, range compression, or autocorrelation. Keep it compact and auditable.
- Label more frequently during volatile periods: Make LABEL_STEP_DAYS dynamic: label every day when vol is high, and every 5–10 days when vol is low. This can improve regime transitions.
- Add a regime “smoother”: Prevent whipsaws by requiring a regime to persist for N days (or use an HMM-like smoothing rule) before switching trading behavior.
- Stricter prompt + schema validation: Force strict JSON and reject responses that include extra text. If invalid, re-try once or default to UNCERTAIN.
- Ensemble LLM with a quantitative prior: Combine LLM label with the KMeans (or a rules-based label). For example, only accept TREND_UP if both agree, otherwise UNCERTAIN.
- Optimize for drawdown-aware objective: Instead of maximizing Sharpe, maximize Calmar or use a penalty: score = Sharpe − λ·|MaxDD|. This often improves stability.
- Add regime-specific risk sizing: Use volatility targeting (scale position by 1/vol) within each regime so you don’t take the same risk in calm vs volatile markets.
- Expand regime taxonomy (carefully): Split RANGE into ‘tight range’ vs ‘wide range’, or split TREND into ‘strong trend’ vs ‘weak trend’. More regimes can help if you validate properly.
Download Files:
Suggested next experiments
To strengthen the strategy, consider adding:
- multiple FX pairs (EURUSD, GBPUSD, USDJPY),
- sensitivity to transaction costs,
- a longer OOS window once you’re confident in label hygiene, and
- a “label audit” where you sample dates and inspect summaries vs labels.
Note:
- The strategy idea originated from the author
- The blog content was created with the assistance of an AI large language model and
- The blog content was curated/edited by the author.
Frequently Asked Questions
- Can an LLM reliably label market regimes?: This backtest suggests that an LLM can be useful as a regime classifier, but the evidence here is limited to one FX pair and a relatively short OOS window. It should be validated across more assets and periods.
- What is walk-forward optimization?: Walk-forward optimization tunes parameters on a rolling historical window and tests them on the next unseen period. This gives a more realistic estimate of out-of-sample performance than a single train/test split.
- How is this different from asking an LLM to generate a strategy?: In this project, the LLM only classifies regimes from a numeric summary. The trading rules themselves remain fixed, explicit, and fully coded by the researcher.
- What are the main live-trading risks?: The main risks include unstable labeling across prompt versions, dependence on API availability, and backtest assumptions that may not hold in real execution
Further Reading
To explore the basics of Quant Trading, check our Learning Track: Quantitative Trading for Beginners.
For LLM usage for trading, explore the Trading Using LLM: Concepts and Strategies track, which provides practical hands-on insights into implementing LLM models for trading.
If you're a serious learner, you can take the Executive Programme in Algorithmic Trading (EPAT), which covers statistical modelling, machine learning, and advanced trading strategies with Python.
This project is for educational and illustrative purposes only. Trading in financial markets involves substantial risk of loss. The code and concepts discussed here are not financial advice. Always exercise caution and thoroughly understand any automated trading system before deploying it in a live environment.
Serious about learning?
For a structured pathway that covers machine learning, deep learning, and their application in trading, the Executive Programme in Algorithmic Trading (EPAT) provides a comprehensive curriculum with a focus on practical implementation and real-world trading workflows.
Connect with an EPAT career counsellor to explore how it aligns with your background and goals:
Disclaimer: All investments and trading in the stock market involve risk. Any decision to place trades in the financial markets, including trading in stock or options or other financial instruments, is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.
