TL;DR: This post builds a forex backtest where a DeepSeek LLM labels market regimes from compact numeric summaries. We compare it to a KMeans baseline, apply monthly walk-forward optimization, and report out-of-sample results from 2023 onward.

Prerequisites

To fully grasp the regime-labeling approach in this blog, it helps to have a basic familiarity with clustering methods and market regimes. For foundational reading, explore Markov processes and Hidden Markov Chains, and how to identify Market Regimes using Hidden Markov Models.

For a conceptual introductions, you can check Markov processes and Hidden Markov Chains that will serve you on understanding how to identify Market Regimes using Hidden Markov models to trade.

What you will get:

A complete Python script that runs a 2023+ out-of-sample FX backtest with monthly walk-forward optimization.
A baseline regime classifier (KMeans) and an LLM-based classifier (DeepSeek) that use the same feature set.
A clear comparison of equity curves and key metrics (CAGR, Sharpe, Sortino, Calmar, MaxDD, win rate).
A practical checklist of tweaks to improve LLM-based regime labeling performance and robustness.

What this project builds

This project demonstrates a practical way to combine an LLM with a quantitative trading workflow without letting the LLM touch raw price history. The LLM’s job is narrow: label each period as a market regime (trend up/down, range, high/low volatility) using a compact numeric summary. The trading rules remain fully deterministic.

To reduce overfitting, we evaluate performance out-of-sample (OOS) starting in 2023 and use monthly walk-forward optimization (WFO). Every month, we tune a small set of parameters using only prior data, then trade the next month with the chosen parameters.

What the script does (in plain English)

Step 1: Download daily EURUSD data and compute simple features (returns, volatility, trend score, ATR proxy, z-score).
Step 2: Create regime labels in two ways:
a non-LLM baseline using KMeans clustering, and
an LLM version using DeepSeek.
Step 3: For each month in 2023+, optimize a few parameters on a trailing training window (default: 3 years).
Step 4: Trade the next month, stitch months together, and produce an equity curve.

Full Python script explained in parts

First, we’ll explain the script in the same order it’s written. After each explanation, we’ll show the exact code block so you can match the narrative to the implementation. This section is meant to be readable even if you’re a beginner in Python.

Let's import the corresponding libraries:

SETTINGS (edit these)

Next, the script begins with a short settings block. This is where you choose the FX symbol, the date range (from 2023 up to today, transaction costs, and the walk-forward optimization grid. It also includes the DeepSeek configuration (API key, base URL, and model).

In addition, the optimization grid is intentionally small so the experiment stays readable and the walk-forward loop does not become a “hyper-parameter monster.”

Then, once these settings are fixed, the rest of the script can run end-to-end without changing any trading logic.

Please find below the code for this part:

DATA + FEATURES

Then the script downloads daily price data and converts raw OHLC data into a compact, easy-to-interpret feature set.

After that, it computes the signals used everywhere else: daily log returns, rolling annualized volatility, a 20-day trend score (mean/std of returns), an ATR-style range proxy, and a 20-day z-score (distance from the moving average).

Finally, both the baseline and the LLM see these same features, which makes the comparison fair: the labeling method is the main difference.

Here is the code for this part:

REGIME LABELS (Non‑LLM: KMeans)

Next, we create a non-LLM baseline using KMeans clustering on rolling window summaries. To avoid leakage, the KMeans model is fit only on the pre-2023 period.

In addition, the clusters are converted into named regimes using simple heuristics: the most positive trend cluster becomes TREND_UP, the most negative becomes TREND_DOWN, the highest volatility cluster becomes HIGH_VOL, the lowest becomes LOW_VOL, and the remaining one becomes RANGE.

Then, labels are forward-filled until the next labeling date so the strategy has a regime label each day.

Find below the code for this part:

REGIME LABELS (LLM: DeepSeek) + cache

Then, we build the LLM regime labeler using DeepSeek. Instead of sending a long price history, the model receives only a compact numeric summary of the last N days (mean return, volatility, trend score, ATR proxy, z-score, and a drawdown proxy).

In addition, the prompt requests a single label from a fixed set and expects strict JSON output, which makes the labeling step easier to parse and audit.

After that, the script caches each labeled date in a JSON file so reruns do not spend tokens on the same periods again.

Check the code:

STRATEGY LOGIC (UNCHANGED, but parameterized for WFO)

Next, the strategy converts regime labels into daily positions using simple regime-conditioned rules.

For trend regimes it takes directional exposure (long in TREND_UP, short in TREND_DOWN). For RANGE it mean-reverts using the z-score: it fades short when price is far above the mean and fades long when price is far below the mean. For HIGH_VOL and UNCERTAIN it stays flat by default, while LOW_VOL uses a smaller trend-following position.

Then, to reduce lookahead bias, positions are shifted by one day so trades are assumed to execute on the next bar.

See below the code section:

METRICS

Then, we define the evaluation metrics used later in the Results section: CAGR, annual volatility, Sharpe, Sortino, Calmar, max drawdown, and win rate.

These metrics help compare not only returns, but also the risk taken to earn them.

See below the code script:

WALK-FORWARD OPTIMIZATION (MONTHLY)

After that, the script runs monthly walk-forward optimization. Each month in 2023+, it trains on the trailing TRAIN_YEARS of data, tries a small parameter grid, and selects the set with the best training Sharpe.

Then it trades the next month with those chosen parameters and stitches the monthly results into one out-of-sample equity curve.

At this stage, the optimizer is tuning only these knobs: z_thr (z-score entry threshold in RANGE), range_size (RANGE position size), lowvol_size (LOW_VOL sizing), and highvol_size (HIGH_VOL sizing, often 0.0).

Check the code:

MAIN

Finally, main() wires everything together: data → features → regimes → monthly walk-forward → equity curves. It prints metrics, plots both curves, and saves CSV files for equity and monthly parameter choices.

Visualize the code below:

Results (OOS from 2023): LLM vs non-LLM

Check the plot:

Quick performance summary (Quantstats-style metrics for OOS 2023+):

Strategy	Final Equity	Peak Equity	CAGR	Annual Volatility	Sharpe	Sortino	Calmar	Max Drawdown	Win Rate
Non-LLM (KMeans) + monthly WFO	1.091	1.160	2.63%	5.05%	0.541	0.884	0.268	-9.82%	38.72%
LLM (DeepSeek) + monthly WFO	1.203	1.277	5.69%	6.40%	0.898	1.578	0.663	-8.59%	41.92%

How to read this table: CAGR is the annualized growth rate, AnnVol is annualized volatility, Sharpe and Sortino measure risk-adjusted returns (Sortino focuses on downside risk), Calmar relates return to drawdown, MaxDD is the worst peak-to-trough loss, and WinRate is the share of positive days.

Why did the LLM do better here? The DeepSeek labels can combine multiple signals (trend strength, volatility, drawdown proxy, and z-score) and react more flexibly to regime transitions than KMeans clustering. That can reduce misclassification around turning points, so the strategy spends more time using the right behavior (trend-following vs mean reversion). In this OOS window, that shows up as higher CAGR and materially better Sharpe/Sortino, with a slightly smaller max drawdown and a modestly higher win rate.

The equity curves you generated cover 2023-01-02 to 2026-03-31 (OOS). To make the comparison fair, both curves are rebased to start at 1.0 on the first OOS date.

At a high level, the LLM-labeled strategy produced a higher terminal equity and a higher peak equity over this OOS window. That suggests the LLM regime labels (combined with monthly WFO) were more useful than the KMeans regimes for deciding when to apply trend-following vs range mean-reversion behaviors.

Interpreting the differences

Why might the LLM version outperform? In this setup, the LLM acts as a flexible classifier that makes “soft” judgments from multiple signals at once (trend score, volatility, drawdown proxy, z-score). KMeans can be sensitive to scaling, cluster shapes, and may not separate regimes cleanly when the market transitions between states.

However, treat this as a hypothesis generator, not a final conclusion. This OOS window is relatively short and includes a specific FX regime mix. The monthly WFO also introduces the riskof mild overfitting if the parameter grid is too large or the training objective is not aligned with your real-world constraints (e.g., drawdown limits).

10 practical tweaks to improve the LLM-based strategy

Below are concrete, implementation-level tweaks that may improve performance or robustness when you use an LLM for regime labeling. These focus on reducing label noise, improving consistency, and aligning optimization with real trading constraints.

Add a “confidence” output and abstain rule: Ask the LLM for {regime, confidence}. If confidence is low, label as UNCERTAIN and stay flat. This reduces noisy trades.
Use majority vote (self-consistency) on hard months: For month-start labeling windows, call the LLM 3 times (temperature=0) and take the majority label. Cache the majority result.
Increase information in the summary (still numeric): Add features like rolling skew, kurtosis, breakout frequency, range compression, or autocorrelation. Keep it compact and auditable.
Label more frequently during volatile periods: Make LABEL_STEP_DAYS dynamic: label every day when vol is high, and every 5–10 days when vol is low. This can improve regime transitions.
Add a regime “smoother”: Prevent whipsaws by requiring a regime to persist for N days (or use an HMM-like smoothing rule) before switching trading behavior.
Stricter prompt + schema validation: Force strict JSON and reject responses that include extra text. If invalid, re-try once or default to UNCERTAIN.
Ensemble LLM with a quantitative prior: Combine LLM label with the KMeans (or a rules-based label). For example, only accept TREND_UP if both agree, otherwise UNCERTAIN.
Optimize for drawdown-aware objective: Instead of maximizing Sharpe, maximize Calmar or use a penalty: score = Sharpe − λ·|MaxDD|. This often improves stability.
Add regime-specific risk sizing: Use volatility targeting (scale position by 1/vol) within each regime so you don’t take the same risk in calm vs volatile markets.
Expand regime taxonomy (carefully): Split RANGE into ‘tight range’ vs ‘wide range’, or split TREND into ‘strong trend’ vs ‘weak trend’. More regimes can help if you validate properly.

Suggested next experiments

To strengthen the blog, consider adding:

multiple FX pairs (EURUSD, GBPUSD, USDJPY),
sensitivity to transaction costs,
a longer OOS window once you’re confident in label hygiene, and
a “label audit” where you sample dates and inspect summaries vs labels.

Note:

The strategy idea originated from the author
The blog content was created with the assistance of an AI large language model and
The blog content was curated/edited by the author.

AI Forex Backtesting with LLM Regime Labels: DeepSeek vs KMeans in Python