Decision Transformer: Architecture & How It Works

The Decision Transformer reframes reinforcement learning as a sequence modeling problem, using a GPT-style Transformer to generate actions by conditioning on desired return-to-go, past states, and past actions—without traditional RL components like value functions or policy gradients.

Architecture Overview

Decision Transformer treats an RL trajectory as a sequence of (return-to-go, state, action) triplets. At each timestep, three tokens are fed into the model: the desired return-to-go (cumulative future reward), the current state, and the action taken. Each modality has its own linear embedding layer that projects to the Transformer's hidden dimension.

The model uses a causal GPT-style Transformer that processes the interleaved sequence of return-to-go, state, and action tokens. The Transformer attends to the history of these triplets and predicts the next action. At inference time, the user specifies the desired return-to-go (e.g., the maximum possible episode return), and the model generates actions likely to achieve that cumulative reward.

Timestep embeddings (learned, one per timestep) are added to all three token types within each timestep, providing temporal context. The model is trained on offline datasets of trajectories using a standard autoregressive loss on action predictions.

Key Innovations

RL as sequence modeling: Eliminates the need for temporal difference learning, value function estimation, or policy gradients—standard supervised learning on sequences suffices
Return conditioning: By specifying desired returns at test time, the model can generate different quality behaviors on demand, from cautious to optimal
Offline RL without pessimism: Unlike most offline RL methods that need conservative estimates to avoid out-of-distribution actions, Decision Transformer naturally handles this through its generative approach
Long-horizon credit assignment: The Transformer's attention mechanism naturally handles long-range dependencies that are challenging for traditional RL

Common Use Cases

Offline reinforcement learning, robotics control from demonstrations, game playing, autonomous navigation, and any RL task where a dataset of trajectories is available but online interaction is expensive or dangerous.

Notable Variants & Sizes

The original Decision Transformer uses a GPT-2-style architecture with 3-4 layers. Online Decision Transformer adds online fine-tuning. Multi-Game Decision Transformer trains a single model across multiple Atari games. Gato (DeepMind) extends this concept to a generalist agent across hundreds of tasks. Q-Transformer combines Q-learning with the Transformer architecture.

Technical Details

Standard configuration: 3-4 layers, 4 attention heads, 128-dim embeddings, context length of 20 timesteps (60 tokens). Trained on D4RL offline datasets (Gym locomotion tasks: HalfCheetah, Hopper, Walker2d) and Atari games. Training uses cross-entropy loss for discrete actions (Atari) or MSE for continuous actions (Gym). AdamW optimizer, learning rate 1e-4, weight decay 1e-4, trained for 100K gradient steps. At evaluation, return-to-go is set to the maximum trajectory return in the dataset or a target performance level.