The Decision Transformer reframes reinforcement learning as a sequence modeling problem, using a GPT-style Transformer to generate actions by conditioning on desired return-to-go, past states, and past actions—without traditional RL components like value functions or policy gradients.
Architecture Overview
Decision Transformer treats an RL trajectory as a sequence of (return-to-go, state, action) triplets. At each timestep, three tokens are fed into the model: the desired return-to-go (cumulative future reward), the current state, and the action taken. Each modality has its own linear embedding layer that projects to the Transformer's hidden dimension.
The model uses a causal GPT-style Transformer that processes the interleaved sequence of return-to-go, state, and action tokens. The Transformer attends to the history of these triplets and predicts the next action. At inference time, the user specifies the desired return-to-go (e.g., the maximum possible episode return), and the model generates actions likely to achieve that cumulative reward.
Timestep embeddings (learned, one per timestep) are added to all three token types within each timestep, providing temporal context. The model is trained on offline datasets of trajectories using a standard autoregressive loss on action predictions.
Key Innovations
- RL as sequence modeling: Eliminates the need for temporal difference learning, value function estimation, or policy gradients—standard supervised learning on sequences suffices
- Return conditioning: By specifying desired returns at test time, the model can generate different quality behaviors on demand, from cautious to optimal
- Offline RL without pessimism: Unlike most offline RL methods that need conservative estimates to avoid out-of-distribution actions, Decision Transformer naturally handles this through its generative approach
- Long-horizon credit assignment: The Transformer's attention mechanism naturally handles long-range dependencies that are challenging for traditional RL
Common Use Cases
Offline reinforcement learning, robotics control from demonstrations, game playing, autonomous navigation, and any RL task where a dataset of trajectories is available but online interaction is expensive or dangerous.
Notable Variants & Sizes
The original Decision Transformer uses a GPT-2-style architecture with 3-4 layers. Online Decision Transformer adds online fine-tuning. Multi-Game Decision Transformer trains a single model across multiple Atari games. Gato (DeepMind) extends this concept to a generalist agent across hundreds of tasks. Q-Transformer combines Q-learning with the Transformer architecture.
Technical Details
Standard configuration: 3-4 layers, 4 attention heads, 128-dim embeddings, context length of 20 timesteps (60 tokens). Trained on D4RL offline datasets (Gym locomotion tasks: HalfCheetah, Hopper, Walker2d) and Atari games. Training uses cross-entropy loss for discrete actions (Atari) or MSE for continuous actions (Gym). AdamW optimizer, learning rate 1e-4, weight decay 1e-4, trained for 100K gradient steps. At evaluation, return-to-go is set to the maximum trajectory return in the dataset or a target performance level.