Action Lookup Table

Details

In this section we provide more supplementary details and visualization of the experimental results. We explain it from three aspects: diffusion model analysis, diffusion policy analysis, and ALT results.

Diffusion Model Analysis

Training a generative model from 2D points uniformly distributed on a different shaped 1D manifold. Each subplot shows a different training regime. The first row: A low-capacity model (~400 parameters) trained on a small dataset (3k samples) gives erratic inferences. The second row: The low-capacity model trained on a large dataset (100k samples) generalizes to the wrong manifold. The third row: A high-capacity model (~9.5 million parameters) trained on a small dataset (the Diffusion Policy regime) approximately memorizes the dataset, but does not generalize. All the inference (blue) overlay the training data (orange) points, essentially implementing a lookup table. The fourth row: A high-capacity model trained on a large dataset shows strong generalization to the correct data manifold (regime of large scale image diffusion models).

Diffusion Policy Analysis

The core hypothesis of this paper is that the impressive smoothness and multimodality exhibited by diffusion policies stem not from explicit reasoning about the environment, but from their ability to hash from images to action sequences, memorized at training time through the high representational capacity of diffusion models.

Similarity and distance statistics between inference and training trajectories. Each subplot shows the similarity scores (blue bars) and average distances (orange lines) between the Diffusion Policy inference and training trajectories. The large gap between the closest and second-closest neighbors indicates strong alignment with specific training examples. In the presence of distractors, almost all high-similarity matches are sharply concentrated on a single training trajectory, indicating a surprising OOD default behavior. The diffusion model seems to revert to one or two fallback action sequences when presented with OOD images. Even when the input is entirely unrelated to the task, for example, an image of a cat or a dog, the diffusion model still produces an action sequence that closely resembles one from the training set. We believe those results show that the Diffusion Policy's decision-making is largely governed by memory retrieval, rather than by generalized reasoning over the action space.

ALT Results

In this section, we show the results of ATL policy in the case of InD. In addition, we also show the reactive behavior and multimodality behavior of ATL.

ALT Results: In-Distribution (InD)

ALT Results: Reactive Behavior

Here, we compare the reactivity of our ALT Policy to a standard Diffusion Policy. ALT’s inference is several orders of magnitude faster—allowing it to react quickly to changes in the visual scene. In this example, as the cup is moved, ALT adapts immediately and recomputes its action. Diffusion Policies, on the other hand, require slow and expensive sampling steps, making them sluggish in closed-loop control. ALT’s rapid response time results in smoother, more responsive robot behavior—especially in dynamic, real-time environments.

ALT Results: Multimodality Behavior

The multimodality behavior of ALT is similar to that of the Diffusion Policy. In this example, we show that ALT can produce multiple distinct action sequences for the same input image, which is come from the training dataset. This multimodality is a key feature of Diffusion Policy, enabling them to handle complex tasks with multiple valid solutions. Even without a generative model, the ALT policy can easily be adapted to have multimodal behavior as shown in the examples case 1 and 2. For each case we show three different trajectories for the same location. This shows that ALT can still model a multi-model distribution that is present in the training data.

Introduction Video

Bibtex

Coming soon!

Contact

If you have any questions, feel free to contact Chengyang He, Xu Liu and Gadiel Sznaier Camps.