No image selected. Please select an image from the list above.
All displayed similarities are relative to the currently hovered, or last clicked, tile.
Our qualitative analysis reveals three key phenomena in Stable Diffusion U-Net representations:
These phenomena may interfere with the use of diffusion model representations for downstream tasks that require robust features with high spatial fidelity, such as semantic correspondence or dense prediction tasks.
Our quantitative analysis confirms the three key phenomena observed in Stable Diffusion U-Net representations. The graphs above show results for three different models (SD-1.5, SD-2.1, and SD-Turbo). The top row demonstrates the accuracy of a linear probe in predicting token positions, confirming that positional information is linearly extractable from representations, especially in lower blocks. The middle row shows the increased similarity between corner tokens relative to non-corner tokens (log-2 scale), revealing corner artifacts across multiple layers. The bottom row displays the relative norm of anomalous tokens compared to the mean norm of all tokens (log-2 scale), highlighting the presence of high-norm anomalies particularly in up-sampling blocks. These findings are consistent across all evaluated models, suggesting these phenomena are inherent to the U-Net architecture used in Stable Diffusion models.
@inproceedings{loos2025latent, title={Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies}, author={Jonas Loos and Lorenz Linhardt}, booktitle={ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy}, year={2025}, url={https://openreview.net/forum?id=BCFNrZcqEL} }