A replication of "Just-in-time and distributed task representations in language models" on Gemma-3
Summary: We replicate key findings on task vector transferability across Gemma-3 models (270M–4B). Natural language tasks achieve 74–115% recovery through patching, while algorithmic tasks achieve only 15–28%. This confirms one of the paper's central claims: models encode task identity broadly but only package transferable task representations for certain task types.
One can distinguish between two types of internal task signals: task identity representations (which encode what task is being performed) and transferable task representations (which can be extracted and injected to induce task behavior). The original paper found that task identity is ubiquitous, but transferable representations are sparse and task-dependent.
We test the transferability of task vectors by extracting hidden states from few-shot prompts and injecting them into zero-shot prompts, measuring how much task performance we recover.
Tasks split cleanly into two categories. Natural language tasks (antonyms, synonyms, translation, capitalization) achieve high recovery rates (~96% avg.), often matching or exceeding few-shot performance. Algorithmic tasks (list reversal, counting, extraction) show near-complete failure (~17% avg.).
The heatmap below shows performance across all task–model combinations.
Larger models achieve higher absolute patched accuracy on NL tasks (32% → 62% → 87%), consistent with the original paper. Algorithmic tasks remain resistant regardless of scale, with no clear improvement pattern.
The paper hypothesized that more in-context examples lead to better task representations for "easy" tasks. Our results tend to confirm this for both natural language and algorithmic tasks, although it may not always be the case.
| Finding | Original (4B–27B) | Replication (270M–4B) | Status |
|---|---|---|---|
| NL task recovery | 80–90% | 74–115% | Confirmed |
| Algo task recovery | 35–45% | 15–28% | Lower than reported |
| Easy/hard task dichotomy | Clear separation | Clear separation | Confirmed |
| Monotonic scaling | Yes | NL yes; Algo inconsistent | Partial |
| Evidence condensation | Works for easy tasks | Works, but not always | Partial |
The lower algorithmic recovery rates may reflect our use of smaller models (270M–4B vs 4B–27B) and fixed middle-layer extraction rather than per-task optimization. The qualitative findings, particularly the stark NL/Algo divide, replicate clearly.
Semantic transformations: antonyms, synonyms, translation, case changes. These tasks appear to have compact, localized representations that can be captured in a single vector at the final token position.
Algorithmic operations: list manipulation, counting, position extraction. These may instead require distributed computation across tokens and layers that cannot be condensed into a single injection point.
This replication supports the paper's core insight: language models carry rich task identity information throughout their computations, but only "package" that information into a transferable form for certain task types.
Methods: Experiments run on Gemma-3 (pretrained) models (270M, 1B, 4B) using hidden state extraction at the final token position of the prompt. Injection performed at the middle layer (layer n/2) by overwriting the residual stream. K values: 0, 1, 2, 4, 8. Twenty test examples per task.
Limitations: The model size and sample counts are significantly smaller than in the original paper. We do not perform a hyperparameter search for the injection layer.
Code: Available on Github/JonasLoos/task_representations.