Replication Study · February 2026

Can one extract and inject task vectors in language models?

A replication of "Just-in-time and distributed task representations in language models" on Gemma-3

Summary: We replicate key findings on task vector transferability across Gemma-3 models (270M–4B). Natural language tasks achieve 74–115% recovery through patching, while algorithmic tasks achieve only 15–28%. This confirms one of the paper's central claims: models encode task identity broadly but only package transferable task representations for certain task types.

Background

One can distinguish between two types of internal task signals: task identity representations (which encode what task is being performed) and transferable task representations (which can be extracted and injected to induce task behavior). The original paper found that task identity is ubiquitous, but transferable representations are sparse and task-dependent.

We test the transferability of task vectors by extracting hidden states from few-shot prompts and injecting them into zero-shot prompts, measuring how much task performance we recover.

Few-shot prompt Q: big → A: small Q: hot → A: cold Q: wet → A: ? extract h task vector inject Zero-shot prompt Q: fast → A: ? slow

Results

The NL–Algorithmic divide

Tasks split cleanly into two categories. Natural language tasks (antonyms, synonyms, translation, capitalization) achieve high recovery rates (~96% avg.), often matching or exceeding few-shot performance. Algorithmic tasks (list reversal, counting, extraction) show near-complete failure (~17% avg.).

Natural Language
Algorithmic
Figure 1: Recovery rate (patched / few-shot accuracy) at K=8. Values above 100% indicate patching outperforms few-shot.
Full results matrix

The heatmap below shows performance across all task–model combinations.

Figure 2: Performance matrix across tasks and model sizes (K=8). Hover for exact values.

Scaling behavior

Larger models achieve higher absolute patched accuracy on NL tasks (32% → 62% → 87%), consistent with the original paper. Algorithmic tasks remain resistant regardless of scale, with no clear improvement pattern.

Figure 3: Few-shot vs patched accuracy across model sizes. Select tasks to compare.

Evidence condensation

The paper hypothesized that more in-context examples lead to better task representations for "easy" tasks. Our results tend to confirm this for both natural language and algorithmic tasks, although it may not always be the case.

Figure 4: Average patched accuracy by number of in-context examples.

Comparison with original paper

Finding Original (4B–27B) Replication (270M–4B) Status
NL task recovery 80–90% 74–115% Confirmed
Algo task recovery 35–45% 15–28% Lower than reported
Easy/hard task dichotomy Clear separation Clear separation Confirmed
Monotonic scaling Yes NL yes; Algo inconsistent Partial
Evidence condensation Works for easy tasks Works, but not always Partial

The lower algorithmic recovery rates may reflect our use of smaller models (270M–4B vs 4B–27B) and fixed middle-layer extraction rather than per-task optimization. The qualitative findings, particularly the stark NL/Algo divide, replicate clearly.

Discussion

What transfers

Semantic transformations: antonyms, synonyms, translation, case changes. These tasks appear to have compact, localized representations that can be captured in a single vector at the final token position.

What doesn't

Algorithmic operations: list manipulation, counting, position extraction. These may instead require distributed computation across tokens and layers that cannot be condensed into a single injection point.

This replication supports the paper's core insight: language models carry rich task identity information throughout their computations, but only "package" that information into a transferable form for certain task types.

Methods: Experiments run on Gemma-3 (pretrained) models (270M, 1B, 4B) using hidden state extraction at the final token position of the prompt. Injection performed at the middle layer (layer n/2) by overwriting the residual stream. K values: 0, 1, 2, 4, 8. Twenty test examples per task.

Limitations: The model size and sample counts are significantly smaller than in the original paper. We do not perform a hyperparameter search for the injection layer.

Code: Available on Github/JonasLoos/task_representations.