Math Across Languages

Key Findings

What we discovered

Four core findings from our cross-lingual mechanistic analysis of mathematical reasoning.

📈 Performance tracks parameter count

Languages with more math-specific parameters tend to achieve higher reasoning performance. The distribution of these parameters across languages mirrors their relative accuracy on math benchmarks.

🔗 Shared substrate in intermediate layers

A partially shared parameter substrate for mathematical problem-solving exists across languages, with the strongest overlap concentrated in intermediate model layers.

🧮 Scaling corrects arithmetic, not logic

Amplifying math-specific parameters primarily fixes arithmetic execution errors rather than improving reasoning logic. In some cases, pruning can even boost GSM8K accuracy by correcting output formatting or few-shot imitation behavior.

🤝 Collective parameter influence

The identified math-specific parameters exert a collective influence on performance — degradation scales linearly with the fraction removed, rather than being driven by a handful of individual critical parameters.

Overview

Math performance mirrors parameter allocation

The number of math-specific parameters per language aligns with that language's reasoning performance — more parameters, better math.

We use the MathNeurosurgery framework to isolate parameters critical for mathematical reasoning and compare them across English, German, French, and Hindi. Our analysis spans three models: Llama 3.2 1B, Qwen3 4B, and Llama 3.1 8B.

The results reveal a consistent hierarchy: English has the most math-relevant parameters, followed by German and French, with Hindi having the fewest. This ranking directly mirrors their GSM8K accuracy scores, suggesting a tight coupling between parameter allocation and mathematical capability.

This pattern becomes even more pronounced as model size increases, consistent with the idea that larger models can dedicate more specialized sub-networks to high-resource languages.

# Math-specific Parameters — Llama 1B

Values approximate — see paper Fig. 2 for exact data.

GSM8K Accuracy — Llama 1B

34.0%

23.5%

18.5%

14.5%

Methodology

How we localize math circuits

Extending MathNeurosurgery across languages to identify, compare, and validate math-specific parameters.

STEP 1

Score Parameters

Compute importance scores for each parameter using weight–activation products over math (GSM8K) and non-math (MMLU, RACE) datasets across all languages.

S_ij = Σ |W_ij| · ‖X_j^k‖₂

STEP 2

Isolate Math Parameters

Select top-k parameters per layer for math, exclude those also important for non-math tasks. This yields the math-specific parameter set for each language.

T_lang = Top-k_math \ Top-k_non-math

STEP 3

Compare via Jaccard

Measure pairwise overlap between language-specific parameter sets using the Jaccard coefficient, both globally and layer-by-layer.

J(T₁,T₂) = |T₁ ∩ T₂| / |T₁ ∪ T₂|

Results

Cross-lingual parameter overlap

Jaccard similarity reveals partial sharing concentrated in intermediate layers, with systematic variation across language pairs.

Layer-wise Jaccard Similarity — Llama 1B (top-k = 0.1)

Hover over cells to explore · Dock-style magnification

Low

High

Baseline performance across models and languages

Accuracy on GSM8K (math reasoning) and RACE (language understanding)

Model	English		German		Hindi		French
	GSM8K	RACE	GSM8K	RACE	GSM8K	RACE	GSM8K	RACE
Llama 1B	0.340	0.379	0.235	0.321	0.145	0.299	0.185	0.348
Qwen3 4B	0.735	0.414	0.685	0.356	0.385	0.353	0.690	0.403
Llama 8B	0.765	0.448	0.585	0.396	0.415	0.388	0.580	0.413

Validation through weight intervention

Scaling Amplifying math parameters

Scaling isolated parameters by 1.1× on Llama 1B yields up to +9.7% on English GSM8K while leaving RACE/MMLU virtually unchanged. Gains mainly come from corrected arithmetic, not better reasoning logic. Larger models show smaller gains due to saturation.

Pruning Removing math parameters

Zeroing out math-specific parameters causes steep drops in GSM8K (up to –98%) with minimal RACE/MMLU impact, confirming correct localization. Performance degradation scales linearly with the fraction pruned — evidence of a collective, distributed effect.

Qualitative example: scaling corrects arithmetic

English · Llama 1B · Scaling intervention

Same reasoning steps, different arithmetic — scaling fixes the final computation

Input

Q: Toula went to the bakery and bought various types of pastries. She bought 3 dozen donuts which cost $68 per dozen, 2 dozen mini cupcakes which cost $80 per dozen, and 6 dozen mini cheesecakes for $55 per dozen. How much was the total cost?

✗ Before scaling

3 dozen donuts cost 3 × 68 = 204 dollars.

2 dozen mini cupcakes cost 2 × 80 = 160 dollars.

6 dozen mini cheesecakes cost 6 × 55 = 330 dollars.

204 + 160 + 330 is 734

The answer is 734.

✓ After scaling