A cross-lingual mechanistic analysis revealing how mathematical reasoning in LLMs relies on partially shared parameter circuits — with systematic language-dependent differences.
Four core findings from our cross-lingual mechanistic analysis of mathematical reasoning.
Languages with more math-specific parameters tend to achieve higher reasoning performance. The distribution of these parameters across languages mirrors their relative accuracy on math benchmarks.
A partially shared parameter substrate for mathematical problem-solving exists across languages, with the strongest overlap concentrated in intermediate model layers.
Amplifying math-specific parameters primarily fixes arithmetic execution errors rather than improving reasoning logic. In some cases, pruning can even boost GSM8K accuracy by correcting output formatting or few-shot imitation behavior.
The identified math-specific parameters exert a collective influence on performance — degradation scales linearly with the fraction removed, rather than being driven by a handful of individual critical parameters.
The number of math-specific parameters per language aligns with that language's reasoning performance — more parameters, better math.
We use the MathNeurosurgery framework to isolate parameters critical for mathematical reasoning and compare them across English, German, French, and Hindi. Our analysis spans three models: Llama 3.2 1B, Qwen3 4B, and Llama 3.1 8B.
The results reveal a consistent hierarchy: English has the most math-relevant parameters, followed by German and French, with Hindi having the fewest. This ranking directly mirrors their GSM8K accuracy scores, suggesting a tight coupling between parameter allocation and mathematical capability.
This pattern becomes even more pronounced as model size increases, consistent with the idea that larger models can dedicate more specialized sub-networks to high-resource languages.
Values approximate — see paper Fig. 2 for exact data.
Extending MathNeurosurgery across languages to identify, compare, and validate math-specific parameters.
Compute importance scores for each parameter using weight–activation products over math (GSM8K) and non-math (MMLU, RACE) datasets across all languages.
Select top-k parameters per layer for math, exclude those also important for non-math tasks. This yields the math-specific parameter set for each language.
Measure pairwise overlap between language-specific parameter sets using the Jaccard coefficient, both globally and layer-by-layer.
Jaccard similarity reveals partial sharing concentrated in intermediate layers, with systematic variation across language pairs.
Accuracy on GSM8K (math reasoning) and RACE (language understanding)
| Model | English | German | Hindi | French | ||||
|---|---|---|---|---|---|---|---|---|
| GSM8K | RACE | GSM8K | RACE | GSM8K | RACE | GSM8K | RACE | |
| Llama 1B | 0.340 | 0.379 | 0.235 | 0.321 | 0.145 | 0.299 | 0.185 | 0.348 |
| Qwen3 4B | 0.735 | 0.414 | 0.685 | 0.356 | 0.385 | 0.353 | 0.690 | 0.403 |
| Llama 8B | 0.765 | 0.448 | 0.585 | 0.396 | 0.415 | 0.388 | 0.580 | 0.413 |
Scaling isolated parameters by 1.1× on Llama 1B yields up to +9.7% on English GSM8K while leaving RACE/MMLU virtually unchanged. Gains mainly come from corrected arithmetic, not better reasoning logic. Larger models show smaller gains due to saturation.
Zeroing out math-specific parameters causes steep drops in GSM8K (up to –98%) with minimal RACE/MMLU impact, confirming correct localization. Performance degradation scales linearly with the fraction pruned — evidence of a collective, distributed effect.
Same reasoning steps, different arithmetic — scaling fixes the final computation
Q: Toula went to the bakery and bought various types of pastries. She bought 3 dozen donuts which cost $68 per dozen, 2 dozen mini cupcakes which cost $80 per dozen, and 6 dozen mini cheesecakes for $55 per dozen. How much was the total cost?
3 dozen donuts cost 3 × 68 = 204 dollars.
2 dozen mini cupcakes cost 2 × 80 = 160 dollars.
6 dozen mini cheesecakes cost 6 × 55 = 330 dollars.
204 + 160 + 330 is 734
The answer is 734.
3 dozen donuts cost 3 × 68 = 204 dollars.
2 dozen mini cupcakes cost 2 × 80 = 160 dollars.
6 dozen mini cheesecakes cost 6 × 55 = 330 dollars.
204 + 160 + 330 is 694
The answer is 694.
If you find this work useful, please consider citing our paper.
@misc{shomali2026llmparametersmathlanguages,
title={LLM Parameters for Math Across Languages: Shared or Separate?},
author={Behzad Shomali and Luisa Victor and Tim Selbach and Ali Hamza Bashir and David Berghaus and Joachim Koehler and Mehdi Ali and Markus Frey},
year={2026},
eprint={2606.18453},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.18453},
}