Investigating equation-only reasoning in large language models
2026
While Large Language Models excel at mathematical reasoning with Chain-of-Thought prompting, their ability to perform systematic arithmetic reasoning without natural language scaffolding remains poorly understood. We investigate equation-only supervision, where LLMs map natural language problems directly to symbolic equation sequences without intermediate explanations. This approach separates reasoning structure generation from arithmetic computation, enabling compact equation storage and deterministic evaluation by external symbolic systems. We fine-tune LLaMA 3.1 Instruct 8B on GSM8K across three representations: numeric (16 − 3 − 4 = 9), symbolic (v0 − v1 − v2 = v3), and semantic variables (eggs laid per day - eggs eaten for breakfast = eggs sold at market). Numeric equations achieve 67.85% accuracy on GSM8K with strong generalization (63.68% on GSM-Symbolic), while semantic variables perform comparably (66.41%). Surprisingly, pure symbolic variables underperform significantly (52.46%), revealing that semantic grounding is crucial for learning equation structures. Our dual evaluation metrics show equation-calculated accuracy often matches or exceeds LLM-calculated accuracy, indicating that improving structure generation—not arithmetic computation—remains the primary challenge. This diagnostic study provides empirical insights into LLMs' structured mathematical reasoning capabilities with implications for building reliable systems leveraging symbolic computation.
Research areas