TL;DR:
Multi-domain fine-tuning usually forces trade-offs (better code = worse math) because training samples conflict. This paper shows those conflicts evolve dynamically during training and can be measured via gradient interactions. EVIC periodically computes an “interaction matrix” to select only samples that currently help the whole dataset, boosting Mistral-7B performance by 4+ points while using up to 2× fewer training steps than standard mixing.
Motivation
Multi-domain fine-tuning of LLMs suffers from notorious capability trade-offs, where improving performance in one domain (e.g., coding) degrades performance in others (e.g., general instruction following). Existing approaches rely on empirical heuristics or domain-level curriculum strategies without understanding the fundamental interactions between individual training samples, leading to marginal improvements and high trial-and-error costs.
Problem
The core challenge is inter-sample conflict: training signals from different domains often conflict, hindering the effective use of high-quality data. Current methods assume that:
- Interactions between samples are static and determined solely by inherent semantic domain labels
- Domain-level data management is optimal
These assumptions fail because interactions evolve during training, conflicts exist within the same domain, and synergies can occur across domains. Simply mixing data (Multi-Task Learning) or using staged training (Dual-stage Mixed Fine-tuning) cannot resolve these dynamic conflicts.
Intuition
EVIC (EVolving Interaction-guided Curriculum) models interactions as influence on loss, quantified via Adam gradients. Key insights:
- Interactions evolve: Gradient-based influence between sample pairs changes significantly during training (e.g., conflict → promotion), rather than being fixed by domain semantics
- Asymmetry: Sample $j$ helping sample $i$ does not imply the reverse ($sign(Int[j,i]) \neq sign(Int[i,j])$)
- Sample-level granularity: Conflicts and promotions occur both within and across domains, necessitating fine-grained selection over coarse domain grouping
Method: Periodically compute an interaction matrix $Int(\theta_t) \in \mathbb{R}^{N \times N}$ where entry $[j,i]$ represents sample $j$’s influence on sample $i$’s loss. Select samples with non-negative net influence (row sums $\geq 0$) for training. Johnson-Lindenstrauss projection (8192 dimensions) reduces computational cost of high-dimensional gradients.
Results
Performance-to-sample ratio (Table 4):
- Mistral-7B: 1.33× more efficient than MTL (uses 77.7% of training steps)
- Llama-3.1-8B: 1.29× more efficient
- Qwen2.5-14B: 2.11× more efficient (uses 47.4% of training steps)
Coverage & Convergence:
- Sample coverage reaches ~90% after 2 iterations and >96% after 3 iterations (Table 3)
- Only 5% warm-up data required for optimal performance (Figure 3)
Ablation Results:
- Non-iterative variants (static sample selection) significantly underperform: Mistral-7B achieves 33.1 vs. 37.4 (iterative); Llama-3.1-8B achieves 43.2 vs. 44.3 (iterative)
- Iterative interaction computation is necessary for optimal multi-domain performance