Lin Yang's homepage

Back to Home

TL;DR:
Multi-domain fine-tuning usually forces trade-offs (better code = worse math) because training samples conflict. This paper shows those conflicts evolve dynamically during training and can be measured via gradient interactions. EVIC periodically computes an “interaction matrix” to select only samples that currently help the whole dataset, boosting Mistral-7B performance by 4+ points while using up to 2× fewer training steps than standard mixing.

Motivation

Multi-domain fine-tuning of LLMs suffers from notorious capability trade-offs, where improving performance in one domain (e.g., coding) degrades performance in others (e.g., general instruction following). Existing approaches rely on empirical heuristics or domain-level curriculum strategies without understanding the fundamental interactions between individual training samples, leading to marginal improvements and high trial-and-error costs.

Problem

The core challenge is inter-sample conflict: training signals from different domains often conflict, hindering the effective use of high-quality data. Current methods assume that:

Interactions between samples are static and determined solely by inherent semantic domain labels
Domain-level data management is optimal

These assumptions fail because interactions evolve during training, conflicts exist within the same domain, and synergies can occur across domains. Simply mixing data (Multi-Task Learning) or using staged training (Dual-stage Mixed Fine-tuning) cannot resolve these dynamic conflicts.

Intuition

EVIC (EVolving Interaction-guided Curriculum) models interactions as influence on loss, quantified via Adam gradients. Key insights:

Interactions evolve: Gradient-based influence between sample pairs changes significantly during training (e.g., conflict → promotion), rather than being fixed by domain semantics
Asymmetry: Sample $j$ helping sample $i$ does not imply the reverse ($sign(Int[j,i]) \neq sign(Int[i,j])$)
Sample-level granularity: Conflicts and promotions occur both within and across domains, necessitating fine-grained selection over coarse domain grouping

Method: Periodically compute an interaction matrix $Int(\theta_t) \in \mathbb{R}^{N \times N}$ where entry $[j,i]$ represents sample $j$’s influence on sample $i$’s loss. Select samples with non-negative net influence (row sums $\geq 0$) for training. Johnson-Lindenstrauss projection (8192 dimensions) reduces computational cost of high-dimensional gradients.

Results

Performance-to-sample ratio (Table 4):

Mistral-7B: 1.33× more efficient than MTL (uses 77.7% of training steps)
Llama-3.1-8B: 1.29× more efficient
Qwen2.5-14B: 2.11× more efficient (uses 47.4% of training steps)

Coverage & Convergence:

Sample coverage reaches ~90% after 2 iterations and >96% after 3 iterations (Table 3)
Only 5% warm-up data required for optimal performance (Figure 3)

Ablation Results:

Non-iterative variants (static sample selection) significantly underperform: Mistral-7B achieves 33.1 vs. 37.4 (iterative); Llama-3.1-8B achieves 43.2 vs. 44.3 (iterative)
Iterative interaction computation is necessary for optimal multi-domain performance

Back to Home

Lin YANG

EVIC - Boosting Multi-Domain Fine-Tuning of Large Language Models through Evolving Interactions between Samples

Motivation

Problem

Intuition

Results