优化目标

ground truth的token分布（即监督信号）是one-hot的，此时最小化监督信号和模型输出的KL散度就是在最小化二者的交叉熵，就是在最小化NLL（negtive log likelihood）,就是在极大似然估计。

$$ \theta^{\ast}=\arg\max_{\theta}\;\frac{1}{|\mathcal{D}{\mathrm{train}}|}\sum{(x,y)\in\mathcal{D}{\mathrm{train}}}\frac{1}{|y|}\log p{\theta}(y \mid x), $$

其中:

$$ p_{\theta}(y|x)=\Pi_{i=1}^{|y|}p_{\theta}(y_i|x, y_{<i}) $$

code

Result

Qwen3-1.7B模型在Capybara-train上全参数SFT后，相比base在该数据集上的指令遵循能力明显提升，即把模型在该数据集上的回答系统性地推向了reference（参考答案y）的分布。

具体表现为：

$$ \log p_{SFT}(y|x) >> \log p_{Base}(y|x) $$

即模型回答出参考答案的概率要远大于没有训练过的base model；

在测试集合上的win-rate：

$$ \mathrm{WinRate}(\theta_{\mathrm{sft}}, \theta_{\mathrm{base}};\mathcal{D}{\mathrm{test}})\;=\;\frac{1}{\left|\mathcal{D}{\mathrm{test}}\right|}\sum_{(x,y)\in \mathcal{D}{\mathrm{test}}}\mathbb{I}\!\left[\log p{SFT}(y|x)\;>\;\log p_{base}(y|x)\right] $$

avg_margin:

$$ \frac{1}{|\mathcal{D}{test}|}\sum{i=1}^{|\mathcal{D}{test}|}\log \frac{p{SFT}(y|x)}{p_{base}(y|x)} $$

# Reference preference evaluation (log p(reference | prompt))

Metric unit: per-token
n_used     : 200
win_rate   : 0.965   (fraction where SFT > BASE)
avg_margin : +1.2527  (SFT-BASE)
avg_lp_base: -2.0944 （avg logp_base）
avg_lp_sft : -0.8417  
avg_tokens : 286.3

证实了SFT的作用