2026-05-05 日报

主题: Scaling Laws 三连发与领域 Foundation Model 实践

标签: parameter-scaling · transformer · industrial · academic

📊 统计: 共 9 篇 · 精读 5 · 🏢 工业界 3 · 🎓 学术 6 · llm 3 · other 3 · discriminative-rec 3

综述

本日共 9 篇论文，3 篇 LLM、3 篇判别式推荐、3 篇 other，工业（ByteDance/Meta/Tencent）与学术（Cornell/Yale/清华）各占半壁，主线集中在 Scaling Law 的精细化与跨域 Foundation Model 实践。三大 Scaling Law 工作中，ByteDance 的 InfoLaw 把训练重写为信息累积过程，用 quality density 与 log(K) 归一化指数衰减把 mixture×scale×repetition 坍缩到统一幂律，从 252M-1.2B 外推到 7B/425B token 误差仅 0.15%；Cornell 的 Prescriptive Scaling Law 在 Chinchilla 上加单参数过拟合惩罚项，把 multi-epoch R² 从 0.58 拉到 0.95，给出 compute 超阈值后应扩模型而非加 epoch 的反直觉结论；Meta 的 Compute Optimal Tokenization 用 988 个 BLT 模型把 "20 token/param" 推广为 tokenizer 不变的 "60 byte/param"。Yale ReClaim 在 200M 入组人 43.8B 理赔事件上从零训 1.7B Qwen3 风格医保 Foundation Model，1208 病预测 AUC 75.57%，并把 embedding 引入因果推断把 EASE 偏差降低 72%。Tencent FEDIN 则在 CTR 侧引入 target-aware 复值频谱滤波 + 时频双分支。整体趋势：Scaling Law 进入 "prescriptive" 阶段，可直接指导 recipe 选择；Foundation Model 范式正快速渗透到医疗等垂直领域。

重点论文

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition · ⭐ 9/10

🏢 ByteDance · LLM

InfoLaw 把 LLM 训练重新刻画为信息累积过程，引入 quality density f_d=e^{-θd} 与 log(K) 归一化的指数衰减 1-e^{-λ(N)R/log(K)}，把不同 mixture × scale × repetition 的 loss 坍缩到一条 L=α·info^{-β} 的统一幂律；从 252M-1.2B + 3 mixture 拟合，外推到 7B + 425B token mean error 0.15%/max 0.96%，并能在 100k 候选中选出 prescriptive 最优 recipe（小模型偏 quality、大模型偏 diversity）。

Prescriptive Scaling Laws for Data Constrained Training · ⭐ 8/10

🎓 学术 · LLM

在 Chinchilla scaling law 上加一个简单的加性过拟合惩罚项 P·R_D^δ·(N/U_D)^κ，1 个自由参数即让 multi-epoch R² 从 0.58 跃至 0.95，给出 'compute 超过阈值后扩大模型而非加 epoch' 的反直觉但实测最优的分配建议，并把过拟合代价孤立为单一系数 P 解释 strong weight decay 在数据受限场景下削减 P 70% 的现象。

Compute Optimal Tokenization · ⭐ 8/10

🏢 Meta · LLM

本文用 988 个 BLT + 320 个 subword 模型系统研究 tokenizer 压缩率对 scaling law 的影响，把 Chinchilla 的 '20 token/param' 推广为 '~60 byte/param 跨 tokenizer 不变'，并发现最优压缩率随 compute budget 下降、随语言 parity 上升。

ReClaim · ⭐ 7/10

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

🎓 学术 · 其他

Yale 团队在 MarketScan 200M 入组人 43.8B 理赔事件上从零训练 1.7B Qwen3 风格 healthcare foundation model ReClaim，1208 病发生预测平均 AUC 75.57% 显著超越 LightGBM 和 Delphi，instruct token post-training 用 100K 样本带来 +13.76pp 单步跃升，并将 foundation model embedding 引入倾向得分使 RWE 因果推断 EASE 偏差降低 72%。

FEDIN · ⭐ 7/10

FEDIN: Frequency-Enhanced Deep Interest Network for Click-Through Rate Prediction

🏢 Tencent · 判别式推荐

FEDIN 通过实证发现用户兴趣谱在目标物品条件下呈现低熵集中模式，提出 target-aware 复值 MLP 频谱滤波 + 双分支（时域 patch Transformer + 频域）+ Top-k Target Attention 融合，在三个公开 CTR 数据集上一致超越 DIN/DIEN/SASRec/DIFF 等基线。

全部论文

模型	标题	类别	公司	摘要分	精读分
—	InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition	LLM	🏢 ByteDance	8	9
—	Prescriptive Scaling Laws for Data Constrained Training	LLM	🎓 学术	8	8
—	Compute Optimal Tokenization	LLM	🏢 Meta	8	8
ReClaim	Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims	其他	🎓 学术	7	7
FEDIN	FEDIN: Frequency-Enhanced Deep Interest Network for Click-Through Rate Prediction	判别式	🏢 Tencent	7	7
BST-CDSR	Bridging Behavior and Semantics for Time-aware Cross-Domain Sequential Recommendation	判别式	🎓 学术	6	—
PFA	Post-hoc Provider Fairness Adaptation via Hierarchical Exposure Alignment	判别式	🎓 学术	5	—
GRAIL	GRAIL: A Deep-Granularity Hybrid Resonance Framework for Real-Time Agent Discovery via SLM-Enhanced Indexing	其他	🎓 学术	5	—
Khala	Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation	其他	🎓 学术	4	—