2026-06-09 日报

主题: 工业级生成式推荐的强化偏好对齐与判别式 CTR 扩展双线

标签: industrial · rl · semantic-id · parameter-scaling

📊 统计: 共 15 篇 · 精读 6 · 🏢 工业界 4 · 🎓 学术 11 · generative-rec 6 · discriminative-rec 3 · other 3 · llm 3

综述

本日 15 篇、6 篇精读;生成式推荐 6、判别式推荐 3、LLM 与其他各 3,工业界(JD、Netflix、Yandex、OPPO、Meta)主导。JD 的 AdaGRPO 把生成式推荐 RL 对齐从“均匀施奖”改为“选择性准入”,用 sample-level clip 只在策略不确定且奖励可信处放行梯度,离线 HR@10 11.01%→12.18%、线上 effective IPV +0.43%。DeRes 以“恒等残差+块注意力残差”双路径重做 CTR Transformer 层间连接,<5% 额外 FLOPs 让 8 层匹配 16 层 OneTrans。Netflix 的 Mult-DPO 把 DPO 从成对推广到 set-wise 多正样本,以 multinomial 代理替代难解的 Plackett-Luce 并导出闭式可处理上界。Yandex 的 Gryphon 在 SID 生成检索上加 item 级打分模块,解耦失准的 beam 似然,作为唯一召回源替换 15+ 召回器。Meta 的 DUET 按统计机制分流点击/转化流预训练双用户嵌入。趋势上,偏好对齐(GRPO/DPO/KTO)正成为生成式推荐主线,semantic-id 检索向 item 级打分与前缀可判别性深化,判别式则继续吃 scaling 红利。

重点论文

AdaGRPO · ⭐ 8/10

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

🏢 JD · 生成式推荐

JD.com 的 AdaGRPO 把生成式推荐的 RL 对齐从『均匀施加奖励』改为『选择性准入』：保留监督 NLL 作静止锚，用两个 rank-based rollout 诊断（policy 侧困难度 × reward 侧可判别性）合取出一个 detached 二值 sample-level clip，只在『策略不确定且曝光偏置的 production ranker 局部可信』的样本上放行 GRPO 梯度，把 PPO 的 clip 从 ratio 域抬到 sample 域；offline HR@10 11.01%→12.18%（幻觉≤0.22%），线上 A/B effective IPV +0.43% 等显著提升。

DeRes · ⭐ 8/10

DeRes: Decoupling Residual Stability and Adaptivity for Scalable CTR Prediction

🎓 学术 · 判别式推荐

DeRes 用"恒等残差(稳)+块注意力残差(变)"双路径 + 逐维向量门控重做 CTR Transformer 的层间连接，并用 SiLU 替 Softmax(Pointwise AttnRes)支持并行多兴趣与负权遗忘，在 <5% 额外 FLOPs 下让 8 层匹配 16 层 OneTrans(scaling 指数 γ=0.118 vs 0.071)。

Mult-DPO · ⭐ 8/10

Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems

🏢 Netflix · 生成式推荐

把 DPO 从成对偏好推广到推荐系统的 set-wise 多正样本偏好:用同一奖励权重空间上的多项式(multinomial)代理事件替代难处理的边缘化 Plackett-Luce 似然,导出闭式 DPO 目标并证明其为 PL-DPO 损失的可处理上界(并以正/负累积权重比刻画紧致性),进一步扩展到多偏好层级 Mult²-DPO。

Gryphon · ⭐ 7/10

Gryphon: A Unified Architecture for Semantic-ID Generation and Item-Level Scoring in Industrial Recommendations

🏢 Yandex · 生成式推荐

Gryphon 在 encoder-decoder 语义 ID 生成式检索之上联合训练一个 item 级打分模块（ILSM），复用共享 encoder 的用户表征对 beam 生成 SID 解析出的具体 item 重打分，把最终 item 选择与失准的 beam 似然解耦并解决 SID 碰撞；工业音乐场景取得最高 item 级 Recall@1000（较 vanilla GR +3.7%），并在 7 天 A/B 中作为唯一召回源替换 15+ 召回器与 preranking 阶段且收听时长无显著变化。

DUET · ⭐ 7/10

DUET -- Dual User Embedding Transformers for Offsite Conversion Prediction

🏢 Meta · 判别式推荐

Meta DUET 把站外转化预估的上游用户嵌入预训练按统计机制分流——稠密点击流用多层自注意力(ClickAUE)、稀疏转化流用交叉+自注意力锚定(ConvAUE),两个互补嵌入冻结后经事件触发推理(ETI)异步 serving 喂给下游 ranker,训练 NE 降 0.38%、线上 CVR +0.66%/+0.15%。

全部论文

模型	标题	类别	公司	摘要分	精读分
AdaGRPO	Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation	生成式	🏢 JD	8	8
DeRes	DeRes: Decoupling Residual Stability and Adaptivity for Scalable CTR Prediction	判别式	🎓 学术	8	8
Mult-DPO	Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems	生成式	🏢 Netflix	7	8
Gryphon	Gryphon: A Unified Architecture for Semantic-ID Generation and Item-Level Scoring in Industrial Recommendations	生成式	🏢 Yandex	8	7
ToolRec	ToolRec: Calibrated Preference Alignment for Query Recommendation in On-Device Assistants	生成式	🎓 学术	7	7
DUET	DUET -- Dual User Embedding Transformers for Offsite Conversion Prediction	判别式	🏢 Meta	7	7
PRO	Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization	生成式	🎓 学术	7	—
—	$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems	LLM	🎓 学术	6	—
REVEAL	Teach Multimodal Recommendation Model to See via Personalized Visual Extraction and Adaptive Learning	判别式	🎓 学术	6	—
OneFeed	OneFeed: A Unified Generative Framework for Feed Content Enhancement and Query Generation	生成式	🎓 学术	5	—
Hypnos	Next-Token Prediction Learns Generalisable Representations of Sleep Physiology	其他	🎓 学术	5	—
—	Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation	其他	🎓 学术	5	—
—	Explaining Data Mixing Scaling Laws	LLM	🎓 学术	5	—
MetaPlate	MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention	其他	🎓 学术	4	—
—	UXBench: Benchmarking User Experience in AI Assistants	LLM	🎓 学术	4	—