2014-2020: CMU PHD, his advisor is Tuomas Sandholm. (Gabriele Farina, CMU, 2016-2022)
背景:CFR+算法解决了两人有限注德扑(Heads-up Limit Hold’em Poker is Solved, Science, 2015),但HUNL(Heads-Up No Limit Texas Hold’em,两人无限注德州扑克)是个信息集,远超两人有限注德扑:个信息集。
Methods:
Re-Solving具体来说:
成就:
Methods:
Offline self-play:
Real-time solving:
成就:
Comments: Noam Brown, 2024-07-25:
5 years ago we revealed Pluribus, the first superhuman multiplayer poker AI. It cost only $150 to train. Why did poker take longer than Go? And how did it end up being so cheap? The answer is a cautionary tale on the danger of overoptimizing for benchmarks with relevance to LLMs today.
The Annual Computer Poker Competition (ACPC) was the premier poker AI benchmark. Every year starting in 2006, all the poker AI research labs would gather at the ACPC and play their bots against each other. Winning the ACPC was prestigious so researchers put a lot of effort into their submissions.
To keep costs low, the ACPC limited submissions to using only two CPU cores for inference and a time limit of a few seconds per hand. However, unlimited resources were allowed for pretraining.
These constraints influenced research directions: teams spent $10,000+ on pretraining but neglected planning algorithms that used a lot of test-time compute. It turns out those planning algorithms were critical to beating top humans.
Pluribus didn't qualify for the ACPC -- its planning algorithms used 28 CPU cores for 20+ seconds per hand. But it beat human experts.
The lesson I learned from this is to not overoptimize for intermediate benchmarks. Yes, benchmarks can give an indication of progress, but focusing on them too much might lead you down a path that takes you away from the ultimate goal.
I think about this often when I look at LLM benchmarks these days.
博弈意味着每个玩家的收益依赖于其他玩家的策略,不可能单独求得某个玩家的最佳策略,只能求纳什均衡:也就是所有玩家的策略组合。
在两人零和且对称博弈中,纳什均衡就是双方的不输策略,也就是单个人的最佳策略。
纳什均衡策略是保守/防御的策略,不一定能保证收益最大,另一种方式是建模对手已有策略:对手建模并压榨,但是有可能被对手反利用。
提升模型长程推理能力:自回归模型在数学推理问题上很难进步的一点在于无法对答案进行修正,仅仅依靠生成式方法(Pre-Training的方法)和扩大参数规模带来的收益不会太大。
反对意见:朱泽园的那几个
Pre-Training Scaling Law 说的是训练Loss(越小模型能力越好)和数据量(token数 )、模型参数量 成幂律反比。
具体来说,训练计算量 FLOPs。这里反向传播是前向传播的2倍FLOPs,而前向传播计算量为 (2的系数是乘法和加法都算1次浮点数预算,而 和 的矩阵乘法时每个结果元素计算都是 次乘法和 次加法,一共 FLOPs)。
Post-Training Scaling Law 应该也是训练Loss和数据量(模型参数?这里应该已经固定模型参数了)成反比;但这里的数据的质量又依赖于已训练好模型的能力,且还有进行推理时的算力投入。
We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).
Training Compute-Optimal Large Language Models
o1 官方报告: Learning to Reason with LLMs
Scaling Test-Time Compute Optimally Can be More Effective than Scaling LLM Parameters
Noam Brown: Problems can
所有问题都可以分类为:相比于生成,更容易验证(数独)还是更难以验证(说出不丹的首都)。这两种极端的verifier有区别么?
A clean and scalable appoach: Just to have the AI think for longer--Then it develops these abilities like backtracking and self-correction almost like emergently.
其出现意味着大模型不再受数据量(我们已经用上了所有的数据)、训练算力(大模型预训练太贵了)的限制,AI在可见的未来不会撞墙!这是另外一个维度的Scaling Law!
瓶颈?找到那些模型需要更多计算的输入
方案:
Training Verifiers to Solve Math Word Problems
Let's Verify Step by Step
Generative Verifiers: Reward Modeling as Next-Token Prediction
CoT: 分步生成一系列中间推理步骤
Large Language Models are Zero-Shot Reasoners
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Majority 方式:
Self-Consistency Improves Chain of Thought Reasoning in Language Models
不需要Prompt:
MCTS: 将Token或者句子建模为节点,然后提供奖励信息
STaR:
Quiet-STaR:STaR依赖于Few-Shot推理示例,且局限于特定结构话的任务(比如问题问答);Quiet将显式Rationales的推理过程转化为模型内部隐式的推理过程,从而摆脱对外部示例的依赖。
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
OpenAI o1? 1. 动态调整Tinking Token (Depth-limit search) 2. 对于复杂任务,如何对内部思考过程提供细粒度的奖励?
关键是优化模型内部生成合理推理(隐式CoT)的过程:如何构造对应reward? 1. Tree-search 生成内部rationales 2. Process Reward来解决长程问题依赖性的挑战 3. Critic Model来解决复杂问题难以自身提供合理推理过程的挑战
但我感觉是多次推理即可。见下面的图示
Reinforced Self-Training (ReST) for Language Modeling
Self-Rewarding Language Models
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Summary of what we have learned during AMA hour with the OpenAI o1 team today
Finding GPT-4’s mistakes with GPT-4 (CriticGPT介绍)
Generative Language Modeling for Automated Theorem Proving
部分内容来自兴军亮老师(清华)和李凯老师(自动化所)的《计算博弈原理与应用》课件