1. 📘 Topic and Domain: The paper addresses efficient inference-time scaling for large language models through architectural innovation in decoder-decoder Transformer designs.
2. **💡 Previous Research and New Ideas:** Building on the YOCO (You Only Cache Once) decoder-decoder architecture and Universal Transformer, the paper proposes combining YOCO with recursive computation via a Universal Self-Decoder that iterates efficient-attention layers.
3. **❓ Problem:** Standard Transformers and prior recursive approaches like Universal Transformer suffer from high computational overhead and linearly growing KV cache as depth increases, making efficient inference-time scaling difficult.
4. **🛠️ Methods:** YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache.
5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
2. 💡 Previous Research and New Ideas: Building on the YOCO (You Only Cache Once) decoder-decoder architecture and Universal Transformer, the paper proposes combining YOCO with recursive computation via a Universal Self-Decoder that iterates efficient-attention layers.
3. **❓ Problem:** Standard Transformers and prior recursive approaches like Universal Transformer suffer from high computational overhead and linearly growing KV cache as depth increases, making efficient inference-time scaling difficult.
4. **🛠️ Methods:** YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache.
5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
3. ❓ Problem: Standard Transformers and prior recursive approaches like Universal Transformer suffer from high computational overhead and linearly growing KV cache as depth increases, making efficient inference-time scaling difficult.
4. **🛠️ Methods:** YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache.
5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
4. 🛠️ Methods: YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache.
5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
5. 📊 Results and Evaluation: YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
1. 📘 主题与领域: 该论文聚焦于通过解码器-解码器架构创新实现大型语言模型的高效推理时扩展。
2. **💡 先前研究与新思路:** 基于YOCO(一次只缓存一次)解码器-解码器架构和通用Transformer,提出将YOCO与递归计算相结合,通过通用自解码器迭代高效注意力层。
3. **❓ 问题:** 标准Transformer和通用Transformer等递归方法在深度增加时面临高计算开销和KV缓存线性增长的问题,难以实现高效的推理时扩展。
4. **🛠️ 方法:** YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。
5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
2. 💡 先前研究与新思路: 基于YOCO(一次只缓存一次)解码器-解码器架构和通用Transformer,提出将YOCO与递归计算相结合,通过通用自解码器迭代高效注意力层。
3. **❓ 问题:** 标准Transformer和通用Transformer等递归方法在深度增加时面临高计算开销和KV缓存线性增长的问题,难以实现高效的推理时扩展。
4. **🛠️ 方法:** YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。
5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
3. ❓ 问题: 标准Transformer和通用Transformer等递归方法在深度增加时面临高计算开销和KV缓存线性增长的问题,难以实现高效的推理时扩展。
4. **🛠️ 方法:** YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。
5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
4. 🛠️ 方法: YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。
5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
5. 📊 结果与评估: YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。