2026-01-19 Papers

1/2

Paper 1

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

Published: 2026-01-16

Link: http://arxiv.org/pdf/2601.11496

1. 📘 Topic and Domain: Strategic manipulation of AI-mediated economic markets through technology expansion, focusing on game theory and market design.
2. 💡 Previous Research and New Ideas: Based on classical game theory models (bargaining, negotiation, persuasion); introduces new concept of "Poisoned Apple Effect" where releasing unused technology can manipulate market regulation.
3. ❓ Problem: Addresses how expanding available AI technologies in regulated markets can be exploited to manipulate regulatory outcomes and market equilibrium.
4. 🛠️ Methods: Used GLEE dataset to simulate 580,000 strategic decisions across 13 LLMs in three game types (bargaining, negotiation, persuasion), analyzing meta-game equilibria before and after technology expansion.
5. 📊 Results and Evaluation: Found that in ~33% of cases, technology expansion caused opposite payoff changes between players without the new technology being used, proving strategic manipulation possible; regulatory metrics worsened in 40% of cases when market design remained static.

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

Poisoned Apple Effect: Strategic AI Market Manipulation Workflow GLEE Dataset 580K strategic decisions 13 LLMs, 1320 configs Bargaining Resource Division Negotiation Bilateral Trade Persuasion Information Transmission Phase 1: Status Quo Technologies A-D Available Regulator Selects Market 4 Alice: 0.49, Bob: 0.50 Fairness: 1.000 Nash Equilibrium Linear Regression Technology Release Alice introduces Model E (Poisoned Apple) Market 4 Fairness: 0.976 Regulator Response Re-optimize Market Design Switch to Market 8 Fairness: 0.990 Final Equilibrium Model E NOT used! Alice: 0.52 (+0.03) Bob: 0.46 (-0.04) Payoff Reversal! Market Parameters Information Structure Communication Form Game Horizon 8 Market Configurations Statistical Analysis 50,000+ Meta-games Simulated 33% Zero-sum Payoff Shifts 1/3 occur without adoption 40% harm with regulatory inertia Fairness vs Efficiency trade-offs Key Findings Technology availability ≠ adoption Strategic manipulation possible Static regulation vulnerable Dynamic design necessary Regulatory arbitrage threat Policy Implications Market design must be adaptive Monitor technology releases Consider strategic incentives Prevent regulatory capture Robust to manipulation Mathematical Framework Fairness = 1 - 4(p - 0.5)² Efficiency = Σ discounted payoffs Nash Equilibrium via Lemke-Howson Experimental Design Baseline: N technologies available Expansion: Add 1 new technology Measure: Payoff & metric changes POISONED APPLE EFFECT Technology as Strategic Weapon Method Flow: 1. Establish baseline 2. Release technology 3. Regulator responds 4. Measure manipulation
Q1
1. What is the primary mechanism of the 'Poisoned Apple Effect' described in the paper?
Releasing a new AI technology that outperforms all existing models
Releasing a new technology that remains unused but forces regulatory changes benefiting the releaser
Introducing a technology that creates perfect market equilibrium
Q2
2. In the paper's experimental framework, what percentage of cases showed payoff reversals where the new technology remained unused?
Approximately 10%
Approximately 33%
Approximately 50%
Q3
3. Which game type was NOT included in the GLEE dataset analysis?
Auction markets
Bargaining games
Persuasion games
1/2

Paper 2

Your Group-Relative Advantage Is Biased

Published: 2026-01-13

Link: http://arxiv.org/pdf/2601.08521

1. 📘 Topic and Domain: Group-based reinforcement learning for training large language models (LLMs) on reasoning tasks, specifically focusing on advantage estimation in Reinforcement Learning from Verifier Rewards (RLVR).
2. 💡 Previous Research and New Ideas: Based on GRPO (Group Relative Policy Optimization) and its variants, proposing a new finding that group-relative advantage estimation is inherently biased.
3. ❓ Problem: The systematic bias in group-relative advantage estimation that underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation.
4. 🛠️ Methods: Introduced History-Aware Adaptive Difficulty Weighting (HA-DW), which adjusts advantage estimates based on an evolving difficulty anchor and training dynamics to correct biased estimation.
5. 📊 Results and Evaluation: HA-DW consistently improved performance when integrated into GRPO and its variants across five mathematical reasoning benchmarks, demonstrating better results even compared to GRPO using larger numbers of rollouts.

Your Group-Relative Advantage Is Biased

Your Group-Relative Advantage Is Biased: Method Workflow Problem Discovery Group-relative advantage estimator is biased Theoretical Analysis • Theorem 1: Bias characterization • Hard prompts: underestimated • Easy prompts: overestimated Solution: HA-DW History-Aware Adaptive Difficulty Weighting Dynamic reweighting Phase 1: Evolving Difficulty Anchor Batch Observation y_t = K_t / B_t Belief Update C_t = (1-η)C_{t-1} + ηy_t Adaptive Forgetting: η_t = η·σ_t Phase 2: Adaptive Difficulty Reweighting Difficulty Calc diff = p̂_t - C_t Direction D = -sgn(Â)·sgn(diff) Magnitude M = |diff| Φ_{t,i} = λ_scale · exp(D_{t,i} · M_t) HA-DW Enhanced Objective L_{HA-DW}(θ) = 1/G ∑ ψ(π_θ/π_{θ_old}) · φ(Â_{t,i}) · Φ_{t,i} Corrects bias: boosts hard prompts, suppresses easy prompts GRPO + HA-DW Clipped surrogate with reweighting GSPO + HA-DW Sequence-level with reweighting DAPO + HA-DW Token-level with reweighting Experimental Results Consistent improvements across 5 mathematical benchmarks Better performance on hard prompts (+3.4% on MATH Level 4-5) Superior to increasing rollouts while using fewer resources Theoretical Validation • Theorem 3: Bias mitigation • Lemma 1: Baseline rectification • Extended to continuous reward distributions Key Innovation Cross-batch history + Dynamic reweighting
Q1
1. What is the fundamental issue discovered in group-based RL according to the paper?
Group-relative advantage estimator has high variance across different model architectures
Group-relative advantage estimator systematically exhibits bias based on prompt difficulty
Group-relative advantage estimator requires too many computational resources
Q2
2. How does HA-DW improve upon existing group-based RL methods?
By completely replacing the group-relative advantage estimation
By using a larger number of rollouts for each prompt
By dynamically adjusting advantage weights based on evolving difficulty anchors
Q3
3. Which of the following best describes how the bias manifests in group-relative advantage estimation?
It overestimates advantages for both easy and hard prompts
It underestimates advantages for hard prompts while overestimating for easy prompts
It provides unbiased estimates for hard prompts but biased estimates for easy ones
1/2

Paper 3

Controlled Self-Evolution for Algorithmic Code Optimization

Published: 2026-01-12

Link: http://arxiv.org/pdf/2601.07348

1. 📘 Topic and Domain: The paper focuses on algorithmic code optimization using controlled self-evolution in the domain of code generation with Large Language Models.
2. 💡 Previous Research and New Ideas: Based on previous self-evolution methods that use "generate-verify-refine" cycles, it introduces new ideas of diversified planning initialization, genetic evolution with feedback-guided mechanisms, and hierarchical evolution memory.
3. ❓ Problem: The paper addresses the low exploration efficiency of existing self-evolution methods that fail to discover solutions with superior complexity within limited computational budgets.
4. 🛠️ Methods: The paper implements Controlled Self-Evolution (CSE) with three key components: diversified planning initialization for broad solution space coverage, genetic evolution with feedback-guided mutation and crossover, and hierarchical evolution memory for capturing both inter-task and intra-task experiences.
5. 📊 Results and Evaluation: Testing on EffiBench-X demonstrated that CSE consistently outperformed all baselines across various LLM backbones, achieving higher efficiency from early generations and maintaining continuous improvement throughout evolution.

Controlled Self-Evolution for Algorithmic Code Optimization

Controlled Self-Evolution (CSE) Workflow Problem Specification x Diversified Planning Initialization Generate Sketches Z Initial Pop P₀ Genetic Evolution Parent Selection Controlled Mutation Compositional Crossover Reward F(y,x) Hierarchical Evolution Memory Local Memory Global Memory Reflect & Store Retrieve & Guide Evolutionary Loop (T iterations) Select Evolve Memory Update t++ Population Evolution: P₀ → P₁ → P₂ → ... → Pₜ P₀ Initial P₁ Evolved P₂ Refined ... Pₜ Final Best Solution y* Key CSE Features Diversified Initialization • Multiple strategy sketches • Broad solution space coverage • Reduces initialization bias • Parallel exploration • Avoids local optima Eq. (1): y₀ᵢ ~ Aθ(y|x,zᵢ) Controlled Evolution • Feedback-guided operations • Targeted mutation • Compositional crossover • Preserves good components • Surgical improvements Eq. (3,4): Controlled operators Hierarchical Memory • Local: intra-task experience • Global: inter-task patterns • Success/failure lessons • Avoids repeated mistakes • Accelerates convergence Eq. (5,6): Memory operations
Q1
1. What is the main limitation that CSE aims to address in existing self-evolution methods?
High computational costs of training
Low exploration efficiency in finding optimal solutions
Inability to generate syntactically correct code
Q2
2. Which component of CSE helps prevent getting trapped in poor solution regions during initialization?
Hierarchical Evolution Memory
Genetic Evolution
Diversified Planning Initialization
Q3
3. When comparing CSE with baseline methods on EffiBench-X, what unique characteristic did CSE demonstrate?
It achieved perfect accuracy on all test cases
It required fewer computational resources
It showed continuous improvement throughout evolution while starting with higher efficiency