HWC

No date

Generative World Models of Tasks | LLM-Driven Hierarchical Scaffolding for Embodied Agents

PaperWorld ModelRL

快速解釋

這篇論文的核心主張是：對 embodied agents 來說，world model 不該只模擬物理動態，還必須模擬 task semantics。作者認為，長時程多智能體任務之所以難學，不只是因為 reward sparse，而是因為成功策略本身具有明顯的階層結構；若環境只提供平坦的 physics simulator 和單一終局目標，agent 幾乎不可能靠盲目探索自己發現完整策略。因此論文提出 Hierarchical Task Environments (HTEs)：把 task decomposition、subgoal、intrinsic rewards 與 curriculum 直接內建到環境的 world model 裡，再進一步用 LLM 當成 generative world model of tasks，動態產生這些階層式 scaffold。

問題設定

傳統 end-to-end MARL 在複雜 embodied 任務中常遇到 exploration 幾乎不可行的問題。作者特別區分了 reward sparsity 與更根本的 task sparsity。
在多智能體設定下，若每個 agent 的動作空間大小為 $|mathcal{A}|$ ，共有 $N$ 個 agents，則單一步的 joint action space 為

|\mathcal{A}_{\mathrm{joint}}|=|\mathcal{A}|^N.

若任務 horizon 為 $T$ ，則可能的 joint trajectories 數量會成長為

|\mathcal{T}_{\mathrm{traj}}|=|\mathcal{A}|^{NT}.

這意味著在像 robotic soccer 這種需要連續協作、傳球、跑位、射門的任務中，成功軌跡在整個軌跡空間中的密度幾乎為零。也就是說，agent 失敗的核心不只是「拿不到 reward」，而是「根本難以撞見正確的 task structure」。
因此作者批判：只靠擴大 simulation 規模、增加 interaction data 或做 reward shaping，通常仍不足以教會 agent 階層式策略。

核心想法

作者提出的觀點是把 task hierarchy 外部化到環境中。環境不再只是執行物理轉移，而是同時維護一個 task graph，明確定義高階目標、子任務、依賴關係、完成條件與對應的 intrinsic rewards。
一個高階任務可以被拆成一組有序子任務：例如在足球裡，ScoreGoal 可以被拆成 AcquireBall -> DribblePastOpponent -> PassToTeammate -> ShootAtGoal。這樣 RL agent 不必從零發現整條策略鏈，而是能先學每個子任務的可執行策略。
這個設計自動形成 curriculum：葉節點子任務會先被學會，接著再組成更高階的 task composition。也就是說，curriculum 不再是人手工排 1v1、2v2、3v3 的訓練關卡，而是 hierarchy 本身的自然產物。
論文最有野心的部分是把 LLM 放進這個框架中。作者主張 LLM 不只可以當 agent planner，還可以當環境的 task planner：把自然語言高階目標（例如「在右路做 give-and-go」）轉成一串 symbolic subtasks，再交給 HTE API 配置 intrinsic rewards 與 success conditions。
如此一來，LLM 成為「任務世界模型」，而不是只做 action decoding。它負責生成 task scaffolding，環境負責執行與驗證，底層 policy 則學會如何在這個 scaffold 上完成技能。
作者也強調，scaffold 不能永遠存在，而要逐步 fading。只有當外部 task hierarchy 慢慢淡出，agent 仍能維持表現，才能說 agent 真正 internalize 了 hierarchical reasoning，而不是只在利用 training wheels。

關鍵公式

這篇論文本質上是 framework paper，而不是提出單一新的 differentiable objective；因此它的「關鍵公式」主要是對問題複雜度、task hierarchy 與評估指標的正式化。

首先，task sparsity 可以用 joint trajectory space 的組合爆炸來描述：

|\mathcal{A}_{\mathrm{joint}}| = |\mathcal{A}|^N, \qquad |\mathcal{T}_{\mathrm{traj}}| = |\mathcal{A}|^{NT}.

這個式子說明，在 $N$ 個 agents、horizon $T$ 的長時程任務中，成功軌跡的搜尋成本會隨 agents 數與 horizon 指數成長。

其次，論文提出的 task scaffold 可以整理成一個階層式任務圖

\mathcal{H}=(\mathcal{V},\mathcal{E}),

其中每個節點 $\tau_k \in \mathcal{V}$ 是一個 subtask，包含名稱、前置條件、完成條件與 intrinsic reward：

\tau_k = \bigl(\text{name}_k,\; \mathrm{pre}_k,\; \mathrm{term}_k,\; r_k^{\mathrm{int}}\bigr).

若把 LLM 視為 task planner，則高階目標 $g$ 會被轉成有序子任務序列：

\pi_{\mathrm{task}}(g, s_t) = [\tau_1, \tau_2, \dots, \tau_K].

也就是說，LLM 根據目前情境 $s_t$ 與語言目標 $g$ ，產生一條可執行的 task-level plan，環境再根據這個 plan 開啟對應的子任務與條件檢查。

在 HTE 中，agent 的學習訊號不只來自外部任務回報，還包含 subtask completion 的 intrinsic rewards。可用下式整理其回饋結構：

r_t = r_t^{\mathrm{ext}} + \lambda \sum_{k=1}^{K} \mathbf{1}\bigl[\tau_k\ \text{completed at } t\bigr] \, r_k^{\mathrm{int}}.

這個式子不是論文中的單一訓練 loss，而是作者框架的精煉表達：外部最終目標 reward 與 task hierarchy 產生的中介學習訊號共同形塑 agent 的學習過程。

論文也提出新的評估觀點。對於 curriculum efficiency，作者建議比較 flat baseline 與 scaffolded environment 的 sample complexity：

\mathrm{CE} = \frac{N_{\mathrm{flat}}}{N_{\mathrm{scaffolded}}}.

若 $mathrm{CE} > 1$ ，表示有 scaffold 的環境能以更少 samples 到達相同目標表現。

對於 scaffolding brittleness，論文建議比較三種性能：

P_{\mathrm{full}}, \qquad P_{\mathrm{faded}}, \qquad P_{\mathrm{wrong}}.

一個好的 agent 應滿足

P_{\mathrm{full}} \approx P_{\mathrm{faded}} \gg P_{\mathrm{wrong}}.

也就是說，它不該只依賴完整 scaffold，而應在 scaffold 逐步移除後仍保持能力；但若給錯誤 hierarchy，性能則應明顯下降，這才代表它真的學到了正確的 task structure。

模型結構

這篇論文不是在發明一個新的 neural architecture，而是在提出一個分層式 agent-environment stack。其結構可以分成五層來看：

高階目標層
LLM task planner / generative world model of tasks
Hierarchical Task API / HTE layer
Low-level policy / MARL execution layer
Scaffold fading 與 evaluation layer

整體來說，這篇論文的真正創新不是某個新 network block，而是把「task hierarchy 應該存在於哪裡」這件事重新定義了。作者主張 hierarchy 不該只藏在 agent policy 裡，也不該完全靠人手工做 curriculum，而應該成為 world model 與 environment design 的一級公民。