👍 68
06/15 20:00
On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix condition
中文介绍 提出了一种新的在线自我蒸馏方法(OPSD),并将其应用于扩散大语言模型(dLLMs),以改善模型性能。该方法采用自回归中心的设计,通过从中增强学习效率,利用保留的信息推动模型自我改进,特别是在处理被遮挡的信息时表现出色。研究表明,该方法在提高模型表现的同时,优化了推理过程,具有良好的应用前景,尤其是在智能代理和推理等领域。
👍 42
06/14 20:00
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint tra
中文介绍 本研究提出一种统一的人类和机器人数据的预训练方法,旨在提高视觉-语言-动作(VLA)模型的性能。通过利用大规模个人视角视频数据来补充机器人轨迹收集的不足,研究表明这种方法能在多个任务上显著推动机器人学习效果,降低成本并提高训练效率,对机器人自主学习和交互应用有深远影响。
👍 35
06/16 20:00
Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an
中文介绍 针对多模态大语言模型在可控非马尔科夫游戏中的应用进行评估。提出新的基准方法,克服了现有模型只能依赖可见观察状态的局限,实现了在不可见状态下的决策和推理能力。研究表明,该方法能有效提高模型的行为决策能力,推动智能体在复杂环境中的性能,重要性体现在智能体训练和推理的实际应用中。
👍 22
06/15 20:00
Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for per
中文介绍 提出了一种名为 Guava 的工具,通过结合高层推理与外部模块,提供了一种有效的通用方案来提升身体操控能力。这种方法利用大规模视觉-语言训练的模型,能够在复杂任务中实现有效的组合作用,为未来的身体智能体的发展提供了新的思路,适用于机器人和交互应用等多个领域。
👍 14
06/14 20:00
Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and
中文介绍 研究提出了一种文本-视觉共指令的图像编辑方法,旨在克服现有图像编辑方法在空间控制精度上的不足。该方法结合文本指令的语义表达与视觉提示,通过改进的编辑机制实现了更细粒度的编辑控制,实验结果显示其在提高编辑质量和用户交互体验方面表现卓越,对未来图像处理和设计应用具有重要意义。
👍 12
06/15 20:00
Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed th
中文介绍 针对复杂空间推理任务,提出了增强双路径推理方法,旨在提高空间视觉语言模型的推理能力。通过优化不同空间查询所需的推理策略,研究表示该方法在多步推理任务中表现出色,显著提高了推理的准确性,具有推动空间智能和机器人交互能力的潜力。
👍 12
06/10 20:00
Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data
中文介绍 提出了一种自演进视觉问答模型,能够主动提出多样化且具有挑战性的视觉问题。该模型打破了现有视觉问答模型受到高质量训练数据限制的瓶颈,通过生成性问答来提升模型互动性。研究显示,该方法为视觉理解和人机交互提供了新的方向,具有潜在的实际应用价值。
👍 12
06/15 20:00
The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data
中文介绍 提出 EgoCS-400K 数据集,以支持世界模型的研究,旨在解决视频生成向交互式世界建模的过渡问题。数据集包含动态对齐的视听语言轨迹,以支持生成场景变化所需的丰富上下文信息,为智能体在现实环境中的学习提供了新数据来源,对增强世界模型的理解与预测能力具有重要意义。
👍 11
06/16 20:00
Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tai
中文介绍 提出一种名为 EfficientRollout 的自我推测解码系统,旨在减少增强学习(RL)回合生成中的延迟瓶颈。通过系统感知的方式,该方法实现了更高效的信息解码,提升了响应生成速度,研究结果表明其在实时推理和智能代理任务中具有重要应用潜力。
👍 9
06/15 20:00
Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment
中文介绍 提出 LLM-作为环境的框架,以自动化设计用于强化学习的训练环境。通过利用大语言模型(LLM)指导不同阶段的环境重构,研究验证了该方法能有效优化现有策略,提高训练效率,为多智能体推理和协作任务的实现提供了新的思路,具有广泛的应用前景。
👍 9
06/16 20:00
Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their
中文介绍 探讨主动感知在多模态理解中的作用,提出了一种通过互动方式处理长视频的模型。该研究通过改进模型处理查询过程中的成本效率,提出了一种新框架,有效降低了视频处理中的计算负担,对视觉理解、智能监控等领域有重要的影响。
👍 8
06/16 20:00
Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GR
中文介绍 提出了 STARE,这是一种基于惊讶度引导的政策熵加权方法,以稳定强化学习模型的政策熵。通过对令牌级熵动态的分析,该方法显著改善了因素在复杂推理中的应用效果,对提高大语言模型在动态环境中的表现具有重要意义,推动了强化学习和政策优化相关的研究。
👍 8
06/11 20:00
Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize
中文介绍 针对大规模信息搜索任务,提出了 Dr-DCI 方法,通过动态工作区扩展来提高直接语料库交互的效率。这种方法提升了文档检索和排序的能力,使智能代理能够更好地组织信息,有效推进了信息检索和智能代理系统的发展,为更复杂的任务提供支持。
👍 7
06/04 06:26
Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent
中文介绍 研究了多文化多智能体系统中的价值多样性,提出了一种超越单一价值对齐的方法,以适应不同文化背景下的代理行为。该方法强调了价值对齐的局限性,为多元文化环境中的智能体设计和评估提供了新视角,对未来智能体中的文化适应性研究具有重要意义。
👍 7
06/16 20:00
Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parame
中文介绍 提出 Sumi,一种从头开始构建的统一扩散语言模型,允许在任意步骤更新任何标记,旨在提高生成的灵活性。研究表明,该模型在各种任务上具有高效的性能,有助于推动扩散模型在自然语言处理中的应用,尤其适用于创作和生成任务。
👍 6
06/16 20:00
Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, vie
中文介绍 研究了在三维空间中对点轨迹进行预测的语言指令方法,强调运动预测在视觉智能中的重要性。提出的模型能够处理世界坐标中的三维点,并提供了一种与物体类无关的通用代表,研究结果显示其在物体交互和行动规划中的应用潜力,推动了视觉理解和行为预测的研究。
👍 6
06/15 20:00
Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same
中文介绍 针对大型语言模型在中文表达中的逻辑推理能力,提出了 ChLogic 基准,测试其在中英对齐的情况下是否保持推理性能。研究结果验证了模型的逻辑推理能力在非英语环境中可靠性的重要性,对推动多语言理解和跨语言学习具有深远影响。
👍 5
06/15 20:00
As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototy
中文介绍 本研究提出 MaineCoon,一个实时音频-视觉社交世界模型,旨在探索社交平台上视频内容生成的新方法。通过构建原型模型,该研究强调了社交场景中交互生成的重要性,为未来的社交智能体发展提供了新的思路,对实时交互和社交媒体应用有重要推动作用。
👍 5
06/15 20:00
Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing dis
中文介绍 提出了一种可变宽度的变压器模型,旨在打破传统变压器在各层固定宽度的架构限制。通过灵活调整参数和计算预算,该模型在处理不同层次信息时表现出更高的效率,研究显示这种方法在提升变压器模型性能和资源利用率方面具有重要的应用潜力,推动了模型架构的创新。
👍 5
06/14 20:00
Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negat
中文介绍 ProCUA-SFT技术报告探讨了训练计算机使用智能体(CUAs)的新方法,强调通过丰富和多样的轨迹数据提升智能体在图形界面中的表现。研究指出现有最大公开资源的限制,提出了改进的数据收集策略以优化训练,有助于智能代理在复杂环境中的应用,对增强交互能力和用户体验具有积极影响。