👍 107
06/26 20:00
LLM agents are expected to act over multiple turns, using search, browsing interfaces, and terminal tools to complete user goals. Yet not every goal is well specified or achievable in the available environment. In such cases, a reliable agent should recognize that further interaction is unlikely to
中文介绍 该论文提出了一种新型的 LLM agent 机制,旨在解决在不确定或不可实现的目标下,代理能否识别何时停止的难题。通过在多轮交互中进行有效决策,代理可以避免不必要的执行,从而提高资源利用效率。这一机制对于复杂的任务管理和用户需求响应至关重要,意味着在更广泛的 agent 研究中,如增强智能体的决策能力具有重要意义。
👍 59
06/28 20:00
We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal
中文介绍 论文介绍了 Agents-A1,一个 35B 的 Mixture-of-Experts Agentic Model,通过扩展代理视野来达到万亿参数级别的性能。研究对代理视野的扩展进行深入分析,包括长时间轨迹的扩展和异构代理能力的提升。这一突破有助于提高大规模模型在长任务处理中的效率,对进一步推动智能体性能的提升及多任务处理应用具有重要意义。
👍 38
06/25 20:00
As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents
中文介绍 本研究提出了 TUA-Bench,一个针对通用终端使用智能体的基准,以评估智能体在更广泛的计算机使用任务中的能力。现有基准未能全面评估这些智能体的表现,限制了其在编程之外的应用价值。通过提供更丰富的评估标准,TUA-Bench 旨在进一步推动通用智能体在多样化任务中的应用和发展,具有实际的应用前景。
👍 33
06/28 20:00
Foundation models for predictive machine learning on tabular data have recently gained significant traction in academia and industry. Research communities across disciplines are increasingly evaluating tabular foundation models on diverse datasets and tasks. However, these task- and discipline-speci
中文介绍 论文探讨了表格基础模型在预测机器学习中的泛化能力,强调在多样数据集和任务上评估这些模型的重要性。尽管研究日益增多,但针对不同任务和学科的评估却相对稀缺。本研究为理解表格基础模型的局限性和潜力提供理论支持,影响在多领域预测任务中的应用及模型设计方向。
👍 31
06/22 20:00
Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation
中文介绍 研究指出,物理交互遵循长尾分布,表明常规与常见交互在视觉数据中占主导地位,而稀有和不规则交互却未能得到充分的代表。此论文提出的方法着眼于减少这种长尾效应对视觉世界建模评估的影响,以促进对模型现实表现的更准确评价,对于提高智能体在复杂场景下的表现具有重要的应用价值。
👍 23
06/25 20:00
Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, u
中文介绍 论文研究多模态大语言模型在视频时序逻辑推理中的能力,探讨模型如何在动态视觉证据中进行推理,而不仅仅是识别单个帧中的对象或事件。该能力要求模型具备保持、理解以及逻辑推理的能力。本研究为多模态智能体在复杂视听环境中的应用奠定了基础,具有广泛的实际应用潜力。
👍 22
06/22 20:00
On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model (LLM) post-training. Like reinforcement learning (RL), however, OPD faces an on-policy systems bottleneck, as rollouts can dominate training ti
中文介绍 该研究关注于 On-policy distillation (OPD) 的效率及其在大语言模型后训练中的重要性。通过分析当前的系统瓶颈,提出了新的方法来优化训练时间并提高效率。此研究对于增强大规模模型的训练效率和实时表现具有潜在影响,推动了机器学习在动态环境中的应用与发展。
👍 21
06/17 20:00
Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different
中文介绍 本论文介绍了一种可扩展的导航模型,针对 Agentic 导航系统的需求,具有在推理时可外部重新配置观察策略的能力。通过整合指令跟随、目标查找和自主驾驶等多种功能,实现了高效的导航。此研究在多任务导航系统的发展中非常重要,为智能环境中的机器人导航提供了实用的模型架构。
👍 20
06/25 20:00
To reduce memory consumption during LLM inference, a handful of methods have been proposed for KV cache pruning. While these techniques can accomplish lossless memory reduction on many datasets, they often hinge on an under-emphasized condition: an input/domain-specific threshold for KV cache budget
中文介绍 研究提出 ReFreeKV,一种无阈值的 KV 缓存压缩方法,旨在在 LLM 推理中减小内存消耗。虽然现有方法适用于许多数据集,但仍需依据输入特定领域的阈值。本研究所提方法的有效性将在不依赖特定条件下实现更普适的应用,对 LLM 的内存管理和优化具有重要意义。
👍 15
06/28 20:00
Recent advances in 3D Gaussian Splatting have demonstrated unprecedented success in novel view synthesis. However, the substantial inference and storage overhead driven by high-order Spherical Harmonics (SH) are primary bottlenecks for mobile platforms. In this paper, we present Flux-GS, a real-time
中文介绍 论文提出 Flux-GS,一个用于移动端的实时 3D 高斯聚合新方法,以解决高阶 Spherical Harmonics 带来的存储与推理开销问题。该方法在新视图合成中表现优异,显著提高了移动平台的实现能力。此研究为提升移动设备的计算视觉能力和实时图像处理应用提供了新的思路,推动了该领域的进步。
👍 15
06/27 20:00
Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks. However, existing benchmarks primarily evaluate whether models can perceive shallow
中文介绍 本研究探讨了视频理解在多模态智能体中的重要性,提出通过通用关键帧提取架构连接视频问答(VideoQA)与视频引导的智能任务。基于多模态大语言模型的表现,本研究揭示了在多任务应用中的潜在优势,并为未来模型的发展提供了新的方向,具有显著的现实应用意义。
👍 14
06/28 20:00
Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cas
中文介绍 论文提出了一种工具增强的信用优化方法 TACO,旨在提升 Agentic 多模态模型在图像操作中的效果。通过细致的设计,该方法能够更精准地区分有效、冗余和误导性的代码操作,为精细化视觉问答提供支持。这一方法在智能体优化工具使用效率中的应用前景广泛,对多模态学习领域具有重要的推动作用。
👍 9
06/17 20:00
Robots operating in open-world environments must seamlessly integrate localization, spatial reasoning, navigation, and long-horizon planning. While specialist models excel at individual tasks, deploying a multi-model stack is computationally expensive and prone to cascading errors. We present Vesta,
中文介绍 该研究介绍了 Vesta,一个通用的具身推理模型,可以在开放世界环境中无缝整合定位、空间推理、导航和长远规划。通过避免计算复杂性和级联错误,Vesta 提高了智能体在复杂环境中的执行效率。这一成果有助于在实际应用中实现更智能的机器人决策及动作规划,具有重要的现实意义。
👍 8
06/27 20:00
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and pro
中文介绍 论文提出 OSWorld 2.0,一项涵盖 108 种长时间计算机使用工作流程的基准,旨在评估智能体在现实计算任务中的表现。通过关注真实世界的复杂性和长时间工作的需求,该基准揭示了当前智能体的局限性。这一研究对智能体在计算机使用方面的评估标准与任务设计具有重要影响,有助于推动该领域的发展。
👍 7
06/28 20:00
We present DreamForge-World 0.1 Preview, a preview foundational world model for real-time interactive world simulation. The system adapts the LongLive 1 autoregressive video stack, itself derived from Wan2.1-T2V-1.3B, with a residual action pathway inspired by the Matrix-Game family. DreamForge-Worl
中文介绍 本研究介绍了 DreamForge-World 0.1,一个低计算实时可控的世界模型,旨在实现实时交互式世界模拟。该系统改进了由 LongLive 1 自回归视频堆栈衍生的现有模型,开辟了新的高效模拟途径,对虚拟环境中的交互性和沉浸感发展有重要影响。
👍 7
06/28 20:00
Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm. However, GUI agent data cannot be
中文介绍 论文提出 GUICrafter,一种弱监督的 GUI 智能体,利用海量未标注的截图数据进行训练。该方法旨在通过数据驱动的方式,构建强大的 GUI 智能体,推动现有的 GUI 代理研究。尽管 GUI 数据面临挑战,该研究在实现低成本学习与高效数据利用的方向上具有重要的实际意义。
👍 6
05/28 20:00
For agents to learn continuously from interaction with the world at test time, they must be able to explore effectively, acquire new world knowledge and skills, retain relevant episodic experiences, and plan over long horizons. To evaluate these key abilities of test-time continual learning agents,
中文介绍 研究提出 AgentOdyssey,以评估能够从世界互动中持续学习的智能体能力。该测试时间持续学习代理的评估框架强调有效探索、新知识获取及长期计划能力的必要性。此研究对持续学习中的智能体评估提供了新的思路,促进代理在动态环境中的表现提升,具有深远的影响。
👍 5
06/27 20:00
LLM agents handle user requests on behalf of organizations through tool calls and must follow the company policies stated in their system prompts. Prior work approaches this as a safeguarding problem -- external checks that block non-compliant agent actions. We argue that policy adherence is a broad
中文介绍 论文提出 PolicyGuard,一种对话基础的子代理验证器,用于确保 LLM agent 在工具调用时遵循公司的政策。该研究将以往的外部检查方法视为过于局限,强调了政策遵守的广泛性。其结果在提升 LLM smart agent 对公司政策的遵循能力方面具有重要意义,推动了合规智能体的发展。
👍 4
06/25 20:00
Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the cognitive processes
中文介绍 该论文探讨了在教育评估中预测人类测验难度的重要性,通过可靠的估算来支持公平性和有效的测试构建。研究方法超越了传统的人力校准,强调认知过程的重要性,为教育测评领域提供了新的思考,有助于提升智能评估系统的表现,具有重要的学术和应用价值。
👍 2
06/23 20:00
Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly: informal papers cite mostly at the document level, while formal libraries record fine-grained dependencies over a much smaller body of mathematics. We introduce TheoremGraph, a unifi
中文介绍 研究提出 TheoremGraph,一个连接形式与非形式数学的工具,旨在更好地组织数学知识并揭示其内部结构。该模型弥补了现有文献引用的不均匀性,为数学研究提供了一种新的视角,具有促进正式和非正式数学交流的潜力,对学术界的知识传播有重要影响。