👍 64
06/22 20:00
A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on b
中文介绍 本文探讨如何基于语言模型的世界建模推动通用代理的发展。研究提出了一种新的方法,通过当前观察和行动来预测环境动态,以增强推理和规划能力。实验结果表明,该模型在解决复杂任务时展现出了较高的性能,推理能力显著提升。意义:该研究为智能代理的发展提供了新的思路,推动了智能体在动态环境中的应用潜力。
👍 41
06/22 20:00
We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipelin
中文介绍 本研究引入了NatureBench,这是一个涵盖90个来自Nature期刊的跨学科基准测试,旨在评估AI编码代理在真实科学问题中超越复制行为向发现转变的能力。通过与NatureGym的结合,推动了科学领域中的AI研究。意义:该基准测试有助于评估和提升AI在科学研究中的应用与创新能力。
👍 30
06/17 20:00
MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing anno
中文介绍 提出了一种无注释适应方法MobileForge,旨在提高基于MLLM的移动GUI代理在真实应用中的适应能力。方法通过层次化反馈引导的策略优化来克服任务标注昂贵的问题,从而减少对人类编写任务的依赖。研究表明,这种适应方式显著降低了训练成本。意义:该方法推动了移动应用领域中智能代理的效率提升,具有显著的应用潜力。
👍 29
06/17 20:00
MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step
中文介绍 为了解决现有MLLM移动GUI代理在长时间任务中表现不稳定的问题,提出了MemGUI-Agent模型,通过主动上下文管理策略来保存中间事实和应用过渡信息,有效提升了长任务的完成率。实验结果表明该模型在长任务中的成功率得到了显著提高。意义:该研究为提升移动代理在复杂环境下的决策能力提供了新的技术路径。
👍 24
06/21 20:00
AI agents are driving a new software paradigm, with the ability to autonomously call tools, extract information, manage memory, and complete tasks that span applications and data sources. Most existing end-user operating systems, however, are designed for application-centric workflows and offer litt
中文介绍 本文提出了AOHP,一个开源的操作系统级代理框架,支持个性化、高效和安全的交互。通过提供工具调用、信息提取和任务管理功能,AOHP解决了现有操作系统在用户交互中的局限性,显著提升了用户体验。意义:该框架对智能调度和多任务处理有深远影响,推动了个人计算环境的变革。
👍 20
06/19 20:00
Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses,
中文介绍 本文探讨了在大规模语言模型(LLM)中,深层解码并不总是带来更好的结果,通过提出Confident Layer Decoding来减轻对齐代价。研究揭示了早期层的粗略猜测在某些任务中可能更可靠,优化了生成过程。意义:该方法的提出为提高语言模型的生成质量提供了新的视角,特别是在多层结构的优化方面。
👍 19
06/19 20:00
We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage sepa
中文介绍 BioMatrix是首个将序列、结构与自然语言综合整合的多模态基础模型,适用于分子和蛋白质数据。研究展示了在单一解码器架构中实现的广泛实体覆盖与多模态集成,为生物信息学提供了基础设施。意义:该模型站在生物医学研究的前沿,可能提高生物数据处理的智能化程度。
👍 18
06/10 20:00
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of
中文介绍 LingxiDiagBench作为一个多智能体框架被提出,用于评估LLMs在中国精神咨询和诊断中的表现。该框架填补了心理健康评估中的重大空白,旨在提高精神病诊断的一致性和效率。意义:这一研究推动了AI在医疗领域,尤其是心理健康评估中的应用潜力。
👍 18
06/01 20:00
Computer-Use Agents (CUAs) are increasingly deployed in dynamic interactive environments, creating a growing need for continual skill learning during interaction. Recent approaches address this challenge by learning reusable skills from successful trajectories. However, these skill learning methods
中文介绍 SkillHarness旨在为计算机使用代理(CUAs)提供安全技能的学习框架,以适应动态交互环境。研究提出了从成功轨迹中学习可重用技能的方法,有效提高了技能学习的效率和安全性。意义:此方法有助于推动智能代理在复杂任务中的持续适应与学习能力。
👍 12
06/21 20:00
Long-horizon tasks are common in real-world robotic deployments, yet failure detection for such tasks remains underexplored. Detecting failures in long-horizon robotic tasks is particularly challenging because failure onset is often ambiguous and dense temporal annotations are typically unavailable.
中文介绍 Foresight提出了一种基于动作条件的世界模型潜在变量的失败检测方法,专注于长时间机器人操作中的任务失败识别。研究表明,该方法能够显著提高失败检测的准确率,填补了长时间任务失败检测的研究空白。意义:该研究对于提高机器人在不确定环境中的可靠性具有重要意义。
👍 11
06/22 20:00
Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to
中文介绍 OpenThoughts-Agent针对代理模型的数据策划问题,提出了一种通过开放的训练数据优化方法,以提升智能模型的能力。该研究弥补了当前代理模型训练数据稀缺的局限性。意义:这一方法为构建更强大的AI代理提供了新的思路,推动了智能体在多个领域的应用。
👍 10
06/22 20:00
Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existi
中文介绍 FLAT提出了一种用于几何准确场景生成的前馈潜在三角体积生成方法,旨在解决从单一图像生成3D场景中的几何表示问题。研究结果显示,该方法在生成精度和效率方面表现突出。意义:此技术对增强计算机视觉中的生成任务具有重要影响,推动了虚拟现实与影视制作的应用前景。
👍 9
06/09 20:00
Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final resul
中文介绍 Notes2Skills将实验室笔记转化为科学代理的能力,突出了在科学发现过程中的重要性。研究强调了实验记录在科学推理和不确定性处理中的关键作用。意义:此研究为科学发现中的AI辅助决策提供了新的视角,推动了智能代理技能的提升。
👍 7
06/18 20:00
Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use speciali
中文介绍 为了解决开源大规模语言模型(LLM)的隐私能力控制问题,提出了一种分离公共和私有能力的方法。这项研究通过设计安全机制来管理敏感能力的访问,提供了良好的模型控制。意义:这一研究有助于推动公平与安全的AI在各个领域的应用。
👍 6
06/22 20:00
Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent executes tasks, summarizes outcomes, and determines memory content. This
中文介绍 本文提出了一种Execute-Distill-Verify的学习范式,用于促进智能体从体验中自我演进。研究指出,现有方法在处理多代理情况下效果不佳,提出的框架能更好地适应开放世界互动。意义:此方法为智能体的自主学习提供了新思路,推动了智能体的交互能力增强。
👍 5
06/22 20:00
Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspire
中文介绍 本文探讨了文本到图像生成模型在因果推理中的表现,提出反事实基准以评估模型是否具备真实的因果理解能力。研究表明当前模型更多依赖于模式匹配而非因果推理。意义:此研究对推动生成模型的因果理解能力具有重要影响,促进AI在创造性任务中的应用。
👍 5
06/21 20:00
Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera,
中文介绍 Vera是一种分层扩散模型,关注视频编辑中的内容保留问题。该方法通过独特的生成策略,解决了现有视频编辑技术在内容生成过程中的一致性问题。实验结果显示,Vera在视频质量和编辑灵活性上表现优异。意义:该技术对视频制作和编辑领域的创新将产生深远影响。
👍 5
06/18 20:00
As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck - human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training d
中文介绍 Counsel提出了一种元评估数据集,用于复杂多步骤任务中的代理系统评估,针对现有评估方法慢的问题进行优化。研究显示,通过新数据集可以有效提高评估效率。意义:此研究为多智能体系统的评估方法创新提供了基础,推动了智能代理的应用发展。
👍 3
06/16 20:00
Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work larg
中文介绍 Beyond Reward Engineering提出了一种针对长上下文强推理能力的强化学习数据方案,旨在提升大语言模型在复杂任务中的表现。该方法显著提高了模型在长上下文中的推理效率。意义:其对提升智能体长周期决策能力具有重要的实践意义,特别适用于自主交互场景。
👍 3
06/19 20:00
It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deter
中文介绍 本文揭示了可验证搜索与可学习链式推理的关系,展示了短程序任务对模型训练的影响。研究指出对于某些特定程序,任务可学习性存在障碍,为后续研究提供了重要提示。意义:该研究对链式推理的理解加深了AI领域的理论基础,促进了推理任务的研究进展。