👍 104
06/22 20:00
A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on b
中文介绍 本文提出了一种基于语言模型的世界模型,旨在推动通用智能体的边界。该模型通过当前观测和动作来预测环境动态,进而增强推理和规划能力。研究表明,这种语言模型的世界建模能够显著提升智能体在复杂环境中的表现和决策能力,强调了语言处理在智能体自主行为中的重要性。意义:该研究为智能体的推理和计划能力提供了新的视角,具有重要的应用前景。
👍 61
06/22 20:00
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolutio
中文介绍 本论文探讨了大型语言模型(LLM)智能体的记忆系统演变,从简单的检索机制发展为支持持久信息管理的系统。研究显示,该系统可实现信息的存储、检索、更新和动态生命周期管理,显著改善了智能体在复杂任务中的执行效果。作者强调,具备高效记忆的智能体将推动人工智能在多个领域的进一步应用,尤其是在个性化智能助手方面。
👍 53
06/22 20:00
We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipelin
中文介绍 NatureBench 是一项跨学科基准测试,包含来自 Nature 杂志的 90 个任务,旨在评估 AI 编码智能体在解决实际科学问题上的能力。研究表明,当前 AI 编码智能体在这一领域的表现尚未达到最新技术水平,强调了其在科学发现中的潜力。NatureBench 为未来的研究提供了评估标准,尤为适用于科学研究和技术开发。
👍 35
06/17 20:00
MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step
中文介绍 MemGUI-Agent 是一种端到端的移动图形用户界面(GUI)智能体,能够主动管理上下文信息,从而在长时任务中表现出更高的可靠性。研究指出,以往的 ReAct 风格提示方法限制了长期任务的执行,而新方法则有效地积累了跨步骤和应用切换的中间事实。该研究为长时任务中的人机交互提供了新的解决方案,对移动应用中的智能体设计具有重要意义。
👍 34
06/23 20:00
Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language m
中文介绍 ShutterMuse 研究了基于多模态大型语言模型的摄影指导,强调在拍摄时对相机构图和主体姿态的引导能力。现有的美学裁剪基准主要集中于后期处理,缺乏对拍摄时建议的评估。通过新的框架,本研究提升了摄影过程中实时指导的效果,对提升摄影质量和教育领域的互动学习具有重要影响。
👍 34
06/22 20:00
We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the
中文介绍 Wan-Streamer 是一种专为实时、低延迟音视频交互设计的互动基础模型。该模型通过单一的 Transformer 模型同时处理语言、音频和视频信息,支持全双工交互。研究表明,Wan-Streamer 在实时反应和多模态交互中表现出色,为未来人机交互及智能助手的发展提供了技术基础。
👍 26
06/15 20:00
While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executab
中文介绍 本研究探讨了多模态代码智能的现状,指出现有的大型语言模型在文本到代码的生成中已经取得进展,却在需要视觉特征识别的编程任务中面临挑战。通过将视觉感知与可执行代码连接,本文为多模态任务的进一步研究提供了基础,尤其在提升编程自动化和开发效率方面有重要意义。
👍 26
06/21 20:00
AI agents are driving a new software paradigm, with the ability to autonomously call tools, extract information, manage memory, and complete tasks that span applications and data sources. Most existing end-user operating systems, however, are designed for application-centric workflows and offer litt
中文介绍 AOHP 提出了一个开源的操作系统级智能体框架,旨在实现个性化、高效和安全的交互。这一框架允许智能体自主调用工具、提取信息和管理记忆,从而超越既有的以应用为中心的工作流。研究显示,该框架有助于推动智能体在日常应用中的整合,提升用户体验和效率。
👍 21
06/23 20:00
Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruc
中文介绍 MVTrack4Gen 研究了一种基于多视角跟踪的4D视频生成方法,强调几何一致性与运动保真度的重要性。通过对单目参考视频的处理,该方法克服了现有3D重建的局限性,显著提高了生成视频的质量与真实性。该研究为视频生成技术的进步奠定了基础,在虚拟现实和增强现实应用中具有重要的实践价值。
👍 21
06/10 20:00
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of
中文介绍 LingxiDiagBench 是一个针对中文精神科咨询与诊断的多智能体基准框架,旨在解决精神障碍评估中的人员不足与主观性问题。研究表明,借助于 AI 辅助的诊断,能有效提升评估的及时性与一致性,为精神健康领域的深入研究与应用提供了重要支撑。
👍 16
06/23 20:00
Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-sc
中文介绍 V-Zero 提出了一种无标签的在政策蒸馏方法,结合对比证据门控,旨在提高细粒度视觉推理的效果。研究表明,该方法能有效识别任务相关的视觉证据并在局部图像区域中进行有效推理,为强化学习及多模态推理提供了新的思路。该研究对于推动 AI 领域的细粒度理解与推理能力具有重要意义。
👍 16
06/21 20:00
What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same time, ``existential" concerns such as AI escaping human control with d
中文介绍 本论文探讨了智能体模型的本质及其定义,面对大型语言模型(LLM)被标榜为“编码智能体”和“AI 共科学家”等工具的现状,提出了对此类工具的批判性思考。作者强调,清晰界定智能体特征有助于理解与管理未来 AI 技术的发展,这对社会影响深远。
👍 15
06/22 20:00
Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existi
中文介绍 FLAT 提出了一种前馈潜在三角形精确建模的新方法,用于从单幅图像生成可探索的三维场景。研究显示,当前视频扩散模型在生成过程中缺乏足够的几何表现力,FLAT 方法通过优化几何结构,提升了生成效果,为三维场景的探索与利用提供了新方案。
👍 14
06/18 20:00
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow
中文介绍 UnityShots 研究了一种基于记忆驱动的多镜头音视频生成方法,强调整体性跨镜头记忆的构建。该方法能够保持主体外观、场景上下文和说话者身份的一致性,显著提高了多镜头生成的连贯性。研究为视频生成技术中的跨镜头协作提供了新视角,具有重要的应用潜力。
👍 14
06/21 20:00
Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rath
中文介绍 本研究分析了现代文本到图像模型在视觉多样性上的缺陷,提出多样性控制的方法以改善生成样本的质量。研究表明,现有模型在遵循严格要求时,往往导致输出的多样性下降,探讨了如何在保持视觉忠实度的同时提升多样性,为图像生成领域的研究提供了重要启示。
👍 13
06/22 20:00
Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to
中文介绍 IV-CoT 提出了隐式视觉链式思维的方法,旨在解决在结构化文本到图像生成任务中的局限性。该方法通过增强模型在对象计数、空间关系等方面的结构感知,以提高生成图像的质量,适用于要求较高结构一致性的生成任务。该研究对多模态生成技术的发展具有重要启示。
👍 13
06/22 20:00
Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent executes tasks, summarizes outcomes, and determines memory content. This
中文介绍 本论文提出了一种执行-蒸馏-验证的新范式,旨在促进大型语言模型(LLM)智能体的经验学习与自我进化。研究指出,现有的经验学习方法多依赖于单一智能体循环,限制了智能体的自主学习能力。此研究为未来智能体的自我改进与灵活应对动态环境提供了新路径。
👍 7
06/21 20:00
The Hitchhiker's Guide to Agentic AI is a comprehensive practitioner's reference for building autonomous AI systems. The book covers the full stack from first principles to production deployment, organized around a central thesis: building great agentic systems requires understanding every layer of
中文介绍 《The Hitchhiker's Guide to Agentic AI》是构建自主 AI 系统的综合参考书,涵盖了从基础原理到生产部署的完整技术栈。作者强调,理解每一层的联系是构建优秀智能体系统的关键。此书为实际应用和研究提供了系统性指导,助力提升 AI 系统的效率与安全性。
👍 6
06/23 20:00
While Video Virtual Try-on (VVT) has achieved remarkable progress in synthesizing realistic garment overlays on dynamic subjects, existing paradigms remains fundamentally constrained by a passive dependency on source camera trajectories, failing to accommodate the requisite interactive freedom for o
中文介绍 TryOnCrafter 研究了在视频虚拟试穿中应用摄像机轨迹,提高生成效果的交互自由度。研究指出,当前的虚拟试穿方法依赖于固定源摄像机轨迹,限制了生成的灵活性。通过引入渲染可操作的4D试穿代理,该方法显著增强了体验的真实感,为时尚与虚拟现实领域的创新应用提供了新思路。
👍 6
06/20 20:00
Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multi
中文介绍 本文研究了链式思维(CoT)在大型语言模型(LLMs)中的应用,系统探讨其在多模态任务中的有效性。研究结果显示,尽管链式思维在提升推理能力方面已取得进展,但在处理多模态输入时仍存在局限。此项研究为未来整合多模态推理提供了方向,推动了 AI 领域智能决策的进步。