👍 153
06/09 20:00
Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and eve
中文介绍 JoyAI-VL-Interaction 提出了实时视觉-语言交互智能,旨在解决大规模模型通常仅在被询问时才响应的问题。该方法通过集成视觉和语言模型,支持主动响应现实世界中的重要瞬间。该系统的实时性使其能够在安全监控、视频通话或直播中自动做出反应,对提升人机交互的实用性具有重要意义。
👍 96
06/08 20:00
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual
中文介绍 Data Journalist Agent 提出了一种将原始数据转化为可验证的多模式故事的方法,旨在提高新闻报道的质量与效率。该代理利用统计数据和视觉设计来自动生成新闻特写,从而减少传统新闻制作过程中的时间成本。这一方法对推动数据驱动的新闻报道具有重要意义。
👍 80
06/14 20:00
Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale founda
中文介绍 Geometric Action Model 提出了基于几何模型的机器人策略学习方法,旨在改善机器人在复杂三维环境中的自主决策能力。该方法集成了视觉-语言-动作模型,充分借助用户指令和环境交互信息。其关键贡献在于提高了机器人在执行任务时的智能性和适应性,促进了机器人技术的发展。
👍 77
06/14 20:00
DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines
中文介绍 DreamX-World 1.0 是一种通用的互动世界模型,支持文本/图像到视频的可控生成。该模型能够处理多种生成任务,包括相机导航和事件提示,适用于逼真的以及游戏风格的场景。这一模型为虚拟环境中的长期生成提供了新的可能,对推动虚拟现实与游戏设计的研究意义重大。
👍 71
06/11 20:00
Large Language Model (LLM) coding agents have achieved strong results on software engineering tasks, yet repository exploration remains a major bottleneck: locating relevant code consumes substantial token budget and pollutes the agent's context with irrelevant snippets. In most agents, the same mod
中文介绍 FastContext 提出了高效的代码库探索技术,旨在提升大语言模型(LLM)编码代理的性能。该方法通过优化代码定位过程,减少无关代码对上下文的影响,大幅降低推理成本。这一进展有助于提升软件工程中的代码搜索效率,推动 AI 编程助手的发展。
👍 52
06/14 20:00
This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through
中文介绍 VibeThinker-3B 是一种具有3亿参数的小型密集模型,专注于在小模型范围内推进可验证推理能力。该模型通过Spectrum-to-Signal后训练范式系统性增强,力求在推理复杂性与模型规模之间取得平衡。这一工作为小型模型研究提供了新的思路,推动了高效推理系统的发展。
👍 23
06/14 20:00
Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We f
中文介绍 Who Should Lead Decoding Now? 研究了如何在多种 Masked Diffusion Language Models (MDLMs) 的知识组合中追踪可靠轨迹,提出了一种新颖的解码动态分析方法。该研究为在多任务生成中选择适当的解码策略提供了理论基础,推动了更高效的序列生成机制的发展。
👍 21
06/14 20:00
Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard
中文介绍 VisualClaw 提出了一种个性化的实时代理,旨在提高视觉语言模型在复杂多模态任务中的响应效率。通过解决高延迟和静态框架等问题,该代理在处理视频帧和长提示时表现出显著改善。这一创新将大幅推进实际应用中的人机交互效率,对物理世界中的代理技术的发展具有重要意义。
👍 14
06/14 20:00
Multi-task learning (MTL) is essential in recommender systems to enable complementary learning among diverse user feedback. While modern industrial practices have shifted from DNNs to Transformer-centric architectures to strengthen sequence modeling and scaling capacity, they still decouple feature
中文介绍 OneRank 提出了统一的 Transformer 原生排名架构,以改进多任务推荐系统中的学习效率。通过使用最新的转换器架构,该方法加强了特征学习与序列建模的结合,提升了推荐质量。这一贡献显示了在多任务学习中实现更高效信息融合的潜力,为智能推荐系统的优化提供了新思路。
👍 14
06/14 20:00
Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-trut
中文介绍 BadWorld 研究了视觉世界模型(VWMs)在对抗攻击下的脆弱性,提出了改进的评估方法来识别模型对干扰的敏感性。这种方法揭示了当前对抗攻击的局限性,并为提高模型的鲁棒性提供了新的思路,对于世界模型的安全性研究具有重要的现实意义。
👍 13
06/14 20:00
We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigati
中文介绍 Qwen-RobotWorld 介绍了一种基于语言的条件视频世界模型,旨在通过自然语言统一动作接口,预测物理基础的未来视觉轨迹。该模型适用于机器人操作、自动驾驶等场景,为增强机器人领域的智能决策能力提供了有效方案,代表了 embodied AI 的重要进展。
👍 13
06/14 20:00
As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cach
中文介绍 TokenPilot 提出了一种节省上下文管理的策略,旨在降低大语言模型(LLM)代理在长交互会话中的推理成本。通过动态内存管理,该方法有效减小了 token 足迹,从而提升了推理效率。这一方法对优化语言模型的实际应用表现具有重要意义。
👍 13
06/09 20:00
Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Pro
中文介绍 Pythagoras-Prover 提出了通过增强型 Lean 公式化来推进高效正式证明的策略,旨在提高现代 Lean 定理证明器的性能。该方法缓解了验证数据稀缺和推理复杂性的影响,使得监督微调和采样更加高效。这一进展对正式验证领域的研究与应用具有重要的推动作用。
👍 11
06/12 20:00
Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system).
中文介绍 CODA-BENCH 提出的评估基准,旨在测试代码代理处理数据密集型任务的能力,以应对日益增长的自动化工程需求。该基准涵盖复杂的代码与大规模数据,提高了在真实开发环境中的评估水平,推动了智能代码生成与数据处理的研究。
👍 10
04/07 20:00
Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task
中文介绍 Where Did It Go Wrong? 通过对 Web 代理的过程级分析,引入了 WebStep 基准,以克服现有评估方法的不足。该基准提供了1800个任务实例,使得评估不仅关注最终成功率,更关注过程信息,为提高 Web 代理性能提供了新的调优方向,推动了智能代理技术的进步。
👍 9
06/12 20:00
Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at
中文介绍 Ling and Ring 2.6 介绍了一系列高效且即时的智能体模型,旨在实现万亿参数规模下的响应速度和推理能力。通过优化模型架构,该工作在可训练性和高效性之间达成了新的平衡,为大规模智能体的实际应用提供了切实可行的解决方案,推动了AI智能体的技术进步。
👍 9
06/14 20:00
As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods
中文介绍 GD^2PO 提出了针对多重奖励冲突的新算法,旨在通过组动态与奖励解耦策略优化方法来解决现有 LLMs 的多目标训练问题。这一研究为面向不同任务目标的强化学习系统提供了创新的解决方案,推动了多目标优化的研究与应用。
👍 8
06/11 20:00
Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target a
中文介绍 PhoneHarness 提出了通过混合图形用户界面、命令行和工具操作来提升手机代理的实用性,旨在使代理能够完成真实的移动工作流。该方法克服了传统代理模型仅作为 GUI 控制器的局限,推动了在日常生活中应用智能代理的潜能,具有重要的应用价值。
👍 8
06/14 20:00
Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budg
中文介绍 Tangram 提出了高效的非均匀 KV 缓存压缩方案,以优化多轮对话的 LLM 服务。该技术通过分配异构预算,缓解了因对话历史增长而导致的内存约束,显著提升了对话生成的效率。这一研究为提升基于对话的智能系统的响应能力具有重要意义。
👍 8
06/14 20:00
Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and
中文介绍 UniDDT 提出了一个结合理解与生成的统一多模态模型,旨在解决现有多模态模型在视觉理解与文本生成中的学习冲突。该模型通过解耦扩散机制来结合不同模态信息,为实现真正的多模态智能提供了新思路,推动了统一智能体开发的前沿研究。