IBM at NeurIPS 2025
- San Diego, California, US
About
Neural Information Processing Systems (NeurIPS) is a leading machine learning and computational neuroscience conference. IBM Research is excited to sponsor NeurIPS again this year as a Platinum sponsor.
We invite all attendees to visit us during the event at booth number 1109, from Tuesday, December 2 through Friday, December 5.
We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics such as foundation models, trustworthy AI, natural language processing and understanding, knowledge and reasoning, AI automation, human-centered AI, and federated learning.
Presentation times of conference workshops, demos, papers, and tutorials can be found see the agenda section at the bottom of this page. Note: All times are displayed in your local time.
- IBM Presentations @ NeurIPS Agenda
- Booth demo and staff schedule (by time)
- Booth demo list (by title)
- IBM Research staff on site (by name)
Career opportunities
Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2026 summer internships.
- Current IBM Research open roles
- Sign up to be notified of future openings by joining our Talent Network.
Keep up with emerging research and scientific developments from IBM Research. Subscribe to the Future Forward Newsletter.
Agenda
- Description:
Abstract: Training large language models for specialized disciplines such as advanced mathematics, molecular biology, or legal reasoning is limited by the scarcity of large, high quality, domain specific corpora. Most publicly available datasets are dominated by general purpose web text. When available, specialized data are fragmented across diverse sources such as preprints, conference papers, forums, lecture notes, and digitized books. No single source offers comprehensive real-world coverage across scientific domains. Consequently, scaling up authentic domain data remains a bottleneck: collecting a subset of relevant tokens often requires downloading and filtering hundreds of terabytes of raw web material, a process that is both time consuming and costly. x000Dx000D We introduce Data Scout, a modular, LLM powered pipeline that turns a high-level user intent (e.g., “I need data for advanced mathematics”) into a vetted, list of seed URLs in minutes. The system first expands the original intent using an LLM that generates a hierarchical subgraph of related concepts; this taxonomy drives a diversified set of search queries that systematically cover the target domain while respecting known licensing signals. Candidate URLs are then filtered by the same LLM using chain of thought prompting based on topical relevance, licensing clarity, and crawlability. Our results show that that the list of selected candidate URLs when crawled can yield a high percentage of relevant pages (40%+) related to the user’s intended topic or query, compared to less than 1 percent in general web-scale corpora. Data Scout is available with both CLI and GUI front ends. By democratizing domain specific data acquisition, Data Scout enables researchers without dedicated crawling infrastructure to bootstrap large, high-fidelity corpora, accelerating the development of specialized LLMs across various niche domains fields.
Speaker: Chirag Garg
Speakers:CGIBM - Description:
Over the past sixty years, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve previously unaddressed planning problems. This was done through established practices of rigorous design and evaluation of planning systems. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. The purpose of this tutorial is to share the knowledge with the wider AI community, with the aim of incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. We believe that exposing the NeurIPS community to the theory and practices from the planning community will contribute greatly to the progress in building LLM-based planners and to planning in general.
Website: https://planning-llm-era.github.io
Authors:MKHKCMChristian MuiseNON-IBM - Description:
The BeeAI Framework is an open-source project for building reliable AI agents that combine autonomy with control. Current agent frameworks focus primarily on prompting and orchestration, leaving critical questions of predictability and safety unaddressed. BeeAI fills this gap with a lightweight framework that enables developers to build agents whose reasoning abilities are preserved while execution is constrained by declarative, rule-based requirements. At the core of the framework is the RequirementAgent, a novel agent design that enforces deterministic, controlled behaviors across heterogeneous language models. With RequirementAgent, developers can ensure consistent and reliable execution patterns regardless of differences in model reasoning, tool-calling abilities, or stochastic variation. This approach provides practitioners with a unified abstraction layer that simplifies the deployment of complex AI systems into production settings. As an incubating Linux Foundation AI project, BeeAI is gaining adoption in open source and enterprise contexts as organizations seek robust ways to operationalize AI agents at scale. At NeurIPS EXPO, we will showcase BeeAI’s architecture, real-world use cases, and lessons learned from applying declarative control to agent autonomy.
Speakers:SBSandi BesenArtificial Intelligence Applied ResearchNeudesic, an IBM Company - Description:
Current algorithms for aligning LLM behavior are often implemented for narrow settings, making it difficult for researchers and developers to understand their effectiveness across model architectures, datasets, and tasks. To help provide a more informed and principled approach to steering model behavior, we present the AI Steerability 360 (AISteer360) and In-Context Explainability 360 (ICX360) toolkits. Participants will first be guided through a conceptual overview for how model behavior can be influenced across four model control surfaces: input (prompting), structural (weights/architecture), state (activations/attentions), and output (decoding). After the conceptual overview, we will guide attendees through how to apply some recently developed explainability tools (from ICX360) for understanding why models produce given, potentially undesirable, outputs and how this information is used to design targeted steering inventions (via AISteer360). Closing the loop, we will evaluate if the baseline behavior (of the original, unsteered model) was successfully mitigated by the selected steering inventions and investigate if steering introduced any unintended behavioral side-effects. All of the experiments throughout the demonstration will be facilitated solely by the tools in the two toolkits, illustrating their power to design end-to-end steering workflows. Attendees will come away with a practical understanding of how to apply these toolkits to their own alignment challenges.
Speakers:EMDWIBM - Description:
Modern incident root-cause analysis (RCA) is constrained by partial observability, symptom-centric signals, and the overwhelming noise present in logs, traces, and metrics. Diagnosing production failures often depends on instrumentation quality and human expertise, while latent software defects, configuration errors, and zero-day failure modes remain difficult to pinpoint. To address these challenges, we demonstrate a multi-agent system for incident diagnostics that augments observability data with application source code and static analysis signals.
Our system introduces two cooperating agents: the Code Context Agent (COCOA), which builds a knowledge graph of program dependencies, control/data flows, and caller–callee relationships; and the Incident Diagnostics Agent (IDA), which performs agentic reasoning over an entity topology graph enriched with observability streams. Together, these agents extend topology-aware planning (TAP) to simultaneously operate on program dependency graphs and infrastructure entity graphs, thereby linking runtime symptoms with underlying code-level causes.
This demo showcases how multi-agent collaboration enables deeper, context-sensitive RCA. We walk through real-world inspired scenarios—including incidents where critical log lines are hidden in noisy observability streams or where latent defects emerge only after system updates—illustrating how the system surfaces root causes that would otherwise remain invisible. By bridging program analysis with runtime observability, our approach moves beyond symptom-driven diagnostics toward a more reliable, automated framework for incident management.
Speakers:RKRamesh Kumar KottapalliIBM - Description:
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work.
- Booth demo and staff schedule (by time)
- Booth demo list (by title)
- Description:
Abstract: This hands-on workshop introduces a proposal that treats LLMs as computing elements governed by established software development principles—particularly task decomposition and modularization—at both the programming model (Mellea) and model level (LLM intrinsics).x000D x000D LLM outputs are often unpredictable and incorrect. Agentic frameworks and prompt optimization libraries attempt to manage this by giving control to the LLM, but this leads to systems that are hard to debug, maintain, and scale. Mellea offers an alternative: a programming model that restores developer control through modular design, information hiding, and compositional contracts. This enables predictable fault models, better portability, and lower inference costs. Attendees will gain hands-on experience building applications using the Melleaic approach.x000D x000D Extending these principles to the model level, the workshop introduces a modularization framework for LLMs using activated LoRAs. These produce components—LLM intrinsics—that match fine-tuned model accuracy for specific tasks but with significantly lower inference costs and latency, thanks to KV cache reuse. Participants will build applications using a pre-built library of RAG LLM intrinsics and learn how to train their own.x000D x000D Presented by the creators of Mellea and the inventors of LLM intrinsics and aLoRA, this workshop equips attendees with foundational skills for scalable model/application co-design.
Speakers: Nathan Fulton, Hendrik Strobelt
Speakers:NFNathan FultonIBMHS - Description:
The rapid rise of autonomous AI agents across enterprises is creating a new class of security and governance challenges that are not adequately addressed with today’s technology. Context Forge MCP Gateway is an open-source, security-focused middleware that provides fine-grained control and extensibility for agent operations. With over 2.6k GitHub stars and a rapidly growing user community, Context Forge addresses emerging threat classes including prompt injection, data leakage, and misuse of sensitive resources. At its core, Context Forge introduces a plugin architecture modeled after Linux Security Modules, embedding reusable security hooks at critical points in agent execution (e.g., prompt handling, tool invocation, data transformation). This modular foundation enables organizations to enforce contextual policies at scale—ranging from PII redaction and provenance tagging to prompt injection detection and policy-based access control. With 39 plugins already available, Context Forge is establishing a standards-aligned ecosystem for securing agent workflows in real-world enterprise deployments. By blending research-driven design with open-source adoption it creates a practical path for organizations to advance agent trustworthiness, safety, and compliance.
Speakers:FA - Description:
Modern enterprises depend on efficient data engineering pipelines to unlock value from diverseand large-scale datasets. Yet, current processes for workflow design, schema ingestion, and dataquality validation remain complex, error-prone, and dependent on technical expertise. This creates barriers for non-expert users, slows down development, and introduces risks of datainconsistency.
We present a suite of LLM-powered frameworks that reimagine enterprise data engineeringacross three critical dimensions: (i) From Natural Language to Executable ETL Flows, enablingintuitive pipeline creation with natural language specifications and automatic operator/propertyinference, (ii) All You Can Ingest, an end-to-end schema mapping and transformation frameworkthat unifies semantic alignment, code synthesis, and robust validation, and (iii) QualityAssessment of Tabular Data, a scalable approach for auto-generating interpretable quality rulesand executable validators tailored to specific datasets.
Together, these innovations demonstrate how Large Language Models (LLMs), augmented withretrieval, code synthesis, reasoning, and guardrails, can transform the data engineering lifecycleinto a more accessible, adaptive, and trustworthy process, reducing manual effort, acceleratingtime-to-value, and ensuring data fidelity at enterprise scale.
Speakers:SM - Description:
Mellea is a generative AI library for writing robust application software. LLM outputs are intrinsically unpredictable and often wrong. Agentic frameworks and prompt optimization libraries address this unpredictability by putting the LLM in control leading to systems that are often difficult to debug, maintain, evolve, and port. Mellea puts developers back in control by defining a programming model that encourages task decomposition, information hiding, and compositional contracts between isolated modules. Mellea’s programming model results in systems with manageable fault models, better portability, and lower inference-time costs.
Speakers:DC - Description:
Long-range intra-molecular interactions are not well represented by existing molecular descriptors, which limits the accuracy of machine learning models for molecular property prediction. We introduce TDiMS, a descriptor that encodes topological distances between substructure pairs, enabling explicit representation of long-range effects while retaining chemical meaning. Applied to molecular datasets, TDiMS shows particular advantages for larger molecules, where long-range interactions strongly influence target properties. We further demonstrate that choosing appropriate substructure definitions, such as tailored fragments, enhances predictive performance. Beyond accuracy, TDiMS provides interpretable features essential for material discovery, offering insights into structural motifs driving predictions. These results highlight distance-based, interpretable descriptors as a promising route for machine learning in the materials discovery.
Authors:+3 more view allLHAKAkihiro KishimotoIBMKMKohei MiyaguchiIBMMHMasataka HiroseNON-IBMJFJunta FuchiwakiNON-IBMIS
- Description:
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work.
- Booth demo and staff schedule (by time)
- Booth demo list (by title)
- Description:
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express in-context learning when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.
Authors:YDYanna DingNON-IBMSLSongtao LuNON-IBMYLTNTomasz NowickiIBMJGJianxi GaoNON-IBM - Description:
We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TeraX, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observations that only a small fraction (<5%) of tensors in LLMs are active in each training iteration, many inactive tensors are large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching them to/from slow SSDs without stalling the GPU training process. TeraX accurately estimates the lifetime of each tensor with the execution graph generated by PyTorch, based on which it will produce an optimized tensor offloading plan. TeraX has a runtime tensor migration engine to fulfill the offloading plan via GPUDirect storage that allows direct data transfer between GPUs and SSDs. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we demonstrate that TeraX improves the training performance of various LLMs, and achieves near ideal performance assuming unlimited GPU memory.
Authors:+1 more view allZYZiqi YuanNON-IBMHZHaoyang ZhangNON-IBMYZYirui Eric ZhouNON-IBMAMApoorve MohanIBMICI-Hsin ChungIBMSS - Description:
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence after the aLoRA is invoked. This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the prior keys and values. This enables building what we call intrinsics, i.e. specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We train a set of aLoRA-based intrinsics models, demonstrating competitive accuracy with standard LoRA while significantly improving inference efficiency. We contributed our Activated LoRA implementation to the Huggingface PEFT library here: https://github.com/huggingface/peft.
Authors:+3 more view allKGKristjan GreenewaldIBMLLLuis LastrasIBMTPVSLPGZ - Description:
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems.
Authors:KZKaiwen ZhaNON-IBMZGZhengqi GaoNON-IBMMSMaohao ShenNON-IBMZHZhang-Wei HongIBMDBDuane BoningNON-IBMDKDina KatabiNON-IBM - Description:
The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits.
Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of "cognitive tools'' encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our "cognitive tools'' to GPT-4.1 increases its pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview.
In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.
Authors:BEABMR - Description:
Concept Bottleneck Models (CBMs) are interpretable machine learning models that ground their predictions on human-understandable concepts, allowing for targeted interventions in their decision-making process. However, when intervened on, CBMs assume the availability of humans that can identify the need to intervene and always provide correct interventions. Both assumptions are unrealistic and impractical, considering labor costs and human error-proneness. In contrast, Learning to Defer (L2D) extends supervised learning by allowing machine learning models to identify cases where a human is more likely to be correct than the model, thus leading to deferring systems with improved performance. In this work, we gain inspiration from L2D and propose Deferring CBMs (DCBMs), a novel framework that allows CBMs to learn when an intervention is needed. To this end, we model DCBMs as a composition of deferring systems and derive a consistent L2D loss to train them. Moreover, by relying on a CBM architecture, DCBMs can explain the reasons for deferring on the final task. Our results show that DCBMs can achieve high predictive performance and interpretability by deferring only when needed.
Authors:+3 more view allAPAndrea PugnanaNON-IBMRMRiccardo MassiddaNON-IBMFGFrancesco GianniniNON-IBMPBMZMateo Espinosa ZarlengaNON-IBMRPRoberto PellungriniNON-IBM - Description:
We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Uniquely, it introduces abundant additional emergent local minima while preserving perfect pattern recovery --- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks.
Authors:BHZSZhaoyang ShiNON-IBMKBKrishnakumar BalasubramanianNON-IBMDKPR - Description:
Recent works have studied fair resource allocation in social settings, where fairness is judged by the impact of allocation decisions rather than more traditional minimum or maximum thresholds on the allocations themselves. Our work significantly adds to this literature by developing continuous resource allocation strategies that adhere to equality of impact, a generalization of equality of opportunity. We derive methods to maximize total welfare across groups subject to minimal violation of equality of impact, in settings where the outcomes of allocations are unknown but have a diminishing marginal effect. While focused on a two-group setting, our study addresses a broader class of welfare dynamics than explored in prior work. Our contributions are threefold. First, we introduce Equality of Impact (EoI), a fairness criterion defined via group-level impact functions. Second, we design an online algorithm for non-noisy settings that leverages the problem’s geometric structure and achieves constant cumulative fairness regret. Third, we extend this approach to noisy environments with a meta-algorithm and empirically demonstrate that our methods find fair allocations and perform competitively relative to representative baselines.
Authors:BMBlossom MetevierNON-IBMDWKRPTPhilip ThomasNON-IBM
- Description:
When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes that guide model reasoning toward human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ a form of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains that our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.
Authors: - Description:
Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.
Authors:ADAladin DjuheraNON-IBMSKSZSyed ZawadIBMFAFarhan AhmedIBMHLHBHolger BocheNON-IBM - Description:
Training data attribution (TDA) is concerned with understanding model behavior in terms of the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training. To serve as a gold standard for TDA in this "final-model-only" setting, we propose further training, with appropriate adjustment and averaging, to measure the sensitivity of the given model to training instances. We then unify existing gradient-based methods for TDA by showing that they all approximate the further training gold standard in different ways. We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.
Authors:DWIPIBMSGSoumya GhoshNON-IBMADKRMC - Description:
Generative AI now enables many artifacts to be created with little human involvement. But delegating primary responsibility for artifact generation to AI may alter the creative process in undesirable ways. Here we consider an alternative approach, one in which AI provides encouragement, constructive feedback, and horizon-expanding reflection, only taking on significant content generation when requested. This role corresponds to that of a muse, supporting rather than replacing the human creator. After reviewing the roles human muses play, we discuss interactions with several generative AI models in three scenarios of use, identifying conversational behaviors that an AI Muse would need to exhibit.
Authors:JRJohn RichardsIBMJMJacquelyn MartinoIBMRBRachel BellamyIBMMMMichael MullerIBM - Description:
Compositional generalization—a key open challenge in modern machine learning—requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.
Authors:GCPBMHRWRoger WattenhoferNON-IBMAR - Description:
In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 × speedup and a 47 × boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real- time performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster.
Authors:+1 more view allWLWanhua LiNON-IBMYZYujie ZhaoNON-IBMMQMinghan QinNON-IBMYLYang LiuNON-IBMYCYuanhao CaiNON-IBMCGChuang GanIBM - Description:
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment. This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence. Building on this, we present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families-all without compromising capability on intended tasks. We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks.
Authors:SPShengyun PengNON-IBMPCPrincipal Research Scientist and Manager; Chief Scientist, RPI-IBM AI Research CollaborationIBMJCJianfeng ChiNON-IBMSLSeongmin LeeNON-IBMDCDuen Horng ChauNON-IBM - Description:
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.
Authors:+2 more view allSYSonglin YangNON-IBMYSYikang ShenIBMKWKaiyue WenNON-IBMSTIBMMMMayank MishraNON-IBMLRLiliang RenNON-IBM - Description:
At the core of the Transformer, the softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often de-stabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.
Jannis Born (IBM); Filip Skogh (IBM); Kahn Rhrissorrakrai (IBM); Filippo Utro (IBM); Nico Wagner (IBM); Aleksandros Sobczyk (IBM), At the core of the Transformer, the softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often de-stabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard ViT and other doubly stochastic Transformers. Beyond the Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. Our QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.
Jannis Born (IBM); Filip Skogh (IBM); Kahn Rhrissorrakrai (IBM); Filippo Utro (IBM); Nico Wagner (IBM); Aleksandros Sobczyk (IBM)
- Description:
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work.
- Booth demo and staff schedule (by time)
- Booth demo list (by title)
- Description:
LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
Authors:ATAsterios TsiourvasNON-IBMWSGPGeorgia PerakisNON-IBM - Description:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our core contributions is a taxonomy or "language" of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while generative VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
Authors:+7 more view allZLZhiqiu LinNON-IBMSCSiyuan CenNON-IBMDJDaniel JiangNON-IBMJKJay KarhadeNON-IBMHWHewei WangNON-IBMCMChancharik MitraNON-IBM - Description:
Although Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks, growing concerns have emerged over the misuse of sensitive, copyrighted, or harmful data during training. To address these concerns, unlearning techniques have been developed to remove the influence of specific data without retraining from scratch. However, this paper reveals a critical vulnerability in fine-tuning-based unlearning: a malicious user can craft a manipulated forgetting request that stealthily degrades the model’s utility for benign users. We demonstrate this risk through a red-teaming Stealthy Attack (SA), which is inspired by two key limitations of existing unlearning—the inability to constrain the scope of unlearning effect and the failure to distinguish benign tokens from unlearning signals. Prior work has shown that unlearned models tend to memorize forgetting data as unlearning signals, and respond with hallucinations or feigned ignorance when unlearning signals appear in the input. By subtly increasing the presence of common benign tokens in the forgetting data, SA enhances the connection between benign tokens and unlearning signals. As a result, when normal users include such tokens in their prompts, the model exhibits unlearning behaviors, leading to unintended utility degradation. To address this vulnerability, we propose Scope-aware Unlearning (SU), a lightweight enhancement that introduces a scope term into the unlearning objective, encouraging the model to localize the forgetting effect. Our method requires no additional data processing, integrates seamlessly with existing fine-tuning frameworks, and significantly improves robustness against SA. Extensive experiments validate the effectiveness of both SA and SU.
Authors:+7 more view allJRJie RenNON-IBMZDZhenwei DaiNON-IBMXTXianfeng TangNON-IBMYXYue XingNON-IBMSZShenglai ZengNON-IBMHLHui LiuNON-IBM - Description:
Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of 2x cost savings compared to existing LLM routing techniques.
Authors:HWHerbert WoisetschlägerNON-IBMRZRyan ZhangNON-IBMSWShiqiang WangIBMHJHans-arno JacobsenNON-IBM - Description:
Large language models (LLMs) are powerful tools capable of handling diverse tasks. However, their evaluation remains challenging due to the vast and often confusing range of available benchmarks. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, including researchers, practitioners, and non-AI companies, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce \texttt{BenchmarkCards}, an intuitive and validated documentation framework that systematically captures critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies with benchmark creators and users, we show that \texttt{BenchmarkCards} can simplify benchmark selection and enhance transparency, facilitating more informed decision-making in evaluating LLMs.
Authors:+1 more view allASAnna SokolNON-IBMEDMHDPDavid PiorkowskiIBMXZXiangliang ZhangNON-IBMNMNuno MonizNON-IBM - Description:
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs including GPT-4, Llama, and Mistral on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) "LLMFeatureSelector", an LLM-based feature selection scikit-learn pipeline. The software is available at \url{https://github.com/IBM/FailureSensorIQ}.
Authors:CCChristodoulos ConstantinidesIBMDPDhaval PatelIBMSLShuxin LinIBMCEClaudio Humberto Guerrero EstevezIBMSPSunil Dagajirao PatilIBMJKJayant KalagnanamIBM - Description:
Causal discovery in the form of a directed acyclic graph (DAG) for dynamic time series data has been widely studied in various applications. In this work, we propose a dynamic DAG discovery algorithm, , based on online meta-learning. is designed to learn dynamic DAG structures from potentially nonlinear and non-stationary time series datasets, accounting for changes in both parameters and graph structures. Unlike most of the existing work focusing on observational, offline, and/or stationary settings, explicitly treats data collected at different time points with distribution shifts as distinct domains, which is assumed to occur as a result of external interventions. Moreover, involves a new online meta-learning framework to take advantage of the temporal transition among existing domains such that it can quickly adapt to new domains with few measurements. A first-order optimization approach is utilized to efficiently solve the meta-learning framework, and theoretical analysis establishes the identifiability conditions and the convergence of the learning process. We demonstrate the promising performance of the proposed meta learning framework through better accuracy on benchmark datasets against state-of-the-art baselines.
Authors:+1 more view allTGTian GaoIBMSLSongtao LuNON-IBMJLENElliot NelsonNON-IBMDBDebarun BhattacharjyaIBMYYYue YuNON-IBM - Description:
As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions. We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose residual learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We show that the proposed method can be extended to deal with other hardware imperfections like limited response granularity. As far as we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.
Authors:ZWZhaoxian WuNON-IBMQXQuan XiaoNON-IBMTGTayfun GokmenIBMOFOmobayode FagbohungbeIBMTCTianyi ChenNON-IBM
- Description:
SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. The model is open-sourced at https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.
Authors:CRTBGCGiandomenico CornacchiaIBMDSAPSenior Research Scientist and manager, Accelerated Discovery, NLP and Inference on GraphsIBMJM - Description:
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
Authors:+1 more view allYLYuanzhe LiuIBMRDRyan DengIBMTKTim KalerNON-IBMXCXuhao ChenNON-IBMCLCharles LeisersonNON-IBMYMYao MaNON-IBM - Description:
Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable variables, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose \emph{Causally reliable Concept Bottleneck Models} (CBMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and \emph{unstructured} background knowledge (e.g., scientific literature). Experimental evidence suggests that CBMs are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.
Authors:+1 more view allGFGiovanni De FeliceNON-IBMAFArianna Casanova FloresNON-IBMFSFrancesco De SantisNON-IBMSSSilvia SantiniNON-IBMJSJohannes SchneiderNON-IBMPB - Description:
Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits -- a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function's Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.
Authors:PRKCTKSUShashanka UbaruIBMAGAlexander GrayNON-IBM - Description:
Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix (P) and a complex-valued diagonal matrix (D). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any N-state FSA with one layer of dimension N and a linear readout of size N \times N, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model
Authors:ATNMNicolas MenetIBMMHTHThomas HofmannNON-IBMAR - Description:
Recent methods for aligning large language models (LLMs) with human feedback predominantly rely on a single reference model, which limits diversity, model overfitting, and underutilizes the wide range of available pre-trained models. Incorporating multiple reference models has the potential to address these limitations by broadening perspectives, reducing bias, and leveraging the strengths of diverse open-source LLMs. However, integrating multiple reference models into reinforcement learning with human feedback (RLHF) frameworks poses significant theoretical challenges, where achieving exact solutions has remained an open problem. This paper presents the first exact solution to the multiple reference model problem in reverse KL-regularized RLHF. We introduce a comprehensive theoretical framework that includes rigorous statistical analysis and provides sample complexity guarantees. Additionally, we extend our analysis to forward KL-regularized RLHF, offering new insights into sample complexity requirements in multiple reference scenarios. Our contributions lay the foundation for more advanced and adaptable LLM alignment techniques, enabling the effective use of multiple reference models. This work paves the way for developing alignment frameworks that are both theoretically sound and better suited to the challenges of modern AI ecosystems.
Authors:GAGholamali AminianNON-IBMAAAmir R AsadiNON-IBMISIdan ShenfeldNON-IBMYM - Description:
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Authors:DPDhaval PatelIBM - Description:
Lightning talk by Dhaval Patel at social event (https://neurips.cc/virtual/2025/loc/san-diego/social/129335)
- Description:
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work.
- Booth demo and staff schedule (by time)
- Booth demo list (by title)
- Description:
Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating a pivot to scaling test-time compute. Existing deterministic inference-time scaling methods, usually with reward models, cast the task as a search problem, but suffer from a key limitation: early pruning. Due to inherently imperfect reward models, promising trajectories may be discarded prematurely, leading to suboptimal performance. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods. Our method maintains a diverse set of candidates and robustly balances exploration and exploitation. Our empirical evaluation demonstrates that our particle filtering methods have a 4--16x better scaling rate over deterministic search counterparts on both various challenging mathematical and more general reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct surpasses GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work.
Authors:IPIsha PuriNON-IBMSSNON-IBMGXGuangxuan XuNON-IBMABKXKai XuNON-IBMAS - Description:
We study the problem of estimating the mean reward of the best arm in a multi-armed bandit (MAB) setting. Specifically, given a target precision epsilon and confidence level delta, the goal is to return an epsilon-accurate estimate of the largest mean reward with probability at least 1 - delta, while minimizing the number of samples. We propose an algorithm that features a novel stopping condition based on a confidence ellipsoid and a two-phase sampling strategy. The algorithm is simple, nearly free of hyperparameters, and achieves the asymptotically optimal sample complexity of 2 R^2 f(mu) log(1/delta), where R is the parameter of sub-Gaussian reward distributions, and f(mu) is the value of an instance-specific optimization problem. In contrast to the widely used Track-and-Stop algorithm, which requires solving a non-convex optimization problem at every round, our approach solves a single convex optimization problem just once. We also establish a matching lower bound, proving that no algorithm can asymptotically outperform this rate. Our analysis introduces several new techniques to address the challenge posed by the non-convexity of the characterizing optimization problem. Experimental results support our theoretical guarantees and demonstrate the practical effectiveness of the proposed method.
Authors: - Description:
Machine learning (ML) algorithms deployed in real-world environments are often faced with the challenge of adapting models to concept drift, where the task data distributions are shifting over time. The problem becomes even more difficult when model performance must be maintained under adherence to strict resource constraints. Existing solutions often depend on drift-detection methods that produce high computational overhead for resource-constrained environments, and fail to provide strict guarantees on resource usage or theoretical performance assurances. To address these shortcomings, we propose RCCDA: a dynamic model update policy that optimizes ML training dynamics while ensuring strict compliance to predefined resource constraints, utilizing only past loss information and a tunable drift threshold. In developing our policy, we analytically characterize the evolution of model loss under concept drift with arbitrary training update decisions. Integrating these results into a Lyapunov drift-plus-penalty framework produces a lightweight policy based on a measurable accumulated loss threshold that provably limits update frequency and cost. Experimental results on three domain generalization datasets demonstrate that our policy outperforms baseline methods in inference accuracy while adhering to strict resource constraints under several schedules of concept drift, making our solution uniquely suited for real-time ML deployments.
Authors:APAdam PiasecznyNON-IBMKSKamran ShisherNON-IBMSWShiqiang WangIBMCBChristopher BrintonNON-IBM - Description:
Federated Learning (FL) has emerged as a privacy-preserving framework for training models on data generated at the edge. However, the heterogeneity of data silos (e.g., label skew and domain shift) often leads to inconsistent learning objectives and suboptimal model performance. Inspired by the data-driven approach, we propose Flick, a novel data generation framework for heterogeneous Federated Learning with Commonsense Knowledge from Large Language Models (LLMs). In Flick, the client performs the local data summary to capture client-specific knowledge in textual form. The central server then distills task-relevant, high-quality knowledge from the out-of-the-box LLM -- guided by cross-client-specific insights -- to generate informative text prompts. These prompts direct a generative model in producing synthetic data, enabling global model fine-tuning and local data compensation. This process gradually aligns the label and feature distributions across clients. Extensive results demonstrate that Flick improves the global model accuracy by up to 11.43%, and accelerates convergence by up to 12.9x, validating its effectiveness in addressing data heterogeneity.
Authors:RZRan ZhuNON-IBMMYMingkun YangNON-IBMSWShiqiang WangIBMJYJie YangNON-IBMQWQing WangNON-IBM - Description:
The need for training multilingual multi-task speech processing (MSP) models that perform both automatic speech recognition and speech-to-text translation is increasingly evident. However, a significant challenge arises from the conflicts among multiple objectives when using a single model. Multi-objective optimization can address this challenge by facilitating the optimization of multiple conflicting objectives and aligning the gradient updates in a common descent direction. While multi-objective optimization helps avoid conflicting gradient updates, a critical issue is that when there are many objectives, such as in MSP, it is often {\em difficult to find} a common descent direction. This leads to an important question: Is it more effective to separate highly conflicting objectives into different optimization levels or to keep them in a single level? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To keep computation and memory overhead low, we incorporate a lightweight layer‑selection strategy that detects the most conflicting layers and uses only their gradients when computing the conflict‑avoidance direction. We conduct an extensive investigation using the CoVoST v2 dataset for combined multilingual ASR and ST tasks, along with the LibriSpeech and AISHELL-1 datasets for multilingual ASR, to identify highly conflicting objectives and determine the most effective training recipe among the three proposed multi-objective optimization algorithms.
Authors:ASA SaifNON-IBMLCLisha ChenNON-IBMXCSLSongtao LuIBMBKTCTianyi ChenNON-IBM - Description:
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information. Although prior studies have explored this issue using fixed-template or retrieval-based distractions, such static methods show limited effectiveness against contemporary models. To address this problem, we propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior. Without modifying the original question or answer, the method efficiently produces challenging adaptive distractions across multiple datasets, enabling systematic stress testing of LLMs’ contextual robustness. Experiments on four benchmarks demonstrate that the generated distractions lead to an average performance drop of over 45% for mainstream models. Further comparisons of mitigation strategies show that prompt-based optimization methods yield limited gains, whereas post-training approaches (e.g., DPO) significantly enhance the model's contextual robustness. The results indicate that these issues do not stem from knowledge deficits in LLMs, but from a fundamental inability to maintain consistent reasoning under contextual distraction, posing a major challenge to the reliability of LLMs in real-world applications.
Authors:+3 more view allYWYanbo WangNON-IBMZXZixiang XuNON-IBMYHYue HuangNON-IBMCGChujie GaoNON-IBMSWSiyuan WuNON-IBMJYJiayi YeNON-IBM
- Description:
Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.
Authors:CXChen XiongNON-IBMPCPrincipal Research Scientist and Manager; Chief Scientist, RPI-IBM AI Research CollaborationIBMTHTsung-yi HoNON-IBM - Description:
Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs).However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory.To address this, we present a calibration approach—performed via quantile regression—that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an instance-adaptive scaling (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer.Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.
Young Jin Park; Kristjan Greenewald (IBM); Kaveh Alimohammadi; Hao Wang; Navid Azizan
- Description:
The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances have shown that evaluating a state space model can be recast as solving a parallelizable optimization problem, and sometimes this approach yields dramatic speed-ups in evaluation time. However, the factors that govern the difficulty of these optimization problems remain unclear, limiting the larger adoption of the technique. In this work, we establish a precise relationship between the dynamics of a nonlinear system and the conditioning of its corresponding optimization formulation. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior, directly governs the number of optimization steps required for evaluation. In predictable systems, the state trajectory can be computed in time, where is the sequence length, a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis demonstrates that for predictable systems, the optimization problem is always well-conditioned, whereas for unpredictable systems, the conditioning degrades exponentially as a function of the sequence length. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized, and highlighting predictability as a key design principle for parallelizable models.
Authors:XGXavier GonzalezNON-IBMLKLeo KozachkovIBMDZDavid ZoltowskiNON-IBMKCSLScott LindermanNON-IBM - Description:
In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing human-in-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.
Authors:+4 more view allYCYinfang ChenNON-IBMJPJiaqi PanNON-IBMJCJackson ClarkNON-IBMYSYiming SuNON-IBMNZNoah ZheutlinIBMBB - Description:
Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models — including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct — to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at github.com/IBM/analog-foundation-models.
Authors:+4 more view allJBICIason ChalasIBMGAGiovanni AcampaIBMACAn ChenIBMOFOmobayode FagbohungbeIBMST - Description:
Reinforcement learning (RL) has demonstrated significant promise in enhancing the reasoning capabilities of Text2SQL LLMs, especially with advanced algorithms such as GRPO and DAPO. However, the performance of these methods is highly sensitive to the design of reward functions. Inappropriate rewards can lead to "reward hacking", where models exploit loopholes in the reward structure to achieve high scores without genuinely solving the task. This work considers a constrained RL framework for Text2SQL that incorporates natural and interpretable reward and constraint signals, while dynamically balancing trade-offs among them during the training. We establish the theoretical guarantees of our constrained RL framework and our numerical experiments on the well-known Text2SQL datasets substantiate the improvement of our approach over the state-of-the-art RL-trained LLMs.
Authors:+1 more view allWCWeiqin ChenIBMNPMGMichael GlassIBMLVLong VuIBMGRSSShankar SubramaniamIBM - Description:
Predicting chemical hazard indicators for substances of concern (SoCs), such as their persistence, bioaccumulation, and toxicity (PBT), is a critical task in environmental science and chemical regulatory compliance. Existing approaches rely heavily on molecular structural representations such as SMILES, which are often unavailable in early-stage assessments, in legacy documentation, or are inadequate for structurally representing the diversity of compounds encountered for regulation tasks. This paper addresses the challenge of estimating PBT properties from partial, noisy, and unstructured natural language descriptions of SoCs, such as their physical appearance, melting point, industrial use, and other general characteristics. We propose a new framework that leverages the generalization capabilities of Large Language Models (LLMs) to infer PBT profiles from these textual descriptions. Our key contributions include the development of the first dataset of natural language descriptions paired with PBT hazard categories and a fine-tuned LLM pipeline capable of generating hazard assessments. Experimental results show that our approach achieves competitive performance compared to structure-based models, enabling early hazard screening in low- or incomplete-data scenarios.
Authors:SSNPESEB - Description:
Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems.
Authors:+1 more view allCWXLXunzhuo LiuNON-IBMYLYuhan LiuNON-IBMYZYue ZhuIBMXMXiangxi MoNON-IBMJJJunchen JiangNON-IBM - Description:
We study contextual bandits in high-dimensional combinatorial action spaces arising in structured constrained optimization problems, such as IT resource allocation and retail assortment pricing. Key quantities in the objective or constraints must be estimated from data during the trial sequence of actions. In [ 12], we propose a novel, practical, and transparent approach based on general-purpose regression oracles with Inverse Gap Weighting (IGW) for seamless integration within an optimization framework. IGW sampling is efficiently managed by: (a) a column-generation reformulation of the underlying Mixed Integer Programming (MIP) model which allows for flexible lower-level predictors, causal coherence, and efficient representation of large action spaces; (b) a diverse solution pool generation to balance the exploration-exploitation trade-off in large-action spaces. To address non-smooth rewards induced by constraints, we introduce a risk-averse phased learning strategy. Experiments on an IT auto-scaling task demonstrate substantial reductions in cumulative regret, with added gains from risk-averse methods that effectively manage constraint violations. This submission summarizes [ 12] and sketches our extensions underway as we seek a full theoretical regret analysis.
Authors: - Description:
Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks, particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future efforts by the community, we release the Long-term iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in several iEEG tasks. MVPFormer surpasses state-of-the-art (SOTA) Transformer baselines in seizure detection across the Long-term, the MAYO, and the FNUSA dataset, while also achieving SOTA performance on four Brain TreeBank iEEG decoding tasks (volume, pitch, onset, and speech). Together, our contributions establish MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with SOTA clinical performance.
Authors:FCFrancesco CarzanigaIBMMHASKSKaspar SchindlerNON-IBMAR - Description:
Fine-tuning large language models (LLMs) on telecom datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this process may compromise model safety. Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries. In this paper, we investigate this issue for fine-tuning LLMs using three representative datasets featured by the GenAINet initiative, and show that safety degradation occurs even after fine-tuning with seemingly harmless telecom data. We further extend our analysis to publicly available TeleLLMs continually pre-trained on telecom corpora, revealing that safety alignment is often severely lacking, primarily due to the omission of safety-focused instruction tuning. To address these issues, we evaluate three safety realignment defenses (SafeInstruct, SafeLoRA, and SafeMERGE) using established red-teaming benchmarks. The results show that, across all settings, the proposed defenses can effectively restore safety without compromising downstream task performance, leading to Safe teleCOMMunication (SafeCOMM) models. Our work serves as a diagnostic study and practical guide for safety realignment in telecom-tuned LLMs, emphasizing the importance of safety-aware instruction and fine-tuning for real-world deployments of telecom LLMs.
Authors:+1 more view allADAladin DjuheraNON-IBMSKFAFarhan AhmedIBMSZSyed ZawadIBMFKFernando KochNON-IBMWSWalid SaadNON-IBM - Description:
Vision-language models (VLM) bring image and textual representations close together in a joint embedding space, which is useful for tagging and retrieval from content stores. However such associations are not very stable in that a synonymous textual query does not retrieve the same set of images or with a high degree of overlap. This is due to the absence of linkages between semantically related concepts in vision-language models. In contrast, the episodic memory store in the brain has linkages to the semantic conceptual memory subsystem which helps in both the formation and recall of memories. In this paper, we exploit this paradigm to link a VLM to a semantic memory thereby producing a new semantic vision-language model called SemCLIP. Specifically, we develop a semantic memory model for the language of object-naming nouns reflecting their semantic similarity. We then link a vision language model to the semantic memory model through a semantic alignment transform. This leads to a richer and more stable understanding of the concepts by bringing synonymous visual concepts and their associated images closer. Both the semantic memory model and the alignment transform can be learned from word knowledge sources thus avoiding large-scale retraining of VLMs from real-world image-text pairs. The resulting model is shown to outperform existing embedding models for semantic similarity and downstream tasks of retrieval on multiple datasets.
Authors:+2 more view allTSTanveer Syeda-MahmoodNON-IBMNDKWRMRaziuddin MahmoodNON-IBMLSLuyao ShiIBMAJAshutosh JadhavIBM - Description:
Generative flow networks are able to sample, via sequential construction, high-reward, complex objects according to a reward function. However, such reward functions are often approximated from noisy data leading to epistemic uncertainty in the learnt policy. We present an approach to quantify this uncertainty by constructing a surrogate model composed of a polynomial chaos expansion, fit on a small ensemble of trained flow networks. This model learns the relationship between reward functions, parametrised in a low-dimensional space, and the probability distributions over actions at each step along a trajectory. The surrogate model can then be used for inexpensive Monte Carlo sampling to estimate the uncertainty in the policy given uncertain rewards. We illustrate the performance of our approach on a Bayesian structure learning task, and compare it to a basic multilayer perceptron.
Authors:RNRamon Nartallo-kaluarachchiIBMRSSUShashanka UbaruIBMBHBen HuhIBMMZLH - Description:
Transformer language models recently enabled molecular structure prediction directly from infrared (IR) spectra, yet have remained confined to pure compounds. We show that the same architecture learns the correlations embedded in binary mixture spectra and can retrieve the individual molecular components. Trained solely on gas-phase data, our model attains a Top–10 accuracy of 61.4% on balanced synthetic mixtures. When evaluated on 15 mixtures measured with Attenuated Total Reflectance (ATR) IR spectrometer, whose response differs markedly from the training domain, it still achieves 52.0% Top–10 accuracy, evidencing strong cross-instrument transferability. The ability to identify signals of individual molecules within complex spectra extends machine-learning-assisted spectroscopy from idealised samples to realistic laboratory scenarios. All code and pretrained weights are released to accelerate adoption and further development. This advance opens the door to automated structure elucidation using IR data in fields ranging from environmental monitoring to pharmaceutical quality control.
Authors: - Description:
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results on I-RAVEN-X show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. For instance, LRMs experience a significantly smaller degradation on arithmetic accuracy (80.5% → 63.0%) compared to LLMs (59.3% → 4.4%). However, LRMs are still significantly challenged by reasoning under uncertainty (−61.8% in task accuracy) and cannot effectively explore multiple probabilistic outcomes in superposition.
Authors:GCMHRWRoger WattenhoferNON-IBMASAR - Description:
Transcriptomic foundation models (TFMs) promise to act as virtual cell models, but it remains unclear whether they have internalized the biological rules of transcriptomic space. To address this question, we propose assessing the quality of pretrained TFMs by probing the coherence of their internal world model using the pretraining loss on synthetic samples. Our approach combines two complementary tests. First, as a stress test of plausibility, we compare pretraining loss on shuffled cells compared to real samples. Second, to probe the coherence of the internal world model, we evaluate interpolated samples both within and between cell types, quantifying whether the model identifies coherent clusters. Across multiple datasets, TFMs tend to distinguish real and shuffled cells, with entropy of expression value strongly predicting the loss gap. Interpolations reveal "loss barriers" between distant cell types while similar cell types tend not to have barriers. Interestingly, much of the structure of cell embeddings persists despite the shuffling of the values of expressed genes. This approach demonstrates that quantification of an internal world model is possible, even in a "zero resource" setting, without labeled data. We argue that this is a critical step toward identifying whether TFMs can truly function as virtual cell models, rather than stochastic parrots.
Authors:NMNoa MorielIBMYSMRMD - Description:
As AI becomes a native component of 6G network control, AI models must adapt to continuously changing conditions, including the introduction of new features and measurements driven by multi-vendor deployments, hardware upgrades, and evolving service requirements. To address this growing need for flexible learning in non-stationary environments, this vision paper highlights Adaptive Random Forests (ARFs) as a reliable solution for dynamic feature adaptation in communication network scenarios. We show that iterative training of ARFs can effectively lead to stable predictions, with accuracy improving over time as more features are added. In addition, we highlight the importance of explainability in AI-driven networks, proposing Drift-Aware Feature Importance (DAFI) as an efficient XAI feature importance (FI) method. DAFI uses a distributional drift detector to signal when to apply computationally intensive FI methods instead of lighter alternatives. Our tests on 3 different datasets indicate that our approach reduces runtime by up to 2 times, while producing more consistent feature importance values. Together, ARFs and DAFI provide a promising framework to build flexible AI methods adapted to 6G network use-cases.
Authors:YBSTGZMDMerim DzaferagicNON-IBMJKJohn D. KelleherNON-IBM - Description:
With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads.
Authors:+2 more view allFLFerran Agullo LopezNON-IBMJTJoan Oliveras TorraNON-IBMCWAGAlberto Gutierrez-torreNON-IBMOTAY - Description:
Classical machine-learning auto-tuners for OS control struggle with semantic gaps, brittle rewards, and unsafe exploration. We introduce an online, LLM-driven agent that emulates expert reasoning for continuous OS optimization. When tuning the Linux Completely Fair Scheduler’s hyperparameters, the agent outperforms Bayesian optimization by 5% in single-parameter tuning, 7.1% in two-parameter co-tuning, and a human expert by 2.98% overall, while converging faster and adapting more quickly to workload changes. When application counters are unavailable, system-level proxies (e.g., Instructions Per Cycle (IPC)) preserved tail latency in our setup. Putting this together, we propose adopting the Model Context Protocol (MCP) for tool/resource discovery and invocation and a logging channel; on top of that, we propose adding transactional apply--commit--revert, host-mediated approval gates, and policy controls in the OS-tuning server and host to ensure safe, auditable operation. Our results and reference design suggest a practical path toward safe, self‑adapting OS control.
Authors:GLGeorgios LiargkovasNON-IBMVJVahab JabrayilovNON-IBMHFHubertus FrankeIBMKKKostis KaffesNON-IBM - Description:
In the ever more digitalized world of today, code vulnerabilities pose a critical threat to our privacy, economy, safety, and infrastructure. Existing automated code vulnerability detection methods suffer from high false positive rates, poor generalization and their inability to adapt to changing vulnerability landscapes. To address these challenges we propose SIVA, a self-improving LLM-based vulnerability detection agent, using memory-guided meta-learning for dynamic prompt optimization. SIVA showed strong learning capabilities, improving its F1 score from 58% to 95% in 5 iterations, significantly outperforming previous state-of-the-art multi-agent systems (~53% F1) on real-life vulnerability datasets. Furthermore, SIVA generalized well across 7 programming languages (93% F1), successfully transferring learned vulnerability concepts between them.
Authors: - Description:
Recent advances in machine learning have transformed molecular property prediction, with large-scale representation models trained on diverse modalities such as SMILES, SELFIES, graph-based embeddings, etc. While multi-modal fusion offers richer insights than unimodal approaches, traditional fusion methods often assign static importance across modalities, leading to redundancy and poor robustness under missing-modality conditions. We introduce a Dynamic Multi-Modal Fusion framework, a self-supervised approach that adaptively integrates heterogeneous molecular embeddings. The framework employs intra-modal gating for feature selection, inter-modal attention for adaptive weighting, and cross-modal reconstruction to enforce information exchange across modalities. Training is guided by progressive modality masking, enabling the fused representation to remain informative even when some inputs are absent. Preliminary evaluations on the MoleculeNet benchmark demonstrate that our method improves reconstruction and modality alignment while achieving superior performance on downstream property prediction tasks compared to unimodal and naïve fusion baselines. These results highlight the importance of dynamic gating, entropy-regularized attention, and reconstruction-driven learning in building robust molecular fusion models.
Authors:ISSTLHSKTM - Description:
In materials discovery, descriptors that are both accurate and interpretable are essential for predicting molecular properties. However, existing descriptors, including neural network-based approaches, often struggle to capture long-range interactions between substructures. We analyze the previously proposed descriptor TDiMS, which models nonlocal structural relationships via average topological distances between substructure-pairs. While TDiMS has shown strong performance, its size dependence had not been systematically assessed. Our analysis reveals that TDiMS is particularly effective for larger molecules, where long-range interactions are critical and conventional descriptors underperform. SHAP-based analysis highlights that its predictive power derives from distant substructure-pair features. In addition to improved accuracy, TDiMS offers interpretable features that provide chemical insight, potentially accelerating molecular design and discovery.
Authors:+3 more view allLHAKAkihiro KishimotoIBMKMKohei MiyaguchiIBMMHMasataka HiroseNON-IBMJFJunta FuchiwakiNON-IBMIS - Description:
Molecular property prediction has greatly benefited from learned embeddings such as SMILES-based, SELFIES-based, and graph-derived representations. However, existing approaches often rely on a single modality or naïvely concatenating multiple modalities, limiting robustness and failing under missing-modality conditions. In this work, we propose a novel self-supervised fusion framework - dynamic fusion, that dynamically integrates multiple molecular embeddings. The proposed framework employs intra-modal gating for feature selection, inter-modal attention for adaptive weighting, and cross-modal reconstruction to ensure information exchange. Through progressive modality masking during training, the dynamic fusion approach learns to generate fused embeddings resilient to missing modalities. We conducted preliminary evaluations of the proposed approach on MoleculeNet benchmarks, and demonstrate a superior performance in reconstruction, modality alignment, and downstream property prediction tasks compared to unimodal baselines. Our findings highlight the importance of feature-level gating, entropy-regularized attention, and cross-modal reconstruction in achieving robust fusion.
Authors:ISSTLHSKTM - Description:
Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into elementary operations on natural language strings and limits the scope of LLMs to only those operations where they are absolutely necessary. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.
Authors:SLSupriya LallIBMCFChristian FarrellIBMHPHari PathanjalyIBMMPMarko PavicIBMSCSarvesh ChezhianIBMMA - Description:
This paper introduces an end-to-end, or joint prediction and optimization, framework for the class of two-stage contextual optimization problems with information-gathering. We showcase the approach on a dynamic electricity-scheduling problem on real data. We show that the adaptiveness of the end-to-end approach indeed provides benefits over other methods which train their forecasting method independently of the first information-gathering stage.
Authors:RCRares ChristianNON-IBMPHPavithra HarshaIBMGPGeorgia PerakisNON-IBMBQBrian QuanzIBM - Description:
Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, UltraFeedback, ORPO, HelpSteer, and Code-Preference-Pairs. However, the construction process behind these datasets often lacks valuable metadata, design rationales, and quality annotations. This missing context makes it difficult to understand how preferences were selected, what task types they span, and how well they reflect human judgement on a per-sample-level. In this work, we present the first comprehensive, data-centric analysis of open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
Authors:ADAladin DjuheraNON-IBMFAFarhan AhmedIBMSKSZSyed ZawadIBMHLHBHolger BocheNON-IBM - Description:
Attention-based models dominate sequence transduction, yet in medical time-series datasets they often misallocate focus to irrelevant regions while missing critical context. We present EvidenceMoE, a Mixture-of-Experts architecture that assigns experts based on prior physics knowledge and refines their outputs through an Evidential Dirichlet feedback mechanism providing per-expert reliability scores. In our work on fluorescence lifetime-guided cancer surgery, we assigned expert models to relevant time-series segments encoding tumor depth and microenvironment based tumor delineation knowledge from physics (i.e., the radiative transport equation for photon propagation in tissue), rather than learned only from data. Unlike other prior models that address either depth (e.g., Fluorescence LiDAR) or fluorescence decay (fluorescence lifetime or FLI for drug–target binding), EvidenceMoE jointly captures both within a unified framework, achieving errors as low as 0.030 NRMSE for depth and 0.074 NRMSE for FLI on simulated and experimental datasets, closely matching ground-truth measurements.
Authors:+5 more view allIEIsmail ErbasNON-IBMFDFerhat DemikiranNON-IBMKSNWNNNavid NizamNON-IBMVPVikas PandeyNON-IBM - Description:
Understanding how transformers can execute specific algorithmic and symbolic computations remains a challenge in artificial intelligence. Prior work has demonstrated that standard transformers have trouble generalizing algorithmic problems of arbitrary length (i.e., length generalization), such as arithmetic problems. Here we present an interpretable and modular framework for specifying exact algorithmic computations with universal transformers that enable these models to perfectly solve algorithmic problems of arbitrary depth (length), without any training. In particular, by formulating algorithmic problems as computable circuits, we exactly map circuit computations onto components of the universal transformer architecture. We showcase this ability by specifying universal transformers that perfectly solve two fundamental algorithmic problems: modular arithmetic and Boolean logic. Notably, these two models demonstrate how transformers can generalize to problems of any length using interpretable architectural modifications. This framework can be naturally adopted for any algorithmic problem that can be formulated as a circuit, illustrating exactly how transformers can implement arbitrary circuit algorithms. More broadly, this framework provides an existence proof of transformer models capable of implementing exact algorithms, providing avenues of opportunity for exploring their learnability in future work.
Authors:TITaku ItoIBMRPChief Scientist, IBM Research; IBM Fellow; Vice President, IBM Technology & Technical CommunityIBMPR - Description:
Evaluator-driven discovery systems (e.g., FunSearch) succeed when the target admits a clear fitness metric (e.g., “find the largest cap set”), but many central mathematical objects—Vitali sets, the Banach–Tarski paradox, Hamel bases, ultrafilters, etc.—lack such metrics and often rely on specific nonconstructive axioms, such as the axiom of choice (AC). We propose a FunSearch variant with a theorem proposer and a Lean-verified, axiom-aware evaluator that scores candidates by (i) proof progress, (ii) property coverage, and (iii) an axiom footprint that audits reliance on Choice (AC), Zorn’s Lemma, the axiom of dependent choice (DC), the law of excluded middle (EM), and others. A minimal prototype reconstructs proofs of the existence of a right inverse for an arbitrary surjection (via AC). We claim no new theorems, but provide early evidence that axiom-aware evaluation broadens evaluator-driven discovery beyond purely executable code.
Authors:MEBSBesart ShytiNON-IBM - Description:
The safety of AI agents in multi-turn interaction is a growing concern, particularly as agent behavior may vary over time due to the dynamic nature of both the agent and its environment. We introduce the concept of ``state-induced risk amplification'', hypothesizing that extended AI-environment interaction can lead to agent actions that transition the system into risky states, and that such transitions may increase the likelihood of risky actions by the agent. We provide a formal characterization of these effects using the Markov decision process framework. To empirically test our hypotheses, we introduce a novel experimental setup inspired by traffic monitoring applications. Our results demonstrate the practical occurrence of state-induced risk amplification, highlighting an emerging safety risk for current multi-turn agents and calling for safety evaluation methods that account for state-dependent dynamics. We discuss implications for designing adaptive risk mitigation strategies.
Authors:RNRebecka NordenlowIBMTOLQSBRBRachel BellamyIBM
- Description:
Structure elucidation is crucial for identifying unknown chemical compounds, yet traditional spectroscopic analysis remains labour-intensive and challenging, particularly when applied to a large number of spectra. Although machine learning models have successfully predicted chemical structures from individual spectroscopic modalities, they typically fail to integrate multiple modalities concurrently, as expert chemists usually do. Here, we introduce a multimodal multitask transformer model capable of accurately predicting molecular structures from integrated spectroscopic data, including Nuclear Magnetic Resonance (NMR) and Infrared (IR) spectroscopy. Trained initially on extensive simulated datasets and subsequently finetuned on experimental spectra, our model achieves Top–1 prediction accuracies up to 96%. We demonstrate the model’s capability to leverage synergistic information from different spectroscopic techniques and show that it performs on par with expert human chemists, significantly outperforming traditional computational methods. Our model represents a major advancement toward fully automated chemical analysis, offering substantial improvements in efficiency and accuracy for chemical research and discovery.
Authors: - Description:
Learning molecular representations that are robust to 3D rotations typically requires architectures with built-in symmetry priors or extensive data augmentation. In this work, we investigate whether contrastive multimodal pretraining alone can induce SO(3) invariance in molecular embeddings. We jointly train a continuous 3D-field encoder, based on a vector-quantized generative adversarial network (VQGAN), and a SMILES-based transformer encoder on a dataset of 855,000 molecules, each represented by a DFT-computed electron density grid and a corresponding canonical SMILES string. Both CLIP-style and SigLIP contrastive objectives are used to align representations across modalities. Because SMILES embeddings are invariant to molecular orientation, the contrastive loss implicitly encourages the 3D encoder to produce rotation-consistent representations by aligning different poses of the same molecule to a fixed symbolic anchor. To evaluate geometric generalization, we construct a benchmark comprising 1,000 molecules with five unseen random SO(3) rotations each. The CLIP-based model retrieves at least one rotated variant among its top-10 results for 77% of queries, compared to 9.8% for a unimodal VQGAN baseline, and retrieves three or more variants for 45% of queries (versus 0% baseline). Functional group-wise Recall@10 exceeds 98% for most chemical classes, and clustering by HOMO energy yields a Davies–Bouldin index of 2.35 (versus 34.46 for the baseline), indicating strong chemical organization in the latent space. Additionally, fine-tuning with rotated samples reveals a trade-off between retrieval precision and pose diversity. These results suggest that contrastive multimodal pretraining can yield symmetry-aware molecular representations, even in the absence of explicit equivariant design.
Authors:+1 more view allESVSEBDZEOEnzo OliveiraIBMCGCaio GamaIBM - Description:
Recent advancements in Machine Learning (ML) have substantially accelerated the material discovery field, yet the utilization of Large Language Models (LLMs) in the Metal-Organic Frameworks (MOFs) research has received limited attention. This work leverages LLMs to build a new set of models that accelerate MOF material discovery. Our strategy relies on pre-training the Granite model using a single H100 GPU on a combination of selective chemical journals and structural data from the PubChem database. Our evaluation demonstrates that this pre-training strategy significantly enhances the performance of LLMs in predicting MOF properties, especially in limited-resource task scenarios. We hope this work can motivate future research to explore the potential of LLMs in enhancing material discovery to build robust and efficient Metal-Organic Frameworks models.
Authors:SASultan AlrowiliIBMMEMathan Kumar EswaranIBMSG - Description:
Recent years have seen fast emergence and adoption of chemical foundation models in computational material science for property prediction and generation tasks that are focused mostly on small molecules or crystals. Despite these paradigm shifts, integration of newly discovered materials in real world devices continues to be a challenge due to design problems. New candidate material must be optimized to achieve compatibility with other components in the system and deliver the target performance. Chemical foundation model benchmarks must evaluate their scope in predicting macro scale outcomes that are the result of chemical interactions in multi-variate design space. This study evaluates performance of chemical foundation models that are pre-trained primarily with SMILES of small molecules, in extrapolating learning from molecules to material design challenges across multiple length scale in batteries. Ten prediction models are trained covering molecular properties, formulations performance, and battery device measurement. Material representations from several foundation models are compared and their performance is benchmarked against conventional molecular representations such as Morgan Fingerprints. The study further examines their capacity to generalize to out-of-distribution cases by quantifying prediction errors for novel material designs that differ substantially from the training data. Finally, interpretability of the trained predictors is assessed by correlating actual outcomes and predictions to the chemical moieties in the datasets, with the aim of enabling researchers to interpret design rules in chemical space where model has high confidence.
Authors:+5 more view allVSATMGMZMurtaza ZohairIBMNPTE - Description:
Accurate UV spectral prediction is challenging for machine learning. UV spectra exhibit broad absorption bands characterized by the peak positions, band shapes, and curvature profiles. However, current models fail to capture these characteristics. We present Peak Position Awareness (PPA), Curriculum Learning for Interpolated Abstracted Spectra (CLIAS), and Spectrum Curvature Limitation (SCL) to handle the above characteristics, showing consistent improvements over diverse models.
Authors:HSAKAkihiro KishimotoIBMHKHiroshi KajinoIBM - Description:
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., Spearman correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing, which can lead to invalid conclusions and mistrust. By analyzing over 40 prominent benchmarks, we show how some overlooked methodological choices can significantly influence BAT results. To address these inconsistencies, we propose a set of best practices and demonstrate their impact on robustness and validity. To foster adoption and facilitate future research, we introduce BenchBench (links in the App), a Py package and Leaderboard for BAT.
Authors:+2 more view allYPAGAriel GeraIBMOAOfir ArvivIBMAYAsaf YehudaiIBMEBElron BandelIBMES - Description:
Alignment techniques are essential for making Large Language Models (LLMs) usable and useful for real-world applications and diverse approaches have been developed, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces MEAL (Multi-dimensional Evaluation of ALignment techniques), a comprehensive and systematic evaluation framework for alignment techniques. It focuses on four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments of models with different alignment strategies, we demonstrate the utility of our framework in identifying their strengths and limitations, providing valuable insights for future research directions.
Authors:+6 more view allMAMuneeza AzmatIBMMAMomin AbbasIBMMMMGMarcelo Carpinette GraveIBMLSTMTiago MachadoIBM - Description:
Modern computer vision models commonly rely on passive sensing and process images in their entirety all at once. Lacking the ability to zoom-in to task-relevant regions for detailed analysis, this approach becomes limited for high-resolution, cluttered scenes where only a small area is relevant for the task at hand. A particularly challenging problem in this context is instance detection that involves localizing specific object instances given a few visual examples. We introduce an active sensing system that uses a brain-inspired coarse-to-fine strategy to glimpse over the image by steering a retina-like sensor. The sensor uses a log-polar pixel layout that facilitates precise localization of task-relevant regions.
Our system can be integrated with various state-of-the-art instance detectors. It improves their performance by up to 90%, making even small models developed for edge-devices perform on par or, in difficult cases, even better than their large counterparts. In light of performance gains, our model can become a complementary part in sensor hardware enabling active, task-driven sensing.Authors:OKOleh KolnerIBMTBSWAP - Description:
As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model’s accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated in-painting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.
Authors:JFJannis FleckensteinIBMDKDavid KreismannIBMTGTBEVMR - Description:
Most large-scale chemical language models are trained on a single textual molecular representation using self-supervised learning over large unlabeled corpora. These models excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens. However, relying solely on one representation may result in the loss of structural or semantic information captured by alternative formats and may limit the model's ability to generalize across diverse molecular encodings. To address this limitation, we incorporate multiple textual molecular representations—including SMILES, SELFIES, molecular formula, IUPAC name, International Chemical Identifier (InChI), serialized polymer graph (SPG), and electrolyte formulations in an unified vocabulary to harness the unique strengths of each format. Here, we introduce a large encoder-decoder chemical foundation model based on the Bamba architecture, a hybrid of Transformers and Mamba-2 layers, designed to support multi-representational inputs. The model is pre-trained in a BERT-style on 588 million samples, resulting in a corpus of approximately 29 billion molecular tokens. These models serve as a foundation for language chemical research in supporting different complex tasks, including molecular properties prediction, classification, and molecular translation. Furthermore, extensive studies of the multimodal molecular latent space indicate cross-representation alignment and reveal how different textual encodings of the same molecule can converge toward a unified semantic representation. This shared space may facilitate deeper insights into molecular structure, enhance generalization, and support a broad range of downstream applications.
Authors:+3 more view allVSEBESNPDZVS - Description:
Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.
Authors:SSShambhavi ShankerIBMMPJH - Description:
Sensitivity analysis is a cornerstone of climate science, essential for understanding phenomena ranging from storm intensity to long-term climate feedbacks. However, computing these sensitivities using traditional physical models is often prohibitively expensive in terms of both computation and development time. While modern AI-based generative models are orders of magnitude faster to evaluate, computing sensitivities with them remains a significant bottleneck. This work addresses this challenge by applying the adjoint state method for calculating gradients in generative flow models. We apply this method to the cBottle generative model, trained on ERA5 and ICON data, to perform sensitivity analysis of any atmospheric variable with respect to sea surface temperatures. We quantitatively validate the computed sensitivities against the model’s own outputs. Our results provide initial evidence that this approach can produce reliable gradients, reducing the computational cost of sensitivity analysis from weeks on a supercomputer with a physical model to hours on a GPU, thereby simplifying a critical workflow in climate science.
Authors:+3 more view allNDNicolae DobraNON-IBMJPJakiw PidstrigachNON-IBMTRTim ReicheltNON-IBMPFJJAJ - Description:
AI-driven weather forecasting models, particularly foundation models, have achieved significant advancements in both speed and accuracy. However, accurately forecasting rare, high-impact extreme events, such as storms and heatwaves, remains a critical challenge. These models often underestimate event intensity and frequency, limiting their reliability in operational and risk-sensitive contexts. In this study, we investigate uncertainty-aware extreme event forecasting using the recently introduced time-series foundation model, Tiny Time Mixers (TTM). We develop and compare two uncertainty quantification approaches, hyperparameter ensembling and Monte Carlo (MC) dropout, and evaluate their ability to improve classification of extreme events. Our results show that incorporating predictive uncertainty significantly enhances performance compared to zero-shot TTM, and that the choice of uncertainty method and threshold critically affects model behavior. We find that the hyperparameter ensemble yields more stable and accurate predictions, particularly for rare storm events, highlighting the value of lightweight ensemble models for uncertainty-calibrated forecasting.
Authors:INJAResearch Software Engineer, Data Scientist, Computer Vision, SciML and whatever we need.IBM - Description:
Development of transformational AI models for polymers has been greatly hindered by the lack of large, comprehensive, multi-modal datasets that are licensed for research and commercial use. The primary aim of this proposal is to address this unmet need through the creation of carbon-m1, a massive, multi-modal synthetic dataset for polymers and polymer containing materials for release under an Apache 2.0 license. Carbon-m1 will seek to capture critical structural, sequence, and stochastic features of polymers as well as their characterization data, two crucial features missing from existing efforts to tackle data challenges within polymer AI development.
Authors:NP - Description:
Designing stable crystal structures is central to accelerating the discovery of new materials, yet most generative approaches remain limited to reproducing known patterns rather than exploring novel possibilities. We present a method that trains large language models with reinforcement learning guided by verifiable energy-based rewards, optimizing toward physically grounded stability objectives. Compared to supervised finetuning and base models, our reinforcement learning–trained model generates crystals with higher predicted stability and a greater proportion of previously unreported structures. These results suggest that combining verifiable energy rewards and reinforcement learning provides a powerful path toward automated discovery of novel, stable materials.
Authors:ZHZhang-Wei HongIBMNSNofit SegalNON-IBMANAviv NetanyahuNON-IBMRGRafael Gómez-bombarelliNON-IBMPAPulkit AgrawalNON-IBM - Description:
Learning molecular representations robust to 3D rotations typically relies on symmetry-aware architectures or extensive augmentation. Here, we show that contrastive multimodal pretraining alone can induce SO(3) invariance in molecular embeddings. We jointly train a 3D electron density encoder, based on a VQGAN, and a SMILES-based transformer encoder on 855k molecules, using CLIP-style and SigLIP objectives to align volumetric and symbolic modalities. Because SMILES embeddings are rotation-invariant, the contrastive loss implicitly enforces rotation-consistency in the 3D encoder. To assess geometric generalization, we introduce a benchmark of 1,000 molecules with five random SO(3) rotations each. Our model retrieves rotated variants with 77% Recall@10 (vs. 9.8% for a unimodal baseline) and organizes latent space by chemical properties, achieving functional group-wise Recall@10 above 98% and a Davies–Bouldin index of 2.35 (vs. 34.46 baseline). Fine-tuning with rotated data reveals a trade-off between retrieval precision and pose diversity. These results demonstrate that contrastive multimodal pretraining can yield symmetry-aware molecular representations without explicit equivariant design.
Authors:+1 more view allESVSEBDZEOEnzo OliveiraIBMCGCaio GamaIBM - Description:
This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self-preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision-making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi-agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the mathematical formalization of ethical manipulation within multi-agent systems.
Authors: - Description:
Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.
Authors: - Description:
Providing human-understandable insights into the inner workings of neural networks is an important step toward achieving more explainable and trustworthy AI. Analyzing representations across neural layers has become a widely used approach for this purpose in various applications. In this work, we take a step toward a more holistic understanding of neural layers by investigating the existence of distinct layer groupings within them. Specifically, we explore using representation similarity within neural networks to identify clusters of similar layers, revealing potential layer groupings. We achieve this by proposing, for the first time to our knowledge, the use of Gromov-Wasserstein distance, which overcomes challenges posed by varying distributions and dimensionalities across intermediate representations–issues that complicate direct layer-to-layer comparisons. On algebraic, language, and vision tasks, we observe the emergence of layer groups that correspond to functional abstractions within networks. These results reveal implicit layer structure pattern, and suggest that the network computations may exhibit abrupt shifts rather than smooth transitions. Through downstream applications of model compression and fine-tuning, we validate our measure and further show the proposed approach offers meaningful insights into the internal behavior of neural networks.
Authors:TGTian GaoIBMADKRDW - Description:
The HyperCube model is a promising tensor factorization framework for discovering latent group structures in data. A foundational conjecture posits that its global minima correspond to unitary group representations, but a proof has remained elusive. We make significant theoretical progress by decomposing the HyperCube objective into a base term dependent on matrix norms and a misalignment term. We introduce the Perfect Alignment Conjecture, which states that this misalignment must vanish at any stationary point for the optimization to capture a true group. Under this condition, we prove that all local minima are in fact global and are unitarily equivalent to the group's regular representation, thus conditionally resolving the original conjecture. Our analysis reveals HyperCube’s unique inductive bias for full-rank, unitary solutions, distinguishing it from the typical low-rank bias in deep learning models.
Authors:BHBen HuhIBMHJHalyun JeongNON-IBM - Description:
We present 3DGrid-LLM, a multimodal foundation model designed to integrate natural language with three-dimensional electron density grids for applications in molecular and materials science. The architecture extends a large decoder-only language model by incorporating discrete volumetric representations obtained through a 3D VQGAN, enabling joint token-level processing of spatial and textual modalities within a unified framework. Pre-trained on a diverse corpus of molecular and materials datasets, 3DGrid-LLM supports bidirectional text–grid generation, multimodal question answering, and retrieval-augmented 3D reconstruction. Comprehensive evaluations demonstrate consistent improvements over baseline methods in multimodal VQA, chemically informed text generation, and property-aligned retrieval tasks, yielding outputs that are both accurate and physically consistent.
Authors:+5 more view allESEBVSHPHenrique PortoIBMEOEnzo OliveiraIBMCGCaio GamaIBM - Description:
The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk.
We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context.
Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.
Authors:SARMBZBing ZhangIBMHPHima PatelIBMCDChad DelucaIBM - Description:
Designing effective drug molecules is a multi-objective challenge that requires the simultaneous optimization of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties. Existing generative frameworks face two major limitations: (1) a reliance on molecular descriptors that fail to capture pharmacologically meaningful endpoints, and (2) the use of reward linear scalarization, which collapses multiple objectives into a single score and obscures trade-offs. To address these challenges, we propose a Pareto-guided reinforcement learning framework for predictor-driven ADMET optimization (RL-Pareto). Our framework enables simultaneous optimization of multiple objectives and flexibly scales to user-defined objective sets without retraining. Predictor models trained on ADMET datasets provide direct feedback on drug-relevant properties, while Pareto dominance defines reward signals that preserve trade-off diversity during chemical space exploration. In benchmarking, our framework achieved a 99% success rate, with 100% validity, 87% uniqueness, and 100% novelty, alongside improved hypervolume coverage compared to strong baselines. These results demonstrate the potential of Pareto-based reinforcement learning to generate molecules that effectively balance competing properties while maintaining diversity and novelty.
Authors:HNHoang-my NguyenNON-IBMNVNguyet Hang VuNON-IBMLHHNHoang D. NguyenNON-IBM - Description:
Recent advances in large language models (LLMs) have enabled the development of intelligent agents with reasoning and planning capabilities. However, there are two key limitations: the lack of realistic domain specific models that capture the causal system dynamics in which these agents operate and the absence of representative simulation environments combining LLM-agents with reinforcement learning (RL) for rigorous evaluation. The cloud autoscaling problem is a compelling use case for benchmarking AI systems. It allows the use of a causal system model while requiring agents to solve a constrained optimisation problem: minimising resource costs while meeting strict service level objectives (SLOs), with minimal intervention and interpretable actions. We use these characteristics to develop a microservice simulation environment that models the causal relations between CPU usage, memory usage, resource limits, and latency in applications of any scale and topology. It also has the ability to introduce realistic system failures.
Our simulation engine gives agents the `licence to scale' without doing any harm in real deployments. Furthermore, it provides a realistic and controlled environment for RL agents, making it compatible with standard RL baselines. Our work provides a benchmark environment for the integration of LLMs, agents, causal models, and RL for adaptive decision-making in dynamic, resource-constrained environments.
Authors:CLASAdrian SelkIBMABJWJonas WahlNON-IBMMRMarco RuffiniNON-IBM - Description:
Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.
Authors:JKJung koo KangIBM - Description:
Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post-hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy-guided PRMs outperform unguided or explicit action-prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as $0.2, making PRMs a practical and scalable mechanism for improving SWE agents' reliability and efficiency.
Authors:SGShubham GandhiIBMJTJason TsayIBMJGKKYR - Description:
Reasoning models have increasingly been used to perform complex tasks in open ended environments. A challenge facing such efforts is domain specific tuning often requiring large quantities of data and verifiability. We can construct a high-performance reasoning agentic workflow for chemistry that is a) verifiable and b) extensible through the use of tools. We further show that distilling the outputs of the resulting workflow into smaller models results in lighter workflows that are still performant.
Authors:GGGabrielle GaudeauNON-IBMSTDCDefne CirciIBMIKIan KennedyNON-IBMMMME - Description:
Foundation models (FMs) have transformed natural language processing (NLP), but their successes have not yet translated to the time series domain. Existing time series foundation models (TSFMs) struggle with generalization across varying context and target lengths, lack adaptability to different sampling rates, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that addresses these challenges through two key innovations: a state space model (SSM) based encoder and a functional basis decoder. This design enables continuous-time modeling, adjustment to various sampling rates, and flexible forecasting horizons without retraining. We further propose a parallel training strategy that enhances robustness and accelerates training. Despite being the smallest model, FlowState achieves state-of-the-art results on the GIFT-ZS and the Chronos-ZS benchmarks, while demonstrating superior adaptability to unseen sampling rates.
Authors:LGIBMTBSWAP
- Description:
Utilising pre-trained biomedical foundation models (BMFMs) for inference on multi-omic data from small cohorts, represents a promising and practical route to demonstrate advantage of this technology for real-world drug discovery tasks. Here, we show this via an innovative and unique BMFM inference workflow, where BMFMs provide discernible advantage for predicting patient drug response from omics data. We utilise open-source, fine-tuned, multi-omics BMFMs for inference to enable downstream feature selection and engineering. Firstly, predicting drug- target binding affinity (BA), enabling ranking and prioritisation of gene targets and associated SNPs, and secondly, using patient SNPs to mutate reference proteins and assess their impact on prednisolone BA. BMFM-derived features were composed and used alongside non-BMFM features to predict patient-specific prednisolone response using an explainable ML approach. We demonstrate superior predictive power of BMFM-derived feature sets, and downstream explainability distinguished SNPs that were most influential for personalised drug response prediction.
Authors:LGJKSCStephen CheckleyNON-IBMKBKaren BinghamNON-IBMGMGraeme MacluskieNON-IBMDBDavid BuntonNON-IBM
Upcoming events
- —
IBM at AGU 2025
- New Orleans, LA, USA
- —
IBM at SEMICON Japan 2025
- Tokyo, Japan


