IBM at NeurIPS 2025

  • San Diego, California, US
This event has ended.

About

Neural Information Processing Systems (NeurIPS) is a leading machine learning and computational neuroscience conference. IBM Research is excited to sponsor NeurIPS again this year as a Platinum sponsor.  We invite all attendees to visit us during the event at booth number 1109, from Tuesday, December 2 through Friday, December 5.

We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics such as foundation models, trustworthy AI, natural language processing and understanding, knowledge and reasoning, AI automation, human-centered AI, and federated learning.

Presentation times of conference workshops, demos, papers, and tutorials can be found see the agenda section at the bottom of this page. Note: All times are displayed in your local time.

Career opportunities

Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2026 summer internships.

Agenda

  • Description:

    Abstract: Training large language models for specialized disciplines such as advanced mathematics, molecular biology, or legal reasoning is limited by the scarcity of large, high quality, domain specific corpora. Most publicly available datasets are dominated by general purpose web text. When available, specialized data are fragmented across diverse sources such as preprints, conference papers, forums, lecture notes, and digitized books. No single source offers comprehensive real-world coverage across scientific domains. Consequently, scaling up authentic domain data remains a bottleneck: collecting a subset of relevant tokens often requires downloading and filtering hundreds of terabytes of raw web material, a process that is both time consuming and costly. x000Dx000D We introduce Data Scout, a modular, LLM powered pipeline that turns a high-level user intent (e.g., “I need data for advanced mathematics”) into a vetted, list of seed URLs in minutes. The system first expands the original intent using an LLM that generates a hierarchical subgraph of related concepts; this taxonomy drives a diversified set of search queries that systematically cover the target domain while respecting known licensing signals. Candidate URLs are then filtered by the same LLM using chain of thought prompting based on topical relevance, licensing clarity, and crawlability. Our results show that that the list of selected candidate URLs when crawled can yield a high percentage of relevant pages (40%+) related to the user’s intended topic or query, compared to less than 1 percent in general web-scale corpora. Data Scout is available with both CLI and GUI front ends. By democratizing domain specific data acquisition, Data Scout enables researchers without dedicated crawling infrastructure to bootstrap large, high-fidelity corpora, accelerating the development of specialized LLMs across various niche domains fields.

    Speaker: Chirag Garg

    https://neurips.cc/virtual/2025/loc/san-diego/128656

    Speakers:
  • Description:

    Over the past sixty years, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve previously unaddressed planning problems. This was done through established practices of rigorous design and evaluation of planning systems. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. The purpose of this tutorial is to share the knowledge with the wider AI community, with the aim of incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. We believe that exposing the NeurIPS community to the theory and practices from the planning community will contribute greatly to the progress in building LLM-based planners and to planning in general.

    Website: https://planning-llm-era.github.io

    Authors:
    CM
    Christian Muise
    NON-IBM
  • Description:

    The BeeAI Framework is an open-source project for building reliable AI agents that combine autonomy with control. Current agent frameworks focus primarily on prompting and orchestration, leaving critical questions of predictability and safety unaddressed. BeeAI fills this gap with a lightweight framework that enables developers to build agents whose reasoning abilities are preserved while execution is constrained by declarative, rule-based requirements. At the core of the framework is the RequirementAgent, a novel agent design that enforces deterministic, controlled behaviors across heterogeneous language models. With RequirementAgent, developers can ensure consistent and reliable execution patterns regardless of differences in model reasoning, tool-calling abilities, or stochastic variation. This approach provides practitioners with a unified abstraction layer that simplifies the deployment of complex AI systems into production settings. As an incubating Linux Foundation AI project, BeeAI is gaining adoption in open source and enterprise contexts as organizations seek robust ways to operationalize AI agents at scale. At NeurIPS EXPO, we will showcase BeeAI’s architecture, real-world use cases, and lessons learned from applying declarative control to agent autonomy.

    Speakers:
    SB
    Sandi Besen
    Sandi Besen
    Artificial Intelligence Applied Research
    Neudesic, an IBM Company
  • Description:

    Current algorithms for aligning LLM behavior are often implemented for narrow settings, making it difficult for researchers and developers to understand their effectiveness across model architectures, datasets, and tasks. To help provide a more informed and principled approach to steering model behavior, we present the AI Steerability 360 (AISteer360) and In-Context Explainability 360 (ICX360) toolkits. Participants will first be guided through a conceptual overview for how model behavior can be influenced across four model control surfaces: input (prompting), structural (weights/architecture), state (activations/attentions), and output (decoding). After the conceptual overview, we will guide attendees through how to apply some recently developed explainability tools (from ICX360) for understanding why models produce given, potentially undesirable, outputs and how this information is used to design targeted steering inventions (via AISteer360). Closing the loop, we will evaluate if the baseline behavior (of the original, unsteered model) was successfully mitigated by the selected steering inventions and investigate if steering introduced any unintended behavioral side-effects. All of the experiments throughout the demonstration will be facilitated solely by the tools in the two toolkits, illustrating their power to design end-to-end steering workflows. Attendees will come away with a practical understanding of how to apply these toolkits to their own alignment challenges.

    Speakers:
  • Description:

    Modern incident root-cause analysis (RCA) is constrained by partial observability, symptom-centric signals, and the overwhelming noise present in logs, traces, and metrics. Diagnosing production failures often depends on instrumentation quality and human expertise, while latent software defects, configuration errors, and zero-day failure modes remain difficult to pinpoint. To address these challenges, we demonstrate a multi-agent system for incident diagnostics that augments observability data with application source code and static analysis signals.

    Our system introduces two cooperating agents: the Code Context Agent (COCOA), which builds a knowledge graph of program dependencies, control/data flows, and caller–callee relationships; and the Incident Diagnostics Agent (IDA), which performs agentic reasoning over an entity topology graph enriched with observability streams. Together, these agents extend topology-aware planning (TAP) to simultaneously operate on program dependency graphs and infrastructure entity graphs, thereby linking runtime symptoms with underlying code-level causes.

    This demo showcases how multi-agent collaboration enables deeper, context-sensitive RCA. We walk through real-world inspired scenarios—including incidents where critical log lines are hidden in noisy observability streams or where latent defects emerge only after system updates—illustrating how the system surfaces root causes that would otherwise remain invisible. By bridging program analysis with runtime observability, our approach moves beyond symptom-driven diagnostics toward a more reliable, automated framework for incident management.

    Speakers:
    RK
    Ramesh Kumar Kottapalli
    IBM
  • Description:

    Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work.

  • Description:

    Abstract: This hands-on workshop introduces a proposal that treats LLMs as computing elements governed by established software development principles—particularly task decomposition and modularization—at both the programming model (Mellea) and model level (LLM intrinsics).x000D x000D LLM outputs are often unpredictable and incorrect. Agentic frameworks and prompt optimization libraries attempt to manage this by giving control to the LLM, but this leads to systems that are hard to debug, maintain, and scale. Mellea offers an alternative: a programming model that restores developer control through modular design, information hiding, and compositional contracts. This enables predictable fault models, better portability, and lower inference costs. Attendees will gain hands-on experience building applications using the Melleaic approach.x000D x000D Extending these principles to the model level, the workshop introduces a modularization framework for LLMs using activated LoRAs. These produce components—LLM intrinsics—that match fine-tuned model accuracy for specific tasks but with significantly lower inference costs and latency, thanks to KV cache reuse. Participants will build applications using a pre-built library of RAG LLM intrinsics and learn how to train their own.x000D x000D Presented by the creators of Mellea and the inventors of LLM intrinsics and aLoRA, this workshop equips attendees with foundational skills for scalable model/application co-design.

    Speakers: Nathan Fulton, Hendrik Strobelt

    https://neurips.cc/virtual/2025/loc/san-diego/128678

    Speakers:
    NF
    Nathan Fulton
    IBM
  • Description:

    The rapid rise of autonomous AI agents across enterprises is creating a new class of security and governance challenges that are not adequately addressed with today’s technology. Context Forge MCP Gateway is an open-source, security-focused middleware that provides fine-grained control and extensibility for agent operations. With over 2.6k GitHub stars and a rapidly growing user community, Context Forge addresses emerging threat classes including prompt injection, data leakage, and misuse of sensitive resources. At its core, Context Forge introduces a plugin architecture modeled after Linux Security Modules, embedding reusable security hooks at critical points in agent execution (e.g., prompt handling, tool invocation, data transformation). This modular foundation enables organizations to enforce contextual policies at scale—ranging from PII redaction and provenance tagging to prompt injection detection and policy-based access control. With 39 plugins already available, Context Forge is establishing a standards-aligned ecosystem for securing agent workflows in real-world enterprise deployments. By blending research-driven design with open-source adoption it creates a practical path for organizations to advance agent trustworthiness, safety, and compliance.

    Speakers:
  • Description:

    Modern enterprises depend on efficient data engineering pipelines to unlock value from diverseand large-scale datasets. Yet, current processes for workflow design, schema ingestion, and dataquality validation remain complex, error-prone, and dependent on technical expertise. This creates barriers for non-expert users, slows down development, and introduces risks of datainconsistency.

    We present a suite of LLM-powered frameworks that reimagine enterprise data engineeringacross three critical dimensions: (i) From Natural Language to Executable ETL Flows, enablingintuitive pipeline creation with natural language specifications and automatic operator/propertyinference, (ii) All You Can Ingest, an end-to-end schema mapping and transformation frameworkthat unifies semantic alignment, code synthesis, and robust validation, and (iii) QualityAssessment of Tabular Data, a scalable approach for auto-generating interpretable quality rulesand executable validators tailored to specific datasets.

    Together, these innovations demonstrate how Large Language Models (LLMs), augmented withretrieval, code synthesis, reasoning, and guardrails, can transform the data engineering lifecycleinto a more accessible, adaptive, and trustworthy process, reducing manual effort, acceleratingtime-to-value, and ensuring data fidelity at enterprise scale.

    Speakers:
  • Description:

    Mellea is a generative AI library for writing robust application software. LLM outputs are intrinsically unpredictable and often wrong. Agentic frameworks and prompt optimization libraries address this unpredictability by putting the LLM in control leading to systems that are often difficult to debug, maintain, evolve, and port. Mellea puts developers back in control by defining a programming model that encourages task decomposition, information hiding, and compositional contracts between isolated modules. Mellea’s programming model results in systems with manageable fault models, better portability, and lower inference-time costs.

    https://neurips.cc/virtual/2025/loc/san-diego/talk/127761

    Speakers:
    DC
    David Cox
    VP, AI models at IBM Research IBM Director, MIT-IBM Watson AI Lab
    IBM Research
  • Description:

    Long-range intra-molecular interactions are not well represented by existing molecular descriptors, which limits the accuracy of machine learning models for molecular property prediction. We introduce TDiMS, a descriptor that encodes topological distances between substructure pairs, enabling explicit representation of long-range effects while retaining chemical meaning. Applied to molecular datasets, TDiMS shows particular advantages for larger molecules, where long-range interactions strongly influence target properties. We further demonstrate that choosing appropriate substructure definitions, such as tailored fragments, enhances predictive performance. Beyond accuracy, TDiMS provides interpretable features essential for material discovery, offering insights into structural motifs driving predictions. These results highlight distance-based, interpretable descriptors as a promising route for machine learning in the materials discovery.

    Authors:
    AK
    Akihiro Kishimoto
    IBM
    KM
    Kohei Miyaguchi
    IBM
    MH
    Masataka Hirose
    NON-IBM
    JF
    Junta Fuchiwaki
    NON-IBM
    +3 more view all

Upcoming events