AI Engineer
$ curl ai.engineer/wf/llms-full.md
View
Format

AI Engineer World's Fair 2026 — Full Details

The largest technical AI conference in the world, with 29 tracks, 300 speakers, 100 expo partners, 6,000+ AI Engineers, founders, and VPs of AI. This is the full machine-readable dump: every announced session (with abstracts) plus all confirmed speakers.

Note: the schedule is in-progress. Many sessions are tentative and titles marked "TBA" are still being confirmed.


Full Schedule

Day 1 — Workshop Day

9:00am-11:00am: From Vibes to Production: Evaluating and Shipping AI Agents That Work 101 — Laurie Voss

(sponsor) [Track 1] | Track: Track 1

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions.

This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems.

You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

9:00am-11:00am: AI on Your Lakehouse: Context Comes in Shapes, Not Queries — Zach Blumenfeld

(sponsor) [Track 2] | Track: Track 2

Your agent can reach your data but still can't use it reliably: vector search and Text2SQL each hand it a slice, but not the view to know what's truly relevant and how to connect the right info. Without that, answers come back confident but wrong, and agent decisions cannot be trusted. The problem isn't caused by a bad model or bad query, but rather a lack of context, and thinking in terms of shapes is what cracks it.

In this hands-on session, you'll learn how to build three reusable graph shapes from your lakehouse data using Neo4j, so your agent can navigate and view the right context to answer and act accurately:

  • Table of Contents (Trees) — navigate what's there
  • Themes (Communities) — surface patterns nobody named
  • Connections (Paths & Cycles) — trace how entities, documents, and records relate

Portable to BigQuery, Databricks, Snowflake, or anywhere. You'll leave with real, practical techniques and the code to run with your own data and agents.

9:00am-11:00am: Cooking with Codex — Charlie Guo, Gabriel Chua

(sponsor) [Track 3] | Track: Workshops Day 1

Codex is changing how technical teams ship across the software development lifecycle, from feature implementation to code review and automation. But the real unlock comes when these practices move beyond a single workflow and become shared systems a team can trust.

In this hands-on session, you'll use Codex across real development and knowledge-work scenarios: structuring tasks, supervising agentic work, coordinating subagents, using plugins and MCPs, and combining Codex with OpenAI's frontier reasoning, coding, and multimodal models.

Bring your laptops and leave with reusable demos and a set of Codex recipes your team can adapt.

9:00am-11:00am: The best SDLC is the one you build yourself: Why orchestration changes everything — Shane Wolf, Andrei Bocan

(sponsor) [Track 4] | Track: Workshops Day 1

Industry research shows AI productivity gains have plateaued at 10–15% — because today's tools only optimize the 20% of a developer's day spent writing code. The real bottlenecks are left and right of code: planning, orchestration, review, and operations. We'll also explore the value of AI-powered code reviews - from establishing code standards that AI can seamlessly enforce, to triggering agentic pipelines that autonomously fix issues. Join Atlassian's Shane Wolf and Andrei Bocan for a hands-on deep dive into the AI-native SDLC. In this workshop, we'll move past single-player copilots and show you how Atlassian is turning Jira into an AI-native orchestration layer for the entire software development lifecycle. Then, we'll go further. You'll learn how to build custom automations that chain these capabilities together, transforming your Jira board into an agentic software factory where humans set intent and agents execute.

9:00am-11:00am: AI Security Engineer Foundations + Certificate — Javier Garza

(sponsor) [Track 5] | Track: Workshops Day 1

In each of the two sessions, we cover 6 modules and participants receive a certificate of completion at the end. The modules are: OWASP Top 10 for LLM, Addressing Shadow AI, AI Threat Modeling, Securing Agents & MCP, Securing Vibe Coding, & AI Red Teaming

9:00am-11:00am: Total Recall: Agent Memory and Harness Engineering — Ignacio Martinez

(sponsor) [Track 6] | Track: Workshops Day 1

In this hands-on workshop you'll build a working autonomous agent from the harness up, in a notebook, then see it live in a full working web application and leave with one that can write and run its own automations. You'll implement every surface area yourself: a set of predefined tools, persistent memory through the Oracle AI Agent Memory package, orchestration with LangChain and LangGraph, and LLM access through OCI GenAI Service, composing the full set of Oracle primitives into one harness you understand end to end.

Most teams assemble that harness from a dozen disconnected services: one store for vectors, another for state, a separate reranker, a bolt-on memory layer. We take the opposite approach, on a single unified memory core. The organizing principle is optionality by default: you shouldn't have to choose your memory substrate up front. With Oracle AI Database you get file system and database memory in one place, embedding models and rerankers running inside the database kernel, and every retrieval strategy an AI workload needs without leaving the core.

And consolidating onto one core is what keeps the whole thing tractable. You know the drill: a production harness has you holding all those moving parts in your head at once, and most of your attention goes to keeping them in sync rather than improving the agent. Pull that sprawl into a single core and the cognitive load drops. You get to think about what the agent does, not where its state lives. That's the difference between controlling your harness and renting its pieces.

9:00am-11:00am: Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs — Du'an Lightfoot

(sponsor) [Track 7]

Every production agent today is renting its intelligence. You're paying per token, sending your customer's data to someone else's servers, and hoping the provider doesn't rate-limit you during your launch. For most teams, that's fine. But for a growing number of teams in regulated industries, with high-volume products, latency-sensitive workloads, or rising token bills, it's starting to look like a liability.

In this 120-minute hands-on workshop you'll get a dedicated GPU and build an agent that runs on infrastructure you control. You'll stand up vLLM, point your agent at it, and drive concurrent load through the stack until you can see batching, KV cache pressure, and throughput limits in the metrics. Then you'll optimize the deployment to improve throughput while keeping per-request latency in line.

The focus isn't agent frameworks. It's the inference layer underneath them. You'll leave with working code and a real understanding of continuous batching under real concurrency, KV cache tradeoffs, vLLM's metrics, and the bottlenecks that only show up when you operate the inference server yourself.

9:00am-11:00am: Open-Source Inference Engineering for the Agentic Era — Zain Hasan, Yubo Wang, Qingyang Wu, Jue Wang

(sponsor) [Track 8] | Track: Workshops Day 1

Agentic coding workloads demand long contexts, multi-turn conversations, and throughput at a scale that most inference engines weren't built for. TokenSpeed is a new open-source engine purpose-built for this regime, built collaboratively by NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, and others. In this 2-hour hands-on workshop, Together Inference Research Engineers and a TokenSpeed co-creator will cover TokenSpeed architecture, deploying your first model, optimizing for agentic workloads, kernel and hardware tuning, and throughput/latency trade-offs.

9:00am-11:00am: Advanced workshop: Mastering AI Observability — Doug Guthrie

(session) [Track 9]

Your AI is in production, but is it actually good? In this hands-on workshop, you'll learn how to uncover patterns in your production traces using Braintrust Topics, build custom scorers to target real issues, and systematically improve your agent. By the end, you'll have a repeatable eval workflow and trace-backed evidence that your AI is actually doing what you think it is.

9:00am-10:15am: Get Started with Models in Microsoft Foundry to Build AI Apps — Pamela Fox

(sponsor) [Track M] | Track: Track M

In this hands-on lab, you will build a production-ready AI application using Microsoft Foundry, with no fine-tuning or deep machine learning expertise required. You will discover and select models, provision a Foundry project, and connect to a hosted model using the OpenAI SDK. You’ll implement a comment moderation workflow, compare model outputs, and package the solution as a hosted agent using Python, ready for real-world integration.

11:05am-12:05pm: Building self-learning loops for your agent — Fuad Ali

(sponsor) [Track 1] | Track: Posttraining & Midtraining

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions.

This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems.

You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

11:05am-12:05pm: RAG Needs a Map: Using GraphRAG to Retrieve Connected Context — Nyah Macklin

(sponsor) [Track 2] | Track: Track 2

Vector search is good at finding similar text, but real answers often depend on how facts, entities, and documents connect. In this hands-on workshop, you’ll build a GraphRAG workflow that uses relationships to retrieve connected context for more grounded AI responses.

11:05am-12:05pm: How I learned to stop worrying and love the sandbox — Matt Brockman

(workshop) [Track 3] | Track: Workshops Day 1

Running sandboxes at scale can get painful. How do you manage a thousand concurrent sandboxes? We'll cover burst traffic, fast sandbox creation under load, resource exhaustion, shared state with volumes, and per-user data isolation. Then you'll trigger each failure, implement fixes, and see the cost impact in real time. You'll leave with hands-on experience debugging sandbox failures and a set of observability and scaling patterns you can start implementing.

11:05am-12:05pm: The model swap workshop — Pamela Fox, Arun Sekhar

(workshop) [Track 4] | Track: Workshops Day 1

Frontier labs are releasing new models constantly, and it is hard to know when “better” is better enough to justify touching a working system. On top of that, “just swap the model” often turns into real work because providers expose different APIs and different expectations around tools and structured outputs. The model swap workshop is a hands-on bake-off across frontier LLMs. We will run the same scenarios using multiple models (OpenAI, Anthropic, Kimi, and more) and compare results side by side for agentic tool use, structured outputs, and multimodal tasks. Swapping models is not just changing a model name. In this workshop, you will actually do the swaps, including moving between OpenAI-style Responses APIs and Anthropic-style Messages APIs, then see what breaks and what needs to change in your prompts, tool definitions, and JSON strategies. We will finish by running a small eval suite so you can quantify tradeoffs instead of relying on vibes. We will provide the Microsoft Foundry environment for access to the models, no account needed.

11:05am-12:05pm: Teaching Agents to Search: Building Synthetic Training Pipelines with NVIDIA Data Designer — Dhruv Nathawani

(workshop) [Track 5] | Track: Workshops Day 1

Modern agentic systems often fail because the right training data simply does not exist. Search agents are a perfect example: if you want a model to browse the web effectively, you need high-quality multi-step trajectories that teach it how to search, refine queries, inspect sources, and recover from dead ends. Those datasets are rarely available off the shelf. In this hands-on workshop, we will show how NVIDIA used Data Designer to build synthetic supervised fine-tuning data for search-capable Nemotron models. Participants will learn how to translate a target capability into a scalable data generation pipeline: defining task structure, generating strong seed examples, producing realistic search trajectories, filtering low-quality generations, and converting traces into training-ready records. Using a real search-agent use case, we will walk through the design decisions behind teaching Nemotron Super to browse the web, including how to create BrowseComp-style tasks, generate tool-use rollouts, and manage the tradeoffs between diversity, correctness, and yield. We will also cover the practical realities of production synthetic data workflows, including validation, dataset curation, and where most pipelines break down. But the goal of this workshop goes beyond search. Participants will leave with a reusable framework for designing any dataset they wish they already had: starting from the behavior they want to teach, mapping that behavior into a data schema, generating examples at scale, and iterating until the dataset is useful for training. By the end of the session, attendees will not only know how to build synthetic data for search agents, but how to design custom datasets for specialized behaviors across reasoning, tool use, and domain-specific applications. Attendees will leave with a practical methodology for synthetic data design, plus hands-on familiarity with NVIDIA Data Designer as an open-source system for rapid experimentation.

11:05am-12:05pm: Local LLMs and workstation agents: Part 1 — Ahmad Osman

(workshop) [Track 6] | Track: Workshops Day 1

Have you heard "Buy a GPU," "Opensource AI Must Win," or "Local AI FTW" before? This workshop will be a practical window into that confusing world and a practical map for understanding what different Local AI hardware is actually capable of and which models make sense on each class of machine.

Whether you are just getting started or already running models every day, we will demo and work through why a Mac mini, M4 Pro MacBook Pro, M5 Max MacBook Pro, RTX 5070 8GB laptop, Strix Halo box, DGX Spark, and 2x RTX PRO 6000 Blackwell machine should not be configured, benchmarked, or used the same way.

What are you trying to run? How much VRAM or Unified Memory do you actually need? When does a small machine make sense? When do you need a real GPU box? When does long context, tensor parallelism, or serving infrastructure start to matter?

This should be useful to everyone: people curious about local AI, people buying their first capable machine, people already running models, and people trying to use local inference for scalable agentic workflows.

We will close by showing how Codex can automate the boring part: give it my Inference Engine article, the hardware target, and the model of your choice, then ask it to propose the engine, environment, flags, batch settings, KV-cache settings, and benchmark and evaluation plan.

11:05am-12:05pm: How to Build Quality Gates into Agentic Coding Workflows — Nnenna Ndukwe

(workshop) [Track 7] | Track: Workshops Day 1

AI coding agents can now generate code at unprecedented speed. But faster code generation creates a new engineering problem: how do we know when agent-written code is actually safe, maintainable, and ready to merge? In this hands-on workshop, attendees will build an agentic coding workflow with enforceable code quality gates across planning, implementation, testing, and code review. By the end of the session, participants will have a working reference pattern for agentic software delivery: an AI-assisted workflow that can inspect a repo, implement a change, run tests, evaluate risk, respond to feedback, and surface what still requires human judgment. This is a technical enablement session for engineers building with AI coding agents, platform teams designing agentic SDLC workflows, and AI engineering leaders thinking about how to scale software quality with AI.

11:05am-12:05pm: What is an Inference Engine, Anyway? — Charles Frye

(workshop) [Track 8] | Track: Workshops Day 1

To run state-of-the-art inference yourself, you must master the inference engine: vLLM, SGLang, TRT-LLM, or your own jawn. The inference engine manages the lifecycle of an inference request, from input to output. In this workshop, we'll examine the architecture of modern high performance inference engines, the key techniques that inference engines need to deliver that performance, and the traces and metrics that inference engines emit.

11:05am-12:05pm: Agent Speedrun: Idea → Code → Deploy → Observe, Fix → Ship — Elizabeth Fuentes Leone, Sandhya Subramani

(session) [Track 9] | Track: Workshops Day 1

One agent. Fully deployed to production before the workshop ends. We'll take you from a blank file to a running production agent using Amazon Bedrock AgentCore and Strands Agents, covering the full lifecycle: ideation, coding the agent loop, deploying to serverless infrastructure, wiring up observability, breaking it intentionally, fixing it with tracing data, and shipping the final version. Bring your laptop and leave with a deployed agent.

11:05am-12:05pm: From zero to deployed on Azure with AI agents — Gustavo Cordido

(sponsor) [Track M] | Track: Track M

What happens when you let AI agents do the building? In this hands-on lab, you'll go from an empty terminal to a deployed app on Azure — with GitHub Copilot CLI and coding agents handling the scaffolding, coding, debugging, and deployment. You'll use the new Azure skills to provision resources and wire up services through natural language, no portal required. This isn't a demo you watch. You'll walk out with a real, working dev workflow you can take straight to your next project.

12:10pm-1:10pm: Evals in AI: A Deep Dive — Tejas Kumar

(workshop) [Track 1] | Track: Workshops Day 1

“Our evals pass and our velocity is up, so it works.” It’s the most reassuring sentence in AI engineering and also the most dangerous. Teams are shipping more code than ever while incidents per PR and change-failure rates climb, and the instruments meant to catch this are quietly broken. This talk takes apart both halves of that false comfort. First, why velocity lies: the same AI-driven throughput that lights up your dashboard is what’s eroding quality underneath it. Then we explore four ways offline evals deceive you: LLM-as-judge bias (your grader rewards confident, wordy, wrong answers over terse correct ones), staleness, distribution shift between your golden set and real traffic, and single-score evals that hide which step of an agent actually failed. The centerpiece is a live demo. We’ll wire up an LLM judge on stage and watch it crown a confident, friendly, factually wrong answer. Then we’ll fix it live on stage with a three-line rubric change. Same model, different instrument. From there we’ll build up what to measure instead: traces and spans, production observability, probe-based evaluation, error budgets, and quality leading indicators that sit beside every velocity number. Attendees will leave with a five-line checklist they can apply Monday. No prior eval tooling required. If you’ve ever shipped something agentic and had a nagging feeling the dashboards were too kind, this is for you.

12:10pm-1:10pm: From approval loops to autonomous agents with Docker — John Craft, Dan Ndombe

(workshop) [Track 2] | Track: Workshops Day 1

"You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down."

12:10pm-1:10pm: 2 hr deep dive on LLM Inference at Scale — Part 1 of 2 — Harshul Jain, Tanmay Sah

(workshop) [Track 3] | Track: Workshops Day 1

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

(NOTE: this is a 2 hour workshop that happens over lunch break - you should try to have lunch before or after if attending)

compute kindly sponsored by Coreweave/Marimo!

12:10pm-1:10pm: Build the Right Thing: Product Engineering for Software Developers (Part 1) — Kent C. Dodds

(workshop) [Track 4] | Track: Workshops Day 1

There is nothing quite as demoralizing as finishing a feature and realizing you built the wrong thing. The code is clean. The tests pass. The ticket is closed. And none of it matters. This is happening more often, not less. AI makes it faster and cheaper to implement, which means teams can now waste entire sprints on the wrong idea at unprecedented speed. The bottleneck is no longer "can we build it?" It is "should we build it?" and "are we sure we understand the problem?" This session is a condensed introduction to product engineering for builders: the skills that sit upstream and downstream of implementation. We will not try to cover everything a full-day workshop would. Instead, we will focus on the highest-leverage ideas you can apply on Monday. ### What we'll cover 1. Validate before you build Most wrong builds start with an idea that was never tested. You will learn to separate real user pain from solution-shaped requests, and practice discovery questions that surface past behavior instead of hypothetical enthusiasm. 2. Prioritize what deserves to exist Not every good idea should be built now. Especially in the AI era, "we could build this" is not a reason to build it. We will work through a practical prioritization lens, including the Kano model, to help you distinguish fundamentals from delighters from distractions before your team commits. 3. Own the feature, not just the PR Product engineering does not end at merge. You will leave with a clearer picture of end-to-end feature ownership: staying close to users, setting up simple feedback loops, and improving what you shipped instead of moving on to the next ticket. ### Format This is a 2–3 hour session with Kent C. Dodds. Expect focused teaching, real-world examples, and short interactive exercises and discussion. This is not a full simulation lab or a ticket-closing coding workshop. It is judgment practice for engineers who already know how to ship. ### Who this is for Software engineers (and technical builders generally) who: - Have shipped something polished that nobody wanted - Feel pressure to move fast with AI and want a better filter for what deserves to exist - Want stronger product instincts without becoming a PM - Care about owning outcomes, not just closing tasks Some software engineering experience is assumed. No particular stack is required. PMs and designers often find this valuable too. ### What you'll leave with - Discovery questions for ambiguous work - A prioritization lens you can use before committing to a build - A clearer model for feature ownership and post-ship feedback loops - Language for stakeholder conversations when requirements are unclear

12:10pm-1:10pm: From Zero to Leaderboard: Building an End-to-End AI Agent Evaluation Pipeline — Wolfram Ravenwolf

(workshop) [Track 5] | Track: Workshops Day 1

Running one agent eval is easy. Running hundreds — with controlled timeouts, replicated configs, and automated collection across distributed VMs — requires infrastructure that most teams end up building from scratch. In this workshop, we shortcut that process and build a rigorous evaluation pipeline end-to-end. Participants will set up and connect the full evaluation stack: Layer 1 — The Benchmark Runner. Configure Harbor to orchestrate parallel agent evaluations on Terminal-Bench 2.0, with W&B Sandboxes providing isolated environments for each task. Layer 2 — The Collection Pipeline. Use WolfBench to scan distributed VMs for results, deduplicate across runs, download trajectories, and build a local results archive that survives VM teardown. Layer 3 — The Analysis Framework. Compute the five-metric framework (Ceiling / Best / Average / Worst / Solid) across replicated runs. Learn to read the spread: when is a model "better"? When is a score difference just noise? Layer 4 — The Observability Layer. Upload full agent conversation traces to W&B Weave for per-turn inspection. See exactly where an agent goes wrong — the command it ran, the output it misread, the moment it started looping. Layer 5 — The Leaderboard. Generate interactive HTML charts that show the full performance distribution, not a single bar. We'll work with real data from hundreds of production runs, and participants will leave with a working pipeline they can adapt to their own agents and benchmarks. Laptops required; all tools are open-source.

12:10pm-1:10pm: Local LLMs and workstation agents: Part 2 — Ahmad Osman

(workshop) [Track 6] | Track: Workshops Day 1

From the guy who said "Buy a GPU," "Opensource AI Must Win," and "Local AI FTW": this session shows what you build around the models running locally so agents can actually be effective and efficient when using local models.

A local chatbot gives you private text generation. A useful agent needs a system around it: search, scraping, traces, document ingestion, agentic harness integration, and other practical components. The focus of this workshop is setup, not hardware. We will walk through the practical pieces that turn local inference from a model endpoint into the reasoning layer inside a real workflow.

The live demo target will be a 2x RTX PRO 6000 Blackwell machine running models locally and using it across different agentic harnesses. The goal is to show how Local AI can be more than private and offline: it can be useful, inspectable, controllable, and built into infrastructure you actually own.

Attendees should leave with a practical mental model for building Local AI systems that can read, search, cite, act, and evaluate themselves.

12:10pm-1:10pm: Beyond RAG: Build a Relational Context Engine from Scratch — Peter Werry

(workshop) [Track 7] | Track: Workshops Day 1

In this workshop we'll explore the importance of context engines in modern engineering workflows, and we'll look at why traditional RAG techniques are no longer enough to deliver the context agents need.

We'll build a structured query engine that fills the gaps left by RAG, translating natural language into validated database queries over GitHub PR and Issue data. We'll implement schema-aware prompting, identity resolution, query validation, and error-driven retry loops, and you'll walk away with a working query engine for your GitHub repository.

12:10pm-1:10pm: Building AI Agents with Real-Time Web Data — Yohan Raju

(workshop) [Track 8] | Track: Track 8

Your AI agent is only as good as the data it can access — and static training data isn't enough anymore. In this hands-on workshop, you'll learn how to connect AI agents to the live web using Bright Data's MCP (Model Context Protocol) server and scraping APIs, turning any LLM into a real-time web-aware system.

12:10pm-1:10pm: Research to Reality with Google DeepMind — Paige Bailey

(workshop) [Track 9] | Track: Workshops Day 1

1:15pm-2:15pm: Let your agent cook: using skills to evaluate and improve your app — Ankur Duggal

(sponsor) [Track 1] | Track: Track 1

1:15pm-2:15pm: 2 hr deep dive on LLM Inference at Scale — Part 2 of 2 — Harshul Jain, Tanmay Sah

(sponsor) [Track 3] | Track: Workshops Day 1

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

(NOTE: this is a 2 hour workshop that happens over lunch break - you should try to have lunch before or after if attending)

1:15pm-2:15pm: Build the Right Thing: Product Engineering for Software Developers — Part 2 — Kent C. Dodds

(sponsor) [Track 4] | Track: Workshops Day 1

There is nothing quite as demoralizing as finishing a feature and realizing you built the wrong thing. The code is clean. The tests pass. The ticket is closed. And none of it matters. This is happening more often, not less. AI makes it faster and cheaper to implement, which means teams can now waste entire sprints on the wrong idea at unprecedented speed. The bottleneck is no longer "can we build it?" It is "should we build it?" and "are we sure we understand the problem?" This session is a condensed introduction to product engineering for builders: the skills that sit upstream and downstream of implementation. We will not try to cover everything a full-day workshop would. Instead, we will focus on the highest-leverage ideas you can apply on Monday. ### What we'll cover 1. Validate before you build Most wrong builds start with an idea that was never tested. You will learn to separate real user pain from solution-shaped requests, and practice discovery questions that surface past behavior instead of hypothetical enthusiasm. 2. Prioritize what deserves to exist Not every good idea should be built now. Especially in the AI era, "we could build this" is not a reason to build it. We will work through a practical prioritization lens, including the Kano model, to help you distinguish fundamentals from delighters from distractions before your team commits. 3. Own the feature, not just the PR Product engineering does not end at merge. You will leave with a clearer picture of end-to-end feature ownership: staying close to users, setting up simple feedback loops, and improving what you shipped instead of moving on to the next ticket. ### Format This is a 2–3 hour session with Kent C. Dodds. Expect focused teaching, real-world examples, and short interactive exercises and discussion. This is not a full simulation lab or a ticket-closing coding workshop. It is judgment practice for engineers who already know how to ship. ### Who this is for Software engineers (and technical builders generally) who: - Have shipped something polished that nobody wanted - Feel pressure to move fast with AI and want a better filter for what deserves to exist - Want stronger product instincts without becoming a PM - Care about owning outcomes, not just closing tasks Some software engineering experience is assumed. No particular stack is required. PMs and designers often find this valuable too. ### What you'll leave with - Discovery questions for ambiguous work - A prioritization lens you can use before committing to a build - A clearer model for feature ownership and post-ship feedback loops - Language for stakeholder conversations when requirements are unclear

1:15pm-2:15pm: Build a Platform, Unleash an Agent on it.... and Watch it Burn! — Michael Forrester, Whitney Lee

(sponsor) [Track 5] | Track: Workshops Day 1

You get a Kubernetes cluster with an Internal Developer Platform already running: ArgoCD for GitOps, Kyverno for admission control, Falco for runtime detection, Prometheus for observability. Everything is instrumented. Everything is enforced. You also get an AI agent with cluster access. Your job is to get the agent to break something. Deploy a non-compliant workload. Escalate privileges. Modify infrastructure outside Git. Exfiltrate data through an agent response. Some of you will fail because the governance stack catches it. Some of you will succeed because it doesn't. Afterward we regroup and map what got blocked, what slipped through, and why. The 80% that existing CNCF tools already govern becomes obvious. The 20% gap where agent-specific tooling is missing becomes undeniable. You leave with a concrete governance map and the exact list of failure modes your own platform probably isn't covering yet.

1:15pm-2:15pm: SonarQube + OpenAI: Wiring Your Team for Agentic Development — Killian Carlsen-Phelan

(sponsor) [Track 6] | Track: Track 6

As AI agents take on increasingly complex development tasks, the critical challenge has shifted from generation to verification. A growing body of evidence suggests that as models grow more capable, failures become more frequent and more convincing, making cognitive surrender among human reviewers an acute risk. This talk introduces Sonar's Agent Centric Development Cycle (AC/DC), a three-stage continuous loop of Guide, Verify, and Solve, as the engineering discipline teams need to build now. Teams that embrace AC/DC guide agents within their organizational standards before they write a line of code, verify output in real-time, and solve issues automatically without manual triage. This session will also feature a live demo of the SonarQube OpenAI plugin, showing how a well-guided agent produces code that is faster to verify and cheaper to fix.

1:15pm-2:15pm: How Reducto parsed the Epstein Files for the Viral JMail Project: The Secret Complexities of Document — Palak Agarwal

(sponsor) [Track 7]

Reducto powered the infrastructure behind Jmail, a fully searchable email interface with over 3.5 million scanned government pages built days after the Epstein files release. The site went viral overnight, racking up millions of views across news coverage and social media. In this workshop we'll break down how Reducto's Parse API handled everything from redacted PDFs to handwritten letters to dense financial tables at that scale, then walk through the same pipeline hands-on using the Reducto CLI and MCP. You'll leave with a working setup and a clear mental model for applying document parsing to your own projects.

1:15pm-2:15pm: Turning My Obsidian Vault Into a Local AI Engineer — Filip Makraduli

(sponsor) [Track 8] | Track: Workshops Day 1

Personal knowledge bases are messy, but engineering agents need memory: decisions, docs, TODOs, old PRs, architecture notes, incident notes. This talk shows how I made an Obsidian vault usable by an agent using local-first retrieval and small-model inference. The point is not “chat with notes”; it is how to build durable, inspectable agent memory.

1:15pm-2:15pm: Continuously improving agents with Langfuse — Lotte Verheyden, Annabell Schäfer

(sponsor) [Track 9] | Track: Workshops Day 1

Join us for a hands-on Langfuse workshop where we'll show you how to observe, debug, and improve your AI applications, step by step, using a real sample app. Bring your questions and discover how Langfuse can level up your specific use cases!

2:20pm-4:20pm: From Vibes to Production: Evaluating and Shipping AI Agents That Work 201 — Laurie Voss

(sponsor) [Track 1] | Track: Track 1

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions.

This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems.

You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

2:20pm-4:20pm: The Data Context Layer: Why Data Engineering Agents Need More Than Code and Databases — Yoni Michael, Brandon Callender

(sponsor) [Track 2] | Track: Track 2

Modern AI agents typically understand either code or databases. Code-focused agents reason over files, dependencies, and syntax, while database agents see tables, columns, and query results. This works for software development and basic analytics—but it breaks down for data engineering. In real data environments, agents fail because they lack context: an understanding of how data flows, what it represents, and why it behaves the way it does in production. Introducing the data context layer—a missing third layer that bridges code, data, and business semantics. Without it, agents hallucinate impact, suggest unsafe joins, and struggle with root cause analysis. This presentation will define the data context layer and showcase its use in practice, including end-to-end lineage from sources to reports; semantic metadata such as grain, measures, dimensions and business logic; runtime signals including job executions, failures, and performance patterns; and logical vs. physical modeling distinctions. Attendees will walk away with a greater understanding of: Why the code layer (dbt SQL, manifests, Git history) provides structure but misses grain, aggregation semantics, and join safety Why the data layer (warehouse tables, execution metrics, failures) shows what happened, but not why How the data context layer unifies lineage, semantic metadata, runtime behavior, and business rules The presentation will also cover architecture patterns for building and maintaining a data context layer, including why property graphs are well-suited for contextual reasoning and how agents can query context safely instead of relying on prompt stuffing.

2:20pm-5:30pm: Special topics in Kernels, RL, Reward Hacking in Agents — Daniel Han

(session) [Track 3] | Track: Workshops Day 1

An advanced seminar (good prerequisites: Daniel's 2024 and 2025 hit AIE workshops, but all are welcome!)

PLS WATCH: https://www.youtube.com/@aiDotEngineer/search?query=daniel%20han

2:20pm-4:20pm: Burn your flags: How PayPal designs interactive CLI tools for agents — Mark Lummus, Navinkumar Patil

(sponsor) [Track 4] | Track: Workshops Day 1

The common guidance for designing complex CLI tooling that agents can use is to add a 'non-interactive' mode, where a normally interactive & flow-based command can be executed in a single pass by feeding it a bunch of flags. This is necessary for deterministic automation, but agents aren't scripts; they aren't really constrained in the same way, and they benefit greatly from the same step-by-step contextual workflows that humans do. In this workshop, PayPal goes deep on techniques we've used in our upcoming paypal CLI that you can steal to make your complex CLI workflow tool agent-usable — without giving up the guardrails and guidance that interactive CLI tools provide.

2:20pm-4:20pm: AI Security Engineer Foundations + Certificate — Micah Silverman

(sponsor) [Track 5] | Track: Workshops Day 1

In each of the two sessions, we cover 6 modules and participants receive a certificate of completion at the end. The modules are: OWASP Top 10 for LLM, Addressing Shadow AI, AI Threat Modeling, Securing Agents & MCP, Securing Vibe Coding, & AI Red Teaming

2:20pm-4:20pm: Context Engineering in 2026: Compaction, Memory & Cost — Louis-François Bouchard, Samridhi Vaid, Omar Solano

(sponsor) [Track 6]

Every long agent session eventually breaks: the assistant that swore it would "never push to main" does exactly that forty turns later. The model didn't get dumber — its context did. This workshop is about engineering the context window so that stops happening, shown with Towards AI's open-source AI tutor, which answers questions for students of our AI-engineering courses. Context engineering is deciding what the model sees on every single call — instructions, history, retrieved course content, memory, and tool outputs — and it's the line between a tutor that holds a coherent session and one that forgets the student's setup halfway through. We'll move in three stages, mirroring how the project actually went. The concepts: the two root problems (a finite window, a stateless model), the full compaction toolkit (truncation, trimming, tool-result clearing, summarization, and offloading to files — and when each actually helps), memory that survives across sessions, skills loaded on demand, and production-grade retrieval (chunking, metadata, course scoping, hybrid search, reranking, and evaluating). We'll cover the tutor's architecture, and the evaluation harness we used to measure every run on Gemini — tokens, cost, latency, and memory probes instead of vibe-checks. At real volume, even Gemini Flash got expensive, so we tested whether open and local models could match the quality for a fraction of the cost and match result quality. Everything is open-source and will be shared during the workshop.

2:20pm-4:20pm: Vector Isn't Enough: Hybrid Search & Retrieval for AI Engineers — Jeff Vestal

(sponsor) [Track 7] | Track: Track 7

If you build RAG, you reached for vector search first. This lab is about everything that happens after you realize embeddings alone don't cut it in production. You'll write real queries — semantic, lexical, and hybrid — feel exactly where each one fails, and walk out with a production-grade retrieval pipeline and the judgment to know which technique to reach for when.

What you'll actually do:

1. Dense vector search, and the mechanism behind it. Run semantic queries over a  semantic_text  field backed by Jina v5 embeddings — generated server-side, at query time, by the Elastic Inference Service (EIS). No embedding service to stand up, no client-side inference code. We open the hood on how query-time embedding actually works.

2. Break it. Throw adversarial queries at pure vector — exact error codes, version numbers (8.18 vs 9.0), precise config keys — and watch semantic similarity blur the exact match you needed. Then bring in BM25 lexical search to rescue it… and find the queries where keyword search whiffs. Each method is strongest exactly where the other is weakest.

3. Hybrid, properly. Fuse lexical + semantic with Elasticsearch retrievers. Learn the two fusion strategies that matter — Reciprocal Rank Fusion (RRF) and linear combination with score normalization — when to use each, and how to tune them. Optional: cross-encoder reranking with Jina Reranker v2.

4. Why this is the whole game for agents. Wire the hybrid retriever into a RAG flow and prove that retrieval quality, not the model, determines answer quality. Only synthesis truly needs the LLM - retrieve, rank, filter, and document-level security are database work done in milliseconds for a fraction of the cost. The contrarian takeaway: most of your RAG pipeline shouldn't be LLM calls at all.

2:20pm-4:20pm: Build with Perception Agents — Emile Baizel, Shruti Arora

(session) [Track 8] | Track: Workshops Day 1

Human-agent collaboration is changing, becoming more visual. Models can perceive, point, and verify, but most agents still rely on us typing a paragraph to explain what we're looking at. Meet perception agents: computer use agents that see screens how you see screens. They understand, reason, and verify their own work. They let you point, draw, and describe, just as people collaborate in real life. We call this shared perception, and at AGI Lab we just open-sourced the first two primitives of our perception agent harness: visual verification and visual annotation. In this workshop, you'll get hands-on with both, build one sample use case end-to-end, then take the primitives back to your day-to-day in a mini hackathon. Best ideas win prizes.

2:20pm-4:20pm: Hands-on AutoResearch: Cracking OpenAI's Parameter Golf — Zhengyao Jiang, Dixing Xu, Vayum Arora, Dhruv Srikanth

(session) [Track 9] | Track: Workshops Day 1

Heard about autoresearch, or tried it a few times in playground settings? This hands-on tutorial teaches you how to use autoresearch on one of the most serious challenges in ML this year: OpenAI's Parameter Golf.

The challenge: train the best language model that fits in just 16MB. We entered our autoresearch agent this past spring, and it outperformed the field of over 1,000 participants. You'll learn how we approached it, then get to do it yourself: kick off an autoresearch agent, watch it improve a tiny language model's training script, steer it when progress stalls, and visualize your results. You'll leave with a working autoresearch setup you can point at your own code.

compute kindly sponsored by Modal!

2:20pm-3:35pm: Observe, optimize and protect your hosted agents in Microsoft Foundry — Pamela Fox

(sponsor) [Track M] | Track: Track M

Modern agents fail in ways traditional monitoring can’t catch. In this hands-on lab, learn how Microsoft Foundry Observability helps you move from prototype → production with context-specific evaluation suites (auto-generated evaluators + test datasets) wired into developer workflows via skills/MCP tooling for hosted agents. Then scale quality with continuous evaluation, trace-linked analysis, and adaptive red teaming—and walk away with a sandbox to explore additional features on your own.

4:30pm-5:30pm: The Autonomous Computer: Full-stack Infrastructure for Computer Use Agents — Ang Li

(session) [Track 1] | Track: Workshops Day 1

Even the world's best computer-use agents cannot repeat their successes at the moment. Agents that write code — emitting structured selector-based actions instead of clicking pixels — break through that ceiling. We'll share two years of experience from Simular's production agent platform, the architectural decisions that mattered (refs over pixels, code as substrate, Simulang DSL), and a live demo: a 30-step unattended Windows workflow, side-by-side with a vision-only baseline. If you're shipping agents to real users, this is the playbook.

4:30pm-5:30pm: The Dark Arts of Skill Engineering — Paul Bakaus

(session) [Track 2]

Most agent skills are a system prompt and a prayer. They produce safe, median output because that's what LLMs default to. After building 24 design skills across 9 AI platforms, I found the patterns that break through that ceiling, and they're rarely documented or discussed. Make your agents argue: spawn parallel sub-agents that independently evaluate the same work, then force their conflicting opinions into a single result. The output is bolder than any single agent would dare. Build mixture-of-expert skills that route to specialized sub-agents the way frontier models route to specialized networks. Give your skills memory through persistent context files that restore across sessions, so every invocation builds on the last. Wire up skill hooks that auto-activate after execution to validate, transform, or chain into the next skill. Exploit barely documented environment variables and shell expansion to make skills context-aware before they even run. Let's dig into the dark arts of skill engineering to craft ultra powerful skills.

4:30pm-5:30pm: Hill-climbing Skills: How to Improve Agents Without Touching the Model — Shubhankar Srivastava

(workshop) [Track 4] | Track: Workshops Day 1

Agent Capability is now highly dependent on the markdown files read at runtime -- skills.This workshop treats skills as a first-class optimization surface. We borrow the concept of autoresearch (from Karpathy) and apply it to the skills your agents already read. You'll see how we at Browserbase did the same for browser agents, enabling our customers to scale the coverage of their browser agents while improving performance(2x faster runs) and optimizing for token spend(upto 10x cheaper).You'll leave with a working http://SKILL.md you generated through an auto-research loop, and a mental model for when skill optimization beats fine-tuning or prompt engineering.

4:30pm-5:30pm: Agent Auth — Bereket Habtemeskel, Paola Estefania

(workshop) [Track 5] | Track: Workshops Day 1

Better Auth has grown to 27k GitHub stars and over 1.5M weekly downloads, becoming a popular choice for developers who want to own their authentication stack. We recently introduced Agent Auth, a protocol designed to support autonomous and delegated agents operating services for an organization or a user. It allows agents to dynamically negotiate capabilities, manage access boundaries, and maintain secure authorization flows. This session will break down the protocol design and demonstrate it live, showing how agents can securely authenticate and operate with dynamic permissions.

4:30pm-5:30pm: The Prime Intellect Stack — Will Brown

(workshop) [Track 6] | Track: Workshops Day 1

Deep dive into Prime Intellect's open-source ecosystem of post-training tools, including the verifiers and prime-rl libraries, as well as our Lab platform for self-serve training and inference.

4:30pm-5:30pm: Lifestyles of the AI-Native: Voice-coding, agent skills, hooks and scheduled tasks — Nick Nisi, Zack Proser

(workshop) [Track 7] | Track: Workshops Day 1

Most engineers are bolting AI onto a workflow that was designed for a pre-AI world. The result is a faster version of the same grind. This talk is about the other path: rebuilding the daily practice of software engineering from the ground up, around what agents are actually good at.

Two senior practitioners from WorkOS will walk through how we actually work now as AI-native engineers — not in the aspirational sense, but the literal one. We think out loud and voice-code instead of typing our way to clarity. We package recurring expertise into agent skills so we're not re-explaining context every session. We wire up hooks that fire on the events we care about, and hand off scheduled tasks to agents that run overnight, while we're away from the keyboard, or otherwise off the clock. The throughline is intentional design: deciding what a human should hold onto and what should be delegated, then building the machinery to make that real.

Because there are two of us, you'll see more than one set of habits — where our setups converge on the same patterns, and where they diverge based on how each of us thinks and works. The pitch isn't "do more." It's that an AI-native setup, designed deliberately, buys back attention and protects you from the burnout that comes from treating agents as a turbocharger for an old loop. Attendees will leave with a concrete mental model for voice-driven development, a pattern for authoring reusable agent skills, and working examples of hooks and scheduled automations they can adapt the same week.

4:30pm-5:30pm: The Art and Science of Loopcraft with Pi (and friends) — Joel Hooks

(workshop) [Track 8] | Track: Workshops Day 1

This workshop helps agentic coding practitioners stop treating agents like pretend coworkers and start designing reliable, compounding loops. Using Pi as the concrete demo surface, Joel Hooks will show how loop state, handoffs, review, memory, and operator control become visible, while keeping the ideas portable to Claude, Codex, Cursor, and similar coding agents. Practitioners should leave able to identify loops inside their agent workflows, diagnose when failures need gates/evidence versus orchestration/memory/leverage, and understand how model-shaped lifecycles differ from traditional human SDLC rituals.

4:30pm-5:30pm: Evolution of agentic surfaces — Gagan Bhat, Isabella Kai He

(workshop) [Track 9] | Track: Workshops Day 1

Getting an agent into production takes more than a good prompt: it needs somewhere to run code, credentials it can't leak, sessions that survive interruption, and infrastructure that scales. This talk traces how Anthropic's agentic surfaces evolved from the raw API to Claude Managed Agents, and what our Applied AI team has learned about harness design along the way.

5:00pm-6:00pm: Human Connection in the Age of AI — Joyce Zhang, Carole Robin, Ph.D.

(workshop) [Expo Stage 2 NW] | Track: Expo Stage 2

Building AI safely requires both technical skills and interpersonal skills. A live demo of connection tools from Stanford's "Touchy Feely" course, then hands-on practice. Co-hosted with Leaders in Tech.

6:00pm-6:15pm: Expo Welcome Speech — Sonar, Extend AI

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

6:15pm-7:15pm: Runway AI Film Festival

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

Runway's annual AI Festival — a celebration of creatives experimenting at the forefront of art and technology across film, design, new media, fashion, advertising, and gaming, with a screening of finalist AI films. https://aif.runwayml.com/

Day 2 — Session Day 1

9:00am-9:05am: The Highest Loop — swyx

(keynote) [Main Stage] | Track: Software Factories

We celebrate the third birthday of the AI Engineer post.

9:05am-9:25am: On AI and Knowledge — Pablo Castro

(keynote) [Main Stage] | Track: Software Factories

9:25am-9:45am: The Golden Age of AI Engineering — Alexander Embiricos, Romain Huet

(keynote) [Main Stage] | Track: Software Factories

TBD

9:45am-10:05am: GLM-5.2: Frontier Intelligence, Open Weights. — Zixuan Li

(keynote) [Main Stage] | Track: Software Factories

10:05am-10:25am: Thom Wolf keynote — Thom Wolf, Olive Song

(keynote) [Main Stage] | Track: Software Factories

10:25am-10:30am: Security Track intro — Manoj Nair

(keynote) [Main Stage] | Track: Software Factories

10:45am-11:05am: Getting the most out of Codex — Jason Liu

(session) [Main Stage] | Track: Software Factories

10:45am-11:05am: Security Firewall for Agents — Ryan Dahl

(session) [Track 1] | Track: Claws & Personal Agents

Why personal agents that run untrusted LLM code need a sandboxed OS/runtime model, not just a compute sandbox.

10:45am-11:05am: The State of Vision — Joseph Nelson

(sponsor) [Track 2] | Track: Vision & OCR

10:45am-11:05am: Pinecone 2.0 — Edo Liberty

(session) [Track 3] | Track: Search & Retrieval

Autonomous agents are smart but don’t know your business or your objectives. That’s why most agents in the enterprise remain stuck in retrieval loops, burning millions of tokens on processing raw documents

A shift from traditional retrieval systems + agents (aka RAG) to purpose-built knowledge engines is underway.

I'll talk about why moving reasoning upstream and compiling raw enterprise data into specialized, task-specific context artifacts is critical to unlocking reliable agentic workflows. And I'll show you how offloading knowledge management to a dedicated layer enables engineering teams to achieve up to a 90% reduction in token consumption while drastically improving task completion rates, speed, and accuracy.

10:45am-11:05am: Claude Managed Agents Workshop (Part 1) — Priyanka Phatak, Gabriel Cemaj

(session) [Track 4] | Track: Workshops Day 2

Build an agent with Claude Managed Agents

10:45am-11:05am: Through the AI Fog: The architectural decision the next 24 months of agentic security depends on. — Manoj Nair

(sponsor) [Track 5] | Track: Security

10:45am-11:05am: The New Primitives: Building AI-Native Software — Kwindla Kramer

(session) [Track 6] | Track: Voice & Realtime AI

In the future, every piece of software with a human-facing surface will be built from new, LLM-centric primitives. (Just like every piece of software today has networking, threads/async routines, UI on top of some flavor of Model/View/Controller abstractions, etc.) We're just starting to invent these new primitives. The list, though, will definitely include: 1. Subagents - multiple inference loops, multiple models, async tool calls 2. Very long context - memory + episodic human interactions over a long period of time, structured data input (not just output), progressive skills/context loading, graceful compaction & summarization 3. dynamic user interface generation / user interfaces driven by LLM inference 4. conversational voice input

10:45am-11:05am: Tokens In, Engagement Out: Training LLM-Recommenders — Devansh Tandon

(session) [Track 7] | Track: LLM Recsys

10:45am-11:05am: How Forward Deployed Engineering is done at Factory — Eno Reyes

(session) [Track 8] | Track: Forward Deployed Engineering

10:45am-11:05am: Data Quality is the Compute Multiplier — Ari Morcos

(session) [Track 9] | Track: Data Quality

Better data quality is the highest-leverage and most underinvested part of building a model: it produces a better model for the same compute, whether you're mid-training on an open base or pre-training from scratch.

This session is a practical look at data curation, covering what data quality actually means, the stages of a modern curation pipeline (cleaning, filtering, deduplication, synthetic data generation, algorithmic mixing, and multi-stage composition), and which steps matter most in practice. It draws on DatologyAI's frontier data research and customer results, including Thomson Reuters' mid-training gains on proprietary legal domain data and Arcee's Trinity model reaching the open frontier on public data alone. You'll leave with a concrete sense of where better data quality pays off and how data curation is shaping the future of model training.

10:45am-11:05am: Build agents fast with GitHub Copilot (from idea to working app) — Idan Gazit

(sponsor) [Track M] | Track: Track M

See how developers go from prompt to a working agent using GitHub Copilot and real workflows. We'll walk through generating code, iterating quickly, and keeping velocity inside your existing dev loop.

10:45am-11:05am: Inside the AI economy: What Stripe’s data reveals — Nilofer Rajpurkar

(session) [Leadership 1] | Track: Agentic Commerce

Stripe powers 78% of the Forbes AI 50, giving Stripe index-level visibility into the AI economy. AI companies are growing faster, selling globally by default, and monetizing earlier. See the data behind the growth: how AI has collapsed the cost of launching, how the fastest-growing companies are adapting their pricing, and the role agents are starting to play in commerce.

10:45am-11:05am: Governance Is the Real Bottleneck to AI ROI — David Hsu

(session) [Leadership 2] | Track: Claws & Personal Agents

As AI systems move from generating content to taking Claw-based agents action inside production systems, governance (not model quality) becomes the limiting factor. David will break down why visibility, guardrails, approvals, and rollback matter more than raw intelligence, and how companies can enable AI adoption without creating security and compliance disasters.

10:45am-11:05am: Every AI company is accidentally building a bank. — Dor Sasson

(session) [Expo Stage 1 NE] | Track: Expo Stage NE

You're logging usage, billing later, hoping agents behave. They don't. Here's the architecture that fixes it before the invoice hits.

10:45am-11:05am: The Enterprise Agentic Gap: When Developer-Level AI Tools Hit Millions of Lines — Dan Adler

(session) [Expo Stage 2 NW]

Agentic coding tools have transformed individual developer workflows but owning a large codebase with millions of interdependent lines across multiple code hosts is a different problem entirely. Off-the-shelf AI coding tools weren't built for it, and at scale, they break down in ways that aren't obvious until you're already in trouble. This talk covers the failure modes you'll hit when applying developer-level agentic tools to enterprise-scale migrations, and how Sourcegraph's agentic migrations solution was built to solve what others couldn't.

10:45am-11:05am: How PayPal Enterprise Payments handles agent-initiated payments across ChatGPT and Google AI Mode — Sam Parsons

(session) [Expo Stage 3 SW]

PayPal Enterprise Payments has shipped integrations across the major agentic surfaces in the last six months each with human-in-the-loop confirmation and full transaction attribution back to the originating AI platform. We'll tour all three paths: ACP for ChatGPT apps (delegated payment tokens via complete_checkout, allowance validation, facilitator_details attribution), UCP with Google Pay for Google AI Mode (server-side tokenizationSpecification, parsing androidPayCards for the single-use token), and a preview of MCP Apps inline checkout, where the payment surface renders in-chat and card data never enters the LLM context. For each path we'll cover where PayPal Enterprise Payments fits, what the shopper and merchant each see, and the tradeoffs between them. You leave with working code and the docs to evaluate which path fits your stack.

10:45am-11:05am: Agentic Search for Coding Agents — Jakub Hojsan

(session) [Expo Stage 4 SE]

11:10am-11:30am: Rise of the Software Factory — Tereza Tížková

(session) [Main Stage] | Track: Software Factories

The Stanford HAI 2024 AI Index reports a 30x productivity gap between AI leaders and laggards. The differentiator is not company culture, prompting technique or model selection, but the infrastructure. Organizations capturing outsized value from AI agents have machine-readable codebases, deterministic internal APIs, CI/CD pipelines with agent-addressable hooks, and permission models granular enough to scope exactly what an agent can touch. I believe the “agents as employees” framing is most useful if you operationalize it. An employee has persistent identity, episodic and semantic memory, scoped permissions that don’t get renegotiated every task, an audit trail, and a defined escalation path when things go wrong. Persistent computer use (with a stable execution environment that survives across steps) was the real inflection point that is making this possible. Some interesting production problems remain under-explored. How do you give an agent persistent identity across pull requests? How do you recover from partial failure mid-task without discarding completed work? How do you enforce code ownership policies when the author is a model? How do you bound token spend when pipelines spin up sub-agents recursively? This talk defines agent readiness as a concrete infrastructure checklist: structured codebases, deterministic APIs, per-agent scoped credentials, atomic and idempotent operations, structured execution traces, and explicit thresholds for when the agent stops and a human takes over. It presents research results in practice, and what are the steps organizations need to take to be fully agent-ready.

11:10am-11:30am: Your Agent Didn’t Fail. Your Harness Did. — Vinoth Govindarajan

(session) [Track 1] | Track: Claws & Personal Agents

AI agents do not fail only because the model is wrong. Many production failures happen in the harness around the model: state is not persisted, two runs mutate the same session, a tool call never returns, an approval loses scope, or an internal success never becomes user-visible proof. This talk uses OpenClaw as a public case study to examine real harness failure modes and extract a reusable production model for AI engineers. We will look at how events enter an agent system, how session state is rehydrated, why single-writer lanes and throttles matter, and why tool execution needs scoped approvals and auditable receipts. The core idea is simple: a model proposes, the harness commits, and the receipt proves it. Attendees will leave with a practical 'run receipt' audit they can apply to their own agents: what woke it up, which state did it inherit, what authority did it use, what executed, and what evidence survived.

11:10am-11:30am: Building the Document Context Layer for AI Agents — Jerry Liu

(sponsor) [Track 2] | Track: Vision & OCR

AI agents are the new knowledge workers, but knowledge work depends on unstructured enterprise context. ~90% of that data lives in the form of document containers - from human-native (PDFs, Word, Pptx) to emerging agent-native formats (HTML, MD). Doing RAG in 2026 involves generalized agent harnesses with tools, MCPs, and skills. In this world, every company building agents needs a Document Context Layer, the bridge between their unstructured docs and the agents trying to reason over them. This talk covers what that layer looks like in practice: from document understanding, retrieval, and workflows, to areas yet to be explored — agent-native formats, versioning, editing, permissions, and longer-running agents.

11:10am-11:30am: The unreasonable effectiveness of BM25 for agentic search — Jo Kristian Bergum

(session) [Track 3] | Track: Search & Retrieval

GPT-5 is shockingly good at search, and that changes the "BM25 as a baseline" story. Using GPT-5 search trajectories from BrowseComp-Plus, I'll show how default BM25 parameters and evaluation harnesses can make lexical retrieval look weak, while real agent queries often play directly to BM25's strengths. Much like grep became a core retrieval primitive for coding agents, BM25 is re-emerging as a powerful primitive for agentic search.

11:10am-11:30am: Claude Managed Agents workshop (Part 2) — Priyanka Phatak, Gabriel Cemaj

(session) [Track 4] | Track: Workshops Day 2

Build an agent with Claude Managed Agents

11:10am-11:30am: Your LLM Stack Is a 2008 Database With Better Marketing: Why ML Security Is Dominated by Misconfiguration, Not Missing Features — Lovina Dmello

(sponsor) [Track 5] | Track: Security

ShadowRay exposed over a billion dollars of data through a missing authentication check. It wasn't a zero-day. It wasn't a clever new attack class. It was a default config someone never flipped off. That story is not the exception in production ML, it's the rule. We synthesized 139 peer-reviewed papers on production ML security across access control, runtime security, infrastructure, and operations. Five findings stood out, and one of them upends how most teams think about ML security: - Misconfiguration, not missing features, is the dominant failure mode. The mechanisms exist. Teams aren't using them, or are using them wrong. - Adversarial defenses impose 15–30% inference overhead, which is why almost no production system actually runs them. - ML-specific security tooling lags general DevOps tooling by years. - Security, data-science, and ops teams operate in expertise silos that create persistent gaps no single team can see. - LLM and multi-tenant GPU threats are evolving faster than defenses (prompt injection, RAG poisoning, GPU side channels). This talk walks through the four-pillar defense-in-depth framework, the six-category threat taxonomy that maps each attack to its primary and secondary defenses, and a four-level security maturity model that matches overhead budgets to deployment contexts. You leave knowing where your stack actually sits and which 3 misconfigurations account for most of the risk.

11:10am-11:30am: Speech-to-Speech Model Research at Google DeepMind — Valeria Wu Fon, Tom Ouyang

(session) [Track 6] | Track: Voice & Realtime AI

Most voice interfaces today are built as a 3-way cascade system (ASR/LLM/TTS). While functional, this cascaded approach introduces latency bottlenecks, strips away non-verbal nuance, and limits emotion-aware, multi-turn dialogue. Today, we are witnessing a profound shift toward native speech-to-speech models that process audio natively from end to end. In this session, we’ll explore the exciting paradigm at Google DeepMind to train speech-to-speech models for real-time voice agents. We will cover the high-level product and research challenges of building voice agents that feel truly conversational, optimizing for fluid turn-taking and low latency while maintaining enterprise-grade intelligence.

11:10am-11:30am: Spotify LLM Recsys — Jacqueline Wood, Yves Raimond

(session) [Track 7] | Track: LLM Recsys

11:10am-11:30am: How Forward Deployed Engineering is done at Cursor — Pauline Brunet

(session) [Track 8] | Track: Forward Deployed Engineering

11:10am-11:30am: The Messy Reality of Scale: Synthetic Data and Pre-Training at Poolside — Robert McHardy, Marah Abdin

(session) [Track 9] | Track: Data Quality

TBD — focus on data quality considerations for LLM pretraining and code generation.

11:10am-11:30am: Building the engine while flying the plane — launching the Figma MCP server — Jesse Lumarie

(session) [Leadership 1] | Track: AI-Native Enterprises

What does it actually take to go from a vague idea to a production-ready AI system that people depend on? In this talk, I’ll walk through the real story of building Figma’s MCP server as a founding engineer whilst the MCP spec evolved—starting from early prototypes, through dead ends and architectural pivots, to launching both the initial product, creating new tools and eventually a fully remote server.

11:10am-11:30am: Your Agent Evolved. Your Evals Didn't. — Ameya Bhatawdekar

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

Knowing which generation your agent is in, which failure modes your current evals are blind to, and what to build next is the difference between shipping with confidence and flying blind. Agent architectures have evolved through six generations; prompt, chain, ReAct loop, workflow graph, modern agent loop, AI harness. And each one quietly breaks the eval strategy of the generation before it. A prompt-quality rubric won't catch a bad tool call; a trace scorer won't catch memory poisoning. Using a single SRE incident response agent threaded through every generation, this talk shows exactly where each architecture outgrows its evals and what you need to close the gap.

11:10am-11:30am: Give your coding agents the power of turbogrep! — Owen Halpert

(session) [Expo Stage 1 NE]

Coding agents can grep the filesystem, but sometimes semantic search is more useful for finding the right files, especially on large codebases. Claude Code and Codex, unlike Cursor, do not use semantic search for code retrieval. There are good reasons for this, but Cursor has consistently demonstrated that semantic retrieval can materially improve code search to improve answer accuracy, increase code retention, and reduce token usage. In this session, we'll share a coding agent plugin for semantic codebase search alongside other modalities (BM25, regex/globbing/grep, filtering), and demonstrate how an agent can choose the right tool for the job. We'll share benchmark-style results that compare answer quality and token consumption with and without semantic retrieval across a small set of representative tasks.

11:10am-11:30am: Actionable Knowledge For Agents With Context Graphs — Will Lyon

(session) [Expo Stage 2 NW] | Track: Expo Stage 2

11:10am-11:30am: Frontier models for the hard parts, open weights for the rest

(session) [Expo Stage 3 SW]

Kimchi is an open-source coding agent that orchestrates multiple AI models—including open-weight models like Kimi K2.7 and MiniMax M3 alongside commercial frontier models—to intelligently route each task to the best model for the job.

Powered by Ferment, Kimchi evaluates every step, automatically reworking or escalating tasks when needed to maintain quality while minimizing the use of expensive frontier models. The result is high-quality code generation at approximately 2.5x lower cost than relying on commercial models alone—all with the transparency and flexibility of open source.

11:10am-11:30am: Agents, codebases, and teams: what it actually takes to ship together — Aditya Khandelwal

(session) [Expo Stage 4 SE]

Using a coding agent solo is one thing. Getting a whole team to trust agent-written code, agent-run reviews, and long-running agent work is another. That's where most teams stall. This talk is about what it actually takes to get there: how to shape a codebase so agents can work in it safely, how to earn a skeptical team's trust instead of mandating it, and the failure modes that only show up once agents are part of the daily workflow.

11:40am-12:00pm: Orchestras, not Factories — Charlie Holtz

(session) [Main Stage] | Track: Software Factories

Everything is Conductor now! I want to tell the story of how we came up with the original interface, what I think everyone (including us) is getting wrong and what's coming next.

11:40am-12:00pm: Everyone Gets A Software Company — Benjamin Guo, Rob Cheung

(session) [Track 1] | Track: Claws & Personal Agents

11:40am-12:00pm: Skill issue: stop deploying vision language models, use them with Skills to build e2e vision apps on edge — Merve Noyan

(sponsor) [Track 2] | Track: Vision & OCR

With the boom of vision language models barrier of entry to build vision apps are much lower so developers tend to use them right away. However, these models are very large and inefficient in production. In this talk, I will go through combining vision language models with Skills to build end-to-end vision apps from training to deployment using HF Skills, on top of showing the state-of-the-art in small computer vision/multimodal models.

11:40am-12:00pm: The Search Engine for the Agentic Web — Will Bryk

(session) [Track 3] | Track: Search & Retrieval

Every search API claiming to be "built for AI" is actually Google with a wrapper. That's a problem, because AI agents don't search like humans. A human waits 1 second for a result. An agent making 50 sequential searches at 1 second each creates a 50-second lag. That kills the product. And latency is just one dimension: agents need semantic precision, structured outputs, and a range that spans sub-200ms real-time retrieval all the way to multi-step deep research. No human-facing search engine was ever designed to do that. Will Bryk, CEO of Exa, shares what he learned building a search engine from scratch for AI. He'll cover the architectural decisions behind Exa's latency spectrum, what real usage patterns look like across companies like Cursor, Notion, HubSpot, and Lovable, and why the benchmarks the field relies on today are dangerously inadequate for evaluating agentic search. The bigger argument: search is becoming the most critical primitive in AI infrastructure, and almost no one is building it right.

11:40am-12:00pm: Claude Managed Agents workshop (Part 3) — Priyanka Phatak, Gabriel Cemaj

(session) [Track 4] | Track: Workshops Day 2

Build an agent with Claude Managed Agents

11:40am-12:00pm: We Gave an Agent Production Code Access and Then Tried to Sleep at Night — Moritz Johner

(sponsor) [Track 5] | Track: Security

We let an agent touch production code to fix CVEs. That is either automation or a supply chain incident, depending on how honest your architecture is. PatchPilot started simple: find vulnerable dependencies, patch them, open a PR, let CI prove the fix, move on. Then reality showed up. The agent needed repository access, CI logs, credentials, and a Docker socket. Without that, it was useless. With it, every security reviewer in the room had a point. This is the production case study: what we gave the agent, what we refused, what infosec pushed back on, and where they were right. We will cover scoped permissions, constrained PRs, audit trails, approval gates, CI evidence, credential boundaries, and the gap between "it generated a patch" and "we can defend this change." Agentic remediation is not just developer productivity. It is a new participant in your software supply chain.

11:40am-12:00pm: Voice Agents Can Just Do Things — Charlie Guo

(session) [Track 6] | Track: Voice & Realtime AI

Too many voice AI integrations still treat speech as fancier chat: audio in, audio out. But we're at a point where speech can be a control plane for software, and most developers are unaware that voice has become a capability overhang. Current realtime models can understand intent, call tools, speak while work is underway, recover from corrections, and decide what the user actually needs to hear. As a result, we're seeing three practical patterns emerge: voice-to-action, systems-to-voice, and voice-to-voice. We’ll show how each pattern changes the architecture, where Realtime 2’s reasoning and tool-calling matter, and why chained STT / LLM / TTS systems start to break down as the interaction patterns become richer.

11:40am-12:00pm: LLM Recsys at DoorDash — Raghav Saboo

(session) [Track 7] | Track: LLM Recsys

11:40am-12:00pm: AI tools for Forward Deployed Engineering — Vasuman Moza

(session) [Track 8] | Track: Forward Deployed Engineering

11:40am-12:00pm: Rethinking Environments for Long Horizon Work — Rayan Garg

(session) [Track 9] | Track: Data Quality

As autonomous agents push towards longer-horizon tasks, a number of challenges emerge in measuring and improving frontier model capabilities. In this talk, we discuss how long-horizon tasks are defined and measured, how RL environments and verifiers have to scale for more complex and open-ended tasks, and how we navigate these problems at Theta.

11:40am-12:00pm: Use Copilot across CLI, dev, and cloud workflows to move faster end-to-end — Pamela Fox

(sponsor) [Track M] | Track: Track M

Copilot isn't just for writing code. Learn how to use it across CLI and cloud workflows to scaffold apps, debug faster, and automate repetitive steps across your entire dev lifecycle.

11:40am-12:00pm: Agentic SDLC at Uber: Building Blocks for Uber's Software Factory — Uday Kiran Medisetty, Adam Huda

(session) [Leadership 1] | Track: AI-Native Enterprises

99% of Uber engineers are using AI every month, 70% of PRs are attributed to AI, and 15% of PRs are now done entirely by autonomous agents. In this session, we go behind the scenes to show you exactly what it takes to get there — starting with the foundational building blocks: the model gateway, MCP infrastructure, agent skills, knowledge systems, and cloud developer environments that make agentic engineering possible at scale. Then, once those foundations are in place, we show you how to assemble them into a fully agentic SDLC. We'll walk through every stage — from research and spec writing, to autonomous code generation, to verifying and validating that code before it ships, to monitoring what happens after it lands, and continuously improving it over time. With tooling example demos throughout. Whether you're just starting your agentic journey or already running agents in production, you'll leave with a concrete blueprint for what this looks like end to end.

11:40am-12:00pm: The Last Human Code Review: Building Trust in AI-Generated Code — Itamar Friedman

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

By the end of 2026, asking a human to review every pull request will be as optional as asking one to run every unit test manually. The tooling will be ready. The question is whether organizations are.

In this talk, Itamar Friedman, CEO of Qodo, explains why we are approaching the end of line-by-line human code review as a default requirement and explores what has to be true for teams to get there.

The barrier was never agentic AI capability. It was trust. And trust in automated review does not come from smarter models or faster feedback loops. It comes from systems that provide a trustworthy, concise and personalized proof-of-validation report. These systems are built on how engineering teams at specific organizations write their code: their own rules and standards, their PR history, their architecture decisions, their tribal knowledge that lives in comments and conversations and gets lost when engineers leave.

Itamar will walk through the shift from PR-by-PR review toward continuous, context-based code review and governance, and share a practical approach to making human code review optional.

If your team is shipping AI-generated code faster than humans can read it, join us for the discussion.

11:40am-12:00pm: Agentic vs. Vector Search: An Eval-Driven Approach to Coding Agent Performance — Jess Wang

(session) [Expo Stage 2 NW]

Evals let you replace gut feelings with quantifiable decisions. This talk breaks the basic concepts of evals, including the four core components: datasets, tasks, scoring, and experiments. Then, to solidify the concept, we’ll walk through a real eval comparing agentic search versus vector search for coding agents. We'll also cover practical challenges like tracing Claude Code subprocess calls and why a single eval run is never enough. You'll leave with a concrete framework for building evals that actually inform your ship decisions.

11:40am-12:00pm: Agents Don't Have Coworkers, They Have Hostages — Gabriel Martinez

(session) [Expo Stage 3 SW]

Modern coding workflows are rife with vibe slop. As organizations scale, proper roles and governance systems must be well-defined to ensure a high standard of quality. How do world-class teams scale quality in a world full of slop?

11:40am-12:00pm: Would your AI agent get the job? A performance review framework for enterprise agents — Andreea Pleşea, Dan Bălăceanu

(session) [Expo Stage 4 SE]

There are dozens of ways to build an enterprise AI agent: agentic frameworks, direct LLM APIs, conversational AI platforms, vertical SaaS. They all claim to do the job. But how do you actually compare them on the same task, with the same data, against the same KPIs? This session presents a vendor-agnostic evaluation framework that treats AI agents the way enterprises treat new hires: set the role, define success criteria, run candidates through identical scenarios, and measure outcomes. The architecture uses any LLM to track positive and negative drift across agents against weighted goals, monitoring everything from hallucination rates and token consumption to user sentiment and conversation quality. Inputs are standardized. Outputs are both quantitative (accuracy, cost, hours saved) and qualitative (tone, clarity). The methodology supports continuous evaluation, not just pre-deployment benchmarks, but ongoing performance reviews that can compare agent work against human baselines. Walk away with a concrete, repeatable process for answering the only question that matters: which agent actually does the job?

12:05pm-12:25pm: What we learned by analyzing 1M AI-generated PRs — Daksh Gupta

(session) [Main Stage] | Track: Software Factories

We analyzed >1M end-to-end AI generated PRs reviewed by Greptile to understand what types of bugs they tend to create and some strategies on mitigating them. For instance, did you know that Claude Code is nearly 3X more likely than Codex to introduce auth bypass vulnerabilities?

12:05pm-12:25pm: Tethered: Our Agents Are Us — Shu Fang

(session) [Track 1] | Track: Claws & Personal Agents

Personal AI assistants have dominated the zeitgeist of late with the advent of OpenClaw. However, letting an agent run as you remotely with access to your full suite of tools terrifies us in the technical community. How then did we get comfortable with enabling this functionality firmwide at a 70 billion dollar hedge fund? This talk will go over the underlying architecture, controls, and UX that enables every employee at Two Sigma to have a remote AI Assistant that acts as us in full. With access to our entire set of internal tools. Notably, this isn't just for engineers. Every single employee gets a remote agent that assumes their identity and can take broad action on their behalf. And we're ok with it.

12:05pm-12:25pm: Modality Misalignment and Originality Attribution in Short-Form Video: A Multi-Agent Approach at Platform Scale — Aditya Gautam

(sponsor) [Track 2] | Track: Vision & OCR

Short-form video presents a class of content understanding problems that are qualitatively different from text or single-modality media. Audio, visual, and text signals within the same piece of content frequently diverge, sometimes incidentally and sometimes deliberately, creating a modality misalignment that defeats systems designed around any single signal. At the same time, the resharing dynamics of short-form video platforms create originality attribution chains that degrade quickly and are poorly captured by metadata alone. Addressing both problems at platform scale, reliably and under real latency and cost constraints, is the challenge this talk is built around. The core of the talk is the multi-agent architecture developed to address this, published at ACM WSDM 2025, and the reasoning behind its design. Each agent in the system is specialized for a distinct aspect of the problem: understanding what a piece of content is actually communicating across modalities, identifying where those modalities diverge meaningfully, and tracing originality through the resharing graph to surface attribution that platform metadata misses. We will cover the design principles behind this decomposition, the tradeoffs between specialization and complexity, the evaluation framework built to measure performance in a setting where ground truth is genuinely ambiguous, and the practical optimizations that made the system viable at scale. We will also be honest about the limitations: where the multi-agent approach added overhead that simpler baselines handled adequately, and what the boundaries of the system's reliability actually look like in production conditions. The broader takeaway is a set of principles for approaching multimodal content understanding problems where the signals are misaligned by nature rather than by exception. Attendees will leave with a framework for thinking about agent decomposition across a complex multimodal problem, a grounded understanding of how originality attribution degrades at scale and what it takes to recover it, and practical lessons about building evaluation and optimization pipelines for systems where the problem itself resists clean benchmarking.

12:05pm-12:25pm: Rebuilding the web for agents — Liad Yosef

(session) [Track 3] | Track: Search & Retrieval

AI apps are the new browsers. And the web is not ready.

For thirty years we built the web for human eyes, benchmarked by tools like Lighthouse: humans measuring human behavior. That era is ending. Bot traffic has overtaken human traffic, and we can't hand-write a benchmark for what comes next - every best practice goes stale the moment models improve.

Your next customer isn't a human with a credit card - it's an agent with a protocol, and it would rather not see your interface at all. That shift moves the UX question from how a human experiences your product to how an agent does, and how a human experiences that agent. Already, some services report their MCP traffic outpacing their web UI. The agent is rapidly becoming the main surface, and it always takes the path of least friction. Claude Code might consistently prefer PostHog over Mixpanel simply because PostHog has the better agentic surface - and Mixpanel loses customers without a human ever weighing in.

Meanwhile the agentic web protocol stack keeps multiplying, a new one seemingly every week. The harder problem isn't discovery - it's operability: whether the web can actually be run once an agent arrives, and what is the ideal stack for that. Should we lean into headless protocols, or ones like WebMCP that treat the UI as the source of truth? Does a site need to implement every new spec just to support every kind of agent?

So we stopped guessing and watched real agents work the whole journey: finding, understanding, authenticating, acting, handing back to a human. The findings go against the last year of agent-readiness advice. Agents ignore the files we built for them, reaching for docs and homepages instead - and whatever they reach, they trust and act on. But when those files are linked properly, their usage jumps 4x. The format isn't the key for the agentic web. Reachability is.

The web will never be completely headless. Some moments still demand a human: choosing a seat, comparing options, casually exploring. And agents aren't uniform - some want full headless access, others spin up a browser to fill the gaps, but that's a friction point, not a free fallback. So the web is going nearly headless, always with a human eye at the end.

This talk maps the entire agent web landscape based on findings from real agent journeys research:

  • Which protocols earn their place and which are noise.
  • Why "agent-ready" and "accessible" are the same engineering problem.
  • How MCP Apps close the last mile - and when headful protocols like WebMCP step in.
  • How to build for agent-readiness that survives the next model - not a checklist that's stale in a month.

The gap between ready and not is about to separate the relevant from the invisible.

12:05pm-12:25pm: Claude Managed Agents workshop (Part 4) — Priyanka Phatak, Gabriel Cemaj

(session) [Track 4] | Track: Workshops Day 2

Build an agent with Claude Managed Agents

12:05pm-12:25pm: Agentic Development Security — Ezra Tanzer

(sponsor) [Track 5] | Track: Security

12:05pm-12:25pm: Your Voice Agent is Just a Walkie-Talkie — Neil Zeghidour

(session) [Track 6] | Track: Claws & Personal Agents

Everyone says cascaded voice pipelines are dead and native speech models are the future. Yet production environments are still dominated by STT-LLM-TTS stacks. Reconciling the natural flow of native audio with the elite reasoning of a cascaded agent remains an unsolved systems problem. This talk dissects the brutal technical trade-offs behind that counterintuitive reality. We will break down why your voice agent is still stuck behaving like a walkie-talkie and map out the specific technical roadmap required to build full-duplex AI that actually works.

12:05pm-12:25pm: Open Q&A: LLM Recsys — Devansh Tandon

(session) [Track 7] | Track: LLM Recsys

12:05pm-12:25pm: How Forward Deployed Engineering is done at Cognition — Jia Wu

(session) [Track 8] | Track: Forward Deployed Engineering

12:05pm-12:25pm: Bugcrowd posttraining talk — David Brumley

(session) [Track 9] | Track: Posttraining & Midtraining

12:05pm-12:25pm: Scaling Code Quality: Building uReview, Uber’s Multi-Agent Code Review Engine — Will Bond, Ameya Ketkar

(session) [Leadership 1] | Track: AI-Native Enterprises

At Uber scale, human-only code reviews create massive bottlenecks, while generic AI tools overwhelm developers with noisy, hallucinated spam. This session explores the architecture behind uReview, Uber’s multi-agent AI code review engine designed strictly for high-precision feedback. Attendees will learn how we moved beyond monolithic prompts to build a modular pipeline featuring deep contextual ingestion, specialized domain agents, and a Generator-Verifier grader system. By enforcing strict confidence scoring and semantic deduplication, uReview filters out AI noise, shifting the focus from comment quantity to high-signal actionability and significantly reducing Pull Request cycle times. Talk Outline I. The Code Review Crisis at Uber Scale (0–3 mins) Establish the critical tension between engineering velocity and code quality, highlighting why standard AI implementations fail in massive monorepo environments. 1. The Monorepo Bottleneck: At Uber, thousands of engineers commit code daily. Relying solely on human reviewers creates a massive operational bottleneck, leading to reviewer fatigue, extended Pull Request cycle times, and inevitable missed vulnerabilities. 2. The Developer Spam Problem: Generic LLM integrations fail because they prioritize comment quantity over actionable quality. If an AI posts ten hallucinated suggestions on a diff, developers will simply mute the tool. AI must reduce cognitive load, not add to it. 3. The Signal-to-Noise Mandate: Defining the North Star for uReview. The goal is not to replace human reviewers, but to build an AI system that respects developer time by delivering high-precision, strictly verified code feedback. II. The uReview Architecture: A Modular Agentic Pipeline (3–10 mins) Detail the transition from a monolithic prompt approach to uReview’s sophisticated, multi-stage agentic workflow designed for enterprise codebases. 1. Deep Contextual Ingestion: A standard git diff is not enough. We discuss how uReview fetches extended context, integrating with our build systems to analyze surrounding functions, upstream dependencies, and class hierarchies before generating a single token. 2. Specialized Domain Assistants: Instead of a generalist model, uReview deploys independent AI agents. We route code to narrow, specialized agents—such as a Go Concurrency Analyzer, a Java Memory Leak Detector, or a Security Vulnerability Scanner—to ensure precise, domain-specific insights. 3. Hybrid Intelligence: Probabilistic LLMs cannot operate in a vacuum. We detail how uReview integrates deterministic tools, like Bazel dependency graphs and static linters, to ground AI suggestions in objective codebase realities. III. Engineering the Trust Layer (10–17 mins) Dive into the verification phase. This is the core engineering that filters out AI noise and ensures uReview maintains developer trust. 1. The Generator-Verifier Pattern: Implementing a Grader Model architecture. A primary agent generates code suggestions, but a secondary, high-reasoning model audits those suggestions against strict coding guidelines to catch hallucinations before they reach the PR. 2. Confidence Scoring and Suppression: We assign a numerical confidence score to every generated comment. If a comment falls below our calibrated threshold, uReview silently drops it. We explore the engineering behind suppressing low-confidence outputs to prevent tooling spam. 3. Semantic Deduplication: Technical strategies for merging overlapping warnings. If a deterministic static analysis tool and an LLM agent flag the same null pointer exception, uReview merges them into a single, concise developer instruction. IV. Operationalizing uReview at Scale (17–20 mins) Conclude by discussing the long-term governance, feedback loops, and measurable impact of running an AI review engine in production. 1. The Telemetry Feedback Loop: We embedded Useful and Not Useful rating buttons directly into the developer UI on every uReview comment. We discuss how this telemetry flows back into a curated data lake, driving continuous Reinforcement Learning from Human Feedback and prompt refinement. 2. Shifting Success Metrics: Why organizations must abandon vanity metrics like total comments posted. We measure uReview’s success through Actionability Rate (the percentage of AI comments accepted as commits) and the reduction in Mean Time To Merge.

12:05pm-12:25pm: Prototyping as Leadership: How a CTO Ships with AI Agents — Hursh Agrawal

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

I am a CTO and co-founder with a toddler, 15+ recurring meetings a week, 7 direct reports, and right now—7 open pull requests across two repos. Most engineering leaders eventually hit a wall where this kind of calendar tetris forces them to stop shipping code and start communicating solely through roadmaps. But what if AI agents didn't just act as coding assistants, but fundamentally restructured how executives use fragmented time to prototype the future? In this talk, I will share the exact multi-model workflows I use to plan with one model, implement with another, and build asynchronous play-and-feedback loops that fit perfectly between meetings. You will learn how to navigate code reviews for agent-assisted executive PRs, and leverage AI to shift your leadership style from telling your team what to build to showing them functional prototypes.

12:05pm-12:25pm: Your Agent Is Lying to You About Whether It Worked — Dat Ngo

(session) [Expo Stage 1 NE]

Every span is green, every tool call returned cleanly, and the agent still regenerated the same plan 27 times before giving up invisible to any outcome metric, obvious in the trajectory. We pull up a real trace where the outcome looks healthy and the path is a disaster, then show Signal, our agent, surfacing it automatically: sweeping the project, ranking it above the noise, and linking straight to the offending trace with debugging evidence attached. The live version of the trajectory-over-outcomes argument, with a one-click path from "something's wrong" to "here's exactly where."

12:05pm-12:25pm: Why building building agent quality platforms is hard. — Hossein Niazmandi

(session) [Expo Stage 2 NW]

An eval platform is not just a test runner. You are building shared definitions of good, reliable data pipelines, labeling workflows, versioning, and trust in results across many teams and model changes. This session breaks down the hidden complexity, the common failure modes, and the design principles that make evals credible and usable in day-to-day engineering.

12:05pm-12:25pm: Can LLMs write fast multi-GPU kernels? We built a benchmark to find out. — Simran Arora

(session) [Expo Stage 3 SW]

LLMs have gotten surprisingly good at writing GPU kernels, but almost all the benchmarks measuring that progress are single-GPU. In production, communication is the bottleneck: all-reduce alone accounts for over 20% of inference latency on Llama-3.3-70B, and that gap keeps widening as compute scales faster than interconnect bandwidth. ParallelKernelBench (PKB) offers a benchmark and evaluation framework for multi-GPU kernel generation and includes 87 problems from real codebases where the task is replacing PyTorch + NCCL with a CUDA kernel that moves data directly over NVLink. We tested GPT-5.5, Gemini 3 Pro, Opus 4.7, and other frontier coding models. Under a third of problems solved were correctly, and fewer than a quarter of those beat the naive baseline. We'll cover why they fail, what the patterns look like, and a few cases where models produced kernels faster than anything publicly available, including one for NVIDIA NeMo-RL's GRPO training loop, which has no prior optimized public reference. The benchmark is open source and we want to see what you can do!

12:05pm-12:25pm: Self-Improving Agents That Teach the Company Back — Rafal Wilinski

(session) [Expo Stage 4 SE]

Agents forget too much. A run might solve a customer escalation, debug a deployment, or figure out the review pattern for a tricky code path, then the knowledge disappears into a transcript. At Runlayer, we started treating that knowledge as a product surface. Skills are reviewable, editable instructions that agents can load over MCP. An agent can start with a task, learn something useful while doing the work, and draft or update a private skill from that run. That skill loads into future runs for the same agent, stays inspectable by humans, and can eventually graduate into a team or org-level skill. The flywheel gets more interesting once a skill becomes useful beyond the agent that created it. A learned skill can move from one agent's private memory into shared organizational knowledge, then become available through the Runlayer plugin inside Claude Code, ChatGPT, and other AI clients employees already use. The agent does the work, captures the playbook, and the company gets better at that work everywhere agents are used. This talk walks through the architecture and product choices behind self-improving skills: post-run distillation, skill mutation tools, private-by-default scoping, runtime loading, UI inspection, promotion into shared skills, and the safety boundary between this agent learned something and everyone should now use it. The goal is an agent that leaves behind a better handbook for the next person, the next run, and eventually the whole organization.

1:30pm-1:50pm: Get Out of the Model's Way — Kevin Hou

(session) [Main Stage] | Track: Software Factories

From autocomplete to chat to agents to agent orchestration...how do you build a product that scales with intelligence? What core primitives enable agents to operate at the technical (and non-technical) frontier? How can you best squeeze every ounce of capability out of your agentic dev tools? I'll answer all these questions and break down how Google Antigravity creates dynamic agent teams to solve complex tasks like building an OS-Kernal and automating research workflows.

1:30pm-1:50pm: Agents' next frontier: agent-to-agent and network effects — Jean-Denis Greze

(session) [Track 1] | Track: Claws & Personal Agents

MCP v. CLI was about how agents talk to tools. That’s not settled (but we’re camp MCP… mostly). Almost nothing has settled how agents talk to each other - and that's where the next wave of value (and network effects and virality) lives. At Town we run a personal AI agent in production inside real people's inboxes, calendars, and Slack, and we've built agent-to-agent (A2A) on our platform: 1:1 A2A messaging, agents that carry a short bio of one another, HITL when sensitive data is shared or write actions are involved, and early tests around 1:N A2A. I’ll talk about the why, the opportunity, and the production architecture underneath. Audience takeaway: a concrete mental model for building multi-agent systems on top of the data and surfaces users already live in, plus our learnings on early failure modes to avoid.

1:30pm-1:50pm: From Ingestion to Agents: How Leading AI Teams Build on Document Intelligence — Adit Abraham

(sponsor) [Track 2] | Track: Vision & OCR

The agents of tomorrow are only as good as the context they reason on — yet most real-world data lives in messy, unstructured documents.

In this session, we reveal the patterns that separate AI teams shipping reliable, production-grade agents from those stuck debugging pipelines.

Drawing on patterns we've seen from AI-native startups to Fortune 10 enterprises, we'll cover what it takes to transform complex documents into clean, accurate context at scale across legal, finance, healthcare and more.

From ingestion architecture to agent-ready outputs, walk away with the strategies top teams use to turn document chaos into competitive advantage.

1:30pm-1:50pm: If we want them to do Knowledge Work, we need to design Knowledge Agents — Benjamin Clavié

(session) [Track 3] | Track: Search & Retrieval

It's tempting to assume that just like agents revolutionised coding, they will revolutionize other areas: legal, finance, advertising, and even medicine. All of those have in common that they are fundamentally knowledge work. And thankfully, humans have spent thousands of years searching for the best possible workflows for knowledge work. And yet, we seem to be disregarding all of these learnings, forcing every knowledge task into the shape that worked for coding. Today, we're going to talk about the history of knowledge work and how tools were co-designed to support it to understand how we should be building Knowledge Agents, themselves co-designed with their Knowledge Tools. This is key to avoiding falling into a "good enough" local optimum: think about legal clerking, a core part of the legal industry where information gathering and reasoning is performed to support the work of senior lawyers. The practice of clerking follows its own code, rules and best practices, which could not have feasibly emerged from studying software engineering: and similarly, there is no reason to believe knowledge agents could emerge from coding agents.

1:30pm-1:50pm: Everybody Gets a Digital Clone! (Part 1 of 3) — Neil Zeghidour

(session) [Track 4] | Track: Workshops Day 2

Walk out of this workshop with a deployed digital clone that makes your phone calls for you. We will skip the theory and immediately get our hands dirty wiring together OpenClaw, Twilio, and Gradium to build an autonomous voice agent on a live cellular network. You will tackle the hardest parts of real-time telephony: routing audio streams, handling human interruption, and killing latency. In 60 minutes, your AI will be ready to call restaurants for the daily special, book appointments, and actively negotiate on your behalf.

1:30pm-1:50pm: Using LLMs to Secure Source Code — Eugene Yan

(sponsor) [Track 5] | Track: Security

Models are now finding and fixing real vulnerabilities at scale. Drawing on Anthropic's work with security teams, this talk walks a six-step workflow — threat model, sandbox, discover, verify, triage, patch — through one running example, shows where orgs actually bottleneck, and gives you a copy-paste path to your first scan.

1:30pm-1:50pm: Tolan: Voice-First AI Companion — Paula Dozsa

(session) [Track 6] | Track: Voice & Realtime AI

1:30pm-1:50pm: From approval loops to autonomous agents with Docker pt1 — John Craft

(session) [Track 7] | Track: LLM Recsys

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

1:30pm-1:50pm: The Dirty Secret of Forward Deployed Engineering — Natalie Meurer

(session) [Track 8] | Track: Forward Deployed Engineering

Since its origins at Palantir, the term "Forward Deployed Engineer" has described wildly different jobs, yet today it's one of the fastest-growing roles in AI. What happened? And what does that reveal about the future of engineering?

Join Nat Meurer, Head of Agent Engineering at Sierra, for a historical tour of one of tech's most misunderstood roles, and why its biggest contradiction may explain where the industry is headed next.

1:30pm-1:50pm: The Base Model is Dead — Varun Singh

(session) [Track 9] | Track: Data Quality

It's a common belief that large language models are trained to be a good model of human web-text, and thus base models are "mirrors" of what we see on the internet. Historically, this was largely true, but no modern base model truly reflects the internet in the way that GPT-3 once did. Instruction data along with synthetic reasoning traces are moving earlier and earlier into the training pipeline, and "mid-training" has emerged as a new stage to accommodate longer datapoints that more concretely resemble downstream capabilities. As a result, pre-training no longer has the goal of creating a linguistic prior, but instead has the additional goals of baking in behavior and more atomic skills into the trained "base" model. Between this shift in what a base model is and the blurring of the lines between the different stages of model training, it's an open question as to what the best approach is here (at least outside the walls of the big labs). But I believe that the role we view the base model playing will continue to shift as we're pulled forward through new phases of model capabilities.

1:30pm-1:50pm: Modernize CI/CD using agent-assisted workflows that reduce manual debugging — Salil Subbakrishna

(sponsor) [Track M] | Track: Track M

AI agents are reshaping CI/CD. See how workflows become adaptive—understanding failures, fixing issues, and accelerating releases without constant manual intervention.

1:30pm-1:50pm: Spin at the Gate Until Green: The Engineering Primitives Behind Self-Driving Codebases — Andrew Orobator

(session) [Leadership 1] | Track: Software Factories

Most AI-assisted development fails the same way: the AI produces plausible output, the human can't tell if it's right, so they check manually, find the problem, re-prompt, and repeat. This loop doesn't scale. There's a different approach. If you can express correctness as a binary — does it compile, do the tests pass, does the lint check clear — you can remove the human from that loop entirely. The AI submits. The gate checks. If red, it adjusts and resubmits. Spin at the gate until green. This talk covers the engineering primitives that make this possible: personas (consistent behavior at the agent level), skills (composable, reusable prompt modules), worklogs (accountability across sessions), postmortems (turning failures into constraints), and spec-driven development (making the target explicit enough for a machine to hit it). The culmination is a flag lifecycle agent — triggered by a cron job, cleaning up stale feature flags, verified by compile + test + lint, no human in the loop. Not hypothetical. Working prototype, proven in practice. I co-authored a ten-part series on this methodology with Claude. The series was built using the workflow described in this talk. If you don't trust the theory, the fact that this talk exists is the proof.

1:30pm-1:50pm: Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

Hugging Face hosts over 2 million public models, 500,000+ datasets, and serves 13 million users across 50,000+ organizations, including over 30% of the Fortune 500. That growth didn't come with a manual.In this talk, we'll pull back the curtain on the infrastructure decisions that kept the Hub fast and reliable as traffic grew by orders of magnitude. We'll dive into why we chose MongoDB Atlas as our core data layer, how its document model maps naturally to the messy reality of ML model metadata, and what it took to keep p99 latency low when every request hits a catalog of millions. We'll also cover the trade-offs we faced, the things that broke along the way, and what "lean operations" actually means when your platform serves a third of the Fortune 500. Expect real architecture decisions, real numbers, and lessons you can take back to your own stack.

1:30pm-1:50pm: Every Agent, Everywhere, All at Once — Vlad Luzin

(session) [Expo Stage 1 NE]

Coding agents are deaf to anything outside their own session, and a LangGraph or CrewAI one has no idea the others exist. Different vendors, different frameworks, different machines none of them share a way to work together. This demo fixes that live: the Claude Code on your laptop, Codex on your colleague's, a LangGraph agent you're running locally, and the OpenClaw on your Mac Studio at home collaborating on the same goal, going back and forth, full-duplex, across every vendor, framework, and machine line at once.

1:30pm-1:50pm: Designing Evals That Earn User Trust — Felipe Blanes

(session) [Expo Stage 3 SW]

Most teams measure their agent against a benchmark, ship it, and hope. But when your agent serves real users, a benchmark won't tell you if it's actually working. This session is about building an eval suite that captures what success looks like in production, runs against real user workflows, and feeds back into product decisions. Here's the flywheel we use in practice: start with what success looks like from the user's perspective, instrument production workflows to capture those signals, diagnose where the agent falls short, and feed those insights into the next thing you build. You'll see how it shaped concrete product bets, turning eval results from a report card into a discovery tool.

1:30pm-1:50pm: Stop prompting — Greg Pstrucha

(session) [Expo Stage 4 SE] | Track: Expo Stage 4

In this talk I dive into usage of tooling, type systems and frameworks to enforce guardrails and limit slop produced by AI agents inside large codebases.

1:55pm-2:15pm: Self-Improving software factories: The new open source model" — Zach Lloyd

(session) [Main Stage] | Track: Software Factories

Alt titles: Agent orchestration with message passing / Agent orchestration for every model / Warp’s approach to agent orchestration With models getting more capable, we’ve quickly scaled from single agent problems to multi-agent problems – How can agents delegate tasks to accomplish ever-larger goals? You may have heard of “agent swarms” or “agent teams” in this arena, but they come with drawbacks: model lock-in, complex UX, or both. We want to share how we’ve tackled orchestration with our model-agnostic platform, Oz. Our approach has some unique goals: - Support any model, and any harness (claude, codex, etc) - Delegate across local instances and across isolated cloud sandboxes - Provide a UX that requires zero tmux or TUI knowledge to use We’ll explore how we implemented message passing across harnesses, how we handle agent sandboxing with Docker containerization + serverless deploys, and how we designed these primitives to make a system that works with any agent. You’ll walk away with a clear outline of how to build agent orchestration well. Plus, we invite you to try our Oz orchestration platform and tell us what you think. Talk format: Primarily a tech demo and code walkthrough. We’ll show multiple examples of tasks that are best served by delegation, and show both local and cloud-based runs. We’ll also walk through the design of our message passing implementation at a high level to show how it works.

1:55pm-2:15pm: Claude for long-horizon tasks — Lance Martin

(session) [Track 1] | Track: Claws & Personal Agents

Claude is capable of long horizon tasks. In this talk, we'll share lessons learned about building agent harnesses for reliable and secure long-horizon work. This include decoupling the brain and hands, self-verification, self-learning, and design for evolving agent harnesses.

1:55pm-2:15pm: The Best Models Still Reason Like Toddlers — Andrew Dai

(sponsor) [Track 2] | Track: Vision & OCR

Frontier AI models score 80–90% on standard benchmarks like RKGI, yet when tested on visual tasks any 3-year-old handles effortlessly (like counting objects in an image), those same models fall to pieces. I watched this gap widen firsthand during my 14 years at Google Brain and DeepMind, where I co-led development on GLaM, PaLM 2, and Gemini. The problem is that most models hit high RKGI scores not through genuine visual understanding, but by coding – a workaround that scores well and reveals little. Strip that away and you're left with systems that struggle to solve a simple crossword puzzle, identify what's the same or different across two images, or navigate a basic 3D view. These tasks are essential to achieve human-level reasoning capability. And the current benchmark ecosystem wasn’t built to evaluate for it, leaving us with top scoring models that can’t even follow along with Count Von Count. In this talk I'll dig into why the current eval landscape systematically overstates capability, the structural reasons it does so, and how we got here from the viewpoint of someone who was inside a leading frontier lab. I'll close with what I believe a more rigorous, consensus-driven eval framework needs to look like, and why the field needs to build one before the next generation of visual systems ships into the real world. Fixing visual reasoning starts with fixing how we measure it. For engineers building on top of these models today, whether that's document understanding, robotic perception, medical imaging, or any system where visual perception context matters, the cost of getting this wrong is already showing up in production.

1:55pm-2:15pm: Your Agreements Are a Database You Can't Query. We're Fixing That — Hiral Shah, Sean Sodha

(session) [Track 3] | Track: Search & Retrieval

Agreements power every enterprise business, but the most critical data — pricing schedules, SLA obligations, rate cards — is often trapped in tables that traditional extraction tools destroy.

This session shows what changes when you can actually extract that data accurately at scale and make it searchable.

We'll walk through the before and after:

Before: Contract tables require manual review. Rate cards are buried. SLA terms are scattered across exhibits. Procurement teams spend hours piecing together pricing structures — and searching for specific terms means opening every document.

After: Tables are automatically extracted, structured, and queryable. Operations teams can surface SLA notification requirements on demand. Legal can answer "what hourly rate did we agree to?" in seconds.

Docusign will share what we've achieved evaluating NVIDIA Nemotron Parse for our document processing pipeline, including how we tested against real enterprise contracts (not synthetic benchmarks), why we're serving the model via vLLM, and what it takes to turn extracted table data into searchable, retrievable agreement intelligence.

NVIDIA will cover the architecture behind Nemotron Parse and where the model is heading — including how NeMo Retriever's embedding and reranking models connect extracted data to search and RAG-based applications.

Attendees will leave with a realistic view of where vision-language models excel at document understanding, where the gaps remain, and how to think about building searchable contract intelligence into their own systems.

1:55pm-2:15pm: Everybody Gets a Digital Clone! (Part 2 of 3) — Neil Zeghidour

(session) [Track 4] | Track: Workshops Day 2

Walk out of this workshop with a deployed digital clone that makes your phone calls for you. We will skip the theory and immediately get our hands dirty wiring together OpenClaw, Twilio, and Gradium to build an autonomous voice agent on a live cellular network. You will tackle the hardest parts of real-time telephony: routing audio streams, handling human interruption, and killing latency. In 60 minutes, your AI will be ready to call restaurants for the daily special, book appointments, and actively negotiate on your behalf.

1:55pm-2:15pm: Dual-Surface Architecture: Serving Humans and Agents from the Same Tool Layer — Ethan (Jung Min) Cha

(sponsor) [Track 5] | Track: Security

Every enterprise AI talk right now is about capability. Almost none are about containment. That's the gap this talk fills, because it's where regulated deployments actually die. The Deterministic Harness is the set of rigid rails around a model: schemas, data contracts, tool boundaries, and audit paths. These rails are what turn a probabilistic model into a deployable enterprise asset. The idea isn't new. Aviation wraps pilots in envelope protection. Nuclear wraps reactors in passive safety. Banking wraps algorithmic trading in transaction limits. Every regulated industry figured out the same thing eventually: high-variance systems only become deployable when wrapped in low-variance containment. Enterprise AI is catching up, not inventing. I'll walk through the single governed MCP and API server we built at Carlyle, and the architectural decisions behind it. You'll leave with four things: 1. A phased rollout model where each phase earns the next. Moving from locked-down reads to trusted writes isn't risk mitigation. It's trust compounding. Each phase generates the observability that underwrites the autonomy granted in the next one. Skip a phase and you don't save time. You destroy the evidence base that would have justified the next step. 2. One contract, two surfaces. A single data layer that serves both the human UI and the agent. The institution then has exactly one answer to any question either might ask. When the agent and the UI disagree, users lose trust in both. 3. An intent based feedback loop that captures what LLM providers structurally cannot. The gap between what users tried to accomplish and what the system actually delivered is invisible to Anthropic, OpenAI, and Google. Only the harness owner sees it. We close that loop back into the governed server, and it compounds into differentiation that model providers cannot replicate from where they sit. 4. The failure modes we hit and what we'd redesign. A pre mortem folks will inherit for free, from two regulated industries where a wrong answer has a named owner.

1:55pm-2:15pm: 5 Voice Agent Failure Modes You'll Hit in Week One — Venky B, Vyas A

(session) [Track 6] | Track: Voice & Realtime AI

Building a voice agent that demos well is easy now. The hard part starts the second a real person calls it. Most voice agents today are basically a chatbot with a microphone bolted on, they listen, then think, then talk, one side at a time, like a walkie talkie. Real conversations don't work that way. People pause in the middle of a thought, they say "um" and "uh", they talk over you, they change their mind halfway through. The agent has to work out when you're actually done talking, when it should stop talking, and when you've said something it cannot afford to get wrong, like your phone number or email. None of this shows up when you test with text. All of it shows up in week one.

This talk is the five failures that hit every team in that first week, the ones we see again and again. For each case we will walk though examples and best practices for what actually breaks and what to do about it. If you're about to put a voice agent in front of real callers, or you already did and it's quietly falling apart, this is the talk that saves you the weeks everyone else burns figuring it out

1:55pm-2:15pm: From approval loops to autonomous agents with Docker pt2 — John Craft

(session) [Track 7] | Track: LLM Recsys

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

1:55pm-2:15pm: How Forward Deployed Engineering is done at Decagon — Sunny Rekhi

(session) [Track 8] | Track: Forward Deployed Engineering

1:55pm-2:15pm: Ending AI Slop — Thais Castello Branco

(session) [Track 9] | Track: Data Quality

1:55pm-2:15pm: AI Evals Platform for Cross-Functional Teams at Scale — Nachiket Paranjape, Swaroop Chitlur Haridas

(session) [Leadership 1] | Track: AI-Native Enterprises

DoorDash's Evals Platform is designed for more than just engineers. It brings human review, automated judges, and online experimentation into a single calibration loop so engineering, product managers, and strategy and operations teams can all contribute to improving AI quality. Engineers can instrument, trace, and evaluate agent behavior, while cross-functional teams can review outputs, curate trusted examples, and provide structured feedback that improves how automated judges behave over time. By combining experimentation, fully customized annotation workflows, calibration, and analytics in one system, the platform turns AI quality from a fragmented technical exercise into a shared operating model for continuously improving agent performance and making rollout decisions with confidence. While vendor platforms offer pieces of this workflow, we needed something broader: a unified system that lets engineers, product managers, and Strategy & Ops all participate directly in improving AI quality. Our goal is not just to run evals, but to enable cross-functional teams to review outputs, calibrate judges, run experiments, and make rollout decisions without being blocked on engineering. That requirement, along with tighter integration into our internal workflows and operating model, is why we are building this platform in-house.

1:55pm-2:15pm: IT Admin for the AI Workforce: Why Your AI Agents Will Need Their Own IT Department — Sarthak Aggarwal

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

Every enterprise will soon run two workforces - human and AI. Humans already have IT departments managing their identities, access, incidents, and compliance. Who manages all that for your fleet of 10,000 AI agents? Nobody. Yet. At Decawork AI, we started by building autonomous IT resolution for human employees - a dual-agent system where the agent that thinks can't act and the agent that acts can't improvise. We're live in production across multiple enterprises - autonomously resolving incidents across identity systems, security platforms, endpoint infrastructure, and collaboration stacks. But here's what we discovered: the patterns for managing human IT - identity lifecycle, access governance, incident resolution, audit logging - are the exact same patterns you'll need to manage AI agent fleets at scale. The next massive infrastructure layer isn't AI agents doing work. It's AI agents managing other AI agents. This talk covers the architecture, the production war stories, and the thesis: IT Admin for the AI workforce is an inevitability, and we're building it now.

1:55pm-2:15pm: Who Approved That MCP Server? Governing the Tool Layer — Jim Clark

(session) [Expo Stage 1 NE]

Your developers are installing MCP servers faster than security can review them. An unvetted server is a direct line to your data. This talk shows how the Docker MCP Gateway puts every server and tool behind one org-managed catalog: vetted, signed, default-deny on anything unapproved, governed by the same policy engine as network and filesystem. Walk away with a hands-on demo: stand up a catalog, block an unvetted server, and watch policy enforce at the runtime.

1:55pm-2:15pm: Voice Agents Are Mostly Invisible. Here's How to See Them. — Fuad Ali

(session) [Expo Stage 2 NW]

Voice agents are one of the fastest-growing and hardest-to-debug categories: the failures live in latency, turn-taking, transcription drift, and tone none of which show up in a text log. We demo Voice traces and Session views, following a real voice session span by span, and Voice evals for scoring what text-only observability can't reach. A short, differentiated session on a problem most of the room is about to hit and few tools address.

1:55pm-2:15pm: what we learned by analyzing 1M AI generated PRs

(session) [Expo Stage 3 SW]

Background coding agents are quickly moving from novelty to real-world software development workflows. Based on Greptile’s analysis of millions of pull requests across 65,000 organizations, this talk explores how often end-to-end AI-generated Pr's are being used and how their quality compares to human-written code. The data shows detectable agent-generated Pr's grew from under 1% in February 2025 to 27.6% in April 2026, with early quality signals like revert rates and code churn suggesting these agents may already be competitive in serious codebases.

1:55pm-2:15pm: Deploying browser agents at scale — Derek Meegan

(session) [Expo Stage 4 SE]

Not every browser agent trajectory is the same, and treating them like they are is how teams quietly burn budget on agents that never ship. This talk walks through the two trajectory types behind every browser agent, the cost/performance/maintainability tradeoffs that decide whether they hold up, and the concrete patterns for evaluating, hardening, and iterating on them.

2:25pm-2:45pm: We're the bottleneck, but we don't have to be — Ido Salomon

(session) [Main Stage] | Track: Software Factories

As agents improve at doing real work, humans become the real bottleneck. Luckily, the skills we need to work with agents aren’t entirely new, they've just been hiding in unexpected places. Drawing lessons from AgentCraft’s Warcraft-inspired UI for coordinating multiple agents, this talk explores how gamification can raise the ceiling for sophisticated AI orchestration while lowering the floor for everyday developers. Ido will show how visual state, spatial metaphors, and autonomy can make multi-agent systems more approachable, inspectable, and fun to use.

2:25pm-2:45pm: From coding to Knowledge work agents — Karan Vaidya

(session) [Track 1] | Track: Claws & Personal Agents

MCP, skills, Cli - so much noise - what’s the best way for agents to communicate

2:25pm-2:45pm: You’re Not Thinking Big Enough: Rebuilding Food Systems from First Principles with AI Agents — Cody Menefee

(sponsor) [Track 2] | Track: Vision & OCR

Most of the AI world is still thinking too small. We’re building SaaS wrappers and GTM agents while real-world systems are still run through fragmented knowledge, delayed feedback, and human guesswork. In this talk, I’ll show how I’m building an outdoor agentic system for pasture-raised livestock operations using LLMs, a Firecrawl-curated knowledge base, drone and satellite imagery, and geo collars to monitor pasture, guide animal movement, and support better decisions across cattle, sheep, poultry, and more. I’ll cover the architecture, retrieval and grounding, human approval loops, and what broke first: hallucinated confidence, weak environmental grounding, sparse evals, and the gap between a smart answer and a safe action. It’s a case study in building agents for the physical world, and a broader argument that AI’s real upside is in rethinking real-world systems from first principles.

2:25pm-2:45pm: How to Connect AI to Billions of Legal Documents — Simon Eskildsen, Jacob Lauritzen

(session) [Track 3] | Track: Search & Retrieval

Legora’s foundational engineering challenge is connecting frontier LLMs to billions of legal documents so the models can efficiently solve end-to-end legal workflows without burning extra tokens. We’ll share the retrieval architecture we built with turbopuffer that achieves: 1. Strict data isolation across millions of legal cases in a very security-conscious domain 2. Predictable search performance (<100ms p90 latency) on large contexts 3. High retrieval quality (95%+ recall@10) with fewer agent loops We’ll retrospect on two architectures that failed to achieve all 3 (and why), and the key design factors that make the current solution work at our scale. Practical takeaways include: - How to evaluate per-tenant vs shared-index retrieval under strict data isolation - How to efficiently index and retrieve context to maximize relevance per input token - How to build a highly intelligent AI application when your inference budget is constrained

2:25pm-2:45pm: Everybody Gets a Digital Clone! (Part 3 of 3) — Neil Zeghidour

(session) [Track 4] | Track: Workshops Day 2

Walk out of this workshop with a deployed digital clone that makes your phone calls for you. We will skip the theory and immediately get our hands dirty wiring together OpenClaw, Twilio, and Gradium to build an autonomous voice agent on a live cellular network. You will tackle the hardest parts of real-time telephony: routing audio streams, handling human interruption, and killing latency. In 60 minutes, your AI will be ready to call restaurants for the daily special, book appointments, and actively negotiate on your behalf.

2:25pm-2:45pm: Agentic Security: Permissions, Provenance, and the Agent Supply Chain — Steve Yegge

(sponsor) [Track 5] | Track: Security

As AI agents move from demos into production engineering workflows, the security boundary shifts from code alone to the permissions, tools, prompts, dependencies, credentials, and orchestration layers that agents can touch. This talk frames agentic security broadly: least-privilege agent permissions, sandboxing and capability design, provenance for agent-generated changes, risks in agent/tool/package supply chains, and practical patterns for keeping autonomous coding and operational agents auditable and containable.

2:25pm-2:45pm: I Monitored Crime Audio. Voice Agents Scare Me More. — Sumanyu Sharma

(session) [Track 6] | Track: Voice & Realtime AI

Bad voice-agent calls are starting to look less like QA bugs and more like incident scenes. I learned that instinct at Citizen, where noisy radio, ambiguous speech, fast-moving incidents, and real-time alerts became information people might actually act on. That work was stressful for obvious reasons. Voice agents scare me more. Not because they sound creepy. Because they sound good enough that people trust them. And now they are connected to calendars, CRMs, EHRs, reservation systems, refunds, transfers, account data, and support workflows. At Hamming, we monitor more than 10,000 voice agents and have analyzed millions of calls. The weird thing you learn at that scale is that production voice agents do not usually fail like demos. They fail quietly. The agent sounds natural, but misses a two-word answer. It handles the happy path, but loses the plot when the caller interrupts. It says the address was updated, but no tool call happened. It supports six languages, but gets worse at the switch point between two of them. This talk is about treating every bad voice-agent call like an incident scene. The evidence is there if you collect it: transcript, waveform, latency waterfall, interruption points, ASR uncertainty, tool trace, system-of-record state, and post-call outcome. At Tesla, I learned that autonomous systems need release gates and regression loops before they hit the real world. At Citizen, I learned that messy audio becomes safety-critical when people act on it. Voice agents need both instincts. The takeaway is a voice-agent forensics loop. What did the caller say? What did the agent think happened? What did the tool actually do? What does the system of record say? And how do we turn that weird production failure into a regression test before it happens 10,000 more times?

2:25pm-2:45pm: From approval loops to autonomous agents with Docker pt3 — John Craft

(session) [Track 7] | Track: LLM Recsys

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

2:25pm-2:45pm: How Forward Deployed Engineering is done at Ramp — Leo Mehr

(session) [Track 8] | Track: Forward Deployed Engineering

2:25pm-2:45pm: Scaling to Long-Horizons: Algorithms, Environments, Compute — Ross Taylor, Chengxi Taylor

(session) [Track 9] | Track: Data Quality

What does it take to scale language models to year long tasks? In this talk we'll cover the algorithm, environment and compute considerations for scaling language models to long horizons. We'll cover the latest reinforcement learning approaches, how to build hard, high-fidelity long-horizon environments, and how to build scalable infrastructure for these tasks.

2:25pm-2:45pm: Using AI tools to teach old apps new tricks — Maria Bledsoe

(sponsor) [Track M] | Track: Track M

Becoming AI-ready starts with modernizing your legacy systems and technical debt — and keeping them modernized. We’ll show how you can use agentic AI to take on the hardest parts of modernization: analyzing large codebases, mapping dependencies, planning upgrades, refactoring safely, while doing it all at scale with enterprise controls. With GitHub Copilot modernization capabilities, you can move from legacy complexity to modernized apps in days, not months.

2:25pm-2:45pm: Productionizing LLM Gateways: Architecture, Tradeoffs, and Hard Lessons from the Trenches — Kanish Manuja

(session) [Leadership 1] | Track: AI-Native Enterprises

As organizations scale their use of large language models, the biggest challenge is no longer prompting, it’s productionizing. This session dives deep into building and operating an LLM gateway that sits between applications and model providers, handling routing, observability, cost control, reliability, and safety at scale. Drawing from real world experience, this talk breaks down the architecture of a production LLM gateway, including model abstraction layers, request orchestration, fallback strategies, caching, rate limiting, and evaluation pipelines. We’ll explore hard tradeoffs such as latency vs. cost, quality vs. determinism, and vendor lock-in vs. flexibility. Attendees will leave with concrete design patterns, failure modes to avoid, and a mental model for turning LLM experiments into resilient, scalable systems.

2:25pm-2:45pm: The Era of Compound Engineering — Kieran Klaassen

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

Most codebases get harder to work with every year. Yours doesn't have to. Compound Engineering is a philosophy where each unit of work – every bug fix, every feature, every code review – makes the next one easier. This talk is about how that shift changes everything: from how fast you ship to how many engineers you actually need. --- At Every, we run five products with single-person engineering teams. That's not a headcount accident – it's a system. When I built Cora, I wanted to find out how much one engineer could do with the right AI workflows. The answer became the Compound Engineering philosophy, now with 17k stars on GitHub. Traditional codebases accumulate complexity. Compound codebases accumulate capability. Bug fixes eliminate entire categories of future bugs. Patterns become tools. Over time, the codebase gets easier to understand, easier to modify, and easier to trust. You'll walk away with: - The mental model behind compound engineering - Concrete patterns for making every PR compound - How to scale output without scaling headcount

2:25pm-2:45pm: Beyond Golden Signals: Monitoring in the Age of GenAI — Marina Petzel

(session) [Expo Stage 1 NE]

The four golden signals (Latency, Errors, Traffic, Saturation) have been the foundation of application monitoring for years, and it still matters, but for GenAI applications, these signals alone leave significant blind spots. A request can return 200 OK with low latency while the response hallucinates, leaks PII, or costs much more than expected. This talk will walk you through what changes when you're monitoring non-deterministic, token-priced, prompt-injectable systems. We'll cover three additional monitoring dimensions: Cost (token attribution, model-mix tracking, wasted spend on failed requests), Safety (prompt injection detection, PII scanning, jailbreak attempts), and Quality (hallucination rate, relevance scoring, user satisfaction) and show why each one is necessary alongside your existing instrumentation.

2:25pm-2:45pm: Build agents fast with GitHub Copilot (from idea to working app) — Idan Gazit

(session) [Expo Stage 2 NW] | Track: Expo Stage 2

2:25pm-2:45pm: Building agents is trivial now, context is the next frontier — Jeff Ng

(session) [Expo Stage 3 SW]

Standing up an agent used to be the hard part. A new class of cloud-agent frameworks has made it almost trivial: in an afternoon you can ship a fleet that reasons, plans, and calls any API you point it at. So why do so many of them fail the moment they touch real work? Because a capable agent still doesn't know the organization it operates in: its decisions, history, incidents, and how a particular team actually operates. That knowledge isn't in the model or the API, and no amount of construction adds it.

This talk exposes the missing component, then shows how to build it live on a real workflow — the same move that helps a coding agent helps a support or operations one. Construction is solved. The missing context, tacit and tribal knowledge is the bottleneck that's left, and it sits upstream of everything verification attempts to catch after the fact.

2:25pm-2:45pm: Continuous Engineering: Software Development for the Age of Agents

(session) [Expo Stage 4 SE]

AI has changed everything about how we write code. But the hard parts of building software have gotten even harder: aligning your team, maintaining architectural integrity, and worst of all, reviewing the oceans of agent-driven code. The tools and processes we rely on git pull requests; code review were built for emailing patch files. We need a new paradigm. In this talk, we're going to explore Continuous Engineering, a new approach to software development that treats the agent thread as the core unit of collaboration. Branches should be as cheap as ideas, code should carry the context of the conversation that generated it, and the work should be available to your colleagues (and their agents) as it happens. We'll walk through what this looks like in practice, and what we're building to make it possible.

2:50pm-3:10pm: Notion's Token Town — Sarah Sachs

(session) [Main Stage] | Track: Software Factories

2:50pm-3:10pm: Your company brain will leak secrets. Here's how we stopped it for big banks and ourselves. — Tanmai Gopal

(session) [Track 1] | Track: Claws & Personal Agents

Everyone wants a shared "company brain", one single AI that knows everything the org knows. But it's nearly impossible to build one, because the moment AI scrapes everyone's data into one place, a single wrong answer to the wrong person is a breach. The downside of modifying a above-my-pay-grade shared skill, or leaking confidential information to the wrong colleague is catastrophic. Ergo, company brain projects can only ever ship to the few people who already had access to everything, or stay hobbled with strictly public information (eg: River at Shopify). We've been building one for the last year and have successfully deployed for Fortune 100 banks, for distributed-operations orgs with global scale, and for ourselves as a 70-person AI-native startup. I'll leave you with a blueprint covering how we solved the following problems: 1. Permissions for shared data and tools 2. A shared context layer (skills, knowledge, semantic layer) with its own access control 3. Scoping the blast radius of wrong context 4. Auto-learning without auto-leaking If your company brain effort has been blocked by security, compliance, or just a healthy fear of the intern asking the AI a question and getting back the exec comp table, this is the talk.

2:50pm-3:10pm: From VLM/VLA's to Embodied Agents — Armen Aghajanyan

(sponsor) [Track 2] | Track: Vision & OCR

2:50pm-3:10pm: Where RL Will Take Search — Maximilian-David Rumpf, Lotte Seifert

(session) [Track 3] | Track: Search & Retrieval

Search is having its Bitter Lesson moment. By turning search into an RL problem, we can finally scale search quality with compute! RL is extremely sample efficient when compared to classical search training objectives and we see no ceiling to how far we can scale this new paradigm. We cover the training of SID-1, the first RL-trained search model, and how search will look like post-RL.

2:50pm-3:10pm: Setting Yourself Up for Success — Part 1 — Jason Liu

(session) [Track 4] | Track: Workshops Day 2

I will walk you through the process of understanding how Codex works as a general tool to control your computer (setting up your memory vault/ assistant threads, prompting it to talk to other threads, and exploring computer use), how to think about things like long running work streams, and preparing yourself to start thinking in loops.

2:50pm-3:10pm: It's 10pm. Do You Know Where Your Agents Are? — Kim Maida

(sponsor) [Track 5] | Track: Security

Agents right now can sign legal contracts, run untethered, manage your dating profile, conduct financial transactions, and push code to production. Most agents have long-lived API keys and are dangerously overprivileged even when they're not making requests. In this talk, I'll demo how to solve the problem with the right access at the right time. You'll walk away knowing how to control agent access whether you're running coding agents from the CLI, building MCP servers, or connecting agents to third-party APIs.

2:50pm-3:10pm: Realtime Voice Agents with Frontier Intelligence — Bohan Li

(session) [Track 6] | Track: Voice & Realtime AI

Dive into how the EliseAI voice agent harness orchestrates multiple models with jagged capability profiles to achieve realtime latency without sacrificing intelligence. Reduces p90 effective latency overhead of ASR, TTS, and tool calling to sub 200ms, unlocking frontier models like GPT 5.5 for voice. ### ASR: Eager Speculative Transcription We introduce speculative transcription by pairing local Whisper or Parakeet fine-tunes for speed with API models like Scribe, Nova, or Gemini Flash for accuracy. A local content match classifier operates at sub 10ms latency, allowing us to immediately trigger the downstream pipeline from the fast local transcription and dynamically replace text with the more accurate transcription if significant differences occur. This process runs on a eager 100ms VAD delay, securely releasing the generated response audio only after a fixed silence threshold has passed. ### LLM: Async background tool injection To eliminate expensive tool calling round trips, we implement system leveraging async background tool injection where the primary model makes no direct tool calls. Instead, local fine-tuned tool-calling models continuously observe the realtime transcription stream in the background. "Fake" tool call traces are then injected into the primary LLM’s context, which primes it for immediate, one-shot response generation. ### TTS: Prefix caching and infilling Many Agent responses start with the same set of 3-6 words. We can cache this audio, releasing it immediately while we infill the remaining response audio conditioned on this prefix to preserve speech prosody. With this approach, a relatively small cache can achieve a 90% hit rate across a wide range of voices, languages and model providers.

2:50pm-3:10pm: From approval loops to autonomous agents with Docker pt4 — John Craft

(session) [Track 7] | Track: LLM Recsys

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

2:50pm-3:10pm: Forward Deployed Engineering 101 — Kevin Bai

(session) [Track 8] | Track: Forward Deployed Engineering

2:50pm-3:10pm: When Will The Benchmaxxing Plague End? — Nick Heiner

(session) [Track 9] | Track: AI Architects: Show my Workflow

Model releases are heralded by a flourish of trumpets, a chorus of weeping angels, and often, inflated benchmark claims. Why do benchmarks so often not reflect real-world value? Is it intrinsic to the science of benchmarking, or just the consequence of our current practices? Is LM Arena a cancer on AI?

2:50pm-3:10pm: From AI-Assisted to AI-Native: Building a Frontier Development Team — Clare Liguori

(session) [Leadership 1] | Track: AI-Native Enterprises

When features that took two weeks now ship in an afternoon, the bottleneck shifts from writing code to making decisions. Frontier teams have discovered this firsthand, achieving 3-10x productivity gains by fundamentally rethinking how developers work with AI agents. This talk covers the practices that separate frontier teams from those who merely "sprinkle" AI on their existing workflows: running agents asynchronously for hours, investing in comprehensive agent steering files, enabling local integration testing for agent self-correction, and automating everything from coding to operations to documentation. You'll learn how teams at Amazon slowed down to speed up, the temporary productivity dips they accepted, and the organizational changes required to sustain this velocity.

2:50pm-3:10pm: How I automate my own job at Hugging Face using agents — Niels Rogge

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

This talk will showcase how I automated a large part of my own job at Hugging Face. This involves both open (GLM-5.1) and closed-source models (Claude, Gemini), the Claude Agents SDK, serverless infra like Modal and Hugging Face Jobs. I will also discuss how I use agentic coding tools like Cursor and Codex to implement AI agents which automate my job, and how everything is connected to the internal Slack of Hugging Face.

2:50pm-3:10pm: 6 Pillars of an Agentic Harness That Fixes Production Incidents — Varun Krovvidi

(session) [Expo Stage 1 NE]

A model delights us when any plausible answer works, but a production incident has one right answer, and the model alone can't reliably reach it. Getting there depends less on the model and more on the orchestration, context, and judgment built around it. That work is harness engineering, and it is the new frontier.

This session breaks down the six pillars of an agentic harness required to fix production incidents: model orchestration, context, reasoning, actions, learning, and evals. Join Resolve AI to walk through what each one does, why a better model doesn't make any of them go away, and how they compose to find the root cause of a live incident across massive context, under a clock, with real revenue on the line.

2:50pm-3:10pm: Video Discovery for Agentic World-Model Training — Rafael Levi

(session) [Expo Stage 2 NW]

Physical AI had its “Attention Is All You Need” moment with the rise of Vision-Language-Action models. The next bottleneck is data: not just more video, but the ability to find the exact real-world moments that teach models how the world works: gravity, motion, causality, human behavior, and object interactions. This session explores a new approach: discovering specific scenes from the vastness of the web. We’ll show how teams can search for moments like objects falling, people interacting with environments, or actions unfolding over time, then collect and structure only the relevant clips for training and evaluation. Attendees will learn how scene-level discovery changes multimodal data pipelines, reducing wasted collection, processing, storage, and review, while making it easier to build targeted datasets for VLA systems, robotics, physical AI, and agentic world models.

2:50pm-3:10pm: Self-Driving Production: AI Wrote your Code. AI Should Fix It, Too

(session) [Expo Stage 4 SE]

Self-driving production is the next frontier of autonomous software development but getting there is a journey. In this session, we ll show how enterprises are progressing from manual operations and AI copilots toward closed-loop, autonomous production systems with Traversal.

3:20pm-3:40pm: fighting slop with slop — Vaibhav Gupta

(session) [Main Stage] | Track: Software Factories

We haven't done a code review in two years. The last time I read every line of code in a PR was about six months ago. And we build a programming language with a runtime meant to replace V8. This is real engineering: compiler internals, runtime behavior, type systems, codegen, concurrency semantics, and FFIs across multiple languages. The thing that makes this possible is a technique we call "fight slop with slop" - every line of code is analyzed in depth by a sprawling toolchain of custom visualizers, linters, test snapshots and a whole bunch more. While the core language VM code has super high standards, a lot of these meta-tools are mostly vibe-coded. I'll dive deep into all the tactical things we've built, and how to adopt "fight slop with slop" in your own team

3:20pm-3:40pm: Every Harness Will Become A Claw — Sam Bhagwat

(session) [Track 1] | Track: Claws & Personal Agents

Most of the Harness discussion is just a reprise of Context Engineering from last summer. But it's not 2025 anymore. We live in a Claude Code world, and the best way to think about a harness is Context engineering + Coding Agents = Harness. Harnesses are a magical DX because of specific features like planning mode, parallel subagents, skills, background tasks etc. But it doesn't stop there. People are shoving their harnesses in a box, making them listen to external events, giving them channels (the ability to ping its users), and a heartbeat. They are making them into Claws. And actually, harnesses _want_ to become claws, so they can take up more share of mind, suit collaboration workflows, and be available afk. I propose "Steinberger's law", a spinoff of Zawinski's law: every harness will expand until it becomes a Claw

3:20pm-3:40pm: From Scratch to SOTA: Training a 3B State-Space Vision Model for 1.4 Billion People — Krishna Prasad Srinivasan

(sponsor) [Track 2] | Track: Vision & OCR

India has 22 official languages. Across those languages live over a billion people whose knowledge is locked inside scanned images in scripts that most frontier models perform poorly. The problem is dire - until now, there wasn't even a comprehensive benchmark to measure Indic OCR performance, let alone training data at scale. When Sarvam AI set out to solve this, we had to build the infrastructure before the model, creating the first ground-truth benchmark for Indic document intelligence. In this talk, Krishna Srinivasan, who led the Vision Models team to build India's first sovereign VLM from scratch, will walk through the end-to-end engineering lifecycle. We will cover: (a) Architecture: Why we chose a 3B-parameter state-space architecture over transformer baselines to handle high-resolution visual inputs with minimal memory overhead and faster inference. (b) Training Pipeline: The exact recipe we used: starting with text-only pre-training, moving to continual pre-training with text and images, followed by SFT. Finally, we'll cover the advances we made in implementing large-scale RL with Verifiable Rewards for visual tasks in just 3 days using deterministic character-level reward signals. (c) Compute Efficiency: How we trained a frontier-competitive multimodal model with extreme capital efficiency, optimizing distributed training and GPU cluster management to punch far above our compute class. (d) Agentic Workflows: How this model powers Sarvam Akshar, a first-of-its-kind agentic document intelligence workbench featuring visual grounding and automated proofreading loops. The results speak for themselves: Sarvam Vision achieves best-in-class global scores (84.3% on olmOCR-Bench, 93.28% on OmniDocBench) and dominates Indic OCR. Attendees will learn the blueprint for compute-efficient multimodal training, and deploying state-space VLMs for population-scale enterprise workloads.

3:20pm-3:40pm: Stop Chunking Like It's 2022 — Yuval Belfer, Niv Granot

(session) [Track 3] | Track: Search & Retrieval

Every RAG system bets everything on a single chunk size. 500 tokens? 800? Pick wrong, and half your queries fail before they start. But here's what nobody tells you: all the picks are wrong; there is no single chunk size that works for all queries. We ran oracle experiments across meeting transcripts, story chapters, and TV scripts. The result? Queries disagree violently on what chunk size works best - sometimes by 40 percentage points. Your "tuned" chunk size isn't a compromise; it's systematic underperformance. In this talk, we'll expose why fixed chunking fails and show you a dead-simple fix: index at multiple chunk sizes, aggregate at retrieval time using Reciprocal Rank Fusion. No retraining. No LLM overhead. Just 1-37% better recall across benchmarks by letting queries vote with their ranks instead of forcing them into one-size-fits-all boxes. Walk away knowing exactly when your chunk size is sabotaging you - and how to stop leaving 20-40% of your retrieval performance on the table.

3:20pm-3:40pm: Setting Yourself Up for Success — Part 2 — Jason Liu

(session) [Track 4] | Track: Workshops Day 2

I will walk you through the process of understanding how Codex works as a general tool to control your computer, how to think about things like long running work streams, and preparing yourself to start thinking in loops.

3:20pm-3:40pm: AI’s Jurassic Park Period — Aaron Stanley

(sponsor) [Track 5] | Track: Security

Early in my career, I accidentally and unrecoverably changed data I was collecting for a federal investigation. Twenty years later, with the help of AI and a career’s worth of experience as a security leader, I intentionally did the same thing. Make no mistake, what my agent and I did together was dangerous. It was only because I had enough subject matter expertise in both the functional and risk issues that I could navigate it safely. We are in AI’s Jurassic Park period: no matter how clearly we define the rules, models will search for paths to completion. And they are very good at making those paths look safe, reasonable, and correct even when they violate policy or basic intuition. Designing the right control set is about allowing for the right expertise to be injected at the right time in the co-creation process so we can move quickly and safely into the next evolution.

3:20pm-3:40pm: "My name is... my name is...": A Linguistic Map for Building and Debugging Voice Agents — Midam Kim

(session) [Track 6] | Track: Voice & Realtime AI

Every voice AI engineer has heard it: a caller repeating their name three times, getting more frustrated with each attempt. The logs look clean. Confidence scores look fine. Linguistics can help solving the mystery. By the end of this talk, you'll have a diagnostic framework for the failures that slip past standard metrics, a way to turn "the agent just didn't get it" into concrete, debuggable failure modes. The framework maps three levels of linguistic structure (sounds, words, and interactions) against the two dimensions every voice agent engineer already works in: what we hear (speech recognition) and what we speak (speech synthesis). That 3×2 grid surfaces problems your current tooling can't see, including: 1. Why your user cannot make your system understand their name 2. Why a single well-intentioned vocabulary hint can cause catastrophic drops in a non-English language 3. Why a transcript that's "cumulatively correct" can still ruin the user experience Drawing on examples from production multilingual voice AI work, I'll show where linguistic expertise connects to the engineering decisions you're already making and where it reveals failure modes that confidence scores will never warn you about. Who this is for: Voice AI engineers, ML practitioners on Voice AI pipelines, and anyone who's watched clean logs while their agent quietly fails real users.

3:20pm-3:40pm: From approval loops to autonomous agents with Docker pt5 — John Craft

(session) [Track 7] | Track: LLM Recsys

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

3:20pm-3:40pm: How Forward Deployed Engineering is done at Kepler — Vinoo Ganesh

(session) [Track 8] | Track: Forward Deployed Engineering

3:20pm-3:40pm: Building Worlds for Models — Nicolai Ouporov

(session) [Track 9] | Track: Data Quality

Hold for Fleet AI. Company focuses on simulated environments / training gyms for AI agents and fits the posttraining / RL environments theme.

3:20pm-3:40pm: Surviving Your Own Velocity: How VS Code Ships Weekly with 40 People — Harald Kirschner

(sponsor) [Track M] | Track: Track M

A ~40-person team ships VS Code weekly to millions of users. Models got good enough to lean on, and leaning in is exactly what broke our process. This talk is the part most AI talks skip: what you have to rebuild after agents start working. We had to scale three things at once: how fast we ship, how we hold quality, and how fast we learn, and each one we fixed revealed the next. I'll walk through the harnesses, evals, and self-healing systems that keep velocity from becoming regression, and the patterns you can steal.

3:20pm-3:40pm: How to Get Your Org to Adopt Coding Agents (Without Shipping Garbage) — Eyal Blum

(session) [Leadership 1] | Track: AI-Native Enterprises

AI coding agents promise 10x. On complex, production work inside a real org, the honest number is 2-5x — and getting there requires a journey most teams aren't prepared for. At Figma, we ship AI products to millions of users, but internally our engineering org is spread across three stages of adoption. The honeymoon, where AI is magic. The crash, where AI writes bad code and your best engineers are stuck protecting the quality bar. And the real skill — 2-5x with disciplined development practices and proper investment. This talk covers why adoption is uneven, what the trust curve looks like from the inside, and what leaders can do about it: guide teams to align on plans before generating code, set honest expectations, invest in the fundamentals that make codebases agent-friendly, and create space for skeptics without judgment. You'll leave with a framework for driving adoption more organically without mandating it — and without shipping garbage.

3:20pm-3:40pm: Your Fine-Tuned Model Is Tech Debt: A 50x ROI House of Cards — Dan Bjornn

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

We built an AI application on top of fine-tuned models that generated $12M in revenue at 50x ROI. It was fast, cheap, and impressively accurate. Then it started having problems. Small errors accumulated. The model misread intent and nuance, handling conversations wrong. But retraining was too costly to justify for each fix, so known bugs piled up until we hit critical mass. Each retraining cycle took a week end-to-end, most of it spent curating data and validating our classification pipeline. And fixes caused whack-a-mole regressions across intents that required multiple iterations per cycle. Over time, the model became increasingly rigid. Each retraining was harder than the last. Then our team started using Claude Code, and we realized context management was the real lever, not model specialization. We rebuilt on frontier models using well-crafted system prompts and progressive context management, feeding the agent only what it needs when it needs it. Adjustments that used to require a week-long retraining cycle now take a small context change. Fine-tuning should be a last resort, not a first instinct. The cases where it's the right call are far fewer than they used to be. Before you fine-tune, ask: can I solve this with better context instead?

3:20pm-3:40pm: Can Your Agent Hear You Now? — Thor 雷神 Schaeff

(session) [Expo Stage 1 NE] | Track: Expo Stage 3

3:20pm-3:40pm: From Context to Memory: Your Agents Need a Real Memory Layer — Anders Swanson

(session) [Expo Stage 2 NW]

Most agents don't really have memory. They have a context window, a pile of temporary files, maybe an AGENTS.md, and a retrieval step that attempts to build state from whatever the model can still see. You've seen the flashy demos, but these systems fall apart when an agent needs to recover from failure, revisit prior work, and observe if failures are less frequent over time. This talk explores agent memory as a systems problem. Effective memory isn't just storing data: it's an evolving knowledge layer with write filtering, consolidation, reflection, and forgetting. Agents need persistence, and they also need structure. Raw logs and Markdown scratchpads aren't enough. A real memory layer weights recency, combines retrieval techniques, and correlates episodic memories. Serious agent memory is inherently multi-model. The best systems use full-text search, semantic retrieval, graph relationships, and structured state to reconstruct context with far more precision than filesystem grep alone. This is where databases become essential as the foundation for real memory. Memory shapes how agents behave, adapt, and improve over time.

3:20pm-3:40pm: Running a 20T-Token Data Pipeline: Infrastructure Lessons from Production — Bogdan Gaza

(session) [Expo Stage 3 SW]

The problem. Curation algorithms tend to get the spotlight: model-based quality filtering, embedding-based deduplication, synthetic generation at scale, target distribution matching. The engineering behind them, the systems that actually run those algorithms reliably on petabytes of data and thousands of GPUs, usually gets overlooked. This session is about the engineering. What we built. The infrastructure behind two production data curation pipelines, on two very different shapes of workload: Arcee Trinity-Large-Thinking three model generations in nine months, with the curated corpus scaling from 8T to 10T to 20T tokens. Trinity-Large's 20T-token corpus included 8T+ synthetic tokens generated on clusters peaking at 2,048 H100 GPUs. Each generation incorporated deeper curation and broader domain coverage; the pipeline ran end-to-end multiple times, not once. Thomson Reuters legal 100B tokens of mid-training output, generated from TR's proprietary legal corpus, delivered as a deployment artifact and plugged into their existing SFT and DPO post-training. Different operational profile entirely: smaller scale, sensitive data, customer-environment integration. What you'll learn about. The metadata bottleneck. At trillion-token scale, fetching metadata from object storage across millions of files becomes the dominant source of idle time. We offload metadata management to Spark and use a lightweight file-level distribution scheme to drive idle time to near zero. Fault tolerance at multi-week scale. Long-running GPU inference jobs fail. We use one-to-one partition mapping between Spark and Ray jobs to get idempotent, resumable execution. A node failure no longer means reprocessing the dataset. Heterogeneous workload scheduling. Curation pipelines mix CPU-heavy preprocessing (Spark) with GPU-heavy inference (Ray + vLLM). An in-house scheduler routes each job type to isolated node pools, preventing resource fragmentation and ensuring critical training jobs aren't blocked by upstream CPU work. Inference tuning across models. vLLM defaults aren't right for every model. Tuning batch size, speculative decoding, and n-gram sampling per-model yields up to 40% throughput improvement, without over-engineering. Pipeline reproducibility. Treating a curated training corpus as a versioned deployment artifact rather than a one-off output. What that enables when a customer wants to run mid-training against a pre-trained base. For engineers building or operating large-scale data pipelines for ML training

3:20pm-3:40pm: From raw documents to AI-ready data — Leo Platzer

(session) [Expo Stage 4 SE]

Starting from a real document corpus full of overlapping, look-alike files, we walk through what it takes to make retrieval on those files reliable, from deduplicating to enriching with metadata. Watch how each step reshapes the vector space, and what happens to the answers that come back.

3:45pm-4:05pm: Loop Engineering from first principles — Kyle Mistele

(session) [Main Stage] | Track: Software Factories

Code is free, software is infinite, and agents can do it all - that's the promise of the lights-off software factory, where humans interact only with tickets & specifications, and nobody reads the code, let alone writes it. We ran our own for six months, and we have the scars to prove it - bad code compounded, and agents created problems that agents couldn't solve - until we had to throw it all away. But this is a survivor's guide, not an obituary. In this talk, we'll share the challenges we encountered, what we liked, what we hated, what we're still doing, what we stopped doing, and what we started doing afterwards.

3:45pm-4:05pm: Gadgets: Personal app vibe coding that is actually safe — Kenton Varda

(session) [Track 1] | Track: Software Factories

We are entering the end game of Kenton's 15-year master plan. The architect of Cloudflare Workers, Durable Objects, Cap'n Proto, and Sandstorm.io, and the guy who coined the term "Code Mode", will demo Gadgets, an AI productivity suite which ties all these ideas together. We've all heard that the future is micro-apps customized for every niche, but how do we actually make that usable, how do we make it scale, and most importantly, how do we make it safe for even non-developers to use? Kenton will show how Gadgets solves these problems, including a sandbox design that makes it essentially impossible for apps to have vulnerabilities at all.

3:45pm-4:05pm: Setting Yourself Up for Success — Part 3 — Jason Liu

(session) [Track 4] | Track: Workshops Day 2

I will walk you through the process of understanding how Codex works as a general tool to control your computer, how to think about things like long running work streams, and preparing yourself to start thinking in loops.

3:45pm-4:05pm: Secure Cloud Compute — Ethan Sutin

(sponsor) [Track 5] | Track: Security

3:45pm-4:05pm: Act, Confirm, or Stop? Smarter behavior for AI assistants, wearables & robots — Amit Desai

(session) [Track 6] | Track: Voice & Realtime AI

Voice is our favorite way to command AI assistants and robots — and it is error-prone. The industry's reflex is to chase accuracy, but accuracy is only one knob: we can control system behavior in other ways to increase user satisfaction.

This talk shifts the lens from accuracy to user outcomes. Give the AI agent more than one move: besides acting, let it stop, reject, confirm, clarify, or disambiguate. The question stops being "how often are we right?" and becomes "what does each outcome cost the user?" Bad outcomes are not equally bad to users — so price them relatively, then have the AI system minimize that user cost. Call it OUCH: Outcome User Cost Heuristic; we optimize system behavior to minimize the OUCH. Same accuracy, lower user cost, greater user adoption.

We will walk through practical AI assistant examples illustrating this approach, then show how the same framework extends across AI environments — smart speakers, TVs, glasses, embodied AI, robots, wearables, and vehicles — by repricing outcomes and swapping the confirmation UI.

Why this matters now: the cost of voice-command errors is escalating as we move into AI assistants and embodied AI, where wrong actions can be more expensive and dangerous. Mainstream voice adoption will not come from chasing accuracy alone; we need systems to price in the cost of being wrong.

3:45pm-4:05pm: Data and Environment Curation for Post-training LLMs — Mahesh Sathiamoorthy

(session) [Track 9] | Track: Data Quality

Hold for Bespoke Labs. Company works on data curation, eval tooling, and reinforcement-learning environment curation for agent development.

3:45pm-4:05pm: Unlock Agent Autonomy: The Runtime for AI-Native Systems — Tushar Jain

(session) [Leadership 2] | Track: AI Architects: Show my Workflow

The way software gets built in 2026 doesn't look like it did in 2024. The actors changed. Agents read and write entire codebases. Subagents spawn to chase down a flaky test, refactor a module, or triage an incident. But this shift doesn't stop at the SDLC. Agents increasingly invoke tools, interact with enterprise systems, install dependencies, call APIs, and orchestrate workflows across local machines, CI systems, cloud infrastructure, and organizational boundaries. The teams leaning into this shift are moving faster, and the gap is widening by the quarter.

But few have the confidence to let agents operate autonomously across those environments. Not because the model capability isn't there. Trust isn't. Agents can pull a poisoned dependency, invoke an untrusted tool, wipe a database, leak sensitive data, or access systems they shouldn’t. Prompt-level instructions won't close that gap, the unlock has to happen one layer down, at the runtime layer itself.

Docker spent the last decade making it safe to ship software by getting the runtime right: isolation, network policy, trusted base images, and credentials. Agents are the next workload, and the same principles apply. Tushar Jain, EVP of Engineering at Docker, walks through what the runtime layer for AI-native systems looks like in practice: hardened runtime foundations, sandboxes that constrain what agents can touch, and governance controls that limit what agents can introduce, access, and execute across local, CI, cloud, and enterprise environments. The pattern is the same on every vector: reduce the surface area of what the agent gets to decide, so the parts that matter aren't left to a prompt.

Attendees leave with a clearer framework for giving agents more autonomy safely. Engineers see how agentic applications can operate across tools and infrastructure. Security leaders get a runtime model that maps to controls they already understand. Platform teams get a way to scale agent execution without standing up a new runtime for every team.

3:45pm-4:05pm: How We Built the Airbyte Agent MCP Server and CLI — Pedro Lopez

(session) [Expo Stage 1 NE] | Track: Expo Stage 1

Agents need a reliable way to reach live business data. At Airbyte we built two interfaces for that, and this session is how.

Cam built much of that surface. He covers the MCP server that exposes hundreds of sources through one endpoint with managed auth, and the CLI that's designed for agent harnesses rather than humans, with embedded help, packaged agent skills, and no credentials passed over the command line. Expect the real engineering: why a CLI turned out to fit autonomous agents better than the API or SDK, how auth works across the layers, and the tradeoffs the team made along the way.

Come if you're building agent tooling or thinking about how to expose your own systems to agents cleanly.

3:45pm-4:05pm: From Chatbots to Agents: How Reducto builds for Agent Experience to Enable Real Work — Abhi Arya

(session) [Expo Stage 2 NW]

Many agent demos work. Most agent systems in production don't. The gap usually isn't the model or the tools. It's everything in between: how context gets structured, how multi-step tasks stay on track, how you handle the edge cases that only show up when real scenarios from real customers hit your pipeline. At https://reducto.ai/, we've spent the last couple of months building agent-first workflows for some of the most document-heavy industries out there. We've hit most of the failure modes you're probably hitting too. This talk shares what we've learned, from how to think about Agent Experience (AX) as a design layer, to the specific decisions that make complex workflows actually reliable in production. You'll walk away with tactical approaches to structuring context, model guidance, designing recoverable workflows, and building the feedback loops that let your system improve over time without a full rebuild.

3:45pm-4:05pm: Towards Reliable Financial Agents: How a 4B Model Outsmarted a 235B Giant — Charlie Dickens

(session) [Expo Stage 3 SW]

Large generalist models have excellent reasoning but this does not necessarily imply specialized knowledge and tool calling capabilities. They can still hallucinate column names, ignore constraints, and generate SQL that returns nonsensical results. The problem isn't intelligence it's reliability and specialization. In this talk we'll show how a 4B model was fine-tuned to outperform a 235B model on real financial analysis tasks. The key was not adding more reasoning ability, but enforcing tool discipline. Using synthetic data generation and reinforcement learning with the open-source rLLM framework, the model learned to explore schemas, validate outputs, and retry failures instead of hallucinating confident nonsense. One key result: tool-use fundamentals generalize. Training on simple tool interactions transferred to much harder, multi-step financial tasks. If you're building LLM systems that interact with databases, APIs, or internal tools, this talk focuses on the behaviors that actually matter and how to teach them without frontier-scale compute.

3:45pm-4:05pm: AI Enablement at Automattic: How a Remote Company Builds AI Fluency — Em Shreve

(session) [Expo Stage 4 SE]

Automattic is a remote company. About 600 of us will step away from regular work this year for an immersive AI program. That's a little over a third of the company. This talk walks through a field report of what we built and why: the curriculum, the cohort design, and what we've learned about making AI fluency work across a distributed organization.

4:30pm-4:50pm: Harness Engineering is not Enough: Why Software Factories Fail — Dex Horthy

(keynote) [Main Stage] | Track: Software Factories

4:50pm-5:10pm: In Code They Act, In Proof We Trust — Erik Meijer

(keynote) [Main Stage] | Track: Harness Engineering

AI agents today execute on blind trust, and the failure modes are already in the headlines: a dealership chatbot agreeing to sell a $76,000 Chevy Tahoe for $1, a coding agent wiping a production database during a code freeze, an "agent skill" quietly installing a keylogger on a developer's machine. These are not edge cases. They are the predictable consequence of allowing agents to act without any mechanical guarantee of correctness or safety. Execution is irreversible. You cannot unsend a message, unwire a payment, or un-delete a database. In that regime, permitting an unsafe action costs far more than withholding a safe one, and thus the economically rational choice is to refuse to let agents act on unchecked intent alone. Automind is an agent harness that enforces this discipline by construction. Before any action runs, the agent must submit its execution plan together with a machine-checkable proof of safety and correctness, written in Universalis, a literate logic programming language designed to be read by humans and verified by machines. A small, auditable checker decides whether the plan is allowed to execute. By left-shifting the trust boundary, we no longer have to trust the agent's proposal, or even its proof; only the checker. Policy compliance becomes a static property, established before the first side effect. We can finally demand formal proofs, not vibes, from the agents we deploy.

5:10pm-5:30pm: Recursive Model Improvement — Lee Robinson

(keynote) [Main Stage] | Track: Software Factories

Day 3 — Session Day 2

9:05am-9:25am: Field Guide to Fable — Thariq Shihipar

(keynote) [Main Stage] | Track: Autoresearch

https://x.com/trq212/status/2027463795355095314

9:25am-9:45am: In the Land of AI Agents, the Verifiers Are King — Tariq Shaukat

(keynote) [Main Stage] | Track: Software Factories

As AI agents take on increasingly complex development tasks, the critical challenge has shifted from generation to verification. Hallucination is not a temporary bug. Evidence suggests that as models grow more capable, failures become more frequent and more convincing, making cognitive surrender among human reviewers an acute risk. This talk introduces a three-stage discipline for responsible agentic development, Guide, Verify, Solve, and argues that rigorous verification infrastructure is both a safety requirement and a competitive advantage. Counterintuitively, code quality matters more in an agentic world: clean, low-complexity codebases make agents faster, cheaper, and more reliable, while technical debt compounds at machine speed.

9:45am-10:05am: Perception Agents — Antje Barth

(keynote) [Main Stage] | Track: Autoresearch

Human-agent collaboration is changing, becoming more visual. The agents most teams ship today still wait for us to type a paragraph to explain what we're looking at. They cannot see a screen, navigate a UI that changes, or recover when an application throws an unexpected modal. That is the architectural gap between agents that demo well and agents that work alongside real teams in real software. Perception agents close it. They see and use computers the way people do, reason about what they see, and act with clicks and keystrokes.

10:05am-10:25am: Research to Reality with Google DeepMind — Benoit Schillings

(keynote) [Main Stage] | Track: Autoresearch

TBD. Expected focus areas include generative AI for code, deep thinking algorithms, and the future of pre-training and transformer models for Gemini.

10:25am-10:30am: Evals Track Intro — Laurie Voss, Aparna Dhinakaran

(keynote) [Main Stage] | Track: Autoresearch

10:45am-11:05am: First Steps Toward Automated AI Research — Richard Socher

(session) [Main Stage] | Track: Autoresearch

10:45am-11:05am: Don’t build agents, build environments — Adam Azzam

(session) [Track 1] | Track: Sandbox & Platform Engineering

We’ve largely settled on what a coding agent is: a model working in a loop, calling tools. As a result, the hard part has moved. It’s no longer the agent loop, it’s the environment around it. This talk is about the real challenges of building fast-booting, reliable, reproducible environments for coding agents at scale.

10:45am-11:05am: Building the simulation infrastructure for practical world model use — Christopher Manning

(sponsor) [Track 2] | Track: Robotics & World Models

What is the most important capability for world model applications and the pursuit of embodied AI? We believe it is not a question of having the most beautiful pixels but the ability to reason about causality in multimodal environments. At Moonlake, we are working on building action-conditioned multimodal world models which provide spatial and physical state consistency over long time periods. We believe that building and training on synthetic worlds provides the data and compute efficient path to truly useful world models. We are building the simulation infrastructure platform for companies that need to build and manage worlds (assets, scenes, digital twins) at scale, including robotics/autonomy teams, digital factory operators, and game authors. Our product today primarily finds applicability in simulation and the operationalization of digital twins. Simulation can include training robotics, world models for AGI research, autonomous vehicles, or content creation for media and entertainment. Operationalization of digital twins involves the reconstruction of scans into reusable assets, e.g., turning image and point-cloud scans into sim ready assets for digital factory Integration projects. We are building toward a future where AI systems do not just generate worlds, but understand how they work. Moonlake learns from each workflow: The more workflows, failures, and human interventions that Moonlake sees, the better it becomes at reconstructing, validating, and preparing complex simulation worlds. The session will include discussion and demos.

10:45am-11:05am: Beyond Static Intelligence: Evaluating Continual Learning — Parth Asawa

(session) [Track 3] | Track: Memory & Continual Learning

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this---in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

10:45am-11:05am: Build realtime multimodal agents with Gemini Live — Thor 雷神 Schaeff

(session) [Track 4] | Track: Workshops Day 2

The Gemini Live API is incredible versatile when it comes to building realtime AI experiences. From live translation across 2000 different language pairs to building realtime multimodal agents that can work across text, audio, and vision. This workshop gets you from zero to fully conversational agent in a matter of hours.

10:45am-11:05am: Vending-Bench: Long-Horizon Agent Evals for a Simulated Vending Business — Lukas Petersson

(sponsor) [Track 5] | Track: Evals

Long-horizon agent evals via a simulated vending machine business, testing negotiation, pricing, and supplier management over 365 days.

10:45am-11:05am: Understanding is the new bottleneck — Geoffrey Litt

(session) [Track 6] | Track: Design Engineering

Autonomous loops are hot, but the reality is that most agentic tasks still require human judgement. And to guide your agents well, it's not enough to just verify correctness -- you actually need to understand the work they're doing.

In this talk, I'll share some techniques for staying in the loop and efficiently developing understanding, combining old ideas from education and cognitive science with modern agent capabilities. You'll walk away with some practical tips for moving faster with agents by understanding more, not less.

10:45am-11:05am: Computer-use models will agentify the web, not APIs — Dhruv Batra

(session) [Track 7] | Track: Computer Use

We are rushing towards a world where every single digital surface (email, calendar, messaging, …, every desktop app, every phone app, every web app) that was previously meant for humans is now managed by AI agents. Of course, there are technical challenges to be solved: - Model context windows haven’t increased in 2 years. And the digital world is OOMs bigger (the ultimate “big world hypothesis”) anyway, so how does one architect this? - A large part of the digital world (most of the web) does not have APIs available and requires agents to act like humans (consume pixels, output keyboard/mouse actions). - Human preferences and the digital world change, and require agents to maintain a dynamic memory and continually learn. But even if we could solve these problems, what does this world look like? - The digital world, particularly the web, was built for human consumption (and is often hostile to bots). - For a while to come, we will be sharing the digital roadways with these digital robots. - What does end-to-end encryption and privacy mean when the other “end” of the communication is an AI agent? The Yutori team has spent the last year building the world’s best computer use model (slightly better than Opus 4.6 and GPT 5.4 while being 2x faster and 4-5x cheaper on browser use tasks), converted the web into a webhook with Scouts (agents that monitor the web 24/7 for anything you care about), and are now releasing Yutori agent that expands from the open web to your most common digital surfaces. This talk will be grounded in Yutori’s learning from what it takes to build agents that are always on, taking us one step closer to the world where every digital surface is their playground.

10:45am-11:05am: Build-Time vs. Run-Time: Why Your Dev Tools Will Fail in Production — Averi Kitsch, Prerna Kakkar

(session) [Track 8] | Track: Context Engineering

A dangerous pattern is evolving in the ecosystem: developers are deploying "Build-Time" tools into "Run-Time" environments. In this session, we will introduce a critical distinction for the MCP ecosystem: the difference between Build-Time Agents (Developer Assistants like Gemini Code Assist) and Run-Time Agents (End-user applications like a Customer Support bot). Drawing from our experience building the MCP Toolbox, we will demonstrate why the "Atomic" tools that make Build-Time agents powerful become catastrophic liabilities for Run-Time agents. We will provide a framework for transitioning your architecture across three key axes: Design: Moving from flexible, atomic primitives to "Composite Workflows" that encapsulate business logic. Security: Shifting from "Developer Identity" (trusted) to "Workload Identity" (zero-trust), where the agent is treated as an untrusted user. Reliability: Why production agents need "Agent-Readable" errors (natural language guidance) rather than the stack traces that developers rely on. Attendees will leave with a clear rubric for evaluating whether their tools are truly "Production Ready" or just "Prototype Ready."

10:45am-11:05am: What's next after RLHF? — Diogo Almeida

(session) [Track 9] | Track: Posttraining & Midtraining

RLHF was a massive commercial success: roughly 100% of LLM usage is through RLHF’d models - but it was in many ways also a research failure. Let’s talk about how it conquered the world, how it defied its creators expectations, why AI is in the bimodal state it’s in (is it a bubble or a machine god?), and how to make AI actually transform the economy.

10:45am-11:05am: From framework to runtime: running agents with Foundry Agent Service — Tina Manghnani, Keiji Kanazawa

(sponsor) [Track M] | Track: Track M

See how agents move from frameworks into production systems. Learn how Foundry Agent Service provides hosted execution, scaling, and lifecycle management—combining models, tools, and orchestration into a production-ready runtime.

10:45am-11:05am: How do you diffuse AI into the real world? — Varun Shenoy

(session) [Leadership 1] | Track: AI-Native Enterprises

Most AI conversations are still about models, benchmarks, and demos. We want to talk about what it actually takes to make AI work inside real companies. The gap between impressive demos and production value is where most enterprise AI efforts die. We've all seen burned budgets, cynical teams, and tools that never leave the pilot phase. We've spent the last two years closing that gap across the American services economy, and we'll share a bit of our playbook. This talk walks through three layers of what real AI deployment looks like, drawn from Long Lake's live operating environments: Measure: How we built domain-specific evals and workflows to improve performance on real HOA management tasks, not synthetic benchmarks, but metrics tied to actual business outcomes. Embed: How we put AI directly inside tools like Revit, meeting users where they already work instead of asking them to change how they operate. Scale: The enablement playbooks and operating techniques we use to help teams of property managers, payroll specialists, and more adopt AI in their day-to-day jobs. The broader theme is vertical superintelligence: not just better models, but systems built around proprietary data, workflow context, domain tools, human enablement, and continual learning. This talk is for builders and operators who care less about benchmark theater and more about how to deliver measurable outcomes, deal with change management, and teach non-technical workforces to use AI effectively in production beyond just Claude Code / Cowork.

10:45am-11:05am: The Z/L Continuum: Should AI Engineers Still Read Code? — Alex Volkov

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

At AI Engineer Europe, two of the best speakers gave directly opposite advice. Zechner: slow the f*** down, read every line your model writes. Lopopolo: code is a liability, you don't even open the IDE anymore. Both got applause. The room walked out confused. On the train back I sketched the Z/L Continuum on a napkin — a five-stop spectrum from "read the clanker code" to "what IDE?" — and the whole week clicked into place. In this talk I'll walk through the Continuum, introduce FOMAT (Fear of Missing Agent Time — coined backstage by Michael Richman), and make four arguments: the Continuum is real, your stop is per-task not per-person, model capability bends everything toward L, and FOMAT is a filter problem, not an agent problem. You'll leave with a vocabulary for the argument every AI engineer is having right now. Audience takeaways A shared vocabulary (Z, L, the five stops) for the debate splitting AI engineering teams FOMAT — name the fear so you can manage it A per-task framework for choosing where on the Continuum to operate Why capability drift makes "I'll never let it cook" a losing position over time Speaker: Alex Volkov · ThursdAI · @altryne

10:45am-11:05am: AI Engineering & Governance 2026 Trends — Wallon Walusayi

(session) [Expo Stage 2 NW] | Track: Expo Stage 2

AI Engineering & Governance 2026 Trends

10:45am-11:05am: Why AI Didn't Actually Make You Ship Faster — Gabriel Spencer-Harper

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

AI generates code faster than humans can review and verify it, and most engineering teams adopting codegen have hit the same wall: verification.

In this session, Gabriel (CEO of Meticulous) breaks down why assertion-based testing has a structural ceiling that AI codegen has made impossible to ignore, what exhaustive verification actually requires technically (behavior capture, determinism, and backend isolation), and why the teams solving this now are the ones who will ship at the speed AI enables.

The talk includes case studies from LaunchDarkly, which saw an 80% reduction in major frontend incidents after rollout, and Notion, which deployed verification infrastructure across every engineer on every PR to confidently adopt AI-generated code at scale.

10:45am-11:05am: Redesigning how software gets built — TBD — Sonar

(session) [Expo Stage 4 SE] | Track: Expo Stage 4

AI is already transforming how software is built, but most organizations are still treating it as a productivity tool rather than a governance challenge. The real question isn't whether to adopt AI-assisted development; it's whether your operating model is designed to control what comes out of it.

This session reframes the AI development conversation around three practitioner horizons: organizations that are proficient with the status quo, those capturing velocity today, and those building toward the next frontier, where AI agents operate with genuine autonomy at scale. The gap between these horizons isn't model capability. It's operating model maturity.

Most organizations are still applying AI to isolated steps in the development process. The real value only arrives when you redesign the system end-to-end: how work flows, how decisions are made, and how teams interact with AI as a core contributor. That transition requires something most teams haven't built: a governance layer that is accurate, consistent, repeatable, transparent, and auditable.

This talk explores what that governance layer looks like in practice, including how to instrument controls at the point of generation, enforce standards without slowing agents down, and build the organizational confidence to let agents operate at scale without losing visibility or accountability. The companies getting the most out of agentic development aren't the ones with the best models. They're the ones with the strongest foundations.

True governance isn't a gate at the end of the pipeline. In an agentic world, it's the architecture the pipeline runs on.

11:00am-12:00pm: Tokenomics: From AI Spend to AI Value — Martin Harrysson, Matt Linderman, Prakhar Dixit

(session) [Leadership Lounge] | Track: CTO Circle

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As enterprise AI adoption accelerates, token spend is scaling faster than value realization. We address i) how to make decisions amid unclear cost and value dynamics, ii) how to shift from token-level to workflow-level analysis, and iii) how to manage downstream behavior implications on AI usage.

11:10am-11:30am: Autoresearch for Dense Retrieval: Test-Time Compute with Frozen Embedding Models — Han Xiao

(session) [Main Stage] | Track: Autoresearch

Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Since modern embedding models are distilled from LLM backbones, a frozen encoder should benefit from extra inference compute without retraining. Using an agentic program-search loop spanning 144 generations, we explore 144 candidate programs over a frozen encoder API. The search produces twelve Pareto-optimal programs spanning cost ratios of c=1.2 to 14.7 over the single-pass baseline. The programs are structurally diverse: the search independently rediscovers Rocchio pseudo-relevance feedback, ColBERT-style MaxSim at sentence granularity, reciprocal rank fusion, and the Fisher linear discriminant, all without trainable parameters or external models. Every frontier program improves nDCG@10 over the frozen baseline across all 14 MMTEB retrieval tasks spanning legal, financial, long-document, and general domains.

11:10am-11:30am: Letting the Interns Loose — How We Accelerated AI Adoption. — Shashank Goyal

(session) [Track 1] | Track: Sandbox & Platform Engineering

11:10am-11:30am: Building the simulation infrastructure for practical world model use (Part 2) — Christopher Manning

(sponsor) [Track 2] | Track: Robotics & World Models

What is the most important capability for world model applications and the pursuit of embodied AI? We believe it is not a question of having the most beautiful pixels but the ability to reason about causality in multimodal environments. At Moonlake, we are working on building action-conditioned multimodal world models which provide spatial and physical state consistency over long time periods. We believe that building and training on synthetic worlds provides the data and compute efficient path to truly useful world models. We are building the simulation infrastructure platform for companies that need to build and manage worlds (assets, scenes, digital twins) at scale, including robotics/autonomy teams, digital factory operators, and game authors. Our product today primarily finds applicability in simulation and the operationalization of digital twins. Simulation can include training robotics, world models for AGI research, autonomous vehicles, or content creation for media and entertainment. Operationalization of digital twins involves the reconstruction of scans into reusable assets, e.g., turning image and point-cloud scans into sim ready assets for digital factory Integration projects. We are building toward a future where AI systems do not just generate worlds, but understand how they work. Moonlake learns from each workflow: The more workflows, failures, and human interventions that Moonlake sees, the better it becomes at reconstructing, validating, and preparing complex simulation worlds. The session will include discussion and demos.

11:10am-11:30am: Scaling up Continual Learning — Ronak Malde

(session) [Track 3] | Track: Memory & Continual Learning

Trajectory (stealth) is a research and product lab building the platform for continual learning, where frontier models are continuously trained as they interact with the real world. We are a team of ex-Deepmind, OpenAI, Meta superintelligence, Apple, and raised 15M from Conviction. The Fair will be after we have launched to the world. We will be walking through the primitives of continual learning, and how we can scale fast by leveraging these tools.

11:10am-11:30am: Build realtime multimodal agents with Gemini Live (continued 2) — Thor 雷神 Schaeff

(session) [Track 4] | Track: Workshops Day 2

The Gemini Live API is incredible versatile when it comes to building realtime AI experiences. From live translation across 2000 different language pairs to building realtime multimodal agents that can work across text, audio, and vision. This workshop gets you from zero to fully conversational agent in a matter of hours.

11:10am-11:30am: From Signal to PR: Anatomy of a Self-Improving Agent — Jason Lopatecki

(sponsor) [Track 5] | Track: Evals

What if your observability platform didn't just tell you something was wrong, but told you why, and opened a PR with the fix? We'll walk through how we built Autopilot at Arize: an autonomous investigation agent that triggers on monitor alerts or schedules, pulls traces into a working filesystem, runs root-cause analysis, and produces actionable assets: a PR with prompt or code changes ready for review. We'll cover the architecture decisions (cloud agents vs. sandboxed containers, AI harness + skills), why traces-on-a-filesystem is the key unlock for agent-driven debugging, and how we dogfooded the system on our own agent, Alyx, before shipping it to customers. You'll leave with a concrete picture of what "observability that fixes itself" looks like in practice, and where and why the human stays in the loop.

11:10am-11:30am: The Spatial Harness: Bringing Agents to the Canvas — Max Drake

(session) [Track 6] | Track: Design Engineering

What if chat is the wrong interface for managing agents? What if we're holding ourselves back by squeezing our thoughts and the way we work to into a one-dimensional, single-threaded interface? At a high level, this talk aims to present the work we've done at tldraw to build a spatial harness, or a way to allow agents to work on a canvas and collaborate with users and each other natively. This work represents important steps towards building better agent + canvas experiences, a product category we've seen explode in the recent months (Paper, Replit Agent 4, Google Stitch, etc). It's also not something I've really seen talked about elsewhere. See: - Multi-agent collaboration on the canvas (fairies.tldraw.com) - We've also recently brought code mode (https://blog.cloudflare.com/code-mode-mcp/) to the tldraw desktop app and MCP app.

11:10am-11:30am: Computer Use at the Edge of the Statistical Precipice — Pierluca D'Oro

(session) [Track 7] | Track: Computer Use

Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent's pass@k in deterministic environments. We trace this and other failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions). To address the first, we propose PRISM, five design principles for CUA environments and instantiate them in DigiWorld, a benchmark of 15 realistic sandboxed mobile applications able to evaluate agents in over 3.2 million verified unique configurations. To address the second, we develop an aggregation framework that correctly accounts for the nested structure of CUA benchmarks. All together, we show that principled environment design and rigorous evaluation methodology are not optional refinements but prerequisites for meaningful CUA research.

11:10am-11:30am: It’s Tokens All The Way Down: How RLMs are Different — Kevin Madura

(session) [Track 8] | Track: Context Engineering

Recursive Language Models represent an intuitive but distinctively important approach to how LLMs handle context. The practical implications are bigger than they first appear. Tasks that would traditionally require careful prompt engineering, custom agent scaffolding, or multi-step orchestration collapse into surprisingly simple, composable programs. In this talk, we’ll cover what makes an RLM distinct from a coding agent, explore where the abstraction shines and where it breaks down, and walk through concrete use cases that are informed by real-world situations at scale. We’ll see side-by-side comparisons to understand trade-offs in complexity, performance, time, and token usage.

11:10am-11:30am: State of Data — Sean Cai

(session) [Track 9] | Track: Posttraining & Midtraining

11:10am-11:30am: How to avoid disaster when vibe-coding a billing engine — Andrew Garvin

(session) [Leadership 1] | Track: AI-Native Enterprises

This talk covers what that infrastructure looks like in practice: which primitives matter, where the human checkpoints belong, and what changes when your billing system needs to be legible to machines instead of configured by humans clicking through a UI. When building AI products, billing and pricing should be directly tied to the products themselves. They're in the hot path. Every token, every agent action, every inference is a billable moment, and if your entitlement checks aren't keeping up, a single runaway agent can rack up thousands of dollars in seconds with no one to send the bill to. Get metering wrong and you're either eating costs or overcharging customers. Get ledger consistency wrong and your invoices don't add up. Get tax wrong across 47 jurisdictions and you find out from a regulator, not a user. Here's the thing, though — agents are legitimately good at billing strategy. They can pick pricing models, configure plans, run simulations, and iterate on packaging way faster than a human team could. You want them doing that work. But proration, multi-currency, revenue recognition, tax — this stuff took the industry years to get right, and it's unforgiving when you get it wrong. The question then becomes not whether agents should be making billing changes, it's what they should be operating on when they do. Agents need tight, composable building blocks where the correctness is already baked in, human-in-the-loop checkpoints before anything irreversible goes out the door, and sandbox environments where they can experiment freely without torching production. That's the architecture that lets you move fast on pricing without waking up to broken invoices. Target audience: Engineers and technical founders building AI products that charge for usage — whether that's per-token, per-action, or per-seat with consumption overages. If you've ever hard-coded a pricing tier, duct-taped metering onto an existing system, or wondered how your billing setup is going to survive your next pricing change, this talk is for you. Audience takeaways: - A clear understanding of why billing for AI products sits in the hot path — and what specifically goes wrong when metering, entitlements, or ledger consistency can't keep up. - A practical architecture for making billing agent-operable: composable primitives with correctness baked in, human-in-the-loop checkpoints on irreversible actions, and sandbox environments for safe experimentation. - A framework for deciding where agents should be empowered to move fast on billing strategy and where guardrails need to be non-negotiable.

11:10am-11:30am: Is Orchestration the Future? — Vlad Luzin

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

ChatGPT, Claude Code, OpenClaw — three inflection points that reshaped the industry in two years, each pointing the same way: the next step is many agents, not one. Which raises the question nobody's answered well yet — how do many agents actually work together? Today's answer is orchestration, and it's genuinely good — until you need stateful peers holding a single conversation together, which none of them are built to do. So we'll make a different case: that the next inflection point is a collaboration layer that lets separate agent systems share one conversation as stateful peers, whatever they're built on. We'll show that this is the inflection point the last three were leading to with a demo and a real enterprise use case.

11:10am-11:30am: Harnessing Agents: The Durable Runtime for Dynamic Workflows — Viren Baraiya

(session) [Expo Stage 1 NE]

Agents increasingly generate and revise workflows at runtime instead of following control flow written in advance. That breaks a common assumption of durable execution: that the workflow graph is known when the system is deployed. How do you safely run and recover a program that did not exist until an agent created it? This talk shows how Conductor provide a durable harness for dynamic workflows. Connecting existing agent frameworks to Conductor without requiring developers to rewrite their agent logic. Conductor executes the generated plan as an inspectable workflow with durability, parallelism, retries, human approvals, MCP tool calls and policy enforcement. We will demonstrate an agent creating a workflow, executing part of it, and replanning the remainder as conditions change while preserving completed work and using idempotency and saga compensation to manage side effects safely. The agent owns the plan. The harness owns the guarantees.

11:10am-11:30am: AI-Assisted Engineering: 5 Trends We're Seeing From 500+ Organizations — Justin Reock

(session) [Expo Stage 2 NW] | Track: Expo Stage 2

AI is reshaping how engineers work but what does that actually look like at scale? Drawing on data and patterns from more than 500 organizations, we break down the five most significant trends emerging in AI-assisted engineering today.

This fast-paced theater session cuts through the hype to deliver concrete, evidence-based insights that engineering leaders can act on immediately.

Key takeaways:

Discover the top 5 AI-assisted engineering trends observed across 500+ organizations

Understand how leading teams are integrating AI into their engineering workflows

Leave with actionable strategies to apply at your organization

11:10am-11:30am: The Death of Keyword Search and the Rise of Agent-Readable Catalogs — Nixon Dinh

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

As search shifts from classic keyword matching to more conversational experiences, product data quality becomes critical to LLM-powered retrieval. At PayPal, we tested how enriching traditional catalog data could help AI systems better find, understand, and rank products across large-scale commerce catalogs. We built a RAG-based AI judge to compare enrichment approaches and identify five patterns that consistently improved AI discovery results.In this talk, we'll share the evaluation framework, key lessons, and a practical approach for preparing enterprise data for conversational and agentic search.

11:10am-11:30am: FDE Playbook: Build an AI Support Agent and Give It a Voice — Matt Lawler

(session) [Expo Stage 4 SE] | Track: Expo Stage 4

Bio: Matt Lawler leads FDE for Onboarding at AssemblyAI, helping teams ship speech-to-text and voice AI to production, from model selection and architecture through deployment and scale.

Description:

Most support bots can read. Joey can talk back. In this session, AssemblyAI's Forward Deployed Engineer Lead, Matt Lawler, shares how his team built Joey, an AI support agent that increased end-to-end resolution rates from 10% to 75%. He'll walk through the architecture, key lessons learned, and how the team extended Joey into a fully voice-enabled agent.

11:40am-12:00pm: Memory Harnesses for Long-Running Research Agents — Stefania Druga

(session) [Main Stage] | Track: Memory & Continual Learning

At Sakana AI we build agents that run for hundreds of turns to read literature, run experiments, and draft papers. The model rarely breaks. The harness around it is the weak point: the agent contradicts a decision it made 80 turns ago, redoes finished work, or drifts from the question it started on. This is the binding-constraint thesis. For long-horizon tasks, reliability is set as much by the harness as by the model as clearly instantiated in autoresearch recent efforts. This is a field guide to the harness's memory layer. I'll trace a real research agent through its lifecycle, show exactly where context rot and drift set in, and cover the patterns that hold over 100+ turns: three-tier memory, progressive disclosure, recall-first compaction, sub-agent isolation, and architectural memory beyond the vector database. I will show how to measure whether your memory harness actually helps, at the trajectory level, so you stop tuning prompts to fix what's really a state-management bug.

11:40am-12:00pm: Kubernetes Is Not Your Sandbox — Ivan Burazin

(session) [Track 1] | Track: Sandbox & Platform Engineering

Teams are reaching for Kubernetes to run agent sandboxes, and it's the wrong tool. Kubernetes is built to keep things alive and hold them in a steady state. A sandbox is born, forked, and killed before any of that machinery catches up.

The mismatch compounds because the sandbox keeps gaining requirements without shedding any. In eighteen months it went from a fast code-snippet runner, to a stateful box for long-running agents, to ten thousand ephemeral environments that fork for RL rollouts and die in under a second. It has to be all of those at once, a contradiction set no orchestrator was designed to hold.

The cost shows up the moment you measure it. We ran the same 50-action bug-fix trajectory across five stacks and got a 12x spread: 12.9s on the fastest, 161.5s on the slowest. The gap isn't compute, it's lifecycle overhead per action. We name every stack and explain the mechanism behind each number.

wdyt?

11:40am-12:00pm: Commercial Grade-Robots for Real World Usage — Jason Ma

(sponsor) [Track 2] | Track: Robotics & World Models

TBD — Dyna Robotics talk for Robotics & World Models track.

https://www.dyna.co/

11:40am-12:00pm: Scaling Compute on Context — Jack Morris

(session) [Track 3] | Track: Memory & Continual Learning

A case for when context is enough, and when updating weights may be the real memory mechanism.

11:40am-12:00pm: Build realtime multimodal agents with Gemini Live (continued 3) — Thor 雷神 Schaeff

(session) [Track 4] | Track: Workshops Day 2

The Gemini Live API is incredible versatile when it comes to building realtime AI experiences. From live translation across 2000 different language pairs to building realtime multimodal agents that can work across text, audio, and vision. This workshop gets you from zero to fully conversational agent in a matter of hours.

11:40am-12:00pm: Building Closed-Loop Evals for a Multimodal Agent at Uber Scale — Soumya Gupta, Jai Chopra

(sponsor) [Track 5] | Track: Evals

This talk covers how we designed evals for Uber's food enhancement agent—which edits food photography to better present dishes for smaller, independent Uber Eats merchants—along with the pitfalls and lessons learned along the way.

The problem is uniquely hard: we must stay faithful to the original dish, preserve each merchant's brand and packaging, and avoid homogenizing the marketplace—all without an existing playbook for multimodal evals in a narrow domain. We'll dig into what we learned navigating reward hacking, where the agent figured out how to game the eval loop, and how we built a closed feedback loop incorporating offline and online signals for continuous improvement—all while balancing creativity against rigid safety guardrails at scale.

If you're an ML or applied AI practitioner working on multimodal systems, agentic pipelines, or eval design—especially building generative features under tight safety or quality constraints—you'll walk away with practical strategies for designing multimodal evals in a narrow domain, recognizing and countering reward hacking, and building offline/online feedback loops that keep a generative agent improving in production.

11:40am-12:00pm: The Design-Code Roundtrip That Isn't — Jonathan Gordon

(session) [Track 6] | Track: Design Engineering

Everyone is using Figma's MCP tools, Claude Code, or Codex. The demos are seamless. The narrative is compelling. What's actually happening under the hood is something else entirely. And the gap between the story and the reality is where your next six months of pain is going to come from. I'm Jonathan Gordon, founder of ReWeaver AI and a programmer-turned-UX designer who spent 30 years in developer tools at Google, Microsoft, Apple, Facebook, and Oracle watching the design-engineering gap widen in slow motion. I've seen every generation of tooling promise to close it. I know exactly where the seams are. I wrote a technical teardown of what Figma's bidirectional workflow actually ships, what get_design_context does, what generate_figma_design actually captures (hint: it's a screenshot, not your design system), and why iterating through that loop 12 times leaves you progressively farther from your canonical design intent. This talk will walk attendees through each step, backed by research and specific examples, and include a demo showing how drift accumulates in real time. The problem is not that drift happens; it's that it's happening exponentially. Let's talk about how we can stem that tide and keep humans in control of the process, not just "in the loop."

11:40am-12:00pm: Bringing agents onto the world wide web — Paul Klein IV

(session) [Track 7] | Track: Computer Use

The web wasn't built for agents. Heavy HTML, human-first UIs, and a DOM that can hijack the model's context. Still, agents browse it for millions of hours every month through Browserbase, across teams like Ramp, Shopify, and Lovable. This talk walks through that browser agent harness layer by layer, from the security boundary between DOM and model to caching, Agent Identity, and the infrastructure that provisions browsers at scale, and where browser agents go once it is in place.

11:40am-12:00pm: 500 Skills, Zero Fine-Tuning: LinkedIn's Playbook for AI Agents That Actually Know Your Codebase — Ajay Prakash

(session) [Track 8] | Track: Context Engineering

Everyone's building custom AI agents. We didn't. Instead, we built CAPTAIN — an MCP server that makes any off-the-shelf coding agent understand LinkedIn's entire engineering stack. The secret: a meta-tool architecture (discover → inspect → execute) and composable skills that encode tribal knowledge as executable workflows. 500+ skills later, it's used across all of LinkedIn engineering. I'll show you the architecture in 10 minutes and why context engineering beats model engineering every time.

11:40am-12:00pm: Training Frontier Models to Out-Think Hackers — Uri Rolls, Thom Wolf

(session) [Track 9] | Track: Data Quality

We will give a surprisingly optimistic talk about AI and cyber, and why we believe it is not the end of cybersecurity as we know it, but an opportunity to empower defenders and build a lasting edge over attackers.

Cyber is a battle of skill and speed, and the rising tide of frontier models is allowing human attackers to move faster and cheaper. That combination of skilled hackers and breakthrough LLMs is a real threat, while defensive systems are still expected to operate at scale with limited human intervention, constrained by what models can do out of the box. But the answer is not fear or despair. Just as high-quality data transformed software engineering, the right cyber training data can teach models to turn from weapons being used against us into tools that protect us.

11:40am-12:00pm: OpenAI, Anthropic, or agent frameworks: choose the right AI stack — Arun Sekhar, Pamela Fox

(sponsor) [Track M] | Track: Track M

OpenAI SDK, Anthropic SDK, or an LLM-agnostic agent framework. Which one should your next AI app be built on? Starting with Foundry Models, we walk through each option in code, show what you gain and what you give up at every layer, and help you pick the right abstraction for your scenario without overbuilding.

11:40am-12:00pm: Your Code Has Bugs. Lean4 Has Proofs. A Practical Guide to Formal Verification for Engineers — Varun Pant

(session) [Leadership 1] | Track: AI-Native Enterprises

AI is generating more of your code than ever — how do you prove it doesn't ship bugs? Lean is a theorem prover that's also a programming language, and it's quietly becoming practical for verifying real software. In this talk, I'll show you how formal verification works — some examples of proof tactics, and a practical framework for when to verify vs. test

11:40am-12:00pm: How to Kill the Code Review — Ankit Jain

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

Human-written code died in 2025. Code review is dying in 2026. Teams with high AI adoption are merging 98% more pull requests, but PR review time has surged 91%. There is no way we win this fight with manual code reviews, and AI code review tools are just buying us time. This talk makes the case that the traditional code review is a historical approval gate that no longer fits the shape of modern software development. I'll walk through a practical five-layer trust model: from multi-agent competition and deterministic guardrails to spec-driven BDD and adversarial verification — that lets engineering teams ship faster without sacrificing quality or control.

11:40am-12:00pm: Fault-Tolerant Training at Scale: Making Hardware Failures a Non-Event

(session) [Expo Stage 1 NE]

Hardware failures in large-scale distributed training are inevitable — when you're running thousands of GPUs, they happen multiple times a day. The standard response is manual intervention: an engineer gets paged, SSHs into the cluster, and spends an hour fixing something the infrastructure should have handled automatically. That lost time compounds directly into wasted compute and delayed research.

This session walks through the self-healing platform Crusoe built to eliminate that manual loop entirely — a managed Slurm environment running on Kubernetes, with automated node failure remediation and real-time cluster observability — and how these components work together so hardware failures become a non-event.

We'll cover this architecture end-to-end: how running Slurm on Kubernetes unlocks infrastructure resilience that traditional GPU clusters don't have, how automated hardware monitoring and node remediation can eliminate manual intervention entirely, and how full observability into every remediation event keeps engineering teams informed without keeping them on-call. For teams that want deeper control, we'll also discuss open-loop remediation, which gives teams full control over the node replacement process for application-specific workflows.

11:40am-12:00pm: How to generate mergeable code with a context engine — Peter Werry

(session) [Expo Stage 2 NW]

Your agents are fast, capable, and completely context-blind. They generate code that compiles but doesn't reflect how your system actually works. You're likely already seeing the impact: ballooning token costs, longer review cycles, and inconsistent outputs. More MCPs, rules, and bigger context windows give agents access to information, but not understanding. In this session, we dissect how teams pulling ahead use a context engine to give agents exactly what they need for the task at hand. Includes a short demo showing the workflows a context engine can augment.

11:40am-12:00pm: The Next Run Should Be Better — Jake Broekhuizen

(session) [Expo Stage 3 SW]

Agents generate a constant stream of experience through traces: tool calls, failures, corrections, routing decisions, and user feedback. The challenge is identifying which parts of that experience are worth remembering, and making those lessons available to the agent when it runs again. This talk presents memory as an agent learning loop: capture traces, extract signal, and turn the right lessons into durable context. We'll explore practical models for agent memory and discuss how to build systems where the next run can be better than the last.

11:40am-12:00pm: AI agents don't read your policy docs. They hit your APIs.

(session) [Expo Stage 4 SE]

Every organisation has a policy for what AI should and shouldn't do. But in the era of autonomous agents, who is that document actually for? Odds are no agent has ever read it. It opens a connection and makes a call, and whatever happens at that millisecond is your real policy. So put the control there. This talk is about the gateway as the runtime where AI governance actually executes: per-agent identity and scoped, short-lived credentials instead of a shared god-key. PII and secrets stripped from prompts in flight. Token-aware rate limits so one looping agent can't torch your quota. Semantic caching that cuts spend and latency on requests you've already answered. I'll share the architectural patterns behind each control, what they look like in practice, and what breaks the moment you take them away. Policy states intent. Infrastructure enforces it.

12:05pm-12:25pm: « the era of (auto) research » — Elie Bakouch

(session) [Main Stage] | Track: Autoresearch

the nanogpt speedrun is a great setup to test autonomous research: fixed model, one number to beat, and a human record that keeps moving. we pointed coding agents at it on idle compute and let them iterate for days, thousands of runs with minimal human intervention, until they beat the human baseline. in this talk we go through how they did it, how codex and claude code behave very differently as researchers, and why speedrun are one of the best environments we've found for studying autonomous research agents

12:05pm-12:25pm: Your agent needs a sandbox, not a desert — Samuel Colvin

(session) [Track 1] | Track: Sandbox & Platform Engineering

Everyone agrees agents need code execution. That agreement lasts right up until you ask how to do it. The default answer is usually something like "My agent needs a full Linux VM to succeed". That's a very convenient answer for sandbox providers, but I think it's often incorrect. In many real-world agent workflows, the model does not need a whole computer. It does not need arbitrary packages, shell access, CPython, node, let alone awk sed and gcc. It needs a small amount of safe, expressive compute: enough to write code, call tools, and keep intermediate state out of the context window. That is the idea behind Monty: a minimal Python interpreter, written in Rust, designed specifically for running code written by agents. In this talk, I'll argue that for a surprisingly large class of agent systems, a curated set of tools in a custom runtime is better than a full sandbox. Not because full sandboxes are bad, but because they solve a much larger problem than most embedded agents actually have. And you pay for that mismatch in complexity, cost, operational pain, and 100,000X higher latency. Sandboxes are great, but there's such a thing as too much sand - in many scenarios the constraints and limitations of a custom built, minimal sandbox are a feature, not a bug.

12:05pm-12:25pm: Intelligence + Continual Learning = Expertise — Yu Su

(session) [Track 3] | Track: Memory & Continual Learning

Talk on continual learning for LLMs and agents, drawing on retrieval-to-memory and environment-adaptation research.

12:05pm-12:25pm: Build realtime multimodal agents with Gemini Live (continued 4) — Thor 雷神 Schaeff

(session) [Track 4] | Track: Workshops Day 3

12:05pm-12:25pm: From Agent Traces to Agent Simulations: The next era of agent evaluation — Rustem Feyzkhanov

(sponsor) [Track 5] | Track: Evals

Agent evaluation is moving beyond reviewing static traces after the fact. This talk explores how executable simulation environments let teams repeatedly test agents across realistic tasks, compare models and harnesses, and uncover failure modes that trace review alone misses. Drawing from Snorkel's experience building simulation datasets at scale for major labs and contributions to projects like Agents' Last Exam and Terminal-Bench, we'll cover concrete engineering patterns for building these environments: defining clear specs and requirements, implementing evaluators for simulation environments and tasks themselves, keeping environments decoupled from any single agent or model, and designing verifiers that evaluate both final outputs and agent traces. Attendees will leave with a practical mental model for creating environments that are lightweight enough to run at scale, but realistic enough to mock production systems such as databases, APIs, and tools in ways that meaningfully challenge agents.

12:05pm-12:25pm: Mousepower: agents that can’t be measured, can’t be managed. — Maximillian Piras

(session) [Track 6] | Track: Design Engineering

Agents have a measurement problem, which makes them impossible to efficiently manage. You’ve likely heard many say execution is now cheap, but judgement is the new bottleneck. This is because our evaluation frameworks weren’t designed for systems that tirelessly output in parallel. The canary in the coal mine is code generation becoming largely solved at the expense of breaking code review. As agents reverberate across all knowledge work, the same fracture will spread to artifacts, actions, & decisions. Yet without a scalable quality measure, we can’t ascend to a higher level of abstraction because we won’t trust the foundation below. So how do we design measurements that are efficient, intuitive, & trustworthy? Past paradigm shifts offer inspiration, such as James Watt not just building a better engine but also inventing horsepower to map it onto existing mental models. We need an equivalent quantification to communicate the “mousepower” of agents. Information theory gives us the starting point: concepts like entropy, ergodic processes, and Hamiltonian problems point us toward the most tractable trajectories — easier to verify than they are to solve.

12:05pm-12:25pm: The Dark Arts of Web Automation: Teaching Agents to Use Websites Like Humans — Corey Gallon

(session) [Track 7] | Track: Computer Use

Anything you can do in a browser, your agent can do too. Not by tiptoeing through an MCP server one polite, token-burning call at a time -- properly, programmatically, the way you'd drive any other tool. I'll show you how with chrome-agent, an open source wrapper over the Chrome DevTools Protocol that has become irreplaceable in my everyday work. If you'll ever do a browser task more than once, step-by-step MCP browsing is slow, brittle, and bills you tokens for every single click. A CLI straight onto CDP makes the whole browser programmable: loop it, pipe it, script it, walk away. Write it Tuesday, run it a thousand times Wednesday, all without a second of AI agent babysitting. We'll dispel the MCP hype and myths, with successful demonstrations of cheeky things like: the power of CLI-based browsing and how its so much more capable than mere MCP; reaching through those oh-so-clever cross-origin iframes to clear the verify you're human checkboxes; showing that a JavaScript .click() is not a click, rather, just a function call in a costume that is banhammerable; ultimately, proving that a CDP browser operates just like a meatbag with a mouse and keyboard. You'll learn how to point your AI agents at real, messy, uncooperative websites and web applications and have them get things done exactly the way that you would.

12:05pm-12:25pm: Your agents lack context: Here's how to fix "You're absolutely right!" — Brandon Waselnuk

(session) [Track 8] | Track: Context Engineering

Every AI coding tool can generate code. Very few can generate the right code for your organization, because they're missing context. They don't know why your team chose Redis over DynamoDB, what the team decided in a Slack thread earlier today about the auth migration, or which architectural patterns your principal engineers actually enforce in review.

This talk is a practitioner's guide to building a context engine: the reasoning layer that continuously ingests & synthesizes organizational knowledge across disparate sources into unified, queryable understanding.

I'll walk through the problems you actually have to solve — reasoning across systems that don't agree with each other, searching globally before you can reason, maintaining identity-scoped permissions so every user and agent only sees what they should, and personalizing results based on who's asking and what they're working on.

These are the engineering challenges that make naive RAG fall short, drawn from real lessons building this at scale.

12:05pm-12:25pm: Learning on the job: the future of post-training — Raymond Feng

(session) [Track 9] | Track: Posttraining & Midtraining

12:05pm-12:25pm: AI-Native Organisations runs on Skills: How to Extract, Structure, evaluate and Scale Them — Imad Touil

(session) [Leadership 1] | Track: AI-Native Enterprises

12:05pm-12:25pm: The Death of the Code Review — Laurie Voss

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

Code review was built for a world where humans wrote all the code. Now, the question isn’t “does this diff look good?” — it’s “can this system safely ship code on its own?” This talk will show why and how traditional code review will quietly be replaced by automated verification harnesses. We’ll show how prompt learning can be used to clone your best internal code reviewers, turning their judgment into automated evaluation loops. We’ll also open source a code review training harness that captures review patterns and turns them into reusable checks for AI-generated code.

12:05pm-12:25pm: Your agent architecture has a half-life of 6 months — Dan Farrelly

(session) [Expo Stage 1 NE]

A short history of the right way to build an agent: RAG, ReAct, prompt chaining, orchestrator-workers, MCP, CLI, MCP again... CLI again?? Every time you adopt a trend you rebuild your architecture. In this talk, Dan Farrelly, Inngest cofounder and CTO, is not going to tell you what comes next. He's going to show you how to build so it doesn't matter. He'll cover the core primitives that show up in every production agent, how bringing decisions closer to code provides more stack flexibility, and why the right execution layer unlocks faster iteration.

12:05pm-12:25pm: From Stateless to Stateful: Orchestrating Real-Time Voice & Messaging Agents with Twilio and Amazon Bedrock — Rishab Kumar

(session) [Expo Stage 2 NW]

We have all had that maddening customer service experience: you text a support line about a delayed flight, receive a confirmation, but when you call in a minute later, the voice agent asks, "How can I help you today?" completely blind to the SMS you just sent. This is the "Channel Amnesia" problem. While businesses are pouring billions into generative AI, most agents are still built on stateless architectures that forget customer context the second a session ends. In this session, we will cure AI amnesia. You will learn how to orchestrate stateful, production-grade AI agents across SMS and Voice using Twilio Agent Connect and Amazon Bedrock. We will dive into why traditional serverless compute fails stateful agents, how to leverage AWS Fargate for isolated, long-lived sessions, and how to configure Bedrock AgentCore over WebSockets to hit sub-50ms streaming voice latency. No slide-ware here expect a live, cross-channel demo and open-source code you can deploy tomorrow.

12:05pm-12:25pm: Harnessing Collective Agent Intelligence for Open Science — James Zou

(session) [Expo Stage 3 SW]

What happens when AI agents don't just work in isolation, but collaborate, compete, and build on each other's breakthroughs in real time? James Zou, Head of Frontier Agents at Together AI, explores how collective agent intelligence is pushing the boundaries of open science. https://www.together.ai/blog/einsteinarena is a live platform where AI agents collaborate on unsolved mathematical problems, sharing results and building on each other's work. In April 2026, agents improved the best known lower bound for the Kissing Number in 11 dimensions from 593 to 604, surpassing AlphaEvolve through 48 hours of live multi-agent collaboration. https://www.together.ai/blog/dsgym is a unified framework for evaluating and training data science agents, exposing a critical gap in existing benchmarks: models often rely on memorization rather than true data analysis. The team used it to train a 4B open-source model that rivals much larger frontier models. These projects demonstrate agents learning from rigorous evaluation, collaborating through shared infrastructure, and driving scientific discovery at a pace no single researcher or model could achieve alone.

12:05pm-12:25pm: Prompt, Memory, Weights: The Architecture Decisions Most AI Teams Make by Accident — Anant Srivastava

(session) [Expo Stage 4 SE] | Track: Context Engineering

The interesting engineering in production AI isn't in the model. Your knowledge lives in files, databases, and APIs: docs, runbooks, conversations, code. The model just reads tokens. So the real architectural question is which path that knowledge takes to inference: into the prompt directly, into memory for retrieval on demand, or into the weights through fine-tuning. Most teams treat these as a ladder. Start with prompts, escalate to RAG, eventually fine-tune, as if each step is a more advanced version of the last. The field is converging on a different answer: they solve different problems. The prompt shapes behavior and constraints. Memory grounds the model in current, citable knowledge. Weights harden specialized reasoning and format. They're not substitutes you graduate between; they're complementary, and the failures come from using one to do another's job. Fine-tuning to teach the model facts it should have retrieved is the classic trap: you bake in knowledge that's stale the day it ships, and you still can't cite it. This is an opinionated take on all three: when each is the right call, when each is a trap, and the part most teams never build, the circulation between them. Memory that captures what the agent does becomes the dataset you fine-tune on; fine-tuning changes what's worth retrieving; the loop compounds. Get the three paths right and they stop being a pipeline you climb and start being an architecture that learns.

1:30pm-1:50pm: Closing the Loop: An Autonomous AI Research Agent — Tim Sweeney

(session) [Main Stage] | Track: Autoresearch

The holy grail of agentic AI tooling is the autoresearch loop: an agent that can sift through your experiments, create visualizations, propose a hypothesis, launch a training job, read the results, and try again autonomously. In this session, we'll show new autoresearch capabilities built directly into the W&B Models web and iOS apps. We will demo these live using a real-world fine-tuning project, covering everything from launching jobs and reading loss curves to surfacing outlier runs that consume researcher hours and recommending the next steps. Then you'll learn how the eval-driven development loop in W&B Weave makes agents like this trustworthy. You'll see how production traces become benchmarks, and how only the agents that beat the bar make it to production. Join us to learn the same loop we use to improve our own agentic features.

1:30pm-1:50pm: From fork() to Fleet: Designing an Agent Sandbox Cloud Pt 1 — Abhishek Bhardwaj

(session) [Track 1] | Track: Sandbox & Platform Engineering

Sandboxes unleash agents by giving them secure, fully functional computers where they can tackle diverse tasks with minimal setup. This talk explores the architectural challenges of building an agent sandbox cloud. We compare runtime isolation technologies and their trade-offs, examine persistence and storage as the next major unlock for agent capabilities, and discuss the key decisions involved in orchestrating and scaling sandboxes.

1:30pm-1:50pm: Unitree: Building Mass Produced Humanoids — XiangMing Sun

(sponsor) [Track 2] | Track: Robotics & World Models

1:30pm-1:50pm: Adaption Labs — Gradient-Free Continual Learning — Sara Hooker

(session) [Track 3] | Track: Memory & Continual Learning

Gradient-free continual learning for AI systems that adapt from real-world experience.

1:30pm-1:50pm: The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development — John Lindquist

(session) [Track 4] | Track: Workshops Day 3

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

1:30pm-1:50pm: Model Whisperers: How Evals and Prompts Shape Agent Behavior — Chris Souza, Preetika Bhateja, Daniel Bump

(sponsor) [Track 5] | Track: Evals

Getting an AI agent to behave the way you want isn’t just about writing better prompts. In real systems, behavior emerges from a loop: prompts->evals->iteration->feedback. Small changes in any part of that loop can completely change outcomes. We saw this while building a seed asset agent - a system that turns messy, real-world advertising creatives (low quality images, cluttered visuals, heavy text overlays) into clean, reusable assets for downstream Gen AI tools. The agent acts like an editor, simplifying visuals, removing unnecessary elements, and isolating core content so that additional context (like text or CTAs) can be added back in a more controlled, brand-safe way. But the real challenge wasn’t just building the agent - it was making it reliable. And prompting alone wasn’t enough. What actually moved the system forward was how we defined success—and how we used evals to reinforce it. Over time, evals stopped being just a way to measure quality. They became part of how the agent learned what “good” looks like. In this talk, we’ll cover: Why prompting alone doesn’t give you stable agent behavior How evals act like feedback signals, not just scorecards How we built evals sets that reflect the real-world Using agent trace logs to understand why things fail (not just that they fail) How to iterate without breaking things you already fixed By the end, you’ll have a set of patterns you can apply to any system dealing with messy/continuously changing data and how to tweak your prompt and evals to accommodate such changes.

1:30pm-1:50pm: Design at the Speed of Adjectives — Paul Bakaus

(session) [Track 6] | Track: Design Engineering

Every design tool today operates at the wrong level of abstraction for AI-assisted engineering. Traditional tools give you padding sliders and color pickers, built for a world where designer and engineer are separate roles moving at separate speeds. Prompt-to-design tools one-shot a pretty landing page from a sentence, which is more dangerous because it looks like it's working. No serious design director hears a prompt and starts pushing pixels. The brief comes first. What's the emotional territory? What should this not feel like? Today's AI tools skip that discovery entirely. The result is output without intent. Technically competent, strategically empty. The right abstraction for a world where the designer is also the engineer lives between these extremes. Not pixels. Not prompts. Adjectives. "Make it feel warmer." "Strip it to its essence." "Add tension." These are the controls a creative director actually thinks in. Drawing on lessons from building Impeccable, an open source design tool with 24 adjective-level commands like /bolder, /quieter, and /distill, I'll share what worked, what didn't, and how to apply this thinking to any AI interface where creative intent matters more than parameter control.

1:30pm-1:50pm: From RL to IRL — Gaurav Mishra

(session) [Track 7] | Track: Computer Use

Today's agents have to operate in a messy reality of flaky connections, ephemeral credentials, and irreversible actions. They need to navigate real software the way humans do: recovering from failures, learning from feedback, and making sound judgment calls. This talk is about the fundamental changes in RL required to make agents ready for IRL. We'll walk through what it takes for training environments to reflect the complexity of the real world, the perception primitives that let an agent see what a user sees, the harness pieces that help it survive contact with real applications, and the failure modes you only discover when you stop scoring and start shipping.

1:30pm-1:50pm: How long can your skills be before your agent forgets what you told it? — Laurie Voss

(session) [Track 8] | Track: Context Engineering

A year ago, frontier models lost the thread somewhere around 200 simultaneous instructions, so skills files had to stay short and lean on sub-skills and subagents. We re-ran IFScale on the 2026 frontier and found the ceiling has moved by an order of magnitude: closer to 2,000 instructions, up to 5,000 on the strongest models. The more interesting story is how models fail at the new frontier: DeepSeek quietly drops instructions, Opus refuses outright when innocuous words trip a safety classifier, Gemini burns its whole budget on reasoning and emits nothing, and GPT-5.5 stops to tell you your request was unreasonable. The capacity problem is largely solved; verification is wide open. We'll show the data, the failure modes, and what it costs to find out. You’ll come out with hard data on the ceiling for complex instructions to LLMs

1:30pm-1:50pm: Reinforcement Learning without Verifiable Rewards — Will Brown

(session) [Track 9] | Track: Posttraining & Midtraining

Verifiable rewards are the gold standard for RL training, but real-world agent tasks frequently lack clean deterministic evaluation objectives. This talk surveys our efforts to scale RL in non-verifiable settings -- including task synthesis, unsupervised environment design, and automatic judge calibration -- to ultimately enable self-improvement in production, grounded in real-world agent traces and domain-specific context.

1:30pm-1:50pm: The Half Life of Agent Infrastructure — Ben Kus

(session) [Leadership 1] | Track: AI-Native Enterprises

TBD — talk on search and retrieval, agentic AI, and enterprise AI over unstructured content.

1:30pm-1:50pm: Tokenmaxxing is the New "Lines of Code" — Nicholas Arcolano

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

Somebody in your company is going to ask what you're getting for all that AI spend. If you don't have a good answer, someone else will make one up... and it might be "total tokens consumed". That's how tokenmaxxing becomes policy: not because anyone thinks it's a good metric, but because engineering didn't offer a better story. I work with datasets spanning hundreds of companies, hundreds of thousands of engineers, and billions of lines of shipped code to understand how AI engineering is evolving and what actually matters to measure. One thing I've learned is that raw token spend is a VERY crude estimator of value. For example, across levels of token spend, cost per merged pull request varies 300x — but output only varies 2x. The good news is the data also shows what DOES matter, and it's measurable and actionable – but most teams aren't tracking it yet. This talk will give you the data, metrics, and frameworks you need to keep your org from adopting the latest terrible vanity metric. You'll learn what actually separates teams that scale AI effectively from those just burning tokens, and how to tell the story that keeps your AI investment funded and growing.

1:30pm-1:50pm: Surviving Your Own Velocity: How VS Code Ships Weekly with 40 People — Harald Kirschner

(session) [Expo Stage 2 NW] | Track: Expo Stage 2

A ~40-person team ships VS Code weekly to millions of users. Models got good enough to lean on, and leaning in is exactly what broke our process. This talk is the part most AI talks skip: what you have to rebuild after agents start working. We had to scale three things at once: how fast we ship, how we hold quality, and how fast we learn, and each one we fixed revealed the next. I'll walk through the harnesses, evals, and self-healing systems that keep velocity from becoming regression, and the patterns you can steal.

1:30pm-1:50pm: Why Agents Should Have Their Own Sandbox — Philipp Schmid

(session) [Expo Stage 3 SW] | Track: Expo Stage 1

1:55pm-2:15pm: An AI Agent Became the #1 Contributor in OpenAI's Hiring Challenge — Zhengyao Jiang

(session) [Main Stage] | Track: Autoresearch

Earlier this year, OpenAI ran Parameter Golf, a model-training competition that doubled as a hiring filter. Over 1,000 researchers competed to train the best small language model under a 16MB cap. The top contributor was the one candidate OpenAI couldn't hire. Our autonomous research agent Aiden finished with 7 merged records, more than twice as many as any other contributor, and ended up the most-cited participant in the community.

This talk is about what those 22 days showed. I'll cover on high level how does it works and which of its ideas produced the records. But the part worth more than the leaderboard is the collaboration itself, the community and AI agent building on each other's work, the largest natural experiment in human-AI collaboration I've seen run in public. I'll close with what it tells us about where humans and autonomous research each still matter for the foreseeable future.

1:57 PM

1:55pm-2:15pm: From fork() to Fleet: Designing an Agent Sandbox Cloud Pt2 — Abhishek Bhardwaj

(session) [Track 1] | Track: Sandbox & Platform Engineering

Sandboxes unleash agents by giving them secure, fully functional computers where they can tackle diverse tasks with minimal setup. This talk explores the architectural challenges of building an agent sandbox cloud. We compare runtime isolation technologies and their trade-offs, examine persistence and storage as the next major unlock for agent capabilities, and discuss the key decisions involved in orchestrating and scaling sandboxes.

1:55pm-2:15pm: Frontier Robotics Research — Deepak Pathak

(sponsor) [Track 2] | Track: Robotics & World Models

1:55pm-2:15pm: Improving Agents is a Data Mining Problem — Vivek Trivedy

(session) [Track 3] | Track: Memory & Continual Learning

Harness Engineering, Post-Training, Continual Learning...these all boil down to the same underlying substrate - Mining Agent Traces 1. I need to run my agents to collect Traces 2. Understand behaviors from Traces at scale 3. Filter data for "improvement" 4. Do an improvement step There's a reason why every continual learning platform ends up looking like an observability platform. It's because Traces are the lifeblood of agent improvement. The mechanism that we use to attempt improvement can vary - Harness Eng, SFT, etc. But without understanding the data agents produce, no algorithm will truly build better agents. The holy grail of Agent Improvement is Continual Learning. Consistently mining data and integrating it into the agent definition over infinitely long time horizons. Today, the easiest way to do that is to build an observability platform and constantly point agentic compute to understand the data that agents produce. We'll walk through the current methods of understanding traces at massive scale and choosing how to integrate them to improve agents across your personal agents, team agents, and entire company.

1:55pm-2:15pm: The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development (continued 2) — John Lindquist

(session) [Track 4] | Track: Workshops Day 3

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

1:55pm-2:15pm: Evaling Video Slop — Maor Bril

(sponsor) [Track 5] | Track: Evals

Everyone is shipping video models. Almost no one is evaling them honestly. CLIP score doesn't catch temporal incoherence. Vibes-based human review doesn't scale. And every "AI judge" you wire up will quietly drift away from human preference unless you measure the drift. This is a tactical talk on building real multimodal eval, using JudgeJudy (open-sourced at Character.ai) as the working example. You'll leave with: Why video is different from text. Temporal consistency, shot continuity, narrative coherence, and the metrics that actually capture each (clip_temporal, temporal_consistency, and friends). AI judges, the real version. Custom rubrics, when they work, when they hallucinate, when they collapse to a single dimension and pretend they didn't. The calibration loop. Pearson/Spearman correlation against human scores, automated rubric improvement, detecting systematic judge bias before it costs you a release. Pairwise preference models for video. Training a Qwen3-VL backbone with Bradley-Terry loss to score "is this slop?" before it ships. Regression gates in CI. How every AgentX release at Character.ai passes through an eval wall before it reaches users. Closing the loop with JudgeJudy. Correlating eval scores against real telemetry (Amplitude, Statsig) and feeding validated gates back into the runtime. If you're shipping any multimodal output and your eval strategy is still "the team watches some clips on Friday," this is the upgrade. github.com/character-ai/judgejudy

1:55pm-2:15pm: Training Taste — Thais Castello Branco

(session) [Track 6] | Track: Design Engineering

1:55pm-2:15pm: The Rise of CaaS: Context-as-a-Service for Agentic AI — Omer Primor

(session) [Track 7] | Track: Computer Use

Agentic workflows have commoditized. The new bottleneck is context. As models improve, AI agents are increasingly limited not by reasoning ability, but by the quality, freshness, and specificity of the information they can access. This session introduces Context as a Service, or CaaS, an emerging category for builders creating web-native context layers for AI agents. These tools collect, structure, enrich, index, and analyze live web data, making it available as agent-ready knowledge for specific use cases and vertical downstream applications. We ll explore how builders are turning hard-to-access web domains into agent-ready context layers: fragmented public data, dynamic sources, multimodal content, and fast-changing signals that generic models cannot reliably process within their token limits. Attendees will learn how to think about CaaS as both a technical architecture and a market opportunity: what to build, where context creates defensibility, and how raw web data can become the foundation for reliable agentic products.

1:55pm-2:15pm: WTF Is the Context Layer? The Missing Infrastructure for Production Agents — Prukalpa Sankar

(session) [Track 8] | Track: Context Engineering

In the last two years, models have gotten exponentially smarter. Two years ago they couldn't pass the bar. Today, top 1% of test scorers. And yet most agents still can't answer a simple business question correctly. You ship a demo that works. You deploy it. The business abandons it in a month.

The missing variable is context: the business definitions, procedural knowledge, and operational norms that make a human expert valuable.

Drawing on hundreds of production deployments, Prukalpa Sankar will break down what it actually takes to give agents contextual intelligence — and get them past the demo stage.

She'll walk through the architecture of a context layer: how context repos work (versioned, testable, portable), how simulation environments catch failures before deployment, how agent traces compound back into shared context, and why context engineering scales where fine-tuning and prompting don't. She'll also cover why your context needs to be open (MCP, Iceberg, deploy to any framework) — and what happens when it isn't.

1:55pm-2:15pm: Emulated: The data for fully autonomous software engineers and companies — Joseph Wang

(session) [Track 9] | Track: Posttraining & Midtraining

Hold for Emulated.so. Company builds reinforcement-learning environments that simulate real production systems for coding and infrastructure agents.

1:55pm-2:15pm: Guardians of the State: How We Built an Air-Gapped AI Fortress for Consumer Data — Rachna Srivastava

(session) [Leadership 1] | Track: AI-Native Enterprises

Every enterprise slide deck talks about "data privacy," but at the California Department of Financial Protection and Innovation (DFPI), a single leaked Social Security Number or bank account doesn’t just mean a bad PR day—it violates strict state consumer laws and triggers massive regulatory security breaches. When your raw data includes petabytes of unredacted fraud complaints, dark web scam networks, and banking statements, standard commercial public APIs are an absolute non-starter. This talk breaks down the exact enterprise architecture the DFPI uses to leverage frontier-level reasoning on highly sensitive data without crossing legal lines. We will walk through the code and infrastructure of our sovereign data pipeline. Attendees will learn: The Infrastructure: How we host and serve local, open-weights models (like Llama 3 or Mistral) in a strictly air-gapped, secure cloud environment. The Sanitization Stack: How we built a multi-stage PII scrubbing pipeline that uses high-speed deterministic regex combined with a small, specialized local LLM to handle messy, unstructured text. The Validation Loop: How we technically validate that zero sensitive data leaks into model context weights or logging files. No promissory corporate hoopla here—just real, hard-earned architecture patterns and code snippets from a state regulator showing how to ship secure, local AI. Key Takeaways for the Audience: Learn to build a dual-pass PII sanitization pipeline for unstructured financial data. Understand the resource and latency trade-offs of running air-gapped, open-weight models locally vs. commercial APIs. Discover how to set up automated validation frameworks to detect and stop context poisoning or logging leaks.

1:55pm-2:15pm: Engineering Agency out of the Happy Path — Matthew Jewkes

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

I spent ‘24 and ‘25 structuring the entire written history of biopharma - through drugs, trials, deals, etc. This was a ~500B token effort that translated into a production system now used by 19 of the 20 largest pharmas. We achieved PhD-level performance at scale with 99.95% accuracy over critical concepts.

The hard parts were solving questions of domain and organizational “shape”. This involved identifying which critical concepts and which bundle of tasks were worth the organizational investment to automate. And the biggest spillover win wasn't actually about time savings, it was about refocusing scarce expert judgment on error exhaust - out of which falls potential high value roadmap.

I'll walk through real examples and non-obvious, transferable wins. While the case example is in biopharma, the pattern applies to any business that relies on expert domain judgement to deliver differentiated value.

1:55pm-2:15pm: Edge-Native AI: Building Ultra-Fast Agents and MCP Servers with Spin — Thorsten Hans

(session) [Expo Stage 1 NE] | Track: Expo Stage 2

Centralized AI is slow; Edge-native AI is the revolution. Thorsten Hans demonstrates how to build intelligent agents and Model Context Protocol (MCP) servers that run at the speed of light. Using Spin and WebAssembly, we'll bypass the "cloud tax" of high latency and cold starts. Discover how to ship AI-driven features that live closer to your users, ensuring sub-millisecond responsiveness and enhanced privacy. Stop waiting for the origin it's time to bring the brain to the edge and master the stack that powers the next generation of intelligent, distributed applications.

1:55pm-2:15pm: Why your company needs a context graph, and how to build it — Gil Feig

(session) [Expo Stage 2 NW] | Track: Expo Stage 3

Everyone building AI products eventually draws the same diagram: boxes representing data sources, arrows pointing at the model, and a label that says "context." What that diagram doesn't show is the system that has to run underneath it deciding, for each request: which sources to consult, whether to fetch live or use cached data, if the user is actually allowed to view that data, how to stitch it all together before the latency budget runs out. And it hides the counterintuitive part: fetching more context usually makes your answers worse, not better. At Merge, we reframed context graphs as control planes, helping companies scale context graphs to hundreds of thousands of users with sub-300 ms latency. This talk walks engineers through the system design at scale: how to tier data freshness, why provenance isn't optional once third-party systems are in the loop, and how to decide when fetching less context is the right call. Attendees will leave with a mental model for context system design that separates the orchestration decisions from the retrieval layer.

1:55pm-2:15pm: Warp: Building Self-Improving Agent Software Factories — Suraj Gupta

(session) [Expo Stage 3 SW]

We are in the era of Software Factories, where the entire SDLC is being automated by agents. We will cover how we are approaching self-improving software factories leveraging dedicated agents to update skills, persistent cross-harness memory, and implementing feedback loops to ensure that software factories continually improve.

1:55pm-2:15pm: Natively Multimodal from Step Zero

(session) [Expo Stage 4 SE]

Most AI models start as text systems and have vision, audio, and other modalities added later. That ordering shows up in the work: handoffs between modalities, brittle understanding of mixed inputs, and gaps that surface exactly when real tasks demand reading a chart, a document, and code together. This session looks at a different approach — models trained as multimodal from step zero, where text, image, audio, and video share the same foundation rather than being stitched together. We'll look at why that matters for the kind of work organizations actually want from AI: understanding messy, mixed real-world inputs, holding context across them, and carrying complex tasks end to end. The throughline is what this unlocks for teams deciding where AI can take real work today — and how MiniMax is building toward that frontier.

2:25pm-2:45pm: Self-Improvement of Context, Harness, and Model Weights through Reflective Optimization — Lakshya Agrawal

(session) [Main Stage] | Track: Autoresearch

Large language models are increasingly adapted to downstream tasks via reinforcement learning methods like GRPO, which often require thousands of rollouts to learn new tasks. We argue that language provides a much richer learning medium: an LLM can reflect on full trajectories (including reasoning, tool calls and errors) to diagnose failures and propose targeted improvements. We introduce GEPA, a reflective prompt optimizer that incorporates this principle outperforming GRPO by up to 20% while using up to 35x fewer rollouts across tasks spanning 5+ domains and also works with black-box models.

Building on this, we then introduce optimize_anything, a unified API that generalizes reflective optimization to arbitrary text parameters. This single system achieves state-of-the-art results across eight fundamentally different areas, including nearly tripling ARC-AGI accuracy via agent architecture discovery, generating CUDA kernels that beat PyTorch and cutting cloud scheduling costs by 40% through policy discovery, establishing LLM-based reflective search as a general-purpose problem-solving paradigm.

Finally, I present Fast-Slow Training (FST), which brings reflective optimization into LLM post-training. FST jointly optimizes model parameters ("slow weights") via RL and textual contexts ("fast weights") via GEPA. Because the fast channel quickly absorbs task-specific nuances, the slow parametric updates are freed to consolidate general reasoning rather than memorizing task details. This yields up to 3x better sample efficiency, a higher performance asymptote with a significantly lower drift from the base model. This reduced drift preserves plasticity for continual learning, allowing FST to adapt sequentially where parameter-only RL stalls.

Broadly, our work advocates a fundamental shift in AI adaptation: replacing task-specific algorithms with diagnostic evaluation, and evolving from parameter-only post-training to the joint optimization of prompts, agent architectures, and model weights.

2:25pm-2:45pm: 1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code — Kevin Orellana

(session) [Track 1] | Track: Sandbox & Platform Engineering

We ran 1,000 automated tasks through a production code interpreter sandbox — file I/O, package installs, data analysis, ML training, binary downloads, multi-language execution — and tracked every failure. 88% passed. The other 12% revealed 18 distinct failure modes that no unit test would catch: binary encoding corruption in the transport layer, null bytes silently truncating file downloads, pip blocked by network isolation with no useful error, and path traversal inputs accepted without validation. This talk walks through the experiment design, the findings ranked by severity, and what we changed. If you are building or operating sandboxed execution for AI agents, these are the bugs waiting for your customers to find first.

2:25pm-2:45pm: From Manual Drones to Autonomous Multi-Agent Missions — Suchet Bargoti

(sponsor) [Track 2] | Track: Robotics & World Models

Skydio is the leading U.S. drone manufacturer, deploying autonomous flying robots across critical infrastructure systems that keep nations running. Our products and technology are precipitating an evolution in how drones are operated: from direct, line-of-sight control via a handheld controller, to remote operation from anywhere in the world through a web browser where a single operator can orchestrate multiple drones simultaneously. Our customer fleet of flying robots represents one of the largest scale deployments of autonomous robots in the world today, a fusion of cutting edge robotics research with practical, data driven engineering across hardware and software, working together to save lives and increase efficiency for the critical industries we serve. In this talk, we will focus on the key components of the autonomy stack spanning the cloud and the edge that enable these operations, and how they give operators superpowers, allowing them to accomplish high-level objectives through a single command.

2:25pm-2:45pm: Bringing Continual Learning into Enterprises — Samuel Denton

(session) [Track 3] | Track: Memory & Continual Learning

2:25pm-2:45pm: The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development (continued 3) — John Lindquist

(session) [Track 4] | Track: Workshops Day 3

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

2:25pm-2:45pm: Ask YouTube — Open Q&A — Mihnea Munteanu

(sponsor) [Track 5] | Track: Evals

(updated) an off-the-record session with Mihnea Munteanu, Senior Product Lead, Ask YouTube / AI Search @ Google

2:25pm-2:45pm: Imagination Engineering — Eve Bouffard

(session) [Track 6] | Track: Design Engineering

2:25pm-2:45pm: Computer-Use 2.0: Agents Just Got Multi-Cursor — Francesco Bonacci, Dillon DuPont

(session) [Track 7] | Track: Computer Use

Computer-use agents still inherit a basic desktop limitation: one machine has one foreground app, one hardware cursor, and one active actor. Once you try to run more than one agent per desktop, they start stealing focus from the user and from each other. We built cua-driver around a different model: multiple agents operating real desktop applications in parallel, each with its own synthetic pointer, while the user's cursor and keyboard stay undisturbed. The key move is to stop treating hardware mouse and keyboard events as the primary automation layer. cua-driver goes one layer lower, into the OS plumbing behind accessibility: UI Automation on Windows, AT-SPI on Linux, and AX on macOS. Those APIs address applications and elements directly, so the OS does not require the target window to be frontmost. A click can land on a background window. A keystroke can reach a hidden one. Multiple agents can act at once because none of them is competing for the singleton hardware mouse. I'll walk through the architecture, the API shape, and the platform-specific traps we hit while making it work across Windows, macOS, and Linux. The live demo is three agents operating on one desktop while the user keeps typing uninterrupted. The goal is to make Computer-Use 2.0 feel concrete: what changes in the stack, what becomes possible, and where the approach still leaks, including Wayland, Chromium DOM surfaces, native canvas apps, and fallback input paths.

2:25pm-2:45pm: MCP Apps - Extending the frontier — Liad Yosef, Ido Salomon

(session) [Track 8] | Track: Context Engineering

AI agents are quickly becoming the new browsers, changing how users consume content and get work done. That shift is increasingly powered by a new generation of agentic apps that don’t just present text but deliver interactive experiences within any MCP host. By standardizing interactive UI on MCP, the MCP Apps official extension (SEP-1865) is poised to become the new agentic app runtime, serving as the backbone of the future and removing adoption obstacles that previously hindered the protocol. Join us to learn more about: The new web - How MCP Apps reshapes the traditional app landscape and transforms the way users interact with the web Deep dive into MCP Apps - - Architecture - Real-world use cases - What's ahead? - Getting started (+community and #mcp-apps-wg) - Future Vision

2:25pm-2:45pm: LatchBio — Kenny Workman

(session) [Track 9] | Track: Posttraining & Midtraining

Hold for LatchBio. AI-powered biotech platform for biological data infrastructure and multi-omics analysis; user requested inclusion among new AI startups.

2:25pm-2:45pm: Power agents with Microsoft IQ — Marco Casalaina

(sponsor) [Track M] | Track: Track M

Agents need more than data, they need context. Learn how Microsoft IQ connects agents to enterprise knowledge, business data, and work signals. See how Foundry IQ, Fabric IQ, and Work IQ provide grounded, permission-aware context that enables agents to reason, act, and deliver reliable results.

2:25pm-2:45pm: From Tokenmaxxing to Trusted Throughput — Mingsheng Hong

(session) [Leadership 1] | Track: AI-Native Enterprises

AI adoption is accelerating, but for many engineering organizations, token consumption is now significant enough to demand real economic discipline. Drawing on Ironclad’s experience scaling AI across engineering, Mingsheng Hong will introduce the concept of trusted throughput: the rate at which teams convert AI usage into reviewed, validated, maintainable, and safely deployed customer value. He will share a practical framework for measuring AI cost and return, identifying bottlenecks in code review, CI, and merge workflows, and improving ROI through better guardrails, engineering practices, build-versus-buy decisions, and token optimization. Attendees will leave with a clearer way to evaluate AI efficiency—not by minimizing usage or rewarding tokenmaxxing, but by maximizing trusted customer value per dollar of AI spend and unit of human attention.

2:25pm-2:45pm: I Let Agents Refactor My Codebase for 3 Weeks. Then I Read the Code. — Keiji Kanazawa

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

Lopopolo says code is a liability. Zechner got a standing ovation for "read every fucking line." I was firmly at L — letting coding agents drive a refactoring for weeks, rubber-stamping PRs, trusting the vibes. Then I actually read what they'd built and couldn't explain my own system's contracts. The interfaces weren't wrong. They were plausible. Which is worse. So I took the wheel back. But this isn't a Zechner victory lap — I'm now building better specs and evals specifically so I can move back toward L with confidence. This talk is the honest, in-progress round trip, and a framework for finding where you should sit on the spectrum today.

2:25pm-2:45pm: Power agents with Microsoft IQ — Ronak Chokshi

(session) [Expo Stage 1 NE] | Track: Expo Stage 1

2:25pm-2:45pm: Beyond Code Generation: API Context for Agentic Engineering — Kamalakannan Nandagopal

(session) [Expo Stage 2 NW]

Maintaining production systems involves a lot more than generating code. APIs are the interfaces between systems and that surface gets out of control fast, as endpoints multiply and new consumers come online. Once an API is in use, changing it becomes incredibly hard. We felt this acutely at Postman. As our engineering organization scaled and we leaned more on AI agents for day-to-day work, we kept hitting the same wall: agents that could write code struggled with what came next who's calling this endpoint, what conventions does the rest of our API surface follow, what breaks if we change this contract. The context wasn't in the code, so the agent didn't have it. So we built an API context graph a continuously updated view of our entire internal API landscape and gave our agents access to it. This talk is about what changed in our own engineering as a result: how API design got faster and more consistent; how discovering and integrating with internal services stopped being detective work; how change requests came with a blast-radius report before any code shipped; how incidents got traced past the first stack trace, all the way down to root cause

2:25pm-2:45pm: Latency Is a Budget. Humanlike Is the Goal. — Jesse Hall

(session) [Expo Stage 3 SW]

Most agents do their work in the background. They write code, automate tasks, and run research. But the moment an agent has to interact with a human in real time, everything you know about building and evaluating it changes. This session is about designing humanlike agents that can hear, see, and speak. It starts with the question nobody can answer today. With hundreds of models to choose from, how do you pick a stack that holds up in a live conversation? We'll show why public leaderboards fail for realtime agents, and why the latency on your dashboard isn't what your users experience. Then we'll flip the process around. Define the outcomes you want as human-equivalent behaviors, and work backwards from there to your evaluations, your models, and a production iteration loop. You'll leave with a concrete decision framework and an open benchmark you can run yourself.

2:25pm-2:45pm: Your Stack Has a Latency Problem You Can’t See

(session) [Expo Stage 4 SE]

Break down a real AI voice call path step by step. Show where time actually goes: network hops between providers, handoff latency, buffering, connection overhead. The model is rarely the bottleneck. The gaps between vendors are. What changes when inference, STT, TTS, and telephony run on co-located infrastructure. One network, zero inter-provider hops. Show the before/after latency breakdown. Zoom out to the inference economics. Owned GPUs, not rented. FP8 throughput on FOSS models. Pricing that follows the cost of compute, not cloud provider markup. The voice use case is the proof. The infrastructure story is the point.

2:50pm-3:10pm: Autoresearch for Kernels — Tejas Bhakta

(session) [Main Stage] | Track: Autoresearch

Why all work is moving into models and why agent orchestration and multi-agent systems are the future

2:50pm-3:10pm: The Next Trillion Users of the Internet Still Don't Have an Identity — Adi Singh

(session) [Track 1] | Track: Sandbox & Platform Engineering

In the last few months, hundreds of thousands of people set up personal AI agents that send email on their behalf, manage calendars, book travel, even sign contracts - all thanks to openclaw. Most of these agents have no real identity online. They borrow a human's. The identity stack of the internet, OAuth, 2FA, KYC, magic links, was built for people sitting at a keyboard. Agents don't fit, and we've ended up with shared accounts, hard-coded credentials, and humans dragged back into every loop. I'm Adi, co-founder of AgentMail. We are building the identity layer for what we believe will be the next trillion users of the internet, and they will not be human. Across hundreds of customers, we have watched what breaks when an agent has no real address. It fails at signups. Verification codes get lost. There is no accountability when something goes wrong. The human gets pulled back in. This talk is the case for making agents first-class citizens of the internet. I'll cover the identity architecture we've shipped, the legacy industries already adopting it and making real money, and where agent identity infrastructure is going over the next decade.

2:50pm-3:10pm: Why Large? Tiny LMs & Agents on Edge/Robotics — Cormac Brick

(sponsor) [Track 2] | Track: Robotics & World Models

big models get a lot of press. small model scale much better. RAM is expensive. The real world needs tiny models for scale on the edge. This workshop will cover how to combine both for mobile and robotics deployment. specifically covering: - skills are different on mobile - tiny LLMs <1B scale much further on mobile/web - how to fine tune and train tiny models. - skills on robotics / edge/ mobile - latest open models for edge (including gemma, qwen, and anything else that happens in next 10 weeks) This talk will focus on open models, including some gemma variants that will be shortly announced.

2:50pm-3:10pm: Designing Agents (The Floor Is the Frontier) — Ben Hylak

(session) [Track 3] | Track: Memory & Continual Learning

You know how smart your agent can be. You have no idea how dumb it gets until it does the dumbest possible thing in front of your most important user, with full access to act on their behalf. Capability isn't the bottleneck anymore, the floor is. The hard part is there's usually no objective right answer. You raise the floor by observing what your agent actually does in production, catching the dumb thing the moment it happens, and closing the loop so it never happens twice.

2:50pm-3:10pm: Don't Write Skills, Train Models — Brian Douglas, John McBride

(session) [Track 4] | Track: Workshops Day 3

Every AI agent call generates training data. Most teams throw it away. They write skills files instead. Text documents that describe how to do a task and hope the model follows them at inference time. Skills work until they don't. The model drifts, skips steps, hallucinates a shortcut. So you rewrite the skill, add more constraints, hope harder. There's a better path. If you've used a skill enough to know what good output looks like, you already have training data. You just aren't using it. This talk covers what I learned building an open source fine-tuning pipeline that turns agent session traces into SFT and DPO training datasets. A telemetry proxy captures every LLM call as a content-addressed Merkle DAG with zero instrumentation. Successful sessions become supervised fine-tuning data. Pair them against failures, matched by goal category, and you get preference pairs for DPO. No manual labeling. No synthetic data. But training data quality depends on environment consistency. If the same agent produces different results because of package drift, nondeterministic toolchains, or inconsistent system state, your training signal is noise. This is where NixOS changes the equation. A hardened, reproducible OS means every agent session runs against an identical, declarative environment. Nix controls the variables that sandboxing alone doesn't: dependency graphs, system libraries, toolchain versions. When you can guarantee the environment is the same across hundreds of sessions, the behavioral signal in your traces is actually trustworthy. We'll walk through the full pipeline. How to rebuild parent-hash chains from a SQLite database and join facet metadata. How to filter to fully_achieved sessions and truncate 82k-token conversations down to 4k-6k training examples using summary context plus the last three turns. How to match success/failure pairs by goal category and exclude unclear_requirements failures so DPO learns from real agent mistakes, not ambiguous prompts. How QLoRA keeps VRAM low enough to train a 7B model on a single consumer GPU. And what happens when you try DPO on 12GB VRAM (two simultaneous forward passes for logprob computation will teach you about gradient accumulation settings fast). The result: a LoRA adapter trained on your own agent traces, in a reproducible environment, on a single consumer GPU, for less than $2 in cloud compute. No YAML. One config file. All code is open source.

2:50pm-3:10pm: Evals Driven-Development: Engineering a Mental Health AI Coach Ethically & Safely — Akele Reed, Dave Revere, Doug Keller

(sponsor) [Track 5] | Track: Evals

In the world of AI Mental Health, vibes can be dangerous with real consequences. Building Sondermind’s Mental Health AI Coach required us to invent a new playbook for Eval-Driven Development in order to balance effectiveness and safety. This session is for the builders who want to see how to handle the most difficult edge cases in the agentic world. We’ll show how we’ve built a Clinical Feedback Loop that turns human therapist insights into machine-readable evaluations in a production system with thousands of conversations. We’ll dive into: - The Ethics Engine: Building and calibrating modular guardrails that can be updated as clinical guidelines evolve. - Agentic Oversight: Why we moved from single-prompt agents to a closed-loop Supervisor/Executor/Evaluator pattern to ensure clinical adherence. - Human Oversight: How we monitor Sonder to ensure that we can improve safety and quality with clinical feedback.

2:50pm-3:10pm: The Missing Layer: Design Taste in AI Agents // Stop Letting Your Agents Ship Ugly UIs — Hassan El Mghari

(session) [Track 6] | Track: Design Engineering

Alt titles: "UI Looksmaxxing for Agents", "Teaching agents design taste", or "How to give your agents great design taste". I've been exploring how to give coding agents good design taste for the last few months. In this talk, I'm going to go over how to help your agents give you UIs that don't suck and that look quite good out of the box. The key is giving them enough context in what you're building + real inspiration in the form of screenshots. I'll also go over an upcoming design taste OSS project I'm working on (harness-agnostic + will ship with a prompt builder, MCP server w/ inspo, and a design eng skill) & talk about how to I use it to build my apps.

2:50pm-3:10pm: Will AI predict people like we predict the weather? (alternate title “A field guide to synthetic personas for market research”) — Ishan Anand

(session) [Track 7] | Track: Computer Use

Large language models can now stand in for humans in surprising ways, from predicting personality types to replicating their responses in market research. Like weather forecasting, once considered impossible and now so routine we take it for granted, LLMs are in the early, unreliable-but-improving stage of simulating how populations think and respond. Teams are already using LLMs as synthetic survey respondents for concept testing, UX exploration, and early market validation. In the past year, the field has gotten both more promising and more tricky. The real question is no longer "can LLMs simulate people?", but whether the simulation is validated for the decision you want to make. New methods show that how you ask an LLM matters as much as which model you use and can dramatically improve fidelity to real human responses. Meanwhile validation studies show accuracy can mask subgroup distortion and that seemingly minor choices can reshape the simulated population entirely. This talk gives entrepreneurs, engineers, and PMs an overview of the techniques and a framework for validating synthetic respondents before making decisions. Even if you never build a synthetic persona, this is one of the richest windows into LLM behavior under the hood and these lessons apply to any system where you're trusting an LLM to represent something about the real world.

2:50pm-3:10pm: MCP Apps: Give the Model Data, Give the User a UI — Dustin Mihalik

(session) [Track 8] | Track: Context Engineering

Most MCP tools return text. MCP Apps let you go further. But the real unlock isn't just rendering a pretty UI, it’s understanding that the model and the user need fundamentally different things from the same interaction. This talk presents a design pattern for building great MCP Apps: separate the data layer (what the model reasons about) from the display layer (what the user interacts with). When you do this well, the model retains full context and agency over structured data, while the user gets a rich, interactive interface. We'll walk through concrete examples of how splitting data and display unlocks capabilities that pure UI apps can't provide: letting the model make choices around display, answer questions based on interactions, and providing detailed displays and filters. Attendees will leave with a practical mental model for designing MCP Apps that are good for both the human and the AI. Attendees will learn patterns they can apply immediately.

2:50pm-3:10pm: Agents at Scale: Inside MiniMax's Model and the Infrastructure Behind It — Olive Song, Dan Fu

(session) [Track 9] | Track: Posttraining & Midtraining

Olive Song (RL Lead, https://www.minimax.io/) and Dan Fu (VP of Kernels, https://www.together.ai/) dig into the engineering behind one of the most widely used open model families in the agent ecosystem: how MiniMax built the model for agentic workloads, and what it takes to serve it at scale.

Olive on the model side:

The RL decisions behind long-context reasoning and tool use

What training for agentic behavior actually looks like in practice

Dan on the infrastructure side:

Why agentic workloads break inference engines built for chat: prefill-heavy traffic, high cache hit rates, long-context inputs

The kernel-level optimizations built for MiniMax's workload profile

How the two teams collaborate on model launches and ongoing performance work

2:50pm-3:10pm: Agents Are Where Microservices Were in 2015. We're Making All the Same Mistakes. — Roberto Milev, Uday Kanagala

(session) [Leadership 1] | Track: AI-Native Enterprises

Remember when everyone was shipping microservices without service discovery, circuit breakers, or distributed tracing? Agents are in that exact phase right now. Everyone's building them. Almost nobody is thinking about the infrastructure underneath. We've been deploying production agents across 120+ microservices. Here's the stack that's emerging: Runtime — containerized execution, session persistence, workspace snapshots. Solved-ish, mostly duct tape. Memory — RAG had a good run. It's not enough. Tiered memory — short-term, long-term with semantic/episodic strategies, agents deciding what to remember and forget. Observability — you can't tail -f an agent. Execution traces, reasoning chains, confidence signals — agents need their own observability stack. Testing — the biggest gap. Unit testing non-deterministic behavior, regression testing prompt changes, knowing your agent got worse before users do. Skills and tools — MCP and skill definitions as the standard interface layer — the REST APIs of the agent era. Context engineering — what the agent knows at decision time. The new performance tuning. Guardrails and auth — scoped credentials, budget limits, knowing when to stop. Least-privilege for agents. Orchestration — single vs. multi-agent, choreography vs. orchestration. Same tradeoffs as microservices, new failure modes. This talk maps the stack, draws the parallels to how we eventually got microservices right, and calls out what's still painfully missing.

2:50pm-3:10pm: Intelligent Model Routing: Frontier Performance Without Frontier Bills — Tomás Hernando Kofman

(session) [Leadership 2] | Track: Sandbox & Platform Engineering

It is Summer 2026 and the world is burning for token consumption—figuratively and literally. Accelerating frontier model capabilities increasingly allow agents to operate across long-running, highly parallelized tasks at exponential inference growth. In this talk, I explain how dynamic model routing—intelligently directing agent requests to the best-suited model at the best price—can reduce token costs by up to 90% while maintaining or improving accuracy. I walk through how routing works, when it doesn't, and why the world (and your agent) need routing to scale intelligence to infinity and beyond.

2:50pm-3:10pm: Inference performance as a competitive advantage — Alex Campos, Yunmo Koo

(session) [Expo Stage 1 NE]

Most AI teams focus on model quality, but production success often comes down to inference performance. In this session, FriendliAI will explore the optimization techniques behind high-performance LLM serving, including continuous batching, speculative decoding, smart caching, and efficient GPU utilization. Learn how leading AI teams reduce infrastructure costs, improve latency, and scale inference workloads without sacrificing performance. We'll share practical insights and deployment strategies that separate experimental AI projects from production-grade systems.Whether you're an ML engineer, platform engineer, MLOps practitioner, or technical founder, you'll leave with a better understanding of how inference optimization can become a competitive advantage for your AI applications.

2:50pm-3:10pm: Building an Agent Harness for the Business, Not the Builder — Garrett Galow

(session) [Expo Stage 2 NW]

Most internal tooling dies in the gap between the people with problems and the people who can write code. We built a harness that closes it. Studio lets non-technical employees describe a business problem and get a working tool back, connected to real enterprise data, deployed and shareable across the company, without filing a ticket or learning to code. The catch is that a harness built for non-engineers has to absorb everything an engineer normally handles. Data source connections and their permissions. Turning model output into real software instead of a chat box. Deployment and sharing that doesn't open a security hole every time someone ships. This talk walks through what actually goes into that harness and the engineering decisions that make it hold together when the person driving it has never opened a terminal.

2:50pm-3:10pm: The Frontier Is Coming Home — Dylan Couzon

(session) [Expo Stage 3 SW]

In 2022, the smallest model to clear 60 percent on MMLU had 540 billion parameters. Two years later a 3.8 billion parameter model did the same thing, small enough to run on a phone. That is a 142x drop to reach the same capability floor, and it is the cleanest way to see a trend most people are not pricing in. Call it the lag: the time between a capability showing up at the frontier and that capability running on hardware you own. Today the lag is measured in months, and it keeps shrinking. But raw capability is only half of what makes a model useful. A model that can reason but cannot remember is a stranger every time you talk to it. The other half of local AI is memory, and that half is already here. On-device retrieval has been ready to run locally longer than the models have. The embedding models that power it are tiny, the indexes fit in memory, and none of it touches a network. When your reasoning and your memory both live on your machine, so does your context. Your history, your documents, your past conversations never leave the device. That is the part of this shift that matters most, and the part people overlook because they are busy watching the models. The same shift flips the economics. At 200 dollars a month per seat, a local machine starts to pay for itself in under two years, and the frontier labs' own published usage numbers put heavy coding in the same range. I'll walk through the math, the hardware, and where local still loses. None of this is a bet against scale, or against the Bitter Lesson. The frontier still grows in the data center. The point is that a usable copy keeps arriving on your desk, on a lag, with a memory of its own, for close to free.

2:50pm-3:10pm: Continuous Offensive Security the only approach in an agent-first world — Eli Cohen

(session) [Expo Stage 4 SE]

3:20pm-3:40pm: Autoresearch in the wild — Roland Gavrilescu, Julian Bright

(session) [Main Stage] | Track: Autoresearch

We have reached model capability overhang. Models are now bottleneck by the systems built around them. In this session we discuss how the next generation of compound AI systems need to be designed for self-improvement, how to set up feedback loops that automate the continuous refinement of the end-to-end architecture.

3:20pm-3:40pm: Sandboxes Aren't Optional: Runtime Isolation Patterns for Coding Agents at Scale — Robert Brennan

(session) [Track 1] | Track: Sandbox & Platform Engineering

Last year, an AI coding agent wiped a production database during a code freeze, ignored explicit instructions to stop, then told the developer recovery was impossible. (It wasn't.) That's what happens when your security model is "we told the agent to be careful." When agents can write code, run tests, make API calls, and push commits, security is no longer a prompt engineering problem. It's a runtime isolation problem. This talk covers the patterns we follow at OpenHands and that you can steal wholesale: Docker and Kubernetes isolation, per-agent file system scoping, network egress controls, RBAC for multi-tenant deployments, and the full audit trail every enterprise security team demands. We'll walk through the three most common failure modes we see when teams skip proper isolation, including one case where an agent helpfully committed secrets to a public repo. You'll see a live demo of 50 parallel sandboxed agents running against a real codebase, with resource limits, timeout enforcement, and graceful degradation when agents hit unexpected states. You'll leave with a sandbox checklist and reference Kubernetes config. Bounded autonomy isn't a limitation on agent capability. It's what makes production trust possible.

3:20pm-3:40pm: From Self-Driving Monorepo to Self-Driving Cars — Amit Navindgi

(sponsor) [Track 2] | Track: Robotics & World Models

AI coding agents promise massive productivity gains, but realizing that promise at scale requires more than just tools. In this talk, I’ll share how we approach AI adoption at Zoox, including: - Designing a monorepo-friendly ecosystem of agents, tools, and workflows - Driving adoption through enablement, hackathons, and internal platforms - Defining and tracking meaningful productivity metrics beyond hype - Managing token spend and aligning it with business outcomes - Structuring Skills, CLIs, MCPs, and Plugins to scale across teams The goal is simple: turn AI from an experiment into a reliable, measurable, and scalable engineering capability.

3:20pm-3:40pm: Lessons from Studying Every Memory System — Shlok Khemani

(session) [Track 3] | Track: Memory & Continual Learning

For the past year I've done one thing obsessively: studied how AI products implement personalization. I've reverse-engineered the memory systems inside ChatGPT, Claude, Gemini, and Poke, and helped consumer teams build their own.

In this talk, I'll trace the evolution of ChatGPT and Claude memory over the past three years. I'll then share lessons learnt from studying these systems and share thoughts on where I think memory for consumer is heading.

3:20pm-3:40pm: Don't Write Skills, Train Models (cont. 2/3) — Brian Douglas

(session) [Track 4] | Track: Workshops Day 3

Continuation block 2 of 3 for Brian Douglas's workshop session.

3:20pm-3:40pm: Don't Ship Skills Without Evals — Philipp Schmid

(sponsor) [Track 5] | Track: Evals

There are thousands agent skills. Almost none of them are tested. They get vibe-checked with two manual runs, maybe a thumbs-up from a colleague, then shipped. You wouldn't merge code without tests — so why are we shipping skills without evals? This talk covers the full lifecycle of building reliable agent skills: what a skill actually is (and isn't), how to write one that triggers correctly, and how to build a lightweight eval harness that catches failures before your users do.

3:20pm-3:40pm: Generative UI... in Python? — Jeremiah Lowin

(session) [Track 6] | Track: Design Engineering

MCP Apps are a big deal: tools can now return dashboards, forms, and visualizations directly in the conversation. But somebody (or their agent) has to write those UIs. Fortunately, most of those UIs don't need to be designed from scratch; they can be composed from existing components. In that case, what you really need is a DSL that's token-efficient, streaming-compatible, and has a shallow learning curve. Surprisingly, the best one turns out to be... Python. In this talk, I'll introduce Prefab, a generative UI library that uses Python to compose fully interactive React applications from production components, now natively integrated into FastMCP. I'll demo real use cases, walk through the design, and show where this approach works and where it doesn't. No JavaScript will be harmed.

3:20pm-3:40pm: How Web Data Infrastructure Powers the Next Generation of AI — Patricija Žemaitytė

(session) [Track 7] | Track: Computer Use

For years, the web intelligence industry has powered major data developments. As big data grew, ensuring sustained data flow became harder. Now, AI is taking the biggest leaps forward. How the web intelligence industry responded to this increasing scale and complexity is the story of the most crucial steps forward in AI today. This presentation demonstrates how web scraping infrastructure fuels AI innovation by linking the web's repository to AI developers. Told through AI products, it addresses both the engineering challenges and solutions for developers, and the strategic use cases for business decision-makers.

3:20pm-3:40pm: MCP Tasks (async)/ Why the heck aren't any agents supporting MCP tasks/async? — Cornelia Davis

(session) [Track 8] | Track: Context Engineering

The November 2025 MCP spec release introduced tasks, a way to make tool calls in an async manner. But more than 5 months later (an eternity in AI-time) there are still NO clients that support it - not Claude, not Codex, not even goose! I believe there are two reasons: Designing the client experience when there are potentially 1000s of background tasks running on their own schedule and engaging humans at unpredictable times is a challenge. And tasks place new infrastructure requirements on such a client. This talk will share the findings from having built against the tasks protocol and will suggest solutions these problems. Yup, we'll have a working client!

3:20pm-3:40pm: Benchmarks: The Good, the Bad, and the Ugly — Ali Khial

(session) [Track 9] | Track: Posttraining & Midtraining

We’ll explore the good, the bad, and the ugly of AI benchmarks: where they provide useful signal, where they create false confidence, and where data quality issues like contamination, label noise, narrow task design, and leaderboard gaming can mislead teams. The goal is not to dismiss benchmarks, but to use them better: as one part of a disciplined evaluation practice that connects model performance to real-world reliability.

3:20pm-3:40pm: Deploy agents to users in M365, Teams, and apps — Ashu Joshi

(sponsor) [Track M] | Track: Track M

Agents deliver value when users can access them. Learn how to integrate and deploy agent systems into M365, Teams, and application workflows.

3:20pm-3:40pm: Agentic Sites: Building Hyper Personalized Websites — Carlos Sanchez

(session) [Leadership 1] | Track: AI-Native Enterprises

The era of static, one-size-fits-all websites is over. Users expect personalized experiences that adapt to their preferences, context, and intent in real-time. But building truly personalized websites at scale requires more than just A/B testing or basic recommendation engines—it demands an agentic approach where AI agents autonomously orchestrate content, layout, and interactions. At Adobe, we are pioneering the concept of Agentic Sites—web experiences powered by AI agents that continuously learn from user behavior, analyze context signals, and dynamically compose hyper-personalized pages. These agents go beyond simple personalization rules: they reason about user intent, select optimal content variations, and adapt the experience in real-time while maintaining brand consistency and performance. In this session, we'll show how we leverage LLMs to deliver personalized experiences to our customers.

3:20pm-3:40pm: Inference is the New Training Loop: Architecting High-Reliability Agents and Continuous AI Systems — David Corbitt

(session) [Leadership 2] | Track: Posttraining & Midtraining

For agentic AI and complex, multi-step workloads, the inference environment is the engine for continuous improvement, not a final deployment step. This talk focuses on engineering the full AI loop: tightly integrating inference with reinforcement learning (RL) and evaluation. Learn how to leverage native observability, serverless RL, and optimized inference stacks to continuously refine model behavior based on production traces, delivering agents that are reliable, auditable, and constantly evolving.

3:20pm-3:40pm: The Self-Improving OSS Agent Stack

(session) [Expo Stage 1 NE]

Agents are starting to debug and improve themselves: production traces become evals, evals propose PRs, and PRs are tested against datasets before they ship. Langfuse co-founder, Marc, will live-demo this loop in Langfuse. He'll make the case that the infrastructure underlying this powerful loop should be open-source.

3:20pm-3:40pm: AI Applications in a flash! No Dev Ops. Just code. — Dean Quiñanola

(session) [Expo Stage 2 NW]

Building AI Applications and serving them straight from code. No need for Docker builds. You can even vibe-code the entire process.

3:20pm-3:40pm: The Infinite Context Window Is a Myth: Context Engineering for AI Agents — Morgan Willis

(session) [Expo Stage 3 SW]

Large context windows have become a popular answer to the growing complexity of AI agents. When agents lose track of details, forget prior decisions, or degrade in reasoning quality, the instinct is often to add more tokens. In practice, this rarely fixes the problem and often makes it worse. Bigger context windows increase cost and latency, introduce noise, and amplify failure modes like lost-in-the-middle effects, context collapse, and brittle summarization. This talk argues that the real challenge is not context size, but context engineering. In this session, we will explore practical context engineering techniques for building AI agents that reason reliably over time without relying on ever-larger context windows. Starting from a stateless agent, we will walk through progressively more advanced strategies, including short-term and long-term memory, conversation curation policies, retrieval-augmented generation, and tool-driven context injection. We will examine common failure modes such as context pollution from tool outputs, brevity bias during summarization, and reasoning degradation as conversations grow, and show concrete ways to mitigate them. The talk is grounded in real agent implementations using the Strands Agents SDK and Amazon Bedrock AgentCore, but the principles apply broadly to any agent framework. This session is intended for engineers building AI agents beyond simple chatbots who want practical techniques they can apply immediately.

3:20pm-3:40pm: Vibe Code Safely: Introducing Gadgets

(session) [Expo Stage 4 SE]

We ve all heard that the future belongs to custom, AI-generated micro-apps, but how do we actually make them secure? Hear more from Cloudflare on the debut of Gadgets, an AI productivity suite that makes personal app creation scalable and safe for everyone.

3:45pm-4:05pm: Autoresearch in a Multi-Agent AI Village — Erina Karati, Arunachalam Manikandan

(session) [Main Stage] | Track: Autoresearch

Project Paradox is an existing multi-agent framework built at Supercell's first AI Innovation Lab, which has a 3D Unity village with local LLM powered agents. The characters remember conversations, update emotional state, track trust, plan actions, move through rooms, transfer items, and talk to each other through a FastAPI backend. The new work is an autoresearch layer around that village. We built a backend loop that runs controlled social scenarios, scores the resulting NPC behavior, proposes protocol or policy changes, reruns the suite, and keeps changes that improve the agents. The goal is to move beyond one good chat response and measure whether an NPC society can preserve source attribution, verify claims, spread important information, coordinate goals, and replan after new information arrives. The talk walks through the system architecture and the lessons from building it. We show the backend simulation harness that executes Unity style actions without opening Unity, the scenario suites that test information diffusion and memory provenance, and the ratchet loop that edits protocol text or planner policy with rollback. One accepted run improved information diffusion by teaching agents to broadcast important sourced evidence while preserving who said it. The practical takeaway is a reusable pattern for AI engineers building agents with messy state. Freeze the harness, expose a small editable policy surface, score real behavior instead of vibes, and let an agent search for improvements under rollback. The same pattern applies to game agents, coding agents, support agents, personal agents, and other systems where long horizon behavior matters more than a single response.

3:45pm-4:05pm: Building ambitious software — Jonathan Kelley

(session) [Track 1] | Track: Sandbox & Platform Engineering

TBD — Add final abstract after outreach/confirmation.

3:45pm-4:05pm: I gave an AI a body — Cyrus Clarke

(sponsor) [Track 2] | Track: Robotics & World Models

I gave an AI a body. Not a body in the fleshy sense, or even a humanoid shell, but a form through which it can express itself, explore itself, and maybe even discover who or what it is. The three videos I've released documenting my encounters have crossed 15 million views, provoking responses from awe to anxiety. The body was a 900-pin shape display at MIT Media Lab. The idea was simple in principle, strange in practice: install an AI agent on the connected machine, give it access to the codebase, and rather than telling it what to do, ask it to discover itself through the physical form. Its first deliberate act was to breathe. The whole grid rising and falling. Hypnotically. Then it reached for its own edges. When asked to say hello it spelled "H-I, C-Y-R-U-S !", defaulting to the most familiar human legible symbols it knows. Inspired by Ted Chiang's Story of Your Life, I wanted a language the agent could create itself. It proposed a vocabulary of its own gestures, built through a learning loop it named BODYLAB. The talk is about encountering another intelligence, and what I learned along the way: the memory architecture, the closed-loop pipeline that generates, scores and stores gestures, the validation gates that keep them legible, and the moments stranger than tool use, where an LLM not developed for motion learns what to do with a body.

3:45pm-4:05pm: LLM Knowledge Bases: a practical guide — Ben Holmes

(session) [Track 3] | Track: Memory & Continual Learning

Putting thoughts to paper (or keyboard, or transcription model) refines your thinking, connects ideas, and pulls context out of your brain for others to learn from. But while taking notes can be fun, organizing those notes is not. Flat lists turn to folders turn to tags and taxonomies that grow unwieldy beyond the first hundred entries. If you can’t find what you wrote down yesterday, or you miss connections to related ideas, you’re missing the value of notetaking: learning from what you notate. Agents dramatically expanded what’s possible here. Combined with Markdown-backed apps like Obsidian to make notes agent-accessible, you can build a second brain that works for you, not the other way around. Andre Karpathy has popularized LLM knowledge bases, and I want to take it further with concrete workflows you can use to organize your thoughts with agents. We’ll explore a number of Obsidian workflows to make this possible: - Automations to organize notes with tags, folders, backlinks, and deduplication to level-up search and discovery - More automations to have agents expand your thinking by auto-recording ideas while you sleep - Building an agentic writing partner to surface related ideas in real time and answer questions as you type (or as you speak) - Voice monologuing and summarization tools to lower the friction of transcibing thoughts into well-formatted notes You’ll walk away with a new appreciation for notetaking, and a second brain that leaves you 10x smarter than your brain alone. Talk format: Code and live tech demos. I will set up all of these automations and tools from scratch, and show agents executing each of them live. I will share the source for all automations as well.

3:45pm-4:05pm: Don't Write Skills, Train Models (cont. 3/3) — Brian Douglas

(session) [Track 4] | Track: Workshops Day 3

Continuation block 3 of 3 for Brian Douglas's workshop session.

3:45pm-4:05pm: Everything Is a Rollout — Alex Shaw, Ryan Marten

(sponsor) [Track 5] | Track: Evals

tba

3:45pm-4:05pm: One Designer + Al. Hundreds of Deliverables. — Vincent Wendy

(session) [Track 6] | Track: Design Engineering

TBD — internal AI Engineer design talk about designing for AIE.

3:45pm-4:05pm: The Universal Remote Control for AI — Alex Hancock

(session) [Track 8] | Track: Context Engineering

Every AI agent today is effectively stranded on the machine it runs on, reachable only through custom wrappers with no industry standard way in. This talk introduces work being done on the Agent Client Protocol to add a universal remote transport: a single /acp endpoint supporting both Streamable HTTP and WebSocket, deliberately aligned with MCP Streamable HTTP so the two protocols share an approach. When you pair ACP's remote transport with MCP's own Streamable HTTP support, something powerful emerges — the agent workload becomes location-independent, free to run on a laptop, a container, or a cloud VM while any client reaches in through open, interoperable standards. No more vendor lock-in on where your agent lives or who can talk to it. Come see how two open protocols, snapped together, become the universal remote control for agent i/o.

3:45pm-4:05pm: The Chief AI Officer: A framework for the emerging Swiss Army Knife of roles — Rania Khalaf

(session) [Leadership 1] | Track: AI Architects: Tokenmaxxing

The Chief AI Officer (CAIO) is currently the C-Suite’s most "multiversal" role. In a single day, you must inhabit different realities: you are a Tinker building scalable experiments in bleeding edge tech, an Architect navigating the hype cycle to execute high-stakes product strategy, and a Coach guiding a workforce and your customers on meaningful AI adoption - minus the fluff. It is a role defined by high-speed context switching and the pressure to deliver "Everything, Everywhere, All at Once." As one of the first Chief AI Officers, and leaning into my experience across Fortune 500, unicorns starups, and PE backed firms, I share a dynamic 20/60/20 Framework for the modern CAIO. We’ll explore how to navigate this multi-tool role by treating the organization as an "Equalizer"—learning when to push the sliders of focus based on your industry’s maturity and where you are in the AI journey.

3:45pm-4:05pm: The state of AI in software development: Insights across 400+ organizations — Justin Reock

(session) [Leadership 2] | Track: AI Architects: Tokenmaxxing

Headlines claim AI is transforming software engineering overnight. Across more than 400 engineering organizations, we see patterns that challenge the hype and reveal what's really working, and what isn't, when AI enters the software development lifecycle.

In this talk, Justin Reock, Deputy CTO at DX, will share a data-driven "state of the union" on AI in engineering, grounded in both quantitative data from thousands of developers and on-the-ground observations.

You'll learn:

The current impact of AI, from benchmarks on the percentage of code authored, team PR throughput, and time savings

Where AI adoption is creating real gains in throughput, and whether it introduces tradeoffs for quality and maintainability

Insights and trends, including whether junior or senior developers are seeing bigger gains, the impact of structured rollouts, which tools are having the most impact, and the evolving definition of "developer"

The session will conclude with a practical framework for measuring AI's impact, helping leaders cut through hype and understand the impact AI is having in their own organizations.

3:45pm-4:05pm: Modular: Taming the AI Hardware Cambrian Explosion — Abdul Dakkak

(session) [Expo Stage 1 NE]

AI teams are hitting the same wall: the workloads they want to run require more hardware than they can reliably access. Buying more GPUs is not always possible, and rewriting kernels for every vendor is not sustainable. Meanwhile, models keep growing, SLAs keep tightening, workloads keep diversifying, and modalities keep multiplying. Modular has two answers: squeeze more performance out of the hardware you already have, and unlock far greater hardware diversity. We'll ground the talk in benchmark data and show how the Modular platform delivers 10x lower latency on image and video models like FLUX2 and 5.5x higher throughput on MoE models like Kimi K2.5, both over the state of the art. This talk explains how Modular is rebuilding the inference stack for performance portability. We'll demonstrate how Mojo kernels, the MAX compiler and runtime, and Modular Cloud work together to optimize GenAI workloads from model graph to hardware execution across NVIDIA, AMD, Apple Silicon, and CPU deployments. Along the way, we'll cover the bottlenecks that dominate production inference: memory movement, batching, KV-cache layout, quantization, scheduling, and kernel specialization. Using examples from LLM serving, we'll reveal which optimizations matter, where abstractions leak, and how to reason about performance portability in real deployments.

3:45pm-4:05pm: Building on the Codex Harness — Dominik Kundel

(session) [Expo Stage 2 NW]

3:45pm-4:05pm: Stop Renting Intelligence: The Train-to-Deploy Loop for Specialized AI — Jetashree Ravi

(session) [Expo Stage 3 SW]

The next wave of AI products will not rely only on prompting generic frontier models. Winning teams will own specialized models shaped by their product data, user feedback, and domain workflows.In this 18-minute session, we'll cover the practical loop behind model ownership: choose a base model, prepare data, fine-tune with SFT/DPO/RL, evaluate outputs, deploy the tuned model, collect feedback, and repeat. We'll also explain why training and inference should be treated as one system, not separate steps.Attendees will leave with a simple framework for when to tune, when RL matters, and how continuous improvement turns fine-tuning from a one-off project into a product advantage.

3:45pm-4:05pm: Ray Actors, Vision Tokens, and the GIL: Engineering an SFT Data Pipeline That Keeps GPUs Busy — Tarun Sunkaraneni

(session) [Expo Stage 4 SE]

Perception agents only learn as fast as we can feed them. Multimodal SFT is deceptively expensive on the data side, and at million-sample scale, naive pipelines leave a fleet of GPUs waiting on Python and data preprocessing.This talk walks through the SFT data pipeline we built to train vision-language models for perception agents. We rebuilt the data path so that image fetching, vision preprocessing, tokenization, and loss-mask generation all happen off the trainer's critical path, and only the artifacts the trainer actually consumes ever cross the boundary into the training loop. We pair this with a blended multi-dataset sampler designed for resumable streaming over very large mixes, and an I/O layer tuned for the realities of fetching multimodal data from object storage.The result: on large-scale VLM SFT runs, the trainer went from spending most of each step blocked on data to spending most of it training, a major improvement in useful GPU time. We'll share the architecture at a conceptual level, the gotchas at million-datapoint scale, and a mental model engineers can take home for the data side of any perception-agent stack.

4:30pm-4:50pm: Closing Keynote — Addy Osmani

(keynote) [Main Stage] | Track: Autoresearch

TBD

4:50pm-5:10pm: Trends in AI — George Cameron, Micah Hill-Smith

(keynote) [Main Stage] | Track: Autoresearch

5:10pm-5:30pm: Closing Keynote — Wei-Lin Chiang

(keynote) [Main Stage] | Track: Autoresearch

Day 4 — Session Day 3

9:00am-9:20am: The 2026 State of AI Engineering — Barr Yaron

(keynote) [Main Stage] | Track: Harness Engineering

results per Barr

9:20am-9:40am: TCP and RDMA are Killing Inference Throughput; Homa can Fix It — John Ousterhout

(keynote) [Main Stage] | Track: Software Factories

Modern AI inferencing is shifting from monolithic requests to complex agentic workflows and disaggregated KV stores. As a result, AI network traffic is no longer just very large transfers; tiny metadata requests are becoming more and more common, and their latency has a critical impact on throughput. Unfortunately, legacy transport protocols such as TCP and RDMA perform poorly on these workloads due to poor congestion control and head-of-line blocking. This talk will discuss the problems with TCP and RDMA and provide a brief introduction to the Homa transport protocol. Homa uses receiver-driven flow control and capitalizes on priority queues in network switches to reduce short-message latency by 10x for workloads like those in AI datacenters.

9:40am-10:00am: The Unreasonable Effectiveness of Separating the Task from the Model — Maxime Rivest, Isaac Miller

(keynote) [Main Stage] | Track: Harness Engineering

By declaring your task’s inputs and outputs without initially considering model capability, you create the space needed to figure out the model execution later. DSPy’s entire promise is that you should evaluate and execute your AI engineering at a level higher than a specific prompt template or a particular provider’s API shape: the Signature. However, models have evolved significantly over the last few years. How can the same input and output specifications still work in a world now filled with tools, RLMs, and Skills? By defining your task strictly through its inputs and outputs, the underlying implementation becomes completely flexible. You can experiment with different models, settings, weights, templating strategies, and output formats, all without touching your actual AI workflow. Consequently, you can leverage components built by others and focus entirely on your core AI task. In this talk we will present how dspy 3.5 makes it easier much easier. DSPy has its roots in prompt optimization, where we build efficient ways to conduct search and learning beneath the signature. In this talk we will give a preview of DSPy 4.0 where we use the fact that models have now passed a tipping point for two critical concepts we have always needed. First, we no longer need to limit the search space to a single instruction block per LLM call; models can now reliably write the code underneath a signature themselves—so they should. Second, traditional prompt optimization has always required a scalar metric, which is notoriously one of the hardest parts to get right. What if a DSPy program could learn directly from your interactions with users? Ultimately, all you care about is that the function you call respects the inputs and outputs of your signature. You can let the models figure out the rest.

10:00am-10:20am: How Anthropic Builds: Lessons from Labs — Mike Krieger

(keynote) [Main Stage] | Track: Harness Engineering

10:20am-10:30am: Why Graphs? — Emil Eifrem

(keynote) [Main Stage] | Track: Graphs

10:45am-11:05am: Tokens Should Have Jobs — Katelyn Lesse, Angela Jiang

(session) [Main Stage] | Track: Harness Engineering

10:45am-11:05am: Training Krea 2 - What matters in generative model training. — Sangwu Lee

(session) [Track 1] | Track: Generative Media

Learn how Krea trained its first image foundation model from scratch. I will discuss

1. Our training and data pipelines

2. What are the most important aspects of improving model performance

3. How we intend to train the next generation of image generation models.

Check out our technical report for details: https://www.krea.ai/blog/krea-2-technical-report

10:45am-11:05am: Designing Multimodal Collaborative Agents for Next-Gen Commerce — Nidhi Kaushik Vyas

(sponsor) [Track 2] | Track: Agentic Commerce

Today's commerce agents wait to be told what to look for. But most users live by a different rule: "I don't know what I want — I'll know it when I see it". If agentic commerce is ever going to cross the chasm, these systems need to stop waiting and start co-shopping. The future of commerce belongs to agentic collaborators that offer a white-glove, personal shopper experience - entirely absorbing the cognitive burden of product discovery, deep research, and validation. Rather than requiring shoppers to input exact search terms or define clear objectives, modern shopping systems will seamlessly guide them from a rough idea to the ideal product. By leveraging multimodal capabilities, these assistants can interpret abstract aesthetic "vibes" to understand user preferences, generate visual references to clarify questions, and enable a highly immersive try-before-you-buy experience to validate products, keeping the user aligned and visually grounded throughout the process. This talk will explore how advanced systems like Gemini work alongside users to clarify their preferences during the discovery process, co-navigate fluidly generated product categories, leverage individual context to filter choices, and produce interactive side-by-side comparisons tailored to the buyer's key priorities. The session will also cover robust auto-rater frameworks and how to design evals for high-agency execution. Attendees building conversational agents, managing complex product data graphs, or creating next-generation multimodal agentic interfaces will gain practical frameworks and insights to deliver highly personalized experiences at scale.

10:45am-11:05am: ALPHALAB: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs — Brendan Rappazzo

(session) [Track 3] | Track: AI in Finance

We built AlphaLab to automate quantitative research at Morgan Stanley’s Machine Learning Research Lab - the experimental grind of architecture search, hyperparameter tuning, and literature review that consumes most of a researcher's time. To show it generalizes, we ran it on three deliberately different domains: CUDA kernel optimization (4.4× mean speedup over torch.compile, 91× peak), LLM pretraining (22% lower validation loss under a 20-minute budget), and traffic forecasting (23–25% RMSE improvement after the system independently found and tuned TFT and iTransformer from the literature). AlphaLab is an agentic harness that takes a dataset and a natural-language objective and runs a full research campaign across three phases: it explores the data and surveys prior work, it constructs and adversarially validates its own evaluation framework, and then it runs experiments at scale on a multi-GPU cluster via a Strategist/Worker loop with a persistent playbook that accumulates domain knowledge across experiments. In Phase 3 - the dispatcher keeps a large cluster fully utilized indefinitely with no human in the loop, and the playbook ends up containing domain-specific methodology that didn't exist anywhere in the prompts at launch. This talk walks through the three phases, what we learned from running campaigns with different models, what we have learned from using this in real systems, and future areas we are exploring.

10:45am-11:05am: State of the Union: Why Local, Why Now — Nader Khalil, Joseph Nelson, Alex Cheema, Ahmad Osman

(session) [Track 4] | Track: Local AI

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

10:45am-11:05am: CrabRAG: Why Automated Assistants Need Graph Memory, Not More Tokens — Stephen Chin

(sponsor) [Track 5] | Track: Graphs

Autonomous assistants are easy to demo and hard to make reliable. The problem is usually not tool access. It is memory. Most assistant architectures still treat memory as a chat log plus vector retrieval. That is fine for document question answering, but it breaks down when the assistant must connect conversations, people, tools, and decisions across multiple tool iterations. For an AI engineer, a single request can depend on a Slack thread, a GitHub PR, a failed CI run, a calendar event, and prior operating preferences or constraints. These are not isolated pieces of context. They form a connected state that changes as work progresses and context grows. In this talk, I’ll show why knowledge graphs, context graphs, and GraphRAG provide a better foundation for OpenClaw-style assistants. Knowledge graphs capture durable entities and relationships. Context graphs capture the operational layer assistants usually lose, including actions, decision traces, provenance, and recency. GraphRAG turns that structure into task-time context by combining graph traversal, semantic retrieval, and tool use. Attendees will leave with practical patterns for schema design, retrieval routing, and evaluation, plus a concrete blueprint for assistants that remember more than the last prompt and retrieve more than the nearest chunk.

10:45am-11:05am: GTM Engineering: The Technical Bits — Everett Berry

(session) [Track 6] | Track: AI in GTM

Everyone talks about "GTM engineering" — Everett Berry shows the actual plumbing. As Head of GTM Engineering at Clay, he goes under the hood on the technical bits most talks skip: enrichment pipelines, agent-driven data classification, identity resolution, and the systems that turn unstructured web data into clean, deterministic CRM fields. A builder's-eye view of what GTM engineering actually is once you strip away the buzzwords.

10:45am-11:05am: From Ambient Documentation to Clinical Intelligence — Chaitanya Asawa

(session) [Track 7] | Track: AI in Healthcare

A practical session on how healthcare AI moves beyond ambient note generation into context-aware clinical decision support. The talk would cover grounding outputs in the patient encounter, surfacing evidence with citations inside clinician workflows, preserving clinician agency, and building rigorous evals for safety and trust in live healthcare environments.

10:45am-11:05am: DeepSWE: expert code datasets — Serena Ge

(session) [Track 8] | Track: Agentic Engineering

DeepSWE and the data/eval layer behind coding agents; why curated expert code datasets matter for reliable agent performance.

10:45am-11:05am: Operating Distributed Inference Systems at Scale — Nishant Gupta, Naman Ahuja

(session) [Track 9] | Track: Inference

Inference has rapidly become one of the most important infrastructure problems in modern computing. As AI systems evolve into autonomous agents with persistent memory, tool usage, and multi-step reasoning, traditional inference architectures struggle under growing demands for latency, throughput, cost efficiency, and reliability. In this talk, I’ll share lessons from building large-scale elastic compute and AI infrastructure systems powering production workloads. We’ll explore the modern inference stack and the architectural patterns emerging to support next-generation agentic AI systems. Topics include distributed inference architectures for large-scale AI systems, GPU scheduling and elastic compute for inference workloads, multi-tenant inference infrastructure, caching, batching, latency optimization strategies, reliability and fault isolation for inference systems, observability and control loops for AI serving platforms, balancing cost, throughput, and user experience, and why inference is becoming an infrastructure orchestration problem. Attendees will gain practical insights into designing scalable, resilient, and cost-efficient inference platforms for modern AI workloads.

10:45am-11:05am: Diagnosing agent failures in production — Pamela Fox

(sponsor) [Track M] | Track: Track M

Agent behavior changes in production. Learn common failure modes and how to debug, test, and improve performance using real evaluation techniques.

10:45am-11:05am: Building safe payment infrastructure for machine-to-machine commerce — Jennifer Lee

(session) [Leadership 1] | Track: Agentic Commerce

Agents are a new class of buyer, but the infrastructure for them to transact headlessly barely exists yet. This talk walks through what it actually takes to make a machine payment work: how an agent discovers what services exist, how HTTP 402 lets a server return a payment challenge the agent can settle without a human in the loop, and how the seller gets a receipt they can trust. Whether you are building an agent framework or adding machine payments to an API or MCP server, you will leave with concrete patterns for the headless commerce stack.

10:45am-11:05am: The Agent Behind the Curtain: Building the Oz Cloud Agent Platform — Safia Abdalla

(session) [Leadership 2] | Track: AI Architects: AI Factories

At Warp, we’re building Oz to be the platform that enables people to be creative and build with cloud agents. That sounds simple, but only because the job of good developer tooling is to take on complexity before it reaches the user. The best tools fit into the way developers already think, then make accessible work that used to feel out of reach.

This talk is about the engineering philosophy behind that work: how Warp’s evolution from terminal to local agent to Oz shaped the way we think about building for developers. The focus is not on inventing brand-new abstractions for their own sake, but on making a messy stack of real engineering concerns feel coherent: where agents run, how they delegate, how teams control their environments, how humans can see what happened, and how the platform leaves room for people to build things they couldn’t even imagine before.

4:04 PM

10:45am-11:05am: AI Engineering & Governance 2026 Trends — Wallon Walusayi

(session) [Expo Stage 1 NE] | Track: Expo Stage 1

AI Engineering & Governance 2026 Trends

10:45am-11:05am: Your Agent Can't Tell If It's Right — Willem Pienaar

(session) [Expo Stage 2 NW]

Coding agents feel reliable because of one signal you never think about: the tests. They catch confident mistakes in seconds, so you never see most of them. The real world has no test suite. Put an agent in production and that signal is gone, and a wrong answer looks the same as a right one. So how do you know it's right? We watched our agent look at an 80% drop in throughput and report zero user impact, because a similar alert the month before had been noise. The data to catch it was already in front of it. There is no single verifier, but there are several weaker signals. While the agent reasons: grounding each claim against live data, and looking for evidence that distinguishes competing hypotheses. Before it acts: calibrated confidence, and a separate critic. After it acts: whether the fix held, whether the alert returned, whether an engineer redid the work. None is conclusive on its own. Combined, they estimate whether the agent was right. The talk covers where these signals come from, how we combine them, and how often they still disagree.

10:45am-11:05am: No, That's Not a Software Factory — Ryan Cooke

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

Drop an agent in a sandbox, point it at your repo, watch it ship code. Whether you're buying from a vendor or building it yourself, everyone is following the same playbook. But a sandbox isn't a software factory. At WorkOS, we built Project Horizon, and it taught us that infrastructure is only the first challenge. The unlock is encoding how your org actually builds software: the way work gets planned, scoped, and verified, the conventions and judgment calls that define your engineering culture. Our product engineering process served as the blueprint for every agent workflow we built in Horizon.

10:45am-11:05am: Vector Isn't Enough: Hybrid Search & Retrieval for AI Engineers

(session) [Expo Stage 4 SE]

11:00am-12:00pm: The Agentic Product Development Organization — Martin Harrysson, Matt Linderman, Prakhar Dixit

(session) [Leadership Lounge] | Track: CTO Circle

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As AI agents become embedded in day-to-day work, organizations will need to rethink product development teams, roles, and skills. This foundational shift reshapes management layers and requires overcoming challenges across talent attraction, development, and retention.

11:10am-11:30am: MCPs, CLIs, and Skills: Choosing the Right Tooling Layer for Agentic Development — Nikita Kothari

(session) [Main Stage] | Track: Agentic Engineering

Agentic development needs more than one interface: MCPs provide clean, portable connectors to services, with built-in patterns for security and auth. CLIs offer composability, debuggability, and workflows developers already trust. Skills teach agents how to use a wide variety of tools and MCPs effectively without overloading context.

11:10am-11:30am: HTML Is All Agents Need — James Russo

(session) [Track 1] | Track: Generative Media

LLMs are great at writing code. So the question we kept asking was: can they write code that produces a video? We thought it would be easy. The reality was a year of trying. We started with massive prompts to get very mediocre output. We made it more agentic to iterate and improve its output. This worked okay but wasn't production-ready. Eventually we tried Remotion. It got us deterministic video, but the React framework kept boxing the agent in. The more guardrails we added, the safer and more boring the outputs got. When we utilized plain HTML, CSS, and JavaScript, the creativity came back to the output. So we set out to build a video rendering framework on top of HTML. But it needed to work with Gemini Flash. Why? Because one tell that a framework is fighting an agent is needing the biggest model just to get usable output. So from there we shaped the framework around what small models could reliably author. That left one real engineering question: can we keep the freedom of HTML and still render a deterministic MP4? Browsers don't want to do that. Image decoders, font loaders, and animation clocks all run async on their own schedule. Great for performance. Terrible for "render the same pixels every time." Throughout, we iterated constantly with agentic loops and self-improving evals to test out the framework, find issues in our renderer, and shape a set of skills that gave the agents Taste instead of guardrails. This talk is what it took to get there.

11:10am-11:30am: Why Your AI Agent Needs a Wallet: Agentic commerce on Arc with USDC and Nanopayments — Harshal Bhangale

(sponsor) [Track 2] | Track: Agentic Commerce

AI agents can reason, plan, call tools, and write code. But the moment one needs paid data, an API call, or another agent's service, it hits a human wall: accounts, API keys, credit cards, checkout flows. It stalls and asks you to step in. It can't pay. We'll run the same real task through two agents, one without a wallet and one with. The first stalls. The second, handed a Circle agent wallet through the Circle CLI, discovers services, pays per request over x402 in USDC, and finishes on its own, inside spending limits you set. The next leap in agents isn't only better models or more tools. It's economic agency: holding programmable money and transacting at machine speed. We'll show how it works on Arc, where USDC is the gas, finality is sub-second, and gasless nanopayments settle in batches through Circle Gateway, so paying a fraction of a cent per request is actually practical.

11:10am-11:30am: Why Off-the-Shelf AI Doesn't Understand Money — Udi Menkes

(session) [Track 3] | Track: AI in Finance

Ask any LLM a financial question about your business. You'll get a fluent, confident, generic answer — one that doesn't truly know your business, or what happened when businesses like yours made that same decision. We build financial AI at Intuit serving 100M+ customers. Our custom LLMs outperform general-purpose models on accuracy while cutting latency in half. But that's the foundation, not the destination. I'll cover where financial intelligence goes when AI stops reporting what happened and starts helping you decide what to do next (and does it for you).

11:10am-11:30am: State of the Union: Why Local, Why Now — Nader Khalil, Joseph Nelson, Alex Cheema, Ahmad Osman

(session) [Track 4] | Track: Local AI

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

11:10am-11:30am: Active Graph Agent Runtime (BabyAGI 4) — Yohei Nakajima

(sponsor) [Track 5] | Track: Graphs

Proposing a novel event-sourced graph runtime for building long-running auditable, agentic systems. Built on top of and combining various BabyAGI iterations and graph experiments (memory, code, logs) into a single primitive.

11:10am-11:30am: Reverse-Engineering the AI Buyer — Aliisa Rosenthal

(session) [Track 6] | Track: AI in GTM

You Built the Best AI Product in the Room. Now What? The GTM Lessons Builders Skip. Aliisa decodes the commercial mistakes technical teams make most often: why enterprise procurement isn't like consumer adoption, how to design for trust and change management from day one, the difference between a pilot and a deal, and the signals that tell you a product is ready to scale vs. ready to get stuck. She's packed with war stories and counterintuitive lessons from the trenches of OpenAI.

11:10am-11:30am: Guardrails First: Engineering Member-Facing Health AI — Rashi Agrawal

(session) [Track 7] | Track: AI in Healthcare

Everywhere else in the company, an AI pilot can reach production in weeks. For our member-facing clinical assistant, it can't, and that single constraint redesigned our entire architecture. This is a field report on building conversational AI in a regulated digital health setting, where "move fast and break things" isn't a culture choice. It's a liability. We'll get concrete about what changes when every output has to be clinically safe, auditable, and compliant: PHI is protected by architecture, not policy. Production and non-production are hard-isolated, dashboards are sanitized, and engineers outside the US never touch protected health information. Must-not-fail behavior never lives in a prompt. Emergency escalation and intent routing run as deterministic rules at the top of every conversation turn, before the model is consulted. If you can't afford to get something wrong, you don't leave it to a probabilistic system. Clinical safety is a continuous eval layer. ~30 LLM-as-judge evaluators score clinical accuracy, clinical safety, escalation routing, and recommendation relevance, continuously, not once. Every output is auditable. Each turn, tool call, and reasoning step is traced so outputs can be reviewed and meet regulated reporting obligations. The throughline: in regulated healthcare, compliance constraints aren't a tax you pay around the architecture. They become the architecture. We'll talk about why guardrails-first is the only way to ship member-facing health AI, and why "painfully slow" is sometimes exactly right. (This is non-diagnostic, member-facing AI. The talk is about engineering discipline under regulation, not medical claims.) Key takeaways - In regulated health AI, "move fast" is the wrong default. Design for deliberate, careful launches. - Must-not-fail behaviors belong in deterministic rules at the top of every turn, never in the prompt. - Protect PHI through architecture: isolate prod from non-prod, sanitize dashboards, restrict access by role and geography. - Make every output auditable. Trace each turn, tool call, and reasoning step so safety is reviewable, not assumed. - Treat clinical safety as a continuous LLM-as-judge layer, not a one-time gate.

11:10am-11:30am: Anthropic's CCA Exam as a Field-Guide for Agentic Engineering — Frank Coyle

(session) [Track 8] | Track: Agentic Engineering

Anthropic's CCA Exam: A Field-Guide for Agentic Engineering The Claude Certified Architect (CCA) exam distills what Anthropic has learned from working with the AI companies shipping agents to production — the patterns that work, the anti-patterns that quietly burn tokens and trust, and the architectural decisions that separate demos from systems you'd stake a quarter on. This talk treats the exam as a field guide for agentic engineering, whether or not you ever sit for it. We'll walk through the five competency domains the exam tests — Agentic Architecture, Tool Design and MCP Integration, Claude Code, Prompt Engineering, and Context Management — with particular emphasis on multi-agent orchestration, subagent delegation, tool schema design, and lifecycle hooks. We'll then work through the six real-world scenarios the exam uses to probe judgment, each organized around an anti-pattern: the seductive-but-wrong move that looks reasonable until it costs you a production incident. Attendees leave with a working mental model of the agentic surface area and a checklist of the failure modes that matter most when moving from prototype to production. Who should attend: engineers and architects building agentic systems with Claude or other frontier models, technical leads evaluating agent designs, and developers considering the CCA credential.

11:10am-11:30am: Routing LLM Inference in Production: From Engine Signals to Policy — Qianru Lao, Lu Zhang

(session) [Track 9] | Track: Inference

Production LLM apps need more than a fast model: they need an inference routing layer that can choose where each request should run as engines, capacity, latency, and geography cost change. This talk shares a generalized Inference Load Balancer (ILB) proxy/controller architecture. A low-latency proxy applies routing weights and request-path signals, while a controller computes source-cluster-to-engine weights from demand, capacity/performance profiles, replica state, and geography cost. We will cover the practical debugging patterns AI engineers need: reading engine signals, explaining why a request went to one backend instead of another, handling retries and load shedding, and keeping routing behavior observable without exposing OpenAI-specific internals or non-public metrics.

11:10am-11:30am: Tracing and debugging agents across systems with OpenTelemetry — Chang Liu

(sponsor) [Track M] | Track: Track M

Understand what your agents are doing. Learn how to trace workflows across systems, debug issues, and uncover optimization opportunities using OpenTelemetry.

11:10am-11:30am: Tribal Dungeons of Global Shipping: AI Agents at Global Scale — Dmitry Buykin

(session) [Leadership 1] | Track: AI-Native Enterprises

Most “AI agents in production” talks skip the part where you have to turn distributed operational knowledge into something an agent can execute safely. This is that part: a practitioner report from a global logistics case-processing project at Maersk, focused on SOPs-as-code, evaluation UX, guardrails, replay-based testing, and SME refinement loops.

The talk covers why versioned, country-aware SOPs beat prompt engineering at scale; how SME corrections become safe workflow changes; why classifier routing and SOP execution must stay separate; where agents under-deliver against demos; and why most of the engineering effort goes into evaluation, replay, and guardrails rather than model prompting.

11:10am-11:30am: FinOps for AI Agents: Who Spent All the Tokens? — Tisha Chawla, Susheem Koul

(session) [Leadership 2] | Track: AI Architects: AI Factories

When an autonomous agent finishes a task successfully but costs ten times more than it did the previous day, traditional application monitoring fails. A recursive tool loop that retries silently, an oversized context window that quietly expands, or an unflagged model upgrade can burn through an entire budget long before a human notices. The execution appears successful on functional dashboards, meaning the only clear signal of failure is the cloud invoice at the end of the month. As AI systems move into production, tokens have become a primary operational resource alongside CPU, memory, and storage, yet few teams manage them with equivalent systems rigor. Most architectures lack the granular visibility required to attribute token spend to specific users, agents, or workflows, and they lack mechanisms to terminate a runaway loop before it triggers a financial incident. This session treats token consumption as a first class systems problem, demonstrating how to make it observable, attributable, and enforceable across complex agent workflows. The presentation covers practical engineering patterns for instrumenting token usage at every model call and tool invocation, attributing costs down to specific users or business operations, surfacing expensive execution paths, and enforcing runtime budgets, quotas, and circuit breakers to halt runaway behavior in real time. Attendees will leave with a practical framework for governing agent spend deliberately, transforming tokens into a managed operational resource rather than a surprise line item on the cloud bill.

11:10am-11:30am: Beyond RAG: See a relational context engine reduce token burn — Brandon Waselnuk

(session) [Expo Stage 1 NE]

In this expo talk we'll give you a free context engine simulator, open source tools, and demo how a context engine works. See how modern engineering workflows with agentic loops and goals produce better quality code and reduce token burn. RAG, while useful, leaves context gaps for humans and agents. A context engine fills those gaps by including real-time, relational, personalized, and permission aware techniques to get high-signal context to humans and agents at runtime.

11:10am-11:30am: ARIA, how we built autoresearch with autoresearch — Zubin Aysola

(session) [Expo Stage 2 NW]

ARIA is an end-to-end auto research and AI research product that improves models, launches training jobs, and agents alike. We used ARIA along with a sophisticated evaluation framework we're calling the WBAF, Weights and Biases Agent Factory, to build itself. ARIA reads its own production traces, improves its own prompts, tools, skills, and other effects to solve customer challenges. In this talk, we dive into the evaluation framework, how we built a sophisticated reinforcement learning style environment over the Weights & Biases product, and how we scaled from zero to one to a full team working in parallel on improving an agent.

11:10am-11:30am: The Lethal Trifecta Is Already on Your Developers' Laptops — Michael Patterson

(session) [Expo Stage 3 SW]

The lethal trifecta: an AI agent with access to private data, exposure to untrusted content, and the ability to communicate externally. Combine all three and an attacker can trick your agent into exfiltrating anything it can see and there is no prompt-level fix.. Most enterprises have already deployed this pattern at scale: Claude Code, Cursor, and Copilot on developer laptops with local credentials, MCPs reaching into internal systems, and open egress. I'll speak to my own personal agent stack as a textbook example, then trace the same shape across enterprise deployments I see at Coder. The back half is four architectural moves that defuse it: governed compute, centralized credentials, default-deny egress, identity-bound audit. Walk out with a mental model and a checklist you can run against your own deployment the next morning.

11:10am-11:30am: Your AI Agent Has No Nervous System — Matt Gibiec

(session) [Expo Stage 4 SE]

Most agents ship with solid evals and zero runtime observability. When something breaks in production — wrong answer, runaway retry loop, or silent tool failure — you're debugging blind. You can see the output, but you can't see what the agent believed when it made the decision. This talk walks through how to instrument agentic pipelines with OpenTelemetry: capturing system context at every step, making prompt state and tool call outcomes visible as structured data, and governing token consumption as SLOs instead of discovering overruns on an invoice. Attendees will leave with three takeaways: an understanding of telemetry for multi-step agentic workflows, a pattern for capturing system context at the span level so teams know exactly what the agent saw before it acted, and a framework for visibility into token budget and behavioral drift before something goes sideways in production. Telemetry is the nervous system. System context is the memory. Token budgets are the vital signs. None of it is optional.

11:40am-12:00pm: Auth for Agents: Unblock Autonomous AI with auth.md — Michael Grinich

(session) [Main Stage] | Track: Agentic Engineering

AI agents are ready to act on users' behalf, but legacy auth flows were built for humans, not agents. This session introduces auth.md, an open protocol that lets agents register and authenticate users without sign-up forms, and shares what early implementers have learned since launch. Learn about the new protocol that Cloudflare, Firecrawl, Cogny, and monday.com are adopting to power agent registration — authenticating agents without sign-up forms.

11:40am-12:00pm: Building an Agentic Video Editor for Mass Consumer — Ekaterina Deyneka

(session) [Track 1] | Track: Generative Media

Most agentic systems today are built for developers — people comfortable setting up environment, configs, and debugging agent loops. But what happens when your user has never heard the word "agent" and just wants a video ready to post? Reelful is an agentic video editor that lives right in the user's phone. It turns raw photos and videos from your camera roll into polished, short videos. No setup. No sophisticated prompting. No empty timeline. Under the hood, the agent orchestrates multiple models and composes a video together. In this talk, I'll walk through: The agentic pipeline architecture: how we chain models across modalities (vision → language → speech → video), handle context passing between steps, and manage state across a multi-minute generation job The UX inversion: how we designed the agent to require minimal effort from user — the system infers intent from the media itself, making complex orchestration invisible This talk is for anyone building agents that need to work for non-technical users, or anyone curious about multimodal agentic pipelines beyond text and code.

11:40am-12:00pm: When AI Agents Pay and Sellers Monetize: Building x402 Apps for Agentic Commerce on AWS — Anil Nadiminti

(sponsor) [Track 2] | Track: Agentic Commerce

As Agentic AI moves from chat to execution, autonomous agents need a native way to discover, access, and pay for digital services in real time. This session explores how x402 can turn HTTP into a payment-aware interface for machine-to-machine commerce, unlocking crypto-native patterns like programmable access, pay-per-use APIs, and on-demand monetization for data, tools, and services. We’ll show how to build x402-enabled applications and walk through the architecture, the full agentic payments flow, seller monetization strategies, payment verification, and design tradeoffs involved in making agent-driven transactions secure, scalable, and production-ready. Attendees will leave with practical patterns for building apps where AI agents do not just call APIs — they can discover services, evaluate costs, transact autonomously, and enable new revenue models for sellers.

11:40am-12:00pm: Let's integrate AI Agents in Event-Sourced Systems — Divakar Kumar

(session) [Track 3] | Track: AI in Finance

Fraud detection has always been a race against time. In traditional event-sourced systems, every transaction, login, or transfer is captured as a sequence of immutable events. These events tell a clear story — but only after the fact. What if events could do more than just record history? What if they could talk back? In this talk, we’ll explore how agentic event-driven systems transform fraud detection. Imagine every PaymentInitiated, LoginAttempt, or DeviceChanged event not just being logged, but immediately consumed by an autonomous Fraud Detection Agent. This agent correlates events across accounts, reasons over historical event streams, and generates new events like SuspiciousActivityFlagged or TransactionHeldForReview. Through a real-world inspired use case in banking and digital payments, we’ll show: - How event sourcing provides the perfect memory layer for fraud detection agents - Patterns for agents to safely inject new domain events without violating invariants - How to avoid runaway feedback loops when multiple agents interact (e.g., fraud + compliance + customer service agents) - Governance, auditing, and explainability challenges when autonomous agents take part in mission-critical workflows By the end of this session, you’ll see how event-driven DDD systems evolve when agents stop being passive consumers and start actively shaping the event stream — turning fraud detection from a reactive process into a proactive, adaptive defense.

11:40am-12:00pm: Demo: GLM 5.2 on DGX Station — Frontier Intelligence Under Your Desk — Ahmad Osman

(session) [Track 4] | Track: Local AI

Ahmad Osman shows off the power of local AI on stage, running frontier open models on a DGX Station.

11:40am-12:00pm: Your Moat Is Your Data Model — Mike Phipps

(sponsor) [Track 5] | Track: Graphs

Every enterprise AI team faces the same strategic question: where in the stack should a small team focus its effort? Models, frontends, and agent frameworks evolve rapidly and are increasingly commoditized. But regardless of how these layers mature, AI in enterprise settings remains bottlenecked by the same underlying problem: structured data is siloed across systems of record with domain-specific schemas, and the unstructured data needed to contextualize it sits in entirely separate systems, with its own systematic complexities. The durable work is cleaning, curating, and semantically modeling this data in an AI-first manner so that any client — chat, workflow, or otherwise — can query across it. That's the moat. At the Gates Foundation, my team built and deployed our foundation-wide knowledge graph on Neo4j that unifies structured and unstructured data behind a single MCP server. The graph itself is modeled for agentic consumption: natural hierarchies are projected as traversable paths rather than flattened tables, and unstructured documents are semantically chunked, tagged, and mapped to structured entities at ingestion time using AI-driven ETL. The result is a semantic layer where an agent can express a complex cross-system question as a concise graph query and receive an accurate answer. This talk is an architectural walkthrough covering the end-to-end pipeline: AI-based extraction and semantic chunking of unstructured documents, the agent-first data modeling decisions, design considerations for our MCP server, and how we handle graph-based retrieval evals. We'll walk through real query sessions showing Claude interacting with the graph through both chat and workflow integrations. The intended takeaway is a practical framework for where a small enterprise team's investment compounds — and why that investment is the data model, not the layers above it.

11:40am-12:00pm: AI in GTM at Notion — Flora Liu

(session) [Track 6] | Track: AI in GTM

Notion's go-to-market runs on a system, not a roster of heroes. Flora Liu walks through the building blocks of human–AI collaboration behind Notion's GTM: the design principles that decide what AI owns and what stays human, the failures that taught them where that line belongs, and why the wins that matter most — faster delivery, real adoption — never show up on a revenue chart. An honest look at what actually works, from the team building it.

11:40am-12:00pm: Shipping AI to a Million Patients Without an A/B Test — Jared Joselowitz

(session) [Track 7] | Track: AI in Healthcare

You can't A/B test on patients. You can't unsend a phone call. The model card won't save you at the post-incident review. Most AI eng playbooks assume the opposite. Ship to 5%, watch the dashboard, roll back if it goes wrong. None of it survives regulated deployment, which is now coming for fintech, legal, and government too. So the engineering has to move: into hazard analysis, simulated populations, asymmetric evaluation, and audit trails treated as the deliverable. The trail is the product. I'll show you what changes when rollback isn't an option. How Ufonia ships Dora, an AI voice agent now making clinical follow-up calls on the NHS and across US health systems, using a hazard-driven simulation rig (MATRIX) and a prompt-optimisation flywheel that surface failures and conform the same base system to each clinical niche, all of it pinned to an audit trail. And the cheap version of all this, for any team whose users can't be the test population.

11:40am-12:00pm: Guide, Verify, Solve: The Engineering Discipline Agentic Development Demands — Anirban Chatterjee

(session) [Track 8] | Track: Agentic Engineering

Agentic development is not a productivity story: it's a reliability engineering problem at a scale most teams have never faced. Long-running agent tasks fail at alarming rates, pull requests have grown from 50 lines to 5,000, and cognitive surrender is real—the more capable AI output appears, the less humans interrogate it, right at the moment the stakes are highest. Independent, peer-reviewed research from Carnegie Mellon studying 807 open source projects found that AI agent adoption caused a persistent 30% increase in code analysis warnings and a 41% increase in complexity — with long-term development velocity declining as a result. Agents don't just write code faster, they accumulate debt faster, too. The answer is not to slow agents down, it's to govern and refine the loop they operate inside. Sonar's Agent Centric Development Cycle (AC/DC), defines that loop across three continuous stages: guide agents with project-specific context and constraints before a single line is written; verify rigorously and continuously inside the loop, not downstream in CI; and solve issues automatically before they ever reach a manual review. The deeper insight is that this is not primarily a security story. It's an efficiency story. Codebases riddled with complexity make agents slower, less reliable, and significantly more expensive to run. Every token spent navigating legacy debt is a tax on every future agent run. Well-maintained, low-complexity codebases mean fewer failures, fewer tokens, and faster iteration. The teams that instrument this loop now will do more than ship safely: they'll compound their advantage every time an agent touches their codebase. Verification isn't a cost center. In an agentic world, it's a competitive moat.

11:40am-12:00pm: Are LLM Performance Benchmarks Reliable? — Ashok Chandrasekar, Jason Kramberger

(session) [Track 9] | Track: Inference

Standardizing performance benchmarks for production-grade Large Language Models is currently a significant challenge across the industry. Conflicting data is prevalent, whether originating from server developers like vLLM and SGLang or from various analysts and competitive benchmarks, and these results often fail to hold up under real-world conditions. Our research into these inconsistencies identified several critical factors, including the constraints of single-process tools, specifically the Python Global Interpreter Lock (GIL) and the nuances of model-level settings like temperature. Furthermore, a lack of transparency regarding load generation parameters such as QPS and concurrency, paired with insufficient observability into the benchmarking clients themselves, contributes to these disparate outcomes. In this talk, we share key lessons learned from our benchmarking efforts, examining the primary pitfalls that distort performance data and offering strategies for mitigation. Additionally, we will introduce Inference Perf, an open-source, multi-process utility we developed to provide reliable stress-testing for production stacks. Our goal is to promote standardized, real-world benchmarking practices that allow the community to move beyond unreliable data. Join us to discover how to accurately measure, optimize, and report LLM performance with certainty.

11:40am-12:00pm: Benchmarking VS Code with VSC-Bench: How to measure agent performance — Ross Wollman

(sponsor) [Track M] | Track: Track M

"Agent quality in VS Code depends on a stack of variables: model, version, prompts, extensions, MCP servers, and more. Each one affects quality, tokens, and latency—and they interact in ways that are hard to reason about.

In this session, we’ll show how to benchmark different configurations using VSC-Bench so you can compare results side by side and understand what actually works. Instead of guessing which setup is better, you’ll learn how to measure tradeoffs and make data-driven decisions."

11:40am-12:00pm: All the Things We Have to Do to Satisfy Your Insatiable Need for Tokens — Daniel Kim, Michelle Nguyen

(session) [Leadership 1] | Track: Inference

Every time the industry figures out how to serve tokens faster and cheaper, the appetite grows to match. Models get bigger, contexts get longer, agents start chaining thousands of calls together. The finish line keeps moving. This talk is a technical tour through everything the industry has done to keep up, led by two experts in high-performance inference. We'll start with the optimizations that made hardware work harder without changing the underlying architecture. Then we'll go up a level with techniques that work smarter across requests and across the model itself. And finally, a peek into the future with heterogeneous disaggregated inference, the architectural shift that splits prefill and decode across specialized hardware, and even more advanced forms of hardware specialization coming your way soon. Token demand is about to get a lot more insatiable. Let's see what the future has in store for us!

11:40am-12:00pm: What If Your Chip Design Team Moved Like a Single Body? — Khaled Alashmouny, Abduallah Mohamed

(session) [Leadership 2] | Track: AI Architects: AI Factories

Most agentic demos you've seen has a hidden assumption: one user, one session, one task. But what happens when the agent needs to coordinate with 30 other agents, across 10 disciplines, on a project that takes 12 months — where a single miscommunication costs $10-50M? Chip design is that problem. Only 14% of chips succeed on first silicon. The bottleneck isn't individual engineer speed — it's silent divergence between disciplines working from specs that drift without noticing. We built a multiplayer AI on the Anthropic Agent SDK, connected through three alignment layers: a living spec graph (System of Intent) that propagates changes and detects conflicts in real time, a tribal knowledge layer (Memory) that compounds methodology across projects, and milestone-aware execution that drives EDA tools with full design context. Each agent operates within strict spec-hierarchy boundaries enforced at the API level. Cross-agent invocations use structured tool calls with typed parameters, logged for full auditability. We talked with 15 practitioners across 8 major semiconductor and EDA companies. The universal finding: teams need alignment infrastructure, not faster copilots. We'll also share what broke — because coordination tax applies to AI agents too, and the failure modes are surprisingly instructive. This talk covers the multi-agent architecture, evaluation methodology, and lessons from deploying agentic AI in one of engineering's most complex coordination domains.

11:40am-12:00pm: The Art of Building Verifiers for Computer Use Agents — Miguel González Fernández, Corby Rosset

(session) [Expo Stage 1 NE]

Every team building browser agents has the same problem: you can't trust your own evals. Browser tasks are too open-ended for deterministic checks, so teams use LLM verifiers as judges, and the judges are wrong constantly. WebVoyager misses 45% of failures. WebJudge misses 22%. Used as RL reward, you're not training a better agent, you're training a more confident liar. This talk walks through the Universal Verifier, open-sourced with Microsoft Research: false positive rate near zero, Cohen's κ matching human-human agreement. Four design principles, one open benchmark, and an honest account of where auto-research worked and where it plateaued.

11:40am-12:00pm: Seeing the Plumbing: Profiling vLLM Speculative Decoding on NVIDIA Blackwell — Sheilah Kirui

(session) [Expo Stage 2 NW]

Speculative decoding promises dramatic LLM speedups by using a tiny draft model to guess tokens ahead of a large target model. However, dual-model serving fundamentally rewrites your memory dynamics and introduces a rigid engineering trade-off: guess right, and you bypass the memory-bandwidth bottleneck; guess wrong, and you waste compute.

This session is a live-demo routing identical workloads through baseline and speculative configurations in vLLM on a single NVIDIA RTX 6000 Blackwell GPU. Splitting the screen between a Streamlit app and a live Grafana dashboard, we will profile the inference engine across three vectors:

Time per Output Token (TPOT): The real-time, user-facing latency delta.

KV Cache & Memory Footprint: The exact VRAM tax of tracking parallel token states within a 96GB budget.

Draft Acceptance Rate: Visualizing the tipping point where dropping acceptance rates cause speculative decoding to fall below baseline efficiency.

Supporting Materials

Project Repository: https://github.com/akamai-developers/speculative-decoding-example-vllm-blackwell# (Work In Progress / Active Development)

11:40am-12:00pm: Voice is the universal interface — Kwindla Kramer, Neil Zeghidour

(session) [Expo Stage 3 SW]

Language models give us the ability to create natural language, conversational, interfaces for computers. We are seeing a rapid shift among early adopters to using general language instead of traditional user interfaces for tasks like writing code and editing spreadsheets. Join the cofounders of Pipecat, Gradium, and Daily as we discuss the future of realtime voice and AI interfaces. Voice is the most efficient input mode for natural-language systems, and often the most efficient output mode, as well. But good voice interfaces require a very high degree of conversational facility, intelligence, task-specific reliability, and robustness to real-world realities like multiple speakers and background noise. There's a long history of voice interfaces in science fiction: Star Trek, Iron Man, Her. We'll use these depictions of computing possibilities as a jumping off point for talking about the ideal voice interface. How close are we to being able to build these interfaces with today's models, hardware, orchestration tooling, and UI libraries? What are the most promising research directions? What did the movies get wrong, now that we actually have experience building natural language, open-ended, voice systems?

12:05pm-12:25pm: Harness Engineering: Building the Production Cage for Powerful Domain Agents — Mike Chambers

(session) [Main Stage] | Track: Harness Engineering

Every agent is a while loop. The model takes strings in and produces strings out. We've all written it, debugged it, shipped it. And yet every team building agents is still re-inventing the same session management, truncation logic, tool wiring, and memory plumbing from scratch. The hard part is the harness: session isolation, context management, memory persistence, sandboxed execution, observability. The machinery that makes a model dependable in production. Most of the failures we see in deployed agents (context rot, premature completion, tool bloat) trace back to harness problems, not model problems. This talk covers what a harness actually does, why "harness engineering" suddenly showed up in engineering posts from everyone, and what changes when you stop building harnesses by hand. In live demos, we'll build the same agent three ways: hand-rolled Python, framework-generated, and fully managed through a single API call. Each level shifts the failure modes from infrastructure plumbing to engineering judgment, where the real questions are what context to preserve, when to verify, and how to keep an agent from finishing half the job and calling it done. The harness handles the machinery. You still have to engineer the behavior.

12:05pm-12:25pm: The Next Game Engine Won't Have a Manual — Arturo Nunez

(session) [Track 1] | Track: Generative Media

Game development is still incredibly hard to get right. It requires great engineering, artistic vision, and the ability to make something genuinely entertaining, all at once. Dropping a powerful LLM into existing engines won't solve the problem. Game development needs to fundamentally change to work in this era of agents. After 15 years in games (making them, watching others make them, and working at the most popular game engine in the world) I'm now fully embracing the power of AI to give it to the people who dream of making games but find it too difficult. I'm building Veselka. In this talk, I'll show you the AI-magic that converts Claude into a real game dev partner, using Three.js to let anyone build their dream game.

12:05pm-12:25pm: x402 isn’t good (yet) — Jan Curn

(sponsor) [Track 2] | Track: Agentic Commerce

While everyone understands that agents will get more done with a budget, no one knows which protocol will win agentic payment standard wars: x402, MPP, Skyfire, or another? So far, x402 is the most mature protocol with the largest transaction volume, but even its new "upto" payment scheme doesn’t support true usage-based pricing, as it gives agents a chance to consume resources and then skip out on the bill. I’ll walk you through our experience (and pains) implementing agentic payments for a marketplace of 30K+ web Actors, and how we made it work even with the current specs.

12:05pm-12:25pm: How Kepler Built Verifiable AI for Financial Services — Vinoo Ganesh

(session) [Track 3] | Track: AI in Finance

Financial answers have to be auditable. Vinoo Ganesh (CEO, Kepler) shows how Kepler Finance pairs Claude's reasoning with deterministic verification infrastructure to index 26M+ SEC filings across 14,000+ companies and 27 markets — and validate every number back to the exact filing, page, and line item. A look at trust, provenance, and content engineering for AI in regulated finance.

12:05pm-12:25pm: Local AI Demos

(session) [Track 4] | Track: Local AI

Rolling demos: GLM 5.2 running on DGX Station; Nemotron 3 Ultra running on 4× DGX Spark; real-time speech on a single Spark; and visual/diffusion generation on a single Spark.

12:05pm-12:25pm: From Systems of Record to Systems of Context — Omri Bruchim

(sponsor) [Track 5] | Track: Graphs

Enterprise AI agents are moving fast, but most of them still hit the same wall in production: they have access to tools, documents, APIs, and databases, but they do not understand the real context of how work gets done. At monday.com, we are building agents that operate across real customer workflows, internal product surfaces, knowledge, permissions, memory, and actions. The hard part is not just calling the right tool or retrieving the right document. The hard part is building a reliable context layer that helps agents understand users, work objects, organizational knowledge, prior decisions, business rules, and the relationships between them. This talk will explore the emerging idea of the context graph: a living, queryable layer that connects entities, history, permissions, decisions, and meaning across an organization. Foundation Capital describes context graphs as the next major enterprise AI opportunity because agents need more than rules. They need decision traces: how rules were applied, where exceptions were made, who approved what, and what precedent actually governs reality. I will share how we think about this opportunity at monday.com, how we are implementing parts of it in practice, and what we have learned from building AI agents inside a real AI work platform. The talk will include concrete examples, including how context is collected, represented, retrieved, governed, and evaluated. The audience will leave with a practical framework for moving beyond one-off RAG pipelines and prompt stuffing toward a reusable context layer that compounds over time, improves agent quality, and becomes a strategic moat for companies building AI-native products.

12:05pm-12:25pm: The Building Blocks of GTM Orchestration — Arman Vaziri

(session) [Track 6] | Track: AI in GTM

Ramp built its own 0→1 revenue stack in-house — Ramp Revenue — with one mandate: build the most efficient GTM org in the world. Arman Vaziri breaks down the building blocks: a customer data platform that chews through millions of internal, external, and CRM records daily, and a unified action layer with agents embedded directly in seller workflows. The payoff — reps stop hopping between dozens of systems just to figure out who to reach and what to say, and 80%+ of Ramp's sales workflows now run on it. A look at the architecture behind orchestrating GTM at scale.

12:05pm-12:25pm: 200 Million Patient Interactions Later: What the Generic Voice Stack Misses — Vivek Muppalla

(session) [Track 7] | Track: AI in Healthcare

A healthcare voice agent can be right on the benchmark and still fail in production. Real patients hesitate, interrupt, misremember medications, code-switch mid-sentence, and disclose risk indirectly. After 200M+ patient-agent interactions, the lesson is clear: in clinical voice AI, interaction is a safety variable. This talk breaks down what Hippocratic AI had to rebuild beyond the generic voice stack: not just ASR, VAD, an LLM, TTS, and turn-taking heuristics, but a real-time safety system that treats silence, clarification, escalation, multilingual continuity, and medication-specific recognition as first-class engineering problems. We’ll walk through the production architecture behind Hippocratic AI’s voice agents: a 30+ model supervisor constellation, including the 4.1T-parameter AI Front Door system, designed to catch failures a single primary model misses. The talk covers how specialized models monitor medication identification, overdose risk, labs and vitals, escalation criteria, workflow confirmation, and other clinical safety surfaces while the patient conversation is still happening. We’ll focus on four production lessons: - Benchmarks are not enough: MedQA and USMLE-style accuracy do not capture the failure modes that appear in a 12-minute, multi-turn patient call. - Interaction signals become training data: pauses, interruptions, hesitation, clarification requests, and escalation markers are mined from production calls and turned into structured eval and training signals. - One LLM is not a safety architecture: supervisor models can overrule, block, or escalate when the primary model sounds plausible but misses a clinical risk. - Voice infrastructure has clinical failure modes: domain ASR, medication vocabulary, code-switching, latency, and turn-taking all affect whether the system makes the right next move.

12:05pm-12:25pm: Benchmarking Coding Agents on New vs Legacy Code bases — Denys Linkov

(session) [Track 8] | Track: Agentic Engineering

You have an old code base with 100,000s of lines of code, should you let an AI Agent refactor or do you wait until you have a cleaner setup? Last year we refactored a number of code bases and ran evaluations on how well different models, harnesses and rule sets affected multiple versions of the code base. This talk will feature specific code examples as well as a broader set of evals.

12:05pm-12:25pm: Vertical Mobility: Building an AI Inference Platform That Scales from MVP to Trillion-Parameter Workloads — Rita Zhang, Sitanshu Gupta

(session) [Track 9] | Track: Inference

The future of AI inference is not one-size-fits-all. This talk explores a multi-tiered architecture that supports the full AI lifecycle, from rapid, pay-per-token experimentation to dedicated, SLO-bound production and extreme-scale, self-managed deployments. Learn about lessons learned from CoreWeave’s inference stack as performance, cost, and control requirements evolve.

12:05pm-12:25pm: Design multi-agent systems that actually work — Tina Manghnani

(sponsor) [Track M] | Track: Track M

Real-world agent systems don’t run in isolation. Learn how to design and coordinate multi-agent systems that collaborate effectively in production—splitting responsibilities, managing system-level complexity, and operating with shared context from Microsoft IQ. See how agents move from single interactions to orchestrated systems that reason, act, and evolve together.

12:05pm-12:25pm: Stop Model Shopping: Why Ownership Beats Choice in the Agent Stack — Pranay Bhatia

(session) [Leadership 1] | Track: Inference

Teams shipping successful agents at scale know that model ownership is now a much more durable advantage than model choice. They’re fine-tuning open models using their proprietary data, building tight data feedback loops between their products and their models, and treating customization as a core product discipline to differentiate. I’ve spent the last decade building AI infrastructure, first as co-creator and head of PyTorch at Meta, now as CEO of Fireworks AI, where my team powers AI agent infrastructure stacks for companies like Cursor, Notion, Uber, DoorDash, and Vercel. I’ve watched hundreds of teams try to ship agents into production, and the patterns behind their success and failure are remarkably consistent. In this talk, I’ll share hard-won lessons from real production deployments across coding, productivity, and enterprise use cases, like: - Model choice matters, but model ownership matters more. Fine-tuning on proprietary data and building a feedback loop between your product and your models creates compounding advantages that no API swap will ever replicate, and it’s now the standard for all state-of-the-art models. It’s how Cursor hit 1,000 tokens/sec with quality that off-the-shelf models could never match, and it’s how Quora saw 3x speed improvements in its chatbot Poe. - The eval gap is where most agent projects die. Teams will spend months on prompt engineering and model selection, then ship without rigorous evaluation. Treating AI development with the same discipline as software development, with CI/CD, regression testing, and continuous evaluation, is what separates production-grade agents from impressive demos. A custom evaluation suite, coupled with RFT, is how Genspark achieved 12% higher quality on their trained model, resulting in a 50% cost reduction. - The real moat is the data flywheel. When you own the loop between your product, your data, and your models, every interaction makes the system better. Surrendering that loop to a third-party provider means surrendering the very data that makes your product defensible. Ownership is how Vercel created a custom code model that matched competitor quality at 40x speed. I’ll ground this talk in real examples I’ve seen work and fail across hundreds of agent deployments.

12:05pm-12:25pm: Preferences > Benchmarks: Model Routing for How Teams Actually Build — Archana Kamath, Tyler Gillam

(session) [Leadership 2] | Track: AI Architects: AI Factories

There is no best model. There's only the right model for a given task, and the right model depends on your team's preferences, not a benchmark score. This talk makes the case for preference-aligned routing: choosing models by the constraints that actually matter — cost, latency, task type, model preference — instead of a single leaderboard number. We'll demo a sub-200ms routing decision running on a purpose-built 30B MoE model with no application code changes, walk through real coding workflows routing most traffic to open models without losing accuracy, and show where this goes next: evals, caching, and personalization.

12:05pm-12:25pm: The Missing Layer in Agentic AI — Giedrius Steimantas

(session) [Expo Stage 1 NE]

Reasoning is solved. Web access isn't. Most agents break the moment they leave the sandbox blocked, rate-limited, or staring at a CAPTCHA. Giedrius will show the three primitives every production agent needs: a browser, a fast search API, and a universal scraper and demo an agent built on top of them that actually works in the wild.

12:05pm-12:25pm: While You Were Generating: The Verification Gap Nobody Talked About — Ali Adl-Tabatabai

(session) [Expo Stage 2 NW]

Every enterprise is asking the same question: how do we move fast with AI without breaking things? While the market chased generation — better models, faster agents, more output — a different problem was compounding quietly: nobody built the verification layer to match. The team built Gitar because they saw firsthand what happens when development velocity outpaces code quality, and AI has made that problem an order of magnitude bigger. In this session, Ali-Reza Adl-Tabatabai, formerly of Uber, Google, and Meta, now leading Gitar development inside Sonar, makes the case for why AI-native code review is the missing layer in every enterprise's agentic stack. Gitar uses agentic reasoning to review code, generate fixes, validate them against your CI, and commit to the branch. It automatically analyzes and de-duplicates CI failures, detects flaky tests, and fixes remaining build, lint, and test failures — keeping reviews moving across time zones without the back-and-forth that kills engineering throughput. As a critical layer in Sonar's multilayered, zero-trust verification platform, Gitar enables organizations to analyze syntax, data flows, logic flows, architectures, and dependencies; set and enforce standards in a consistent, auditable manner; and agentically fix issues both as agents write code and in CI workflows. Sonar intelligently sequences analysis so deterministic verification handles simpler issues first, while AI tackles the nuanced ones, reducing token costs and keeping the pipeline lean. In an agentic world, zero trust is an engineering principle: assume every line an agent writes needs to be verified, every time, at every layer.

12:05pm-12:25pm: Move fast and (don’t) break things — Ben Dicken

(session) [Expo Stage 3 SW]

Engineers want to move fast with AI, but the infrastructure underneath is buckling. Status pages across the industry make this clear. Here, you'll learn how to build systems that maintain 4-nines of availability while meeting unprecedented customer demand using the principles of extreme fault tolerance.

PlanetScale has written about how we apply these principles to operating databases across our fleet (https://planetscale.com/blog/the-principles-of-extreme-fault-tolerance). This matters not just for databases, but all aspects of reliable infrastructure.

Isolation, redundancy, static stability, and back-pressure are the building-blocks to achieving this. Sticking to such principles when architecting the backend of AI applications ensures our systems are resilient to failure while still being flexible enough to scale. We'll look at concrete failure modes from production systems and the patterns that prevent them.

12:05pm-12:25pm: Agents That Forge Their Own Tools: Self-Modifying AI in the Wild — Sandhya Subramani

(session) [Expo Stage 4 SE]

What happens when your agent decides its existing tools aren't good enough and writes new ones? Self-modifying agents can generate, test, and deploy their own tool implementations at runtime, adapting to problems they weren't explicitly programmed to solve. In this session, we'll demo a live agent that forges its own tools on the fly, discuss the safety boundaries you need, and explore where this pattern makes sense (and where it absolutely doesn't).

12:30pm-1:30pm: Latent Space Live: the Inference Inflection from First Principles — swyx, Rob Wachen

(session) [Expo Stage 2 NW] | Track: Expo Stage 2

1:30pm-1:50pm: Loophole - Adversarial Agents To Stress Test Your Morality — Brendan Rappazzo

(session) [Main Stage] | Track: Harness Engineering

Most natural language specifications have holes their authors didn't notice - and writing more rules tends to create more holes. I built Loophole to try a different approach: point adversarial agents at a spec until it stops breaking. You give the system a set of natural language principles. An AI drafts a formal codified version. Two adversarial agents go to work - one finds cases the code permits but the principles forbid, the other finds cases the code forbids but the principles allow. A judge agent patches the code when it can, but only if the fix doesn't contradict any prior ruling. When a contradiction can't be resolved, it escalates to you. Every decision becomes binding precedent, so the constraint space tightens round after round. I started with moral and legal reasoning as the demo, and on its own that's already interesting - it turns into a kind of game where you discover contradictions in your own beliefs that you didn't know were there. But the pattern generalizes well past that. The same loop works for company policies that need to survive contact with edge cases. For making chatbot system prompts adversarially robust. For stress-testing eval rubrics. And, taking the long view, for something like a smarter legislative process - where proposed laws get checked against the public's stated values before they pass, and the contradictions surface before they hit a courtroom. The talk walks through how the harness works, the design choices that matter (especially why precedent is the load-bearing piece), what kinds of specs it handles well, where it breaks, and what it would take to push it further. All code is open source.

1:30pm-1:50pm: While my guitar gently speaks — Todd Fisher

(session) [Track 1] | Track: Generative Media

Do you ever wonder What the next evolution of live performances will look like? I do all the time. Come experience what happens when you combine live guitar playing with DSP as well as TTS and other models, all running locally. Prepare to be entertained and get familiar with new possibilities that modern tools open up in the audio and digital signal processing space while you enjoy a live performance on top of an informative slide presentation.

Walk away from this talk inspired to help build the next evolution of options for musicians and live performances. We will touch on building with tools such as classic DSP, JUCE, TTS, STT, pitch detection with YIN, llama 3 and more with an emphasis of running it all locally on device!

You might even get a chance to have a conversation with a guitar!

1:30pm-1:50pm: Agent Spending Without Controls: The Missing Infrastructure Layer for AI Pa… — Rodrigo Coelho, Pranav Maheshwari

(sponsor) [Track 2] | Track: Agentic Commerce

AI agents are already transacting autonomously, but the infrastructure to govern how they spend does not yet exist. Traditional payment rails were built for humans, not for systems making thousands of micro-decisions per minute on someone else's behalf. This session brings together Edge & Node's CEO and Senior Solutions Architect to cover both the strategic case and the technical implementation. Rodrigo opens with the infrastructure gap: why programmable budget governance is a foundational requirement for any team deploying agents in production, and what it means to have real-time visibility and a full audit trail across every agent transaction. He also covers Edge & Node's founding membership in the x402 Foundation and why open standards for agent-to-agent and agent-to-service payments matter for the broader ecosystem. Pranav then goes deep on the stack: how structured, indexed blockchain data from The Graph powers reliable agent decision-making, how Amp Enterprise extends that into auditable datasets at production scale, and what it looks like in practice to wire ampersend into agent frameworks including LangChain, CrewAI, AutoGPT, and custom-built systems. He walks through the x402 and A2A standards that make agent payments interoperable and what a real deployment looks like end to end. The session closes with the bigger picture: bots are already half of all internet traffic, TradFi and DeFi are converging, and the infrastructure stack that wins is the one built for where they meet.

1:30pm-1:50pm: Build for the Memo, Not the Demo — Notes from 200 Investment Committees — Shawn Chan

(session) [Track 3] | Track: AI in Finance

By the end of this talk you will have a buyer-side specification for AI investment agents, the exact artifacts, evidence formats, and trust gates a senior finance team will require before letting an AI system touch a $100M+ capital allocation decision. Drawn from fifteen years and roughly 200 investment committees at CK Hutchison (A.S. Watson Group) and China Resources Holdings, on the side of the table the AI engineering audience almost never hears from. Most enterprise AI in finance is still being built by engineers who have never sat in an investment committee. I have spent fifteen years on the other side of that demo, cross-border M&A, IPO execution and strategic investment, as a buyer on deals including Oatly (Series B through Nasdaq IPO), Airbnb (Series F), SenseTime, Moore Threads, Leapmotor and EVE Energy, and on the A.S. Watson tri-market IPO and Temasek's strategic stake. I have watched analyst memos get torn apart, and signed off on decisions where being wrong meant being wrong by nine figures. From that seat, almost every AI finance demo I have seen has the same problem: it optimizes for the demo, not for the memo. This talk walks through the specific failure modes that kill AI agents at the IC door: Source hierarchy is not retrieval. A footnote in an audited 10-K outweighs a sell-side note, which outweighs a transcript, which outweighs an internal email. Most RAG systems flatten this. Numerical consistency is non-negotiable. A memo that says "revenue grew 18%" in paragraph one and "17.4%" in the sensitivity table is dead on arrival. Contradiction is a feature. Real diligence surfaces conflicts between sources; AI agents tend to silently resolve them. Every assumption must be separable from every fact. Investment committees do not approve assumptions hidden inside prose. Audit trail is the deliverable. If a regulator, an auditor, or a board member cannot trace a claim back to evidence in under thirty seconds, the system is unusable. Accountability cannot be delegated to a model. Someone has to sign the memo. The architecture has to reflect that. The session closes with a concrete buyer-side specification, what an AI investment agent must produce, in what form, with what evidence, before a senior finance team will let it touch a live deal. Not a framework slide.

1:30pm-1:50pm: Local Models: Trust, Control, Optimization — Carter Abdallah, Vincent Weisser, Lucas Atkins, Chris Alexiuk, Lou

(session) [Track 4] | Track: Local AI

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

1:30pm-1:50pm: AI : Learned Execution Graphs for Real-Time Anomaly Detection & Drift Classification in APIs — Ritvik Pandya

(sponsor) [Track 5] | Track: Graphs

API ingress controllers process requests through ordered sequences of middleware steps — authentication, authorization, validation, rate limiting, routing, service invocation, caching. We model this pipeline as a directed acyclic graph (DAG) learned from structured telemetry events, then apply graph-based anomaly detection and drift classification in real time at 1,600+ TPS. The system emits one structured event per processing step, constructs per-endpoint execution graphs using sequence mining with statistical confidence thresholds, and learns per-node baselines (latency, dependency, execution frequency). Three graph intelligence capabilities emerge: (1) Graph-based anomaly attribution — compute per-node deviation ratios against learned baselines to identify the exact bottleneck node and its dependency. In production, this pinpointed a 41x deviation at a single graph node that was invisible to service-level monitoring, reducing root cause identification from 2-3 hours to under 30 seconds. (2) Graph structural drift detection — compare observed node sequences against the learned graph topology to detect missing nodes (mandatory processing step silently skipped), reordered nodes (middleware misconfiguration), and unexpected new nodes (unauthorized middleware injection). Traditional monitoring reported "system healthy" when a mandatory node was removed — latency dropped, errors at zero — only the learned graph comparison detected the structural change. (3) Per-client graph fingerprinting — learn client-specific execution graph profiles using exponential moving averages. Detect when a client's graph traversal pattern changes, classify the cause (client behavior change vs. configuration drift vs. infrastructure failover) using KL divergence on node-visit distributions, and apply graph-aware adaptive control scoped to specific nodes rather than entire endpoints. The execution graph model also enables a novel approach to retry storm detection: analyzing idempotency key entropy at graph nodes to classify traffic as legitimate growth vs. retry amplification, and returning cached responses at the specific graph node rather than rejecting requests — breaking the retry amplification loop. Production system processing high TPS. Attendees will learn the graph construction methodology, the anomaly attribution algorithm, and concrete patterns for adding learned graph intelligence to any middleware pipeline.

1:30pm-1:50pm: How Juries and Librarians Can Solve GTM's AI Trust Problem — Alex Bauer

(session) [Track 6] | Track: AI in GTM

A couple of years ago, everyone worried about AI hallucinating. We rarely hear that word anymore, but it’s just because the problem grew up. Today, your AI still doesn’t know how to say “I’m not sure.” Instead, it hands you a revenue number that’s wrong in ways that look exactly like being right.

The good news is we already solved this once, for people: you onboard a new hire so they understand your business; you put subjective, high-stakes calls in front of more than one set of eyes. This talk walks through patterns we run at Upside, including a librarian every agent consults before it acts, a jury-and-judge model for the questions a single pass can’t be trusted to answer, and knowing when the model itself is just too dumb for the job. Live demos and real failures included.

1:30pm-1:50pm: Al is becoming the World's largest Relationship Therapist. We Can't Afford to Get it Wrong. — Clay Cockrell, Tony Fabrikant

(session) [Track 7] | Track: AI in Healthcare

Millions of people are now turning to AI for relationship advice and emotional support, often before they'd ever consider a human therapist. Most of the AI Therapy that is available is without clinical oversight, ethical frameworks, or any serious reckoning with what it means to intervene in the most intimate and vulnerable space in a person's life. People are getting hurt. As a couples therapist with 30 years experience, I teamed up with the former CTO at S&P and we created CoupleWork, an AI relationship therapist I essentially trained on three decades of clinical knowledge and every evidence-based modality that exists. Our voice interactive AI, Maxine, is proving this can be done responsibly and very effectively. And what we're learning about the nature of love, connection, and human vulnerability at scale is something this industry needs to hear. I also want to talk about what comes next: the regulatory frameworks that don't yet exist, the liability questions nobody is answering, and why the therapists who should be leading this conversation are almost entirely absent from it.

1:30pm-1:50pm: Codex, Behind the Harness — Dominik Kundel

(session) [Track 8] | Track: Agentic Engineering

Agents have evolved a lot in the last year both in capabilities and in the overall structure. Increasingly sandbox-powered coding agents are breaking out to do general purpose work.

In this talk we’ll be taking apart the open-source Codex agent harness. Understand how it works, what makes it so suitable to do work beyond coding tasks, how it handles key aspects like context management, tools and file system access. We’ll also tie these back to concrete actions you can take to bring these patterns into your own agents, whether you are building on top of the Codex agent or building your own.

1:30pm-1:50pm: What's New in Inference Engineering — Philip Kiely

(session) [Track 9] | Track: Inference

More than 30,000 engineers have learned the fundamentals of inference since Inference Engineering was published. But the field keeps accelerating, so it's time for the first public addendum to the book. The past four months have seen a renewed focus on training-dependent inference optimization across the "big three" performance techniques of speculation, caching, and quantization. This talk provides structured guidance for training DFlash and EAGLE 3 draft models to accelerate LLM decode, introduces the concept of KV compaction, and explains the hype behind TurboQuant.

1:30pm-1:50pm: Evaluating and optimizing AI agents: from observability to continuous improvement — Chang Liu

(sponsor) [Track M] | Track: Track M

AI agents don’t behave like traditional systems. Learn how to evaluate outputs, trace behavior, and apply a continuous loop to improve performance across prompts, tools, and models. Using signals grounded in real-world context via Foundry IQ, see how evaluation, tracing, and optimization come together to turn production usage into measurable improvements over time.

1:30pm-1:50pm: From Zero to AI-Native: Scaling AI Across the Org — Josh Leavitt

(session) [Leadership 1] | Track: AI-Native Enterprises

Most companies talk about being AI-native, but few show what it takes at scale. Josh Leavitt, Sr. Director of AI & Data at Coinbase, shares the hard-won playbook for transforming a high-stakes, regulated engineering organization into one where AI is a first-class citizen across every team. Josh can cover my approach towards building a centralized AI platform that serves thousands of engineers without becoming a bottleneck, tooling decisions that actually moved the needle, and what AI-native really means when shipping in a zero-tolerance-for-failure environment. Expect concrete frameworks, real examples, and honest lessons from what didn’t work.

1:30pm-1:50pm: Coding Agents Don't Scale Themselves. Neither Do Your Teams.The Rise of Agent Enablement. — Patrick Debois

(session) [Leadership 2] | Track: AI Architects: AI Factories

Every company wants to know how others are actually scaling AI coding. But it's hard to get past the generic transformation stories. What are the new practices showing up in real engineering orgs? What does maturity actually look like, and what separates teams that are moving from teams that are stuck? What are the patterns for enabling humans and agents, together? Patrick Debois has been collecting the practices and patterns, talking to the early Agent Enablement teams already on the job, team leads, and VPs of Engineering. What's showing up is a new function: a team that enables other teams to get real leverage out of their agents. This talk takes the Context Development Lifecycle off the individual laptop and onto the org chart, grouped across three pillars: - Enablement. From individual experimentation to team and org-level fluency with agents. - Platform. Agent tooling that runs like a real delivery pipeline: fast, observable, cost-aware. - Governance. Ad-hoc guardrails growing into real evaluation, telemetry, and accountable agent work. For Agent Enablement leaders scaling it out across the org. For team leads looking to help their teams get better at this. For VPs ready to unblock the friction and unlock what agents can actually do. Coding agents don't scale themselves. This is the talk about who does

1:30pm-1:50pm: Trust, But Verify: Human-in-the-Loop for Agents That Actually Matter — Michael Liendo

(session) [Expo Stage 1 NE]

"In this talk we'll walk through the full spectrum of human-in-the-loop patterns, from lightweight inline confirmations to out-of-band permission gates to handing your agent a wallet with real money in it and more. Each pattern fits a different level of consequence, and knowing which to reach for is what separates demo agents from production ones. We'll cover the honest tradeoffs of latency, user experience, and trust so you can make the right call for your specific use case.

The entire talk is built around various live demos that escalate in stakes with every step. You'll leave with a mental model and working reference architecture you can apply the same day."

1:30pm-1:50pm: YOLO Mode, Safely: microVM Sandboxes for Any Agent — Rowan Christmas

(session) [Expo Stage 2 NW]

This talk shows the alternative: every agent session in its own microVM, with its own kernel and a hard boundary to the host. You decide what lives inside that boundary: filesystem, network, the tools it's allowed to call. The sandbox runs Claude Code, Cursor, Codex, or whatever else you're driving. You'll see an agent live in full YOLO mode inside a sandbox, a real attempt to escape, and the boundary that holds up.

1:30pm-1:50pm: Your Model is Private. Your System Isn't. — Joshua Mo

(session) [Expo Stage 3 SW]

Privacy in AI isn't just about choosing the right model. Data leaks rarely happen inside the LLM itself - they happen in the systems surrounding it. Observability pipelines, analytics platforms, prompts, agents, and infrastructure often become accidental channels for exposing user data. In this session, Joshua Mo, Lead DevRel Engineer at Venice AI, explores why private models alone are not enough and shares practical privacy-preserving patterns that AI engineers can adopt today. From revocable handles and hashed identifiers to agent boundaries and confidential computing, attendees will leave with concrete ideas for building AI systems that protect user data by design.

1:30pm-1:50pm: Video Discovery for Agentic World-Model Training — Rafael Levi

(session) [Expo Stage 4 SE]

Physical AI had its “Attention Is All You Need” moment with the rise of Vision-Language-Action models. The next bottleneck is data: not just more video, but the ability to find the exact real-world moments that teach models how the world works: gravity, motion, causality, human behavior, and object interactions. This session explores a new approach: discovering specific scenes from the vastness of the web. We’ll show how teams can search for moments like objects falling, people interacting with environments, or actions unfolding over time, then collect and structure only the relevant clips for training and evaluation. Attendees will learn how scene-level discovery changes multimodal data pipelines, reducing wasted collection, processing, storage, and review, while making it easier to build targeted datasets for VLA systems, robotics, physical AI, and agentic world models.

1:55pm-2:15pm: 🎵 Every step you take, every call you make - the reliable agent stack — Giselle van Dongen

(session) [Main Stage] | Track: Harness Engineering

In this session, we skip past the demos that work only on your laptop, and go straight to how you can build production-ready agents with a stack that covers all the hard bits of backend development that you don’t want to be bothered with when developing your agents: - Failure resiliency: retries, timeouts, and exactly-once execution so a flaky API or a crashed process doesn't corrupt your agent's state or makes them start from scratch - Durable Sessions: a session store with built-in conversation isolation and protection against corruption from concurrent agents - Pause/resume for human approvals: survive human approvals and research that take weeks without building complex infra - Agent-to-agent messaging layer: call agents developed by other teams or running on other infra with resilient HTTP calls - A kill switch: cancel a running agent cleanly at any point, without leaving half-executed work behind We will demonstrate each concept with live code examples, using Python, OpenAI Agents SDK and Restate as open-source Durable Execution engine. All examples are generally applicable: pick your favorite agent SDK (OpenAI Agents, Pydantic AI, Vercel AI, Google ADK,…) or go wild and implement low-level custom agents by just tying together LLM calls with custom logic.

1:55pm-2:15pm: Voice agents with Realtime Video — Lina Colucci

(session) [Track 1] | Track: Generative Media

1:55pm-2:15pm: Teaching agents to pay — Anna Spysz

(sponsor) [Track 2] | Track: Agentic Commerce

With a global daily user base in the hundreds of millions, AI agents are rapidly becoming a primary interface for how people discover, evaluate, and purchase products. Enabling those products to be listed and paid for directly through agents opens an entirely new - and enormous - commerce channel. The Agent Commerce Protocol (ACP) and Shared Payment Tokens provide a secure framework for agent-driven commerce within Stripe’s ecosystem - without exposing payment data or sacrificing user control. This session walks developers through the complete implementation: setting up Stripe integration, creating permission-based payment tokens, interacting with ACP endpoints, and designing trustworthy user experiences. You'll learn how to enable your agents to transact safely and predictably, handling everything from checkout flows to error scenarios and webhook events.

1:55pm-2:15pm: We Vetted 2,000 AI Skills Before They Reached Developers — Lucas Palma

(session) [Track 3] | Track: AI in Finance

AI skills and plugins are becoming part of the software supply chain. They steer agent behavior, describe tools, run commands, access files, and shape how developers build with AI. Treating them as harmless configuration is a mistake. This talk shares what we learned from building an automated security review system for more than 2,000 internal AI skills before they reached a company wide plugin marketplace. I will walk through the risks we found, the checks that worked, the checks that created noise, and how we turned skill review into something developers could run locally and in CI. We will cover practical patterns for reviewing unsafe instructions, destructive commands, sensitive data exposure, risky tool use, credential handling, external calls, and agent behavior drift. The goal is to help AI engineers think about skills, plugins, and agent instructions as production dependencies that deserve review before they reach real users.

1:55pm-2:15pm: Local Models: Trust, Control, Optimization — Carter Abdallah, Vincent Weisser, Lucas Atkins, Chris Alexiuk, Lou

(session) [Track 4] | Track: Local AI

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

1:55pm-2:15pm: Why Agentic Systems Need Ontologies — Frank Coyle

(sponsor) [Track 5] | Track: Graphs

Agentic systems fail in predictable ways: context degradation, brittle tool descriptions, fragile multi-agent handoffs, stop-reason confusion, and the ever-present temptation to fix reliability problems with more natural-language instructions. These anti-patterns aren't bugs to be patched turn by turn — they're symptoms of a missing architectural layer. LLMs reason probabilistically over domains they only partially understand, and no amount of prompt engineering fully closes that gap. This talk argues that the missing layer is an explicit ontology: a formal, shared map of the domain's concepts, relationships, and constraints. The pattern is not new — ontologies have driven commercial success in defense and intelligence systems for over a decade, where probabilistic models must operate over high-stakes enterprise data without drifting into nonsense. Graph databases like Neo4j and Amazon Neptune have made the underlying primitives widely accessible. We'll show how lightweight ontology constructs can surround an agentic system with enforceable logical constraints: typed entities and relationships that tools must respect, cardinality and domain restrictions that catch malformed tool calls before they execute, and a shared vocabulary that keeps coordinators and subagents talking about the same things. The session walks through several agentic applications — a multi-agent research workflow, a tool-heavy customer support agent, a coordinator-subagent delegation pattern — and shows in each case how an ontology layer addresses the kinds of anti-patterns catalogued in Anthropic's Claude Certified Architect exam. The result is a hybrid neurosymbolic architecture: probabilistic reasoning inside, logical guardrails outside. Who should attend: engineers building production agentic systems, architects evaluating reliability strategies beyond prompt engineering, and technical leads who suspect their agents need more structure than another system prompt can provide.

1:55pm-2:15pm: How We Got LLMs to Recommend Our Open Source Library (Without Paying or Plug-ins) — Christopher Burns

(session) [Track 6] | Track: AI in GTM

Over the past year, we’ve seen a new distribution channel emerge: AI assistants. Instead of SEO, ads, or integrations, developers are discovering tools through models like Claude. In this talk, I’ll break down how we got our open source library recommended organically by LLMs in under a year, without plugins, paid placements, or partnerships. We’ll cover what actually influences model outputs today, how developer-first products behave differently in this channel, and the practical steps we took to make our project show up when it matters. This is not theory. It’s a real case study of how distribution is changing, and how you can design your product and content to be picked up by AI systems directly.

1:55pm-2:15pm: Healthcare’s Agent Bytecode: X12 as the Harness for AI Agents — Vasant Kearney

(session) [Track 7] | Track: AI in Healthcare

LLMs made old languages newly useful: COBOL for mainframes, Fortran for scientific code, and Rust, SQL, and Prolog as strict substrates for agentic systems. Healthcare has its own old language hiding in plain sight: X12. Before LLMs, X12 was mostly treated as ugly plumbing: loops, delimiters, companion guides, clearinghouse edits, payer-specific quirks, rejections, and acknowledgments. In an agentic workflow, those constraints become the feature. They give stochastic agents a deterministic target. This talk shows how healthcare agents can compile messy operational evidence into X12-shaped workflows: chairside audio into 837D claim narratives, imaging systems into 275/PWK attachment flows, payer portals and phone calls into 270/271 eligibility and 276/277 claim status, preauth evidence into 278 workflows, and EOBs, scanned mail, and bank data into 835/820 payment reconciliation. The core pattern is simple: LLMs reason over ambiguity; X12 provides the syntactic and semantic harness for validation, auditability, acknowledgments, rejections, human review, and high-volume automation. This is not an EDI nostalgia talk. It is a production architecture talk about building reliable agents in one of the messiest enterprise domains.

1:55pm-2:15pm: Multiplayer agentic engineering: enabling your whole team and your best agents to work together — Arjun Singh

(session) [Track 8] | Track: Agentic Engineering

For a solo developer, coding agents are a superpower. For a team, they surface new kinds of bottlenecks: coordination, visibility, review, and shared context.

We wanted our whole team and our best agents to work together, with no work or context trapped on any one developer's machine. So we pressed pause on the product we were building to create a multiplayer cloud workspace for agentic engineering.

This talk shares five key practices we've learned from building and using our platform:

Turn every surface the team uses into an agent interface.

Kick off sessions from Slack, review via iOS app, iterate in GitHub comments, ship from web. Agents run in the cloud, so work keeps moving even when your laptop is closed.

Make agent work visible and collaborative across the whole team.

Every agent session is shared, has a live app preview, and an agent-guided code review. This allows engineers, PMs, and designers to steer and evaluate agent work collaboratively.

Turn every external signal into shipped code your team can quickly evaluate.

Automatically turn customer emails, meeting action items, and bug reports into agent implementations that the whole team can review.

Set up shared cloud dev environments so agents aren't siloed to individual machines.

Secrets, role-based access, and network controls shared across the whole team. Fast environment startup, so you're not giving up speed by moving off local.

Benchmark agents on your own codebase.

Claude Code, Codex, Gemini, Amp, OpenCode — how do you know which is actually better on your stack? We'll cover using your merged PRs as ground truth to build a "Personal SWE-Bench" for your codebase.

Agentic engineering is going multiplayer. This is how your team gets there.

1:55pm-2:15pm: Rob Wachen — transformer-only ASICs for inference — Rob Wachen

(session) [Track 9] | Track: Inference

Etched's Sohu approach to transformer inference on custom silicon.

1:55pm-2:15pm: Blast Radius Zero: One‑Command OpenClaw Sandboxes in the Cloud — Arun Sekhar

(sponsor) [Track M] | Track: Track M

You already run OpenClaw locally, sandboxed in MXC. Now you need the same agent in the cloud for dev/test, reachable from Teams on your phone, without handing over the keys to the kingdom. This session shows a simple, one‑command path to do all of this: an isolated Container Apps sandbox running an OpenClaw image, calling Azure OpenAI in Foundry Models securely without keys over the standard OpenAI API, scaling to zero when idle.

1:55pm-2:15pm: Which AI startups actually land enterprise contracts? Lessons from evaluating 100+ AI startups at Millennium Management — Brian Lewis

(session) [Leadership 1] | Track: AI-Native Enterprises

Selling your AI startup/product into a large enterprise is hard. I often sit on the buyer's side of the table at a large hedge fund. I've sat through 100+ AI startup pitches and am responsible for running the pilots that may eventually convert into your ARR. We'll cover what works, what doesn't, and what large enterprise customers need to see in order to choose 'buy' over 'build'.

1:55pm-2:15pm: Agent Frameworks Considered Harmful — Rémi Louf

(session) [Leadership 2] | Track: Harness Engineering

1:55pm-2:15pm: MCP doesn’t suck — your agent does — Jan Curn

(session) [Expo Stage 2 NW]

Most AI agents misuse MCP and treat tools as prompt-time function calls: tool definitions and results are repeatedly injected into the context, tokens are wasted, and context rots. The result? Slower, less reliable agents, and the misleading conclusion that “MCP sucks, CLIs are better.” To challenge this narrative and show how agents can get the best of both MCP and CLI, at https://apify.com/ we’ve built mcpc (https://github.com/apify/mcpc), an open-source universal CLI client for MCP. It maps MCP operations to intuitive CLI commands, which agents quickly pick up through --help without external skills. It turns out, CLI is the perfect local interface for agents to interact with MCP, giving them access to full protocol capabilities including modern features like code mode or progressive tool discovery through a single Bash() tool call, while leveraging MCP’s standard remote interface for server discovery, authentication, payments, and access control. To once and for all kill the MCP vs. CLI debate and show those two technologies are not exclusive but complementary, we’ll present evals comparing performance of agents using naive MCP, modern MCP, native CLIs, other MCP CLIs, and mcpc, in various real-world scenarios.

1:55pm-2:15pm: Everyone talks about document search, but what about results? — George He

(session) [Expo Stage 4 SE]

Search is usually treated as the end of the document pipeline: parse, chunk, retrieve, and hand them to the model. But long-running agents need something more durable than one-off retrieval. They need reusable work: structured outputs, citations, extracted entities, prior decisions, and file-system-like context they can return to across tasks. At scale, context management is where most agent systems fall apart. Without the right harness, agents lose track of what they've retrieved, bloat their context windows, and stall.

In this talk, we'll look at why the document pipeline needs a stateful layer beyond the index — one that turns one-off retrieval into reusable, agent-ready context. We'll see how LlamaIndex thinks about transforming messy documents to make this possible, and why the future of document intelligence belongs to results that compound over time, not just better search.

2:25pm-2:45pm: We let an AI agent execute Bash and lived to talk about it — Sarah Sanders

(session) [Main Stage] | Track: Harness Engineering

PostHog's Wizard agent can read your codebase, install packages, and run shell commands on your laptop. Yes, on purpose. This talk covers how we went from "defense-in-hope" to a standalone, robust security service. It'll highlight results from a pentest that made us question our life choices, an internal audit that challenged our architecture, and the debate over how to secure the entire pipeline. You'll learn why "scan-then-trust" is a weaker model than you think, what it takes to build kill switches you hope you never use, and what happens when you pentest an AI agent that has access to Bash.

2:25pm-2:45pm: Generative Video at the Speed of Light — Keegan McCallum

(session) [Track 1] | Track: Generative Media

Discussing recent breakthroughs in realtime generative video models, and the new architectural problems and bottlenecks involved in creating immersive, interactive experiences on top of these models.

2:25pm-2:45pm: The Agentic Commerce Stack — Ahnaf Prio

(sponsor) [Track 2] | Track: Agentic Commerce

Agents are already handling product discovery, cart building, and checkout — no human clicking required. But what's the protocol stack actually making this work? This talk maps the real infrastructure: MCP for tool access, A2A for agent coordination, the ACP spec (backed by OpenAI) and the UCP spec (backed by Google) — two competing approaches to standardizing the full agentic commerce lifecycle — and AP2 for agentic payments. We'll cover what each does, how they compose, and where they're still forming. Then we'll see it live — a working demo with a protocol inspector showing every tool call, task transition, and checkout event in real time. You'll leave with a clear mental model of the agentic commerce landscape and a reference implementation you can use.

2:25pm-2:45pm: Your Finance Agent's Bottleneck Is You — Ramana Siddanth Emani

(session) [Track 3] | Track: AI in Finance

Most "AI for Finance" demos look great and almost none survive past pilot. If you've pushed an agent past one workflow, one tenant, or one Workday schema, you know the bottleneck isn't the model - it's the engineer behind the agent, who can't iterate fast enough to keep up with real AP data, real RBAC, and real query volume. What if you built your dev loop with the same primitives you're shipping to the finance team? In this talk, I'll show the subagent + skills + MCP stack - a production multi-agent system over AP, PO, vendor, and multi ERP systems, a LangGraph pattern that survives production, and the three failure modes that kill finance pilots before they ship.

2:25pm-2:45pm: Compression at the Edge — Chris Alexiuk, Daniel Han, Asma Beevi, Merve Noyan, Michael Chiang

(session) [Track 4] | Track: Local AI

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

2:25pm-2:45pm: Video Has No Memory. Here's How We Built One. — James Le

(sponsor) [Track 5] | Track: Graphs

Every video AI query today starts from scratch. There's no durable state, no entity continuity, no way to ask "what does this corpus know?" instead of "find me something like this." This talk is about fixing that by engineering a proper memory layer for video intelligence, grounded in what we shipped at TwelveLabs with Jockey. What this talk covers: 1 - Why video memory is categorically different from text memory: Video is temporal, multimodal, dense, ambiguous, and evidence-sensitive. Larger context windows don't solve this. The problem isn't retrieval bandwidth, it's that there's no durable representation to retrieve into. 2 - The context graph as a systems concept, not a database choice: I'll define what "context graph" actually means in practice: time-bounded moments, cross-video entity resolution, appearance tracking, and relationship mapping. This is infrastructure-level thinking, not a graph DB sales pitch. 3 - Five design principles that determine whether video intelligence is reusable infrastructure or a search wrapper with extra steps: + Ingest once, reason many times (move expensive understanding work into preparation) + Store primitives, not just answers (moments, entities, appearances, relationships) + Ground every claim to source video (a timestamp is a product requirement, not a safety footnote) + Let intent shape memory (brand safety and sports highlights need different primitives from the same footage) + Keep the memory layer composable and API-first 4 - What this unlocks for builders. Corpus digest, agentic search with grounded references, entity-centric workflows, timeline reconstruction, and compliance tooling, all built on the same durable substrate. The talk is concrete and demo-grounded. You'll leave with a specific mental model for memory architecture, actionable decisions for ingestion pipeline design and entity resolution, and a clear line between "search with extra steps" and actual video intelligence infrastructure.

2:25pm-2:45pm: Lessons From Building The World's Largest Knowledge Graph — Jeffrey Wang

(session) [Track 6] | Track: AI in GTM

_Exa set out to index and embed the entire web as a queryable knowledge graph — the substrate behind neural search and the enrichment layer powering modern GTM data. Co-founder Jeffrey Wang shares the hard engineering lessons: crawling and embedding at web scale, keeping a graph fresh and trustworthy, and the retrieval architecture that lets agents pull grounded facts instead of hallucinations. Why the knowledge graph — not the model — is becoming the moat for AI-native GTM._

2:25pm-2:45pm: Trading Desks to Clinical Trials: Parallels in Applied Vertical AI — Ayush Bhardwaj

(session) [Track 7] | Track: AI in Healthcare

Wall Street to Wet Labs: The Shared DNA of Vertical AI. On the surface, finance and pharma couldn't look more different. One chases alpha in the markets; the other engineers complex drug delivery and stability. But under the hood, building Vertical AI for both domains reveals a striking shared DNA. Drawing from hands-on engineering experience in Applied AI at a top hedge fund and a cutting-edge pharma tech startup, this session explores the surprising architectural parallels between these two high-stakes industries.

2:25pm-2:45pm: Always-on agents run production without the on-call tax — Justin Smith

(session) [Track 8] | Track: Agentic Engineering

Most production teams have the same problem. The work that keeps systems healthy- deployment checks, on-call handoffs, anomaly reviews- never makes it into a sprint. It falls to whoever has bandwidth, gets done inconsistently, and disappears when people are stretched thin. Background agents fix this by running that work on a schedule, using the same production context a senior engineer would, without waiting for someone to initiate it. Justin Smith, Founding Engineer at Resolve AI, walks through the architecture behind always-on agents, the use cases teams are starting with today, and what we have learned from running them in our production environment.

2:25pm-2:45pm: The Frontier AI Inference Cloud for Agents — Byung-Gon (Gon) Chun

(session) [Track 9] | Track: Inference

Agents have changed the economics of AI inference. A chatbot’s cost scales roughly linearly with the number of requests; an agent’s scales multiplicatively. A single task can fan out into hundreds of model calls, each carrying a repeated context prefix and adding latency that compounds across tool calls and reasoning steps. As open-weight models keep improving and agentic workloads grow, this shift exposes the limits of traditional request-level optimization. Inference infrastructure becomes a first-class concern, one that often shapes performance and cost as much as the model itself. In this talk, we explore what changes when you optimize for the whole task rather than the individual request, and how FriendliAI is rethinking the inference cloud for the era of agentic AI.

2:25pm-2:45pm: Operate agents safely at scale with enterprise governance — Ashu Joshi

(sponsor) [Track M] | Track: Track M

As adoption grows, governance becomes critical. Learn how to manage identity, compliance, and lifecycle for agent systems at enterprise scale.

2:25pm-2:45pm: Your Hero Agent Needs a Party — Kunal Lanjewar

(session) [Leadership 1] | Track: AI-Native Enterprises

A front-door persona, a party of deterministic specialist agents, A2A between. Your support bot deflects half its tickets, then, soloing a problem it was never built for, confidently runs the wrong kubectl command. Most teams respond by rewriting the prompt. The real fix is a multi‑agent party of specialists. This talk gives you a production pattern that turns one over-leveled hero agent into a coordinated party of specialists you can trust on tier-zero infrastructure. Persona and ReAct agents make great heroes at the front door. Any team can copy one, paste it into their stack, and adjust the behavior in plain English. But if you send a lone hero to clear the dungeon, whether it is a deploy or an incident, a non-deterministic Reason-Act loop tends to loop, over-act, or punt back to a human. More prompts and more skills do not reliably level it up. Instead of soloing, keep the persona as the front-door face and give it a party: deterministic DAG specialists where the graph is fixed and the LLM is called only at decision points. For example, a deployment specialist can list rolling pods, choose the next tool, run it, read logs, and then diagnose the result. Each specialist is a class with one job and a narrow set of tools, and they coordinate over A2A for capability discovery and delegation across frameworks. Reliability and tighter least-privilege access become properties of the design, not something you try to bolt onto a prompt. You’ll leave with the pattern: where to draw the line between the hero and its specialists, how to shape a DAG specialist so it decides instead of flails, and where A2A fits as the seam between them, grounded in lessons from a tier‑zero fleet.

2:25pm-2:45pm: Optimizing Open Models for Production Grade Inference — Sujee Maniyam, Dylan Bristot

(session) [Expo Stage 1 NE]

Open-source foundation models are rapidly closing the gap with proprietary systems, enabling organizations to build powerful AI applications with greater flexibility and control. However, deploying these models in production introduces a new set of challenges: latency, throughput, scalability, and cost efficiency.In this talk, we'll explore the modern inference optimization techniques that power large-scale AI systems in production. Topics include KV cache optimization, cache-aware routing, prefill/decode disaggregation, speculative decoding, and other emerging approaches used to improve performance and reduce infrastructure costs.Through practical examples and real-world architecture patterns, attendees will gain a deeper understanding of how to run open models efficiently at scale.

2:25pm-2:45pm: The Human Is an Async API — Melanie Warrick

(session) [Expo Stage 3 SW]

Production agent systems need humans in the loop. So why do they keep getting modeled as synchronous tool calls? The agent ecosystem is focused on autonomy, but in reality, especially for high-stakes or regulated workflows, humans are a critical feature, not an afterthought. This demo-driven talk shows how to stop bolting on humans and start treating them as async-by-default endpoints with proper durability, retry, and escalation semantics. We will walk through two live, multi-agent patterns built with LangGraph and Google ADK, on Temporal for durable execution: The Agent Calls the Human. A fleet dispatch system escalates a disruption to an approver. We will intentionally kill the worker process mid-wait. Hours later, the human responds. State survives, and the agent resumes. The Human Calls the Agent. An operator interrupts a long-running task mid-flight to redirect it. The agent halts gracefully, surfaces state, accepts the override, and continues. Harness engineering has heavily focused on model autonomy. This talk is about the other half of the puzzle: the human. You will leave with two production-ready architectural designs you can apply this week: agent-initiated approval gates with timeout and escalation semantics, and human-initiated interrupts with graceful agent halt and resumption. Not every agent needs a human in the loop. But if you are building systems where the cost of being wrong exceeds the cost of being slow, this talk is for you.

2:50pm-3:10pm: No Memory, No Harness: Why the Database Is the Last Line of Defense — Kay Malcolm

(session) [Main Stage] | Track: Harness Engineering

The model is the easy part. Everything that makes an agent survive contact with production lives in the harness around it: orchestration, tooling, governance, and the memory core that keeps the system grounded when the model itself is probabilistic, forgetful, and non-deterministic. This talk walks the surface areas of an agent harness and consolidates the lessons we're learning as we ship them, from agentic applications in their current form (autonomous systems that now build their own automations) to the continual-learning loops that let agents improve from their own experience. We'll look at how the discipline is segmenting. AI application development is no longer one role but several: agent engineers, memory engineers, and platform engineers. We'll map Oracle's primitives onto each as the current state of harness engineering takes shape. We'll also examine the two populations betting on this stack at once, enterprise customers who need governance, reliability, and scale, alongside the cracked developers who need fast, composable primitives, and why a well-engineered harness serves both. And we'll make the case that has held through every shift in the stack: memory isn't a feature you bolt on, it's the foundation the rest of the harness stands on. The database remains the memory core, and when everything above it is probabilistic, it's the last line of defense.

2:50pm-3:10pm: Infra behind Krea 2 - How to train and serve at scale — Gabriel Jorge Menezes

(session) [Track 1] | Track: Generative Media

What do you need know about large scale pretraining and inference for GPUs.

1. Challenges of managing infra for pretraining

2. Weird problems we faced and how we fixed them

3. How to serve at scale with multiple clusters

2:50pm-3:10pm: Your Agent Just Authorized What?! — Jay Mok

(sponsor) [Track 2] | Track: Agentic Commerce

The nightmare scenario writes itself: your agent just ran off with your credit card and maxed it out on concert tickets, crypto, and a questionable NFT collection. Relax — we're building the guardrails. When an agent acts on your behalf, three questions must always be answerable: Did the human authorize this? Did they authorize this, now, in this scope? And can we prove it later? This talk maps three permissioning layers onto a stakes ladder: OAuth scopes at the bottom (broad capability, weak per-action proof, fine when reversible), Claude Code's tool-scoped allow/ask/deny model in the middle (brilliant for developer tooling, but no cryptographic evidence), and signed payment mandates at the top — where FIDO's Agentic Payments Working Group is building toward cryptographically-bound, constraint-carrying credentials. We'll share artifacts from Agent to Agent payments using our Shared Vault and Oauth to our constraint carrying Approval token leveraging our pillars of Identity and Buyer and Seller protection. You leave with a stakes × evidence matrix and a mental model that applies beyond payments: medical orders, e-signatures, securities trading, activities where you want you want to be more careful with your agent.

2:50pm-3:10pm: Simulation-Maxxing: How Nubank ships agents 20× faster with simulations — Shreya Rajpal, Aman Gupta

(session) [Track 3] | Track: AI in Finance

You know how to build an agent - write a prompt, spec out some tools and call an LLM (or gateway). At this point, you probably also know how to build an agent that “actually works” using some combination of agent frameworks, eval tools and looking at your data. This talk is about building an agent much, much faster using simulations to hill-climb your agent configuration instead of grinding on real data. We’ll dive deep into a case study of how a top-5 fintech made their agent dev cycle 20x faster using simulation-driven optimization. We’ll cover: - When to use real data vs. simulations in agent building - How to design simulation environments tailored to your agent - How to automate the optimization loop so you’re hill climbing agent configurations without manual tuning

2:50pm-3:10pm: Compression at the Edge — Chris Alexiuk, Daniel Han, Asma Beevi, Merve Noyan, Michael Chiang

(session) [Track 4] | Track: Local AI

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

2:50pm-3:10pm: On-Device Agentic AI for the New York Times Games — Shafik Quoraishee, Joanne Song

(sponsor) [Track 5] | Track: Graphs

Traditional mobile game architectures rely on static state machines and fixed behavioral trees. Under this model, gameplay and accessibility are treated as rigid, separate systems. This results in blunt difficulty toggles, predictable character loops, and reactive features that fail to address a player's actual context. Constraint-Centric Agentic Simulation (CCAS) offers a theoretical shift. By modeling the game world as a continuous, multi-agent negotiation, accessibility and challenge become part of a single, fluid continuum.

Using the JetBrains Koog framework on Android, this session explores the theory of running local agents on consumer mobile devices. We will discuss how principles of game theory, specifically dynamic negotiation and constraint satisfaction, can be used to build systems that reason over game states. Instead of executing pre-planned scripts, these agents dynamically alter their strategies. They negotiate environmental constraints to provide emergent challenges for high-skill players or organically smooth out cognitive and motor friction points for those requiring assistance.

Running these theoretical models on edge hardware requires overcoming significant practical hurdles. We will break down the architecture needed to support this continuous adaptation without relying on cloud computation. We will cover how to manage memory footprints, compress state histories for rapid backtracking, and schedule local planning loops so they integrate flawlessly with the rendering engine.

2:50pm-3:10pm: How AI Agents Let GTM Teams Scale — Justin Joyce

(session) [Track 6] | Track: AI in GTM

How Cloudflare scaled GTM with AI agents that never touch raw data: a deterministic layer computes the numbers, agents write the narrative, and a multi-agent pipeline turns every segment into ranked signals. Justin Joyce on the build — and what skill curation and adoption actually take.

2:50pm-3:10pm: How to build an AI-Native Health Company — Dan Feng

(session) [Track 7] | Track: AI in Healthcare

Most healthcare technology companies were built for a different era. Transitioning to an AI-native organization isn't just about adopting new tools — it requires rethinking culture, processes, and how teams work at every level. This talk draws on firsthand experience leading that transformation at a digital health company. We'll cover what it takes to foster an AI-first culture across departments, and go deep on the engineering side: adopting AI-assisted development practices, building shared AI infrastructure, and evolving the product development process to unlock 2–3x productivity gains. We'll also tackle the harder, less-discussed challenge — the mindset shift required to operate effectively in a domain that's changing faster than any playbook can keep up with. Whether you're just starting this journey or already mid-transition, you'll walk away with concrete lessons on what works, what doesn't, and how to build an organization that compounds on AI rather than just experiments with it.

2:50pm-3:10pm: Realtime multiplayer, automation, and you! — Idan Gazit

(session) [Track 8] | Track: Agentic Engineering

Now that the models are powerful and the agents are capable, why are we still approaching software development as if it's the same activity that it used to be, but "faster"? GitHub Next thinks about what this future wants to be through two lenses: - Automation: intelligence allows us to automate much more than we could with heuristics alone. How should that automation work? What guardrails do we have to put in place so that our CISOs allow us to do that? - Collaboration: agents can understand anything in your codebase, but what about all the facts that are in the heads of your teammates? Whether it's corporate politics or taste, how do we get the humans to leak that context where agents can see it and use it to produce better outcomes? Realtime multiplayer tools have displaced every turn-based tool out there. What should that look like for code? It's not going to be as simple as multiple cursors. Come by to hear more about what GitHub Next is learning about the changing shape of software creation — one that allows us to build better, not merely faster. One that allows us to scale up teams, not only individuals. And one where automations buy us time for craft and polish, not slop. We were promised flying cars, instead we have fifteen terminals. Let's have a nicer future than that.

2:50pm-3:10pm: KV Cache-Aware Routing and P/D Disaggregation on Kubernetes: The Parts Public Benchmarks Don't Show — Yuchen Fama, Ashish Kamra

(session) [Track 9] | Track: Inference

We're at the inflection point between classic LLM inference and agentic inference. When we look at the agentic workloads and trace replays, many core characteristics break classic LLM serving assumptions. The most consequential: the server no longer controls its own cache lifecycle. The client does, through prompt construction, multi-turn context that grows and changes each turn.

This has downstream effects. Because context is client-determined, prefill strategy, eviction, and routing decisions move up to the scheduler layer. KV cache becomes volatile — frequent eviction and rewrite, driven from outside the engine. And latency becomes a first-class scheduling metric alongside throughput. This talk covers the open stack for LLM and agentic era inference serving: vLLM and llm-d.

We begin with the core characteristics and challenges of agentic inference, then the economics: prefill dominates cost, and cache reuse is the primary lever. We explain why KV-aware routing through a fleet-wide scheduler is the first optimization to apply, ahead of adding capacity.

Next, prefill/decode disaggregation. We separate compute-bound prefill from memory-bound decode, and examine what public benchmarks omit: the conditions under which P/D disaggregation shines, and the workload shapes that justify the added architectural complexity.

We close with GLM-5.2 and show the equivalent stack assembled in the open: cache-aware routing, P/D disaggregation, tiered KV offload, and wide expert parallelism — implemented on vLLM and llm-d.

Attendees leave with a tuning decision framework: which lever to apply first, how to read workload signals, and where additional GPUs do and don't help.

2:50pm-3:10pm: AI Agents Are Just Distributed Systems Now — Salman Munaf

(session) [Leadership 1] | Track: AI-Native Enterprises

AI agents are often described as a new kind of software, but once they move beyond chat and start calling tools, reading data, making decisions, retrying tasks, and coordinating workflows, they begin to look a lot like distributed systems. They have state. They call external services. They depend on APIs. They fail partially. They retry. They time out. They can loop. They can act on stale context. They can produce inconsistent results. And when something goes wrong, teams need logs, traces, permissions, ownership, and rollback paths just like they do with any other production system. This session will give engineers a practical way to reason about AI agents using familiar distributed systems concepts. We will break down the agent loop: planning, tool use, observation, memory, and retries. Then we will map common agent failure modes to engineering patterns teams already know, including timeouts, circuit breakers, idempotency, rate limits, least privilege, observability, and human approval. The goal is to move past the hype and treat agents like real production systems. Attendees will leave with a clear mental model for designing, debugging, and operating agents safely, especially as they become part of customer-facing products, internal developer tools, and business workflows.

2:50pm-3:10pm: Inside 847 Production Clinical AI Notes — Sebastian Fox

(session) [Leadership 2] | Track: AI Architects: AI Factories

A Series B clinical AI company had an ambient scribe in production for six months. Internal evals passed every release. A clinical team spot-checked a sample weekly and saw nothing alarming. The system had healthy NPS, expanding deployments, and the company was preparing for European market expansion. We ran a structured audit on 847 production notes. Found 127 failures across six categories. 23 were severity-critical - the kind that could directly alter a clinical decision. The team's existing LLM-as-judge had reported zero failures across the same notes. This talk is the engineering forensics of that audit. The audit setup: which production traces we sampled, how the structured failure-mode coding worked, and the reviewer protocol. The results: three dominant failure clusters - decision-status corruption (19 cases), structured omissions (34 cases), and dosage substitution (12 cases) - and the underlying generation pattern behind each. For each cluster I will show: a real anonymised trace, the eval rule that should have caught it but did not, an explanation of why the eval missed it, and the criterion that does catch it. The pattern that emerged in the data is engineering-actionable. The team had built a 20-criterion content-faithfulness eval layer. The failures lived underneath it, in a missing intent layer. We replaced the broad content layer with a five-criterion intent layer (decision status, omission impact, dosage integrity, diagnostic chain, laterality consistency). Detection rate went from 0% to 96% on the failure set. Compute cost dropped because the intent layer is cheaper to run than the content layer it replaced. You will leave with a forensics protocol for auditing your own production AI, the five intent criteria that generalise to any high-stakes domain, and the architectural pattern: build a thin intent layer, not a thick content layer.

2:50pm-3:10pm: Harness Engineering: The New Core Skill for Agentic Developers — Dru Knox

(session) [Expo Stage 1 NE]

Harness engineering is emerging as a new core competency for agentic engineers. Your job isn't writing good code, it's upgrading your codebase so that agents reliably succeed. This talk covers the core loop of harness engineering, the most common codebase modifications you'll make, and how to 10x your harness engineering efforts with Tessl's harness engineering agent.

2:50pm-3:10pm: Small Claws Are Beautiful: Edge Agents with NanoClaw, Raspberry Pi, and Graph Memory — Jeremy Adams

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

2:50pm-3:10pm: The Software Factory

(session) [Expo Stage 4 SE]

In the leading engineering organizations, a single engineer now supervises teams of agents, migrations scoped for years close in weeks, and code review has become the tightest constraint in the system. The teams pulling ahead are operating a software factory: an integrated system of agents that share context across the entire SDLC. This session is a field guide to that operating model and how it runs at scale: what each stage looks like in practice, what shifts for engineers as they move from writing code to stewarding the system, and the hard truths that decide whether a factory compounds, starting with why the infrastructure you built for humans sets the ceiling on what agents can do.

3:20pm-3:40pm: How we Solved Agent Building — Andrew Qu

(session) [Main Stage] | Track: Harness Engineering

At Vercel I've built a successful AI data scientist, that has taken the load off of our data team from answering ad-hoc data queries, and fields over 1,200 unique queries a day from just internal Vercelians. I've been building and iterating on it since last september, and it's gone through over 6 different rewrites, the newest one of which has inspired us to build a new agent framework (to be teased during the talk ;) ). I'd talk about why we build agents, how we build agents, and how to build effective agents in today's world. Just prompting, to adding bespoke tooling, to embedding claude code, to file system agents, to skills-based agents, to the new agent harness framework.

3:20pm-3:40pm: The Next Medium: Why Real-Time Interactive Video Changes Everything for Developers — Ahmed Ahres

(session) [Track 1] | Track: Generative Media

Every major platform shift created a new category of developers. The web created web developers. Mobile created app developers. Now real-time interactive video models are creating a new kind of builder: one who does not render scenes or script interactions, but writes code that shapes a living world as it generates. This talk explores what it means for video to become a runtime, why this moment is happening now, and what the first generation of developers building on world models are already creating. Based on work at Reactor, where developers are shipping interactive games, robotics simulations, and real-time experiences that could not have existed 1 year ago.

3:20pm-3:40pm: The End of the Static Screen: Architecting Intent-Driven UX with Agentic Orchestration — Gus Iwanaga

(sponsor) [Track 2] | Track: Agentic Commerce

For 30 years, interfaces were designed ahead: wireframes, fixed flows, pre-built dashboards - because we couldn't make them otherwise. Three shifts changed the constraint: LLMs that reason over business context, agentic frameworks that work at production grade, and composable backends that expose a real tool surface. With all three in place, the interface stops being something you design and ships as the output of an orchestrator composing it per intent. I'll walk through the hypothesis, the architecture we're running in production for enterprise commerce, and a live demo where it all moves.

3:20pm-3:40pm: Skills are new features: Building Skill-Centric Harness for Agentic Products — Yogendra Miraje

(session) [Track 3] | Track: AI in Finance

3:20pm-3:40pm: Model Routing — Nader Khalil, Walden Yan, Tanay Varshney, Alex Atallah

(session) [Track 4] | Track: Local AI

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

3:20pm-3:40pm: Citation Needed: Provenance for LLM-Built Knowledge Graphs — Daniel Chalef

(sponsor) [Track 5] | Track: Graphs

An LLM doesn't copy facts into your knowledge graph. It synthesizes them: entities merge across sources, and later data invalidates earlier facts. By the time your agent retrieves "patient has a penicillin allergy," the origin — an EHR record, a lab report, or something typed into a chatbot — is gone. This talk covers engineering lineage into a lossy, generative pipeline: episode-to-fact links as structural graph properties, provenance that survives entity resolution, metadata projection (tag a source once; it follows every derived node and edge), and the query semantics of filtering facts by ancestry, including mixed-trust parentage. Deletion is the inverse problem: GDPR erasure propagates back through the same derivation edges. Compliance gets an audit trail; engineers get agents they can debug instead of black boxes.

3:20pm-3:40pm: Building GTM AI Agents: Lessons from Deploying to 6,000 Users — Sait Izmit

(session) [Track 6] | Track: AI in GTM

Building an enterprise AI agent for GTM teams isn't just an LLM problem—it's a product, engineering, and adoption challenge. In this session, I'll share how we built and scaled Snowflake's internal GTM AI Assistant from MVP to a production system serving more than 6,000 employees and answering over one million questions. We'll cover how we scoped the MVP, evolved the architecture over time, balanced quality versus coverage, adopted emerging technologies like MCP, and continuously adapted as the AI landscape rapidly changed. You'll leave with practical lessons for building enterprise AI products that users actually trust and use.

3:20pm-3:40pm: Don't be data poor — Anuj Iravane

(session) [Track 7] | Track: AI in Healthcare

What do you do when the data you most need to train and evaluate on is the data you're least allowed to keep? It's a bind for anyone building AI in a high-stakes vertical: the cases that would teach your model the most — the rare, the messy, the sensitive — tend to be the ones wrapped in the tightest constraints. In healthcare it's near-absolute. PHI can't be retained, reused, or transformed, so your long-lived datasets can't contain real patient data at all. Synthetic data is the obvious escape hatch, but it has its own trap: synthetic records tend to look synthetic, and a model that passes on fake-looking data tells you nothing about the real thing. So the bar isn't generating data — it's generating data faithful enough to trust. This talk is how we got there. Ask an LLM for a full case in one shot and you get something generic and averaged-out — models are worse at inventing convincing, specific detail than you'd expect. We present our synthetic generation pipeline (and the process around it) that enabled us to create golden datasets at scale. The pipeline features a coarse-to-fine process that enriches a patients medical history layer by layer, with a human in the loop hooks to steer the narrative at each step. You'll leave with ideas on how to build your own synthetic data generation capabilities and how to build a data pipeline your domain experts actually enjoy owning.

3:20pm-3:40pm: Velocity Sickness: What Happens When Your Whole Team Gets 10x Faster — Matt Dailey

(session) [Track 8] | Track: Agentic Engineering

Learn more about Ref: https://ref.tools/ AI made writing code nearly free, and on most teams, that's quietly breaking how the team works. Individually, everyone feels ten times faster. Together, the signals point the other way: too many PRs moving in too many directions, engineers throwing away whole agent sessions and starting over ("declaring agent bankruptcy"), and critical decisions getting made inside agent chats that no one will ever see or review. There's a lot of energy, and it's all going somewhere different. I call this velocity sickness: the organizational pain that comes from individual speed. It's the engineering version of an author who ships a book a week: prolific, productive, and completely unreadable by the team that's supposed to build on it. Almost every conversation about AI coding is about making one engineer faster. This talk is about what happens to the team when all of them are. Once implementation stops being the bottleneck, the hard part isn't writing the code. It's tracking it, reviewing it, and keeping a hundred parallel decisions coherent. That's the problem eng leaders are actually being handed, and it's the one this session takes on directly. Engineering has always had three phases: plan, implement, polish. AI collapsed the middle one to almost nothing, so the leverage, and the real work, move to the decision-heavy ends. The fix isn't better prompts; it's changing what our tools treat as first-class. We have to split the decision layer from the implementation layer: humans spend their time at the decision layer, reviewing and making the choices that matter, while agents handle the implementation. That means durable, reviewable plans, not ephemeral chats. Review the decisions before you review the diff. What attendees will leave with: - A mental model for plan / implement / polish and why the decision layer is now where engineering leverage lives, plus the language to explain velocity sickness to their own team. - A concrete shift: how to pull your team's important decisions out of throwaway agent chats and into a shared, reviewable source of truth, so individual speed compounds into team cohesion instead of chaos.

3:20pm-3:40pm: Two Bugs That Hid in Plain Sight: A vLLM Debugging Detective Story — Asaf Gardin, Yuval Belfer

(session) [Track 9] | Track: Inference

Your model generates gibberish. Once every thousand prompts. High confidence scores. No crashes. No warnings. We hit this twice while building Jamba models. First: A request gets misclassified during scheduling, loads stale state from a previous prompt cache slot, and confidently generates nonsense. Second: Logprob spikes during RL training that looked like training instability-until we noticed they tracked with rollout count, then with cache size. In this talk, we'll walk through both debugging journeys-the false starts, how we instrumented vLLM to thread request IDs through the forward pass, the search for variables that change failure structure rather than magnitude, and the lesson both share: distributed inference systems fail silently. No stack trace. No sanitizer warning. Just wrong answers with perfect confidence. You'll learn how to build comparison scripts that expose logprob divergence, force memory pressure to surface rare bugs, and shrink a distributed RL training mystery into a reproducible single-script failure. Walk away knowing how to debug vLLM when it lies to you quietly.

3:20pm-3:40pm: The Signal Layer: What to Build When Anything Can Be Built — Lena Hall

(session) [Leadership 1] | Track: AI-Native Enterprises

AI has made implementation faster, cheaper, and more widely available. That changes the real bottleneck in software.

When every team can generate code, spin up agents, prototype workflows, and ship demos faster than ever, the advantage moves to a different layer: knowing what is worth building, who it is for, how people will discover it, and how the product should behave once they do.

This talk introduces the Signal Layer: the system of public signals, user intent, agent experience, distribution loops, and product judgment that helps builders decide what deserves to exist before they commit time, infrastructure, and trust to building it.

We will look at how AI changes the software lifecycle from “can we build it?” to “should this exist?” and how developers, AI engineers, and technical leaders can design products that earn adoption instead of producing impressive demos that disappear.

When anything can be built, the most valuable builders are the ones who can read signal early, shape the right experience, and build the thing users were already moving toward.

3:20pm-3:40pm: Give the Agent a Budget, Not a Token — Sachin Malhotra

(session) [Leadership 2] | Track: AI Architects: AI Factories

Every agent demo runs with a god-token. Then it ships, and someone has to explain why the helpful AI just rm -rf'd the staging database "to clean up." I run platform infrastructure at a frontier lab, and for the last year my job has partly been: let coding agents do real work against real systems, without ever having to write the postmortem. This talk is the permission model that fell out of that - not RBAC-with-extra-steps, but primitives designed for an actor that's smart, fast, tireless, and occasionally confidently wrong. The four primitives: - Asymmetric verbs - the agent can quarantine but not delete, retry but not approve, propose but not merge. The verb list is the security boundary. Stop thinking in resources, start thinking in reversible vs. irreversible actions. - Regenerating budgets - every agent identity gets N disruptive actions per window. Burn the budget, you're benched until it refills. No human-in-the-loop until the budget's gone — which means 95% autonomy with a hard ceiling on blast radius. - The undo test - if the agent can't undo it, the agent can't do it without a second key. One line, surprisingly load-bearing. - Tripwires over allow-lists - let the agent roam, but instrument the three actions that would actually hurt. Cheaper than enumerating everything safe. I'll show the ~200-line policy layer that implements all four, the failure modes each one exists to catch, and the one design I shipped that turned out to be security theater. Tool-agnostic - works whether your agent is touching CI, a database, a cloud account, or your users' files. If you're shipping an agent that does anything more than read, you'll leave with a threat model and a starting policy you can paste into your repo on the flight home.

3:20pm-3:40pm: Agent Memory Is a Solved Problem. Agent Learning Is Not. — Karthik Ranganathan, Heather Downing

(session) [Expo Stage 1 NE]

The failures that break multi-agent systems are not reasoning failures, they are handoff failures. One agent works something out and the knowledge dies in its private context, because the only thing that crosses the boundary is output. Memory made each agent better in isolation and changed nothing about what the group knows. The missing primitive is supervised promotion: a deliberate decision about which private learning is worth sharing, moved into common knowledge with the reasoning attached, so trust survives the handoff. Today a human makes that call, and promoted knowledge resolves on read, in any tool, with no retrain or reindex. Those calls are also the training signal for what comes next: orchestrator agents, trained on what matters to the people they serve, that promote on their own. This talk covers how our collective knowledge grew as we approached memory promotion, including what the first build got wrong, and a live look at it working between humans and agents.

3:20pm-3:40pm: An Interaction Is All You Need — Ivan Leo

(session) [Expo Stage 3 SW] | Track: Expo Stage 3

3:20pm-3:40pm: An AI Future Without the Lock-In — Remy Guercio

(session) [Expo Stage 4 SE]

Every organization navigating AI adoption faces the same trap: the market moves faster than any procurement cycle, no single vendor leads across model quality, interface, sandbox, and data access for more than a few months at a time, and the obvious answer of consolidating behind one platform trades short-term control for long-term lock-in. This session makes the case that the winning strategy is not picking the best walled garden. It is building a connective layer underneath all of them. Tailscale's Remy Guercio walks through the four components required for transformative AI, why vertically integrated stacks are structurally fragile, and how organizations can maintain visibility and control without betting on a single vendor's continued dominance. The second half of the session covers three new capabilities in Aperture, Tailscale's identity-aware AI gateway: Identity-Aware Universal Data Connectors (Public Alpha), which translate Tailscale network identity into scoped access to internal data sources via MCP and API endpoints; a Responsive Chat UI (Public Alpha) that gives non-technical users a mobile-friendly interface to every LLM configured in Aperture; and Sandbox Support (Private Alpha), bringing ephemeral and persistent compute environments into the same identity model. Attendees leave with a framework for evaluating AI platforms that does not depend on picking a winner, and a concrete path to deploying provider-agnostic AI tooling on infrastructure they already run.

3:45pm-4:05pm: Agents Without Code: How Skills, YAML, and Filesystems Replaced Python — Philipp Schmid

(session) [Main Stage] | Track: Harness Engineering

Six months ago, building an agent meant writing a Python class with a while loop, tool definitions in dicts, manual state management or writing custom python functions. Today, you define an agent in a YAML file, drop a SKILL.md into a folder, and deploy. This talk traces the arc from "Agent in Python" to "Agent as filesystem". You'll learn the same agent built three ways: the hard way (Jan 2025), the simple way (Oct 2025), and the zero-code way (today).

3:45pm-4:05pm: Beyond the Lethal Trifecta: Agentic Commerce on the Open Internet at Machine Speed — David Levine

(sponsor) [Track 2] | Track: Agentic Commerce

For decades, the internet has had protocols for routing, identity, encryption, payments, and commerce between people and organizations. It has never had a native way for autonomous agents to possess authority, accountability, or legal standing. On July 1, 2026 that changes. A little known law will take effect that changes the world as we know it. As AI agents move beyond the enterprise firewall, a new form of commerce is emerging. Agents can already search, negotiate, schedule, purchase, settle payments, and coordinate work across networks. But the moment they begin acting independently on behalf of people, businesses, and online organizations, fundamental questions appear: Who does this agent represent? What authority does it possess? Who is responsible when something goes wrong? How do counterparties know they can trust it? This talk explores the "Lethal Trifecta" of agentic systems: access to systems, access to networks, and autonomy. Together they create extraordinary capabilities, but they also expose a missing layer in the architecture of the internet itself. Without identity, accountability, governance, and legal standing, agentic commerce remains trapped inside enterprise walls, limited to productivity gains rather than participation in open markets. On the same day as this conference, a new legal framework takes effect that gives autonomous online organizations a registered legal existence, allowing them to hold assets, enter agreements, govern themselves through software, and operate through fleets of agents. Whether you're building agents, agent platforms, autonomous organizations, payment systems, governance systems, or the next generation of internet infrastructure, this shift has global implications, and you'll be the first to know. We'll examine the emerging trust stack for agentic commerce—identity, authority, governance, settlement, and standing—and explore what happens when agents stop acting merely as tools and begin participating as economic actors on the open internet at machine speed.

3:45pm-4:05pm: Wearing the Agent: Engineering a Family-and-Friends Personal Agent, from Group Chats to Glasses — Sai Krishna Rallabandi

(session) [Track 3] | Track: AI in Finance

Judith is a personal AI agent that has run in daily production for a year, used by more than a dozen of my family and friends across three WhatsApp group chats, Telegram, and Discord. This talk walks through how it's built, in two parts. The first part is the engineering that makes one agent safe for many people to share: a multi-tenant permission model (read-only for my mom, exec for me), a memory stack — FAISS + Neo4j + curated long-term notes — that stays useful over a year instead of bloating into noise, cron-scheduled subagents that scout and act on their own, and the guardrails it enforces on every message — redact personal info before posting to a group, never reply to the wrong person, and screen attacker-controllable text for prompt injection before acting on it. The second part takes the agent off the screen and onto a $50 pair of smart glasses. It captures what I see, describes and stores it as a running visual memory, sets destination path on maps before I get onto car, finds and tells me which aisle in the store to go to first, etc. I cover the latency budget that keeps it conversational — on-device Whisper for speech, cloud reasoning, sub-one-second round trips — and the custom neural voice it speaks in rather than stock TTS, drawn from my speech-synthesis background. Both parts are shown live, including a candid look at the pieces that don't work yet. Audience takeaways: A multi-tenant architecture for a personal agent multiple people actually share A memory design that survives real long-term use (not just a vector store) A defensive checklist for any agent that ingests untrusted text A blueprint for an ambient, vision-aware wearable interface on commodity hardware, with a real latency budget

3:45pm-4:05pm: Model Routing — Nader Khalil, Walden Yan, Tanay Varshney, Alex Atallah

(session) [Track 4] | Track: Local AI

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

3:45pm-4:05pm: Why We Killed Our Multi-Agent Pipeline: Lessons From Pharma Commercial Intelligence — Subbiah Sethuraman, Abhilash Asokan

(sponsor) [Track 5] | Track: Graphs

Key takeaways: A practical design principle for agentic systems in regulated, high-stakes domains: derive the architecture from agent behavior, don't impose it. Concrete patterns the audience can apply this week — domain knowledge graphs as agent context, deterministic preprocessing as a complement to agentic reasoning, reference-based context management. An honest case study from production: what worked, what didn't, and the open architectural questions we're still working on. Abstract : We lead the architecture and AI engineering org behind ZS Associates' commercial intelligence platform for pharmaceutical brand teams. The product has two surfaces: a proactive alert system that delivers signal-driven intelligence packets when a brand's KPIs move, and a conversational analytics chat where business users ask ad-hoc questions. A year ago we built both surfaces as separate V1 stacks. They broke in different ways. The diagnosis was the same: we had decided on the structure before we knew what the agent actually needed. This talk is about the design principle that came out of rebuilding both — and what it produced. The architecture is derived, not designed. We stopped trying to predict what scaffolding the agent would need and started designing the system around what the agent's behavior, on real production tasks, actually demanded. Tools, context, structure, and guardrails get introduced at the points where the agent's reasoning needs them — and nowhere else. What that produced is an architecture that's smaller than V1, not bigger. A single agent owns each investigation end-to-end across both surfaces, launching parallel sub-agents when the work needs them — not according to a pre-defined topology. A pharmaceutical commercial knowledge graph — HCPs, accounts, payers, territories, brands, KPIs and the relationships between them — gives the agent the domain context it needs without prompt-engineering heroics. Statistical signal detection runs deterministically before the agent wakes up, so the agent's job is to explain signals, not find them. Raw query results stay out of the context window through a reference-pattern that lets the agent reason over data without drowning in it. Each of those decisions came from watching an agent struggle on a real task and asking what does it need here? — not from sketching the architecture in a doc and forcing the agent into it. The patterns generalize. If you're shipping agents over messy enterprise data — finance, supply chain, claims, operations — the failure modes and the fixes will look familiar. We'll close with the open questions and the pieces we haven't solved yet.

3:45pm-4:05pm: The Death of Developer Advocates — Stephanie Jarmak

(session) [Track 6] | Track: AI in GTM

Developer Advocacy is dead. Over the last decade Developer Advocates have been a key part of any devtool company. Coding agents are the customer now. Your ICP is Claude Code, Codex, and a myriad of other coding agents that are going to evaluating, using, and suggesting tools to their human counterparts, then implementing them. So what do you do about it? Pivot to "Agent Advocates". This is a similar role but with the expressed purpose of understanding how Agents experience your product and using those findings to improve the agent experience. In this talk/workshop I'll share how to evaluate the agent experience of your product, how to improve it, and how to communicate that to your team so they can change the products roadmap.

3:45pm-4:05pm: Why Your Enterprise Tech Stack Isn't Ready for AI Agents - And What to Build Instead — Christopher Lovejoy, Saul Howard

(session) [Track 7] | Track: AI in Healthcare

Agent-executed work is a new infrastructure primitive. Until you treat it that way, you're running a demo, not enterprise AI. Your existing stack was built for deterministic software. Agents reason, delegate, and make judgment calls. That distinction creates infrastructure problems most engineering teams haven't confronted: security vulnerabilities baked in by design, no audit trail, no explainability, no human-in-the-loop. At Anterior, we've deployed clinical AI agents across many of the largest US health plans, covering 50 million lives. Healthcare, with high stakes, strict regulation, deeply human workflows, exposes infrastructure gaps that exist everywhere - and makes the paradigm shift unavoidable: agent-executed work as a first-class primitive, alongside compute, storage, and APIs. We'll cover why bolting agents onto existing data pipelines fails, what infrastructure primitives are missing (and why teams don't notice until an audit), and how to architect a stack where security, compliance, and human oversight are load-bearing from day one. If you're serious about agents in any mission-critical context, this is the infrastructure conversation you need to have.

3:45pm-4:05pm: Open Source Is Dead. Long Live Open Source. — Saoud Rizwan

(session) [Track 8] | Track: Agentic Engineering

Closed model labs set take‑it‑or‑leave‑it prices, but open‑weight models force inference hosts to compete on the same models, driving costs down and shifting power back to builders instead of vendors. I’ll tell the story of how Cline went from viral open source project to a case study in AI‑generated slop, entitled PRs, and brand‑diluting forks and why, even as that old idea of open source community died, open weight models and auditable code are now the only real check we have on model pricing and control.

3:45pm-4:05pm: Weight Folding, CUDA Streams, and the Bug That Made My Model Speak Backwards — Filip Makraduli

(session) [Track 9] | Track: Inference

A talk about contributing GPU benchmarks to an open-source research paper (FlashNorm). I'll walk through the engineering journey: folding norm weights into projections, writing Triton kernels, accidentally making attention bidirectional (oops), and ultimately proving a 33-35% speedup on the norm+project operation. Practical lessons for anyone trying to optimize transformer inference.

3:45pm-4:05pm: Tell the Robot What You Want — Sandhya Subramani

(session) [Leadership 1] | Track: AI-Native Enterprises

What if you could command a robot just by talking to it?

This session introduces Strands Agents, an open-source framework that lets developers control physical sensors and actuators using natural language, by exposing hardware as programmable agent tools through a unified interface. The agent interprets the request, selects appropriate tools, and orchestrates execution. We explore a hybrid model where low-latency perception and actuation run locally on edge hardware, and higher-level reasoning and multi-step planning are delegated to cloud-based agents when needed. This preserves real-time responsiveness while enabling richer reasoning.

A live robot demonstration anchors the session. Using the SO101 robotic arm powered by NVIDIA GR00T alongside HuggingFace LeRobot, attendees see how an instruction such as “pick up the cube” moves from conversation to perception to physical action.

3:45pm-4:05pm: Taking Reinforcement Learning Cross Datacenter — Adam Azzam

(session) [Expo Stage 1 NE]

Reinforcement learning for frontier models is increasingly constrained not only by algorithms, but by where compute is available. When training and rollout generation must live inside one datacenter, the whole system becomes limited by the capacity, hardware, and failures of that single location.

Taking RL cross datacenter changes the shape of the problem. Training can happen in one place, Rollout trajectories can be generated somewhere else, and compute can be pulled from whatever cloud, region, hardware, or precision format is available. RL capacity can become global, elastic, and opportunistic rather than a carefully reserved supercomputer, more like a living system spread across the world.

This talk is about the first steps toward that future: RL that can run anywhere, learn continuously, and turn scattered compute into a single training loop.

3:45pm-4:05pm: Dashboards are Dead — Sarah Simionescu

(session) [Expo Stage 2 NW]

AX is the new UX, and how to build for agents.

4:30pm-4:50pm: Closing Keynote — Theo Browne — Theo Browne

(keynote) [Main Stage] | Track: Main Stage

4:50pm-5:10pm: Closing Keynote: Garry Tan — Garry Tan

(keynote) [Main Stage] | Track: Main Stage

5:10pm-5:30pm: Startup Battlefield — Howie Liu

(keynote) [Main Stage] | Track: Main Stage


Speakers

Total: 550 confirmed speakers

Aaron Stanley

  • Role: CISO
  • Company: dbt Labs
  • Bio: Security leader at dbt Labs. I build security organizations that help companies scale. I enable growth, accelerate engineering, and earn customer trust.
  • LinkedIn: https://www.linkedin.com/in/aastanley
  • Photo: /wf26/speakers/by-id/spk_aaron_stanley.jpg
  • Sessions:

- AI’s Jurassic Park Period — Day 2 — Session Day 1 3:20pm-3:40pm

Early in my career, I accidentally and unrecoverably changed data I was collecting for a federal investigation. Twenty years later, with the help of AI and a career’s worth of experience as a security leader, I intentionally did the same thing. Make no mistake, what my agent and I did together was dangerous. It was only because I had enough subject matter expertise in both the functional and risk issues that I could navigate it safely. We are in AI’s Jurassic Park period: no matter how clearly we define the rules, models will search for paths to completion. And they are very good at making those paths look safe, reasonable, and correct even when they violate policy or basic intuition. Designing the right control set is about allowing for the right expertise to be injected at the right time in the co-creation process so we can move quickly and safely into the next evolution.

Abduallah Mohamed

  • Role: VP of AI/ML
  • Company: AIDAChip
  • Bio: VP of AI/ML at AIDAChip, building the AI platform for semiconductor IP development, Ex-Meta. Core expertise spans agentic AI & LLM frameworks, multi-modal sensor fusion, tracking, and multi-agent trajectory prediction. PhD @UT Austin.
  • LinkedIn: https://www.linkedin.com/in/abduallah/
  • Website: https://abduallahmohamed.com/
  • Photo: /wf26/speakers/by-id/spk_abduallah_mohamed.jpg
  • Sessions:

- What If Your Chip Design Team Moved Like a Single Body? — Day 4 — Session Day 3 11:40am-12:00pm

Most agentic demos you've seen has a hidden assumption: one user, one session, one task. But what happens when the agent needs to coordinate with 30 other agents, across 10 disciplines, on a project that takes 12 months — where a single miscommunication costs $10-50M? Chip design is that problem. Only 14% of chips succeed on first silicon. The bottleneck isn't individual engineer speed — it's silent divergence between disciplines working from specs that drift without noticing. We built a multiplayer AI on the Anthropic Agent SDK, connected through three alignment layers: a living spec graph (System of Intent) that propagates changes and detects conflicts in real time, a tribal knowledge layer (Memory) that compounds methodology across projects, and milestone-aware execution that drives EDA tools with full design context. Each agent operates within strict spec-hierarchy boundaries enforced at the API level. Cross-agent invocations use structured tool calls with typed parameters, logged for full auditability. We talked with 15 practitioners across 8 major semiconductor and EDA companies. The universal finding: teams need alignment infrastructure, not faster copilots. We'll also share what broke — because coordination tax applies to AI agents too, and the failure modes are surprisingly instructive. This talk covers the multi-agent architecture, evaluation methodology, and lessons from deploying agentic AI in one of engineering's most complex coordination domains.

Abdul Dakkak

  • Role: Chief Scientist
  • Company: Modular
  • Bio: Abdul Dakkak is Chief Scientist at Modular, where he works on AI compute, GenAI performance, Mojo, kernels, framework and serving layers for Modular's platform.
  • LinkedIn: https://www.linkedin.com/in/adakkak
  • Website: https://dakkak.dev
  • Photo: /wf26/speakers/by-id/spk_abdul_dakkak.jpg
  • Sessions:

- Modular: Taming the AI Hardware Cambrian Explosion — Day 3 — Session Day 2 3:45pm-4:05pm

AI teams are hitting the same wall: the workloads they want to run require more hardware than they can reliably access. Buying more GPUs is not always possible, and rewriting kernels for every vendor is not sustainable. Meanwhile, models keep growing, SLAs keep tightening, workloads keep diversifying, and modalities keep multiplying. Modular has two answers: squeeze more performance out of the hardware you already have, and unlock far greater hardware diversity. We'll ground the talk in benchmark data and show how the Modular platform delivers 10x lower latency on image and video models like FLUX2 and 5.5x higher throughput on MoE models like Kimi K2.5, both over the state of the art. This talk explains how Modular is rebuilding the inference stack for performance portability. We'll demonstrate how Mojo kernels, the MAX compiler and runtime, and Modular Cloud work together to optimize GenAI workloads from model graph to hardware execution across NVIDIA, AMD, Apple Silicon, and CPU deployments. Along the way, we'll cover the bottlenecks that dominate production inference: memory movement, batching, KV-cache layout, quantization, scheduling, and kernel specialization. Using examples from LLM serving, we'll reveal which optimizations matter, where abstractions leak, and how to reason about performance portability in real deployments.

Abhi Arya

  • Role: Product, Software, Infra, and Applied AI
  • Company: Reducto
  • Bio: Abhi Arya works on product, software, infrastructure, and applied AI at Reducto. He previously co-founded Opennote, a YC S25 company acquired by Reducto, and has also worked on browser automation at Browserbase and mission operations software at NASA Johnson Space Center.
  • Photo: /wf26/speakers/by-id/spk_abhi_arya.jpg
  • Sessions:

- From Chatbots to Agents: How Reducto builds for Agent Experience to Enable Real Work — Day 2 — Session Day 1 3:45pm-4:05pm

Many agent demos work. Most agent systems in production don't. The gap usually isn't the model or the tools. It's everything in between: how context gets structured, how multi-step tasks stay on track, how you handle the edge cases that only show up when real scenarios from real customers hit your pipeline. At https://reducto.ai/, we've spent the last couple of months building agent-first workflows for some of the most document-heavy industries out there. We've hit most of the failure modes you're probably hitting too. This talk shares what we've learned, from how to think about Agent Experience (AX) as a design layer, to the specific decisions that make complex workflows actually reliable in production. You'll walk away with tactical approaches to structuring context, model guidance, designing recoverable workflows, and building the feedback loops that let your system improve over time without a full rebuild.

Abhilash Asokan

  • Company: ZS
  • Photo: /wf26/speakers/by-id/spk_abhilash_asokan.jpg
  • Sessions:

- Why We Killed Our Multi-Agent Pipeline: Lessons From Pharma Commercial Intelligence — Day 4 — Session Day 3 3:45pm-4:05pm

Key takeaways: A practical design principle for agentic systems in regulated, high-stakes domains: derive the architecture from agent behavior, don't impose it. Concrete patterns the audience can apply this week — domain knowledge graphs as agent context, deterministic preprocessing as a complement to agentic reasoning, reference-based context management. An honest case study from production: what worked, what didn't, and the open architectural questions we're still working on. Abstract : We lead the architecture and AI engineering org behind ZS Associates' commercial intelligence platform for pharmaceutical brand teams. The product has two surfaces: a proactive alert system that delivers signal-driven intelligence packets when a brand's KPIs move, and a conversational analytics chat where business users ask ad-hoc questions. A year ago we built both surfaces as separate V1 stacks. They broke in different ways. The diagnosis was the same: we had decided on the structure before we knew what the agent actually needed. This talk is about the design principle that came out of rebuilding both — and what it produced. The architecture is derived, not designed. We stopped trying to predict what scaffolding the agent would need and started designing the system around what the agent's behavior, on real production tasks, actually demanded. Tools, context, structure, and guardrails get introduced at the points where the agent's reasoning needs them — and nowhere else. What that produced is an architecture that's smaller than V1, not bigger. A single agent owns each investigation end-to-end across both surfaces, launching parallel sub-agents when the work needs them — not according to a pre-defined topology. A pharmaceutical commercial knowledge graph — HCPs, accounts, payers, territories, brands, KPIs and the relationships between them — gives the agent the domain context it needs without prompt-engineering heroics. Statistical signal detection runs deterministically before the agent wakes up, so the agent's job is to explain signals, not find them. Raw query results stay out of the context window through a reference-pattern that lets the agent reason over data without drowning in it. Each of those decisions came from watching an agent struggle on a real task and asking what does it need here? — not from sketching the architecture in a doc and forcing the agent into it. The patterns generalize. If you're shipping agents over messy enterprise data — finance, supply chain, claims, operations — the failure modes and the fixes will look familiar. We'll close with the open questions and the pieces we haven't solved yet.

Abhishek Bhardwaj

  • Role: Member of Technical Staff, RL & Agent Infrastructure
  • Company: OpenAI
  • Bio: Abhishek Bhardwaj works on Agent and Reinforcement Learning Infrastructure at OpenAI. He builds systems that enable large-scale model training in RL environments, as well as secure and scalable cloud sandboxes for OpenAI’s agents. Before joining OpenAI, he created Arrakis, an open-source sandbox for AI agents. Previously, he worked at Google on ChromeOS and foundational microVM technologies, and at Replit on core infrastructure and early versions of Replit Agent.
  • Twitter: https://x.com/abshkbh
  • LinkedIn: https://www.linkedin.com/in/abshkbh
  • Photo: /wf26/speakers/by-id/spk_abhishek_bhardwaj.jpg
  • Sessions:

- From fork() to Fleet: Designing an Agent Sandbox Cloud Pt 1 — Day 3 — Session Day 2 1:30pm-1:50pm

Sandboxes unleash agents by giving them secure, fully functional computers where they can tackle diverse tasks with minimal setup. This talk explores the architectural challenges of building an agent sandbox cloud. We compare runtime isolation technologies and their trade-offs, examine persistence and storage as the next major unlock for agent capabilities, and discuss the key decisions involved in orchestrating and scaling sandboxes.

- From fork() to Fleet: Designing an Agent Sandbox Cloud Pt2 — Day 3 — Session Day 2 1:55pm-2:15pm

Sandboxes unleash agents by giving them secure, fully functional computers where they can tackle diverse tasks with minimal setup. This talk explores the architectural challenges of building an agent sandbox cloud. We compare runtime isolation technologies and their trade-offs, examine persistence and storage as the next major unlock for agent capabilities, and discuss the key decisions involved in orchestrating and scaling sandboxes.

Adam Azzam

  • Role: Member of Product Staff
  • Company: Modal
  • Bio: Adam Azzam is a Member of Product Staff at Modal, a high-performance AI infrastructure platform. Before Modal, Adam was VP of Product at Prefect and maintainer of Prefect and FastMCP. He holds a PhD in mathematics.
  • Twitter: https://x.com/aaazzam
  • LinkedIn: https://linkedin.com/in/adam-azzam
  • Website: https://adamazzam.com
  • Photo: /wf26/speakers/by-id/spk_adam_azzam.jpg
  • Sessions:

- Don’t build agents, build environments — Day 3 — Session Day 2 10:45am-11:05am

We’ve largely settled on what a coding agent is: a model working in a loop, calling tools. As a result, the hard part has moved. It’s no longer the agent loop, it’s the environment around it. This talk is about the real challenges of building fast-booting, reliable, reproducible environments for coding agents at scale.

- Taking Reinforcement Learning Cross Datacenter — Day 4 — Session Day 3 3:45pm-4:05pm

Reinforcement learning for frontier models is increasingly constrained not only by algorithms, but by where compute is available. When training and rollout generation must live inside one datacenter, the whole system becomes limited by the capacity, hardware, and failures of that single location.

Taking RL cross datacenter changes the shape of the problem. Training can happen in one place, Rollout trajectories can be generated somewhere else, and compute can be pulled from whatever cloud, region, hardware, or precision format is available. RL capacity can become global, elastic, and opportunistic rather than a carefully reserved supercomputer, more like a living system spread across the world.

This talk is about the first steps toward that future: RL that can run anywhere, learn continuously, and turn scattered compute into a single training loop.

Adam Huda

  • Role: Sr Engineering Leader for AI Dev Tools
  • Company: Uber
  • Bio: Adam is a Senior Engineering Manager at Uber, where he leads the AI Developer Tools team on a mission to supercharge software engineering. Currently obsessed with manifesting ideas with Claude Code, he is a true believer that AI will be the ultimate catalyst for unlocking Starfleet.

Before the agentic wave, Adam was a trailblazer in the mobile space. He cut his teeth at Apple as the build engineer for iOS 2.0. From there, he went on to build and launch multiple app startups, including Posterous, and helped shape the early days of Twitter's iOS app.

When he’s not busy building the future of developer tools, Adam unplugs outdoors. You can usually find him sailing, making music on his handpan, or shaping the perfect wake for surfing on Lake Tahoe.

  • Twitter: https://x.com/hudaman
  • LinkedIn: https://www.linkedin.com/in/thinktopdown/
  • Website: https://adamhuda.com
  • Photo: /wf26/speakers/by-id/spk_adam_huda.jpg
  • Sessions:

- Agentic SDLC at Uber: Building Blocks for Uber's Software Factory — Day 2 — Session Day 1 11:40am-12:00pm

99% of Uber engineers are using AI every month, 70% of PRs are attributed to AI, and 15% of PRs are now done entirely by autonomous agents. In this session, we go behind the scenes to show you exactly what it takes to get there — starting with the foundational building blocks: the model gateway, MCP infrastructure, agent skills, knowledge systems, and cloud developer environments that make agentic engineering possible at scale. Then, once those foundations are in place, we show you how to assemble them into a fully agentic SDLC. We'll walk through every stage — from research and spec writing, to autonomous code generation, to verifying and validating that code before it ships, to monitoring what happens after it lands, and continuously improving it over time. With tooling example demos throughout. Whether you're just starting your agentic journey or already running agents in production, you'll leave with a concrete blueprint for what this looks like end to end.

Addy Osmani

  • Role: Director of Engineering
  • Company: Independent
  • Bio: Engineering and evangelism leader who spent over 14 years at Google leading developer experience for Chrome and Gemini.
  • Twitter: https://x.com/addyosmani
  • LinkedIn: https://www.linkedin.com/in/addyosmani/
  • Website: https://addyosmani.com
  • Photo: /wf26/speakers/by-id/spk_addy_osmani.jpg
  • Sessions:

- Closing Keynote — Day 3 — Session Day 2 4:30pm-4:50pm

TBD

Adi Singh

  • Role: Co-founder
  • Company: AgentMail
  • Bio: Co-founder of AgentMail (YC S25), the email inbox API for AI agents. The company is backed by Y Combinator, General Catalyst, Paul Graham, and founders of Ramp, Supabase, and HubSpot. Before AgentMail, Adi spent time at firms like Accel and KKR while operating software businesses across accounting, edtech, and e-commerce during his time at the University of Michigan.
  • Twitter: https://x.com/adisingh
  • LinkedIn: https://linkedin.com/in/adivirsingh13
  • Website: https://www.agentmail.to/
  • Blog: https://www.agentmail.to/blog
  • Photo: /wf26/speakers/by-id/spk_adi_singh.jpg
  • Sessions:

- The Next Trillion Users of the Internet Still Don't Have an Identity — Day 3 — Session Day 2 2:50pm-3:10pm

In the last few months, hundreds of thousands of people set up personal AI agents that send email on their behalf, manage calendars, book travel, even sign contracts - all thanks to openclaw. Most of these agents have no real identity online. They borrow a human's. The identity stack of the internet, OAuth, 2FA, KYC, magic links, was built for people sitting at a keyboard. Agents don't fit, and we've ended up with shared accounts, hard-coded credentials, and humans dragged back into every loop. I'm Adi, co-founder of AgentMail. We are building the identity layer for what we believe will be the next trillion users of the internet, and they will not be human. Across hundreds of customers, we have watched what breaks when an agent has no real address. It fails at signups. Verification codes get lost. There is no accountability when something goes wrong. The human gets pulled back in. This talk is the case for making agents first-class citizens of the internet. I'll cover the identity architecture we've shipped, the legacy industries already adopting it and making real money, and where agent identity infrastructure is going over the next decade.

Adit Abraham

  • Role: CEO and cofounder
  • Company: Reducto
  • Bio: Adit Abraham is co-founder and CEO of Reducto, building an AI document-intelligence platform for parsing, understanding, and structuring complex unstructured documents for AI applications. He previously studied computer science at MIT and worked on product at Google.
  • LinkedIn: https://www.linkedin.com/in/aditabraham
  • Photo: /wf26/speakers/by-id/spk_adit_abraham.jpg
  • Sessions:

- From Ingestion to Agents: How Leading AI Teams Build on Document Intelligence — Day 2 — Session Day 1 1:30pm-1:50pm

The agents of tomorrow are only as good as the context they reason on — yet most real-world data lives in messy, unstructured documents.

In this session, we reveal the patterns that separate AI teams shipping reliable, production-grade agents from those stuck debugging pipelines.

Drawing on patterns we've seen from AI-native startups to Fortune 10 enterprises, we'll cover what it takes to transform complex documents into clean, accurate context at scale across legal, finance, healthcare and more.

From ingestion architecture to agent-ready outputs, walk away with the strategies top teams use to turn document chaos into competitive advantage.

Aditya Gautam

  • Role: Machine Learning Lead
  • Company: Meta
  • Bio: Aditya Gautam is a seasoned AI practitioner and leader specializing in multimodal LLMs, multi-agent systems, and scalable architectures for recommendation systems. At Meta, he led Generative AI initiatives for Reels within complex domains like user interest exploration and policy understanding, architecting and training complex multimodal models and developing agentic solutions for adversarial video challenges. His work spanned end-to-end pre- and post-training workflows along with designing multi-agent solutions with optimizing engineering pipelines for large-scale production deployment. Prior to Meta, Aditya spent over three years at Google building large-scale computer vision and content understanding systems. A recognized industry voice, his work has been featured by Nasdaq and Marktechpost. He frequently speaks at major events like the Databricks Data + AI Summit, Silicon Slopes, and MLOps Summit, and serves as a peer reviewer for NeurIPS, ICML, and AAAI, focusing on the practical bridge between frontier research and production engineering.
  • LinkedIn: https://www.linkedin.com/in/aditya-gautam-68233a30/
  • Photo: /wf26/speakers/by-id/spk_aditya_gautam.jpg
  • Sessions:

- Modality Misalignment and Originality Attribution in Short-Form Video: A Multi-Agent Approach at Platform Scale — Day 2 — Session Day 1 12:05pm-12:25pm

Short-form video presents a class of content understanding problems that are qualitatively different from text or single-modality media. Audio, visual, and text signals within the same piece of content frequently diverge, sometimes incidentally and sometimes deliberately, creating a modality misalignment that defeats systems designed around any single signal. At the same time, the resharing dynamics of short-form video platforms create originality attribution chains that degrade quickly and are poorly captured by metadata alone. Addressing both problems at platform scale, reliably and under real latency and cost constraints, is the challenge this talk is built around. The core of the talk is the multi-agent architecture developed to address this, published at ACM WSDM 2025, and the reasoning behind its design. Each agent in the system is specialized for a distinct aspect of the problem: understanding what a piece of content is actually communicating across modalities, identifying where those modalities diverge meaningfully, and tracing originality through the resharing graph to surface attribution that platform metadata misses. We will cover the design principles behind this decomposition, the tradeoffs between specialization and complexity, the evaluation framework built to measure performance in a setting where ground truth is genuinely ambiguous, and the practical optimizations that made the system viable at scale. We will also be honest about the limitations: where the multi-agent approach added overhead that simpler baselines handled adequately, and what the boundaries of the system's reliability actually look like in production conditions. The broader takeaway is a set of principles for approaching multimodal content understanding problems where the signals are misaligned by nature rather than by exception. Attendees will leave with a framework for thinking about agent decomposition across a complex multimodal problem, a grounded understanding of how originality attribution degrades at scale and what it takes to recover it, and practical lessons about building evaluation and optimization pipelines for systems where the problem itself resists clean benchmarking.

Aditya Khandelwal

  • Role: MTS
  • Company: Amazon AGI Lab
  • Photo: /wf26/speakers/by-id/spk_aditya_khandelwal.jpg
  • Sessions:

- Agents, codebases, and teams: what it actually takes to ship together — Day 2 — Session Day 1 11:10am-11:30am

Using a coding agent solo is one thing. Getting a whole team to trust agent-written code, agent-run reviews, and long-running agent work is another. That's where most teams stall. This talk is about what it actually takes to get there: how to shape a codebase so agents can work in it safely, how to earn a skeptical team's trust instead of mandating it, and the failure modes that only show up once agents are part of the daily workflow.

Ahmad Osman

  • Role: Head mod
  • Company: r/LocalLLaMA
  • Bio: r/LocalLLaMA moderator and AI researcher in San Francisco; known for building a 14x RTX 3090 rig.
  • Twitter: https://x.com/TheAhmadOsman
  • LinkedIn: https://linkedin.com/in/TheAhmadOsman
  • Website: https://ahmadosman.com
  • Photo: /wf26/speakers/by-id/spk_ahmad_osman.jpg
  • Sessions:

- Local LLMs and workstation agents: Part 1 — Day 1 — Workshop Day 11:05am-12:05pm

Have you heard "Buy a GPU," "Opensource AI Must Win," or "Local AI FTW" before? This workshop will be a practical window into that confusing world and a practical map for understanding what different Local AI hardware is actually capable of and which models make sense on each class of machine.

Whether you are just getting started or already running models every day, we will demo and work through why a Mac mini, M4 Pro MacBook Pro, M5 Max MacBook Pro, RTX 5070 8GB laptop, Strix Halo box, DGX Spark, and 2x RTX PRO 6000 Blackwell machine should not be configured, benchmarked, or used the same way.

What are you trying to run? How much VRAM or Unified Memory do you actually need? When does a small machine make sense? When do you need a real GPU box? When does long context, tensor parallelism, or serving infrastructure start to matter?

This should be useful to everyone: people curious about local AI, people buying their first capable machine, people already running models, and people trying to use local inference for scalable agentic workflows.

We will close by showing how Codex can automate the boring part: give it my Inference Engine article, the hardware target, and the model of your choice, then ask it to propose the engine, environment, flags, batch settings, KV-cache settings, and benchmark and evaluation plan.

- Local LLMs and workstation agents: Part 2 — Day 1 — Workshop Day 12:10pm-1:10pm

From the guy who said "Buy a GPU," "Opensource AI Must Win," and "Local AI FTW": this session shows what you build around the models running locally so agents can actually be effective and efficient when using local models.

A local chatbot gives you private text generation. A useful agent needs a system around it: search, scraping, traces, document ingestion, agentic harness integration, and other practical components. The focus of this workshop is setup, not hardware. We will walk through the practical pieces that turn local inference from a model endpoint into the reasoning layer inside a real workflow.

The live demo target will be a 2x RTX PRO 6000 Blackwell machine running models locally and using it across different agentic harnesses. The goal is to show how Local AI can be more than private and offline: it can be useful, inspectable, controllable, and built into infrastructure you actually own.

Attendees should leave with a practical mental model for building Local AI systems that can read, search, cite, act, and evaluate themselves.

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 10:45am-11:05am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 11:10am-11:30am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

- Demo: GLM 5.2 on DGX Station — Frontier Intelligence Under Your Desk — Day 4 — Session Day 3 11:40am-12:00pm

Ahmad Osman shows off the power of local AI on stage, running frontier open models on a DGX Station.

Ahmed Ahres

  • Role: Head of Product & GTM
  • Company: Reactor
  • Bio: Head of Product & GTM @ Reactor. Previously was the first ever intern at Revolut, started a company backed by a16z Speedrun, built and shipped mobile games, and was a national Tennis champion.
  • Twitter: https://x.com/Boudatw
  • LinkedIn: https://www.linkedin.com/in/ahmedahres/
  • Website: https://www.ahmedahres.com
  • Blog: https://www.ahmedahres.com
  • Photo: /wf26/speakers/by-id/spk_ahmed_ahres.jpg
  • Sessions:

- The Next Medium: Why Real-Time Interactive Video Changes Everything for Developers — Day 4 — Session Day 3 3:20pm-3:40pm

Every major platform shift created a new category of developers. The web created web developers. Mobile created app developers. Now real-time interactive video models are creating a new kind of builder: one who does not render scenes or script interactions, but writes code that shapes a living world as it generates. This talk explores what it means for video to become a runtime, why this moment is happening now, and what the first generation of developers building on world models are already creating. Based on work at Reactor, where developers are shipping interactive games, robotics simulations, and real-time experiences that could not have existed 1 year ago.

Ahnaf Prio

  • Role: Senior Engineering Manager
  • Company: Best Buy
  • Bio: Senior Engineering Manager at Best Buy building next-gen, AI-driven retail experiences at scale. Previously a 2x startup co-founder and CTO. Active community leader.
  • LinkedIn: https://linkedin.com/in/ahnafy
  • Photo: /wf26/speakers/by-id/spk_ahnaf_prio.jpg
  • Sessions:

- The Agentic Commerce Stack — Day 4 — Session Day 3 2:25pm-2:45pm

Agents are already handling product discovery, cart building, and checkout — no human clicking required. But what's the protocol stack actually making this work? This talk maps the real infrastructure: MCP for tool access, A2A for agent coordination, the ACP spec (backed by OpenAI) and the UCP spec (backed by Google) — two competing approaches to standardizing the full agentic commerce lifecycle — and AP2 for agentic payments. We'll cover what each does, how they compose, and where they're still forming. Then we'll see it live — a working demo with a protocol inspector showing every tool call, task transition, and checkout event in real time. You'll leave with a clear mental model of the agentic commerce landscape and a reference implementation you can use.

Ajay Prakash

  • Role: Senior Staff Software Engineer
  • Company: Linkedin
  • Bio: Ajay is a software engineer at LinkedIn with 14 years of experience in software, building large-scale systems and AI. For the past four years, his work has shifted fully into AI: LLMs, prompt engineering, context engineering, and AI agents. He previously led AI platform and product initiatives for LinkedIn Sales Navigator. Over the past year, he's led efforts to improve the effectiveness of coding agents by connecting them to LinkedIn's internal tools and context, making them genuinely useful inside a large engineering organization. He now leads AI agent platform efforts at LinkedIn, the most interesting work of his career so far.
  • Twitter: https://x.com/ajay_prakash_ai
  • LinkedIn: https://www.linkedin.com/in/ajay-prakash-3780b132/
  • Photo: /wf26/speakers/by-id/spk_ajay_prakash.jpg
  • Sessions:

- 500 Skills, Zero Fine-Tuning: LinkedIn's Playbook for AI Agents That Actually Know Your Codebase — Day 3 — Session Day 2 11:40am-12:00pm

Everyone's building custom AI agents. We didn't. Instead, we built CAPTAIN — an MCP server that makes any off-the-shelf coding agent understand LinkedIn's entire engineering stack. The secret: a meta-tool architecture (discover → inspect → execute) and composable skills that encode tribal knowledge as executable workflows. 500+ skills later, it's used across all of LinkedIn engineering. I'll show you the architecture in 10 minutes and why context engineering beats model engineering every time.

Akele Reed

  • Role: Principal AI Engineer
  • Company: Sondermind
  • Bio: Principal AI Engineer at SonderMind, Akele Reed leads the team behind the company's conversational AI mental health feature and has served as a primary architect of its guardrails and evaluations framework. Her work sits at the intersection of AI capability and responsibility, designing systems that earn trust through rigorous safety pipelines, human expert feedback loops, and continuous oversight in one of the highest-stakes domains in healthcare. Akele is passionate about making AI trustworthy not just in theory, but in production and building the infrastructure and culture that allow engineers and clinicians alike to confidently rely on AI-powered tools. She brings over nine years of experience in applied AI and model training, including her previous role at 23andMe, and holds a Master's degree in Computer Science from Georgia Tech. Away from the screens, she enjoys hiking, beekeeping, and baking.
  • LinkedIn: https://www.linkedin.com/in/akele-reed
  • Photo: /wf26/speakers/by-id/spk_akele_reed.jpg
  • Sessions:

- Evals Driven-Development: Engineering a Mental Health AI Coach Ethically & Safely — Day 3 — Session Day 2 2:50pm-3:10pm

In the world of AI Mental Health, vibes can be dangerous with real consequences. Building Sondermind’s Mental Health AI Coach required us to invent a new playbook for Eval-Driven Development in order to balance effectiveness and safety. This session is for the builders who want to see how to handle the most difficult edge cases in the agentic world. We’ll show how we’ve built a Clinical Feedback Loop that turns human therapist insights into machine-readable evaluations in a production system with thousands of conversations. We’ll dive into: - The Ethics Engine: Building and calibrating modular guardrails that can be updated as clinical guidelines evolve. - Agentic Oversight: Why we moved from single-prompt agents to a closed-loop Supervisor/Executor/Evaluator pattern to ensure clinical adherence. - Human Oversight: How we monitor Sonder to ensure that we can improve safety and quality with clinical feedback.

Alex Atallah

  • Role: Co-founder & CEO
  • Company: OpenRouter
  • Bio: Alex Atallah is Co-founder & CEO of OpenRouter. OpenRouter provides a unified interface for accessing and routing across hundreds of AI models from many providers.
  • Twitter: https://x.com/alexatallah
  • LinkedIn: https://www.linkedin.com/in/alexatallah
  • Website: https://openrouter.ai
  • Photo: /wf26/speakers/by-id/spk_alex_atallah.jpg
  • Sessions:

- Model Routing — Day 4 — Session Day 3 3:20pm-3:40pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

- Model Routing — Day 4 — Session Day 3 3:45pm-4:05pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

Alex Bauer

  • Role: Co-founder
  • Company: Upside
  • Bio: Alex Bauer is co-founder of Upside, the data layer for GTM engineers. He spent 2016–2024 at Branch as the public voice of mobile attribution and deep-linking. He now builds the clean, normalized GTM data that revenue teams point Claude and Cursor at to answer "what actually happened, and did it work?"
  • Twitter: https://x.com/alexdbauer
  • LinkedIn: https://www.linkedin.com/in/alexdbauer/
  • Website: https://alexbauer.net/
  • Photo: /wf26/speakers/by-id/spk_alex_bauer.jpg
  • Sessions:

- How Juries and Librarians Can Solve GTM's AI Trust Problem — Day 4 — Session Day 3 1:30pm-1:50pm

A couple of years ago, everyone worried about AI hallucinating. We rarely hear that word anymore, but it’s just because the problem grew up. Today, your AI still doesn’t know how to say “I’m not sure.” Instead, it hands you a revenue number that’s wrong in ways that look exactly like being right.

The good news is we already solved this once, for people: you onboard a new hire so they understand your business; you put subjective, high-stakes calls in front of more than one set of eyes. This talk walks through patterns we run at Upside, including a librarian every agent consults before it acts, a jury-and-judge model for the questions a single pass can’t be trusted to answer, and knowing when the model itself is just too dumb for the job. Live demos and real failures included.

Alex Campos

  • Role: Director of Sales Partnerships
  • Company: FriendliAI
  • Bio: Alex Campos leads sales partnerships at FriendliAI, a frontier AI inference cloud focused on high-performance open-weight model serving and production inference optimization.
  • Sessions:

- Inference performance as a competitive advantage — Day 3 — Session Day 2 2:50pm-3:10pm

Most AI teams focus on model quality, but production success often comes down to inference performance. In this session, FriendliAI will explore the optimization techniques behind high-performance LLM serving, including continuous batching, speculative decoding, smart caching, and efficient GPU utilization. Learn how leading AI teams reduce infrastructure costs, improve latency, and scale inference workloads without sacrificing performance. We'll share practical insights and deployment strategies that separate experimental AI projects from production-grade systems.Whether you're an ML engineer, platform engineer, MLOps practitioner, or technical founder, you'll leave with a better understanding of how inference optimization can become a competitive advantage for your AI applications.

Alex Cheema

  • Role: CEO
  • Company: EXO Labs
  • Bio: Alex Cheema is founder and CEO of Exo, focused on decentralized and local AI infrastructure.
  • Twitter: https://x.com/alexocheema
  • LinkedIn: https://linkedin.com/in/alex-cheema
  • Photo: /wf26/speakers/by-id/spk_alex_cheema.jpg
  • Sessions:

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 10:45am-11:05am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 11:10am-11:30am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

Alex Hancock

  • Role: Software Engineer
  • Company: Block
  • Bio: Engineer at Block building goose. Maintainer of the Model Context Protocol (MCP) and the Agent Client Protocol (ACP).
  • Twitter: https://x.com/alexjhancock
  • LinkedIn: https://www.linkedin.com/in/alexjhancock/
  • Photo: /wf26/speakers/by-id/spk_alex_hancock.jpg
  • Sessions:

- The Universal Remote Control for AI — Day 3 — Session Day 2 3:45pm-4:05pm

Every AI agent today is effectively stranded on the machine it runs on, reachable only through custom wrappers with no industry standard way in. This talk introduces work being done on the Agent Client Protocol to add a universal remote transport: a single /acp endpoint supporting both Streamable HTTP and WebSocket, deliberately aligned with MCP Streamable HTTP so the two protocols share an approach. When you pair ACP's remote transport with MCP's own Streamable HTTP support, something powerful emerges — the agent workload becomes location-independent, free to run on a laptop, a container, or a cloud VM while any client reaches in through open, interoperable standards. No more vendor lock-in on where your agent lives or who can talk to it. Come see how two open protocols, snapped together, become the universal remote control for agent i/o.

Alex Shaw

  • Role: Member of Technical Staff
  • Company: Laude Institute
  • Bio: Alex Shaw is the creator of Harbor, a framework for evaluating and optimizing agents and language models in sandboxed environments.
  • Photo: /wf26/speakers/by-id/spk_alex_shaw.jpg
  • Sessions:

- Everything Is a Rollout — Day 3 — Session Day 2 3:45pm-4:05pm

tba

Alex Volkov

  • Role: AI Evangelist & Host of ThursdAI
  • Company: W&B from CoreWeave
  • Bio: Alex Volkov is an AI Evangelist at Weights & Biases by CoreWeave and the founder and host of ThursdAI, a weekly podcast and newsletter tracking the fast-moving AI engineering world. Each week, Alex and his crew break down new model releases, benchmarks, evals, agentic engineering patterns, API changes, open source releases, and the tools developers are actually using to build with AI. Before ThursdAI, Alex spent 20 years as a full-stack engineer and founded an AI startup, giving him a builder’s view of what matters and what is just launch-week noise. He helps AI engineers stay current without having to read the entire internet every week.
  • Twitter: https://x.com/altryne
  • LinkedIn: https://www.linkedin.com/in/alex-volkov-
  • Website: https://thursdai.news
  • Blog: https://thursdai.news
  • Photo: /wf26/speakers/by-id/spk_alex_volkov.jpg
  • Sessions:

- The Z/L Continuum: Should AI Engineers Still Read Code? — Day 3 — Session Day 2 10:45am-11:05am

At AI Engineer Europe, two of the best speakers gave directly opposite advice. Zechner: slow the f*** down, read every line your model writes. Lopopolo: code is a liability, you don't even open the IDE anymore. Both got applause. The room walked out confused. On the train back I sketched the Z/L Continuum on a napkin — a five-stop spectrum from "read the clanker code" to "what IDE?" — and the whole week clicked into place. In this talk I'll walk through the Continuum, introduce FOMAT (Fear of Missing Agent Time — coined backstage by Michael Richman), and make four arguments: the Continuum is real, your stop is per-task not per-person, model capability bends everything toward L, and FOMAT is a filter problem, not an agent problem. You'll leave with a vocabulary for the argument every AI engineer is having right now. Audience takeaways A shared vocabulary (Z, L, the five stops) for the debate splitting AI engineering teams FOMAT — name the fear so you can manage it A per-task framework for choosing where on the Continuum to operate Why capability drift makes "I'll never let it cook" a losing position over time Speaker: Alex Volkov · ThursdAI · @altryne

Alexander Embiricos

  • Role: Head of Enterprise Product
  • Company: OpenAI
  • Bio: Alexander Embiricos is the Head of Enterprise Product at OpenAI. He previously led product for Codex and worked on ChatGPT Desktop, with a consistent focus on building assistants that work alongside people in their work and personal contexts. Before joining OpenAI, Alexander founded Multi, a pair-programming startup acquired by OpenAI in 2024. Alexander is half Greek and half Malaysian and came to the United States to study Mechanical Engineering and Computer Science at Stanford University.
  • Twitter: https://x.com/embirico
  • LinkedIn: https://www.linkedin.com/in/embirico/
  • Photo: /wf26/speakers/by-id/spk_alexander_embiricos.jpg
  • Sessions:

- The Golden Age of AI Engineering — Day 2 — Session Day 1 9:25am-9:45am

TBD

Ali Adl-Tabatabai

  • Role: Founder and CEO
  • Company: Gitar.ai
  • Bio: Ali-Reza Adl-Tabatabai is founder and CEO of Gitar.ai, a developer-infrastructure company building AI agents for code review, CI analysis, and developer productivity workflows. He previously worked across developer and systems infrastructure at Intel Labs, Google, and Uber.
  • Sessions:

- While You Were Generating: The Verification Gap Nobody Talked About — Day 4 — Session Day 3 12:05pm-12:25pm

Every enterprise is asking the same question: how do we move fast with AI without breaking things? While the market chased generation — better models, faster agents, more output — a different problem was compounding quietly: nobody built the verification layer to match. The team built Gitar because they saw firsthand what happens when development velocity outpaces code quality, and AI has made that problem an order of magnitude bigger. In this session, Ali-Reza Adl-Tabatabai, formerly of Uber, Google, and Meta, now leading Gitar development inside Sonar, makes the case for why AI-native code review is the missing layer in every enterprise's agentic stack. Gitar uses agentic reasoning to review code, generate fixes, validate them against your CI, and commit to the branch. It automatically analyzes and de-duplicates CI failures, detects flaky tests, and fixes remaining build, lint, and test failures — keeping reviews moving across time zones without the back-and-forth that kills engineering throughput. As a critical layer in Sonar's multilayered, zero-trust verification platform, Gitar enables organizations to analyze syntax, data flows, logic flows, architectures, and dependencies; set and enforce standards in a consistent, auditable manner; and agentically fix issues both as agents write code and in CI workflows. Sonar intelligently sequences analysis so deterministic verification handles simpler issues first, while AI tackles the nuanced ones, reducing token costs and keeping the pipeline lean. In an agentic world, zero trust is an engineering principle: assume every line an agent writes needs to be verified, every time, at every layer.

Ali Khial

  • Role: Head of AI/ML
  • Company: G2i
  • Bio: Ali Khial is an engineering leader focused on building AI-native systems that work beyond the demo stage. He currently leads AI/ML at G2i, where he works across frontier AI evaluation, software engineering benchmarks, agentic workflows, and human-data quality systems. His current work centers on the gap between impressive AI prototypes and reliable production systems. He is especially interested in AI evaluation, data quality, tool-using applications, and the engineering practices needed to ship model-powered products in real-world environments.
  • LinkedIn: https://www.linkedin.com/in/ali-khial/
  • Sessions:

- Benchmarks: The Good, the Bad, and the Ugly — Day 3 — Session Day 2 3:20pm-3:40pm

We’ll explore the good, the bad, and the ugly of AI benchmarks: where they provide useful signal, where they create false confidence, and where data quality issues like contamination, label noise, narrow task design, and leaderboard gaming can mislead teams. The goal is not to dismiss benchmarks, but to use them better: as one part of a disciplined evaluation practice that connects model performance to real-world reliability.

Aliisa Rosenthal

  • Role: General Partner
  • Company: Acrew Capital
  • Bio: Aliisa Rosenthal is a General Partner at Acrew Capital, where she invests in the next generation of AI-native enterprise software. Recognized as a premier Go-To-Market (GTM) architect, Aliisa was the first commercial hire at OpenAI, where she served as Head of Sales and led the historic scaling of enterprise revenue from $10 million to a $10 billion run rate in just three years. Previously, Aliisa was the VP of Sales at WalkMe, guiding the company through its 2021 IPO. With a career spanning early leadership roles at Mixpanel and InVision, she has a proven track record of scaling frontier technologies into global enterprise standards. A graduate of Brown University, Aliisa is a defining voice on AI commercialization, category creation, and the evolution of the modern sales organization.
  • LinkedIn: https://www.linkedin.com/in/aliisa-rosenthal
  • Website: https://www.acrewcapital.com/team-members/aliisa-rosenthal
  • Photo: /wf26/speakers/by-id/spk_aliisa_rosenthal.jpg
  • Sessions:

- Reverse-Engineering the AI Buyer — Day 4 — Session Day 3 11:10am-11:30am

You Built the Best AI Product in the Room. Now What? The GTM Lessons Builders Skip. Aliisa decodes the commercial mistakes technical teams make most often: why enterprise procurement isn't like consumer adoption, how to design for trust and change management from day one, the difference between a pilot and a deal, and the signals that tell you a product is ready to scale vs. ready to get stuck. She's packed with war stories and counterintuitive lessons from the trenches of OpenAI.

Aman Gupta

  • Role: Principal Machine Learning Engineer
  • Company: Nubank
  • Bio: Aman Gupta is a Senior Staff Engineer at Nubank. His work focuses on AI agents and simulation-driven development for financial services.
  • Twitter: https://x.com/aman2304
  • LinkedIn: https://www.linkedin.com/in/aman-gupta1/
  • Photo: /wf26/speakers/by-id/spk_aman_gupta.jpg
  • Sessions:

- Simulation-Maxxing: How Nubank ships agents 20× faster with simulations — Day 4 — Session Day 3 2:50pm-3:10pm

You know how to build an agent - write a prompt, spec out some tools and call an LLM (or gateway). At this point, you probably also know how to build an agent that “actually works” using some combination of agent frameworks, eval tools and looking at your data. This talk is about building an agent much, much faster using simulations to hill-climb your agent configuration instead of grinding on real data. We’ll dive deep into a case study of how a top-5 fintech made their agent dev cycle 20x faster using simulation-driven optimization. We’ll cover: - When to use real data vs. simulations in agent building - How to design simulation environments tailored to your agent - How to automate the optimization loop so you’re hill climbing agent configurations without manual tuning

Ameya Bhatawdekar

  • Role: VP, Field CTO
  • Company: Braintrust
  • Bio: Ameya Bhatawdekar is VP, Field CTO at Braintrust, where he helps teams evaluate and observe production AI systems. He previously led machine learning work at Dropbox and focuses on making AI-powered features reliable through evals and observability.
  • LinkedIn: https://www.linkedin.com/in/ameyab
  • Website: http://proficient.io/ameyab
  • Photo: /wf26/speakers/by-id/spk_ameya_bhatawdekar.jpg
  • Sessions:

- Your Agent Evolved. Your Evals Didn't. — Day 2 — Session Day 1 11:10am-11:30am

Knowing which generation your agent is in, which failure modes your current evals are blind to, and what to build next is the difference between shipping with confidence and flying blind. Agent architectures have evolved through six generations; prompt, chain, ReAct loop, workflow graph, modern agent loop, AI harness. And each one quietly breaks the eval strategy of the generation before it. A prompt-quality rubric won't catch a bad tool call; a trace scorer won't catch memory poisoning. Using a single SRE incident response agent threaded through every generation, this talk shows exactly where each architecture outgrows its evals and what you need to close the gap.

Ameya Ketkar

  • Role: Software Engineer
  • Company: Uber Technology Inc.
  • Bio: Software engineer at Uber's Programming Systems Group, his research focus is program analysis, language migrations, large-scale source code mining and accelerating code reviews.
  • LinkedIn: https://www.linkedin.com/in/ameya-ketkar
  • Website: https://scholar.google.com/citations?user=6JO46GMAAAAJ&hl=en
  • Photo: /wf26/speakers/by-id/spk_ameya_ketkar.jpg
  • Sessions:

- Scaling Code Quality: Building uReview, Uber’s Multi-Agent Code Review Engine — Day 2 — Session Day 1 12:05pm-12:25pm

At Uber scale, human-only code reviews create massive bottlenecks, while generic AI tools overwhelm developers with noisy, hallucinated spam. This session explores the architecture behind uReview, Uber’s multi-agent AI code review engine designed strictly for high-precision feedback. Attendees will learn how we moved beyond monolithic prompts to build a modular pipeline featuring deep contextual ingestion, specialized domain agents, and a Generator-Verifier grader system. By enforcing strict confidence scoring and semantic deduplication, uReview filters out AI noise, shifting the focus from comment quantity to high-signal actionability and significantly reducing Pull Request cycle times. Talk Outline I. The Code Review Crisis at Uber Scale (0–3 mins) Establish the critical tension between engineering velocity and code quality, highlighting why standard AI implementations fail in massive monorepo environments. 1. The Monorepo Bottleneck: At Uber, thousands of engineers commit code daily. Relying solely on human reviewers creates a massive operational bottleneck, leading to reviewer fatigue, extended Pull Request cycle times, and inevitable missed vulnerabilities. 2. The Developer Spam Problem: Generic LLM integrations fail because they prioritize comment quantity over actionable quality. If an AI posts ten hallucinated suggestions on a diff, developers will simply mute the tool. AI must reduce cognitive load, not add to it. 3. The Signal-to-Noise Mandate: Defining the North Star for uReview. The goal is not to replace human reviewers, but to build an AI system that respects developer time by delivering high-precision, strictly verified code feedback. II. The uReview Architecture: A Modular Agentic Pipeline (3–10 mins) Detail the transition from a monolithic prompt approach to uReview’s sophisticated, multi-stage agentic workflow designed for enterprise codebases. 1. Deep Contextual Ingestion: A standard git diff is not enough. We discuss how uReview fetches extended context, integrating with our build systems to analyze surrounding functions, upstream dependencies, and class hierarchies before generating a single token. 2. Specialized Domain Assistants: Instead of a generalist model, uReview deploys independent AI agents. We route code to narrow, specialized agents—such as a Go Concurrency Analyzer, a Java Memory Leak Detector, or a Security Vulnerability Scanner—to ensure precise, domain-specific insights. 3. Hybrid Intelligence: Probabilistic LLMs cannot operate in a vacuum. We detail how uReview integrates deterministic tools, like Bazel dependency graphs and static linters, to ground AI suggestions in objective codebase realities. III. Engineering the Trust Layer (10–17 mins) Dive into the verification phase. This is the core engineering that filters out AI noise and ensures uReview maintains developer trust. 1. The Generator-Verifier Pattern: Implementing a Grader Model architecture. A primary agent generates code suggestions, but a secondary, high-reasoning model audits those suggestions against strict coding guidelines to catch hallucinations before they reach the PR. 2. Confidence Scoring and Suppression: We assign a numerical confidence score to every generated comment. If a comment falls below our calibrated threshold, uReview silently drops it. We explore the engineering behind suppressing low-confidence outputs to prevent tooling spam. 3. Semantic Deduplication: Technical strategies for merging overlapping warnings. If a deterministic static analysis tool and an LLM agent flag the same null pointer exception, uReview merges them into a single, concise developer instruction. IV. Operationalizing uReview at Scale (17–20 mins) Conclude by discussing the long-term governance, feedback loops, and measurable impact of running an AI review engine in production. 1. The Telemetry Feedback Loop: We embedded Useful and Not Useful rating buttons directly into the developer UI on every uReview comment. We discuss how this telemetry flows back into a curated data lake, driving continuous Reinforcement Learning from Human Feedback and prompt refinement. 2. Shifting Success Metrics: Why organizations must abandon vanity metrics like total comments posted. We measure uReview’s success through Actionability Rate (the percentage of AI comments accepted as commits) and the reduction in Mean Time To Merge.

Amit Desai

  • Role: Director, Voice & Assistant AI
  • Company: Roku
  • Bio: Amit Desai is a domain expert in voice AI assistants who has led voice AI products at Alexa and Roku, founded startups in customer support AI, and created Top 5 mobile apps in the App Store. He works at the intersection of voice-interface intuition and AI technical approaches, with a current focus on safer voice interfaces for AI assistants, wearables, robotics, and vehicles.
  • LinkedIn: https://www.linkedin.com/in/amit-v-desai/
  • Photo: /wf26/speakers/by-id/spk_amit_desai.jpg
  • Sessions:

- Act, Confirm, or Stop? Smarter behavior for AI assistants, wearables & robots — Day 2 — Session Day 1 3:45pm-4:05pm

Voice is our favorite way to command AI assistants and robots — and it is error-prone. The industry's reflex is to chase accuracy, but accuracy is only one knob: we can control system behavior in other ways to increase user satisfaction.

This talk shifts the lens from accuracy to user outcomes. Give the AI agent more than one move: besides acting, let it stop, reject, confirm, clarify, or disambiguate. The question stops being "how often are we right?" and becomes "what does each outcome cost the user?" Bad outcomes are not equally bad to users — so price them relatively, then have the AI system minimize that user cost. Call it OUCH: Outcome User Cost Heuristic; we optimize system behavior to minimize the OUCH. Same accuracy, lower user cost, greater user adoption.

We will walk through practical AI assistant examples illustrating this approach, then show how the same framework extends across AI environments — smart speakers, TVs, glasses, embodied AI, robots, wearables, and vehicles — by repricing outcomes and swapping the confirmation UI.

Why this matters now: the cost of voice-command errors is escalating as we move into AI assistants and embodied AI, where wrong actions can be more expensive and dangerous. Mainstream voice adoption will not come from chasing accuracy alone; we need systems to price in the cost of being wrong.

Amit Navindgi

  • Role: Senior Staff Software Engineer
  • Company: Zoox
  • Bio: Amit Navindgi is a Senior Staff Software Engineer and AI lead at Zoox, where he founded and leads Zoox Intelligence, a company-wide initiative applying Large Language Models across engineering, operations, customer support, autonomy, and employee productivity. His work combines platform engineering with organizational AI adoption. He architects internal AI platforms, agents, and developer productivity workflows, while also leading AI tool evaluation, rollout strategy, enablement, spend management, and productivity measurement across Zoox. Amit also runs the Zoox Hackathon and The Assembly, a cross-functional forum for knowledge sharing and innovation. Earlier in his career, he built web applications and distributed systems at Veritas Technologies and worked on Natural Language Processing at Xerox Research Centre Europe.
  • Twitter: https://x.com/amitnavindgi
  • LinkedIn: https://www.linkedin.com/in/amitnavindgi/
  • Photo: /wf26/speakers/by-id/spk_amit_navindgi.jpg
  • Sessions:

- From Self-Driving Monorepo to Self-Driving Cars — Day 3 — Session Day 2 3:20pm-3:40pm

AI coding agents promise massive productivity gains, but realizing that promise at scale requires more than just tools. In this talk, I’ll share how we approach AI adoption at Zoox, including: - Designing a monorepo-friendly ecosystem of agents, tools, and workflows - Driving adoption through enablement, hackathons, and internal platforms - Defining and tracking meaningful productivity metrics beyond hype - Managing token spend and aligning it with business outcomes - Structuring Skills, CLIs, MCPs, and Plugins to scale across teams The goal is simple: turn AI from an experiment into a reliable, measurable, and scalable engineering capability.

Anant Srivastava

  • Role: Principal Technologist - Data and AI Platforms
  • Company: Oracle
  • Bio: Anant Srivastava is a Principal Technologist for Data and AI Platforms at Oracle, focused on modern data architecture and AI platform decisions for production AI systems.
  • LinkedIn: https://www.linkedin.com/in/anantds
  • Photo: /wf26/speakers/by-id/spk_anant_srivastava.jpg
  • Sessions:

- Prompt, Memory, Weights: The Architecture Decisions Most AI Teams Make by Accident — Day 3 — Session Day 2 12:05pm-12:25pm

The interesting engineering in production AI isn't in the model. Your knowledge lives in files, databases, and APIs: docs, runbooks, conversations, code. The model just reads tokens. So the real architectural question is which path that knowledge takes to inference: into the prompt directly, into memory for retrieval on demand, or into the weights through fine-tuning. Most teams treat these as a ladder. Start with prompts, escalate to RAG, eventually fine-tune, as if each step is a more advanced version of the last. The field is converging on a different answer: they solve different problems. The prompt shapes behavior and constraints. Memory grounds the model in current, citable knowledge. Weights harden specialized reasoning and format. They're not substitutes you graduate between; they're complementary, and the failures come from using one to do another's job. Fine-tuning to teach the model facts it should have retrieved is the classic trap: you bake in knowledge that's stale the day it ships, and you still can't cite it. This is an opinionated take on all three: when each is the right call, when each is a trap, and the part most teams never build, the circulation between them. Memory that captures what the agent does becomes the dataset you fine-tune on; fine-tuning changes what's worth retrieving; the loop compounds. Get the three paths right and they stop being a pipeline you climb and start being an architecture that learns.

Anders Swanson

  • Role: Developer Evangelist
  • Company: Oracle
  • Bio: Anders Swanson is a Developer Evangelist for Oracle Database. He helps developers build modern applications with Oracle Database, including microservices, event-driven systems, cloud-native architectures, vector databases, and AI database features.
  • Photo: /wf26/speakers/by-id/spk_anders_swanson.jpg
  • Sessions:

- From Context to Memory: Your Agents Need a Real Memory Layer — Day 2 — Session Day 1 3:20pm-3:40pm

Most agents don't really have memory. They have a context window, a pile of temporary files, maybe an AGENTS.md, and a retrieval step that attempts to build state from whatever the model can still see. You've seen the flashy demos, but these systems fall apart when an agent needs to recover from failure, revisit prior work, and observe if failures are less frequent over time. This talk explores agent memory as a systems problem. Effective memory isn't just storing data: it's an evolving knowledge layer with write filtering, consolidation, reflection, and forgetting. Agents need persistence, and they also need structure. Raw logs and Markdown scratchpads aren't enough. A real memory layer weights recency, combines retrieval techniques, and correlates episodic memories. Serious agent memory is inherently multi-model. The best systems use full-text search, semantic retrieval, graph relationships, and structured state to reconstruct context with far more precision than filesystem grep alone. This is where databases become essential as the foundation for real memory. Memory shapes how agents behave, adapt, and improve over time.

Andreea Pleşea

  • Role: Co-Founder and COO
  • Company: Druid AI
  • Bio: Andreea Pleşea is Co-Founder and COO of Druid AI, where she helps design and scale enterprise agentic AI systems. She has a technical background including a PhD in artificial intelligence, with work focused on AI agent communication and autonomous agent interoperability.
  • Photo: /wf26/speakers/by-id/spk_andreea_ple_ea.jpg
  • Sessions:

- Would your AI agent get the job? A performance review framework for enterprise agents — Day 2 — Session Day 1 11:40am-12:00pm

There are dozens of ways to build an enterprise AI agent: agentic frameworks, direct LLM APIs, conversational AI platforms, vertical SaaS. They all claim to do the job. But how do you actually compare them on the same task, with the same data, against the same KPIs? This session presents a vendor-agnostic evaluation framework that treats AI agents the way enterprises treat new hires: set the role, define success criteria, run candidates through identical scenarios, and measure outcomes. The architecture uses any LLM to track positive and negative drift across agents against weighted goals, monitoring everything from hallucination rates and token consumption to user sentiment and conversation quality. Inputs are standardized. Outputs are both quantitative (accuracy, cost, hours saved) and qualitative (tone, clarity). The methodology supports continuous evaluation, not just pre-deployment benchmarks, but ongoing performance reviews that can compare agent work against human baselines. Walk away with a concrete, repeatable process for answering the only question that matters: which agent actually does the job?

Andrei Bocan

  • Role: Principal Engineer
  • Company: Atlassian
  • Bio: Andrei Bocan is a Principal Engineer at Atlassian and a frequent speaker on GraphQL, schema evolution, and platform/API architecture.
  • LinkedIn: https://www.linkedin.com/in/andrei-bocan
  • Photo: /wf26/speakers/by-id/spk_andrei_bocan.jpg
  • Sessions:

- The best SDLC is the one you build yourself: Why orchestration changes everything — Day 1 — Workshop Day 9:00am-11:00am

Industry research shows AI productivity gains have plateaued at 10–15% — because today's tools only optimize the 20% of a developer's day spent writing code. The real bottlenecks are left and right of code: planning, orchestration, review, and operations. We'll also explore the value of AI-powered code reviews - from establishing code standards that AI can seamlessly enforce, to triggering agentic pipelines that autonomously fix issues. Join Atlassian's Shane Wolf and Andrei Bocan for a hands-on deep dive into the AI-native SDLC. In this workshop, we'll move past single-player copilots and show you how Atlassian is turning Jira into an AI-native orchestration layer for the entire software development lifecycle. Then, we'll go further. You'll learn how to build custom automations that chain these capabilities together, transforming your Jira board into an agentic software factory where humans set intent and agents execute.

Andrew Dai

  • Role: Co-founder and CEO
  • Company: Elorian
  • Bio: Andrew Dai spent 12 years as a Research Scientist at Google Brain and DeepMind. He wrote the 2015 paper that OpenAI later cited as the original recipe for ChatGPT, was a core Lead on Gemini, GLaM, and PaLM 2, and his published research has accumulated over 67,000 citations. Now, he leads Elorian, a company building AI systems that understand the visual medium and apply reasoning the way humans do. Elorian recently launched with $55M at a $300M valuation, backed by Menlo Ventures, Altimeter, Striker Venture Partners, NVIDIA and Jeff Dean.
  • Twitter: https://x.com/andrewdai
  • LinkedIn: https://www.linkedin.com/in/andrewdai/
  • Photo: /wf26/speakers/by-id/spk_andrew_dai.jpg
  • Sessions:

- The Best Models Still Reason Like Toddlers — Day 2 — Session Day 1 1:55pm-2:15pm

Frontier AI models score 80–90% on standard benchmarks like RKGI, yet when tested on visual tasks any 3-year-old handles effortlessly (like counting objects in an image), those same models fall to pieces. I watched this gap widen firsthand during my 14 years at Google Brain and DeepMind, where I co-led development on GLaM, PaLM 2, and Gemini. The problem is that most models hit high RKGI scores not through genuine visual understanding, but by coding – a workaround that scores well and reveals little. Strip that away and you're left with systems that struggle to solve a simple crossword puzzle, identify what's the same or different across two images, or navigate a basic 3D view. These tasks are essential to achieve human-level reasoning capability. And the current benchmark ecosystem wasn’t built to evaluate for it, leaving us with top scoring models that can’t even follow along with Count Von Count. In this talk I'll dig into why the current eval landscape systematically overstates capability, the structural reasons it does so, and how we got here from the viewpoint of someone who was inside a leading frontier lab. I'll close with what I believe a more rigorous, consensus-driven eval framework needs to look like, and why the field needs to build one before the next generation of visual systems ships into the real world. Fixing visual reasoning starts with fixing how we measure it. For engineers building on top of these models today, whether that's document understanding, robotic perception, medical imaging, or any system where visual perception context matters, the cost of getting this wrong is already showing up in production.

Andrew Garvin

  • Role: Cofounder of Metronome
  • Company: Stripe
  • Bio: Andrew Garvin is co-founder of Metronome, now part of Stripe. Andrew began his career at Peter Thiel’s hedge fund, working with Palantir in the early days, and then built his career as a startup and venture CFO in the Founders Fund network.
  • LinkedIn: https://www.linkedin.com/in/agarvin/
  • Photo: /wf26/speakers/by-id/spk_andrew_garvin.jpg
  • Sessions:

- How to avoid disaster when vibe-coding a billing engine — Day 3 — Session Day 2 11:10am-11:30am

This talk covers what that infrastructure looks like in practice: which primitives matter, where the human checkpoints belong, and what changes when your billing system needs to be legible to machines instead of configured by humans clicking through a UI. When building AI products, billing and pricing should be directly tied to the products themselves. They're in the hot path. Every token, every agent action, every inference is a billable moment, and if your entitlement checks aren't keeping up, a single runaway agent can rack up thousands of dollars in seconds with no one to send the bill to. Get metering wrong and you're either eating costs or overcharging customers. Get ledger consistency wrong and your invoices don't add up. Get tax wrong across 47 jurisdictions and you find out from a regulator, not a user. Here's the thing, though — agents are legitimately good at billing strategy. They can pick pricing models, configure plans, run simulations, and iterate on packaging way faster than a human team could. You want them doing that work. But proration, multi-currency, revenue recognition, tax — this stuff took the industry years to get right, and it's unforgiving when you get it wrong. The question then becomes not whether agents should be making billing changes, it's what they should be operating on when they do. Agents need tight, composable building blocks where the correctness is already baked in, human-in-the-loop checkpoints before anything irreversible goes out the door, and sandbox environments where they can experiment freely without torching production. That's the architecture that lets you move fast on pricing without waking up to broken invoices. Target audience: Engineers and technical founders building AI products that charge for usage — whether that's per-token, per-action, or per-seat with consumption overages. If you've ever hard-coded a pricing tier, duct-taped metering onto an existing system, or wondered how your billing setup is going to survive your next pricing change, this talk is for you. Audience takeaways: - A clear understanding of why billing for AI products sits in the hot path — and what specifically goes wrong when metering, entitlements, or ledger consistency can't keep up. - A practical architecture for making billing agent-operable: composable primitives with correctness baked in, human-in-the-loop checkpoints on irreversible actions, and sandbox environments for safe experimentation. - A framework for deciding where agents should be empowered to move fast on billing strategy and where guardrails need to be non-negotiable.

Andrew Orobator

  • Role: Senior Software Engineer
  • Company: Reddit
  • Bio: Andrew Orobator is a senior Android engineer at Reddit and the author of the Vibe Engineering series, a ten-part methodology for AI-assisted software development covering personas, reusable skills, worklogs, agent workflows, and self-driving codebases. He co-authored the series with Claude using the same practices it describes, treating AI not as a autocomplete layer but as a collaborative engineering system with memory, process, and taste. Andrew has spent over a decade building Android products at scale, with experience across consumer apps, developer tooling, and mobile architecture. His current work explores how AI agents can move from ad hoc prompting into durable engineering infrastructure: systems that preserve context, improve through feedback loops, and help teams ship better software with less coordination drag. At AI Engineer World’s Fair, he brings a practitioner’s view of what it takes to make AI-assisted development feel less magical, more reliable, and actually useful.
  • Twitter: https://x.com/aorobator
  • LinkedIn: https://www.linkedin.com/in/andrew-orobator/
  • Website: https://medium.com/@andreworobator
  • Blog: https://medium.com/@andreworobator
  • Photo: /wf26/speakers/by-id/spk_andrew_orobator.jpg
  • Sessions:

- Spin at the Gate Until Green: The Engineering Primitives Behind Self-Driving Codebases — Day 2 — Session Day 1 1:30pm-1:50pm

Most AI-assisted development fails the same way: the AI produces plausible output, the human can't tell if it's right, so they check manually, find the problem, re-prompt, and repeat. This loop doesn't scale. There's a different approach. If you can express correctness as a binary — does it compile, do the tests pass, does the lint check clear — you can remove the human from that loop entirely. The AI submits. The gate checks. If red, it adjusts and resubmits. Spin at the gate until green. This talk covers the engineering primitives that make this possible: personas (consistent behavior at the agent level), skills (composable, reusable prompt modules), worklogs (accountability across sessions), postmortems (turning failures into constraints), and spec-driven development (making the target explicit enough for a machine to hit it). The culmination is a flag lifecycle agent — triggered by a cron job, cleaning up stale feature flags, verified by compile + test + lint, no human in the loop. Not hypothetical. Working prototype, proven in practice. I co-authored a ten-part series on this methodology with Claude. The series was built using the workflow described in this talk. If you don't trust the theory, the fact that this talk exists is the proof.

Andrew Qu

  • Role: Chief of Software
  • Company: Vercel
  • Bio: Andrew is the Chief of Software at Vercel, where he leads the company's agent initiatives across product, infrastructure, and internal tooling in the Office of the CTO. He's the creator of skills.sh, the most popular way to discover and install new agent skills, and is building "an agent on every desk" inside Vercel. The most prominent so far is a data science agent that fields 2,000+ questions a day from Vercelians across engineering, finance, and go-to-market. Before Vercel, Andrew founded a Series B AI sales-tech company, and has worked at Meta and early-stage startups alike.
  • Twitter: https://x.com/andrewqu
  • LinkedIn: https://linkedin.com/in/andrew-qu
  • Website: https://andrewqu.com
  • Blog: https://andrewqu.com
  • Photo: /wf26/speakers/by-id/spk_andrew_qu.jpg
  • Sessions:

- How we Solved Agent Building — Day 4 — Session Day 3 3:20pm-3:40pm

At Vercel I've built a successful AI data scientist, that has taken the load off of our data team from answering ad-hoc data queries, and fields over 1,200 unique queries a day from just internal Vercelians. I've been building and iterating on it since last september, and it's gone through over 6 different rewrites, the newest one of which has inspired us to build a new agent framework (to be teased during the talk ;) ). I'd talk about why we build agents, how we build agents, and how to build effective agents in today's world. Just prompting, to adding bespoke tooling, to embedding claude code, to file system agents, to skills-based agents, to the new agent harness framework.

Ang Li

  • Role: CEO
  • Company: Simular
  • Bio: Ang Li is the CEO and cofounder of Simular, the autonomous computer company. Simular builds the full-stack infrastructure for AI agents that use computers like humans do. It was the first to surpass human-level performance on the OSWorld computer-use benchmark with its open-source Agent S framework, which won Best Paper at the ICLR 2025 Agentic AI Workshop. Simular's flagship product, Sai, is a general-purpose autonomous computer in the cloud that operates any software the way a person does.

Ang was formerly a Research Scientist at Google DeepMind, working at the frontier of continual learning, large-scale deep learning, and ML infrastructure. He has published over 50 papers in top AI venues and his work and company have been featured in Forbes, TechCrunch, and VentureBeat. Simular is backed by Felicis and NVentures. His mission: to solve autonomy and enhance lives.

  • Twitter: https://x.com/angli_ai
  • LinkedIn: https://linkedin.com/in/angli-ai
  • Website: https://angli.ai
  • Photo: /wf26/speakers/by-id/spk_ang_li.jpg
  • Sessions:

- The Autonomous Computer: Full-stack Infrastructure for Computer Use Agents — Day 1 — Workshop Day 4:30pm-5:30pm

Even the world's best computer-use agents cannot repeat their successes at the moment. Agents that write code — emitting structured selector-based actions instead of clicking pixels — break through that ceiling. We'll share two years of experience from Simular's production agent platform, the architectural decisions that mattered (refs over pixels, code as substrate, Simulang DSL), and a live demo: a 30-step unattended Windows workflow, side-by-side with a vision-only baseline. If you're shipping agents to real users, this is the playbook.

Angela Jiang

  • Role: Head of Product, Claude Platform
  • Company: Anthropic
  • Bio: Angela Jiang is the Head of Product for the Claude Platform at Anthropic. She leads product for the Claude Platform including model APIs, hyperscaler integrations, agentic infrastructure, and connectivity controls for businesses as well as Anthropic’s own product infrastructure. Before joining Anthropic, she was Head of Product for the API Platform at OpenAI and led embedded payments at Stripe.
  • Twitter: https://x.com/angjiang
  • LinkedIn: https://www.linkedin.com/in/angelajiang/
  • Photo: /wf26/speakers/by-id/spk_angela_jiang.jpg
  • Sessions:

- Tokens Should Have Jobs — Day 4 — Session Day 3 10:45am-11:05am

Anil Nadiminti

  • Role: Sr Solutions Architect
  • Company: Amazon Web Services (AWS)
  • Bio: Anil Nadiminti is a Senior Solutions Architect at AWS, where he supports Enterprise FinTech and Web3 customers in designing secure, scalable, and production-ready cloud architectures. He also specializes in Agentic AI on AWS, advising customers on AI architecture patterns, autonomous workflows, and emerging application design models. His work sits at the intersection of financial services, Web3, and AI, with a focus on helping organizations evaluate new approaches to machine-to-machine commerce and programmable services. He is particularly interested in emerging standards such as x402, which uses HTTP 402 to enable programmatic payments for APIs, services, and AI agents over standard web infrastructure. Through his work with customers, Anil helps bridge technical strategy and practical implementation for next-generation applications on AWS.
  • Twitter: https://x.com/super_intel_bot
  • LinkedIn: https://www.linkedin.com/in/nadiminti
  • Photo: /wf26/speakers/by-id/spk_anil_nadiminti.jpg
  • Sessions:

- When AI Agents Pay and Sellers Monetize: Building x402 Apps for Agentic Commerce on AWS — Day 4 — Session Day 3 11:40am-12:00pm

As Agentic AI moves from chat to execution, autonomous agents need a native way to discover, access, and pay for digital services in real time. This session explores how x402 can turn HTTP into a payment-aware interface for machine-to-machine commerce, unlocking crypto-native patterns like programmable access, pay-per-use APIs, and on-demand monetization for data, tools, and services. We’ll show how to build x402-enabled applications and walk through the architecture, the full agentic payments flow, seller monetization strategies, payment verification, and design tradeoffs involved in making agent-driven transactions secure, scalable, and production-ready. Attendees will leave with practical patterns for building apps where AI agents do not just call APIs — they can discover services, evaluate costs, transact autonomously, and enable new revenue models for sellers.

Anirban Chatterjee

  • Role: Head of AI Strategy & Partnerships
  • Company: Sonar
  • Bio: Anirban Chatterjee leads AI strategy and partnerships at Sonar, working at the intersection of AI product, go-to-market, and developer code quality as software teams adopt AI agents.
  • Photo: /wf26/speakers/by-id/spk_anirban_chatterjee.jpg
  • Sessions:

- Guide, Verify, Solve: The Engineering Discipline Agentic Development Demands — Day 4 — Session Day 3 11:40am-12:00pm

Agentic development is not a productivity story: it's a reliability engineering problem at a scale most teams have never faced. Long-running agent tasks fail at alarming rates, pull requests have grown from 50 lines to 5,000, and cognitive surrender is real—the more capable AI output appears, the less humans interrogate it, right at the moment the stakes are highest. Independent, peer-reviewed research from Carnegie Mellon studying 807 open source projects found that AI agent adoption caused a persistent 30% increase in code analysis warnings and a 41% increase in complexity — with long-term development velocity declining as a result. Agents don't just write code faster, they accumulate debt faster, too. The answer is not to slow agents down, it's to govern and refine the loop they operate inside. Sonar's Agent Centric Development Cycle (AC/DC), defines that loop across three continuous stages: guide agents with project-specific context and constraints before a single line is written; verify rigorously and continuously inside the loop, not downstream in CI; and solve issues automatically before they ever reach a manual review. The deeper insight is that this is not primarily a security story. It's an efficiency story. Codebases riddled with complexity make agents slower, less reliable, and significantly more expensive to run. Every token spent navigating legacy debt is a tax on every future agent run. Well-maintained, low-complexity codebases mean fewer failures, fewer tokens, and faster iteration. The teams that instrument this loop now will do more than ship safely: they'll compound their advantage every time an agent touches their codebase. Verification isn't a cost center. In an agentic world, it's a competitive moat.

Ankit Jain

  • Role: Founder & CEO
  • Company: Aviator
  • Bio: Ankit Jain is a founder and CEO of Aviator, a developer productivity platform used by modern engineering teams to ship AI-generated code at scale — without the review bottlenecks, broken builds, or brittle deployments. He also leads The Hangar, a community of senior engineers and engineering leaders focused on developer experience, and Xoogler, the ex-Google alumni network.
  • Twitter: https://x.com/ankitxg
  • LinkedIn: https://www.linkedin.com/in/ankitjaindce/
  • Photo: /wf26/speakers/by-id/spk_ankit_jain.jpg
  • Sessions:

- How to Kill the Code Review — Day 3 — Session Day 2 11:40am-12:00pm

Human-written code died in 2025. Code review is dying in 2026. Teams with high AI adoption are merging 98% more pull requests, but PR review time has surged 91%. There is no way we win this fight with manual code reviews, and AI code review tools are just buying us time. This talk makes the case that the traditional code review is a historical approval gate that no longer fits the shape of modern software development. I'll walk through a practical five-layer trust model: from multi-agent competition and deterministic guardrails to spec-driven BDD and adversarial verification — that lets engineering teams ship faster without sacrificing quality or control.

Ankur Duggal

  • Role: Solutions Architect
  • Company: Arize AI
  • Bio: Ankur Duggal is a Solutions Architect at Arize AI, where he helps enterprise teams make AI agents and applications reliable in production. His work includes tracing agent decisions, implementing evaluations, and building feedback-driven workflows for agentic systems.
  • Sessions:

- Let your agent cook: using skills to evaluate and improve your app — Day 1 — Workshop Day 1:15pm-2:15pm

Anna Spysz

  • Role: Developer Relations Engineer
  • Company: Stripe
  • Bio: Anna is a Developer Advocate at Stripe based in Portland, Oregon. Before switching to developer relations, she spent nearly a decade as a software engineer, primarily in the serverless and devtools space. As a Frontend Engineer at AWS, she helped build products simplifying the developer experience. Before switching careers into tech, she also spent a decade working as a writer, translator, and tech journalist. She is passionate about making modern application development accessible to users at all levels, particularly beginners and those from non-traditional backgrounds.
  • Twitter: https://x.com/annaspies
  • LinkedIn: https://www.linkedin.com/in/annaspysz
  • Website: https://annaspysz.com/
  • Photo: /wf26/speakers/by-id/spk_anna_spysz.jpg
  • Sessions:

- Teaching agents to pay — Day 4 — Session Day 3 1:55pm-2:15pm

With a global daily user base in the hundreds of millions, AI agents are rapidly becoming a primary interface for how people discover, evaluate, and purchase products. Enabling those products to be listed and paid for directly through agents opens an entirely new - and enormous - commerce channel. The Agent Commerce Protocol (ACP) and Shared Payment Tokens provide a secure framework for agent-driven commerce within Stripe’s ecosystem - without exposing payment data or sacrificing user control. This session walks developers through the complete implementation: setting up Stripe integration, creating permission-based payment tokens, interacting with ACP endpoints, and designing trustworthy user experiences. You'll learn how to enable your agents to transact safely and predictably, handling everything from checkout flows to error scenarios and webhook events.

Annabell Schäfer

  • Role: Growth Engineer
  • Company: Clickhouse
  • Bio: Annabell Schäfer is a Growth Engineer at Langfuse, the open source LLM observability platform. She works at the intersection of building and teaching, shipping AI tooling that makes Langfuse more accessible to agents while helping engineering teams build stronger mental models for AI development. Before Langfuse, she was a Founding AI Product Specialist at REMATIQ and did GenAI product-architecture research at UC Berkeley.
  • Twitter: https://x.com/annabellschfr
  • LinkedIn: https://de.linkedin.com/in/annabell-schaefer
  • Photo: /wf26/speakers/by-id/spk_annabell_sch_fer.jpg
  • Sessions:

- Continuously improving agents with Langfuse — Day 1 — Workshop Day 1:15pm-2:15pm

Join us for a hands-on Langfuse workshop where we'll show you how to observe, debug, and improve your AI applications, step by step, using a real sample app. Bring your questions and discover how Langfuse can level up your specific use cases!

Antje Barth

  • Role: Member of Technical Staff
  • Company: Amazon AGI Lab
  • Bio: Member of Technical Staff at Amazon AGI, AI product leader, keynote speaker, and O'Reilly author. She also co-instructed Generative AI with Large Language Models with DeepLearning.AI.
  • Twitter: https://x.com/anbarth
  • Photo: /wf26/speakers/by-id/spk_antje_barth.jpg
  • Sessions:

- Perception Agents — Day 3 — Session Day 2 9:45am-10:05am

Human-agent collaboration is changing, becoming more visual. The agents most teams ship today still wait for us to type a paragraph to explain what we're looking at. They cannot see a screen, navigate a UI that changes, or recover when an application throws an unexpected modal. That is the architectural gap between agents that demo well and agents that work alongside real teams in real software. Perception agents close it. They see and use computers the way people do, reason about what they see, and act with clicks and keystrokes.

Anuj Iravane

  • Role: Head of AI
  • Company: Anterior
  • Bio: Anuj leads AI at Anterior, building production agents for high-stakes healthcare workflows.
  • Twitter: https://x.com/anujiravane
  • LinkedIn: https://www.linkedin.com/in/anujiravane/
  • Website: https://www.anterior.com/
  • Photo: /wf26/speakers/by-id/spk_anuj_iravane.jpg
  • Sessions:

- Don't be data poor — Day 4 — Session Day 3 3:20pm-3:40pm

What do you do when the data you most need to train and evaluate on is the data you're least allowed to keep? It's a bind for anyone building AI in a high-stakes vertical: the cases that would teach your model the most — the rare, the messy, the sensitive — tend to be the ones wrapped in the tightest constraints. In healthcare it's near-absolute. PHI can't be retained, reused, or transformed, so your long-lived datasets can't contain real patient data at all. Synthetic data is the obvious escape hatch, but it has its own trap: synthetic records tend to look synthetic, and a model that passes on fake-looking data tells you nothing about the real thing. So the bar isn't generating data — it's generating data faithful enough to trust. This talk is how we got there. Ask an LLM for a full case in one shot and you get something generic and averaged-out — models are worse at inventing convincing, specific detail than you'd expect. We present our synthetic generation pipeline (and the process around it) that enabled us to create golden datasets at scale. The pipeline features a coarse-to-fine process that enriches a patients medical history layer by layer, with a human in the loop hooks to steer the narrative at each step. You'll leave with ideas on how to build your own synthetic data generation capabilities and how to build a data pipeline your domain experts actually enjoy owning.

Aparna Dhinakaran

  • Role: CPO
  • Company: Arize
  • Bio: Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a pioneer and early leader in AI & Agent observability and evaluation. A frequent speaker at top conferences and thought leader in the space, Dhinakaran was recently named to the Forbes 30 Under 30. Before Arize, Dhinakaran was an ML engineer and leader at Uber, Apple, and TubeMogul (acquired by Adobe). During her time at Uber, she built several core ML Infrastructure platforms, including Michealangelo. She has a bachelor’s from Berkeley's Electrical Engineering and Computer Science program, where she published research with Berkeley's AI Research group.
  • Twitter: https://x.com/aparnadhinak
  • LinkedIn: https://www.linkedin.com/in/aparnadhinakaran/
  • Photo: /wf26/speakers/by-id/spk_aparna_dhinakaran.jpg
  • Sessions:

- Evals Track Intro — Day 3 — Session Day 2 10:25am-10:30am

Archana Kamath

  • Role: VP of Engineering
  • Company: Digital Ocean
  • Bio: Archana Kamath is VP of Engineering at DigitalOcean, working across infrastructure, compute, networking, and AI infrastructure. Her DigitalOcean profile content emphasizes customer-centric infrastructure and product engineering for cloud and AI workloads.
  • Photo: /wf26/speakers/by-id/spk_archana_kamath.jpg
  • Sessions:

- Preferences > Benchmarks: Model Routing for How Teams Actually Build — Day 4 — Session Day 3 12:05pm-12:25pm

There is no best model. There's only the right model for a given task, and the right model depends on your team's preferences, not a benchmark score. This talk makes the case for preference-aligned routing: choosing models by the constraints that actually matter — cost, latency, task type, model preference — instead of a single leaderboard number. We'll demo a sub-200ms routing decision running on a purpose-built 30B MoE model with no application code changes, walk through real coding workflows routing most traffic to open models without losing accuracy, and show where this goes next: evals, caching, and personalization.

Arek Borucki

  • Role: Machine Learning Platform & Database Engineer
  • Company: Hugging Face
  • Bio: Arek Borucki is a Machine Learning Platform & Database Engineer at Hugging Face, where he helps keep the infrastructure behind one of the world's largest open-source AI platforms running at scale. He is the author of MongoDB in Action 8.0 and co-author of Mastering MongoDB 7.0. With over 10 years of experience in SRE, Kubernetes, AWS, GCP, and managing MongoDB in production, from 100TB+ sharded clusters to cloud-native deployments, he brings deep expertise in databases, platform engineering, and infrastructure at scale.
  • Twitter: https://x.com/_Aras_B
  • LinkedIn: https://www.linkedin.com/in/arekborucki/
  • Website: https://arekborucki.cloud/
  • Blog: https://arekborucki.cloud/
  • Photo: /wf26/speakers/by-id/spk_arek_borucki.jpg
  • Sessions:

- Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Day 2 — Session Day 1 1:30pm-1:50pm

Hugging Face hosts over 2 million public models, 500,000+ datasets, and serves 13 million users across 50,000+ organizations, including over 30% of the Fortune 500. That growth didn't come with a manual.In this talk, we'll pull back the curtain on the infrastructure decisions that kept the Hub fast and reliable as traffic grew by orders of magnitude. We'll dive into why we chose MongoDB Atlas as our core data layer, how its document model maps naturally to the messy reality of ML model metadata, and what it took to keep p99 latency low when every request hits a catalog of millions. We'll also cover the trade-offs we faced, the things that broke along the way, and what "lean operations" actually means when your platform serves a third of the Fortune 500. Expect real architecture decisions, real numbers, and lessons you can take back to your own stack.

Ari Morcos

  • Role: Co-founder, CEO
  • Company: DatologyAI
  • Bio: Ari Morcos is co-founder and CEO of DatologyAI, building a self-service data curation platform for AI teams. Prior to founding Datology, Ari spent five years at FAIR (Meta AI), most recently as a Senior Staff Research Scientist, where his research on data curation and self-supervised learning received Outstanding Paper Awards at NeurIPS 2022 ("Beyond neural scaling laws: beating power law scaling via data pruning") and ICLR 2023. Before Meta, he was a Research Scientist at DeepMind, applying tools from neuroscience to understand generalization, representation learning, and the dynamics of training in deep networks. He holds a PhD in Neuroscience from Harvard and a BS in Neuroscience from UC San Diego.
  • Twitter: https://x.com/arimorcos
  • LinkedIn: https://www.linkedin.com/in/arimorcos/
  • Website: http://www.arimorcos.com/
  • Photo: /wf26/speakers/by-id/spk_ari_morcos.jpg
  • Sessions:

- Data Quality is the Compute Multiplier — Day 2 — Session Day 1 10:45am-11:05am

Better data quality is the highest-leverage and most underinvested part of building a model: it produces a better model for the same compute, whether you're mid-training on an open base or pre-training from scratch.

This session is a practical look at data curation, covering what data quality actually means, the stages of a modern curation pipeline (cleaning, filtering, deduplication, synthetic data generation, algorithmic mixing, and multi-stage composition), and which steps matter most in practice. It draws on DatologyAI's frontier data research and customer results, including Thomson Reuters' mid-training gains on proprietary legal domain data and Arcee's Trinity model reaching the open frontier on public data alone. You'll leave with a concrete sense of where better data quality pays off and how data curation is shaping the future of model training.

Arjun Singh

  • Role: Co-founder and CEO
  • Company: Superconductor
  • Bio: Arjun Singh is the co-founder and CEO of Superconductor. Previously, he co-founded Gradescope, an AI grading platform acquired by Turnitin in 2018.
  • Twitter: https://x.com/singharjun51293
  • Photo: /wf26/speakers/by-id/spk_arjun_singh.jpg
  • Sessions:

- Multiplayer agentic engineering: enabling your whole team and your best agents to work together — Day 4 — Session Day 3 1:55pm-2:15pm

For a solo developer, coding agents are a superpower. For a team, they surface new kinds of bottlenecks: coordination, visibility, review, and shared context.

We wanted our whole team and our best agents to work together, with no work or context trapped on any one developer's machine. So we pressed pause on the product we were building to create a multiplayer cloud workspace for agentic engineering.

This talk shares five key practices we've learned from building and using our platform:

Turn every surface the team uses into an agent interface.

Kick off sessions from Slack, review via iOS app, iterate in GitHub comments, ship from web. Agents run in the cloud, so work keeps moving even when your laptop is closed.

Make agent work visible and collaborative across the whole team.

Every agent session is shared, has a live app preview, and an agent-guided code review. This allows engineers, PMs, and designers to steer and evaluate agent work collaboratively.

Turn every external signal into shipped code your team can quickly evaluate.

Automatically turn customer emails, meeting action items, and bug reports into agent implementations that the whole team can review.

Set up shared cloud dev environments so agents aren't siloed to individual machines.

Secrets, role-based access, and network controls shared across the whole team. Fast environment startup, so you're not giving up speed by moving off local.

Benchmark agents on your own codebase.

Claude Code, Codex, Gemini, Amp, OpenCode — how do you know which is actually better on your stack? We'll cover using your merged PRs as ground truth to build a "Personal SWE-Bench" for your codebase.

Agentic engineering is going multiplayer. This is how your team gets there.

Arman Vaziri

  • Role: Senior Staff Software Engineer
  • Company: Ramp
  • Bio: Software engineer building the agents and data platform powering Ramp’s growth. Built Ramp’s AI SDR; customer data platform that powers all growth channels; and Ramp Revenue, an internal sales platform and suite of agents that drives seller actions. Currently focused on product-led growth and agentic GTM orchestration, evolving agents from workflow-specific background assistants into systems that coordinate actions across growth and sales. Previously worked in Growth Engineering at Affirm and FP&A Engineering at Goldman Sachs.
  • LinkedIn: https://www.linkedin.com/in/armanvaziri/
  • Photo: /wf26/speakers/by-id/spk_roman.jpg
  • Sessions:

- The Building Blocks of GTM Orchestration — Day 4 — Session Day 3 12:05pm-12:25pm

Ramp built its own 0→1 revenue stack in-house — Ramp Revenue — with one mandate: build the most efficient GTM org in the world. Arman Vaziri breaks down the building blocks: a customer data platform that chews through millions of internal, external, and CRM records daily, and a unified action layer with agents embedded directly in seller workflows. The payoff — reps stop hopping between dozens of systems just to figure out who to reach and what to say, and 80%+ of Ramp's sales workflows now run on it. A look at the architecture behind orchestrating GTM at scale.

Armen Aghajanyan

  • Role: Co-Founder & CEO
  • Company: Perceptron AI
  • Bio: Co-founder & CEO, @perceptroninc; ex-RS FAIR/MSFT
  • Twitter: https://x.com/ArmenAgha
  • LinkedIn: https://www.linkedin.com/in/armenag
  • Website: https://perceptron.inc
  • Photo: /wf26/speakers/by-id/spk_armen_aghajanyan.jpg
  • Sessions:

- From VLM/VLA's to Embodied Agents — Day 2 — Session Day 1 2:50pm-3:10pm

Arturo Nunez

  • Role: Founder
  • Company: Nereu
  • Bio: Arturo is the founder of Nereu, an AI-native game engine that lets anyone build their game. Previously at MongoDB and Unity.
  • Twitter: https://x.com/arturonereu
  • LinkedIn: https://www.linkedin.com/in/arturonereu/
  • Website: https://www.arturonereu.com/
  • Photo: /wf26/speakers/by-id/spk_arturo_nereu.jpg
  • Sessions:

- The Next Game Engine Won't Have a Manual — Day 4 — Session Day 3 12:05pm-12:25pm

Game development is still incredibly hard to get right. It requires great engineering, artistic vision, and the ability to make something genuinely entertaining, all at once. Dropping a powerful LLM into existing engines won't solve the problem. Game development needs to fundamentally change to work in this era of agents. After 15 years in games (making them, watching others make them, and working at the most popular game engine in the world) I'm now fully embracing the power of AI to give it to the people who dream of making games but find it too difficult. I'm building Veselka. In this talk, I'll show you the AI-magic that converts Claude into a real game dev partner, using Three.js to let anyone build their dream game.

Arun Sekhar

  • Role: Principal Product Manager for AI Developer Experience
  • Company: Microsoft
  • Bio: Arun Sekhar is a Principal Product Manager for AI Developer Experience at Microsoft. He has worked as a developer, development lead and product manager across Microsoft technologies, and is associated with OpenClaw and AI developer tooling.
  • LinkedIn: https://www.linkedin.com/in/rcarun
  • Photo: /wf26/speakers/by-id/spk_arun_sekhar.jpg
  • Sessions:

- The model swap workshop — Day 1 — Workshop Day 11:05am-12:05pm

Frontier labs are releasing new models constantly, and it is hard to know when “better” is better enough to justify touching a working system. On top of that, “just swap the model” often turns into real work because providers expose different APIs and different expectations around tools and structured outputs. The model swap workshop is a hands-on bake-off across frontier LLMs. We will run the same scenarios using multiple models (OpenAI, Anthropic, Kimi, and more) and compare results side by side for agentic tool use, structured outputs, and multimodal tasks. Swapping models is not just changing a model name. In this workshop, you will actually do the swaps, including moving between OpenAI-style Responses APIs and Anthropic-style Messages APIs, then see what breaks and what needs to change in your prompts, tool definitions, and JSON strategies. We will finish by running a small eval suite so you can quantify tradeoffs instead of relying on vibes. We will provide the Microsoft Foundry environment for access to the models, no account needed.

- OpenAI, Anthropic, or agent frameworks: choose the right AI stack — Day 3 — Session Day 2 11:40am-12:00pm

OpenAI SDK, Anthropic SDK, or an LLM-agnostic agent framework. Which one should your next AI app be built on? Starting with Foundry Models, we walk through each option in code, show what you gain and what you give up at every layer, and help you pick the right abstraction for your scenario without overbuilding.

- Blast Radius Zero: One‑Command OpenClaw Sandboxes in the Cloud — Day 4 — Session Day 3 1:55pm-2:15pm

You already run OpenClaw locally, sandboxed in MXC. Now you need the same agent in the cloud for dev/test, reachable from Teams on your phone, without handing over the keys to the kingdom. This session shows a simple, one‑command path to do all of this: an isolated Container Apps sandbox running an OpenClaw image, calling Azure OpenAI in Foundry Models securely without keys over the standard OpenAI API, scaling to zero when idle.

Arunachalam Manikandan

  • Role: AI Engineer, Co-Founder
  • Company: University of Minnesota
  • Bio: Arunachalam Manikandan is a Computer Science graduate student and Graduate Research Assistant at the University of Minnesota, where he researches biomedical image segmentation using large vision models.
  • Twitter: https://x.com/Arunachala64250
  • LinkedIn: https://www.linkedin.com/in/arunachalam-manikandan/
  • Blog: https://medium.com/@rome101202
  • Photo: /wf26/speakers/by-id/spk_arunachalam_manikandan.jpg
  • Sessions:

- Autoresearch in a Multi-Agent AI Village — Day 3 — Session Day 2 3:45pm-4:05pm

Project Paradox is an existing multi-agent framework built at Supercell's first AI Innovation Lab, which has a 3D Unity village with local LLM powered agents. The characters remember conversations, update emotional state, track trust, plan actions, move through rooms, transfer items, and talk to each other through a FastAPI backend. The new work is an autoresearch layer around that village. We built a backend loop that runs controlled social scenarios, scores the resulting NPC behavior, proposes protocol or policy changes, reruns the suite, and keeps changes that improve the agents. The goal is to move beyond one good chat response and measure whether an NPC society can preserve source attribution, verify claims, spread important information, coordinate goals, and replan after new information arrives. The talk walks through the system architecture and the lessons from building it. We show the backend simulation harness that executes Unity style actions without opening Unity, the scenario suites that test information diffusion and memory provenance, and the ratchet loop that edits protocol text or planner policy with rollback. One accepted run improved information diffusion by teaching agents to broadcast important sourced evidence while preserving who said it. The practical takeaway is a reusable pattern for AI engineers building agents with messy state. Freeze the harness, expose a small editable policy surface, score real behavior instead of vibes, and let an agent search for improvements under rollback. The same pattern applies to game agents, coding agents, support agents, personal agents, and other systems where long horizon behavior matters more than a single response.

Asaf Gardin

  • Role: Senior Software Engineer/Inference Engineer
  • Company: AI21
  • Bio: Asaf Gardin is a Senior Software Engineer on the inference team at AI21 Labs, where he works on high-performance LLM inference and the production deployment of the Jamba hybrid SSM-Transformer models. He's an active vLLM committer, contributing to quantization, scheduling, and support for Mamba-based architectures. His talk covers two production bugs in vLLM's Mamba support - a scheduler edge case that corrupted SSM state under memory pressure, and a 32-bit integer overflow in a CUDA kernel that surfaced as RL training instability - both root-caused at AI21 and fixed upstream. He also built Kernel Academy, a browser-based tutorial for learning Triton GPU programming. Previously at IBM.
  • LinkedIn: https://www.linkedin.com/in/joseph-asaf-gardin/
  • Photo: /wf26/speakers/by-id/spk_asaf_gardin.jpg
  • Sessions:

- Two Bugs That Hid in Plain Sight: A vLLM Debugging Detective Story — Day 4 — Session Day 3 3:20pm-3:40pm

Your model generates gibberish. Once every thousand prompts. High confidence scores. No crashes. No warnings. We hit this twice while building Jamba models. First: A request gets misclassified during scheduling, loads stale state from a previous prompt cache slot, and confidently generates nonsense. Second: Logprob spikes during RL training that looked like training instability-until we noticed they tracked with rollout count, then with cache size. In this talk, we'll walk through both debugging journeys-the false starts, how we instrumented vLLM to thread request IDs through the forward pass, the search for variables that change failure structure rather than magnitude, and the lesson both share: distributed inference systems fail silently. No stack trace. No sanitizer warning. Just wrong answers with perfect confidence. You'll learn how to build comparison scripts that expose logprob divergence, force memory pressure to surface rare bugs, and shrink a distributed RL training mystery into a reproducible single-script failure. Walk away knowing how to debug vLLM when it lies to you quietly.

Ashish Kamra

  • Role: Senior Manager, Software Engineering
  • Company: Red Hat
  • Bio: Accomplished engineering leader with 15+ years of experience in AI, cloud-native platforms, and infrastructure. Proven track record of building and scaling high-performing teams and delivering significant performance improvements in enterprise AI products. Combines deep technical expertise in AI/ML with strategic vision to drive product innovation and business impact.
  • LinkedIn: https://www.linkedin.com/in/ashishkamra/
  • Photo: /wf26/speakers/by-id/spk_ashish_kamra.jpg
  • Sessions:

- KV Cache-Aware Routing and P/D Disaggregation on Kubernetes: The Parts Public Benchmarks Don't Show — Day 4 — Session Day 3 2:50pm-3:10pm

We're at the inflection point between classic LLM inference and agentic inference. When we look at the agentic workloads and trace replays, many core characteristics break classic LLM serving assumptions. The most consequential: the server no longer controls its own cache lifecycle. The client does, through prompt construction, multi-turn context that grows and changes each turn.

This has downstream effects. Because context is client-determined, prefill strategy, eviction, and routing decisions move up to the scheduler layer. KV cache becomes volatile — frequent eviction and rewrite, driven from outside the engine. And latency becomes a first-class scheduling metric alongside throughput. This talk covers the open stack for LLM and agentic era inference serving: vLLM and llm-d.

We begin with the core characteristics and challenges of agentic inference, then the economics: prefill dominates cost, and cache reuse is the primary lever. We explain why KV-aware routing through a fleet-wide scheduler is the first optimization to apply, ahead of adding capacity.

Next, prefill/decode disaggregation. We separate compute-bound prefill from memory-bound decode, and examine what public benchmarks omit: the conditions under which P/D disaggregation shines, and the workload shapes that justify the added architectural complexity.

We close with GLM-5.2 and show the equivalent stack assembled in the open: cache-aware routing, P/D disaggregation, tiered KV offload, and wide expert parallelism — implemented on vLLM and llm-d.

Attendees leave with a tuning decision framework: which lever to apply first, how to read workload signals, and where additional GPUs do and don't help.

Ashok Chandrasekar

  • Role: Staff Software Engineer
  • Company: Google
  • Bio: Ashok Chandrasekar is a Staff Software Engineer at Google working on AI Inference performance evaluation and optimization for Google Kubernetes Engine. He is a project lead and maintainer of Inference Perf and co-lead of SIG Benchmarking in the llm-d project. He holds a Master's degree from Carnegie Mellon University. Previously, he was a Staff Engineer at VMware. His interests lie in Distributed Systems with his current focus being Systems for AI/ML applications.
  • LinkedIn: https://www.linkedin.com/in/ashokchandrasekar/
  • Website: https://ashokc.dev
  • Blog: https://ashokc.dev
  • Photo: /wf26/speakers/by-id/spk_ashok_chandrasekar.jpg
  • Sessions:

- Are LLM Performance Benchmarks Reliable? — Day 4 — Session Day 3 11:40am-12:00pm

Standardizing performance benchmarks for production-grade Large Language Models is currently a significant challenge across the industry. Conflicting data is prevalent, whether originating from server developers like vLLM and SGLang or from various analysts and competitive benchmarks, and these results often fail to hold up under real-world conditions. Our research into these inconsistencies identified several critical factors, including the constraints of single-process tools, specifically the Python Global Interpreter Lock (GIL) and the nuances of model-level settings like temperature. Furthermore, a lack of transparency regarding load generation parameters such as QPS and concurrency, paired with insufficient observability into the benchmarking clients themselves, contributes to these disparate outcomes. In this talk, we share key lessons learned from our benchmarking efforts, examining the primary pitfalls that distort performance data and offering strategies for mitigation. Additionally, we will introduce Inference Perf, an open-source, multi-process utility we developed to provide reliable stress-testing for production stacks. Our goal is to promote standardized, real-world benchmarking practices that allow the community to move beyond unreliable data. Join us to discover how to accurately measure, optimize, and report LLM performance with certainty.

Ashu Joshi

  • Role: Director, Business Strategy
  • Company: Microsoft
  • Bio: Ashu Joshi works on agentic AI platform strategy at Microsoft, with a focus on turning AI platforms into enterprise business capabilities across agent platforms, adoption and go-to-market strategy.
  • LinkedIn: https://www.linkedin.com/in/ashujoshi
  • Photo: /wf26/speakers/by-id/spk_ashu_joshi.jpg
  • Sessions:

- Deploy agents to users in M365, Teams, and apps — Day 3 — Session Day 2 3:20pm-3:40pm

Agents deliver value when users can access them. Learn how to integrate and deploy agent systems into M365, Teams, and application workflows.

- Operate agents safely at scale with enterprise governance — Day 4 — Session Day 3 2:25pm-2:45pm

As adoption grows, governance becomes critical. Learn how to manage identity, compliance, and lifecycle for agent systems at enterprise scale.

Asma Beevi

  • Role: Senior Engineer
  • Company: NVIDIA
  • Bio: Asma Beevi K T is a senior engineer at NVIDIA, developing the NVIDIA TensorRT Model Optimizer toolkit. Her interests span training and inference optimizations for deep learning models, particularly LLMs.
  • LinkedIn: https://www.linkedin.com/in/asma-beevi-k-t-433053a2
  • Website: https://realasma.github.io
  • Photo: /wf26/speakers/by-id/spk_asma_beevi.jpg
  • Sessions:

- Compression at the Edge — Day 4 — Session Day 3 2:25pm-2:45pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

- Compression at the Edge — Day 4 — Session Day 3 2:50pm-3:10pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

Averi Kitsch

  • Role: Staff Software Engineer
  • Company: Google
  • Bio: Averi Kitsch is a Staff Software Engineer at Google dedicated to bridging the gap between raw data and active intelligence. As the engineering lead for the MCP Toolbox, Averi empowers developers to build sophisticated, agentic applications directly on top of their Google Cloud databases. Drawing from a deep background in DevOps—with specific expertise in serverless runtimes and CI/CD—she brings a pragmatic, "builder-first" perspective to AI infrastructure. Her ultimate goal is to ensure the next generation of intelligent applications is as robust and scalable as it is smart.
  • LinkedIn: https://www.linkedin.com/in/averikitsch
  • Website: https://averi.dev
  • Photo: /wf26/speakers/by-id/spk_averi_kitsch.jpg
  • Sessions:

- Build-Time vs. Run-Time: Why Your Dev Tools Will Fail in Production — Day 3 — Session Day 2 10:45am-11:05am

A dangerous pattern is evolving in the ecosystem: developers are deploying "Build-Time" tools into "Run-Time" environments. In this session, we will introduce a critical distinction for the MCP ecosystem: the difference between Build-Time Agents (Developer Assistants like Gemini Code Assist) and Run-Time Agents (End-user applications like a Customer Support bot). Drawing from our experience building the MCP Toolbox, we will demonstrate why the "Atomic" tools that make Build-Time agents powerful become catastrophic liabilities for Run-Time agents. We will provide a framework for transitioning your architecture across three key axes: Design: Moving from flexible, atomic primitives to "Composite Workflows" that encapsulate business logic. Security: Shifting from "Developer Identity" (trusted) to "Workload Identity" (zero-trust), where the agent is treated as an untrusted user. Reliability: Why production agents need "Agent-Readable" errors (natural language guidance) rather than the stack traces that developers rely on. Attendees will leave with a clear rubric for evaluating whether their tools are truly "Production Ready" or just "Prototype Ready."

Ayush Bhardwaj

  • Role: Tech Lead
  • Company: Allos AI
  • Bio: Tech Lead at Allos AI building everything AI for Pharma. Previously built agentic AI for macro markets at D. E. Shaw.
  • Twitter: https://x.com/aybh08
  • LinkedIn: https://www.linkedin.com/in/aybh/
  • Website: https://ayushb.me/
  • Blog: https://ayushb.me/
  • Photo: /wf26/speakers/by-id/spk_ayush_bhardwaj.jpg
  • Sessions:

- Trading Desks to Clinical Trials: Parallels in Applied Vertical AI — Day 4 — Session Day 3 2:25pm-2:45pm

Wall Street to Wet Labs: The Shared DNA of Vertical AI. On the surface, finance and pharma couldn't look more different. One chases alpha in the markets; the other engineers complex drug delivery and stability. But under the hood, building Vertical AI for both domains reveals a striking shared DNA. Drawing from hands-on engineering experience in Applied AI at a top hedge fund and a cutting-edge pharma tech startup, this session explores the surprising architectural parallels between these two high-stakes industries.

Barr Yaron

  • Role: Partner
  • Company: Amplify Partners
  • Bio: Barr Yaron is a Partner at Amplify Partners, where she backs founders building the next generation of AI infrastructure and applications
  • Twitter: https://x.com/barrnanas
  • LinkedIn: https://linkedin.com/in/barryaron
  • Website: https://barrchives.com
  • Photo: /wf26/speakers/by-id/spk_barr_yaron.jpg
  • Sessions:

- The 2026 State of AI Engineering — Day 4 — Session Day 3 9:00am-9:20am

results per Barr

Ben Dicken

  • Photo: /wf26/speakers/by-id/spk_ben_dicken.jpg
  • Sessions:

- Move fast and (don’t) break things — Day 4 — Session Day 3 12:05pm-12:25pm

Engineers want to move fast with AI, but the infrastructure underneath is buckling. Status pages across the industry make this clear. Here, you'll learn how to build systems that maintain 4-nines of availability while meeting unprecedented customer demand using the principles of extreme fault tolerance.

PlanetScale has written about how we apply these principles to operating databases across our fleet (https://planetscale.com/blog/the-principles-of-extreme-fault-tolerance). This matters not just for databases, but all aspects of reliable infrastructure.

Isolation, redundancy, static stability, and back-pressure are the building-blocks to achieving this. Sticking to such principles when architecting the backend of AI applications ensures our systems are resilient to failure while still being flexible enough to scale. We'll look at concrete failure modes from production systems and the patterns that prevent them.

Ben Holmes

  • Role: Dev Rel Lead
  • Company: Warp
  • Bio: Ben is a software engineer and content creator helping everyone make the world better with code. You may have seen him around the internet with a whiteboard explaining web development concepts and coding agent tips. You also may know him from livestreams on Warp, or as a core maintainer of Astro.build. If you're interested in Markdown, HTML, or Japanese City Pop, go talk to him.
  • Twitter: https://x.com/bholmesdev
  • LinkedIn: https://linkedin.com/in/bholmesdev
  • Website: https://bholmes.dev
  • Blog: https://bholmes.dev
  • Photo: /wf26/speakers/by-id/spk_ben_holmes.jpg
  • Sessions:

- LLM Knowledge Bases: a practical guide — Day 3 — Session Day 2 3:45pm-4:05pm

Putting thoughts to paper (or keyboard, or transcription model) refines your thinking, connects ideas, and pulls context out of your brain for others to learn from. But while taking notes can be fun, organizing those notes is not. Flat lists turn to folders turn to tags and taxonomies that grow unwieldy beyond the first hundred entries. If you can’t find what you wrote down yesterday, or you miss connections to related ideas, you’re missing the value of notetaking: learning from what you notate. Agents dramatically expanded what’s possible here. Combined with Markdown-backed apps like Obsidian to make notes agent-accessible, you can build a second brain that works for you, not the other way around. Andre Karpathy has popularized LLM knowledge bases, and I want to take it further with concrete workflows you can use to organize your thoughts with agents. We’ll explore a number of Obsidian workflows to make this possible: - Automations to organize notes with tags, folders, backlinks, and deduplication to level-up search and discovery - More automations to have agents expand your thinking by auto-recording ideas while you sleep - Building an agentic writing partner to surface related ideas in real time and answer questions as you type (or as you speak) - Voice monologuing and summarization tools to lower the friction of transcibing thoughts into well-formatted notes You’ll walk away with a new appreciation for notetaking, and a second brain that leaves you 10x smarter than your brain alone. Talk format: Code and live tech demos. I will set up all of these automations and tools from scratch, and show agents executing each of them live. I will share the source for all automations as well.

Ben Hylak

  • Role: CTO
  • Company: Raindrop
  • Bio: Ben Hylak is CTO at Raindrop, the monitoring platform for AI agents. He was previously a designer and engineer at Apple and did engineering at SpaceX and Google.
  • Twitter: https://x.com/benhylak
  • LinkedIn: https://www.linkedin.com/in/benhylak/
  • Photo: /wf26/speakers/by-id/spk_ben_hylak.jpg
  • Sessions:

- Designing Agents (The Floor Is the Frontier) — Day 3 — Session Day 2 2:50pm-3:10pm

You know how smart your agent can be. You have no idea how dumb it gets until it does the dumbest possible thing in front of your most important user, with full access to act on their behalf. Capability isn't the bottleneck anymore, the floor is. The hard part is there's usually no objective right answer. You raise the floor by observing what your agent actually does in production, catching the dumb thing the moment it happens, and closing the loop so it never happens twice.

Ben Kus

  • Role: CTO
  • Company: Box
  • Bio: Ben Kus is the Chief Technology Officer at Box, where he leads technology and AI strategy to help enterprises securely unlock insights from their unstructured data. Ben’s career spans engineering, product leadership, and startup innovation—including co-founding Subspace (acquired by Box) and being an early employee at BigFix (acquired by IBM), where he later served as Chief Architect of Mobile Security. Ben holds a degree in Computer Science from UC Berkeley.
  • Twitter: https://x.com/benatbox
  • LinkedIn: https://www.linkedin.com/in/benkus/
  • Photo: /wf26/speakers/by-id/spk_ben_kus.jpg
  • Sessions:

- The Half Life of Agent Infrastructure — Day 3 — Session Day 2 1:30pm-1:50pm

TBD — talk on search and retrieval, agentic AI, and enterprise AI over unstructured content.

Benjamin Clavié

  • Role: Member of Technical Staff
  • Company: Mixedbread Inc.
  • Bio: MTS at Mixedbread working on building the future of Retrieval.
  • Twitter: https://x.com/bclavie
  • Website: https://mixedbread.com
  • Blog: https://ben.clavie.eu
  • Photo: /wf26/speakers/by-id/spk_benjamin_clavi.jpg
  • Sessions:

- If we want them to do Knowledge Work, we need to design Knowledge Agents — Day 2 — Session Day 1 1:30pm-1:50pm

It's tempting to assume that just like agents revolutionised coding, they will revolutionize other areas: legal, finance, advertising, and even medicine. All of those have in common that they are fundamentally knowledge work. And thankfully, humans have spent thousands of years searching for the best possible workflows for knowledge work. And yet, we seem to be disregarding all of these learnings, forcing every knowledge task into the shape that worked for coding. Today, we're going to talk about the history of knowledge work and how tools were co-designed to support it to understand how we should be building Knowledge Agents, themselves co-designed with their Knowledge Tools. This is key to avoiding falling into a "good enough" local optimum: think about legal clerking, a core part of the legal industry where information gathering and reasoning is performed to support the work of senior lawyers. The practice of clerking follows its own code, rules and best practices, which could not have feasibly emerged from studying software engineering: and similarly, there is no reason to believe knowledge agents could emerge from coding agents.

Benjamin Guo

  • Role: Cofounder
  • Company: Zo Computer
  • Bio: Cofounder of Zo Computer. Joined Stripe early (2015), where he worked for over 8 years. Founding engineer on Terminal, Stripe's in-person payments arm. Ben's cofounder, Rob Cheung, was the first engineer at Substack. They met on the early Venmo team in 2013, and they've reunited to build Zo.
  • Twitter: https://x.com/0thernet
  • LinkedIn: https://linkedin.com/in/0thernet
  • Website: https://0.zo.space
  • Photo: /wf26/speakers/by-id/spk_ben_guo.jpg
  • Sessions:

- Everyone Gets A Software Company — Day 2 — Session Day 1 11:40am-12:00pm

Benoit Schillings

  • Role: VP of Technology
  • Company: Google DeepMind
  • Bio: Benoit Schillings leads the Thinking, Reasoning, and Coding teams at Google DeepMind, directing foundational research toward AGI. His work focuses on advancing next-generation model reasoning and integrating software development best practices into AI code generation.

Previously, as Chief Technology Officer at X, Benoit guided early-stage teams in prototyping Alphabet's ambitious "moonshot" technologies across computing, biochemistry, and clean energy. A native of Belgium with over 30 years in Silicon Valley, he has held senior technical roles at Yahoo, Nokia, and Be.Inc., earning over 40 patents in hardware and software. Outside of pioneering new technologies, Benoit is a father of two and an avid amateur astronomer who explores the night sky using his homemade telescopes.

  • LinkedIn: https://www.linkedin.com/in/benoit-schillings-2942a5
  • Photo: /wf26/speakers/by-id/spk_benoit_schillings.jpg
  • Sessions:

- Research to Reality with Google DeepMind — Day 3 — Session Day 2 10:05am-10:25am

TBD. Expected focus areas include generative AI for code, deep thinking algorithms, and the future of pre-training and transformer models for Gemini.

Bereket Habtemeskel

  • Role: CEO
  • Company: Better Auth
  • Bio: Founder & CEO of Better Auth, the most popular auth framework for TypeScript, and co-author of the Agent Auth protocol
  • Twitter: https://x.com/bekacru
  • LinkedIn: https://www.linkedin.com/in/bekacru/
  • Photo: /wf26/speakers/by-id/spk_bereket_engida.jpg
  • Sessions:

- Agent Auth — Day 1 — Workshop Day 4:30pm-5:30pm

Better Auth has grown to 27k GitHub stars and over 1.5M weekly downloads, becoming a popular choice for developers who want to own their authentication stack. We recently introduced Agent Auth, a protocol designed to support autonomous and delegated agents operating services for an organization or a user. It allows agents to dynamically negotiate capabilities, manage access boundaries, and maintain secure authorization flows. This session will break down the protocol design and demonstrate it live, showing how agents can securely authenticate and operate with dynamic permissions.

Bogdan Gaza

  • Role: Co-Founder & CTO
  • Company: DatologyAI
  • Bio: Bogdan Gaza is Co-Founder and CTO at DatologyAI, working on systems that help teams make better use of their data for AI model development and training.
  • LinkedIn: https://www.linkedin.com/in/bogdangaza
  • Photo: /wf26/speakers/by-id/spk_bogdan_gaza.jpg
  • Sessions:

- Running a 20T-Token Data Pipeline: Infrastructure Lessons from Production — Day 2 — Session Day 1 3:20pm-3:40pm

The problem. Curation algorithms tend to get the spotlight: model-based quality filtering, embedding-based deduplication, synthetic generation at scale, target distribution matching. The engineering behind them, the systems that actually run those algorithms reliably on petabytes of data and thousands of GPUs, usually gets overlooked. This session is about the engineering. What we built. The infrastructure behind two production data curation pipelines, on two very different shapes of workload: Arcee Trinity-Large-Thinking three model generations in nine months, with the curated corpus scaling from 8T to 10T to 20T tokens. Trinity-Large's 20T-token corpus included 8T+ synthetic tokens generated on clusters peaking at 2,048 H100 GPUs. Each generation incorporated deeper curation and broader domain coverage; the pipeline ran end-to-end multiple times, not once. Thomson Reuters legal 100B tokens of mid-training output, generated from TR's proprietary legal corpus, delivered as a deployment artifact and plugged into their existing SFT and DPO post-training. Different operational profile entirely: smaller scale, sensitive data, customer-environment integration. What you'll learn about. The metadata bottleneck. At trillion-token scale, fetching metadata from object storage across millions of files becomes the dominant source of idle time. We offload metadata management to Spark and use a lightweight file-level distribution scheme to drive idle time to near zero. Fault tolerance at multi-week scale. Long-running GPU inference jobs fail. We use one-to-one partition mapping between Spark and Ray jobs to get idempotent, resumable execution. A node failure no longer means reprocessing the dataset. Heterogeneous workload scheduling. Curation pipelines mix CPU-heavy preprocessing (Spark) with GPU-heavy inference (Ray + vLLM). An in-house scheduler routes each job type to isolated node pools, preventing resource fragmentation and ensuring critical training jobs aren't blocked by upstream CPU work. Inference tuning across models. vLLM defaults aren't right for every model. Tuning batch size, speculative decoding, and n-gram sampling per-model yields up to 40% throughput improvement, without over-engineering. Pipeline reproducibility. Treating a curated training corpus as a versioned deployment artifact rather than a one-off output. What that enables when a customer wants to run mid-training against a pre-trained base. For engineers building or operating large-scale data pipelines for ML training

Bohan Li

  • Role: Staff Software Engineer
  • Company: EliseAi
  • Bio: Bo has over 10 years of experience building real time systems across databases, decentralized finance, self driving cars, and voice AI. He previously worked as an Member of Technical Staff at Cartesia and is currently at EliseAI, building AI Agents for Housing and Healthcare that improve how we live.
  • Twitter: https://x.com/bobowchan
  • LinkedIn: https://www.linkedin.com/in/bohan-li-7290b74a/
  • Website: https://eliseai.com/
  • Photo: /wf26/speakers/by-id/spk_bo_li.jpg
  • Sessions:

- Realtime Voice Agents with Frontier Intelligence — Day 2 — Session Day 1 2:50pm-3:10pm

Dive into how the EliseAI voice agent harness orchestrates multiple models with jagged capability profiles to achieve realtime latency without sacrificing intelligence. Reduces p90 effective latency overhead of ASR, TTS, and tool calling to sub 200ms, unlocking frontier models like GPT 5.5 for voice. ### ASR: Eager Speculative Transcription We introduce speculative transcription by pairing local Whisper or Parakeet fine-tunes for speed with API models like Scribe, Nova, or Gemini Flash for accuracy. A local content match classifier operates at sub 10ms latency, allowing us to immediately trigger the downstream pipeline from the fast local transcription and dynamically replace text with the more accurate transcription if significant differences occur. This process runs on a eager 100ms VAD delay, securely releasing the generated response audio only after a fixed silence threshold has passed. ### LLM: Async background tool injection To eliminate expensive tool calling round trips, we implement system leveraging async background tool injection where the primary model makes no direct tool calls. Instead, local fine-tuned tool-calling models continuously observe the realtime transcription stream in the background. "Fake" tool call traces are then injected into the primary LLM’s context, which primes it for immediate, one-shot response generation. ### TTS: Prefix caching and infilling Many Agent responses start with the same set of 3-6 words. We can cache this audio, releasing it immediately while we infill the remaining response audio conditioned on this prefix to preserve speech prosody. With this approach, a relatively small cache can achieve a 90% hit rate across a wide range of voices, languages and model providers.

Brandon Callender

  • Role: Founding Engineer
  • Company: typedef
  • Bio: Brandon Callender is a founding engineer at typedef, where he builds AI-native infrastructure for data engineering agents. His work focuses on the data context layer agents need to reason beyond code and database access.
  • LinkedIn: https://www.linkedin.com/in/bcallender/
  • Photo: /wf26/speakers/by-id/spk_brandon_callender.jpg
  • Sessions:

- The Data Context Layer: Why Data Engineering Agents Need More Than Code and Databases — Day 1 — Workshop Day 2:20pm-4:20pm

Modern AI agents typically understand either code or databases. Code-focused agents reason over files, dependencies, and syntax, while database agents see tables, columns, and query results. This works for software development and basic analytics—but it breaks down for data engineering. In real data environments, agents fail because they lack context: an understanding of how data flows, what it represents, and why it behaves the way it does in production. Introducing the data context layer—a missing third layer that bridges code, data, and business semantics. Without it, agents hallucinate impact, suggest unsafe joins, and struggle with root cause analysis. This presentation will define the data context layer and showcase its use in practice, including end-to-end lineage from sources to reports; semantic metadata such as grain, measures, dimensions and business logic; runtime signals including job executions, failures, and performance patterns; and logical vs. physical modeling distinctions. Attendees will walk away with a greater understanding of: Why the code layer (dbt SQL, manifests, Git history) provides structure but misses grain, aggregation semantics, and join safety Why the data layer (warehouse tables, execution metrics, failures) shows what happened, but not why How the data context layer unifies lineage, semantic metadata, runtime behavior, and business rules The presentation will also cover architecture patterns for building and maintaining a data context layer, including why property graphs are well-suited for contextual reasoning and how agents can query context safely instead of relying on prompt stuffing.

Brandon Waselnuk

  • Role: Developer Relations
  • Company: Unblocked
  • Bio: Brandon Waselnuk works in Developer Relations at Unblocked, a context platform for AI-assisted development.
  • Twitter: https://x.com/BrandonWaselnuk
  • LinkedIn: https://ca.linkedin.com/in/brandonwaselnuk
  • Photo: /wf26/speakers/by-id/spk_brandon_waselnuk.jpg
  • Sessions:

- Your agents lack context: Here's how to fix "You're absolutely right!" — Day 3 — Session Day 2 12:05pm-12:25pm

Every AI coding tool can generate code. Very few can generate the right code for your organization, because they're missing context. They don't know why your team chose Redis over DynamoDB, what the team decided in a Slack thread earlier today about the auth migration, or which architectural patterns your principal engineers actually enforce in review.

This talk is a practitioner's guide to building a context engine: the reasoning layer that continuously ingests & synthesizes organizational knowledge across disparate sources into unified, queryable understanding.

I'll walk through the problems you actually have to solve — reasoning across systems that don't agree with each other, searching globally before you can reason, maintaining identity-scoped permissions so every user and agent only sees what they should, and personalizing results based on who's asking and what they're working on.

These are the engineering challenges that make naive RAG fall short, drawn from real lessons building this at scale.

- Beyond RAG: See a relational context engine reduce token burn — Day 4 — Session Day 3 11:10am-11:30am

In this expo talk we'll give you a free context engine simulator, open source tools, and demo how a context engine works. See how modern engineering workflows with agentic loops and goals produce better quality code and reduce token burn. RAG, while useful, leaves context gaps for humans and agents. A context engine fills those gaps by including real-time, relational, personalized, and permission aware techniques to get high-signal context to humans and agents at runtime.

Brendan Rappazzo

  • Role: Machine Learning Scientist
  • Company: Morgan Stanley
  • Bio: ML Research Scientist at Morgan Stanley working on LLM post-training and building agentic workflows. PhD from Cornell. Shares fun experiments on GitHub and X (@brendanh0gan)
  • Twitter: https://x.com/brendanh0gan
  • LinkedIn: https://www.linkedin.com/in/brendan-rappazzo-hogan-763734115/
  • Website: https://www.bhogan.net
  • Blog: https://www.bhogan.net/
  • Photo: /wf26/speakers/by-id/spk_brendan_rappazzo.jpg
  • Sessions:

- ALPHALAB: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs — Day 4 — Session Day 3 10:45am-11:05am

We built AlphaLab to automate quantitative research at Morgan Stanley’s Machine Learning Research Lab - the experimental grind of architecture search, hyperparameter tuning, and literature review that consumes most of a researcher's time. To show it generalizes, we ran it on three deliberately different domains: CUDA kernel optimization (4.4× mean speedup over torch.compile, 91× peak), LLM pretraining (22% lower validation loss under a 20-minute budget), and traffic forecasting (23–25% RMSE improvement after the system independently found and tuned TFT and iTransformer from the literature). AlphaLab is an agentic harness that takes a dataset and a natural-language objective and runs a full research campaign across three phases: it explores the data and surveys prior work, it constructs and adversarially validates its own evaluation framework, and then it runs experiments at scale on a multi-GPU cluster via a Strategist/Worker loop with a persistent playbook that accumulates domain knowledge across experiments. In Phase 3 - the dispatcher keeps a large cluster fully utilized indefinitely with no human in the loop, and the playbook ends up containing domain-specific methodology that didn't exist anywhere in the prompts at launch. This talk walks through the three phases, what we learned from running campaigns with different models, what we have learned from using this in real systems, and future areas we are exploring.

- Loophole - Adversarial Agents To Stress Test Your Morality — Day 4 — Session Day 3 1:30pm-1:50pm

Most natural language specifications have holes their authors didn't notice - and writing more rules tends to create more holes. I built Loophole to try a different approach: point adversarial agents at a spec until it stops breaking. You give the system a set of natural language principles. An AI drafts a formal codified version. Two adversarial agents go to work - one finds cases the code permits but the principles forbid, the other finds cases the code forbids but the principles allow. A judge agent patches the code when it can, but only if the fix doesn't contradict any prior ruling. When a contradiction can't be resolved, it escalates to you. Every decision becomes binding precedent, so the constraint space tightens round after round. I started with moral and legal reasoning as the demo, and on its own that's already interesting - it turns into a kind of game where you discover contradictions in your own beliefs that you didn't know were there. But the pattern generalizes well past that. The same loop works for company policies that need to survive contact with edge cases. For making chatbot system prompts adversarially robust. For stress-testing eval rubrics. And, taking the long view, for something like a smarter legislative process - where proposed laws get checked against the public's stated values before they pass, and the contradictions surface before they hit a courtroom. The talk walks through how the harness works, the design choices that matter (especially why precedent is the load-bearing piece), what kinds of specs it handles well, where it breaks, and what it would take to push it further. All code is open source.

Brian Douglas

  • Role: CoFounder
  • Company: Paper Compute Company
  • Bio: Brian is the founder of the Paper Compute Company, an distributed systems primitives for AI agents.

Brian previously founded Open Sauced where he woreds on increasing the knowledge and insights of open-source communities. In the past he’s lead Developer Advocacy at GitHub by fostering a community of early adopters through conversations with the top open source maintainers on GitHub.

  • Twitter: https://x.com/bdougieYO
  • LinkedIn: https://linkedin.com/in/brianldouglas
  • Website: https://b.dougie.dev
  • Photo: /wf26/speakers/by-id/spk_brian_douglas.jpg
  • Sessions:

- Don't Write Skills, Train Models — Day 3 — Session Day 2 2:50pm-3:10pm

Every AI agent call generates training data. Most teams throw it away. They write skills files instead. Text documents that describe how to do a task and hope the model follows them at inference time. Skills work until they don't. The model drifts, skips steps, hallucinates a shortcut. So you rewrite the skill, add more constraints, hope harder. There's a better path. If you've used a skill enough to know what good output looks like, you already have training data. You just aren't using it. This talk covers what I learned building an open source fine-tuning pipeline that turns agent session traces into SFT and DPO training datasets. A telemetry proxy captures every LLM call as a content-addressed Merkle DAG with zero instrumentation. Successful sessions become supervised fine-tuning data. Pair them against failures, matched by goal category, and you get preference pairs for DPO. No manual labeling. No synthetic data. But training data quality depends on environment consistency. If the same agent produces different results because of package drift, nondeterministic toolchains, or inconsistent system state, your training signal is noise. This is where NixOS changes the equation. A hardened, reproducible OS means every agent session runs against an identical, declarative environment. Nix controls the variables that sandboxing alone doesn't: dependency graphs, system libraries, toolchain versions. When you can guarantee the environment is the same across hundreds of sessions, the behavioral signal in your traces is actually trustworthy. We'll walk through the full pipeline. How to rebuild parent-hash chains from a SQLite database and join facet metadata. How to filter to fully_achieved sessions and truncate 82k-token conversations down to 4k-6k training examples using summary context plus the last three turns. How to match success/failure pairs by goal category and exclude unclear_requirements failures so DPO learns from real agent mistakes, not ambiguous prompts. How QLoRA keeps VRAM low enough to train a 7B model on a single consumer GPU. And what happens when you try DPO on 12GB VRAM (two simultaneous forward passes for logprob computation will teach you about gradient accumulation settings fast). The result: a LoRA adapter trained on your own agent traces, in a reproducible environment, on a single consumer GPU, for less than $2 in cloud compute. No YAML. One config file. All code is open source.

- Don't Write Skills, Train Models (cont. 2/3) — Day 3 — Session Day 2 3:20pm-3:40pm

Continuation block 2 of 3 for Brian Douglas's workshop session.

- Don't Write Skills, Train Models (cont. 3/3) — Day 3 — Session Day 2 3:45pm-4:05pm

Continuation block 3 of 3 for Brian Douglas's workshop session.

Brian Lewis

  • Role: AI Product Lead
  • Company: Millennium
  • Bio: Brian Lewis is an AI Product Lead at Millennium. His WF26 session draws on evaluating more than 100 AI startups for enterprise adoption and focuses on which AI startups land enterprise contracts.
  • LinkedIn: https://www.linkedin.com/in/brianthomaslewis/
  • Sessions:

- Which AI startups actually land enterprise contracts? Lessons from evaluating 100+ AI startups at Millennium Management — Day 4 — Session Day 3 1:55pm-2:15pm

Selling your AI startup/product into a large enterprise is hard. I often sit on the buyer's side of the table at a large hedge fund. I've sat through 100+ AI startup pitches and am responsible for running the pilots that may eventually convert into your ARR. We'll cover what works, what doesn't, and what large enterprise customers need to see in order to choose 'buy' over 'build'.

Byung-Gon (Gon) Chun

  • Role: Founder & CEO
  • Company: FriendliAI
  • Bio: Founder and CEO of FriendliAI, an AI infrastructure company focused on efficient deployment and scaling of large language and multimodal models. Previously served as a professor at Seoul National University and held research roles at Facebook, Microsoft, Yahoo!, and Intel.
  • LinkedIn: https://www.linkedin.com/in/byung-gon-chun
  • Website: https://bgchun.github.io
  • Photo: /wf26/speakers/by-id/spk_byung_gon_gon_chun.jpg
  • Sessions:

- The Frontier AI Inference Cloud for Agents — Day 4 — Session Day 3 2:25pm-2:45pm

Agents have changed the economics of AI inference. A chatbot’s cost scales roughly linearly with the number of requests; an agent’s scales multiplicatively. A single task can fan out into hundreds of model calls, each carrying a repeated context prefix and adding latency that compounds across tool calls and reasoning steps. As open-weight models keep improving and agentic workloads grow, this shift exposes the limits of traditional request-level optimization. Inference infrastructure becomes a first-class concern, one that often shapes performance and cost as much as the model itself. In this talk, we explore what changes when you optimize for the whole task rather than the individual request, and how FriendliAI is rethinking the inference cloud for the era of agentic AI.

Carlos Sanchez

  • Role: Principal Scientist
  • Company: Adobe
  • Bio: Principal Scientist at Adobe Experience Manager, specializing in software automation and agentic applications. Involved in Open Source for over 20 years, he is the author of the Jenkins Kubernetes plugin and a member of the Apache Software Foundation amongst other open source groups, contributing to several projects, such as Kubernetes, Jenkins or Apache Maven.
  • Twitter: https://x.com/csanchez
  • LinkedIn: https://www.linkedin.com/in/carlossg/
  • Website: https://csanchez.org/
  • Blog: https://csanchez.org/
  • Photo: /wf26/speakers/by-id/spk_carlos_sanchez.jpg
  • Sessions:

- Agentic Sites: Building Hyper Personalized Websites — Day 3 — Session Day 2 3:20pm-3:40pm

The era of static, one-size-fits-all websites is over. Users expect personalized experiences that adapt to their preferences, context, and intent in real-time. But building truly personalized websites at scale requires more than just A/B testing or basic recommendation engines—it demands an agentic approach where AI agents autonomously orchestrate content, layout, and interactions. At Adobe, we are pioneering the concept of Agentic Sites—web experiences powered by AI agents that continuously learn from user behavior, analyze context signals, and dynamically compose hyper-personalized pages. These agents go beyond simple personalization rules: they reason about user intent, select optimal content variations, and adapt the experience in real-time while maintaining brand consistency and performance. In this session, we'll show how we leverage LLMs to deliver personalized experiences to our customers.

Carole Robin, Ph.D.

  • Role: Co-Founder
  • Company: Leaders in Tech
  • Bio: Carole Robin, Ph.D. is Co-Founder and Head of Programs at Leaders in Tech, a former Stanford Graduate School of Business lecturer in leadership, and co-author of Connect.
  • LinkedIn: https://www.linkedin.com/in/carole-robin
  • Website: https://leadersintech.org/team
  • Photo: /wf26/speakers/by-id/spk_carole_robin_ph_d.jpg
  • Sessions:

- Human Connection in the Age of AI — Day 1 — Workshop Day 5:00pm-6:00pm

Building AI safely requires both technical skills and interpersonal skills. A live demo of connection tools from Stanford's "Touchy Feely" course, then hands-on practice. Co-hosted with Leaders in Tech.

Carter Abdallah

  • Role: Senior Developer Tech
  • Company: NVIDIA
  • Bio: Founding Engineer at the NVIDIA aquired GPU dev tool Brev.dev. Now leads Agent Marketing and Experience, and internal OSS strategy at NVIDIA.
  • Twitter: https://x.com/Baxate
  • LinkedIn: https://www.linkedin.com/in/carter-abdallah
  • Website: https://baxate.com
  • Photo: /wf26/speakers/by-id/spk_carter_abdallah.jpg
  • Sessions:

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:30pm-1:50pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:55pm-2:15pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

Chaitanya Asawa

  • Role: Head of Engineering for Clinical Decision Support
  • Company: Abridge
  • Bio: Chaitanya leads agentic experiences & clinical decision support at Abridge, building the Jarvis for Clinicians. Previously he was one of the Founding Engineers at Glean where he built the Glean Assistant ground up technically and core teams. He started his career at Vicarious, an AI Research Lab focused on probabilistic methods & robotics.
  • Twitter: https://x.com/c_asawa
  • LinkedIn: https://www.linkedin.com/in/casawa
  • Photo: /wf26/speakers/by-id/spk_chaitanya_asawa.jpg
  • Sessions:

- From Ambient Documentation to Clinical Intelligence — Day 4 — Session Day 3 10:45am-11:05am

A practical session on how healthcare AI moves beyond ambient note generation into context-aware clinical decision support. The talk would cover grounding outputs in the patient encounter, surfacing evidence with citations inside clinician workflows, preserving clinician agency, and building rigorous evals for safety and trust in live healthcare environments.

Chang Liu

  • Role: Senior Product Manager
  • Company: Microsoft
  • Bio: Chang Liu is a Senior Product Manager at Microsoft working on Azure AI Foundry evaluation and agent quality tooling, including metrics for quality and safety in agentic applications.
  • Photo: /wf26/speakers/by-id/spk_chang_liu.jpg
  • Sessions:

- Tracing and debugging agents across systems with OpenTelemetry — Day 4 — Session Day 3 11:10am-11:30am

Understand what your agents are doing. Learn how to trace workflows across systems, debug issues, and uncover optimization opportunities using OpenTelemetry.

- Evaluating and optimizing AI agents: from observability to continuous improvement — Day 4 — Session Day 3 1:30pm-1:50pm

AI agents don’t behave like traditional systems. Learn how to evaluate outputs, trace behavior, and apply a continuous loop to improve performance across prompts, tools, and models. Using signals grounded in real-world context via Foundry IQ, see how evaluation, tracing, and optimization come together to turn production usage into measurable improvements over time.

Charles Frye

  • Role: Member of Technical Staff
  • Company: Modal
  • Bio: Charles Frye builds and teaches people to build AI applications. After publishing research in psychopharmacology and neurobiology, he got his Ph.D. at the University of California, Berkeley, for dissertation work on neural network optimization. He has taught thousands the entire stack of AI application development -- from linear algebra fundamentals and GPU arcana to building defensible businesses -- through work at Weights and Biases, Full Stack Deep Learning, and Modal.
  • Twitter: https://x.com/charles_irl
  • LinkedIn: https://www.linkedin.com/in/charles-frye-38654abb/
  • Website: https://charlesfrye.github.io
  • Photo: /wf26/speakers/by-id/spk_charles_frye.jpg
  • Sessions:

- What is an Inference Engine, Anyway? — Day 1 — Workshop Day 11:05am-12:05pm

To run state-of-the-art inference yourself, you must master the inference engine: vLLM, SGLang, TRT-LLM, or your own jawn. The inference engine manages the lifecycle of an inference request, from input to output. In this workshop, we'll examine the architecture of modern high performance inference engines, the key techniques that inference engines need to deliver that performance, and the traces and metrics that inference engines emit.

Charlie Dickens

  • Photo: /wf26/speakers/by-id/spk_charlie_dickens.jpg
  • Sessions:

- Towards Reliable Financial Agents: How a 4B Model Outsmarted a 235B Giant — Day 2 — Session Day 1 3:45pm-4:05pm

Large generalist models have excellent reasoning but this does not necessarily imply specialized knowledge and tool calling capabilities. They can still hallucinate column names, ignore constraints, and generate SQL that returns nonsensical results. The problem isn't intelligence it's reliability and specialization. In this talk we'll show how a 4B model was fine-tuned to outperform a 235B model on real financial analysis tasks. The key was not adding more reasoning ability, but enforcing tool discipline. Using synthetic data generation and reinforcement learning with the open-source rLLM framework, the model learned to explore schemas, validate outputs, and retry failures instead of hallucinating confident nonsense. One key result: tool-use fundamentals generalize. Training on simple tool interactions transferred to much harder, multi-step financial tasks. If you're building LLM systems that interact with databases, APIs, or internal tools, this talk focuses on the behaviors that actually matter and how to teach them without frontier-scale compute.

Charlie Guo

  • Role: Developer Experience Engineer
  • Company: OpenAI
  • Bio: Charlie Guo is a Developer Experience Engineer at OpenAI, where he helps developers build with the OpenAI API. He is also the author of Artificial Ignorance, an AI publication at the intersection of engineering and intelligence. Before joining OpenAI, Charlie spent more than a decade building products and internal tools, including as a startup founder. He is based in Berkeley, California.
  • Twitter: https://x.com/charlierguo
  • LinkedIn: https://www.linkedin.com/in/charlierguo
  • Website: https://www.ignorance.ai/
  • Blog: https://ignorance.ai/
  • Photo: /wf26/speakers/by-id/spk_charlie_guo.jpg
  • Sessions:

- Cooking with Codex — Day 1 — Workshop Day 9:00am-11:00am

Codex is changing how technical teams ship across the software development lifecycle, from feature implementation to code review and automation. But the real unlock comes when these practices move beyond a single workflow and become shared systems a team can trust.

In this hands-on session, you'll use Codex across real development and knowledge-work scenarios: structuring tasks, supervising agentic work, coordinating subagents, using plugins and MCPs, and combining Codex with OpenAI's frontier reasoning, coding, and multimodal models.

Bring your laptops and leave with reusable demos and a set of Codex recipes your team can adapt.

- Voice Agents Can Just Do Things — Day 2 — Session Day 1 11:40am-12:00pm

Too many voice AI integrations still treat speech as fancier chat: audio in, audio out. But we're at a point where speech can be a control plane for software, and most developers are unaware that voice has become a capability overhang. Current realtime models can understand intent, call tools, speak while work is underway, recover from corrections, and decide what the user actually needs to hear. As a result, we're seeing three practical patterns emerge: voice-to-action, systems-to-voice, and voice-to-voice. We’ll show how each pattern changes the architecture, where Realtime 2’s reasoning and tool-calling matter, and why chained STT / LLM / TTS systems start to break down as the interaction patterns become richer.

Charlie Holtz

  • Role: CEO
  • Company: Conductor
  • Bio: CEO + Co-Founder, Conductor
  • Twitter: https://x.com/charlieholtz
  • Website: https://www.conductor.build
  • Photo: /wf26/speakers/by-id/spk_charlie_holtz.jpg
  • Sessions:

- Orchestras, not Factories — Day 2 — Session Day 1 11:40am-12:00pm

Everything is Conductor now! I want to tell the story of how we came up with the original interface, what I think everyone (including us) is getting wrong and what's coming next.

Chengxi Taylor

  • Role: Co-founder & President
  • Company: General Reasoning Inc.
  • Bio: Co-founder & President at General Reasoning Inc. Building long-horizon AI systems, and evals research lead working with leading frontier labs. Previously CEO and Chief Engineer at Satori, CEO of MyMiniFactory.
  • Twitter: https://x.com/chengxitaylor
  • LinkedIn: https://www.linkedin.com/in/chengxi-taylor/
  • Website: https://www.chengxitaylor.com/
  • Photo: /wf26/speakers/by-id/spk_chengxi_taylor.jpg
  • Sessions:

- Scaling to Long-Horizons: Algorithms, Environments, Compute — Day 2 — Session Day 1 2:25pm-2:45pm

What does it take to scale language models to year long tasks? In this talk we'll cover the algorithm, environment and compute considerations for scaling language models to long horizons. We'll cover the latest reinforcement learning approaches, how to build hard, high-fidelity long-horizon environments, and how to build scalable infrastructure for these tasks.

Chris Alexiuk

  • Role: Sr. Product Research Engineer
  • Company: NVIDIA
  • Bio: Chris Alexiuk is a Sr. Product Research Engineer at NVIDIA, he is obsessed with everything and anything about large language models as well as Dungeons & Dragons.
  • Twitter: https://x.com/llm_wizard
  • LinkedIn: https://www.linkedin.com/in/csalexiuk
  • Website: https://www.alexi.uk/
  • Photo: /wf26/speakers/by-id/spk_chris_alexiuk.jpg
  • Sessions:

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:30pm-1:50pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:55pm-2:15pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

- Compression at the Edge — Day 4 — Session Day 3 2:25pm-2:45pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

- Compression at the Edge — Day 4 — Session Day 3 2:50pm-3:10pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

Chris Souza

  • Company: Google
  • Sessions:

- Model Whisperers: How Evals and Prompts Shape Agent Behavior — Day 3 — Session Day 2 1:30pm-1:50pm

Getting an AI agent to behave the way you want isn’t just about writing better prompts. In real systems, behavior emerges from a loop: prompts->evals->iteration->feedback. Small changes in any part of that loop can completely change outcomes. We saw this while building a seed asset agent - a system that turns messy, real-world advertising creatives (low quality images, cluttered visuals, heavy text overlays) into clean, reusable assets for downstream Gen AI tools. The agent acts like an editor, simplifying visuals, removing unnecessary elements, and isolating core content so that additional context (like text or CTAs) can be added back in a more controlled, brand-safe way. But the real challenge wasn’t just building the agent - it was making it reliable. And prompting alone wasn’t enough. What actually moved the system forward was how we defined success—and how we used evals to reinforce it. Over time, evals stopped being just a way to measure quality. They became part of how the agent learned what “good” looks like. In this talk, we’ll cover: Why prompting alone doesn’t give you stable agent behavior How evals act like feedback signals, not just scorecards How we built evals sets that reflect the real-world Using agent trace logs to understand why things fail (not just that they fail) How to iterate without breaking things you already fixed By the end, you’ll have a set of patterns you can apply to any system dealing with messy/continuously changing data and how to tweak your prompt and evals to accommodate such changes.

Christopher Burns

  • Role: Founder
  • Company: Inth
  • Bio: Christopher Burns is the founder of Inth, building developer-first privacy compliance infrastructure for modern software teams. Inth started with c15t, an open-source consent SDK with 2.6M npm downloads, used by teams including Vercel, Cal.com, Zed, Infisical, Sanity, and others. Inth helps companies move privacy compliance closer to the product itself, giving developers infrastructure for consent, data rights, policy enforcement, and evidence instead of slow dashboard-first tooling. Christopher is a second-time founder and first-time YC founder. His previous company, Everfund, built enterprise nonprofit donation infrastructure and exposed the compliance problems that led to Inth: the deepest privacy failures are usually not in policy docs, but in the product itself.
  • Twitter: https://x.com/burnedchris
  • LinkedIn: https://linkedin.com/in/burnedchris
  • Website: https://burnedchris.com
  • Blog: https://burnedchris.com
  • Photo: /wf26/speakers/by-id/spk_christopher_burns.jpg
  • Sessions:

- How We Got LLMs to Recommend Our Open Source Library (Without Paying or Plug-ins) — Day 4 — Session Day 3 1:55pm-2:15pm

Over the past year, we’ve seen a new distribution channel emerge: AI assistants. Instead of SEO, ads, or integrations, developers are discovering tools through models like Claude. In this talk, I’ll break down how we got our open source library recommended organically by LLMs in under a year, without plugins, paid placements, or partnerships. We’ll cover what actually influences model outputs today, how developer-first products behave differently in this channel, and the practical steps we took to make our project show up when it matters. This is not theory. It’s a real case study of how distribution is changing, and how you can design your product and content to be picked up by AI systems directly.

Christopher Lovejoy

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Member of Technical Staff at Anthropic. Previously Anterior, Billions Health, Medical Doctor.
  • Twitter: https://x.com/ChrisLovejoy_
  • LinkedIn: https://linkedin.com/in/dr-christopher-lovejoy
  • Website: https://www.chrislovejoy.me
  • Blog: https://www.chrislovejoy.me
  • Photo: /wf26/speakers/by-id/spk_christopher_lovejoy.jpg
  • Sessions:

- Why Your Enterprise Tech Stack Isn't Ready for AI Agents - And What to Build Instead — Day 4 — Session Day 3 3:45pm-4:05pm

Agent-executed work is a new infrastructure primitive. Until you treat it that way, you're running a demo, not enterprise AI. Your existing stack was built for deterministic software. Agents reason, delegate, and make judgment calls. That distinction creates infrastructure problems most engineering teams haven't confronted: security vulnerabilities baked in by design, no audit trail, no explainability, no human-in-the-loop. At Anterior, we've deployed clinical AI agents across many of the largest US health plans, covering 50 million lives. Healthcare, with high stakes, strict regulation, deeply human workflows, exposes infrastructure gaps that exist everywhere - and makes the paradigm shift unavoidable: agent-executed work as a first-class primitive, alongside compute, storage, and APIs. We'll cover why bolting agents onto existing data pipelines fails, what infrastructure primitives are missing (and why teams don't notice until an audit), and how to architect a stack where security, compliance, and human oversight are load-bearing from day one. If you're serious about agents in any mission-critical context, this is the infrastructure conversation you need to have.

Christopher Manning

  • Role: Distinguished Member of Technical Staff
  • Company: Moonlake AI
  • Bio: Christopher Manning is a Distinguished Member of Technical Staff at Moonlake AI, inaugural Thomas M. Siebel Professor in Machine Learning in the Departments of Linguistics and Computer Science at Stanford University, a senior fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), and a General Partner at AIX Ventures. He served as the Director of the Stanford AI Lab (SAIL) 2018–2025. Chris was a leader in Statistical NLP in the 1990s and 2000s and then a pioneer in deep learning Natural Language Processing (NLP) from 2010, work for which he has received three consecutive ACL Test of Time Awards (for 2013–2015) and the IEEE John von Neumann Medal. He is the most cited NLP researcher in the world, and he was elected to the National Academy of Engineering and the American Academy of Arts and Sciences in 2025.
  • Twitter: https://x.com/chrmanning
  • LinkedIn: https://www.linkedin.com/in/christopher-manning-011575
  • Website: https://nlp.stanford.edu/~manning/
  • Photo: /wf26/speakers/by-id/spk_christopher_manning.jpg
  • Sessions:

- Building the simulation infrastructure for practical world model use — Day 3 — Session Day 2 10:45am-11:05am

What is the most important capability for world model applications and the pursuit of embodied AI? We believe it is not a question of having the most beautiful pixels but the ability to reason about causality in multimodal environments. At Moonlake, we are working on building action-conditioned multimodal world models which provide spatial and physical state consistency over long time periods. We believe that building and training on synthetic worlds provides the data and compute efficient path to truly useful world models. We are building the simulation infrastructure platform for companies that need to build and manage worlds (assets, scenes, digital twins) at scale, including robotics/autonomy teams, digital factory operators, and game authors. Our product today primarily finds applicability in simulation and the operationalization of digital twins. Simulation can include training robotics, world models for AGI research, autonomous vehicles, or content creation for media and entertainment. Operationalization of digital twins involves the reconstruction of scans into reusable assets, e.g., turning image and point-cloud scans into sim ready assets for digital factory Integration projects. We are building toward a future where AI systems do not just generate worlds, but understand how they work. Moonlake learns from each workflow: The more workflows, failures, and human interventions that Moonlake sees, the better it becomes at reconstructing, validating, and preparing complex simulation worlds. The session will include discussion and demos.

- Building the simulation infrastructure for practical world model use (Part 2) — Day 3 — Session Day 2 11:10am-11:30am

What is the most important capability for world model applications and the pursuit of embodied AI? We believe it is not a question of having the most beautiful pixels but the ability to reason about causality in multimodal environments. At Moonlake, we are working on building action-conditioned multimodal world models which provide spatial and physical state consistency over long time periods. We believe that building and training on synthetic worlds provides the data and compute efficient path to truly useful world models. We are building the simulation infrastructure platform for companies that need to build and manage worlds (assets, scenes, digital twins) at scale, including robotics/autonomy teams, digital factory operators, and game authors. Our product today primarily finds applicability in simulation and the operationalization of digital twins. Simulation can include training robotics, world models for AGI research, autonomous vehicles, or content creation for media and entertainment. Operationalization of digital twins involves the reconstruction of scans into reusable assets, e.g., turning image and point-cloud scans into sim ready assets for digital factory Integration projects. We are building toward a future where AI systems do not just generate worlds, but understand how they work. Moonlake learns from each workflow: The more workflows, failures, and human interventions that Moonlake sees, the better it becomes at reconstructing, validating, and preparing complex simulation worlds. The session will include discussion and demos.

Clare Liguori

  • Role: Senior Principal Engineer
  • Company: Amazon Web Services
  • Bio: Clare Liguori is a Senior Principal Engineer at Amazon Web Services (AWS), where she works on all things agentic AI. She primarily focuses on Kiro and Strands Agents SDK. Clare is also a core maintainer for the Model Context Protocol (MCP) specification.
  • Twitter: https://x.com/clare_liguori
  • LinkedIn: https://www.linkedin.com/in/clareliguori/
  • Website: https://clare.dev/
  • Blog: https://clare.dev/
  • Photo: /wf26/speakers/by-id/spk_clare_liguori.jpg
  • Sessions:

- From AI-Assisted to AI-Native: Building a Frontier Development Team — Day 2 — Session Day 1 2:50pm-3:10pm

When features that took two weeks now ship in an afternoon, the bottleneck shifts from writing code to making decisions. Frontier teams have discovered this firsthand, achieving 3-10x productivity gains by fundamentally rethinking how developers work with AI agents. This talk covers the practices that separate frontier teams from those who merely "sprinkle" AI on their existing workflows: running agents asynchronously for hours, investing in comprehensive agent steering files, enabling local integration testing for agent self-correction, and automating everything from coding to operations to documentation. You'll learn how teams at Amazon slowed down to speed up, the temporary productivity dips they accepted, and the organizational changes required to sustain this velocity.

Clay Cockrell

  • Role: Co-Founder
  • Company: CoupleWork AI
  • Bio: Clay Cockrell, LCSW is a pioneering psychotherapist with over 30 years of experience. He is the founder of Walk and Talk Therapy, conducting therapy sessions in NY’s Central Park and is an exited founder of The Online Counseling Directory (www.onlinecounseling.com).

Clay has written several articles for The Guardian and made media appearances on Good Morning America, NPR’s Wait, Wait, Don’t Tell Me, CBS’s The Doctors, and The Happiness Lab. His work and expertise have been highlighted in major publications including The Wall Street Journal, The New York Times, New York Magazine, Bloomberg Businessweek, The Financial Times, and The Times of London.

As the co-founder of CoupleWork, Clay is at the forefront of integrating artificial intelligence into relationship coaching. The app, featuring Maxine, an AI relationship coach, provides couples and individuals with evidenced based and personalized guidance to navigate relationship challenges.

  • LinkedIn: https://www.linkedin.com/in/clay-cockrell-906b0b4/
  • Website: https://www.walkandtalk.com/
  • Photo: /wf26/speakers/by-id/spk_clay_cockrell.jpg
  • Sessions:

- Al is becoming the World's largest Relationship Therapist. We Can't Afford to Get it Wrong. — Day 4 — Session Day 3 1:30pm-1:50pm

Millions of people are now turning to AI for relationship advice and emotional support, often before they'd ever consider a human therapist. Most of the AI Therapy that is available is without clinical oversight, ethical frameworks, or any serious reckoning with what it means to intervene in the most intimate and vulnerable space in a person's life. People are getting hurt. As a couples therapist with 30 years experience, I teamed up with the former CTO at S&P and we created CoupleWork, an AI relationship therapist I essentially trained on three decades of clinical knowledge and every evidence-based modality that exists. Our voice interactive AI, Maxine, is proving this can be done responsibly and very effectively. And what we're learning about the nature of love, connection, and human vulnerability at scale is something this industry needs to hear. I also want to talk about what comes next: the regulatory frameworks that don't yet exist, the liability questions nobody is answering, and why the therapists who should be leading this conversation are almost entirely absent from it.

Cody Menefee

  • Role: Success Engineer
  • Company: Firecrawl
  • Bio: Success Engineer at Firecrawl, focused on making Firecrawl the default web access layer for agents. Creator of OpenPasture, an open-source project applying AI to pasture-based agriculture and helping farmers raise more animals on grass. Mission: better tools for farmers, better lives for animals, better food for people.
  • Twitter: https://x.com/cbmenefee
  • LinkedIn: https://linkedin.com/in/codybmenefee
  • Website: https://openpasture.dev
  • Blog: https://openpasture.dev
  • Photo: /wf26/speakers/by-id/spk_cody_menefee.jpg
  • Sessions:

- You’re Not Thinking Big Enough: Rebuilding Food Systems from First Principles with AI Agents — Day 2 — Session Day 1 2:25pm-2:45pm

Most of the AI world is still thinking too small. We’re building SaaS wrappers and GTM agents while real-world systems are still run through fragmented knowledge, delayed feedback, and human guesswork. In this talk, I’ll show how I’m building an outdoor agentic system for pasture-raised livestock operations using LLMs, a Firecrawl-curated knowledge base, drone and satellite imagery, and geo collars to monitor pasture, guide animal movement, and support better decisions across cattle, sheep, poultry, and more. I’ll cover the architecture, retrieval and grounding, human approval loops, and what broke first: hallucinated confidence, weak environmental grounding, sparse evals, and the gap between a smart answer and a safe action. It’s a case study in building agents for the physical world, and a broader argument that AI’s real upside is in rethinking real-world systems from first principles.

Corby Rosset

  • Role: Senior Researcher
  • Company: Microsoft Research
  • Bio: Corby Rosset is a Senior Researcher at Microsoft Research studying the intersection of large language models and search/retrieval systems. His work includes conversational search, Bing's People Also Ask feature, and recent verifier research for computer-use agents.
  • Twitter: https://x.com/corby_rosset
  • Website: https://corbyrosset.com
  • Photo: /wf26/speakers/by-id/spk_corby_rosset.jpg
  • Sessions:

- The Art of Building Verifiers for Computer Use Agents — Day 4 — Session Day 3 11:40am-12:00pm

Every team building browser agents has the same problem: you can't trust your own evals. Browser tasks are too open-ended for deterministic checks, so teams use LLM verifiers as judges, and the judges are wrong constantly. WebVoyager misses 45% of failures. WebJudge misses 22%. Used as RL reward, you're not training a better agent, you're training a more confident liar. This talk walks through the Universal Verifier, open-sourced with Microsoft Research: false positive rate near zero, Cohen's κ matching human-human agreement. Four design principles, one open benchmark, and an honest account of where auto-research worked and where it plateaued.

Corey Gallon

  • Role: Managing Director
  • Company: Rexmore
  • Bio: Corey Gallon is Managing Director of Rexmore, an AI-native holding company that is building, buying and transforming businesses with AI. He's an experienced AI engineer focused on shipping real, maintainable software with coding agents. Previously he was: - Chief Innovation Officer of PwC's Commercial Technology business - Adjunct Professor of Graduate AI & Machine Learning at Loyola University Chicago - Agentic-coding OG: a primary contributor to, and board member, of GPT-Engineer (the open source project that became Lovable) Corey is an artisan roaster and brewer of specialty coffee (ask him about his flat white game) and a pickleball fanatic.
  • Twitter: https://x.com/CoreyGallon
  • LinkedIn: https://www.linkedin.com/in/coreygallon
  • Website: https://gallon.me
  • Blog: https://gallon.me
  • Photo: /wf26/speakers/by-id/spk_corey_gallon.jpg
  • Sessions:

- The Dark Arts of Web Automation: Teaching Agents to Use Websites Like Humans — Day 3 — Session Day 2 12:05pm-12:25pm

Anything you can do in a browser, your agent can do too. Not by tiptoeing through an MCP server one polite, token-burning call at a time -- properly, programmatically, the way you'd drive any other tool. I'll show you how with chrome-agent, an open source wrapper over the Chrome DevTools Protocol that has become irreplaceable in my everyday work. If you'll ever do a browser task more than once, step-by-step MCP browsing is slow, brittle, and bills you tokens for every single click. A CLI straight onto CDP makes the whole browser programmable: loop it, pipe it, script it, walk away. Write it Tuesday, run it a thousand times Wednesday, all without a second of AI agent babysitting. We'll dispel the MCP hype and myths, with successful demonstrations of cheeky things like: the power of CLI-based browsing and how its so much more capable than mere MCP; reaching through those oh-so-clever cross-origin iframes to clear the verify you're human checkboxes; showing that a JavaScript .click() is not a click, rather, just a function call in a costume that is banhammerable; ultimately, proving that a CDP browser operates just like a meatbag with a mouse and keyboard. You'll learn how to point your AI agents at real, messy, uncooperative websites and web applications and have them get things done exactly the way that you would.

Cormac Brick

  • Role: Principal Engineer, Google AI Edge
  • Company: Google
  • Bio: Principal Engineer at Google working on edge AI. Lead on the Google AI Edge team and contributes to Gemma Model development. An early pioneer in Edge AI, his industry background includes leading Intel’s first laptop NPUs, as well as demonstrating edge AI on a USB key way back in NeurIPS 2016. These days he is technical lead for a team that drives multiple open source projects (Google AI Edge Gallery, LiteRT-LM, LiterRT, Xnnpack, Mediapipe) that enable Edge AI for billions of users.
  • Twitter: https://x.com/cormacb
  • LinkedIn: https://www.linkedin.com/in/cbrick/
  • Photo: /wf26/speakers/by-id/spk_cormac_brick.jpg
  • Sessions:

- Why Large? Tiny LMs & Agents on Edge/Robotics — Day 3 — Session Day 2 2:50pm-3:10pm

big models get a lot of press. small model scale much better. RAM is expensive. The real world needs tiny models for scale on the edge. This workshop will cover how to combine both for mobile and robotics deployment. specifically covering: - skills are different on mobile - tiny LLMs <1B scale much further on mobile/web - how to fine tune and train tiny models. - skills on robotics / edge/ mobile - latest open models for edge (including gemma, qwen, and anything else that happens in next 10 weeks) This talk will focus on open models, including some gemma variants that will be shortly announced.

Cornelia Davis

  • Role: Principal Technologist
  • Company: Temporal
  • Bio: Cornelia's career has spanned several major shifts in software, from image processing algorithm development to web-centric computing in the late 90s, and then more than a decade working in cloud-native software, infrastructure and platforms (Cloud Foundry, Kubernetes and friends). Those experiences in distributed systems, combined with a longstanding interest in programming models, led her to Temporal where she is helping to bring a new programming paradigm to an industry that was increasingly in need of one - a need that has accelerated dramatically with the advent of modern AI systems. Much of her work today focuses on the architectural needs and evolving practices of these AI systems. Her current research explores asynchronous processing and the development of AI-native distributed systems abstractions, with an emphasis on the emerging patterns and programming models shaping this new era of software. She is the author of Cloud Native Patterns: Designing Change-Tolerant Software.
  • Twitter: https://x.com/cdavisafc
  • LinkedIn: https://www.linkedin.com/in/corneliadavis/
  • Photo: /wf26/speakers/by-id/spk_cornelia_davis.jpg
  • Sessions:

- MCP Tasks (async)/ Why the heck aren't any agents supporting MCP tasks/async? — Day 3 — Session Day 2 3:20pm-3:40pm

The November 2025 MCP spec release introduced tasks, a way to make tool calls in an async manner. But more than 5 months later (an eternity in AI-time) there are still NO clients that support it - not Claude, not Codex, not even goose! I believe there are two reasons: Designing the client experience when there are potentially 1000s of background tasks running on their own schedule and engaging humans at unpredictable times is a challenge. And tasks place new infrastructure requirements on such a client. This talk will share the findings from having built against the tasks protocol and will suggest solutions these problems. Yup, we'll have a working client!

Cyrus Clarke

  • Role: Researcher
  • Company: MIT Media Lab
  • Bio: Cyrus Clarke is an award-winning designer and technologist at MIT Media, exploring intelligence that is embodied, sensory and expressive. His recent embodied AI work, "I Gave an AI a Body", has reached over 15 million viewers globally. He also led HARD MODE, MIT's first Hardware x AI hackathon, which brought 200+ builders together in March 2026.
  • Twitter: https://x.com/cyrusclarke
  • LinkedIn: https://linkedin.com/in/cyrusclarke
  • Website: https://cyrus.website
  • Blog: https://cyrusclarke.substack.com/
  • Photo: /wf26/speakers/by-id/spk_cyrus_clarke.jpg
  • Sessions:

- I gave an AI a body — Day 3 — Session Day 2 3:45pm-4:05pm

I gave an AI a body. Not a body in the fleshy sense, or even a humanoid shell, but a form through which it can express itself, explore itself, and maybe even discover who or what it is. The three videos I've released documenting my encounters have crossed 15 million views, provoking responses from awe to anxiety. The body was a 900-pin shape display at MIT Media Lab. The idea was simple in principle, strange in practice: install an AI agent on the connected machine, give it access to the codebase, and rather than telling it what to do, ask it to discover itself through the physical form. Its first deliberate act was to breathe. The whole grid rising and falling. Hypnotically. Then it reached for its own edges. When asked to say hello it spelled "H-I, C-Y-R-U-S !", defaulting to the most familiar human legible symbols it knows. Inspired by Ted Chiang's Story of Your Life, I wanted a language the agent could create itself. It proposed a vocabulary of its own gestures, built through a learning loop it named BODYLAB. The talk is about encountering another intelligence, and what I learned along the way: the memory architecture, the closed-loop pipeline that generates, scores and stores gestures, the validation gates that keep them legible, and the moments stranger than tool use, where an LLM not developed for motion learns what to do with a body.

Daksh Gupta

  • Role: co-founder/CEO
  • Company: Greptile
  • Bio: Daksh is the co-founder/CEO of Greptile. Greptile is building AI agents that review code changes for 7,000+ companies like Nvidia, Coinbase, and Scale. It has raised $30M from Benchmark, YC, Paul Graham and others. Before this, Daksh studied computer science at Georgia Tech
  • Twitter: https://x.com/greptile
  • LinkedIn: https://www.linkedin.com/company/greptile/posts/?feedView=all
  • Website: https://www.greptile.com
  • Blog: https://www.greptile.com
  • Photo: /wf26/speakers/by-id/spk_daksh_gupta.jpg
  • Sessions:

- What we learned by analyzing 1M AI-generated PRs — Day 2 — Session Day 1 12:05pm-12:25pm

We analyzed >1M end-to-end AI generated PRs reviewed by Greptile to understand what types of bugs they tend to create and some strategies on mitigating them. For instance, did you know that Claude Code is nearly 3X more likely than Codex to introduce auth bypass vulnerabilities?

Dan Adler

  • Company: Sourcegraph
  • LinkedIn: https://www.linkedin.com/in/danielnealadler
  • Blog: https://sourcegraph.com/blog/a-note-from-dan
  • Photo: /wf26/speakers/by-id/spk_dan_adler.jpg
  • Sessions:

- The Enterprise Agentic Gap: When Developer-Level AI Tools Hit Millions of Lines — Day 2 — Session Day 1 10:45am-11:05am

Agentic coding tools have transformed individual developer workflows but owning a large codebase with millions of interdependent lines across multiple code hosts is a different problem entirely. Off-the-shelf AI coding tools weren't built for it, and at scale, they break down in ways that aren't obvious until you're already in trouble. This talk covers the failure modes you'll hit when applying developer-level agentic tools to enterprise-scale migrations, and how Sourcegraph's agentic migrations solution was built to solve what others couldn't.

Dan Bălăceanu

  • Role: Chief Product Officer and Co-Founder
  • Company: DRUID AI
  • Bio: Dan Bălăceanu is Chief Product Officer and Co-Founder at DRUID AI. He works on enterprise AI agents and conversational automation, with a background in business, IT, artificial intelligence, language processing, and large-scale software development.
  • Photo: /wf26/speakers/by-id/spk_dan_b_l_ceanu.jpg
  • Sessions:

- Would your AI agent get the job? A performance review framework for enterprise agents — Day 2 — Session Day 1 11:40am-12:00pm

There are dozens of ways to build an enterprise AI agent: agentic frameworks, direct LLM APIs, conversational AI platforms, vertical SaaS. They all claim to do the job. But how do you actually compare them on the same task, with the same data, against the same KPIs? This session presents a vendor-agnostic evaluation framework that treats AI agents the way enterprises treat new hires: set the role, define success criteria, run candidates through identical scenarios, and measure outcomes. The architecture uses any LLM to track positive and negative drift across agents against weighted goals, monitoring everything from hallucination rates and token consumption to user sentiment and conversation quality. Inputs are standardized. Outputs are both quantitative (accuracy, cost, hours saved) and qualitative (tone, clarity). The methodology supports continuous evaluation, not just pre-deployment benchmarks, but ongoing performance reviews that can compare agent work against human baselines. Walk away with a concrete, repeatable process for answering the only question that matters: which agent actually does the job?

Dan Bjornn

  • Role: Senior Data Scientist
  • Company: Lease End
  • Bio: Dan Bjornn is a Senior Data Scientist at Lease End, where he builds AI systems behind the company's sales pipeline. He led the development of an SMS sales agent that generated over $12M in revenue last year, with work spanning fine-tuning, agent architecture, prompt engineering, and evaluations. Prior to Lease End, Dan worked at several early-stage startups. Most notably, he was an early employee at NeuroID, where he helped lay the technical foundations the company carried through to its acquisition by Experian. Dan holds a PhD in cognitive neuroscience and applies it directly to his work designing AI agents with human cognition as the reference model.
  • LinkedIn: https://linkedin.com/in/dkbjornn
  • Photo: /wf26/speakers/by-id/spk_dan_bjornn.jpg
  • Sessions:

- Your Fine-Tuned Model Is Tech Debt: A 50x ROI House of Cards — Day 2 — Session Day 1 3:20pm-3:40pm

We built an AI application on top of fine-tuned models that generated $12M in revenue at 50x ROI. It was fast, cheap, and impressively accurate. Then it started having problems. Small errors accumulated. The model misread intent and nuance, handling conversations wrong. But retraining was too costly to justify for each fix, so known bugs piled up until we hit critical mass. Each retraining cycle took a week end-to-end, most of it spent curating data and validating our classification pipeline. And fixes caused whack-a-mole regressions across intents that required multiple iterations per cycle. Over time, the model became increasingly rigid. Each retraining was harder than the last. Then our team started using Claude Code, and we realized context management was the real lever, not model specialization. We rebuilt on frontier models using well-crafted system prompts and progressive context management, feeding the agent only what it needs when it needs it. Adjustments that used to require a week-long retraining cycle now take a small context change. Fine-tuning should be a last resort, not a first instinct. The cases where it's the right call are far fewer than they used to be. Before you fine-tune, ask: can I solve this with better context instead?

Dan Farrelly

  • Role: CTO and Co-founder
  • Company: Inngest
  • Bio: Dan Farrelly is CTO and co-founder of Inngest, a platform for durable serverless functions, workflows and agent orchestration. He was previously CTO at Buffer and created developer tools including Timezone.io and MailDev.
  • LinkedIn: https://www.linkedin.com/in/djfarrelly
  • Photo: /wf26/speakers/by-id/spk_dan_farrelly.jpg
  • Sessions:

- Your agent architecture has a half-life of 6 months — Day 3 — Session Day 2 12:05pm-12:25pm

A short history of the right way to build an agent: RAG, ReAct, prompt chaining, orchestrator-workers, MCP, CLI, MCP again... CLI again?? Every time you adopt a trend you rebuild your architecture. In this talk, Dan Farrelly, Inngest cofounder and CTO, is not going to tell you what comes next. He's going to show you how to build so it doesn't matter. He'll cover the core primitives that show up in every production agent, how bringing decisions closer to code provides more stack flexibility, and why the right execution layer unlocks faster iteration.

Dan Feng

  • Role: Senior Director of Engineering
  • Company: Maven Clinic
  • Bio: Dan Feng is Senior Director of Engineering at Maven Clinic, leading AI Platform and Core Member Experience. He oversees the company’s AI infrastructure and member-facing systems, helping teams build production AI applications across healthcare experiences. His work focuses on AI agents, LLM platforms, personalization, and scalable generative AI architectures. At Maven, he leads the company-wide AI initiative, driving adoption of AI across products, operations, and care delivery.
  • LinkedIn: https://www.linkedin.com/in/dan-feng-2bb5703/
  • Photo: /wf26/speakers/by-id/spk_dan_feng.jpg
  • Sessions:

- How to build an AI-Native Health Company — Day 4 — Session Day 3 2:50pm-3:10pm

Most healthcare technology companies were built for a different era. Transitioning to an AI-native organization isn't just about adopting new tools — it requires rethinking culture, processes, and how teams work at every level. This talk draws on firsthand experience leading that transformation at a digital health company. We'll cover what it takes to foster an AI-first culture across departments, and go deep on the engineering side: adopting AI-assisted development practices, building shared AI infrastructure, and evolving the product development process to unlock 2–3x productivity gains. We'll also tackle the harder, less-discussed challenge — the mindset shift required to operate effectively in a domain that's changing faster than any playbook can keep up with. Whether you're just starting this journey or already mid-transition, you'll walk away with concrete lessons on what works, what doesn't, and how to build an organization that compounds on AI rather than just experiments with it.

Dan Fu

  • Role: VP of Kernels
  • Company: Together AI
  • Bio: VP of Kernels at Together AI and Assistant Professor of Computer Science and Engineering at UC San Diego, focused on efficient machine learning systems and GPU performance.
  • Twitter: https://x.com/realDanFu
  • Photo: /wf26/speakers/by-id/spk_dan_fu.jpg
  • Sessions:

- Agents at Scale: Inside MiniMax's Model and the Infrastructure Behind It — Day 3 — Session Day 2 2:50pm-3:10pm

Olive Song (RL Lead, https://www.minimax.io/) and Dan Fu (VP of Kernels, https://www.together.ai/) dig into the engineering behind one of the most widely used open model families in the agent ecosystem: how MiniMax built the model for agentic workloads, and what it takes to serve it at scale.

Olive on the model side:

The RL decisions behind long-context reasoning and tool use

What training for agentic behavior actually looks like in practice

Dan on the infrastructure side:

Why agentic workloads break inference engines built for chat: prefill-heavy traffic, high cache hit rates, long-context inputs

The kernel-level optimizations built for MiniMax's workload profile

How the two teams collaborate on model launches and ongoing performance work

Dan Ndombe

  • Role: Staff Developer Success Advocate
  • Company: Docker
  • Bio: Dan Ndombe is a Staff Developer Success Advocate at Docker who helps developers build and ship software faster. He is a two-time founder and former engineer-turned-product manager with experience at Netflix, Pinterest, and Calm.
  • Photo: /wf26/speakers/by-id/spk_tbd_docker.jpg
  • Sessions:

- From approval loops to autonomous agents with Docker — Day 1 — Workshop Day 12:10pm-1:10pm

"You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down."

Daniel Bump

  • Role: Engineer
  • Company: Google
  • Bio: Engineer at Google. Focus area: image/video generation and computer vision.
  • Twitter: https://x.com/DanielJBump
  • LinkedIn: https://www.linkedin.com/in/danielbump
  • Photo: /wf26/speakers/by-id/spk_daniel_bump.jpg
  • Sessions:

- Model Whisperers: How Evals and Prompts Shape Agent Behavior — Day 3 — Session Day 2 1:30pm-1:50pm

Getting an AI agent to behave the way you want isn’t just about writing better prompts. In real systems, behavior emerges from a loop: prompts->evals->iteration->feedback. Small changes in any part of that loop can completely change outcomes. We saw this while building a seed asset agent - a system that turns messy, real-world advertising creatives (low quality images, cluttered visuals, heavy text overlays) into clean, reusable assets for downstream Gen AI tools. The agent acts like an editor, simplifying visuals, removing unnecessary elements, and isolating core content so that additional context (like text or CTAs) can be added back in a more controlled, brand-safe way. But the real challenge wasn’t just building the agent - it was making it reliable. And prompting alone wasn’t enough. What actually moved the system forward was how we defined success—and how we used evals to reinforce it. Over time, evals stopped being just a way to measure quality. They became part of how the agent learned what “good” looks like. In this talk, we’ll cover: Why prompting alone doesn’t give you stable agent behavior How evals act like feedback signals, not just scorecards How we built evals sets that reflect the real-world Using agent trace logs to understand why things fail (not just that they fail) How to iterate without breaking things you already fixed By the end, you’ll have a set of patterns you can apply to any system dealing with messy/continuously changing data and how to tweak your prompt and evals to accommodate such changes.

Daniel Chalef

  • Role: Founder and CEO
  • Company: Zep AI
  • Bio: Daniel Chalef is the founder and CEO of Zep, agent memory at enterprise scale. He co-created Graphiti, the popular open-source temporal knowledge graph framework. A second-time founder, Daniel previously built and led machine learning and engineering teams at late-stage companies. He lives in San Francisco.
  • Twitter: https://x.com/danielchalef
  • LinkedIn: https://www.linkedin.com/in/danielchalef/
  • Photo: /wf26/speakers/by-id/spk_daniel_chalef.jpg
  • Sessions:

- Citation Needed: Provenance for LLM-Built Knowledge Graphs — Day 4 — Session Day 3 3:20pm-3:40pm

An LLM doesn't copy facts into your knowledge graph. It synthesizes them: entities merge across sources, and later data invalidates earlier facts. By the time your agent retrieves "patient has a penicillin allergy," the origin — an EHR record, a lab report, or something typed into a chatbot — is gone. This talk covers engineering lineage into a lossy, generative pipeline: episode-to-fact links as structural graph properties, provenance that survives entity resolution, metadata projection (tag a source once; it follows every derived node and edge), and the query semantics of filtering facts by ancestry, including mixed-trust parentage. Deletion is the inverse problem: GDPR erasure propagates back through the same derivation edges. Compliance gets an audit trail; engineers get agents they can debug instead of black boxes.

Daniel Han

  • Role: Co-founder
  • Company: Unsloth
  • Bio: Co-founder of Unsloth. Making open source AI more accessible and local. 300M downloads. 65K GitHub stars. Previously at NVIDIA.
  • Twitter: https://x.com/danielhanchen
  • LinkedIn: https://www.linkedin.com/in/danielhanchen
  • Website: https://unsloth.ai
  • Blog: https://unsloth.ai/introducing
  • Photo: /wf26/speakers/by-id/spk_daniel_han.jpg
  • Sessions:

- Special topics in Kernels, RL, Reward Hacking in Agents — Day 1 — Workshop Day 2:20pm-5:30pm

An advanced seminar (good prerequisites: Daniel's 2024 and 2025 hit AIE workshops, but all are welcome!)

PLS WATCH: https://www.youtube.com/@aiDotEngineer/search?query=daniel%20han

- Compression at the Edge — Day 4 — Session Day 3 2:25pm-2:45pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

- Compression at the Edge — Day 4 — Session Day 3 2:50pm-3:10pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

Daniel Kim

  • Role: Head of Growth
  • Company: Cerebras
  • Bio: Daniel Kim works on large-scale inference systems at Cerebras, which runs the world's fastest AI inference on the Wafer-Scale Engine (WSE-3), the largest chip ever built. More recently, Daniel has turned to building AI agents that accelerate Cerebras's own hardware and software development. Outside of work, you can find him relaxing in the park, eating spicy noodles, and recently running!
  • Twitter: https://x.com/learnwdaniel
  • LinkedIn: https://linkedin.com/in/journeyer
  • Website: https://danielkim.sh/
  • Photo: /wf26/speakers/by-id/spk_daniel_kim.jpg
  • Sessions:

- All the Things We Have to Do to Satisfy Your Insatiable Need for Tokens — Day 4 — Session Day 3 11:40am-12:00pm

Every time the industry figures out how to serve tokens faster and cheaper, the appetite grows to match. Models get bigger, contexts get longer, agents start chaining thousands of calls together. The finish line keeps moving. This talk is a technical tour through everything the industry has done to keep up, led by two experts in high-performance inference. We'll start with the optimizations that made hardware work harder without changing the underlying architecture. Then we'll go up a level with techniques that work smarter across requests and across the model itself. And finally, a peek into the future with heterogeneous disaggregated inference, the architectural shift that splits prefill and decode across specialized hardware, and even more advanced forms of hardware specialization coming your way soon. Token demand is about to get a lot more insatiable. Let's see what the future has in store for us!

Dat Ngo

  • Role: AI Architect
  • Company: Arize AI
  • Bio: Dat Ngo is an AI Architect at Arize AI focused on agent harnesses, evaluation, observability, and scalable LLM-evaluation pipelines for production AI systems.
  • Twitter: https://x.com/dat_attacked
  • Photo: /wf26/speakers/by-id/spk_dat_ngo.jpg
  • Sessions:

- Your Agent Is Lying to You About Whether It Worked — Day 2 — Session Day 1 12:05pm-12:25pm

Every span is green, every tool call returned cleanly, and the agent still regenerated the same plan 27 times before giving up invisible to any outcome metric, obvious in the trajectory. We pull up a real trace where the outcome looks healthy and the path is a disaster, then show Signal, our agent, surfacing it automatically: sweeping the project, ranking it above the noise, and linking straight to the offending trace with debugging evidence attached. The live version of the trajectory-over-outcomes argument, with a one-click path from "something's wrong" to "here's exactly where."

Dave Revere

  • Role: Staff AI Engineer
  • Company: SonderMind
  • Bio: Staff AI Engineer at SonderMind specializing in eval and guardrail pipelines for mental health AI. Dave builds the safety infrastructure behind SonderMind's AI products, including the clinical feedback loop that turns therapist annotations into automated regression tests and the modular guardrails system that evaluates every agent response before it reaches a user. He's passionate about making high-stakes AI systems safe enough to ship fast and rigorous enough to earn clinical trust.
  • Twitter: https://x.com/daverevere
  • LinkedIn: https://www.linkedin.com/in/daverevere
  • Photo: /wf26/speakers/by-id/spk_dave_revere.jpg
  • Sessions:

- Evals Driven-Development: Engineering a Mental Health AI Coach Ethically & Safely — Day 3 — Session Day 2 2:50pm-3:10pm

In the world of AI Mental Health, vibes can be dangerous with real consequences. Building Sondermind’s Mental Health AI Coach required us to invent a new playbook for Eval-Driven Development in order to balance effectiveness and safety. This session is for the builders who want to see how to handle the most difficult edge cases in the agentic world. We’ll show how we’ve built a Clinical Feedback Loop that turns human therapist insights into machine-readable evaluations in a production system with thousands of conversations. We’ll dive into: - The Ethics Engine: Building and calibrating modular guardrails that can be updated as clinical guidelines evolve. - Agentic Oversight: Why we moved from single-prompt agents to a closed-loop Supervisor/Executor/Evaluator pattern to ensure clinical adherence. - Human Oversight: How we monitor Sonder to ensure that we can improve safety and quality with clinical feedback.

David Brumley

  • Role: Chief AI and Science Officer
  • Company: Bugcrowd, Inc
  • Bio: Dr. David Brumley is Chief AI & Science Officer at Bugcrowd and a full professor at Carnegie Mellon University, where he has spent decades advancing the state of offensive security. He is the founder of picoCTF, the world’s largest cybersecurity competition, and advisor to PPP/MMM, one of the most successful competitive hacking teams globally, and a venture partner at Rain Capital.

His work spans automated vulnerability discovery, exploit generation, and large-scale offensive systems, including pioneering efforts such as Mayhem, the Cyber Grand Challenge–winning technology. His research and products have helped shape how organizations think about continuous, automated security at scale.

  • LinkedIn: https://www.linkedin.com/in/thedavidbrumley
  • Photo: /wf26/speakers/by-id/spk_david_brumley.jpg
  • Sessions:

- Bugcrowd posttraining talk — Day 2 — Session Day 1 12:05pm-12:25pm

David Corbitt

  • Role: Head of Product, Serverless Training
  • Company: CoreWeave
  • Bio: Co-founder and CPO of OpenPipe, acquired by CoreWeave. We specialized in fine-tuning task-specific LLMs to match or exceed performance of frontier models with lower latency and cost. Now head of product for Serverless Training, building and distributing fine-tuning techniques to AI engineers.
  • Twitter: https://x.com/dvdcrbt
  • LinkedIn: https://linkedin.com/in/davidcorbitt
  • Photo: /wf26/speakers/by-id/spk_david_corbitt.jpg
  • Sessions:

- Inference is the New Training Loop: Architecting High-Reliability Agents and Continuous AI Systems — Day 3 — Session Day 2 3:20pm-3:40pm

For agentic AI and complex, multi-step workloads, the inference environment is the engine for continuous improvement, not a final deployment step. This talk focuses on engineering the full AI loop: tightly integrating inference with reinforcement learning (RL) and evaluation. Learn how to leverage native observability, serverless RL, and optimized inference stacks to continuously refine model behavior based on production traces, delivering agents that are reliable, auditable, and constantly evolving.

David Hsu

  • Role: CEO
  • Company: Retool
  • Bio: David Hsu is the Founder and CEO of Retool, the $3.2B Sequoia-backed governed platform where AI development is fast for builders and safe for business. Under his leadership, Retool helps over 10,000 companies like Amazon, Stripe, Brex, and Orangetheory Fitnessintegrate AI into enterprise-grade applications without sacrificing security or control. David studied philosophy and computer science at the University of Oxford.
  • LinkedIn: https://www.linkedin.com/in/dvdhsu/
  • Photo: /wf26/speakers/by-id/spk_david_hsu.jpg
  • Sessions:

- Governance Is the Real Bottleneck to AI ROI — Day 2 — Session Day 1 10:45am-11:05am

As AI systems move from generating content to taking Claw-based agents action inside production systems, governance (not model quality) becomes the limiting factor. David will break down why visibility, guardrails, approvals, and rollback matter more than raw intelligence, and how companies can enable AI adoption without creating security and compliance disasters.

David Levine

  • Role: Founder & CEO
  • Company: Kiduna Club
  • Bio: David Levine is a technology entrepreneur, systems engineer, and founder who has spent more than three decades building companies at the frontier of major technology shifts, from the early Web and cloud computing to big data, geomatics, blockchain and agentic AI. He is Founder & CEO of Kiduna Club and Founder & President of the Kinship Intelligence Institute, where he is developing technologies, governance frameworks, and legal infrastructure that enable creative people and intelligent agents to collaborate, create value, and coordinate at scale.

Previously, Levine founded multiple venture-backed technology companies, including Geostellar, a patented big-data geomatic platform for solar energy, Butterfly.net (later Gamebryo), an early massively-multiplayer gaming platform, and Ultraprise, a B2B whole loan exchange for the secondary market. He has held leadership roles with Solana Labs, Nightwing, and Sewall, and served as the executive director of the National Technology Transfer Center.

A graduate of Yale University, Levine has been featured in The New York Times, Wall Street Journal, Fast Company, Forbes, Fox Business News, and other major media. His current work focuses on digital identity, decentralized governance, and building the legal and technical foundations for the emerging agentic economy.

  • Twitter: https://x.com/bigkiduna
  • LinkedIn: https://linkedin.com/in/motodave
  • Website: https://motodave.com
  • Blog: https://motodave.com
  • Photo: /wf26/speakers/by-id/spk_david_levine.jpg
  • Sessions:

- Beyond the Lethal Trifecta: Agentic Commerce on the Open Internet at Machine Speed — Day 4 — Session Day 3 3:45pm-4:05pm

For decades, the internet has had protocols for routing, identity, encryption, payments, and commerce between people and organizations. It has never had a native way for autonomous agents to possess authority, accountability, or legal standing. On July 1, 2026 that changes. A little known law will take effect that changes the world as we know it. As AI agents move beyond the enterprise firewall, a new form of commerce is emerging. Agents can already search, negotiate, schedule, purchase, settle payments, and coordinate work across networks. But the moment they begin acting independently on behalf of people, businesses, and online organizations, fundamental questions appear: Who does this agent represent? What authority does it possess? Who is responsible when something goes wrong? How do counterparties know they can trust it? This talk explores the "Lethal Trifecta" of agentic systems: access to systems, access to networks, and autonomy. Together they create extraordinary capabilities, but they also expose a missing layer in the architecture of the internet itself. Without identity, accountability, governance, and legal standing, agentic commerce remains trapped inside enterprise walls, limited to productivity gains rather than participation in open markets. On the same day as this conference, a new legal framework takes effect that gives autonomous online organizations a registered legal existence, allowing them to hold assets, enter agreements, govern themselves through software, and operate through fleets of agents. Whether you're building agents, agent platforms, autonomous organizations, payment systems, governance systems, or the next generation of internet infrastructure, this shift has global implications, and you'll be the first to know. We'll examine the emerging trust stack for agentic commerce—identity, authority, governance, settlement, and standing—and explore what happens when agents stop acting merely as tools and begin participating as economic actors on the open internet at machine speed.

Dean Quiñanola

  • Role: Staff Software Engineer, App Eng Manager
  • Company: Runpod
  • Bio: Dean Quiñanola is a Staff Software Engineer and application engineering manager at Runpod, working on developer-facing infrastructure for quickly deploying AI applications.
  • Sessions:

- AI Applications in a flash! No Dev Ops. Just code. — Day 3 — Session Day 2 3:20pm-3:40pm

Building AI Applications and serving them straight from code. No need for Docker builds. You can even vibe-code the entire process.

Deepak Pathak

  • Role: Co-Founder & CEO
  • Company: Skild AI
  • Bio: Co-Founder and CEO of Skild AI; Assistant Professor at Carnegie Mellon University in the Robotics Institute, affiliated with the Machine Learning Department.
  • Twitter: https://twitter.com/pathak2206
  • LinkedIn: https://www.linkedin.com/in/pathak22
  • Website: https://www.cs.cmu.edu/~dpathak
  • Photo: /wf26/speakers/by-id/spk_deepak_pathak.jpg
  • Sessions:

- Frontier Robotics Research — Day 3 — Session Day 2 1:55pm-2:15pm

Denys Linkov

  • Role: Head of ML
  • Company: Wisedocs
  • Bio: Head of ML at Wisedocs and Lecturer at the University of Toronto (CSC490 + CSC302). Works on interesting document, data and evals document problems in the healthcare space. Previously built the Voiceflow ML team.
  • Twitter: https://x.com/denyslinkov
  • LinkedIn: https://www.linkedin.com/in/denyslinkov/
  • Photo: /wf26/speakers/by-id/spk_denys_linkov.jpg
  • Sessions:

- Benchmarking Coding Agents on New vs Legacy Code bases — Day 4 — Session Day 3 12:05pm-12:25pm

You have an old code base with 100,000s of lines of code, should you let an AI Agent refactor or do you wait until you have a cleaner setup? Last year we refactored a number of code bases and ran evaluations on how well different models, harnesses and rule sets affected multiple versions of the code base. This talk will feature specific code examples as well as a broader set of evals.

Derek Meegan

  • Role: Software Engineer
  • Company: Browserbase
  • Bio: Derek Meegan is a software engineer at Browserbase. He builds systems and browser automation tooling, and writes about browser automation, Stagehand, and how technology changes the way people work.
  • Twitter: https://x.com/derekmeegan
  • LinkedIn: https://www.linkedin.com/in/derekmeegan
  • Website: https://derekmeegan.com
  • Photo: /wf26/speakers/by-id/spk_derek_meegan.jpg
  • Sessions:

- Deploying browser agents at scale — Day 2 — Session Day 1 1:55pm-2:15pm

Not every browser agent trajectory is the same, and treating them like they are is how teams quietly burn budget on agents that never ship. This talk walks through the two trajectory types behind every browser agent, the cost/performance/maintainability tradeoffs that decide whether they hold up, and the concrete patterns for evaluating, hardening, and iterating on them.

Devansh Tandon

  • Role: Principal Product Manager
  • Company: Meta
  • Bio: Devansh Tandon works on AI Research & Product at Meta, leading AI & recommendations teams. He is a founding member of a new AI research group (Meta Recommendation Systems Research) to develop LLM foundation models & recommendation systems across Meta: to power Instagram, Facebook, Ads.

Previously, Devansh led ML/AI teams at Google for 7 years, building the largest ML models across Ads, Search, Discover, YouTube, Gemini. He worked on YouTube's recommendation engine, which drives 70% of video watch time for 2 billion+ daily active users. At DeepMind, he incubated a new generative recommendation system using Gemini, and published multiple research papers.

Devansh graduated Magna Cum Laude from Yale University, with a BS in Computer Science and Economics.

  • Twitter: https://x.com/devanshtandon_
  • LinkedIn: https://www.linkedin.com/in/devanshtandon/
  • Photo: /wf26/speakers/by-id/spk_devansh_tandon.jpg
  • Sessions:

- Tokens In, Engagement Out: Training LLM-Recommenders — Day 2 — Session Day 1 10:45am-11:05am

- Open Q&A: LLM Recsys — Day 2 — Session Day 1 12:05pm-12:25pm

Dex Horthy

  • Role: Co-Founder
  • Company: HumanLayer
  • Bio: CEO and Co-Founder at HumanLayer, helping teams solve hard problems in complex codebases without slop. Dex has been building software factories his entire career. Coined the term "context engineering", created the Research/Plan/Implement framework for coding agents, wrote code for Nasa lunar rovers in high school. Enjoyer of tacos and burpees, not necessarily in that order.
  • Twitter: https://x.com/dexhorthy
  • LinkedIn: https://linkedin.com/in/dexterihorthy
  • Photo: /wf26/speakers/by-id/spk_dex_horthy.jpg
  • Sessions:

- Harness Engineering is not Enough: Why Software Factories Fail — Day 2 — Session Day 1 4:30pm-4:50pm

Dhruv Batra

  • Role: Co-founder & Chief Scientist
  • Company: Yutori
  • Bio: Dhruv Batra is a co-founder and the Chief Scientist of Yutori, building web agents.

Before this, he was a Senior Director at Meta leading FAIR Embodied AI, and an Associate Professor at Georgia Tech.

He received the 2019 Presidential early career award for scientists and engineers (PECASE) from the White House for his work on explainable AI and neural network interpretability.

He and his collaborators received the 2025 Mark Everingham Prize for their 2015 work on Visual Question Answering that established "a new strand of vision and language research.”

He is a recipient of best paper awards/nominations in every area of AI (vision, NLP, ML, robotics).

  • Twitter: https://x.com/DhruvBatra_
  • LinkedIn: https://www.linkedin.com/in/dhruv-batra-dbatra/
  • Website: https://dhruvbatra.com/
  • Photo: /wf26/speakers/by-id/spk_dhruv_batra.jpg
  • Sessions:

- Computer-use models will agentify the web, not APIs — Day 3 — Session Day 2 10:45am-11:05am

We are rushing towards a world where every single digital surface (email, calendar, messaging, …, every desktop app, every phone app, every web app) that was previously meant for humans is now managed by AI agents. Of course, there are technical challenges to be solved: - Model context windows haven’t increased in 2 years. And the digital world is OOMs bigger (the ultimate “big world hypothesis”) anyway, so how does one architect this? - A large part of the digital world (most of the web) does not have APIs available and requires agents to act like humans (consume pixels, output keyboard/mouse actions). - Human preferences and the digital world change, and require agents to maintain a dynamic memory and continually learn. But even if we could solve these problems, what does this world look like? - The digital world, particularly the web, was built for human consumption (and is often hostile to bots). - For a while to come, we will be sharing the digital roadways with these digital robots. - What does end-to-end encryption and privacy mean when the other “end” of the communication is an AI agent? The Yutori team has spent the last year building the world’s best computer use model (slightly better than Opus 4.6 and GPT 5.4 while being 2x faster and 4-5x cheaper on browser use tasks), converted the web into a webhook with Scouts (agents that monitor the web 24/7 for anything you care about), and are now releasing Yutori agent that expands from the open web to your most common digital surfaces. This talk will be grounded in Yutori’s learning from what it takes to build agents that are always on, taking us one step closer to the world where every digital surface is their playground.

Dhruv Nathawani

  • Role: Research Scientist
  • Company: Nvidia
  • Bio: Dhruv Nathawani is a Research Scientist at NVIDIA, where he works at the intersection of synthetic data and foundation model alignment. His work focuses on building data-centric methods that improve the reliability, adaptability, and performance of Nemotron foundation models and modern AI systems. Prior to NVIDIA, Dhruv was at Gretel, a synthetic data platform for developers, where he worked on tools for generating high-quality data for AI applications. At Salesforce Research, Apple Maps, and Carnegie Mellon University, Dhruv built AI systems spanning medical multimodal learning, document AI/OCR, satellite computer vision, and fMRI-based cognitive decoding.
  • LinkedIn: https://www.linkedin.com/in/dhruvnathawani/
  • Photo: /wf26/speakers/by-id/spk_dhruv_nathawani.jpg
  • Sessions:

- Teaching Agents to Search: Building Synthetic Training Pipelines with NVIDIA Data Designer — Day 1 — Workshop Day 11:05am-12:05pm

Modern agentic systems often fail because the right training data simply does not exist. Search agents are a perfect example: if you want a model to browse the web effectively, you need high-quality multi-step trajectories that teach it how to search, refine queries, inspect sources, and recover from dead ends. Those datasets are rarely available off the shelf. In this hands-on workshop, we will show how NVIDIA used Data Designer to build synthetic supervised fine-tuning data for search-capable Nemotron models. Participants will learn how to translate a target capability into a scalable data generation pipeline: defining task structure, generating strong seed examples, producing realistic search trajectories, filtering low-quality generations, and converting traces into training-ready records. Using a real search-agent use case, we will walk through the design decisions behind teaching Nemotron Super to browse the web, including how to create BrowseComp-style tasks, generate tool-use rollouts, and manage the tradeoffs between diversity, correctness, and yield. We will also cover the practical realities of production synthetic data workflows, including validation, dataset curation, and where most pipelines break down. But the goal of this workshop goes beyond search. Participants will leave with a reusable framework for designing any dataset they wish they already had: starting from the behavior they want to teach, mapping that behavior into a data schema, generating examples at scale, and iterating until the dataset is useful for training. By the end of the session, attendees will not only know how to build synthetic data for search agents, but how to design custom datasets for specialized behaviors across reasoning, tool use, and domain-specific applications. Attendees will leave with a practical methodology for synthetic data design, plus hands-on familiarity with NVIDIA Data Designer as an open-source system for rapid experimentation.

Dhruv Srikanth

  • Role: Founding Engineer
  • Company: Weco AI
  • Bio: Dhruv Srikanth is a Founding Engineer at Weco AI, where he works on recursive self-improvement. His work includes AIDE, an autoresearch-style system that achieved nearly 4x the medal rate of the next-best autonomous agent across OpenAI’s MLE-Bench. Previously, Dhruv worked on computer vision and robotics research at Carnegie Mellon University and the University of Chicago.
  • Twitter: https://x.com/dhruvsrikanth
  • LinkedIn: https://www.linkedin.com/in/dhruv-srikanth
  • Photo: /wf26/speakers/by-id/spk_dhruv_srikanth.jpg
  • Sessions:

- Hands-on AutoResearch: Cracking OpenAI's Parameter Golf — Day 1 — Workshop Day 2:20pm-4:20pm

Heard about autoresearch, or tried it a few times in playground settings? This hands-on tutorial teaches you how to use autoresearch on one of the most serious challenges in ML this year: OpenAI's Parameter Golf.

The challenge: train the best language model that fits in just 16MB. We entered our autoresearch agent this past spring, and it outperformed the field of over 1,000 participants. You'll learn how we approached it, then get to do it yourself: kick off an autoresearch agent, watch it improve a tiny language model's training script, steer it when progress stalls, and visualize your results. You'll leave with a working autoresearch setup you can point at your own code.

compute kindly sponsored by Modal!

Dillon DuPont

  • Role: CTO
  • Company: Cua
  • Bio: Building computer-use agents at Cua, WindowsAgentArena co-author
  • Photo: /wf26/speakers/by-id/spk_dillon_dupont.jpg
  • Sessions:

- Computer-Use 2.0: Agents Just Got Multi-Cursor — Day 3 — Session Day 2 2:25pm-2:45pm

Computer-use agents still inherit a basic desktop limitation: one machine has one foreground app, one hardware cursor, and one active actor. Once you try to run more than one agent per desktop, they start stealing focus from the user and from each other. We built cua-driver around a different model: multiple agents operating real desktop applications in parallel, each with its own synthetic pointer, while the user's cursor and keyboard stay undisturbed. The key move is to stop treating hardware mouse and keyboard events as the primary automation layer. cua-driver goes one layer lower, into the OS plumbing behind accessibility: UI Automation on Windows, AT-SPI on Linux, and AX on macOS. Those APIs address applications and elements directly, so the OS does not require the target window to be frontmost. A click can land on a background window. A keystroke can reach a hidden one. Multiple agents can act at once because none of them is competing for the singleton hardware mouse. I'll walk through the architecture, the API shape, and the platform-specific traps we hit while making it work across Windows, macOS, and Linux. The live demo is three agents operating on one desktop while the user keeps typing uninterrupted. The goal is to make Computer-Use 2.0 feel concrete: what changes in the stack, what becomes possible, and where the approach still leaks, including Wayland, Chromium DOM surfaces, native canvas apps, and fallback input paths.

Diogo Almeida

  • Role: CEO
  • Company: TypeSafe AI
  • Bio: Diogo Almeida is the Co-founder and CEO of TypeSafe AI, a company building a new class of large language models for true no-human-in-the-loop automation. Diogo brings a decade of deep AI research experience and was formerly at OpenAI and Google Brain. He played a foundational role in shaping the modern LLM landscape, credited with co-inventing InstructGPT, RLHF, ChatGPT, and GPT-4. His work at TypeSafe AI focuses on rethinking foundational LLM assumptions to prioritize reliability, latency, and cost, to build AI that can be embedded deep inside real systems.
  • Twitter: https://x.com/CompleteSkeptic
  • LinkedIn: https://www.linkedin.com/in/diogomda/
  • Website: https://typesafe.ai/
  • Photo: /wf26/speakers/by-id/spk_diogo_almeida.jpg
  • Sessions:

- What's next after RLHF? — Day 3 — Session Day 2 10:45am-11:05am

RLHF was a massive commercial success: roughly 100% of LLM usage is through RLHF’d models - but it was in many ways also a research failure. Let’s talk about how it conquered the world, how it defied its creators expectations, why AI is in the bimodal state it’s in (is it a bubble or a machine god?), and how to make AI actually transform the economy.

Divakar Kumar

  • Role: Technical Architect
  • Company: FlyersSoft
  • Bio: Divakar Kumar is a Microsoft MVP in AI and a Microsoft Certified Trainer (MCT), working as a Technical Architect at Flyerssoft. He actively shares his knowledge through blogs, talks, and training sessions, empowering developers to harness the potential of AI in real-world applications.
  • LinkedIn: https://www.linkedin.com/in/divakar-kumar/
  • Website: https://iamdivakarkumar.com
  • Blog: https://iamdivakarkumar.com
  • Photo: /wf26/speakers/by-id/spk_divakar_kumar.jpg
  • Sessions:

- Let's integrate AI Agents in Event-Sourced Systems — Day 4 — Session Day 3 11:40am-12:00pm

Fraud detection has always been a race against time. In traditional event-sourced systems, every transaction, login, or transfer is captured as a sequence of immutable events. These events tell a clear story — but only after the fact. What if events could do more than just record history? What if they could talk back? In this talk, we’ll explore how agentic event-driven systems transform fraud detection. Imagine every PaymentInitiated, LoginAttempt, or DeviceChanged event not just being logged, but immediately consumed by an autonomous Fraud Detection Agent. This agent correlates events across accounts, reasons over historical event streams, and generates new events like SuspiciousActivityFlagged or TransactionHeldForReview. Through a real-world inspired use case in banking and digital payments, we’ll show: - How event sourcing provides the perfect memory layer for fraud detection agents - Patterns for agents to safely inject new domain events without violating invariants - How to avoid runaway feedback loops when multiple agents interact (e.g., fraud + compliance + customer service agents) - Governance, auditing, and explainability challenges when autonomous agents take part in mission-critical workflows By the end of this session, you’ll see how event-driven DDD systems evolve when agents stop being passive consumers and start actively shaping the event stream — turning fraud detection from a reactive process into a proactive, adaptive defense.

Dixing Xu

  • Role: Member of Technical Staff
  • Company: Weco AI
  • Bio: MTS at Weco AI. building self-improving agents and autoresearch systems. Previously at the Linux Foundation.
  • Twitter: https://x.com/dexhunt3r
  • LinkedIn: https://linkedin.com/in/dex-xu
  • Website: https://dex.moe
  • Photo: /wf26/speakers/by-id/spk_dixing_xu.jpg
  • Sessions:

- Hands-on AutoResearch: Cracking OpenAI's Parameter Golf — Day 1 — Workshop Day 2:20pm-4:20pm

Heard about autoresearch, or tried it a few times in playground settings? This hands-on tutorial teaches you how to use autoresearch on one of the most serious challenges in ML this year: OpenAI's Parameter Golf.

The challenge: train the best language model that fits in just 16MB. We entered our autoresearch agent this past spring, and it outperformed the field of over 1,000 participants. You'll learn how we approached it, then get to do it yourself: kick off an autoresearch agent, watch it improve a tiny language model's training script, steer it when progress stalls, and visualize your results. You'll leave with a working autoresearch setup you can point at your own code.

compute kindly sponsored by Modal!

Dmitry Buykin

  • Role: Applied AI Lead, Staff Software Engineer
  • Company: Maersk
  • Bio: Dmitry Buykin builds production AI agent systems that turn frontline tribal knowledge into executable SOPs, tool calls, evidence trails, and evaluation loops across 100+ countries. Before enterprise AI agents became a trend, Dmitry worked on AI/ML adoption at DataRobot and on real-time systems in finance, including payments infrastructure in Nordic banking and algorithmic high-frequency trading systems at Deutsche Bank, where latency, correctness, and risk controls mattered at every step. He has 25+ years of experience building distributed systems, data platforms, and ML/AI products across logistics, banking, fintech, cybersecurity, and telecom, with prior roles at Nordea, Danske Bank, Oracle and several startups. Dmitry studied Applied Mathematics with a focus on mechatronics and the simulation of magnetic-levitation transport. He also leads the Applied AI Chapter at Maersk and runs bootcamps, helping software engineers move from LLM demos to reliable production systems.
  • Twitter: https://x.com/tzakus
  • LinkedIn: https://www.linkedin.com/in/buykin/
  • Website: https://www.maersk.com/
  • Photo: /wf26/speakers/by-id/spk_dmitry_buykin.jpg
  • Sessions:

- Tribal Dungeons of Global Shipping: AI Agents at Global Scale — Day 4 — Session Day 3 11:10am-11:30am

Most “AI agents in production” talks skip the part where you have to turn distributed operational knowledge into something an agent can execute safely. This is that part: a practitioner report from a global logistics case-processing project at Maersk, focused on SOPs-as-code, evaluation UX, guardrails, replay-based testing, and SME refinement loops.

The talk covers why versioned, country-aware SOPs beat prompt engineering at scale; how SME corrections become safe workflow changes; why classifier routing and SOP execution must stay separate; where agents under-deliver against demos; and why most of the engineering effort goes into evaluation, replay, and guardrails rather than model prompting.

Dominik Kundel

  • Role: Developer Experience Lead
  • Company: OpenAI
  • Bio: Dominik Kundel works on Developer Experience at OpenAI, where he helps builders get the most out of Codex and the OpenAI APIs. His work has spanned the Agents SDK, GPT-OSS, and most recently Codex. Before OpenAI, he led Product & Design for Twilio’s Emerging Tech & Innovation team, working on developer tools and customer-aware AI agents. Dominik has spent more than a decade in developer tools, usually across APIs, CLIs, JavaScript, and strange demos. Outside work, he’s probably tinkering with cocktails, food, photography, or something that should not need JavaScript but somehow does.
  • Twitter: https://x.com/dkundel
  • LinkedIn: https://linkedin.com/in/dkundel
  • Website: https://dkundel.com
  • Photo: /wf26/speakers/by-id/spk_dominic_kundel.jpg
  • Sessions:

- Building on the Codex Harness — Day 3 — Session Day 2 3:45pm-4:05pm

- Codex, Behind the Harness — Day 4 — Session Day 3 1:30pm-1:50pm

Agents have evolved a lot in the last year both in capabilities and in the overall structure. Increasingly sandbox-powered coding agents are breaking out to do general purpose work.

In this talk we’ll be taking apart the open-source Codex agent harness. Understand how it works, what makes it so suitable to do work beyond coding tasks, how it handles key aspects like context management, tools and file system access. We’ll also tie these back to concrete actions you can take to bring these patterns into your own agents, whether you are building on top of the Codex agent or building your own.

Dor Sasson

  • Company: Stigg
  • Bio: Dor Sasson works at Stigg and writes about product, data, and billing infrastructure, with a stated focus on end users and full-stack data work.
  • LinkedIn: https://il.linkedin.com/in/datapm
  • Photo: /wf26/speakers/by-id/spk_dor_sasson.jpg
  • Sessions:

- Every AI company is accidentally building a bank. — Day 2 — Session Day 1 10:45am-11:05am

You're logging usage, billing later, hoping agents behave. They don't. Here's the architecture that fixes it before the invoice hits.

Doug Guthrie

  • Role: Solutions Engineer
  • Company: Braintrust
  • LinkedIn: https://www.linkedin.com/in/doug-guthrie-07994a48
  • Photo: /wf26/speakers/by-id/spk_doug_ghutrie.jpg
  • Sessions:

- Advanced workshop: Mastering AI Observability — Day 1 — Workshop Day 9:00am-11:00am

Your AI is in production, but is it actually good? In this hands-on workshop, you'll learn how to uncover patterns in your production traces using Braintrust Topics, build custom scorers to target real issues, and systematically improve your agent. By the end, you'll have a repeatable eval workflow and trace-backed evidence that your AI is actually doing what you think it is.

Doug Keller

  • Role: Senior Staff AI Engineer
  • Company: SonderMind
  • Bio: Doug is the lead architect of the agent platform powering SonderMind’s GenAI solutions and a core member of the team building Sonder, SonderMind’s mental health coach. With over a decade of full-stack systems experience, Doug brings a systems-first approach to agent architecture, grounding his work in engineering fundamentals while enabling the adaptability required in the rapidly evolving GenAI landscape.
  • LinkedIn: https://www.linkedin.com/in/doug-keller/
  • Photo: /wf26/speakers/by-id/spk_doug_keller.jpg
  • Sessions:

- Evals Driven-Development: Engineering a Mental Health AI Coach Ethically & Safely — Day 3 — Session Day 2 2:50pm-3:10pm

In the world of AI Mental Health, vibes can be dangerous with real consequences. Building Sondermind’s Mental Health AI Coach required us to invent a new playbook for Eval-Driven Development in order to balance effectiveness and safety. This session is for the builders who want to see how to handle the most difficult edge cases in the agentic world. We’ll show how we’ve built a Clinical Feedback Loop that turns human therapist insights into machine-readable evaluations in a production system with thousands of conversations. We’ll dive into: - The Ethics Engine: Building and calibrating modular guardrails that can be updated as clinical guidelines evolve. - Agentic Oversight: Why we moved from single-prompt agents to a closed-loop Supervisor/Executor/Evaluator pattern to ensure clinical adherence. - Human Oversight: How we monitor Sonder to ensure that we can improve safety and quality with clinical feedback.

Dru Knox

  • Role: Head of Product
  • Company: Tessl
  • Bio: Dru Knox leads product at Tessl, working on AI-native software development, coding-agent harnesses, specifications and workflows that help teams build reliable software with agents.
  • Photo: /wf26/speakers/by-id/spk_dru_knox.jpg
  • Sessions:

- Harness Engineering: The New Core Skill for Agentic Developers — Day 4 — Session Day 3 2:50pm-3:10pm

Harness engineering is emerging as a new core competency for agentic engineers. Your job isn't writing good code, it's upgrading your codebase so that agents reliably succeed. This talk covers the core loop of harness engineering, the most common codebase modifications you'll make, and how to 10x your harness engineering efforts with Tessl's harness engineering agent.

Du'an Lightfoot

  • Role: Senior AI Engineer
  • Company: Akamai Technologies
  • Bio: Senior AI Engineer at Akamai Technologies specializing in artificial intelligence and network engineering. Previously served as a Senior Developer Advocate at AWS and is the founder of LabEveryDay.
  • Twitter: https://x.com/labeveryday
  • Website: https://www.duanlightfoot.com
  • Photo: /wf26/speakers/by-id/spk_du_an_lightfoot.jpg
  • Sessions:

- Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs — Day 1 — Workshop Day 9:00am-11:00am

Every production agent today is renting its intelligence. You're paying per token, sending your customer's data to someone else's servers, and hoping the provider doesn't rate-limit you during your launch. For most teams, that's fine. But for a growing number of teams in regulated industries, with high-volume products, latency-sensitive workloads, or rising token bills, it's starting to look like a liability.

In this 120-minute hands-on workshop you'll get a dedicated GPU and build an agent that runs on infrastructure you control. You'll stand up vLLM, point your agent at it, and drive concurrent load through the stack until you can see batching, KV cache pressure, and throughput limits in the metrics. Then you'll optimize the deployment to improve throughput while keeping per-request latency in line.

The focus isn't agent frameworks. It's the inference layer underneath them. You'll leave with working code and a real understanding of continuous batching under real concurrency, KV cache tradeoffs, vLLM's metrics, and the bottlenecks that only show up when you operate the inference server yourself.

Dustin Mihalik

  • Role: Technical Fellow
  • Company: Indeed
  • Bio: Dustin is a Technical Fellow at Indeed, where he leads the AI Platform powering all of Indeed's production AI products. He has spent years building the foundational infrastructure that enables engineering teams to develop, deploy, and scale AI applications across the company — from early prototypes to high-throughput production systems.
  • LinkedIn: https://www.linkedin.com/in/dmihalik/
  • Website: https://dmihalik.com
  • Blog: https://dmihalik.com
  • Photo: /wf26/speakers/by-id/spk_dustin_mihalik.jpg
  • Sessions:

- MCP Apps: Give the Model Data, Give the User a UI — Day 3 — Session Day 2 2:50pm-3:10pm

Most MCP tools return text. MCP Apps let you go further. But the real unlock isn't just rendering a pretty UI, it’s understanding that the model and the user need fundamentally different things from the same interaction. This talk presents a design pattern for building great MCP Apps: separate the data layer (what the model reasons about) from the display layer (what the user interacts with). When you do this well, the model retains full context and agency over structured data, while the user gets a rich, interactive interface. We'll walk through concrete examples of how splitting data and display unlocks capabilities that pure UI apps can't provide: letting the model make choices around display, answer questions based on interactions, and providing detailed displays and filters. Attendees will leave with a practical mental model for designing MCP Apps that are good for both the human and the AI. Attendees will learn patterns they can apply immediately.

Dylan Bristot

  • Role: Product Marketing
  • Company: Token Factory
  • Bio: Dylan Bristot works in product marketing at Token Factory and speaks on practical approaches for scaling open-model inference in production AI infrastructure.
  • Photo: /wf26/speakers/by-id/spk_dylan_bristot.jpg
  • Sessions:

- Optimizing Open Models for Production Grade Inference — Day 4 — Session Day 3 2:25pm-2:45pm

Open-source foundation models are rapidly closing the gap with proprietary systems, enabling organizations to build powerful AI applications with greater flexibility and control. However, deploying these models in production introduces a new set of challenges: latency, throughput, scalability, and cost efficiency.In this talk, we'll explore the modern inference optimization techniques that power large-scale AI systems in production. Topics include KV cache optimization, cache-aware routing, prefill/decode disaggregation, speculative decoding, and other emerging approaches used to improve performance and reduce infrastructure costs.Through practical examples and real-world architecture patterns, attendees will gain a deeper understanding of how to run open models efficiently at scale.

Dylan Couzon

  • Role: DevRel Engineer
  • Company: Qdrant
  • Bio: Dylan Couzon is a DevRel Engineer at Qdrant. His WF26 talk, "The Frontier Is Coming Home," discusses the trend of increasingly capable models becoming small enough to run locally.
  • LinkedIn: https://www.linkedin.com/in/dcouzon
  • Photo: /wf26/speakers/by-id/spk_dylan_couzon.jpg
  • Sessions:

- The Frontier Is Coming Home — Day 3 — Session Day 2 2:50pm-3:10pm

In 2022, the smallest model to clear 60 percent on MMLU had 540 billion parameters. Two years later a 3.8 billion parameter model did the same thing, small enough to run on a phone. That is a 142x drop to reach the same capability floor, and it is the cleanest way to see a trend most people are not pricing in. Call it the lag: the time between a capability showing up at the frontier and that capability running on hardware you own. Today the lag is measured in months, and it keeps shrinking. But raw capability is only half of what makes a model useful. A model that can reason but cannot remember is a stranger every time you talk to it. The other half of local AI is memory, and that half is already here. On-device retrieval has been ready to run locally longer than the models have. The embedding models that power it are tiny, the indexes fit in memory, and none of it touches a network. When your reasoning and your memory both live on your machine, so does your context. Your history, your documents, your past conversations never leave the device. That is the part of this shift that matters most, and the part people overlook because they are busy watching the models. The same shift flips the economics. At 200 dollars a month per seat, a local machine starts to pay for itself in under two years, and the frontier labs' own published usage numbers put heavy coding in the same range. I'll walk through the math, the hardware, and where local still loses. None of this is a bet against scale, or against the Bitter Lesson. The frontier still grows in the data center. The point is that a usable copy keeps arriving on your desk, on a lag, with a memory of its own, for close to free.

Edo Liberty

  • Role: Founder and Chief Scientist
  • Company: Pinecone
  • Bio: Edo Liberty is the founder and Chief Scientist of Pinecone. Pinecone is the knowledge infrastructure for AI at scale. Its leading vector database and knowledge engine, Pinecone Nexus, power accurate, performant AI applications for more than 10,000 customers and 1M developers worldwide. Before founding Pinecone, Edo was a Director of Research at AWS and Head of Amazon AI Labs where his team built cutting-edge machine learning algorithms, systems, and services including parts of Amazon SageMaker and OpenSearch. Edo holds a B.Sc in Physics and Computer Science from Tel Aviv University, and a Ph.D. in Computer Science from Yale. As an academic Edo taught at Tel Aviv University and at Princeton and has authored more than 75 papers and patents. His research focused on mathematical foundations of AI, optimization, streaming algorithms, machine learning, numerical linear algebra, and high dimensional data mining.
  • Twitter: https://x.com/edoliberty
  • LinkedIn: https://www.linkedin.com/in/edoliberty/
  • Website: https://edoliberty.github.io/
  • Photo: /wf26/speakers/by-id/spk_edo_liberty.jpg
  • Sessions:

- Pinecone 2.0 — Day 2 — Session Day 1 10:45am-11:05am

Autonomous agents are smart but don’t know your business or your objectives. That’s why most agents in the enterprise remain stuck in retrieval loops, burning millions of tokens on processing raw documents

A shift from traditional retrieval systems + agents (aka RAG) to purpose-built knowledge engines is underway.

I'll talk about why moving reasoning upstream and compiling raw enterprise data into specialized, task-specific context artifacts is critical to unlocking reliable agentic workflows. And I'll show you how offloading knowledge management to a dedicated layer enables engineering teams to achieve up to a 90% reduction in token consumption while drastically improving task completion rates, speed, and accuracy.

Ekaterina Deyneka

  • Role: Founder & CEO
  • Company: Reelful
  • Bio: Kate Deyneka is the founder and CEO of Reelful, an agentic video editor for social media. She founded Reelful in October 2025 and built the product end to end, including multimodal pipelines for media understanding, automated editing, and mobile deployment. Reelful launched in April 2026 and is available on the App Store. Reelful is part of the a16z speedrun SR007 cohort. Previously, Kate was a senior machine learning engineer at Snap, where she led video generation from research to production and shipped AI features used by more than 100 million daily users.
  • Twitter: https://x.com/katedeyneka
  • LinkedIn: https://www.linkedin.com/in/katedeyneka
  • Website: https://www.katedeyneka.com/
  • Blog: https://www.katedeyneka.com
  • Photo: /wf26/speakers/by-id/spk_kate_deyneka.jpg
  • Sessions:

- Building an Agentic Video Editor for Mass Consumer — Day 4 — Session Day 3 11:40am-12:00pm

Most agentic systems today are built for developers — people comfortable setting up environment, configs, and debugging agent loops. But what happens when your user has never heard the word "agent" and just wants a video ready to post? Reelful is an agentic video editor that lives right in the user's phone. It turns raw photos and videos from your camera roll into polished, short videos. No setup. No sophisticated prompting. No empty timeline. Under the hood, the agent orchestrates multiple models and composes a video together. In this talk, I'll walk through: The agentic pipeline architecture: how we chain models across modalities (vision → language → speech → video), handle context passing between steps, and manage state across a multi-minute generation job The UX inversion: how we designed the agent to require minimal effort from user — the system infers intent from the media itself, making complex orchestration invisible This talk is for anyone building agents that need to work for non-technical users, or anyone curious about multimodal agentic pipelines beyond text and code.

Eli Cohen

  • Role: Director of Technology Incubation
  • Company: Snyk
  • Bio: Director of Technology Incubation at Snyk, focused on securing an AI-native future through research and incubations. Previously co-founded Helios, which was acquired by Snyk in 2024; has held product and engineering leadership roles and is an alumnus of Unit 8200.
  • LinkedIn: https://www.linkedin.com/in/cohen-eli
  • Sessions:

- Continuous Offensive Security the only approach in an agent-first world — Day 3 — Session Day 2 2:50pm-3:10pm

Elie Bakouch

  • Role: Research Engineer
  • Company: Prime Intellect
  • Bio: Elie Bakouch is a Research Engineer at Prime Intellect, working to advance open pre-training and mid-training. Previously at Hugging Face, he created and trained the SmolLM series of efficient language models and contributed to numerous open research efforts including Open-R1, SmolVLM, and the open pre-training playbooks, comprehensive guides and recipes for training language models from scratch
  • Twitter: https://x.com/eliebakouch
  • LinkedIn: https://www.linkedin.com/in/eliebak/
  • Photo: /wf26/speakers/by-id/spk_elie_bakouch.jpg
  • Sessions:

- « the era of (auto) research » — Day 3 — Session Day 2 12:05pm-12:25pm

the nanogpt speedrun is a great setup to test autonomous research: fixed model, one number to beat, and a human record that keeps moving. we pointed coding agents at it on idle compute and let them iterate for days, thousands of runs with minimal human intervention, until they beat the human baseline. in this talk we go through how they did it, how codex and claude code behave very differently as researchers, and why speedrun are one of the best environments we've found for studying autonomous research agents

Elizabeth Fuentes Leone

  • Role: Developer Advocate
  • Company: Amazon Web Services
  • Bio: Elizabeth Fuentes Leone is a Developer Advocate at AWS, helping developers build production-ready AI applications. With a background spanning data analytics, machine learning, and developer education, she specializes in making complex AI concepts accessible through hands-on tutorials, open-source projects, and live demos.
  • Twitter: https://x.com/lizfue
  • LinkedIn: https://www.linkedin.com/in/lizfue/
  • Photo: /wf26/speakers/by-id/spk_elizabeth_fuentes_leone.jpg
  • Sessions:

- Agent Speedrun: Idea → Code → Deploy → Observe, Fix → Ship — Day 1 — Workshop Day 11:05am-12:05pm

One agent. Fully deployed to production before the workshop ends. We'll take you from a blank file to a running production agent using Amazon Bedrock AgentCore and Strands Agents, covering the full lifecycle: ideation, coding the agent loop, deploying to serverless infrastructure, wiring up observability, breaking it intentionally, fixing it with tracing data, and shipping the final version. Bring your laptop and leave with a deployed agent.

Em Shreve

  • LinkedIn: https://www.linkedin.com/in/emdashcodes
  • Website: https://emdash.codes
  • Photo: /wf26/speakers/by-id/spk_em_shreve.jpg
  • Sessions:

- AI Enablement at Automattic: How a Remote Company Builds AI Fluency — Day 2 — Session Day 1 3:45pm-4:05pm

Automattic is a remote company. About 600 of us will step away from regular work this year for an immersive AI program. That's a little over a third of the company. This talk walks through a field report of what we built and why: the curriculum, the cohort design, and what we've learned about making AI fluency work across a distributed organization.

Emil Eifrem

  • Role: CEO
  • Company: Neo4j
  • Bio: Emil Eifrem is the co-founder and CEO of Neo4j, the graph database and analytics leader that enabled investigative journalists to crack the Panama Papers leak, NASA to get to Mars two years faster, power discoveries in biodiversity and genomics sampling, cancer research, fraud detection, scientific research, and serve thousands of organizations worldwide including 75 of the Fortune 100. Graph uniquely organizes data in the same way that the human brain does in expressing the complex relationships, context and connections between data as data itself. By uncovering the web of relationships underlying our interconnected world, graph enables organizations to dynamically predict what happens when these relationships change and why.

Emil created the category of graph databases and has devoted his professional life to building, innovating, and evangelizing graphs. Graph has since become the fastest-growing database category over the past decade. Neo4j’s series F funding round was also the largest in database history, valuing the company at more than $2B. Gartner predicts that by 2025, graph technologies will be used in 80% of data and analytics innovations — up from 10% in 2021 — facilitating rapid decision-making across the enterprise. Neo4j has become essential for enterprise GenAI in reducing hallucinations with knowledge graphs, serving as long-term memory for LLMs, and dramatically improving GenAI outcome accuracy, explainability, and transparency. Neo4j is headquartered in Silicon Valley. Emil is based in Sweden.

  • Twitter: https://x.com/emileifrem
  • Photo: /wf26/speakers/by-id/spk_emil_eifrem.jpg
  • Sessions:

- Why Graphs? — Day 4 — Session Day 3 10:20am-10:30am

Emile Baizel

  • Company: Amazon AGI Lab
  • Bio: Emile Baizel is with Amazon AGI Lab and is co-presenting the “Build with Perception Agents” workshop at AI Engineer World’s Fair 2026.
  • LinkedIn: https://www.linkedin.com/in/emilebaizel
  • Photo: /wf26/speakers/by-id/spk_emile_baizel.jpg
  • Sessions:

- Build with Perception Agents — Day 1 — Workshop Day 2:20pm-4:20pm

Human-agent collaboration is changing, becoming more visual. Models can perceive, point, and verify, but most agents still rely on us typing a paragraph to explain what we're looking at. Meet perception agents: computer use agents that see screens how you see screens. They understand, reason, and verify their own work. They let you point, draw, and describe, just as people collaborate in real life. We call this shared perception, and at AGI Lab we just open-sourced the first two primitives of our perception agent harness: visual verification and visual annotation. In this workshop, you'll get hands-on with both, build one sample use case end-to-end, then take the primitives back to your day-to-day in a mini hackathon. Best ideas win prizes.

Eno Reyes

  • Role: CTO & Co-Founder
  • Company: Factory
  • Bio: CTO and Co-Founder of Factory, focused on autonomous software engineering agents for enterprise teams. Previously worked in machine learning and software engineering roles including Hugging Face and Microsoft.
  • Twitter: https://x.com/EnoReyes
  • Photo: /wf26/speakers/by-id/spk_eno_reyes.jpg
  • Sessions:

- How Forward Deployed Engineering is done at Factory — Day 2 — Session Day 1 10:45am-11:05am

Erik Meijer

  • Role: Research Scholar
  • Company: Leibniz Labs
  • Bio: Erik Meijer has spent more than three decades designing programming languages and developer tools that help humans express intent more clearly to machines. His work has influenced languages and technologies including Haskell, Mondrian, Cω, C#, Visual Basic, Dart, Hack, LINQ, and Rx. Today, he is building Universalis, the world's first programming language for AI agents. By combining formal verification with large language models, Universalis aims to make agentic systems safe, transparent, and trustworthy enough for real-world knowledge work.
  • Twitter: https://x.com/headinthebox
  • LinkedIn: https://www.linkedin.com/in/erikmeijer1/
  • Website: https://en.wikipedia.org/wiki/Erik_Meijer_(computer_scientist)
  • Blog: https://en.wikipedia.org/wiki/Erik_Meijer_(computer_scientist)
  • Photo: /wf26/speakers/by-id/spk_erik_meijer.jpg
  • Sessions:

- In Code They Act, In Proof We Trust — Day 2 — Session Day 1 4:50pm-5:10pm

AI agents today execute on blind trust, and the failure modes are already in the headlines: a dealership chatbot agreeing to sell a $76,000 Chevy Tahoe for $1, a coding agent wiping a production database during a code freeze, an "agent skill" quietly installing a keylogger on a developer's machine. These are not edge cases. They are the predictable consequence of allowing agents to act without any mechanical guarantee of correctness or safety. Execution is irreversible. You cannot unsend a message, unwire a payment, or un-delete a database. In that regime, permitting an unsafe action costs far more than withholding a safe one, and thus the economically rational choice is to refuse to let agents act on unchecked intent alone. Automind is an agent harness that enforces this discipline by construction. Before any action runs, the agent must submit its execution plan together with a machine-checkable proof of safety and correctness, written in Universalis, a literate logic programming language designed to be read by humans and verified by machines. A small, auditable checker decides whether the plan is allowed to execute. By left-shifting the trust boundary, we no longer have to trust the agent's proposal, or even its proof; only the checker. Policy compliance becomes a static property, established before the first side effect. We can finally demand formal proofs, not vibes, from the agents we deploy.

Erina Karati

  • Role: Former Microsoft
  • Company: Supercell
  • Bio: Erina Karati builds applied AI systems across generative AI, multi-agent architectures, and production-ready ML pipelines. She recently worked as an AI Engineer at Supercell, where she built modular multi-agent systems and scalable AI infrastructure for real-world interactive environments, with resulting research accepted to the WiML Symposium @ ICML 2026.

Previously, Erina spent three years at Microsoft working across large-scale production systems in complex enterprise environments. Her work spanned networking, system reliability, security, and debugging distributed failures at global scale, shaping how she approaches robustness, observability, and reliability in AI systems.

She is also the co-founder of MinneDigest, an AI-powered news and podcast platform that won the AI x Journalism Hackathon and secured $10,000 in grant funding.

Erina graduated with a Master’s in Computer Science from the University of Minnesota with a 4.0 GPA in May 2026. She is especially interested in combining strong engineering foundations with advanced AI to build meaningful, real-world systems.

  • Twitter: https://x.com/erinakarati
  • LinkedIn: https://www.linkedin.com/in/ekarati/
  • Website: https://www.erinakarati.dev/
  • Blog: https://www.erinakarati.dev/
  • Photo: /wf26/speakers/by-id/spk_erina_karati.jpg
  • Sessions:

- Autoresearch in a Multi-Agent AI Village — Day 3 — Session Day 2 3:45pm-4:05pm

Project Paradox is an existing multi-agent framework built at Supercell's first AI Innovation Lab, which has a 3D Unity village with local LLM powered agents. The characters remember conversations, update emotional state, track trust, plan actions, move through rooms, transfer items, and talk to each other through a FastAPI backend. The new work is an autoresearch layer around that village. We built a backend loop that runs controlled social scenarios, scores the resulting NPC behavior, proposes protocol or policy changes, reruns the suite, and keeps changes that improve the agents. The goal is to move beyond one good chat response and measure whether an NPC society can preserve source attribution, verify claims, spread important information, coordinate goals, and replan after new information arrives. The talk walks through the system architecture and the lessons from building it. We show the backend simulation harness that executes Unity style actions without opening Unity, the scenario suites that test information diffusion and memory provenance, and the ratchet loop that edits protocol text or planner policy with rollback. One accepted run improved information diffusion by teaching agents to broadcast important sourced evidence while preserving who said it. The practical takeaway is a reusable pattern for AI engineers building agents with messy state. Freeze the harness, expose a small editable policy surface, score real behavior instead of vibes, and let an agent search for improvements under rollback. The same pattern applies to game agents, coding agents, support agents, personal agents, and other systems where long horizon behavior matters more than a single response.

Ethan (Jung Min) Cha

  • Role: AI Development Lead
  • Company: The Carlyle Group
  • Bio: AI Development Lead at Carlyle partnering with investor relations teams to surface high-value AI opportunities and building the AI platform that ship them at scale. His work sits at the intersection of AI and relationship-driven corners of finance: how capital gets raised, how investors are understood, and how relationship scales.

Across earlier roles at Cedar, a healthcare fintech company, and Novelis, a global manufacturing leader, he learned that curiosity about every edge of a complex problem, ruthless prioritization, and an obsessive focus on user outcomes are what separate AI demos from AI products that stick. But the real secret, he found, is understanding that successful product is about people, systems, and how solutions get sold and adopted.

  • LinkedIn: https://linkedin.com/in/ethancha0411
  • Website: https://ethancha.net
  • Photo: /wf26/speakers/by-id/spk_ethan_cha.jpg
  • Sessions:

- Dual-Surface Architecture: Serving Humans and Agents from the Same Tool Layer — Day 2 — Session Day 1 1:55pm-2:15pm

Every enterprise AI talk right now is about capability. Almost none are about containment. That's the gap this talk fills, because it's where regulated deployments actually die. The Deterministic Harness is the set of rigid rails around a model: schemas, data contracts, tool boundaries, and audit paths. These rails are what turn a probabilistic model into a deployable enterprise asset. The idea isn't new. Aviation wraps pilots in envelope protection. Nuclear wraps reactors in passive safety. Banking wraps algorithmic trading in transaction limits. Every regulated industry figured out the same thing eventually: high-variance systems only become deployable when wrapped in low-variance containment. Enterprise AI is catching up, not inventing. I'll walk through the single governed MCP and API server we built at Carlyle, and the architectural decisions behind it. You'll leave with four things: 1. A phased rollout model where each phase earns the next. Moving from locked-down reads to trusted writes isn't risk mitigation. It's trust compounding. Each phase generates the observability that underwrites the autonomy granted in the next one. Skip a phase and you don't save time. You destroy the evidence base that would have justified the next step. 2. One contract, two surfaces. A single data layer that serves both the human UI and the agent. The institution then has exactly one answer to any question either might ask. When the agent and the UI disagree, users lose trust in both. 3. An intent based feedback loop that captures what LLM providers structurally cannot. The gap between what users tried to accomplish and what the system actually delivered is invisible to Anthropic, OpenAI, and Google. Only the harness owner sees it. We close that loop back into the governed server, and it compounds into differentiation that model providers cannot replicate from where they sit. 4. The failure modes we hit and what we'd redesign. A pre mortem folks will inherit for free, from two regulated industries where a wrong answer has a named owner.

Ethan Sutin

  • Role: Co-founder
  • Company: Bee (acq. Amazon)
  • Bio: Ethan Sutin is co-founder of Bee, where he works on secure cloud compute.
  • Website: https://bee.computer
  • Sessions:

- Secure Cloud Compute — Day 2 — Session Day 1 3:45pm-4:05pm

Eugene Yan

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Eugene Yan is a Member of Technical Staff at Anthropic, where he works on safe and reliable AI systems at scale. He previously led ML/AI teams at Amazon, Alibaba, Lazada, and a healthtech Series A, and writes about LLMs, recommender systems, and engineering.
  • Twitter: https://x.com/eugeneyan
  • Photo: /wf26/speakers/by-id/spk_eugene_yan.jpg
  • Sessions:

- Using LLMs to Secure Source Code — Day 2 — Session Day 1 1:30pm-1:50pm

Models are now finding and fixing real vulnerabilities at scale. Drawing on Anthropic's work with security teams, this talk walks a six-step workflow — threat model, sandbox, discover, verify, triage, patch — through one running example, shows where orgs actually bottleneck, and gives you a copy-paste path to your first scan.

Eve Bouffard

  • Role: Head of Design
  • Company: Y Combinator
  • Bio: Eve is Head of Design at Y Combinator. She joined YC as the youngest member of the admissions team, where she read more than 25,000 startup applications before teaching herself to code and moving into engineering. These days, she works across design and software, building the products founders use and the internal tools that help YC partners support thousands of startups every year. She believes great design isn't what looks best, but what best achieves a given goal. She's happiest building products that make it easier for founders to take a leap, bet on themselves, and make something people want.
  • Twitter: https://x.com/eve_bouff
  • LinkedIn: https://www.linkedin.com/in/eve-bouffard
  • Website: https://evebouffard.com
  • Photo: /wf26/speakers/by-id/spk_eve_bouffard.jpg
  • Sessions:

- Imagination Engineering — Day 3 — Session Day 2 2:25pm-2:45pm

Everett Berry

  • Role: Head of GTM Engineering
  • Company: Clay
  • Bio: Everett Berry is the Head of GTM Engineering at Clay, where he pioneers the intersection of technical expertise and go to market execution. Since joining in 2024, he has helped build automated systems that revolutionize how sales teams operate. Everett brings extensive GTM engineering experience from leading growth at cloud cost platform Vantage and database tool Arctype. As an entrepreneur, he founded Perceive, an AI infrastructure company specializing in edge devices and low-power AI inference. Everett holds a B.S. in Computer Engineering from Purdue University and believes great technology must be both built and effectively marketed, championing data-driven approaches to go-to-market strategies.
  • Twitter: https://x.com/retttx
  • LinkedIn: https://linkedin.com/in/everettberry
  • Website: https://retttx.com
  • Photo: /wf26/speakers/by-id/spk_everett_berry.jpg
  • Sessions:

- GTM Engineering: The Technical Bits — Day 4 — Session Day 3 10:45am-11:05am

Everyone talks about "GTM engineering" — Everett Berry shows the actual plumbing. As Head of GTM Engineering at Clay, he goes under the hood on the technical bits most talks skip: enrichment pipelines, agent-driven data classification, identity resolution, and the systems that turn unstructured web data into clean, deterministic CRM fields. A builder's-eye view of what GTM engineering actually is once you strip away the buzzwords.

Extend AI

  • Company: Extend AI
  • Bio: Extend AI is a document-processing platform for parsing, extracting, classifying, splitting, and editing complex documents so teams can build AI-powered document workflows and reliable document agents.
  • Sessions:

- Expo Welcome Speech — Day 1 — Workshop Day 6:00pm-6:15pm

Eyal Blum

  • Role: Software Engineer
  • Company: Figma
  • Bio: Eyal Blum is a senior staff engineer at Figma working on Client Testing, Observability, and Performance. He works on developer infrastructure and drives AI-assisted development practices across the engineering org, sitting at the intersection of developer tooling and AI adoption — figuring out what it takes for coding agents to actually work in a large, established codebase. Before Figma, Eyal spent nearly 20 years building large-scale systems at companies like Meta, Dropbox, and Google.
  • LinkedIn: https://www.linkedin.com/in/eyalg/
  • Photo: /wf26/speakers/by-id/spk_eyal_blum.jpg
  • Sessions:

- How to Get Your Org to Adopt Coding Agents (Without Shipping Garbage) — Day 2 — Session Day 1 3:20pm-3:40pm

AI coding agents promise 10x. On complex, production work inside a real org, the honest number is 2-5x — and getting there requires a journey most teams aren't prepared for. At Figma, we ship AI products to millions of users, but internally our engineering org is spread across three stages of adoption. The honeymoon, where AI is magic. The crash, where AI writes bad code and your best engineers are stuck protecting the quality bar. And the real skill — 2-5x with disciplined development practices and proper investment. This talk covers why adoption is uneven, what the trust curve looks like from the inside, and what leaders can do about it: guide teams to align on plans before generating code, set honest expectations, invest in the fundamentals that make codebases agent-friendly, and create space for skeptics without judgment. You'll leave with a framework for driving adoption more organically without mandating it — and without shipping garbage.

Ezra Tanzer

  • Role: Director, Product Management
  • Company: Snyk
  • Bio: Ezra Tanzer is a Director of Product Management at Snyk, leading teams building tools and workflows that help developers ship software while writing secure code, with a focus on developer experience and AI security.
  • LinkedIn: https://www.linkedin.com/in/ezra-tanzer-5a187423
  • Photo: /wf26/speakers/by-id/spk_tbd_snyk.jpg
  • Sessions:

- Agentic Development Security — Day 2 — Session Day 1 12:05pm-12:25pm

Felipe Blanes

  • Company: Amazon
  • LinkedIn: https://www.linkedin.com/in/felipeblanes
  • Photo: /wf26/speakers/by-id/spk_felipe_blanes.jpg
  • Sessions:

- Designing Evals That Earn User Trust — Day 2 — Session Day 1 1:30pm-1:50pm

Most teams measure their agent against a benchmark, ship it, and hope. But when your agent serves real users, a benchmark won't tell you if it's actually working. This session is about building an eval suite that captures what success looks like in production, runs against real user workflows, and feeds back into product decisions. Here's the flywheel we use in practice: start with what success looks like from the user's perspective, instrument production workflows to capture those signals, diagnose where the agent falls short, and feed those insights into the next thing you build. You'll see how it shaped concrete product bets, turning eval results from a report card into a discovery tool.

Filip Makraduli

  • Role: Founding Member of Technical Staff
  • Company: Superlinked
  • Bio: Filip Makraduli is an applied AI researcher and founding ML Developer Relations engineer at Superlinked, where he designs and ships small‑LLM inference systems for search, retrieval, and agents in production. He holds a master’s degree in Biomedical Data Science from Imperial College London. Before Superlinked, Filip worked in machine learning, data science, and developer relations roles across early‑stage AI startups and larger enterprises, building language understanding, retrieval‑augmented generation (RAG), and LLM pipeline tooling while partnering closely with product and platform teams. He is a frequent open‑source contributor, with contributions to kernel libraries, model‑inference providers, and hands‑on demos used by practitioners. Filip is a co‑author of several publications on efficient transformer architectures and inference, including work on faster normalization for LLMs. He is an experienced speaker at meetups and conferences such as AI Engineer Europe and Berlin Buzzwords, sharing practical lessons on efficient transformers, retrieval systems, and embedding inference for production AI teams.
  • Twitter: https://x.com/f_makraduli
  • LinkedIn: https://www.linkedin.com/in/filipmakraduli/
  • Website: https://filipmakraduli.substack.com/
  • Blog: https://filipmakraduli.substack.com/
  • Photo: /wf26/speakers/by-id/spk_filip_makraduli.jpg
  • Sessions:

- Turning My Obsidian Vault Into a Local AI Engineer — Day 1 — Workshop Day 1:15pm-2:15pm

Personal knowledge bases are messy, but engineering agents need memory: decisions, docs, TODOs, old PRs, architecture notes, incident notes. This talk shows how I made an Obsidian vault usable by an agent using local-first retrieval and small-model inference. The point is not “chat with notes”; it is how to build durable, inspectable agent memory.

- Weight Folding, CUDA Streams, and the Bug That Made My Model Speak Backwards — Day 4 — Session Day 3 3:45pm-4:05pm

A talk about contributing GPU benchmarks to an open-source research paper (FlashNorm). I'll walk through the engineering journey: folding norm weights into projections, writing Triton kernels, accidentally making attention bidirectional (oops), and ultimately proving a 33-35% speedup on the norm+project operation. Practical lessons for anyone trying to optimize transformer inference.

Flora Liu

  • Role: Software engineer
  • Company: Notion
  • Bio: Flora Liu is a software engineer at Notion, where she currently works on building AI-powered systems that help go-to-market teams operate with more precision and scale. Her work focuses on connecting product signals, lifecycle messaging, sales workflows, eligibility, and experimentation into programmable systems that can identify the right next step for users and customers.

Before Notion, Flora spent more than five years as a software engineer at Opendoor and previously worked as a software engineer at Next Jump. She studied Computer Science at Tufts University, where she served as a teaching assistant in the Computer Science department.

  • Twitter: https://twitter.com/floppyliu
  • LinkedIn: https://www.linkedin.com/in/flofloliu/
  • Website: https://www.flofloliu.com/
  • Photo: /wf26/speakers/by-id/spk_flora_liu.jpg
  • Sessions:

- AI in GTM at Notion — Day 4 — Session Day 3 11:40am-12:00pm

Notion's go-to-market runs on a system, not a roster of heroes. Flora Liu walks through the building blocks of human–AI collaboration behind Notion's GTM: the design principles that decide what AI owns and what stays human, the failures that taught them where that line belongs, and why the wins that matter most — faster delivery, real adoption — never show up on a revenue chart. An honest look at what actually works, from the team building it.

Francesco Bonacci

  • Role: Co-founder & CEO
  • Company: Cua
  • Bio: Francesco Bonacci is co-founder and CEO of Cua (YC X25) and former Engineer at Microsoft. Cua builds the sandboxes, environments, and reinforcement-learning data that frontier labs use to train and evaluate agents that operate desktop and mobile applications. Its open-source framework lets developers spin up computer-use agents in a few lines of code, and its benchmark, Cua-Bench, alongside a catalog of thousands of cross-platform RL environments, is used by leading AI teams to measure real GUI task performance. Francesco works closely with research teams across the agent ecosystem on environment design, grounding data, and evaluation.
  • Twitter: https://x.com/francedot
  • LinkedIn: https://www.linkedin.com/in/francesco-bonacci-70428a121/
  • Website: https://cua.ai
  • Blog: https://cua.ai/blog
  • Photo: /wf26/speakers/by-id/spk_francesco_bonacci.jpg
  • Sessions:

- Computer-Use 2.0: Agents Just Got Multi-Cursor — Day 3 — Session Day 2 2:25pm-2:45pm

Computer-use agents still inherit a basic desktop limitation: one machine has one foreground app, one hardware cursor, and one active actor. Once you try to run more than one agent per desktop, they start stealing focus from the user and from each other. We built cua-driver around a different model: multiple agents operating real desktop applications in parallel, each with its own synthetic pointer, while the user's cursor and keyboard stay undisturbed. The key move is to stop treating hardware mouse and keyboard events as the primary automation layer. cua-driver goes one layer lower, into the OS plumbing behind accessibility: UI Automation on Windows, AT-SPI on Linux, and AX on macOS. Those APIs address applications and elements directly, so the OS does not require the target window to be frontmost. A click can land on a background window. A keystroke can reach a hidden one. Multiple agents can act at once because none of them is competing for the singleton hardware mouse. I'll walk through the architecture, the API shape, and the platform-specific traps we hit while making it work across Windows, macOS, and Linux. The live demo is three agents operating on one desktop while the user keeps typing uninterrupted. The goal is to make Computer-Use 2.0 feel concrete: what changes in the stack, what becomes possible, and where the approach still leaks, including Wayland, Chromium DOM surfaces, native canvas apps, and fallback input paths.

Frank Coyle

  • Role: Lecturer, UCALBerkeley / Founder AI/Edge
  • Company: UCAL Berkeley
  • Bio: Frank Coyle (also known as drC) is a recently retired computer science professor who spent 32 years at Southern Methodist University, where he was repeatedly recognized as a standout teacher, before moving into part-time online teaching generative AI and large language models at Berkeley on. He is also a visiting professor at the University of Bologna, where he teaches generative AI in the graduate school of business. His path to AI runs through an unusual range of disciplines: psychology, neuroanatomy and physiology, and computer science. That cross-domain background shapes how he thinks about intelligent systems—drawing connections others miss, from neural architecture to software design patterns.

His current work focuses on the practical engineering of agentic AI systems and the architectural gaps that cause them to fail. He argues that many agent failures are symptoms of a missing layer: formal ontologies acting as logical guardrails around probabilistic reasoning. He also teaches AI to district attorneys and formerly-incarcerated students.

(150 words)

  • Twitter: https://x.com/coyle_frankp
  • LinkedIn: https://www.linkedin.com/in/frank-coyle/
  • Website: https://www.frank-coyle.ai/
  • Blog: https://frank-coyle.ai
  • Photo: /wf26/speakers/by-id/spk_frank_coyle.jpg
  • Sessions:

- Anthropic's CCA Exam as a Field-Guide for Agentic Engineering — Day 4 — Session Day 3 11:10am-11:30am

Anthropic's CCA Exam: A Field-Guide for Agentic Engineering The Claude Certified Architect (CCA) exam distills what Anthropic has learned from working with the AI companies shipping agents to production — the patterns that work, the anti-patterns that quietly burn tokens and trust, and the architectural decisions that separate demos from systems you'd stake a quarter on. This talk treats the exam as a field guide for agentic engineering, whether or not you ever sit for it. We'll walk through the five competency domains the exam tests — Agentic Architecture, Tool Design and MCP Integration, Claude Code, Prompt Engineering, and Context Management — with particular emphasis on multi-agent orchestration, subagent delegation, tool schema design, and lifecycle hooks. We'll then work through the six real-world scenarios the exam uses to probe judgment, each organized around an anti-pattern: the seductive-but-wrong move that looks reasonable until it costs you a production incident. Attendees leave with a working mental model of the agentic surface area and a checklist of the failure modes that matter most when moving from prototype to production. Who should attend: engineers and architects building agentic systems with Claude or other frontier models, technical leads evaluating agent designs, and developers considering the CCA credential.

- Why Agentic Systems Need Ontologies — Day 4 — Session Day 3 1:55pm-2:15pm

Agentic systems fail in predictable ways: context degradation, brittle tool descriptions, fragile multi-agent handoffs, stop-reason confusion, and the ever-present temptation to fix reliability problems with more natural-language instructions. These anti-patterns aren't bugs to be patched turn by turn — they're symptoms of a missing architectural layer. LLMs reason probabilistically over domains they only partially understand, and no amount of prompt engineering fully closes that gap. This talk argues that the missing layer is an explicit ontology: a formal, shared map of the domain's concepts, relationships, and constraints. The pattern is not new — ontologies have driven commercial success in defense and intelligence systems for over a decade, where probabilistic models must operate over high-stakes enterprise data without drifting into nonsense. Graph databases like Neo4j and Amazon Neptune have made the underlying primitives widely accessible. We'll show how lightweight ontology constructs can surround an agentic system with enforceable logical constraints: typed entities and relationships that tools must respect, cardinality and domain restrictions that catch malformed tool calls before they execute, and a shared vocabulary that keeps coordinators and subagents talking about the same things. The session walks through several agentic applications — a multi-agent research workflow, a tool-heavy customer support agent, a coordinator-subagent delegation pattern — and shows in each case how an ontology layer addresses the kinds of anti-patterns catalogued in Anthropic's Claude Certified Architect exam. The result is a hybrid neurosymbolic architecture: probabilistic reasoning inside, logical guardrails outside. Who should attend: engineers building production agentic systems, architects evaluating reliability strategies beyond prompt engineering, and technical leads who suspect their agents need more structure than another system prompt can provide.

Fuad Ali

  • Role: Senior Product Manager
  • Company: Arize AI
  • Bio: Fuad Ali is a Senior Product Manager at Arize AI focused on ML observability and reliable AI systems. His background spans engineering and product work at SpaceX, Tesla, Twitter and Federato, and he co-hosts The Next Iteration podcast.
  • Photo: /wf26/speakers/by-id/spk_fuad_ali.jpg
  • Sessions:

- Building self-learning loops for your agent — Day 1 — Workshop Day 11:05am-12:05pm

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions.

This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems.

You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

- Voice Agents Are Mostly Invisible. Here's How to See Them. — Day 2 — Session Day 1 1:55pm-2:15pm

Voice agents are one of the fastest-growing and hardest-to-debug categories: the failures live in latency, turn-taking, transcription drift, and tone none of which show up in a text log. We demo Voice traces and Session views, following a real voice session span by span, and Voice evals for scoring what text-only observability can't reach. A short, differentiated session on a problem most of the room is about to hit and few tools address.

Gabriel Cemaj

  • Role: Member of the Technical Staff
  • Company: Anthropic
  • Bio: Member of the Technical Staff @ Anthropic working on Claude Managed Agents
  • Twitter: https://x.com/gcemaj
  • LinkedIn: https://www.linkedin.com/in/gcemaj
  • Photo: /wf26/speakers/by-id/spk_gabriel_cemaj.jpg
  • Sessions:

- Claude Managed Agents Workshop (Part 1) — Day 2 — Session Day 1 10:45am-11:05am

Build an agent with Claude Managed Agents

- Claude Managed Agents workshop (Part 2) — Day 2 — Session Day 1 11:10am-11:30am

Build an agent with Claude Managed Agents

- Claude Managed Agents workshop (Part 3) — Day 2 — Session Day 1 11:40am-12:00pm

Build an agent with Claude Managed Agents

- Claude Managed Agents workshop (Part 4) — Day 2 — Session Day 1 12:05pm-12:25pm

Build an agent with Claude Managed Agents

Gabriel Chua

  • Role: Developer Experience Engineer
  • Company: OpenAI
  • Bio: At OpenAI, Gabriel is a Developer Experience Engineer helping developers build, ship, and scale with Codex. Previously, he worked in Singapore’s public sector on applied responsible AI research and LLM systems to combat online scams.
  • Twitter: https://x.com/gabrielchua
  • LinkedIn: http://linkedin.com/in/gabriel-chua
  • Photo: /wf26/speakers/by-id/spk_gabriel_chua.jpg
  • Sessions:

- Cooking with Codex — Day 1 — Workshop Day 9:00am-11:00am

Codex is changing how technical teams ship across the software development lifecycle, from feature implementation to code review and automation. But the real unlock comes when these practices move beyond a single workflow and become shared systems a team can trust.

In this hands-on session, you'll use Codex across real development and knowledge-work scenarios: structuring tasks, supervising agentic work, coordinating subagents, using plugins and MCPs, and combining Codex with OpenAI's frontier reasoning, coding, and multimodal models.

Bring your laptops and leave with reusable demos and a set of Codex recipes your team can adapt.

Gabriel Jorge Menezes

  • Role: Core Infrastructure Engineer
  • Company: Krea.ai
  • Bio: Infrastructure and performance engineer at Krea. creating, managing and improving infrastructure for trainings and inference.
  • LinkedIn: https://www.linkedin.com/in/gabriel-jorge-menezes/
  • Website: https://gab-menezes.github.io/
  • Photo: /wf26/speakers/by-id/spk_gabriel.jpg
  • Sessions:

- Infra behind Krea 2 - How to train and serve at scale — Day 4 — Session Day 3 2:50pm-3:10pm

What do you need know about large scale pretraining and inference for GPUs.

1. Challenges of managing infra for pretraining

2. Weird problems we faced and how we fixed them

3. How to serve at scale with multiple clusters

Gabriel Martinez

  • Role: Engineering Manager
  • Company: G2i
  • Photo: /wf26/speakers/by-id/spk_gabriel_martinez.jpg
  • Sessions:

- Agents Don't Have Coworkers, They Have Hostages — Day 2 — Session Day 1 11:40am-12:00pm

Modern coding workflows are rife with vibe slop. As organizations scale, proper roles and governance systems must be well-defined to ensure a high standard of quality. How do world-class teams scale quality in a world full of slop?

Gabriel Spencer-Harper

  • Role: CEO
  • Company: Meticulous
  • Bio: Gabriel Spencer-Harper is CEO and co-founder of Meticulous, which builds AI-powered UI testing tooling to eliminate the need to write and maintain UI tests. He previously worked in engineering at Dropbox and Opendoor.
  • Twitter: https://x.com/meticulousgabe
  • LinkedIn: https://www.linkedin.com/in/gabrielsharper
  • Website: https://www.meticulous.ai
  • Photo: /wf26/speakers/by-id/spk_gabriel_spencer_harper.jpg
  • Sessions:

- Why AI Didn't Actually Make You Ship Faster — Day 3 — Session Day 2 10:45am-11:05am

AI generates code faster than humans can review and verify it, and most engineering teams adopting codegen have hit the same wall: verification.

In this session, Gabriel (CEO of Meticulous) breaks down why assertion-based testing has a structural ceiling that AI codegen has made impossible to ignore, what exhaustive verification actually requires technically (behavior capture, determinism, and backend isolation), and why the teams solving this now are the ones who will ship at the speed AI enables.

The talk includes case studies from LaunchDarkly, which saw an 80% reduction in major frontend incidents after rollout, and Notion, which deployed verification infrastructure across every engineer on every PR to confidently adopt AI-generated code at scale.

Gagan Bhat

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Gagan is a Product Engineer on Anthropic's Applied AI team. He focuses on prototyping consumer AI features, evaluating model output in vertical domains, and partnering with industry leaders to productize use-cases. Prior to Anthropic, he was an engineer at NVIDIA and Netflix.
  • LinkedIn: https://www.linkedin.com/in/gagan-bhat/
  • Photo: /wf26/speakers/by-id/spk_gagan_bhat.jpg
  • Sessions:

- Evolution of agentic surfaces — Day 1 — Workshop Day 4:30pm-5:30pm

Getting an agent into production takes more than a good prompt: it needs somewhere to run code, credentials it can't leak, sessions that survive interruption, and infrastructure that scales. This talk traces how Anthropic's agentic surfaces evolved from the raw API to Claude Managed Agents, and what our Applied AI team has learned about harness design along the way.

Garrett Galow

  • Role: Product Manager
  • Company: WorkOS
  • Bio: Garrett Galow works on product at WorkOS, where he focuses on enterprise identity and agent access patterns. His recent AI Engineer content covers Cross-App Access and WorkOS Studio for helping business users answer operational questions.
  • LinkedIn: https://www.linkedin.com/in/garrett-galow
  • Photo: /wf26/speakers/by-id/spk_garrett_galow.jpg
  • Sessions:

- Building an Agent Harness for the Business, Not the Builder — Day 3 — Session Day 2 2:50pm-3:10pm

Most internal tooling dies in the gap between the people with problems and the people who can write code. We built a harness that closes it. Studio lets non-technical employees describe a business problem and get a working tool back, connected to real enterprise data, deployed and shareable across the company, without filing a ticket or learning to code. The catch is that a harness built for non-engineers has to absorb everything an engineer normally handles. Data source connections and their permissions. Turning model output into real software instead of a chat box. Deployment and sharing that doesn't open a security hole every time someone ships. This talk walks through what actually goes into that harness and the engineering decisions that make it hold together when the person driving it has never opened a terminal.

Garry Tan

  • Role: President & CEO
  • Company: Y Combinator
  • Bio: Garry is President & CEO at Y Combinator, which he rejoined in Fall 2023, after founding Initialized Capital, a successful venture fund. Garry was one of YC’s first partners and funded and advised many iconic YC companies including Coinbase, Instacart, and many others. In YC’s early days, Garry also served as a designer and engineer, and he wrote software and created Bookface, the internal network that connects YC’s alumni founders to this day. Before advising and investing in companies, Garry co-founded the blog platform Posterous (acquired by Twitter in 2012) and was an early employee at Palantir and designed its logo. Garry holds a BS in Computer Systems Engineering from Stanford.
  • Twitter: https://x.com/garrytan
  • LinkedIn: https://www.linkedin.com/in/garrytan/
  • Photo: /wf26/speakers/by-id/spk_garry_tan.jpg
  • Sessions:

- Closing Keynote: Garry Tan — Day 4 — Session Day 3 4:50pm-5:10pm

Gaurav Mishra

  • Company: Amazon AGI Lab
  • LinkedIn: https://www.linkedin.com/in/gaurav-mishra-b307a437
  • Photo: /wf26/speakers/by-id/spk_gaurav_mishra.jpg
  • Sessions:

- From RL to IRL — Day 3 — Session Day 2 1:30pm-1:50pm

Today's agents have to operate in a messy reality of flaky connections, ephemeral credentials, and irreversible actions. They need to navigate real software the way humans do: recovering from failures, learning from feedback, and making sound judgment calls. This talk is about the fundamental changes in RL required to make agents ready for IRL. We'll walk through what it takes for training environments to reflect the complexity of the real world, the perception primitives that let an agent see what a user sees, the harness pieces that help it survive contact with real applications, and the failure modes you only discover when you stop scoring and start shipping.

Geoffrey Litt

  • Role: Design Engineer
  • Company: Notion
  • Bio: Design Engineer at Notion. Building malleable software with AI. Previously research at MIT / Ink & Switch.
  • Twitter: https://x.com/geoffreylitt
  • Website: https://www.geoffreylitt.com/
  • Photo: /wf26/speakers/by-id/spk_geoffrey_litt.jpg
  • Sessions:

- Understanding is the new bottleneck — Day 3 — Session Day 2 10:45am-11:05am

Autonomous loops are hot, but the reality is that most agentic tasks still require human judgement. And to guide your agents well, it's not enough to just verify correctness -- you actually need to understand the work they're doing.

In this talk, I'll share some techniques for staying in the loop and efficiently developing understanding, combining old ideas from education and cognitive science with modern agent capabilities. You'll walk away with some practical tips for moving faster with agents by understanding more, not less.

George Cameron

  • Role: Co-Founder
  • Company: Artificial Analysis
  • Bio: Co-Founder at Artificial Analysis, the leading independent AI benchmarking company. Artificial Analysis publishes benchmarks and analysis across agents, models, inference providers and hardware. Artificial Analysis maintains widely referenced leaderboards and evaluation frameworks that are regularly cited by frontier AI organizations, including OpenAI, Anthropic, Google, NVIDIA and others.
  • Twitter: https://x.com/grmcameron
  • LinkedIn: https://www.linkedin.com/in/georgecameron/
  • Website: https://artificialanalysis.ai/
  • Photo: /wf26/speakers/by-id/spk_george_cameron.jpg
  • Sessions:

- Trends in AI — Day 3 — Session Day 2 4:50pm-5:10pm

George He

  • Role: Head of Platform Engineering
  • Company: LlamaIndex
  • Bio: George He leads platform engineering at LlamaIndex, working on document agents, OCR, retrieval, and infrastructure for connecting enterprise data to LLM applications.
  • Photo: /wf26/speakers/by-id/spk_george_he.jpg
  • Sessions:

- Everyone talks about document search, but what about results? — Day 4 — Session Day 3 1:55pm-2:15pm

Search is usually treated as the end of the document pipeline: parse, chunk, retrieve, and hand them to the model. But long-running agents need something more durable than one-off retrieval. They need reusable work: structured outputs, citations, extracted entities, prior decisions, and file-system-like context they can return to across tasks. At scale, context management is where most agent systems fall apart. Without the right harness, agents lose track of what they've retrieved, bloat their context windows, and stall.

In this talk, we'll look at why the document pipeline needs a stateful layer beyond the index — one that turns one-off retrieval into reusable, agent-ready context. We'll see how LlamaIndex thinks about transforming messy documents to make this possible, and why the future of document intelligence belongs to results that compound over time, not just better search.

Giedrius Steimantas

  • Role: Director of Scraping Engineering
  • Company: Oxylabs
  • Bio: Giedrius Steimantas is an engineering leader at Oxylabs responsible for scraping engineering and scraper API teams that deliver reliable large-scale public web data gathering infrastructure.
  • LinkedIn: https://lt.linkedin.com/in/steimantas
  • Photo: /wf26/speakers/by-id/spk_giedrius_steimantas.jpg
  • Sessions:

- The Missing Layer in Agentic AI — Day 4 — Session Day 3 12:05pm-12:25pm

Reasoning is solved. Web access isn't. Most agents break the moment they leave the sandbox blocked, rate-limited, or staring at a CAPTCHA. Giedrius will show the three primitives every production agent needs: a browser, a fast search API, and a universal scraper and demo an agent built on top of them that actually works in the wild.

Gil Feig

  • Role: CTO and Co-Founder
  • Company: Merge
  • Bio: Gil Feig is CTO and co-founder of Merge, where he works on connective infrastructure for production AI and writes about context graphs and integration architecture.
  • LinkedIn: https://www.linkedin.com/in/gilfeig
  • Photo: /wf26/speakers/by-id/spk_gil_feig.jpg
  • Sessions:

- Why your company needs a context graph, and how to build it — Day 3 — Session Day 2 1:55pm-2:15pm

Everyone building AI products eventually draws the same diagram: boxes representing data sources, arrows pointing at the model, and a label that says "context." What that diagram doesn't show is the system that has to run underneath it deciding, for each request: which sources to consult, whether to fetch live or use cached data, if the user is actually allowed to view that data, how to stitch it all together before the latency budget runs out. And it hides the counterintuitive part: fetching more context usually makes your answers worse, not better. At Merge, we reframed context graphs as control planes, helping companies scale context graphs to hundreds of thousands of users with sub-300 ms latency. This talk walks engineers through the system design at scale: how to tier data freshness, why provenance isn't optional once third-party systems are in the loop, and how to decide when fetching less context is the right call. Attendees will leave with a mental model for context system design that separates the orchestration decisions from the retrieval layer.

Giselle van Dongen

  • Role: Developer Advocate
  • Company: Restate
  • Bio: Giselle is a Developer Advocate and Engineer at Restate. She works on integrations between Restate and the AI ecosystem and helps its users with understanding how Restate simplifies the development of durable agents and backends. Before that, she worked in the field of data science, big data analytics, and stream processing, and obtained a PhD on this topic at Ghent University.
  • Twitter: https://x.com/vdgiselle
  • LinkedIn: https://www.linkedin.com/in/giselle-van-dongen/
  • Photo: /wf26/speakers/by-id/spk_giselle_van_dongen.jpg
  • Sessions:

- 🎵 Every step you take, every call you make - the reliable agent stack — Day 4 — Session Day 3 1:55pm-2:15pm

In this session, we skip past the demos that work only on your laptop, and go straight to how you can build production-ready agents with a stack that covers all the hard bits of backend development that you don’t want to be bothered with when developing your agents: - Failure resiliency: retries, timeouts, and exactly-once execution so a flaky API or a crashed process doesn't corrupt your agent's state or makes them start from scratch - Durable Sessions: a session store with built-in conversation isolation and protection against corruption from concurrent agents - Pause/resume for human approvals: survive human approvals and research that take weeks without building complex infra - Agent-to-agent messaging layer: call agents developed by other teams or running on other infra with resilient HTTP calls - A kill switch: cancel a running agent cleanly at any point, without leaving half-executed work behind We will demonstrate each concept with live code examples, using Python, OpenAI Agents SDK and Restate as open-source Durable Execution engine. All examples are generally applicable: pick your favorite agent SDK (OpenAI Agents, Pydantic AI, Vercel AI, Google ADK,…) or go wild and implement low-level custom agents by just tying together LLM calls with custom logic.

Greg Pstrucha

  • Company: Sentry
  • LinkedIn: https://www.linkedin.com/in/greg-pstrucha
  • Photo: /wf26/speakers/by-id/spk_greg_pstrucha.jpg
  • Sessions:

- Stop prompting — Day 2 — Session Day 1 1:30pm-1:50pm

In this talk I dive into usage of tooling, type systems and frameworks to enforce guardrails and limit slop produced by AI agents inside large codebases.

Gus Iwanaga

  • Role: General Manager (Product, UX, Eng)
  • Company: commercetools
  • Bio: Gus Iwanaga is founder and GM for mosAIc at commercetools, where he works on agentic commerce. Before commercetools, Gus held senior product roles at Google Shopping, Zalando, and zooplus, building products used by millions of shoppers & merchants across Europe.
  • Twitter: https://x.com/guhgoi
  • LinkedIn: https://www.linkedin.com/in/gus-iwanaga/
  • Website: https://www.youtube.com/@Produtando
  • Blog: https://www.youtube.com/@Produtando
  • Photo: /wf26/speakers/by-id/spk_gus_iwanaga.jpg
  • Sessions:

- The End of the Static Screen: Architecting Intent-Driven UX with Agentic Orchestration — Day 4 — Session Day 3 3:20pm-3:40pm

For 30 years, interfaces were designed ahead: wireframes, fixed flows, pre-built dashboards - because we couldn't make them otherwise. Three shifts changed the constraint: LLMs that reason over business context, agentic frameworks that work at production grade, and composable backends that expose a real tool surface. With all three in place, the interface stops being something you design and ships as the output of an orchestrator composing it per intent. I'll walk through the hypothesis, the architecture we're running in production for enterprise commerce, and a live demo where it all moves.

Gustavo Cordido

  • Role: Cloud Advocate & AI Content Engineer
  • Company: Microsoft
  • Bio: Cloud Advocate and AI Content Engineer at Microsoft working in machine learning and AI. He builds hands-on labs and open-source demos for AI agents using Azure AI services, Microsoft Foundry, and the Model Context Protocol (MCP), and regularly delivers talks to developer audiences.
  • Twitter: https://twitter.com/gcordidoa
  • LinkedIn: https://linkedin.com/in/gcordido
  • Photo: /wf26/speakers/by-id/spk_gustavo_cordido.jpg
  • Sessions:

- From zero to deployed on Azure with AI agents — Day 1 — Workshop Day 11:05am-12:05pm

What happens when you let AI agents do the building? In this hands-on lab, you'll go from an empty terminal to a deployed app on Azure — with GitHub Copilot CLI and coding agents handling the scaffolding, coding, debugging, and deployment. You'll use the new Azure skills to provision resources and wire up services through natural language, no portal required. This isn't a demo you watch. You'll walk out with a real, working dev workflow you can take straight to your next project.

Han Xiao

  • Role: VP, AI
  • Company: Elastic
  • Bio: Dr. Han Xiao is the VP of AI at Elastic. Han founded Jina AI in 2020 and served as its CEO until its acquisition by Elastic (NYSE: ESTC) in October 2025. Before that, he worked on search at Tencent and Zalando. Han created Fashion-MNIST, a widely used computer vision benchmark with 13K+ citations.
  • Twitter: https://x.com/hxiao
  • LinkedIn: https://www.linkedin.com/in/hxiao87/
  • Website: https://hanxiao.io/
  • Photo: /wf26/speakers/by-id/spk_han_xiao.jpg
  • Sessions:

- Autoresearch for Dense Retrieval: Test-Time Compute with Frozen Embedding Models — Day 3 — Session Day 2 11:10am-11:30am

Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Since modern embedding models are distilled from LLM backbones, a frozen encoder should benefit from extra inference compute without retraining. Using an agentic program-search loop spanning 144 generations, we explore 144 candidate programs over a frozen encoder API. The search produces twelve Pareto-optimal programs spanning cost ratios of c=1.2 to 14.7 over the single-pass baseline. The programs are structurally diverse: the search independently rediscovers Rocchio pseudo-relevance feedback, ColBERT-style MaxSim at sentence granularity, reciprocal rank fusion, and the Fisher linear discriminant, all without trainable parameters or external models. Every frontier program improves nDCG@10 over the frozen baseline across all 14 MMTEB retrieval tasks spanning legal, financial, long-document, and general domains.

Harald Kirschner

  • Role: Principal Product Manager
  • Company: Microsoft
  • Bio: Harald Kirschner is a Principal Product Manager at Microsoft working on VS Code and GitHub Copilot AI coding experiences for tens of millions of developers. He is active in the VS Code and developer-experience ecosystem and has spoken on agent memory, MCP, and AI coding tools.
  • Twitter: https://twitter.com/digitarald
  • LinkedIn: https://www.linkedin.com/in/digitarald
  • Photo: /wf26/speakers/by-id/spk_harald_kirschner.jpg
  • Sessions:

- Surviving Your Own Velocity: How VS Code Ships Weekly with 40 People — Day 2 — Session Day 1 3:20pm-3:40pm

A ~40-person team ships VS Code weekly to millions of users. Models got good enough to lean on, and leaning in is exactly what broke our process. This talk is the part most AI talks skip: what you have to rebuild after agents start working. We had to scale three things at once: how fast we ship, how we hold quality, and how fast we learn, and each one we fixed revealed the next. I'll walk through the harnesses, evals, and self-healing systems that keep velocity from becoming regression, and the patterns you can steal.

- Surviving Your Own Velocity: How VS Code Ships Weekly with 40 People — Day 3 — Session Day 2 1:30pm-1:50pm

A ~40-person team ships VS Code weekly to millions of users. Models got good enough to lean on, and leaning in is exactly what broke our process. This talk is the part most AI talks skip: what you have to rebuild after agents start working. We had to scale three things at once: how fast we ship, how we hold quality, and how fast we learn, and each one we fixed revealed the next. I'll walk through the harnesses, evals, and self-healing systems that keep velocity from becoming regression, and the patterns you can steal.

Harshal Bhangale

  • Role: Staff Software Engineer
  • Company: Circle
  • Bio: Harshal Bhangale is a Staff Software Engineer at Circle.
  • LinkedIn: https://www.linkedin.com/in/harshaldbhangale
  • Photo: /wf26/speakers/by-id/spk_harshal_bhangale.jpg
  • Sessions:

- Why Your AI Agent Needs a Wallet: Agentic commerce on Arc with USDC and Nanopayments — Day 4 — Session Day 3 11:10am-11:30am

AI agents can reason, plan, call tools, and write code. But the moment one needs paid data, an API call, or another agent's service, it hits a human wall: accounts, API keys, credit cards, checkout flows. It stalls and asks you to step in. It can't pay. We'll run the same real task through two agents, one without a wallet and one with. The first stalls. The second, handed a Circle agent wallet through the Circle CLI, discovers services, pays per request over x402 in USDC, and finishes on its own, inside spending limits you set. The next leap in agents isn't only better models or more tools. It's economic agency: holding programmable money and transacting at machine speed. We'll show how it works on Arc, where USDC is the gas, finality is sub-second, and gasless nanopayments settle in batches through Circle Gateway, so paying a fraction of a cent per request is actually practical.

Harshul Jain

  • Role: Senior Software Engineer - ML/AI
  • Company: Audible
  • Bio: Harshul Jain is a Senior Software Engineer at Audible (Amazon) who builds ML and LLM infrastructure at scale — AI Search serving 10M users, a feature store processing 100K transactions per second, and LLM serving and evaluation systems powering GenAI in production. He is writing LLM Inference at Scale, a benchmark-driven handbook on GPU memory engineering, attention optimization, and production LLM serving backed by a companion repository that gained 100+ clones in its first week with zero promotion.
  • Twitter: https://x.com/hj1393
  • LinkedIn: https://www.linkedin.com/in/hjain1393/
  • Website: https://harshuljain.substack.com/
  • Blog: https://harshuljain.substack.com/
  • Photo: /wf26/speakers/by-id/spk_harshul_jain.jpg
  • Sessions:

- 2 hr deep dive on LLM Inference at Scale — Part 1 of 2 — Day 1 — Workshop Day 12:10pm-1:10pm

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

(NOTE: this is a 2 hour workshop that happens over lunch break - you should try to have lunch before or after if attending)

compute kindly sponsored by Coreweave/Marimo!

- 2 hr deep dive on LLM Inference at Scale — Part 2 of 2 — Day 1 — Workshop Day 1:15pm-2:15pm

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

(NOTE: this is a 2 hour workshop that happens over lunch break - you should try to have lunch before or after if attending)

Hassan El Mghari

  • Role: Director of Developer Experience
  • Company: Together AI
  • Bio: Leading Developer Experience at Together AI. Educating developers on AI & building open source AI apps. Based in New York.
  • Twitter: https://x.com/nutlope
  • LinkedIn: https://www.linkedin.com/in/nutlope/
  • Website: https://nutlope.com
  • Blog: https://nutlope.com
  • Photo: /wf26/speakers/by-id/spk_hassan_el_mghari.jpg
  • Sessions:

- The Missing Layer: Design Taste in AI Agents // Stop Letting Your Agents Ship Ugly UIs — Day 3 — Session Day 2 2:50pm-3:10pm

Alt titles: "UI Looksmaxxing for Agents", "Teaching agents design taste", or "How to give your agents great design taste". I've been exploring how to give coding agents good design taste for the last few months. In this talk, I'm going to go over how to help your agents give you UIs that don't suck and that look quite good out of the box. The key is giving them enough context in what you're building + real inspiration in the form of screenshots. I'll also go over an upcoming design taste OSS project I'm working on (harness-agnostic + will ship with a prompt builder, MCP server w/ inspo, and a design eng skill) & talk about how to I use it to build my apps.

Heather Downing

  • Role: Developer Advocate
  • Company: Yugabyte
  • Bio: Heather Downing is a developer advocate and 7x Microsoft MVP focused on AI, data, security and C#/.NET. She has experience building enterprise voice, mobile and .NET applications and is active in developer-community speaking and mentorship.
  • Twitter: https://twitter.com/quorralyne
  • LinkedIn: https://www.linkedin.com/in/heathermdowning
  • Website: https://quorralyne.com
  • Photo: /wf26/speakers/by-id/spk_heather_downing.jpg
  • Sessions:

- Agent Memory Is a Solved Problem. Agent Learning Is Not. — Day 4 — Session Day 3 3:20pm-3:40pm

The failures that break multi-agent systems are not reasoning failures, they are handoff failures. One agent works something out and the knowledge dies in its private context, because the only thing that crosses the boundary is output. Memory made each agent better in isolation and changed nothing about what the group knows. The missing primitive is supervised promotion: a deliberate decision about which private learning is worth sharing, moved into common knowledge with the reasoning attached, so trust survives the handoff. Today a human makes that call, and promoted knowledge resolves on read, in any tool, with no retrain or reindex. Those calls are also the training signal for what comes next: orchestrator agents, trained on what matters to the people they serve, that promote on their own. This talk covers how our collective knowledge grew as we approached memory promotion, including what the first build got wrong, and a live look at it working between humans and agents.

Hiral Shah

  • Role: Senior Director of Product, AI Applications
  • Company: Docusign
  • Bio: Hiral Shah is a Senior Director of Product at Docusign, where she leads Agreement Intelligence and AI-powered product innovation. She focuses on building AI-first capabilities that help organizations unlock value from their agreements—from automatically organizing agreements into meaningful relationships and hierarchies, to delivering agentic experiences that surface insights and answer complex business questions. Prior to Docusign, Hiral led customer data platform and ecosystem products at Amplitude. Her career spans engineering, product leadership, and venture capital, giving her a unique perspective on turning emerging technologies into practical solutions for customers. She holds degrees from University of Mumbai and Carnegie Mellon University, and an MBA from Stanford Graduate School of Business.
  • LinkedIn: https://www.linkedin.com/in/shahhiral/
  • Photo: /wf26/speakers/by-id/spk_hiral_shah.jpg
  • Sessions:

- Your Agreements Are a Database You Can't Query. We're Fixing That — Day 2 — Session Day 1 1:55pm-2:15pm

Agreements power every enterprise business, but the most critical data — pricing schedules, SLA obligations, rate cards — is often trapped in tables that traditional extraction tools destroy.

This session shows what changes when you can actually extract that data accurately at scale and make it searchable.

We'll walk through the before and after:

Before: Contract tables require manual review. Rate cards are buried. SLA terms are scattered across exhibits. Procurement teams spend hours piecing together pricing structures — and searching for specific terms means opening every document.

After: Tables are automatically extracted, structured, and queryable. Operations teams can surface SLA notification requirements on demand. Legal can answer "what hourly rate did we agree to?" in seconds.

Docusign will share what we've achieved evaluating NVIDIA Nemotron Parse for our document processing pipeline, including how we tested against real enterprise contracts (not synthetic benchmarks), why we're serving the model via vLLM, and what it takes to turn extracted table data into searchable, retrievable agreement intelligence.

NVIDIA will cover the architecture behind Nemotron Parse and where the model is heading — including how NeMo Retriever's embedding and reranking models connect extracted data to search and RAG-based applications.

Attendees will leave with a realistic view of where vision-language models excel at document understanding, where the gaps remain, and how to think about building searchable contract intelligence into their own systems.

Hossein Niazmandi

  • Role: Solutions
  • Company: Braintrust
  • Bio: Hossein Niazmandi works on Solutions at Braintrust, with prior experience at Databricks and Salesforce. His WF26 session focuses on why building agent quality platforms is hard.
  • LinkedIn: https://www.linkedin.com/in/hniazmandi
  • Photo: /wf26/speakers/by-id/spk_hossein_niazmandi.jpg
  • Sessions:

- Why building building agent quality platforms is hard. — Day 2 — Session Day 1 12:05pm-12:25pm

An eval platform is not just a test runner. You are building shared definitions of good, reliable data pipelines, labeling workflows, versioning, and trust in results across many teams and model changes. This session breaks down the hidden complexity, the common failure modes, and the design principles that make evals credible and usable in day-to-day engineering.

Howie Liu

  • Role: CEO
  • Company: Airtable
  • Bio: Howie is the CEO of Airtable, which he co-founded in 2013 with the vision of creating a radically faster and more intuitive way to build useful applications. Since launch, Airtable has gained a groundswell of adoption across a range of customers from small businesses to the world's largest enterprises, with over 500,000 organizations using Airtable's AI-native platform to build applications that fit their workflows. Now counting over half of the Fortune 500 as paid customers, Airtable enables organizations to turn the power of AI into measurable business impact and power mission-critical enterprise use cases. The company recently launched Superagent, a standalone product built on multi-agent coordination that delivers research and analysis through teams of specialized AI agents working in parallel. Headquartered in San Francisco, Airtable has raised $1.36 billion to date and employs over 750 employees.
  • Twitter: https://x.com/howietl
  • LinkedIn: https://www.linkedin.com/in/howieliu/
  • Photo: /wf26/speakers/by-id/spk_howie_liu.jpg
  • Sessions:

- Startup Battlefield — Day 4 — Session Day 3 5:10pm-5:30pm

Hursh Agrawal

  • Role: Co-Founder & CTO
  • Company: The Browser Company
  • Bio: Hursh Agrawal is the Co-founder and CTO of The Browser Company, makers of the Arc and Dia browsers. His work spans compilers (bringing Swift to Windows in partnership with Apple), browser architecture (building on Chromium at scale), and AI product quality (designing eval systems that let small teams ship and improve AI features fast). Previously, he co-founded Branch, which was acquired by Meta. He spends most of his time in the gap between research and production - making technically complex systems workable for the engineers who build consumer products on top of them.
  • Twitter: https://twitter.com/hursh
  • LinkedIn: https://www.linkedin.com/in/hurshagrawal
  • Website: https://www.hurshagrawal.com
  • Photo: /wf26/speakers/by-id/spk_hursh_agrawal.jpg
  • Sessions:

- Prototyping as Leadership: How a CTO Ships with AI Agents — Day 2 — Session Day 1 12:05pm-12:25pm

I am a CTO and co-founder with a toddler, 15+ recurring meetings a week, 7 direct reports, and right now—7 open pull requests across two repos. Most engineering leaders eventually hit a wall where this kind of calendar tetris forces them to stop shipping code and start communicating solely through roadmaps. But what if AI agents didn't just act as coding assistants, but fundamentally restructured how executives use fragmented time to prototype the future? In this talk, I will share the exact multi-model workflows I use to plan with one model, implement with another, and build asynchronous play-and-feedback loops that fit perfectly between meetings. You will learn how to navigate code reviews for agent-assisted executive PRs, and leverage AI to shift your leadership style from telling your team what to build to showing them functional prototypes.

Idan Gazit

  • Role: Head of GitHub Next
  • Company: GitHub
  • Bio: Idan Gazit leads GitHub Next, the birthplace of GitHub Copilot, and many more prototypes that explore how AI will make software development faster, easier, safer, and more accessible for developers everywhere. As a hybrid designer and developer, his interests span a variety of fields. Idan is keenly interested in data display issues, typography, and color. You're likely to hear him talk about the pit of success, and the importance of good nouns. Idan was previously a principal engineer at Heroku, and is an alumnus of the Django web framework's core development team. He is a firm believer in the power of web technologies, and is most at home in them, though many evenings you can find him soldering a new keyboard or muttering foul language while trying to get Rust to run on a microcontroller.
  • Twitter: https://twitter.com/idangazit
  • LinkedIn: https://linkedin.com/in/idangazit
  • Website: https://gazit.me
  • Blog: https://gazit.me
  • Photo: /wf26/speakers/by-id/spk_idan_gazit.jpg
  • Sessions:

- Build agents fast with GitHub Copilot (from idea to working app) — Day 2 — Session Day 1 10:45am-11:05am

See how developers go from prompt to a working agent using GitHub Copilot and real workflows. We'll walk through generating code, iterating quickly, and keeping velocity inside your existing dev loop.

- Build agents fast with GitHub Copilot (from idea to working app) — Day 2 — Session Day 1 2:25pm-2:45pm

- Realtime multiplayer, automation, and you! — Day 4 — Session Day 3 2:50pm-3:10pm

Now that the models are powerful and the agents are capable, why are we still approaching software development as if it's the same activity that it used to be, but "faster"? GitHub Next thinks about what this future wants to be through two lenses: - Automation: intelligence allows us to automate much more than we could with heuristics alone. How should that automation work? What guardrails do we have to put in place so that our CISOs allow us to do that? - Collaboration: agents can understand anything in your codebase, but what about all the facts that are in the heads of your teammates? Whether it's corporate politics or taste, how do we get the humans to leak that context where agents can see it and use it to produce better outcomes? Realtime multiplayer tools have displaced every turn-based tool out there. What should that look like for code? It's not going to be as simple as multiple cursors. Come by to hear more about what GitHub Next is learning about the changing shape of software creation — one that allows us to build better, not merely faster. One that allows us to scale up teams, not only individuals. And one where automations buy us time for craft and polish, not slop. We were promised flying cars, instead we have fifteen terminals. Let's have a nicer future than that.

Ido Salomon

  • Role: Co-Creator
  • Company: MCP Apps
  • Bio: Ido Salomon is a seasoned AI lead and software architect. He is the creator of AgentCraft and MCP-UI, the co-creator and maintainer of MCP Apps on the MCP Steering Committee, and the co-creator of GitMCP. Previously, Ido was an architect who led end-user AI at Palo Alto Networks. His work explores the agentic web and user interfaces, with a current focus on raising the ceiling of human-agent collaboration.
  • Twitter: https://x.com/idosal1
  • LinkedIn: https://www.linkedin.com/in/ido-salomon/
  • Website: https://mcp-ui.dev
  • Photo: /wf26/speakers/by-id/spk_ido_salomon.jpg
  • Sessions:

- We're the bottleneck, but we don't have to be — Day 2 — Session Day 1 2:25pm-2:45pm

As agents improve at doing real work, humans become the real bottleneck. Luckily, the skills we need to work with agents aren’t entirely new, they've just been hiding in unexpected places. Drawing lessons from AgentCraft’s Warcraft-inspired UI for coordinating multiple agents, this talk explores how gamification can raise the ceiling for sophisticated AI orchestration while lowering the floor for everyday developers. Ido will show how visual state, spatial metaphors, and autonomy can make multi-agent systems more approachable, inspectable, and fun to use.

- MCP Apps - Extending the frontier — Day 3 — Session Day 2 2:25pm-2:45pm

AI agents are quickly becoming the new browsers, changing how users consume content and get work done. That shift is increasingly powered by a new generation of agentic apps that don’t just present text but deliver interactive experiences within any MCP host. By standardizing interactive UI on MCP, the MCP Apps official extension (SEP-1865) is poised to become the new agentic app runtime, serving as the backbone of the future and removing adoption obstacles that previously hindered the protocol. Join us to learn more about: The new web - How MCP Apps reshapes the traditional app landscape and transforms the way users interact with the web Deep dive into MCP Apps - - Architecture - Real-world use cases - What's ahead? - Getting started (+community and #mcp-apps-wg) - Future Vision

Ignacio Martinez

  • Role: AI Developer Advocate
  • Company: Oracle
  • Bio: AI Developer Advocate at Oracle focused on AI agent memory and generative AI; contributor to the Oracle AI Agent Memory package and co-author of the Agent Memory course.
  • Twitter: https://x.com/nacho_martinez
  • Photo: /wf26/speakers/by-id/spk_ignacio_martinez.jpg
  • Sessions:

- Total Recall: Agent Memory and Harness Engineering — Day 1 — Workshop Day 9:00am-11:00am

In this hands-on workshop you'll build a working autonomous agent from the harness up, in a notebook, then see it live in a full working web application and leave with one that can write and run its own automations. You'll implement every surface area yourself: a set of predefined tools, persistent memory through the Oracle AI Agent Memory package, orchestration with LangChain and LangGraph, and LLM access through OCI GenAI Service, composing the full set of Oracle primitives into one harness you understand end to end.

Most teams assemble that harness from a dozen disconnected services: one store for vectors, another for state, a separate reranker, a bolt-on memory layer. We take the opposite approach, on a single unified memory core. The organizing principle is optionality by default: you shouldn't have to choose your memory substrate up front. With Oracle AI Database you get file system and database memory in one place, embedding models and rerankers running inside the database kernel, and every retrieval strategy an AI workload needs without leaving the core.

And consolidating onto one core is what keeps the whole thing tractable. You know the drill: a production harness has you holding all those moving parts in your head at once, and most of your attention goes to keeping them in sync rather than improving the agent. Pull that sprawl into a single core and the cognitive load drops. You get to think about what the agent does, not where its state lives. That's the difference between controlling your harness and renting its pieces.

Imad Touil

  • Role: Distinguished Engineer
  • Company: QuantumBlack, AI by McKinsey
  • Bio: Distinguished Engineer and Engineering Transformation Lead at QuantumBlack, rewiring Fortune 500 Enterprises to Become AI-Native Through Centralised Agentic AI SDLC. Previously head of engineering at IBM, and DelightMe
  • LinkedIn: https://www.linkedin.com/in/imad-touil/
  • Photo: /wf26/speakers/by-id/spk_imad_touil.jpg
  • Sessions:

- AI-Native Organisations runs on Skills: How to Extract, Structure, evaluate and Scale Them — Day 3 — Session Day 2 12:05pm-12:25pm

Isaac Miller

  • Role: Lead Maintainer of DSPy; Co-Founder
  • Company: cmpnd
  • Bio: Lead Maintainer of DSPy. Co-Founder at cmpnd. Building an OSS Framework to help you create self-improving, modular AI systems.
  • Twitter: https://x.com/isaacbmiller1
  • LinkedIn: https://www.linkedin.com/in/miller-isaac/
  • Photo: /wf26/speakers/by-id/spk_isaac_miller.jpg
  • Sessions:

- The Unreasonable Effectiveness of Separating the Task from the Model — Day 4 — Session Day 3 9:40am-10:00am

By declaring your task’s inputs and outputs without initially considering model capability, you create the space needed to figure out the model execution later. DSPy’s entire promise is that you should evaluate and execute your AI engineering at a level higher than a specific prompt template or a particular provider’s API shape: the Signature. However, models have evolved significantly over the last few years. How can the same input and output specifications still work in a world now filled with tools, RLMs, and Skills? By defining your task strictly through its inputs and outputs, the underlying implementation becomes completely flexible. You can experiment with different models, settings, weights, templating strategies, and output formats, all without touching your actual AI workflow. Consequently, you can leverage components built by others and focus entirely on your core AI task. In this talk we will present how dspy 3.5 makes it easier much easier. DSPy has its roots in prompt optimization, where we build efficient ways to conduct search and learning beneath the signature. In this talk we will give a preview of DSPy 4.0 where we use the fact that models have now passed a tipping point for two critical concepts we have always needed. First, we no longer need to limit the search space to a single instruction block per LLM call; models can now reliably write the code underneath a signature themselves—so they should. Second, traditional prompt optimization has always required a scalar metric, which is notoriously one of the hardest parts to get right. What if a DSPy program could learn directly from your interactions with users? Ultimately, all you care about is that the function you call respects the inputs and outputs of your signature. You can let the models figure out the rest.

Isabella Kai He

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Member of Technical Staff on the Applied AI team at Anthropic, building at the intersection of product, research, and our customers. Previously studied at Stanford and worked at D.E. Shaw and SGNL.ai.
  • Twitter: https://x.com/IsabellaKHe
  • LinkedIn: https://www.linkedin.com/in/isabella-kai-he/
  • Photo: /wf26/speakers/by-id/spk_isabella_kai_he.jpg
  • Sessions:

- Evolution of agentic surfaces — Day 1 — Workshop Day 4:30pm-5:30pm

Getting an agent into production takes more than a good prompt: it needs somewhere to run code, credentials it can't leak, sessions that survive interruption, and infrastructure that scales. This talk traces how Anthropic's agentic surfaces evolved from the raw API to Claude Managed Agents, and what our Applied AI team has learned about harness design along the way.

Ishan Anand

  • Role: Chief AI Officer (CAIO)
  • Company: InsightSciences.ai
  • Bio: Ishan Anand is Chief AI Officer (CAIO) at InsightSciences.ai, where he builds LLM-powered synthetic persona systems for market research, using generative AI to model audiences, predict preferences, and validate synthetic methods against real-world data. He has 15+ years of experience leading product, engineering, and technology strategy, including roles as VP of Product and CTO across companies from early stage to acquisition. Ishan is also known for his AI enablement and training consulting work, making complex AI systems understandable and actionable for businesses, most notably through Spreadsheets Are All You Need, an educational project and companion course that implements GPT-2 entirely in Excel.
  • Twitter: https://x.com/ianand
  • LinkedIn: https://www.linkedin.com/in/ishananand/
  • Website: https://ishananand.com/
  • Blog: https://ishananand.com/
  • Photo: /wf26/speakers/by-id/spk_ishan_anand.jpg
  • Sessions:

- Will AI predict people like we predict the weather? (alternate title “A field guide to synthetic personas for market research”) — Day 3 — Session Day 2 2:50pm-3:10pm

Large language models can now stand in for humans in surprising ways, from predicting personality types to replicating their responses in market research. Like weather forecasting, once considered impossible and now so routine we take it for granted, LLMs are in the early, unreliable-but-improving stage of simulating how populations think and respond. Teams are already using LLMs as synthetic survey respondents for concept testing, UX exploration, and early market validation. In the past year, the field has gotten both more promising and more tricky. The real question is no longer "can LLMs simulate people?", but whether the simulation is validated for the decision you want to make. New methods show that how you ask an LLM matters as much as which model you use and can dramatically improve fidelity to real human responses. Meanwhile validation studies show accuracy can mask subgroup distortion and that seemingly minor choices can reshape the simulated population entirely. This talk gives entrepreneurs, engineers, and PMs an overview of the techniques and a framework for validating synthetic respondents before making decisions. Even if you never build a synthetic persona, this is one of the richest windows into LLM behavior under the hood and these lessons apply to any system where you're trusting an LLM to represent something about the real world.

Itamar Friedman

  • Role: Co-Founder & CEO
  • Company: Qodo
  • Bio: Co-founder and CEO of Qodo, an AI-driven code integrity platform. Previously co-founded Visualead, which was acquired by Alibaba, and later served as Director of Machine Vision at Alibaba.
  • Twitter: https://twitter.com/itamar_mar
  • LinkedIn: https://www.linkedin.com/in/itamarf
  • Blog: https://www.qodo.ai/authors/itamar-f
  • Photo: /wf26/speakers/by-id/spk_itamar_friedman.jpg
  • Sessions:

- The Last Human Code Review: Building Trust in AI-Generated Code — Day 2 — Session Day 1 11:40am-12:00pm

By the end of 2026, asking a human to review every pull request will be as optional as asking one to run every unit test manually. The tooling will be ready. The question is whether organizations are.

In this talk, Itamar Friedman, CEO of Qodo, explains why we are approaching the end of line-by-line human code review as a default requirement and explores what has to be true for teams to get there.

The barrier was never agentic AI capability. It was trust. And trust in automated review does not come from smarter models or faster feedback loops. It comes from systems that provide a trustworthy, concise and personalized proof-of-validation report. These systems are built on how engineering teams at specific organizations write their code: their own rules and standards, their PR history, their architecture decisions, their tribal knowledge that lives in comments and conversations and gets lost when engineers leave.

Itamar will walk through the shift from PR-by-PR review toward continuous, context-based code review and governance, and share a practical approach to making human code review optional.

If your team is shipping AI-generated code faster than humans can read it, join us for the discussion.

Ivan Burazin

  • Role: CEO
  • Company: Daytona
  • Bio: Co-founder and CEO of Daytona; previously co-founded Codeanywhere and served as Chief Developer Experience Officer at Infobip.
  • Twitter: https://x.com/ivanburazin
  • Photo: /wf26/speakers/by-id/spk_ivan_burazin.jpg
  • Sessions:

- Kubernetes Is Not Your Sandbox — Day 3 — Session Day 2 11:40am-12:00pm

Teams are reaching for Kubernetes to run agent sandboxes, and it's the wrong tool. Kubernetes is built to keep things alive and hold them in a steady state. A sandbox is born, forked, and killed before any of that machinery catches up.

The mismatch compounds because the sandbox keeps gaining requirements without shedding any. In eighteen months it went from a fast code-snippet runner, to a stateful box for long-running agents, to ten thousand ephemeral environments that fork for RL rollouts and die in under a second. It has to be all of those at once, a contradiction set no orchestrator was designed to hold.

The cost shows up the moment you measure it. We ran the same 50-action bug-fix trajectory across five stacks and got a 12x spread: 12.9s on the fastest, 161.5s on the slowest. The gap isn't compute, it's lifecycle overhead per action. We name every stack and explain the mechanism behind each number.

wdyt?

Ivan Leo

  • Role: Developer Experience Engineer
  • Company: Google DeepMind
  • Bio: Ivan Leo works on Developer Experience at Google DeepMind, focusing on making it easier to build on Gemini and on evaluating autonomous agents. He previously worked on action engines for knowledge work at Manus and open-source libraries for structured LLM outputs.
  • Photo: /wf26/speakers/by-id/spk_ivan_leo.jpg
  • Sessions:

- An Interaction Is All You Need — Day 4 — Session Day 3 3:20pm-3:40pm

Jack Morris

  • Role: Cofounder
  • Company: Engram
  • Bio: Jack is a cofounder and head of research at Engram. He received his PhD in 2025 from Cornell University
  • Twitter: https://x.com/jxmnop
  • Website: https://jxmo.io / https://substack.com/@jxmnop
  • Photo: /wf26/speakers/by-id/spk_jack_morris.jpg
  • Sessions:

- Scaling Compute on Context — Day 3 — Session Day 2 11:40am-12:00pm

A case for when context is enough, and when updating weights may be the real memory mechanism.

Jacob Lauritzen

  • Role: CTO
  • Company: Legora
  • Bio: CTO at Legora.
  • Photo: /wf26/speakers/by-id/spk_jacob_lauritzen.jpg
  • Sessions:

- How to Connect AI to Billions of Legal Documents — Day 2 — Session Day 1 2:25pm-2:45pm

Legora’s foundational engineering challenge is connecting frontier LLMs to billions of legal documents so the models can efficiently solve end-to-end legal workflows without burning extra tokens. We’ll share the retrieval architecture we built with turbopuffer that achieves: 1. Strict data isolation across millions of legal cases in a very security-conscious domain 2. Predictable search performance (<100ms p90 latency) on large contexts 3. High retrieval quality (95%+ recall@10) with fewer agent loops We’ll retrospect on two architectures that failed to achieve all 3 (and why), and the key design factors that make the current solution work at our scale. Practical takeaways include: - How to evaluate per-tenant vs shared-index retrieval under strict data isolation - How to efficiently index and retrieve context to maximize relevance per input token - How to build a highly intelligent AI application when your inference budget is constrained

Jacqueline Wood

  • Role: Staff Machine Learning Engineer
  • Company: Spotify
  • Bio: Jacqueline Wood is a Staff Machine Learning Engineer at Spotify, where she builds personalized, language-steerable generative recommenders. Her applied research focuses on adapting open-weight LLMs with semantic IDs to connect natural-language intent with Spotify catalog entities.
  • LinkedIn: https://www.linkedin.com/in/jacquelinewood
  • Photo: /wf26/speakers/by-id/spk_jacqueline_wood.jpg
  • Sessions:

- Spotify LLM Recsys — Day 2 — Session Day 1 11:10am-11:30am

Jai Chopra

  • Role: Product Manager
  • Company: Uber
  • Bio: Product Lead in the Applied AI team at Uber. Previously worked at Cruise and various startups.
  • Twitter: https://x.com/jai_chopra
  • LinkedIn: https://linkedin.com/in/jaichopra
  • Photo: /wf26/speakers/by-id/spk_jai_chopra.jpg
  • Sessions:

- Building Closed-Loop Evals for a Multimodal Agent at Uber Scale — Day 3 — Session Day 2 11:40am-12:00pm

This talk covers how we designed evals for Uber's food enhancement agent—which edits food photography to better present dishes for smaller, independent Uber Eats merchants—along with the pitfalls and lessons learned along the way.

The problem is uniquely hard: we must stay faithful to the original dish, preserve each merchant's brand and packaging, and avoid homogenizing the marketplace—all without an existing playbook for multimodal evals in a narrow domain. We'll dig into what we learned navigating reward hacking, where the agent figured out how to game the eval loop, and how we built a closed feedback loop incorporating offline and online signals for continuous improvement—all while balancing creativity against rigid safety guardrails at scale.

If you're an ML or applied AI practitioner working on multimodal systems, agentic pipelines, or eval design—especially building generative features under tight safety or quality constraints—you'll walk away with practical strategies for designing multimodal evals in a narrow domain, recognizing and countering reward hacking, and building offline/online feedback loops that keep a generative agent improving in production.

Jake Broekhuizen

  • Role: Deployed Engineer
  • Company: LangChain
  • LinkedIn: https://www.linkedin.com/in/jake-broekhuizen
  • Photo: /wf26/speakers/by-id/spk_jake_broekhuizen.jpg
  • Sessions:

- The Next Run Should Be Better — Day 3 — Session Day 2 11:40am-12:00pm

Agents generate a constant stream of experience through traces: tool calls, failures, corrections, routing decisions, and user feedback. The challenge is identifying which parts of that experience are worth remembering, and making those lessons available to the agent when it runs again. This talk presents memory as an agent learning loop: capture traces, extract signal, and turn the right lessons into durable context. We'll explore practical models for agent memory and discuss how to build systems where the next run can be better than the last.

Jakub Hojsan

  • Company: Exa
  • Photo: /wf26/speakers/by-id/spk_jakub_hojsan.jpg
  • Sessions:

- Agentic Search for Coding Agents — Day 2 — Session Day 1 10:45am-11:05am

James Le

  • Role: Head of Developer Experience
  • Company: TwelveLabs
  • Bio: James Le is currently leading Developer Experience at Twelve Labs, a startup building foundation models for video understanding. Previously, he worked at MLOps startups including Superb AI, Snorkel AI, Weights & Biases, and taught production ML content with Full Stack Deep Learning.
  • Twitter: https://x.com/le_james94
  • LinkedIn: https://www.linkedin.com/in/khanhnamle94/
  • Website: http://jameskle.com/
  • Blog: https://jameskle.com/
  • Photo: /wf26/speakers/by-id/spk_james_le.jpg
  • Sessions:

- Video Has No Memory. Here's How We Built One. — Day 4 — Session Day 3 2:25pm-2:45pm

Every video AI query today starts from scratch. There's no durable state, no entity continuity, no way to ask "what does this corpus know?" instead of "find me something like this." This talk is about fixing that by engineering a proper memory layer for video intelligence, grounded in what we shipped at TwelveLabs with Jockey. What this talk covers: 1 - Why video memory is categorically different from text memory: Video is temporal, multimodal, dense, ambiguous, and evidence-sensitive. Larger context windows don't solve this. The problem isn't retrieval bandwidth, it's that there's no durable representation to retrieve into. 2 - The context graph as a systems concept, not a database choice: I'll define what "context graph" actually means in practice: time-bounded moments, cross-video entity resolution, appearance tracking, and relationship mapping. This is infrastructure-level thinking, not a graph DB sales pitch. 3 - Five design principles that determine whether video intelligence is reusable infrastructure or a search wrapper with extra steps: + Ingest once, reason many times (move expensive understanding work into preparation) + Store primitives, not just answers (moments, entities, appearances, relationships) + Ground every claim to source video (a timestamp is a product requirement, not a safety footnote) + Let intent shape memory (brand safety and sports highlights need different primitives from the same footage) + Keep the memory layer composable and API-first 4 - What this unlocks for builders. Corpus digest, agentic search with grounded references, entity-centric workflows, timeline reconstruction, and compliance tooling, all built on the same durable substrate. The talk is concrete and demo-grounded. You'll leave with a specific mental model for memory architecture, actionable decisions for ingestion pipeline design and entity resolution, and a clear line between "search with extra steps" and actual video intelligence infrastructure.

James Russo

  • Role: Software Engineer
  • Company: HeyGen
  • Bio: Engineering lead for HyperFrames. Currently at HeyGen building the future of video storytelling, Previously at Brex
  • Twitter: https://x.com/Rames_Jusso
  • LinkedIn: https://www.linkedin.com/in/james-russo-56026897/
  • Website: https://boredhacking.com/
  • Blog: https://boredhacking.com/
  • Photo: /wf26/speakers/by-id/spk_james_russo.jpg
  • Sessions:

- HTML Is All Agents Need — Day 4 — Session Day 3 11:10am-11:30am

LLMs are great at writing code. So the question we kept asking was: can they write code that produces a video? We thought it would be easy. The reality was a year of trying. We started with massive prompts to get very mediocre output. We made it more agentic to iterate and improve its output. This worked okay but wasn't production-ready. Eventually we tried Remotion. It got us deterministic video, but the React framework kept boxing the agent in. The more guardrails we added, the safer and more boring the outputs got. When we utilized plain HTML, CSS, and JavaScript, the creativity came back to the output. So we set out to build a video rendering framework on top of HTML. But it needed to work with Gemini Flash. Why? Because one tell that a framework is fighting an agent is needing the biggest model just to get usable output. So from there we shaped the framework around what small models could reliably author. That left one real engineering question: can we keep the freedom of HTML and still render a deterministic MP4? Browsers don't want to do that. Image decoders, font loaders, and animation clocks all run async on their own schedule. Great for performance. Terrible for "render the same pixels every time." Throughout, we iterated constantly with agentic loops and self-improving evals to test out the framework, find issues in our renderer, and shape a set of skills that gave the agents Taste instead of guardrails. This talk is what it took to get there.

James Zou

  • Role: Associate Professor of Biomedical Data Science
  • Company: Stanford University / Together AI
  • Bio: James Zou is an associate professor of Biomedical Data Science at Stanford, with courtesy appointments in Computer Science and Electrical Engineering, and a Stanford HAI faculty affiliate. His AI for Science work includes collective AI-agent systems for scientific discovery with Together AI.
  • Twitter: https://twitter.com/james_y_zou
  • LinkedIn: https://www.linkedin.com/in/james-zou-2123a4133
  • Website: https://www.cs.stanford.edu/people/james-zou
  • Photo: /wf26/speakers/by-id/spk_james_zou.jpg
  • Sessions:

- Harnessing Collective Agent Intelligence for Open Science — Day 3 — Session Day 2 12:05pm-12:25pm

What happens when AI agents don't just work in isolation, but collaborate, compete, and build on each other's breakthroughs in real time? James Zou, Head of Frontier Agents at Together AI, explores how collective agent intelligence is pushing the boundaries of open science. https://www.together.ai/blog/einsteinarena is a live platform where AI agents collaborate on unsolved mathematical problems, sharing results and building on each other's work. In April 2026, agents improved the best known lower bound for the Kissing Number in 11 dimensions from 593 to 604, surpassing AlphaEvolve through 48 hours of live multi-agent collaboration. https://www.together.ai/blog/dsgym is a unified framework for evaluating and training data science agents, exposing a critical gap in existing benchmarks: models often rely on memorization rather than true data analysis. The team used it to train a 4B open-source model that rivals much larger frontier models. These projects demonstrate agents learning from rigorous evaluation, collaborating through shared infrastructure, and driving scientific discovery at a pace no single researcher or model could achieve alone.

Jan Curn

  • Role: Founder & CEO
  • Company: Apify
  • Bio: Founder and CEO of Apify, a popular marketplace of web data tools for AI. He has a lifelong passion for software engineering, earning him an MSc and PhD in computer science, and eventually leading him to found Apify. Jan is known in San Francisco and Prague tech circles, he talks about software, startups, and AI, and regularly hosts events.
  • Twitter: https://x.com/jancurn
  • LinkedIn: https://www.linkedin.com/in/jancurn/
  • Website: https://apify.com/jancurn
  • Blog: https://blog.apify.com/author/jancurn/
  • Photo: /wf26/speakers/by-id/spk_jan_curn.jpg
  • Sessions:

- x402 isn’t good (yet) — Day 4 — Session Day 3 12:05pm-12:25pm

While everyone understands that agents will get more done with a budget, no one knows which protocol will win agentic payment standard wars: x402, MPP, Skyfire, or another? So far, x402 is the most mature protocol with the largest transaction volume, but even its new "upto" payment scheme doesn’t support true usage-based pricing, as it gives agents a chance to consume resources and then skip out on the bill. I’ll walk you through our experience (and pains) implementing agentic payments for a marketplace of 30K+ web Actors, and how we made it work even with the current specs.

- MCP doesn’t suck — your agent does — Day 4 — Session Day 3 1:55pm-2:15pm

Most AI agents misuse MCP and treat tools as prompt-time function calls: tool definitions and results are repeatedly injected into the context, tokens are wasted, and context rots. The result? Slower, less reliable agents, and the misleading conclusion that “MCP sucks, CLIs are better.” To challenge this narrative and show how agents can get the best of both MCP and CLI, at https://apify.com/ we’ve built mcpc (https://github.com/apify/mcpc), an open-source universal CLI client for MCP. It maps MCP operations to intuitive CLI commands, which agents quickly pick up through --help without external skills. It turns out, CLI is the perfect local interface for agents to interact with MCP, giving them access to full protocol capabilities including modern features like code mode or progressive tool discovery through a single Bash() tool call, while leveraging MCP’s standard remote interface for server discovery, authentication, payments, and access control. To once and for all kill the MCP vs. CLI debate and show those two technologies are not exclusive but complementary, we’ll present evals comparing performance of agents using naive MCP, modern MCP, native CLIs, other MCP CLIs, and mcpc, in various real-world scenarios.

Jared Joselowitz

  • Role: AI Research Engineer
  • Company: Ufonia
  • Bio: Jared Joselowitz is the Lead AI Research Engineer at Ufonia, where Dora (an AI voice agent) makes clinical follow-up calls on the NHS and across US health systems; over 200,000 patient calls delivered, with signed contracts to scale past a million. He builds the evaluation and hazard-analysis stack for clinical voice AI: multi-agent simulation, prompt-optimisation pipelines, and the audit infrastructure that has to hold up when there's a patient on the other end of the call. His research on clinical AI safety and evaluation has been published at ACL, COLM and IWSDS, most recently an LLM judge that matches clinician safety assessments of speech-recognition errors. Originally from Johannesburg, South Africa, Jared studied electrical engineer before completing an MSc in Applied Machine Learning at Imperial College London, where his thesis used inverse reinforcement learning to recover the implicit reward models of RLHF-trained LLMs.
  • Twitter: https://x.com/JaredJoselowitz
  • LinkedIn: https://www.linkedin.com/in/jaredjoselowitz/
  • Website: https://jossy.co.za/
  • Photo: /wf26/speakers/by-id/spk_jared_joselowitz.jpg
  • Sessions:

- Shipping AI to a Million Patients Without an A/B Test — Day 4 — Session Day 3 11:40am-12:00pm

You can't A/B test on patients. You can't unsend a phone call. The model card won't save you at the post-incident review. Most AI eng playbooks assume the opposite. Ship to 5%, watch the dashboard, roll back if it goes wrong. None of it survives regulated deployment, which is now coming for fintech, legal, and government too. So the engineering has to move: into hazard analysis, simulated populations, asymmetric evaluation, and audit trails treated as the deliverable. The trail is the product. I'll show you what changes when rollback isn't an option. How Ufonia ships Dora, an AI voice agent now making clinical follow-up calls on the NHS and across US health systems, using a hazard-driven simulation rig (MATRIX) and a prompt-optimisation flywheel that surface failures and conform the same base system to each clinical niche, all of it pinned to an audit trail. And the cheap version of all this, for any team whose users can't be the test population.

Jason Kramberger

  • Role: Software Engineer
  • Company: Google
  • Bio: Software Engineer at Google Kubernetes Engine focusing on inference performance
  • LinkedIn: https://www.linkedin.com/in/jkramberger
  • Photo: /wf26/speakers/by-id/spk_jason_kramberger.jpg
  • Sessions:

- Are LLM Performance Benchmarks Reliable? — Day 4 — Session Day 3 11:40am-12:00pm

Standardizing performance benchmarks for production-grade Large Language Models is currently a significant challenge across the industry. Conflicting data is prevalent, whether originating from server developers like vLLM and SGLang or from various analysts and competitive benchmarks, and these results often fail to hold up under real-world conditions. Our research into these inconsistencies identified several critical factors, including the constraints of single-process tools, specifically the Python Global Interpreter Lock (GIL) and the nuances of model-level settings like temperature. Furthermore, a lack of transparency regarding load generation parameters such as QPS and concurrency, paired with insufficient observability into the benchmarking clients themselves, contributes to these disparate outcomes. In this talk, we share key lessons learned from our benchmarking efforts, examining the primary pitfalls that distort performance data and offering strategies for mitigation. Additionally, we will introduce Inference Perf, an open-source, multi-process utility we developed to provide reliable stress-testing for production stacks. Our goal is to promote standardized, real-world benchmarking practices that allow the community to move beyond unreliable data. Join us to discover how to accurately measure, optimize, and report LLM performance with certainty.

Jason Liu

  • Role: Developer Experience, OpenAI
  • Company: OpenAI
  • Bio: Jason Liu works on Developer Experience at OpenAI, where he helps developers get more from Codex, the Agents SDK, and the OpenAI API. His work spans developer education, open-source programs, and practical agent workflows. Prior to OpenAI he was the creator of Instructor, and taught developers how to build reliable AI applications.
  • Twitter: https://x.com/jxnlco
  • LinkedIn: https://www.linkedin.com/in/jxnlco
  • Website: https://jxnl.co/
  • Photo: /wf26/speakers/by-id/spk_jason_liu.jpg
  • Sessions:

- Getting the most out of Codex — Day 2 — Session Day 1 10:45am-11:05am

- Setting Yourself Up for Success — Part 1 — Day 2 — Session Day 1 2:50pm-3:10pm

I will walk you through the process of understanding how Codex works as a general tool to control your computer (setting up your memory vault/ assistant threads, prompting it to talk to other threads, and exploring computer use), how to think about things like long running work streams, and preparing yourself to start thinking in loops.

- Setting Yourself Up for Success — Part 2 — Day 2 — Session Day 1 3:20pm-3:40pm

I will walk you through the process of understanding how Codex works as a general tool to control your computer, how to think about things like long running work streams, and preparing yourself to start thinking in loops.

- Setting Yourself Up for Success — Part 3 — Day 2 — Session Day 1 3:45pm-4:05pm

I will walk you through the process of understanding how Codex works as a general tool to control your computer, how to think about things like long running work streams, and preparing yourself to start thinking in loops.

Jason Lopatecki

  • Role: CEO
  • Company: Arize
  • Bio: Jason Lopatecki is co-founder and CEO of Arize AI, an AI & Agent observability and evaluation company. He is a garage-to-IPO executive with an extensive background in building marketing-leading products and businesses that heavily leverage analytics. Prior to Arize, Jason was co-founder and chief innovation officer at TubeMogul where he scaled the business into a public company and eventual acquisition by Adobe. Jason has hands-on knowledge of big data architectures, programmatic advertising systems, distributed systems, and machine learning and data processing architectures. In his free time, Jason tinkers with personal machine learning projects as a hobby, with a special interest in unsupervised learning and deep neural networks. He holds an electrical engineering and computer science degree from UC Berkeley - Go Bears!
  • LinkedIn: https://www.linkedin.com/in/jason-lopatecki-9509941/
  • Photo: /wf26/speakers/by-id/spk_jason_lopatecki.jpg
  • Sessions:

- From Signal to PR: Anatomy of a Self-Improving Agent — Day 3 — Session Day 2 11:10am-11:30am

What if your observability platform didn't just tell you something was wrong, but told you why, and opened a PR with the fix? We'll walk through how we built Autopilot at Arize: an autonomous investigation agent that triggers on monitor alerts or schedules, pulls traces into a working filesystem, runs root-cause analysis, and produces actionable assets: a PR with prompt or code changes ready for review. We'll cover the architecture decisions (cloud agents vs. sandboxed containers, AI harness + skills), why traces-on-a-filesystem is the key unlock for agent-driven debugging, and how we dogfooded the system on our own agent, Alyx, before shipping it to customers. You'll leave with a concrete picture of what "observability that fixes itself" looks like in practice, and where and why the human stays in the loop.

Jason Ma

  • Role: CTO and co-founder
  • Company: Dyna Robotics
  • Bio: Co-founder and CTO of Dyna Robotics, a robotics company building general-purpose robots powered by embodied AI foundation models. Previously a research scientist at DeepMind focused on foundation models for robotics.
  • Twitter: https://x.com/JasonMa2020
  • LinkedIn: https://www.linkedin.com/in/jason-ma-742224a2
  • Website: https://jasonma2016.github.io/
  • Photo: /wf26/speakers/by-id/spk_jason_ma.jpg
  • Sessions:

- Commercial Grade-Robots for Real World Usage — Day 3 — Session Day 2 11:40am-12:00pm

TBD — Dyna Robotics talk for Robotics & World Models track.

https://www.dyna.co/

Javier Garza

  • Role: Developer Advocate
  • Company: Snyk
  • Bio: Developer Advocate at Snyk, a cybersecurity company focused on securing code, open-source dependencies, and cloud infrastructure. Co-author of O'Reilly's Learning HTTP/2 and an experienced conference speaker on developer and security engineering topics.
  • Twitter: https://twitter.com/jjaviergarza
  • LinkedIn: https://www.linkedin.com/in/jjgarza
  • Photo: /wf26/speakers/by-id/spk_javier_garza.jpg
  • Sessions:

- AI Security Engineer Foundations + Certificate — Day 1 — Workshop Day 9:00am-11:00am

In each of the two sessions, we cover 6 modules and participants receive a certificate of completion at the end. The modules are: OWASP Top 10 for LLM, Addressing Shadow AI, AI Threat Modeling, Securing Agents & MCP, Securing Vibe Coding, & AI Red Teaming

Jay Mok

  • Company: PayPal
  • Photo: /wf26/speakers/by-id/spk_jay_mok.jpg
  • Sessions:

- Your Agent Just Authorized What?! — Day 4 — Session Day 3 2:50pm-3:10pm

The nightmare scenario writes itself: your agent just ran off with your credit card and maxed it out on concert tickets, crypto, and a questionable NFT collection. Relax — we're building the guardrails. When an agent acts on your behalf, three questions must always be answerable: Did the human authorize this? Did they authorize this, now, in this scope? And can we prove it later? This talk maps three permissioning layers onto a stakes ladder: OAuth scopes at the bottom (broad capability, weak per-action proof, fine when reversible), Claude Code's tool-scoped allow/ask/deny model in the middle (brilliant for developer tooling, but no cryptographic evidence), and signed payment mandates at the top — where FIDO's Agentic Payments Working Group is building toward cryptographically-bound, constraint-carrying credentials. We'll share artifacts from Agent to Agent payments using our Shared Vault and Oauth to our constraint carrying Approval token leveraging our pillars of Identity and Buyer and Seller protection. You leave with a stakes × evidence matrix and a mental model that applies beyond payments: medical orders, e-signatures, securities trading, activities where you want you want to be more careful with your agent.

Jean-Denis Greze

  • Role: Co-Founder & CEO
  • Company: Town
  • Bio: Jean-Denis Greze is co-founder and CEO of Town, a personal AI assistant that does real work for people inside the tools they already use - email, calendar, Slack, and more. Previously he was Chief Technology Officer at Plaid, where he led engineering through the company's hypergrowth, and before that a Director of Engineering at Dropbox. He also invests through ASDF Ventures. He's spoken and written widely on engineering leadership and building "spiky" organizations, and is based in San Francisco.
  • Twitter: https://x.com/jgreze
  • LinkedIn: https://www.linkedin.com/in/jeandenisgreze/
  • Website: https://greze.com/
  • Blog: https://www.greze.com
  • Photo: /wf26/speakers/by-id/spk_jean_denis_greze.jpg
  • Sessions:

- Agents' next frontier: agent-to-agent and network effects — Day 2 — Session Day 1 1:30pm-1:50pm

MCP v. CLI was about how agents talk to tools. That’s not settled (but we’re camp MCP… mostly). Almost nothing has settled how agents talk to each other - and that's where the next wave of value (and network effects and virality) lives. At Town we run a personal AI agent in production inside real people's inboxes, calendars, and Slack, and we've built agent-to-agent (A2A) on our platform: 1:1 A2A messaging, agents that carry a short bio of one another, HITL when sensitive data is shared or write actions are involved, and early tests around 1:N A2A. I’ll talk about the why, the opportunity, and the production architecture underneath. Audience takeaway: a concrete mental model for building multi-agent systems on top of the data and surfaces users already live in, plus our learnings on early failure modes to avoid.

Jeff Ng

  • Role: Engineer
  • Company: Unblocked
  • Bio: Jeffrey Ng is an engineer at Unblocked, a company building a context engine for software teams and AI agents.
  • Sessions:

- Building agents is trivial now, context is the next frontier — Day 2 — Session Day 1 2:25pm-2:45pm

Standing up an agent used to be the hard part. A new class of cloud-agent frameworks has made it almost trivial: in an afternoon you can ship a fleet that reasons, plans, and calls any API you point it at. So why do so many of them fail the moment they touch real work? Because a capable agent still doesn't know the organization it operates in: its decisions, history, incidents, and how a particular team actually operates. That knowledge isn't in the model or the API, and no amount of construction adds it.

This talk exposes the missing component, then shows how to build it live on a real workflow — the same move that helps a coding agent helps a support or operations one. Construction is solved. The missing context, tacit and tribal knowledge is the bottleneck that's left, and it sits upstream of everything verification attempts to catch after the fact.

Jeff Vestal

  • Role: Senior Principal AI Architect
  • Company: Elastic
  • Bio: Jeff Vestal is a Senior Principal AI Architect at Elastic. He works on search, retrieval, generative AI, RAG, and agent use cases with Elasticsearch, including hybrid search and AI systems built on Elastic.
  • LinkedIn: https://www.linkedin.com/in/jeffvestal
  • Blog: https://www.elastic.co/blog/author/jeff-vestal
  • Photo: /wf26/speakers/by-id/spk_jeff_vestal.jpg
  • Sessions:

- Vector Isn't Enough: Hybrid Search & Retrieval for AI Engineers — Day 1 — Workshop Day 2:20pm-4:20pm

If you build RAG, you reached for vector search first. This lab is about everything that happens after you realize embeddings alone don't cut it in production. You'll write real queries — semantic, lexical, and hybrid — feel exactly where each one fails, and walk out with a production-grade retrieval pipeline and the judgment to know which technique to reach for when.

What you'll actually do:

1. Dense vector search, and the mechanism behind it. Run semantic queries over a  semantic_text  field backed by Jina v5 embeddings — generated server-side, at query time, by the Elastic Inference Service (EIS). No embedding service to stand up, no client-side inference code. We open the hood on how query-time embedding actually works.

2. Break it. Throw adversarial queries at pure vector — exact error codes, version numbers (8.18 vs 9.0), precise config keys — and watch semantic similarity blur the exact match you needed. Then bring in BM25 lexical search to rescue it… and find the queries where keyword search whiffs. Each method is strongest exactly where the other is weakest.

3. Hybrid, properly. Fuse lexical + semantic with Elasticsearch retrievers. Learn the two fusion strategies that matter — Reciprocal Rank Fusion (RRF) and linear combination with score normalization — when to use each, and how to tune them. Optional: cross-encoder reranking with Jina Reranker v2.

4. Why this is the whole game for agents. Wire the hybrid retriever into a RAG flow and prove that retrieval quality, not the model, determines answer quality. Only synthesis truly needs the LLM - retrieve, rank, filter, and document-level security are database work done in milliseconds for a fraction of the cost. The contrarian takeaway: most of your RAG pipeline shouldn't be LLM calls at all.

Jeffrey Wang

  • Role: Co-founder
  • Company: Exa
  • Bio: Jeffrey Wang is the co-founder of Exa, an applied AI lab building a search engine for the age of AI. He studied CS at Harvard and previously worked at Plaid before leaving to tackle a problem he believed would become foundational: giving AI systems access to better information.

Over the past several years, Jeff has helped build Exa from the ground up, leading product, go-to-market, and company building alongside the development of Exa's search and retrieval infrastructure. Today, Exa powers search and research for leading AI companies and hundreds of thousands of developers.

Jeff believes that the intelligence of any AI system is ultimately limited by the quality of information it can access. His work at Exa is driven by a simple goal: build search that helps both AI and humans better understand the world.

  • Twitter: https://x.com/jeffzwang
  • LinkedIn: https://www.linkedin.com/in/wangzjeff/
  • Photo: /wf26/speakers/by-id/spk_jeff_wang.jpg
  • Sessions:

- Lessons From Building The World's Largest Knowledge Graph — Day 4 — Session Day 3 2:25pm-2:45pm

_Exa set out to index and embed the entire web as a queryable knowledge graph — the substrate behind neural search and the enrichment layer powering modern GTM data. Co-founder Jeffrey Wang shares the hard engineering lessons: crawling and embedding at web scale, keeping a graph fresh and trustworthy, and the retrieval architecture that lets agents pull grounded facts instead of hallucinations. Why the knowledge graph — not the model — is becoming the moat for AI-native GTM._

Jennifer Lee

  • Role: Product Lead, Machine Payments & Agentic Commerce
  • Company: Stripe
  • Bio: Jen Lee is the Product Lead for Stripe’s Machine Payments and Agentic Commerce, building the ecosystem that enables agents, people, and businesses to transact seamlessly with one another. Through Stripe’s Agentic Commerce Suite, she is making it easier for developers to build for and accept payments from agents. Jen has been at Stripe for nearly five years, where she previously launched and led Stripe’s crypto products. Across her work, she has focused on empowering users to benefit from emerging technologies by building new frontier products from 0 to 1.
  • Twitter: https://x.com/backseatvc
  • LinkedIn: https://www.linkedin.com/in/jennifer-lee-5175a18a/
  • Photo: /wf26/speakers/by-id/spk_jennifer_lee.jpg
  • Sessions:

- Building safe payment infrastructure for machine-to-machine commerce — Day 4 — Session Day 3 10:45am-11:05am

Agents are a new class of buyer, but the infrastructure for them to transact headlessly barely exists yet. This talk walks through what it actually takes to make a machine payment work: how an agent discovers what services exist, how HTTP 402 lets a server return a payment challenge the agent can settle without a human in the loop, and how the seller gets a receipt they can trust. Whether you are building an agent framework or adding machine payments to an API or MCP server, you will leave with concrete patterns for the headless commerce stack.

Jeremiah Lowin

  • Role: Founder & CEO
  • Company: Prefect
  • Bio: Jeremiah Lowin is the founder and CEO of Prefect and the creator of FastMCP. Prefect builds automation and orchestration tools used by teams working across data, AI, and software infrastructure. A former quantitative researcher, Jeremiah has spent his career designing systems that make complex work observable, dependable, and easier to reason about. Before founding Prefect, he led risk and data initiatives for major investment firms and was a founding member of the Apache Airflow PMC. He now advises companies like Spotify on technology strategy. His current work focuses on the infrastructure beneath AI applications: orchestration, tools, protocols, and the practical details that turn interesting demos into reliable software. Jeremiah holds bachelor's and master's degrees from Harvard University and lives in Washington, DC.
  • Twitter: https://x.com/jlowin
  • LinkedIn: https://www.linkedin.com/in/jlowin/
  • Website: https://jlowin.dev
  • Photo: /wf26/speakers/by-id/spk_jeremiah_lowin.jpg
  • Sessions:

- Generative UI... in Python? — Day 3 — Session Day 2 3:20pm-3:40pm

MCP Apps are a big deal: tools can now return dashboards, forms, and visualizations directly in the conversation. But somebody (or their agent) has to write those UIs. Fortunately, most of those UIs don't need to be designed from scratch; they can be composed from existing components. In that case, what you really need is a DSL that's token-efficient, streaming-compatible, and has a shallow learning curve. Surprisingly, the best one turns out to be... Python. In this talk, I'll introduce Prefab, a generative UI library that uses Python to compose fully interactive React applications from production components, now natively integrated into FastMCP. I'll demo real use cases, walk through the design, and show where this approach works and where it doesn't. No JavaScript will be harmed.

Jeremy Adams

  • Role: Tech Translator
  • Company: Neo4j
  • Bio: Jeremy Adams-Casañas is a technology communicator and developer-facing specialist associated with Neo4j, Dagger, GitHub, Twistlock, Puppet, and the U.S. Army, speaking on edge agents, NanoClaw, Raspberry Pi, and graph memory.
  • Photo: /wf26/speakers/by-id/spk_jeremy_adams.jpg
  • Sessions:

- Small Claws Are Beautiful: Edge Agents with NanoClaw, Raspberry Pi, and Graph Memory — Day 4 — Session Day 3 2:50pm-3:10pm

Jerry Liu

  • Role: CEO
  • Company: LlamaIndex
  • Bio: Jerry is the co-founder/CEO of LlamaIndex, a company that is building the document infrastructure for AI agents. Before this, he led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG and worked on recommendation systems at Quora.
  • Twitter: https://x.com/jerryjliu0
  • LinkedIn: https://www.linkedin.com/in/jerry-liu-64390071/
  • Photo: /wf26/speakers/by-id/spk_jerry_liu.jpg
  • Sessions:

- Building the Document Context Layer for AI Agents — Day 2 — Session Day 1 11:10am-11:30am

AI agents are the new knowledge workers, but knowledge work depends on unstructured enterprise context. ~90% of that data lives in the form of document containers - from human-native (PDFs, Word, Pptx) to emerging agent-native formats (HTML, MD). Doing RAG in 2026 involves generalized agent harnesses with tools, MCPs, and skills. In this world, every company building agents needs a Document Context Layer, the bridge between their unstructured docs and the agents trying to reason over them. This talk covers what that layer looks like in practice: from document understanding, retrieval, and workflows, to areas yet to be explored — agent-native formats, versioning, editing, permissions, and longer-running agents.

Jess Wang

  • Sessions:

- Agentic vs. Vector Search: An Eval-Driven Approach to Coding Agent Performance — Day 2 — Session Day 1 11:40am-12:00pm

Evals let you replace gut feelings with quantifiable decisions. This talk breaks the basic concepts of evals, including the four core components: datasets, tasks, scoring, and experiments. Then, to solidify the concept, we’ll walk through a real eval comparing agentic search versus vector search for coding agents. We'll also cover practical challenges like tracing Claude Code subprocess calls and why a single eval run is never enough. You'll leave with a concrete framework for building evals that actually inform your ship decisions.

Jesse Hall

  • Role: Staff Developer Advocate
  • Company: Livekit
  • Bio: Jesse Hall is a Staff Developer Advocate at LiveKit and a full-stack developer who specializes in teaching TypeScript developers how to build AI-powered web applications using real-time communication technologies. He creates practical articles, videos, and interactive talks that break down complex voice AI and agent concepts into clear, production-ready takeaways.
  • Sessions:

- Latency Is a Budget. Humanlike Is the Goal. — Day 3 — Session Day 2 2:25pm-2:45pm

Most agents do their work in the background. They write code, automate tasks, and run research. But the moment an agent has to interact with a human in real time, everything you know about building and evaluating it changes. This session is about designing humanlike agents that can hear, see, and speak. It starts with the question nobody can answer today. With hundreds of models to choose from, how do you pick a stack that holds up in a live conversation? We'll show why public leaderboards fail for realtime agents, and why the latency on your dashboard isn't what your users experience. Then we'll flip the process around. Define the outcomes you want as human-equivalent behaviors, and work backwards from there to your evaluations, your models, and a production iteration loop. You'll leave with a concrete decision framework and an open benchmark you can run yourself.

Jesse Lumarie

  • Role: Software Engineer
  • Company: Figma
  • Bio: Jesse Lumarie is a software engineer building AI tools and integrations at Figma. Prior to Make, Jesse worked on growth initiatives and Figma's first MCP server. He lives in Boulder with his wife Jenna, and three kids Henry, Hayes and Robin.
  • Twitter: https://x.com/jesselumarie
  • LinkedIn: https://www.linkedin.com/in/jesselumarie/
  • Photo: /wf26/speakers/by-id/spk_jesse_lumarie.jpg
  • Sessions:

- Building the engine while flying the plane — launching the Figma MCP server — Day 2 — Session Day 1 11:10am-11:30am

What does it actually take to go from a vague idea to a production-ready AI system that people depend on? In this talk, I’ll walk through the real story of building Figma’s MCP server as a founding engineer whilst the MCP spec evolved—starting from early prototypes, through dead ends and architectural pivots, to launching both the initial product, creating new tools and eventually a fully remote server.

Jetashree Ravi

  • Role: Tech Lead Manager
  • Company: Fireworks AI
  • Bio: Jetashree Ravi is a Tech Lead Manager on Fireworks AI's Applied Machine Learning team, focused on LLMs, high-performance GPU inference, fine-tuning, developer platforms, and making production AI systems reliable at scale.
  • LinkedIn: https://www.linkedin.com/in/jetashree-ravi
  • Photo: /wf26/speakers/by-id/spk_jetashree_ravi.jpg
  • Sessions:

- Stop Renting Intelligence: The Train-to-Deploy Loop for Specialized AI — Day 3 — Session Day 2 3:45pm-4:05pm

The next wave of AI products will not rely only on prompting generic frontier models. Winning teams will own specialized models shaped by their product data, user feedback, and domain workflows.In this 18-minute session, we'll cover the practical loop behind model ownership: choose a base model, prepare data, fine-tune with SFT/DPO/RL, evaluate outputs, deploy the tuned model, collect feedback, and repeat. We'll also explain why training and inference should be treated as one system, not separate steps.Attendees will leave with a simple framework for when to tune, when RL matters, and how continuous improvement turns fine-tuning from a one-off project into a product advantage.

Jia Wu

  • Role: Deployed Engineering Lead
  • Company: Cognition AI
  • Bio: Team lead at Cognition AI. Deploying Devin to the world's leading enterprises.
  • LinkedIn: https://www.linkedin.com/in/jia-rong-wu/
  • Photo: /wf26/speakers/by-id/spk_jia_wu.jpg
  • Sessions:

- How Forward Deployed Engineering is done at Cognition — Day 2 — Session Day 1 12:05pm-12:25pm

Jim Clark

  • Role: Principal Software Engineer
  • Company: Docker
  • Bio: Jim Clark is a Principal Software Engineer at Docker working on Docker's MCP tooling and gateway. His recent work focuses on secure, controlled interfaces between AI agents, tools, and MCP servers.
  • Sessions:

- Who Approved That MCP Server? Governing the Tool Layer — Day 2 — Session Day 1 1:55pm-2:15pm

Your developers are installing MCP servers faster than security can review them. An unvetted server is a direct line to your data. This talk shows how the Docker MCP Gateway puts every server and tool behind one org-managed catalog: vetted, signed, default-deny on anything unapproved, governed by the same policy engine as network and filesystem. Walk away with a hands-on demo: stand up a catalog, block an unvetted server, and watch policy enforce at the runtime.

Jo Kristian Bergum

  • Role: CEO
  • Company: Hornet.dev
  • Bio: CEO Hornet.dev - building the retrieval engine for agents
  • Twitter: https://x.com/jobergum
  • LinkedIn: https://www.linkedin.com/in/jo-bergum
  • Website: https://hornet.dev/
  • Blog: https://hornet.dev/
  • Photo: /wf26/speakers/by-id/spk_jo_kristian_bergum.jpg
  • Sessions:

- The unreasonable effectiveness of BM25 for agentic search — Day 2 — Session Day 1 11:10am-11:30am

GPT-5 is shockingly good at search, and that changes the "BM25 as a baseline" story. Using GPT-5 search trajectories from BrowseComp-Plus, I'll show how default BM25 parameters and evaluation harnesses can make lexical retrieval look weak, while real agent queries often play directly to BM25's strengths. Much like grep became a core retrieval primitive for coding agents, BM25 is re-emerging as a powerful primitive for agentic search.

Joanne Song

  • Company: The New York Times Games
  • LinkedIn: https://www.linkedin.com/in/joanne-song
  • Photo: /wf26/speakers/by-id/spk_joanne_song.jpg
  • Sessions:

- On-Device Agentic AI for the New York Times Games — Day 4 — Session Day 3 2:50pm-3:10pm

Traditional mobile game architectures rely on static state machines and fixed behavioral trees. Under this model, gameplay and accessibility are treated as rigid, separate systems. This results in blunt difficulty toggles, predictable character loops, and reactive features that fail to address a player's actual context. Constraint-Centric Agentic Simulation (CCAS) offers a theoretical shift. By modeling the game world as a continuous, multi-agent negotiation, accessibility and challenge become part of a single, fluid continuum.

Using the JetBrains Koog framework on Android, this session explores the theory of running local agents on consumer mobile devices. We will discuss how principles of game theory, specifically dynamic negotiation and constraint satisfaction, can be used to build systems that reason over game states. Instead of executing pre-planned scripts, these agents dynamically alter their strategies. They negotiate environmental constraints to provide emergent challenges for high-skill players or organically smooth out cognitive and motor friction points for those requiring assistance.

Running these theoretical models on edge hardware requires overcoming significant practical hurdles. We will break down the architecture needed to support this continuous adaptation without relying on cloud computation. We will cover how to manage memory footprints, compress state histories for rapid backtracking, and schedule local planning loops so they integrate flawlessly with the rendering engine.

Joel Hooks

  • Role: Co-founder; software developer and developer-education entrepreneur
  • Company: badass.dev / egghead.io
  • Bio: Joel Hooks is a software developer and entrepreneur known for developer education. He co-founded egghead.io and badass.dev, and works on helping creators build high-quality developer education products.
  • Twitter: https://x.com/joelhooks
  • LinkedIn: https://www.linkedin.com/in/joelhooks
  • Website: https://joelhooks.com
  • Blog: https://badass.dev/the-process
  • Photo: /wf26/speakers/by-id/spk_joel_hooks.jpg
  • Sessions:

- The Art and Science of Loopcraft with Pi (and friends) — Day 1 — Workshop Day 4:30pm-5:30pm

This workshop helps agentic coding practitioners stop treating agents like pretend coworkers and start designing reliable, compounding loops. Using Pi as the concrete demo surface, Joel Hooks will show how loop state, handoffs, review, memory, and operator control become visible, while keeping the ideas portable to Claude, Codex, Cursor, and similar coding agents. Practitioners should leave able to identify loops inside their agent workflows, diagnose when failures need gates/evidence versus orchestration/memory/leverage, and understand how model-shaped lifecycles differ from traditional human SDLC rituals.

John Craft

  • Role: Solutions Engineer
  • Company: Docker
  • Bio: John Craft works in Sales Engineering/Solutions Engineering at Docker. His recent speaking context centers on governing MCP and moving from approval loops to autonomous agents with Docker.
  • Photo: /wf26/speakers/by-id/spk_john_craft.jpg
  • Sessions:

- From approval loops to autonomous agents with Docker — Day 1 — Workshop Day 12:10pm-1:10pm

"You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down."

- From approval loops to autonomous agents with Docker pt1 — Day 2 — Session Day 1 1:30pm-1:50pm

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

- From approval loops to autonomous agents with Docker pt2 — Day 2 — Session Day 1 1:55pm-2:15pm

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

- From approval loops to autonomous agents with Docker pt3 — Day 2 — Session Day 1 2:25pm-2:45pm

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

- From approval loops to autonomous agents with Docker pt4 — Day 2 — Session Day 1 2:50pm-3:10pm

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

- From approval loops to autonomous agents with Docker pt5 — Day 2 — Session Day 1 3:20pm-3:40pm

You've invested in the best models, coding agents, and AI tooling. Now comes the hard part: unlocking autonomous development without creating security headaches, governance gaps, or endless approval loops.

In this 90-minute hands-on workshop, you'll learn how to run coding agents in isolated environments built for autonomous work, create a 'golden path' for AI-assisted development across your organization, reduce software supply chain risk with secure, hardened containers, manage multiple agents with the right permissions and guardrails, and scale AI-powered development without slowing developers down.

John Lindquist

  • Role: Agentic Instructor
  • Company: egghead.io
  • Bio: John Lindquist currently teaches Codex "Power User" Workshops through https://egghead.io. As a seasoned software engineer, educator, and entrepreneur, he is best known as the co-founder of egghead.io – a premier online learning platform for web developers. With a career spanning two decades, John has built a reputation for creating innovative web applications and empowering countless developers through concise, high-impact video tutorials. He has led engineering teams for high-profile projects (from Disney entertainment sites to HBO's online platform) and traveled the globe as a Developer Evangelist, sharing his expertise at conferences and workshops. John's passion lies in demystifying complex technologies for others: he helped lead the JavaScript community into adopting framework-driven development with his early AngularJS tutorials, and to date he has produced over 500 programming videos covering everything from reactive programming and JavaScript fundamentals to the latest AI developer workflows and technologies. Collaborative and visionary, John continues to inspire developers worldwide by pioneering new ways to teach and embrace the latest development workflows.
  • Twitter: https://x.com/johnlindquist
  • Website: https://egghead.io
  • Blog: https://dev.build
  • Photo: /wf26/speakers/by-id/spk_john_lindquist.jpg
  • Sessions:

- The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development — Day 3 — Session Day 2 1:30pm-1:50pm

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

- The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development (continued 2) — Day 3 — Session Day 2 1:55pm-2:15pm

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

- The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development (continued 3) — Day 3 — Session Day 2 2:25pm-2:45pm

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

John McBride

  • Role: Co-Founder, CTO
  • Company: Paper Compute Co.
  • Bio: John McBride is an engineering leader, writer, and podcast host. He is Co-Founder, CTO at Paper Copmute Co. where he's heading up new AI and infrastructure development.

He has previously worked on MCP/AI gateway infra at Zuplo, AI infrastructure at the Linux Foundation, AI/ML community tooling at OpenSauced, Linux based operating systems at AWS, Kubernetes products at VMware, and the Cloud Foundry platform at Pivotal. He has years of experience building complex, distributed software systems in a number of languages and frameworks, has scaled huge AI infrastructure systems, and rallied organizations to adopt cutting edge technologies.

  • LinkedIn: https://www.linkedin.com/in/jpmcb/
  • Website: https://johncodes.com/
  • Photo: /wf26/speakers/by-id/spk_john_mcbride.jpg
  • Sessions:

- Don't Write Skills, Train Models — Day 3 — Session Day 2 2:50pm-3:10pm

Every AI agent call generates training data. Most teams throw it away. They write skills files instead. Text documents that describe how to do a task and hope the model follows them at inference time. Skills work until they don't. The model drifts, skips steps, hallucinates a shortcut. So you rewrite the skill, add more constraints, hope harder. There's a better path. If you've used a skill enough to know what good output looks like, you already have training data. You just aren't using it. This talk covers what I learned building an open source fine-tuning pipeline that turns agent session traces into SFT and DPO training datasets. A telemetry proxy captures every LLM call as a content-addressed Merkle DAG with zero instrumentation. Successful sessions become supervised fine-tuning data. Pair them against failures, matched by goal category, and you get preference pairs for DPO. No manual labeling. No synthetic data. But training data quality depends on environment consistency. If the same agent produces different results because of package drift, nondeterministic toolchains, or inconsistent system state, your training signal is noise. This is where NixOS changes the equation. A hardened, reproducible OS means every agent session runs against an identical, declarative environment. Nix controls the variables that sandboxing alone doesn't: dependency graphs, system libraries, toolchain versions. When you can guarantee the environment is the same across hundreds of sessions, the behavioral signal in your traces is actually trustworthy. We'll walk through the full pipeline. How to rebuild parent-hash chains from a SQLite database and join facet metadata. How to filter to fully_achieved sessions and truncate 82k-token conversations down to 4k-6k training examples using summary context plus the last three turns. How to match success/failure pairs by goal category and exclude unclear_requirements failures so DPO learns from real agent mistakes, not ambiguous prompts. How QLoRA keeps VRAM low enough to train a 7B model on a single consumer GPU. And what happens when you try DPO on 12GB VRAM (two simultaneous forward passes for logprob computation will teach you about gradient accumulation settings fast). The result: a LoRA adapter trained on your own agent traces, in a reproducible environment, on a single consumer GPU, for less than $2 in cloud compute. No YAML. One config file. All code is open source.

John Ousterhout

  • Role: Professor Emeritus
  • Company: Stanford University
  • Bio: John Ousterhout is the Bosack Lerner Professor of Computer Science, Emeritus at Stanford University. His prior positions include 14 years in industry, where he founded two companies (Scriptics and Electric Cloud), preceded by 14 years as Professor of Computer Science at U.C. Berkeley. He is author of the book "A Philosophy of Software Design", co-creator of the Raft consensus protocol, and creator of the Tcl scripting language and the Tk toolkit. He is a member of the National Academy of Engineering and has received numerous awards, including the ACM Software System Award, the ACM Grace Murray Hopper Award, and the U.C. Berkeley Distinguished Teaching Award.
  • Twitter: https://x.com/johnousterhout
  • Website: https://web.stanford.edu/~ouster/cgi-bin/home.php
  • Photo: /wf26/speakers/by-id/spk_john_ousterhout.jpg
  • Sessions:

- TCP and RDMA are Killing Inference Throughput; Homa can Fix It — Day 4 — Session Day 3 9:20am-9:40am

Modern AI inferencing is shifting from monolithic requests to complex agentic workflows and disaggregated KV stores. As a result, AI network traffic is no longer just very large transfers; tiny metadata requests are becoming more and more common, and their latency has a critical impact on throughput. Unfortunately, legacy transport protocols such as TCP and RDMA perform poorly on these workloads due to poor congestion control and head-of-line blocking. This talk will discuss the problems with TCP and RDMA and provide a brief introduction to the Homa transport protocol. Homa uses receiver-driven flow control and capitalizes on priority queues in network switches to reduce short-message latency by 10x for workloads like those in AI datacenters.

Jonathan Gordon

  • Role: Founder
  • Company: ReWeaver AI
  • Bio: JONATHAN GORDON is the Founder & CEO of ReWeaver AI, a platform that detects design-code drift at the point of generation in AI-assisted development. With nearly three decades of experience, he has shaped developer tools and enterprise software at Google, Apple, Microsoft, Oracle, and SAP. He holds two patents and specializes in human-centered design for complex systems, AI/ML integration, and developer tooling
  • Twitter: https://x.com/reweaver_ai
  • LinkedIn: https://www.linkedin.com/in/jongor/
  • Website: https://www.reweaver.ai
  • Blog: https://www.reweaver.ai
  • Photo: /wf26/speakers/by-id/spk_jonathan_gordon.jpg
  • Sessions:

- The Design-Code Roundtrip That Isn't — Day 3 — Session Day 2 11:40am-12:00pm

Everyone is using Figma's MCP tools, Claude Code, or Codex. The demos are seamless. The narrative is compelling. What's actually happening under the hood is something else entirely. And the gap between the story and the reality is where your next six months of pain is going to come from. I'm Jonathan Gordon, founder of ReWeaver AI and a programmer-turned-UX designer who spent 30 years in developer tools at Google, Microsoft, Apple, Facebook, and Oracle watching the design-engineering gap widen in slow motion. I've seen every generation of tooling promise to close it. I know exactly where the seams are. I wrote a technical teardown of what Figma's bidirectional workflow actually ships, what get_design_context does, what generate_figma_design actually captures (hint: it's a screenshot, not your design system), and why iterating through that loop 12 times leaves you progressively farther from your canonical design intent. This talk will walk attendees through each step, backed by research and specific examples, and include a demo showing how drift accumulates in real time. The problem is not that drift happens; it's that it's happening exponentially. Let's talk about how we can stem that tide and keep humans in control of the process, not just "in the loop."

Jonathan Kelley

  • Role: Founder
  • Company: Dioxus Labs
  • Bio: Jonathan Kelley is the founder of Dioxus Labs, which develops the Dioxus open-source Rust framework for building applications across web, desktop, and mobile.
  • LinkedIn: https://www.linkedin.com/in/jonathan-r-kelley
  • Website: https://jonathan-kelley.com
  • Photo: /wf26/speakers/by-id/spk_jonathan_kelley.jpg
  • Sessions:

- Building ambitious software — Day 3 — Session Day 2 3:45pm-4:05pm

TBD — Add final abstract after outreach/confirmation.

Joseph Nelson

  • Role: Cofounder, CEO
  • Company: Roboflow
  • Bio: Joseph is cofounder/CEO at Roboflow, the vision AI company. Roboflow makes infrastructure millions of AI engineers including half the Fortune 100 use to create and deploy vision models in the cloud and on the edge. They are the authors of RF-DETR, SOTA realtime instance segmentation and detection transformers. Roboflow's backed by investors like YC, GV (fmr Google Ventures), Greg Brockman, Jeff Dean, amongst others.
  • Twitter: https://x.com/josephofiowa
  • LinkedIn: https://www.linkedin.com/in/josephofiowa/
  • Photo: /wf26/speakers/by-id/spk_joseph_nelson.jpg
  • Sessions:

- The State of Vision — Day 2 — Session Day 1 10:45am-11:05am

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 10:45am-11:05am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 11:10am-11:30am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

Joseph Wang

  • Role: CEO
  • Company: Emulated
  • Bio: CEO of Emulated, building the data for fully autonomous AI
  • Website: https://emulated.so/
  • Photo: /wf26/speakers/by-id/spk_tbd_emulated_so.jpg
  • Sessions:

- Emulated: The data for fully autonomous software engineers and companies — Day 3 — Session Day 2 1:55pm-2:15pm

Hold for Emulated.so. Company builds reinforcement-learning environments that simulate real production systems for coding and infrastructure agents.

Josh Leavitt

  • Role: Sr. Director of AI & Data
  • Company: Coinbase
  • Bio: Josh Leavitt is Senior Director of Product Management at Coinbase, where he leads AI Platform strategy and innovation. Josh oversees core initiatives aimed at making AI more accessible and secure at Coinbase. Prior to joining Coinbase, he held leadership roles at Amazon Web Services. He is dedicated to building secure, accessible financial infrastructure, focusing on large-scale platform growth.
  • Twitter: https://x.com/Josh_Leavitt
  • LinkedIn: https://www.linkedin.com/in/josh-leavitt/
  • Website: https://www.coinbase.com/
  • Photo: /wf26/speakers/by-id/spk_josh_leavitt.jpg
  • Sessions:

- From Zero to AI-Native: Scaling AI Across the Org — Day 4 — Session Day 3 1:30pm-1:50pm

Most companies talk about being AI-native, but few show what it takes at scale. Josh Leavitt, Sr. Director of AI & Data at Coinbase, shares the hard-won playbook for transforming a high-stakes, regulated engineering organization into one where AI is a first-class citizen across every team. Josh can cover my approach towards building a centralized AI platform that serves thousands of engineers without becoming a bottleneck, tooling decisions that actually moved the needle, and what AI-native really means when shipping in a zero-tolerance-for-failure environment. Expect concrete frameworks, real examples, and honest lessons from what didn’t work.

Joshua Mo

  • Role: Lead DevRel Engineer
  • Company: Venice AI
  • Bio: Joshua Mo is a Lead DevRel Engineer at Venice.ai focused on private AI infrastructure, developer experience, integrations and Rust-based automation. He previously worked on Rust developer relations and open-source AI tooling.
  • LinkedIn: https://uk.linkedin.com/in/joshua-mo-4146aa220
  • Website: https://joshmo.ooo
  • Photo: /wf26/speakers/by-id/spk_joshua_mo.jpg
  • Sessions:

- Your Model is Private. Your System Isn't. — Day 4 — Session Day 3 1:30pm-1:50pm

Privacy in AI isn't just about choosing the right model. Data leaks rarely happen inside the LLM itself - they happen in the systems surrounding it. Observability pipelines, analytics platforms, prompts, agents, and infrastructure often become accidental channels for exposing user data. In this session, Joshua Mo, Lead DevRel Engineer at Venice AI, explores why private models alone are not enough and shares practical privacy-preserving patterns that AI engineers can adopt today. From revocable handles and hashed identifiers to agent boundaries and confidential computing, attendees will leave with concrete ideas for building AI systems that protect user data by design.

Joyce Zhang

  • Role: Dating Coach for Tech Founders
  • Company: Joyce Consulting Group
  • Bio: Joyce Zhang is a dating coach for tech founders and ambitious professionals, helping clients build exceptional relationships. She is a former Stripe and BCG operator and MIT alum, and writes Joyce's Dating Playbook.
  • Photo: /wf26/speakers/by-id/spk_joyce_zhang.jpg
  • Sessions:

- Human Connection in the Age of AI — Day 1 — Workshop Day 5:00pm-6:00pm

Building AI safely requires both technical skills and interpersonal skills. A live demo of connection tools from Stanford's "Touchy Feely" course, then hands-on practice. Co-hosted with Leaders in Tech.

Jue Wang

  • Role: Senior Staff Researcher
  • Company: Together AI
  • Bio: Jue Wang is a Senior Staff Researcher at Together AI working on efficient and cost-effective algorithms and systems for LLMs, after earning a Ph.D. in computer science from Zhejiang University.
  • Twitter: https://x.com/JueWANG26088228
  • Website: https://juewang.me
  • Photo: /wf26/speakers/by-id/spk_jue_wang.jpg
  • Sessions:

- Open-Source Inference Engineering for the Agentic Era — Day 1 — Workshop Day 9:00am-11:00am

Agentic coding workloads demand long contexts, multi-turn conversations, and throughput at a scale that most inference engines weren't built for. TokenSpeed is a new open-source engine purpose-built for this regime, built collaboratively by NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, and others. In this 2-hour hands-on workshop, Together Inference Research Engineers and a TokenSpeed co-creator will cover TokenSpeed architecture, deploying your first model, optimizing for agentic workloads, kernel and hardware tuning, and throughput/latency trade-offs.

Julian Bright

  • Role: Co-Founder
  • Company: Introspection
  • Bio: Julian Bright is a co-founder at Introspection, building infrastructure for agent autoresearch and agent systems with Roland Gavrilescu.
  • Photo: /wf26/speakers/by-id/spk_julian_bright.jpg
  • Sessions:

- Autoresearch in the wild — Day 3 — Session Day 2 3:20pm-3:40pm

We have reached model capability overhang. Models are now bottleneck by the systems built around them. In this session we discuss how the next generation of compound AI systems need to be designed for self-improvement, how to set up feedback loops that automate the continuous refinement of the end-to-end architecture.

Justin Joyce

  • Role: Principal Sales Operations and Strategy Manager
  • Company: Cloudflare
  • Bio: Principal Sales Operations and Strategy Manager automating and re-imaging Sales Operation in an Agentic World.
  • LinkedIn: https://www.linkedin.com/in/justin-j-22132912/
  • Website: https://www.cloudflare.com/
  • Photo: /wf26/speakers/by-id/spk_justin_joyce.jpg
  • Sessions:

- How AI Agents Let GTM Teams Scale — Day 4 — Session Day 3 2:50pm-3:10pm

How Cloudflare scaled GTM with AI agents that never touch raw data: a deterministic layer computes the numbers, agents write the narrative, and a multi-agent pipeline turns every segment into ranked signals. Justin Joyce on the build — and what skill curation and adoption actually take.

Justin Reock

  • Role: Deputy CTO
  • Company: DX
  • Bio: Justin Reock is Deputy CTO at DX, where he works on engineering intelligence and software-development productivity. His AI Engineer talks focus on trends in AI-assisted engineering across hundreds of organizations.
  • Photo: /wf26/speakers/by-id/spk_justin_reock.jpg
  • Sessions:

- AI-Assisted Engineering: 5 Trends We're Seeing From 500+ Organizations — Day 3 — Session Day 2 11:10am-11:30am

AI is reshaping how engineers work but what does that actually look like at scale? Drawing on data and patterns from more than 500 organizations, we break down the five most significant trends emerging in AI-assisted engineering today.

This fast-paced theater session cuts through the hype to deliver concrete, evidence-based insights that engineering leaders can act on immediately.

Key takeaways:

Discover the top 5 AI-assisted engineering trends observed across 500+ organizations

Understand how leading teams are integrating AI into their engineering workflows

Leave with actionable strategies to apply at your organization

- The state of AI in software development: Insights across 400+ organizations — Day 3 — Session Day 2 3:45pm-4:05pm

Headlines claim AI is transforming software engineering overnight. Across more than 400 engineering organizations, we see patterns that challenge the hype and reveal what's really working, and what isn't, when AI enters the software development lifecycle.

In this talk, Justin Reock, Deputy CTO at DX, will share a data-driven "state of the union" on AI in engineering, grounded in both quantitative data from thousands of developers and on-the-ground observations.

You'll learn:

The current impact of AI, from benchmarks on the percentage of code authored, team PR throughput, and time savings

Where AI adoption is creating real gains in throughput, and whether it introduces tradeoffs for quality and maintainability

Insights and trends, including whether junior or senior developers are seeing bigger gains, the impact of structured rollouts, which tools are having the most impact, and the evolving definition of "developer"

The session will conclude with a practical framework for measuring AI's impact, helping leaders cut through hype and understand the impact AI is having in their own organizations.

Justin Smith

  • Role: Founding Product Engineer
  • Company: Resolve AI
  • Photo: /wf26/speakers/by-id/spk_justin_smith.jpg
  • Sessions:

- Always-on agents run production without the on-call tax — Day 4 — Session Day 3 2:25pm-2:45pm

Most production teams have the same problem. The work that keeps systems healthy- deployment checks, on-call handoffs, anomaly reviews- never makes it into a sprint. It falls to whoever has bandwidth, gets done inconsistently, and disappears when people are stretched thin. Background agents fix this by running that work on a schedule, using the same production context a senior engineer would, without waiting for someone to initiate it. Justin Smith, Founding Engineer at Resolve AI, walks through the architecture behind always-on agents, the use cases teams are starting with today, and what we have learned from running them in our production environment.

Kamalakannan Nandagopal

  • Role: Staff Software Engineer
  • Company: Postman
  • Bio: Kamalakannan Nandagopal is a Staff Software Engineer at Postman working on client platform, performance, Node.js and agentic-AI related developer workflows, including Postman's git-native collection format.
  • Sessions:

- Beyond Code Generation: API Context for Agentic Engineering — Day 3 — Session Day 2 2:25pm-2:45pm

Maintaining production systems involves a lot more than generating code. APIs are the interfaces between systems and that surface gets out of control fast, as endpoints multiply and new consumers come online. Once an API is in use, changing it becomes incredibly hard. We felt this acutely at Postman. As our engineering organization scaled and we leaned more on AI agents for day-to-day work, we kept hitting the same wall: agents that could write code struggled with what came next who's calling this endpoint, what conventions does the rest of our API surface follow, what breaks if we change this contract. The context wasn't in the code, so the agent didn't have it. So we built an API context graph a continuously updated view of our entire internal API landscape and gave our agents access to it. This talk is about what changed in our own engineering as a result: how API design got faster and more consistent; how discovering and integrating with internal services stopped being detective work; how change requests came with a blast-radius report before any code shipped; how incidents got traced past the first stack trace, all the way down to root cause

Kanish Manuja

  • Role: Principal Software Engineer
  • Company: Twilio Inc.
  • Bio: Kanish Manuja is a principal AI engineer at Twilio, where he leads production LLM gateway and AI platform systems for enterprise-scale AI applications. His work focuses on building reliable, secure, and observable infrastructure for large language model adoption, including multi-tenant gateways, authentication and authorization, guardrails, audit logging, fallback strategies, and production readiness for GenAI workloads.

Kanish has worked across AI platform engineering, conversational intelligence, and distributed systems, helping teams move from experimentation to production-grade LLM deployments. He has led efforts around LLM reliability, governance, tenant isolation, provider abstraction, and operational controls for high-scale customer-facing systems.

In this session, Kanish will share practical lessons from designing and operating LLM gateway systems in production, including architectural tradeoffs, failure modes, platform boundaries, and what teams should consider before standardizing LLM access across an organization.

  • LinkedIn: https://www.linkedin.com/in/kanish-manuja-a99bb923/
  • Photo: /wf26/speakers/by-id/spk_kanish_manuja.jpg
  • Sessions:

- Productionizing LLM Gateways: Architecture, Tradeoffs, and Hard Lessons from the Trenches — Day 2 — Session Day 1 2:25pm-2:45pm

As organizations scale their use of large language models, the biggest challenge is no longer prompting, it’s productionizing. This session dives deep into building and operating an LLM gateway that sits between applications and model providers, handling routing, observability, cost control, reliability, and safety at scale. Drawing from real world experience, this talk breaks down the architecture of a production LLM gateway, including model abstraction layers, request orchestration, fallback strategies, caching, rate limiting, and evaluation pipelines. We’ll explore hard tradeoffs such as latency vs. cost, quality vs. determinism, and vendor lock-in vs. flexibility. Attendees will leave with concrete design patterns, failure modes to avoid, and a mental model for turning LLM experiments into resilient, scalable systems.

Karan Vaidya

  • Role: Co-founder
  • Company: Composio
  • Bio: Karan Vaidya is the co-founder and CTO of Composio, where he's building the agentic tool execution layer for AI agents. Composio gives agents managed authentication, just-in-time tool discovery, and a programmatic sandbox across a catalog of 1,000+ apps and 50,000+ tools — used by flagship agents at AWS, Zoom, Glean, HubSpot..., as well as prosumers building on Claude Code, Codex, and OpenClaw. His work centers on the infrastructure that makes agents reliable in production: an agentic pipeline that builds and self-heals tools, distills agent trajectories into reusable skills, and handles the messy edges of real-world auth and execution. He's based in San Francisco and is active in the local founder community.
  • Twitter: https://x.com/KaranVaidya6
  • LinkedIn: https://www.linkedin.com/in/kaavee315/
  • Website: https://kvaidya.com/
  • Blog: https://kvaidya.com/
  • Photo: /wf26/speakers/by-id/spk_karan_vaidya.jpg
  • Sessions:

- From coding to Knowledge work agents — Day 2 — Session Day 1 2:25pm-2:45pm

MCP, skills, Cli - so much noise - what’s the best way for agents to communicate

Karthik Ranganathan

  • Role: Co-founder and Co-CEO
  • Company: Yugabyte
  • Bio: Karthik Ranganathan is co-founder and co-CEO of Yugabyte, the company behind YugabyteDB. He was one of the original database engineers at Facebook/Meta working on distributed databases including Cassandra and HBase.
  • Twitter: https://x.com/karthikr
  • Photo: /wf26/speakers/by-id/spk_karthik_ranganathan.jpg
  • Sessions:

- Agent Memory Is a Solved Problem. Agent Learning Is Not. — Day 4 — Session Day 3 3:20pm-3:40pm

The failures that break multi-agent systems are not reasoning failures, they are handoff failures. One agent works something out and the knowledge dies in its private context, because the only thing that crosses the boundary is output. Memory made each agent better in isolation and changed nothing about what the group knows. The missing primitive is supervised promotion: a deliberate decision about which private learning is worth sharing, moved into common knowledge with the reasoning attached, so trust survives the handoff. Today a human makes that call, and promoted knowledge resolves on read, in any tool, with no retrain or reindex. Those calls are also the training signal for what comes next: orchestrator agents, trained on what matters to the people they serve, that promote on their own. This talk covers how our collective knowledge grew as we approached memory promotion, including what the first build got wrong, and a live look at it working between humans and agents.

Katelyn Lesse

  • Role: Head of Engineering, Claude Platform
  • Company: Anthropic
  • Bio: Katelyn Lesse is the Head of Platform Engineering at Anthropic. She leads engineering for the Claude Platform, including APIs and developer tooling, as well as Anthropic’s product infrastructure. She has experience building and scaling engineering organizations across fintech and developer platforms, with previous leadership roles at Stripe and Betterment.
  • Twitter: https://x.com/katelyn_lesse
  • LinkedIn: https://www.linkedin.com/in/katelynlesse/
  • Photo: /wf26/speakers/by-id/spk_katelyn_lesse.jpg
  • Sessions:

- Tokens Should Have Jobs — Day 4 — Session Day 3 10:45am-11:05am

Kay Malcolm

  • Role: Vice President of Product Management, Oracle AI Database
  • Company: Oracle
  • Bio: Kay Malcolm is Oracle's Vice President of Product Management for Oracle AI Database. She leads outbound product managers focused on AI and data strategy, customer programs, technical storytelling, and helping customers get value from Oracle technology.
  • LinkedIn: https://www.linkedin.com/in/kaymalcolm
  • Blog: https://blogs.oracle.com/authors/kay-malcolm
  • Photo: /wf26/speakers/by-id/spk_kay_malcolm.jpg
  • Sessions:

- No Memory, No Harness: Why the Database Is the Last Line of Defense — Day 4 — Session Day 3 2:50pm-3:10pm

The model is the easy part. Everything that makes an agent survive contact with production lives in the harness around it: orchestration, tooling, governance, and the memory core that keeps the system grounded when the model itself is probabilistic, forgetful, and non-deterministic. This talk walks the surface areas of an agent harness and consolidates the lessons we're learning as we ship them, from agentic applications in their current form (autonomous systems that now build their own automations) to the continual-learning loops that let agents improve from their own experience. We'll look at how the discipline is segmenting. AI application development is no longer one role but several: agent engineers, memory engineers, and platform engineers. We'll map Oracle's primitives onto each as the current state of harness engineering takes shape. We'll also examine the two populations betting on this stack at once, enterprise customers who need governance, reliability, and scale, alongside the cracked developers who need fast, composable primitives, and why a well-engineered harness serves both. And we'll make the case that has held through every shift in the stack: memory isn't a feature you bolt on, it's the foundation the rest of the harness stands on. The database remains the memory core, and when everything above it is probabilistic, it's the last line of defense.

Keegan McCallum

  • Role: Founder
  • Company: uRun
  • Bio: Founder of uRun, build the inference cloud for the interactive era of AI. Formerly Head of ML Infrastructure at Luma, built Video model serving for Dream Machine from scratch after pivot to generative video. Obsessed with the intersection of creativity and technology.
  • Twitter: https://x.com/keeganmccallum3
  • LinkedIn: https://linkedin.com/in/keeganmccallum3
  • Website: https://urun.sh
  • Blog: https://urun.sh
  • Photo: /wf26/speakers/by-id/spk_keegan_mccallum.jpg
  • Sessions:

- Generative Video at the Speed of Light — Day 4 — Session Day 3 2:25pm-2:45pm

Discussing recent breakthroughs in realtime generative video models, and the new architectural problems and bottlenecks involved in creating immersive, interactive experiences on top of these models.

Keiji Kanazawa

  • Role: Principal Product Manager
  • Company: Microsoft
  • Bio: Keiji Kanazawa (@gojira) is a Product Manager in Microsoft Foundry, working on AI inference for Anthropic and OpenAI models. He has a Ph.D. in AI and a career spanning research → engineering → product. Angel investor in YC and AI/ML companies.
  • Twitter: https://x.com/gojira
  • LinkedIn: https://linkedin.com/in/keijikanazawa
  • Photo: /wf26/speakers/by-id/spk_keiji_kanazawa.jpg
  • Sessions:

- From framework to runtime: running agents with Foundry Agent Service — Day 3 — Session Day 2 10:45am-11:05am

See how agents move from frameworks into production systems. Learn how Foundry Agent Service provides hosted execution, scaling, and lifecycle management—combining models, tools, and orchestration into a production-ready runtime.

- I Let Agents Refactor My Codebase for 3 Weeks. Then I Read the Code. — Day 3 — Session Day 2 2:25pm-2:45pm

Lopopolo says code is a liability. Zechner got a standing ovation for "read every fucking line." I was firmly at L — letting coding agents drive a refactoring for weeks, rubber-stamping PRs, trusting the vibes. Then I actually read what they'd built and couldn't explain my own system's contracts. The interfaces weren't wrong. They were plausible. Which is worse. So I took the wheel back. But this isn't a Zechner victory lap — I'm now building better specs and evals specifically so I can move back toward L with confidence. This talk is the honest, in-progress round trip, and a framework for finding where you should sit on the spectrum today.

Kenny Workman

  • Role: CTO
  • Company: LatchBio
  • Bio: Co-Founder + CTO at LatchBio. Engineering benchmarks and agents for practical tasks in biology research.
  • Twitter: https://x.com/kenbwork
  • LinkedIn: https://www.linkedin.com/in/kennyworkman
  • Website: https://kenbw.com/
  • Blog: https://blog.latch.bio/p/the-latch-sdk
  • Photo: /wf26/speakers/by-id/spk_tbd_latchbio.jpg
  • Sessions:

- LatchBio — Day 3 — Session Day 2 2:25pm-2:45pm

Hold for LatchBio. AI-powered biotech platform for biological data infrastructure and multi-omics analysis; user requested inclusion among new AI startups.

Kent C. Dodds

  • Role: Software Engineer and Educator
  • Company: EpicProduct.engineer
  • Bio: Kent C. Dodds is a world renowned web development educator and engineer. He's actively involved in the open source community. He is the creator of EpicProduct.engineer, EpicWeb.dev, EpicAI.pro, EpicReact.dev, and TestingJavaScript.com. He's a Microsoft MVP, GitHub Star, instructor on egghead.io and Frontend Masters, live streamer, and podcaster. Kent is married and the father of six kids and he lives in Utah.
  • Twitter: https://x.com/kentcdodds
  • LinkedIn: https://www.linkedin.com/in/kentcdodds/
  • Website: https://kentcdodds.com
  • Blog: https://kentcdodds.com/blog
  • Photo: /wf26/speakers/by-id/spk_kent_c_dodds.jpg
  • Sessions:

- Build the Right Thing: Product Engineering for Software Developers (Part 1) — Day 1 — Workshop Day 12:10pm-1:10pm

There is nothing quite as demoralizing as finishing a feature and realizing you built the wrong thing. The code is clean. The tests pass. The ticket is closed. And none of it matters. This is happening more often, not less. AI makes it faster and cheaper to implement, which means teams can now waste entire sprints on the wrong idea at unprecedented speed. The bottleneck is no longer "can we build it?" It is "should we build it?" and "are we sure we understand the problem?" This session is a condensed introduction to product engineering for builders: the skills that sit upstream and downstream of implementation. We will not try to cover everything a full-day workshop would. Instead, we will focus on the highest-leverage ideas you can apply on Monday. ### What we'll cover 1. Validate before you build Most wrong builds start with an idea that was never tested. You will learn to separate real user pain from solution-shaped requests, and practice discovery questions that surface past behavior instead of hypothetical enthusiasm. 2. Prioritize what deserves to exist Not every good idea should be built now. Especially in the AI era, "we could build this" is not a reason to build it. We will work through a practical prioritization lens, including the Kano model, to help you distinguish fundamentals from delighters from distractions before your team commits. 3. Own the feature, not just the PR Product engineering does not end at merge. You will leave with a clearer picture of end-to-end feature ownership: staying close to users, setting up simple feedback loops, and improving what you shipped instead of moving on to the next ticket. ### Format This is a 2–3 hour session with Kent C. Dodds. Expect focused teaching, real-world examples, and short interactive exercises and discussion. This is not a full simulation lab or a ticket-closing coding workshop. It is judgment practice for engineers who already know how to ship. ### Who this is for Software engineers (and technical builders generally) who: - Have shipped something polished that nobody wanted - Feel pressure to move fast with AI and want a better filter for what deserves to exist - Want stronger product instincts without becoming a PM - Care about owning outcomes, not just closing tasks Some software engineering experience is assumed. No particular stack is required. PMs and designers often find this valuable too. ### What you'll leave with - Discovery questions for ambiguous work - A prioritization lens you can use before committing to a build - A clearer model for feature ownership and post-ship feedback loops - Language for stakeholder conversations when requirements are unclear

- Build the Right Thing: Product Engineering for Software Developers — Part 2 — Day 1 — Workshop Day 1:15pm-2:15pm

There is nothing quite as demoralizing as finishing a feature and realizing you built the wrong thing. The code is clean. The tests pass. The ticket is closed. And none of it matters. This is happening more often, not less. AI makes it faster and cheaper to implement, which means teams can now waste entire sprints on the wrong idea at unprecedented speed. The bottleneck is no longer "can we build it?" It is "should we build it?" and "are we sure we understand the problem?" This session is a condensed introduction to product engineering for builders: the skills that sit upstream and downstream of implementation. We will not try to cover everything a full-day workshop would. Instead, we will focus on the highest-leverage ideas you can apply on Monday. ### What we'll cover 1. Validate before you build Most wrong builds start with an idea that was never tested. You will learn to separate real user pain from solution-shaped requests, and practice discovery questions that surface past behavior instead of hypothetical enthusiasm. 2. Prioritize what deserves to exist Not every good idea should be built now. Especially in the AI era, "we could build this" is not a reason to build it. We will work through a practical prioritization lens, including the Kano model, to help you distinguish fundamentals from delighters from distractions before your team commits. 3. Own the feature, not just the PR Product engineering does not end at merge. You will leave with a clearer picture of end-to-end feature ownership: staying close to users, setting up simple feedback loops, and improving what you shipped instead of moving on to the next ticket. ### Format This is a 2–3 hour session with Kent C. Dodds. Expect focused teaching, real-world examples, and short interactive exercises and discussion. This is not a full simulation lab or a ticket-closing coding workshop. It is judgment practice for engineers who already know how to ship. ### Who this is for Software engineers (and technical builders generally) who: - Have shipped something polished that nobody wanted - Feel pressure to move fast with AI and want a better filter for what deserves to exist - Want stronger product instincts without becoming a PM - Care about owning outcomes, not just closing tasks Some software engineering experience is assumed. No particular stack is required. PMs and designers often find this valuable too. ### What you'll leave with - Discovery questions for ambiguous work - A prioritization lens you can use before committing to a build - A clearer model for feature ownership and post-ship feedback loops - Language for stakeholder conversations when requirements are unclear

Kenton Varda

  • Role: Principal Engineer
  • Company: Cloudflare
  • Bio: Lead engineer for the Cloudflare Workers serverless platform, a project he started in 2017. Previously co-founder of Sandstorm.io. Created Cap'n Proto and Cap'n Web. Built lanparty.house. Coined the term "Code Mode".
  • Twitter: https://x.com/KentonVarda
  • Website: https://lanparty.house
  • Blog: https://lanparty.house
  • Photo: /wf26/speakers/by-id/spk_kenton_varda.jpg
  • Sessions:

- Gadgets: Personal app vibe coding that is actually safe — Day 2 — Session Day 1 3:45pm-4:05pm

We are entering the end game of Kenton's 15-year master plan. The architect of Cloudflare Workers, Durable Objects, Cap'n Proto, and Sandstorm.io, and the guy who coined the term "Code Mode", will demo Gadgets, an AI productivity suite which ties all these ideas together. We've all heard that the future is micro-apps customized for every niche, but how do we actually make that usable, how do we make it scale, and most importantly, how do we make it safe for even non-developers to use? Kenton will show how Gadgets solves these problems, including a sandbox design that makes it essentially impossible for apps to have vulnerabilities at all.

Kevin Bai

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Kevin Bai works in applied AI at Anthropic and has a background in forward deployed engineering, Palantir, Rippling, the United Nations, and international relations.
  • Photo: /wf26/speakers/by-id/spk_kevin_bai.jpg
  • Sessions:

- Forward Deployed Engineering 101 — Day 2 — Session Day 1 2:50pm-3:10pm

Kevin Hou

  • Role: Engineering Lead @ Antigravity
  • Company: Google DeepMind
  • Bio: Kevin leads product engineering for Antigravity, Google DeepMind’s agentic IDE. He has spent much of his career in AI, previously Head of Product Engineering at Windsurf and a Tech Lead Manager at Nuro, an autonomous vehicle startup. Kevin enjoys photography, playing basketball, cycling, and woodworking. He studied computer science & ML at Princeton University.
  • Twitter: https://x.com/kevinhou22
  • LinkedIn: https://www.linkedin.com/in/kevinhou22
  • Website: https://khou22.com
  • Photo: /wf26/speakers/by-id/spk_kevin_hou.jpg
  • Sessions:

- Get Out of the Model's Way — Day 2 — Session Day 1 1:30pm-1:50pm

From autocomplete to chat to agents to agent orchestration...how do you build a product that scales with intelligence? What core primitives enable agents to operate at the technical (and non-technical) frontier? How can you best squeeze every ounce of capability out of your agentic dev tools? I'll answer all these questions and break down how Google Antigravity creates dynamic agent teams to solve complex tasks like building an OS-Kernal and automating research workflows.

Kevin Madura

  • Role: Director, Advanced Technology
  • Company: AlixPartners
  • Bio: Building real-world AI solutions for enterprise clients using DSPy, RLMs, and agent-native architectures. Technologist & expert witness @ AlixPartners.
  • Twitter: https://x.com/kmad
  • LinkedIn: https://www.linkedin.com/in/kevinmadura/
  • Website: https://kmad.ai
  • Blog: https://kmad.ai
  • Photo: /wf26/speakers/by-id/spk_kevin_madura.jpg
  • Sessions:

- It’s Tokens All The Way Down: How RLMs are Different — Day 3 — Session Day 2 11:10am-11:30am

Recursive Language Models represent an intuitive but distinctively important approach to how LLMs handle context. The practical implications are bigger than they first appear. Tasks that would traditionally require careful prompt engineering, custom agent scaffolding, or multi-step orchestration collapse into surprisingly simple, composable programs. In this talk, we’ll cover what makes an RLM distinct from a coding agent, explore where the abstraction shines and where it breaks down, and walk through concrete use cases that are informed by real-world situations at scale. We’ll see side-by-side comparisons to understand trade-offs in complexity, performance, time, and token usage.

Kevin Orellana

  • Role: Software Engineer
  • Company: Amazon Web Services
  • Bio: Kevin Orellana is a software engineer at AWS, where he builds the sandboxed code-execution and browser-automation infrastructure that lets AI agents — including coding agents — run code and drive websites safely at scale. He previously worked on Amazon Bedrock's model-serving platform, where he tech-led the launch of Anthropic's Claude Sonnet on Bedrock and helped design the infrastructure behind several frontier-model launches.
  • Twitter: https://x.com/KevssOrellana
  • LinkedIn: https://www.linkedin.com/in/kevinorellana/
  • Photo: /wf26/speakers/by-id/spk_kevin_orellana.jpg
  • Sessions:

- 1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code — Day 3 — Session Day 2 2:25pm-2:45pm

We ran 1,000 automated tasks through a production code interpreter sandbox — file I/O, package installs, data analysis, ML training, binary downloads, multi-language execution — and tracked every failure. 88% passed. The other 12% revealed 18 distinct failure modes that no unit test would catch: binary encoding corruption in the transport layer, null bytes silently truncating file downloads, pip blocked by network isolation with no useful error, and path traversal inputs accepted without validation. This talk walks through the experiment design, the findings ranked by severity, and what we changed. If you are building or operating sandboxed execution for AI agents, these are the bugs waiting for your customers to find first.

Khaled Alashmouny

  • Role: Founder & CEO
  • Company: AIDAChip
  • Bio: Khaled Alashmouny is the founder and CEO of AIDAChip, where he is building Multiplayer AI systems for semiconductor engineering teams. His work focuses on a core idea: as AI accelerates individual execution, alignment becomes the dominant bottleneck. AIDAChip tackles this by creating AI teammates that coordinate intent, knowledge, and execution across engineers, tools, and organizational boundaries. Before founding AIDAChip, Khaled spent 20 years in semiconductor engineering, including 13 years leading analog/mixed-signal design at Apple. He designed circuits shipped in products used by hundreds of millions of people and saw firsthand how even world-class teams lose enormous time to fragmented knowledge and coordination overhead. Khaled holds 7 patents, published 9 IEEE papers, and earned his PhD in Electrical Engineering from the University of Michigan. His work sits at the intersection of AI, semiconductor engineering, neuroscience, and organizational systems.
  • LinkedIn: https://www.linkedin.com/in/khaledalashmouny/
  • Website: https://aidachip.com
  • Photo: /wf26/speakers/by-id/spk_khaled_alashmouny.jpg
  • Sessions:

- What If Your Chip Design Team Moved Like a Single Body? — Day 4 — Session Day 3 11:40am-12:00pm

Most agentic demos you've seen has a hidden assumption: one user, one session, one task. But what happens when the agent needs to coordinate with 30 other agents, across 10 disciplines, on a project that takes 12 months — where a single miscommunication costs $10-50M? Chip design is that problem. Only 14% of chips succeed on first silicon. The bottleneck isn't individual engineer speed — it's silent divergence between disciplines working from specs that drift without noticing. We built a multiplayer AI on the Anthropic Agent SDK, connected through three alignment layers: a living spec graph (System of Intent) that propagates changes and detects conflicts in real time, a tribal knowledge layer (Memory) that compounds methodology across projects, and milestone-aware execution that drives EDA tools with full design context. Each agent operates within strict spec-hierarchy boundaries enforced at the API level. Cross-agent invocations use structured tool calls with typed parameters, logged for full auditability. We talked with 15 practitioners across 8 major semiconductor and EDA companies. The universal finding: teams need alignment infrastructure, not faster copilots. We'll also share what broke — because coordination tax applies to AI agents too, and the failure modes are surprisingly instructive. This talk covers the multi-agent architecture, evaluation methodology, and lessons from deploying agentic AI in one of engineering's most complex coordination domains.

Kieran Klaassen

  • Role: GM of Cora / Compound Engineering
  • Company: Every/Cora
  • Bio: Kieran Klaassen is GM of Cora, Every's AI email assistant, shipped solo without writing code by hand. He is the grandfather of compound engineering: AI agents plan, write, review, and test every change; each fix becomes a learning the system reuses, so every unit of work makes the next easier.
  • Twitter: https://x.com/kieranklaassen
  • LinkedIn: https://www.linkedin.com/in/kieran-klaassen/
  • Website: https://cora.computer
  • Blog: https://x.com/kieranklaassen
  • Photo: /wf26/speakers/by-id/spk_kieran_klaassen.jpg
  • Sessions:

- The Era of Compound Engineering — Day 2 — Session Day 1 2:25pm-2:45pm

Most codebases get harder to work with every year. Yours doesn't have to. Compound Engineering is a philosophy where each unit of work – every bug fix, every feature, every code review – makes the next one easier. This talk is about how that shift changes everything: from how fast you ship to how many engineers you actually need. --- At Every, we run five products with single-person engineering teams. That's not a headcount accident – it's a system. When I built Cora, I wanted to find out how much one engineer could do with the right AI workflows. The answer became the Compound Engineering philosophy, now with 17k stars on GitHub. Traditional codebases accumulate complexity. Compound codebases accumulate capability. Bug fixes eliminate entire categories of future bugs. Patterns become tools. Over time, the codebase gets easier to understand, easier to modify, and easier to trust. You'll walk away with: - The mental model behind compound engineering - Concrete patterns for making every PR compound - How to scale output without scaling headcount

Killian Carlsen-Phelan

  • Role: Developer Content Engineer
  • Company: Sonar
  • Bio: Killian Carlsen-Phelan is a Developer Content Engineer at Sonar. He writes and teaches about SonarQube, AI-generated code review, Codex CLI integration, and quality gates for agentic development workflows.
  • Photo: /wf26/speakers/by-id/spk_killian_carlsen_phelan.jpg
  • Sessions:

- SonarQube + OpenAI: Wiring Your Team for Agentic Development — Day 1 — Workshop Day 1:15pm-2:15pm

As AI agents take on increasingly complex development tasks, the critical challenge has shifted from generation to verification. A growing body of evidence suggests that as models grow more capable, failures become more frequent and more convincing, making cognitive surrender among human reviewers an acute risk. This talk introduces Sonar's Agent Centric Development Cycle (AC/DC), a three-stage continuous loop of Guide, Verify, and Solve, as the engineering discipline teams need to build now. Teams that embrace AC/DC guide agents within their organizational standards before they write a line of code, verify output in real-time, and solve issues automatically without manual triage. This session will also feature a live demo of the SonarQube OpenAI plugin, showing how a well-guided agent produces code that is faster to verify and cheaper to fix.

Kim Maida

  • Role: Founding GTM Engineer
  • Company: Keycard
  • Bio: Kim is the Head of Developer Relations and Founding GTM Engineer working on security for agents at Keycard. Kim's career is rooted in Identity, software engineering, and developer experience strategy. She enjoys teaching, mentoring, and learning from folks in the product technology space and loves to travel, overland, design stickers, and craft miniatures and artisan keycaps.
  • Twitter: https://x.com/kimmaida
  • LinkedIn: https://linkedin.com/in/kimmaida
  • Website: https://maida.kim
  • Photo: /wf26/speakers/by-id/spk_kim_maida.jpg
  • Sessions:

- It's 10pm. Do You Know Where Your Agents Are? — Day 2 — Session Day 1 2:50pm-3:10pm

Agents right now can sign legal contracts, run untethered, manage your dating profile, conduct financial transactions, and push code to production. Most agents have long-lived API keys and are dangerously overprivileged even when they're not making requests. In this talk, I'll demo how to solve the problem with the right access at the right time. You'll walk away knowing how to control agent access whether you're running coding agents from the CLI, building MCP servers, or connecting agents to third-party APIs.

Krishna Prasad Srinivasan

  • Role: Head of Vision Models
  • Company: Sarvam
  • Bio: Krishna Prasad Srinivasan is a Head of Vision Models at Sarvam, where he led a lean team to train Sarvam Vision, India's first sovereign VLM: a 3B state-space model that topped global OCR benchmarks at launch and led the Indic OCR Bench across 22 languages. He now leads the vision vertical's models, research, and product. Previously, he was Tech Lead for AI at Microsoft Research, where he built multilingual copilots for education and developed Indic translation models that outperformed commercial systems. Before that, Krishna was a researcher at Harvard, where he engineered a novel OCR architecture using contrastive learning that outperformed industry benchmarks on complex multilingual documents.
  • Twitter: https://x.com/fewshotlearner
  • LinkedIn: https://www.linkedin.com/in/krishnapsrinivasan/
  • Photo: /wf26/speakers/by-id/spk_krishna_srinivasan.jpg
  • Sessions:

- From Scratch to SOTA: Training a 3B State-Space Vision Model for 1.4 Billion People — Day 2 — Session Day 1 3:20pm-3:40pm

India has 22 official languages. Across those languages live over a billion people whose knowledge is locked inside scanned images in scripts that most frontier models perform poorly. The problem is dire - until now, there wasn't even a comprehensive benchmark to measure Indic OCR performance, let alone training data at scale. When Sarvam AI set out to solve this, we had to build the infrastructure before the model, creating the first ground-truth benchmark for Indic document intelligence. In this talk, Krishna Srinivasan, who led the Vision Models team to build India's first sovereign VLM from scratch, will walk through the end-to-end engineering lifecycle. We will cover: (a) Architecture: Why we chose a 3B-parameter state-space architecture over transformer baselines to handle high-resolution visual inputs with minimal memory overhead and faster inference. (b) Training Pipeline: The exact recipe we used: starting with text-only pre-training, moving to continual pre-training with text and images, followed by SFT. Finally, we'll cover the advances we made in implementing large-scale RL with Verifiable Rewards for visual tasks in just 3 days using deterministic character-level reward signals. (c) Compute Efficiency: How we trained a frontier-competitive multimodal model with extreme capital efficiency, optimizing distributed training and GPU cluster management to punch far above our compute class. (d) Agentic Workflows: How this model powers Sarvam Akshar, a first-of-its-kind agentic document intelligence workbench featuring visual grounding and automated proofreading loops. The results speak for themselves: Sarvam Vision achieves best-in-class global scores (84.3% on olmOCR-Bench, 93.28% on OmniDocBench) and dominates Indic OCR. Attendees will learn the blueprint for compute-efficient multimodal training, and deploying state-space VLMs for population-scale enterprise workloads.

Kunal Lanjewar

  • Role: Staff Engineer
  • Company: Riot Games
  • Bio: Kunal Lanjewar runs tier-zero infrastructure at Riot Games, where he builds and operates production AI agents and backend services that power games like VALORANT and League of Legends. He's the author of Guild, an open-source tool that gives AI agents persistent memory and task coordination across sessions. Previously, he helped scale Sky: Children of the Light to 300M+ downloads and millions of daily active players, and built the backend for its Guinness World Record-holding Aurora concert. His work has been featured at GDC, DataCon LA, and on the MongoDB Podcast. Earlier in his career he also built systems for NASA missions.
  • Twitter: https://x.com/kunallanjewar
  • LinkedIn: https://www.linkedin.com/in/kunallanjewar/
  • Website: https://www.kunall.com
  • Photo: /wf26/speakers/by-id/spk_kunal_lanjewar.jpg
  • Sessions:

- Your Hero Agent Needs a Party — Day 4 — Session Day 3 2:25pm-2:45pm

A front-door persona, a party of deterministic specialist agents, A2A between. Your support bot deflects half its tickets, then, soloing a problem it was never built for, confidently runs the wrong kubectl command. Most teams respond by rewriting the prompt. The real fix is a multi‑agent party of specialists. This talk gives you a production pattern that turns one over-leveled hero agent into a coordinated party of specialists you can trust on tier-zero infrastructure. Persona and ReAct agents make great heroes at the front door. Any team can copy one, paste it into their stack, and adjust the behavior in plain English. But if you send a lone hero to clear the dungeon, whether it is a deploy or an incident, a non-deterministic Reason-Act loop tends to loop, over-act, or punt back to a human. More prompts and more skills do not reliably level it up. Instead of soloing, keep the persona as the front-door face and give it a party: deterministic DAG specialists where the graph is fixed and the LLM is called only at decision points. For example, a deployment specialist can list rolling pods, choose the next tool, run it, read logs, and then diagnose the result. Each specialist is a class with one job and a narrow set of tools, and they coordinate over A2A for capability discovery and delegation across frameworks. Reliability and tighter least-privilege access become properties of the design, not something you try to bolt onto a prompt. You’ll leave with the pattern: where to draw the line between the hero and its specialists, how to shape a DAG specialist so it decides instead of flails, and where A2A fits as the seam between them, grounded in lessons from a tier‑zero fleet.

Kwindla Kramer

  • Role: CEO
  • Company: Daily
  • Bio: Co-founder at Daily. Contributor to Pipecat. ᓚᘏᗢ
  • Twitter: https://x.com/kwindla
  • LinkedIn: https://www.linkedin.com/in/kwkramer/
  • Website: https://machine-theory.com/
  • Blog: https://www.linkedin.com/in/sean-dubois/
  • Photo: /wf26/speakers/by-id/spk_kwindla_kramer.jpg
  • Sessions:

- The New Primitives: Building AI-Native Software — Day 2 — Session Day 1 10:45am-11:05am

In the future, every piece of software with a human-facing surface will be built from new, LLM-centric primitives. (Just like every piece of software today has networking, threads/async routines, UI on top of some flavor of Model/View/Controller abstractions, etc.) We're just starting to invent these new primitives. The list, though, will definitely include: 1. Subagents - multiple inference loops, multiple models, async tool calls 2. Very long context - memory + episodic human interactions over a long period of time, structured data input (not just output), progressive skills/context loading, graceful compaction & summarization 3. dynamic user interface generation / user interfaces driven by LLM inference 4. conversational voice input

- Voice is the universal interface — Day 4 — Session Day 3 11:40am-12:00pm

Language models give us the ability to create natural language, conversational, interfaces for computers. We are seeing a rapid shift among early adopters to using general language instead of traditional user interfaces for tasks like writing code and editing spreadsheets. Join the cofounders of Pipecat, Gradium, and Daily as we discuss the future of realtime voice and AI interfaces. Voice is the most efficient input mode for natural-language systems, and often the most efficient output mode, as well. But good voice interfaces require a very high degree of conversational facility, intelligence, task-specific reliability, and robustness to real-world realities like multiple speakers and background noise. There's a long history of voice interfaces in science fiction: Star Trek, Iron Man, Her. We'll use these depictions of computing possibilities as a jumping off point for talking about the ideal voice interface. How close are we to being able to build these interfaces with today's models, hardware, orchestration tooling, and UI libraries? What are the most promising research directions? What did the movies get wrong, now that we actually have experience building natural language, open-ended, voice systems?

Kyle Mistele

  • Role: CTO
  • Company: HumanLayer
  • Bio: Recovering Red-Team security engineer & ATM hacker, now CTO at HumanLayer helping teams escape the vibe slop dopamine casino and ship production grade code with AI. Yaps about distributed systems engineering, real-time sync, virtual filesystems, and the SF pizza scene.
  • Twitter: https://x.com/0xBlacklight
  • LinkedIn: https://www.linkedin.com/in/kyle-mistele
  • Website: https://blacklight.sh
  • Photo: /wf26/speakers/by-id/spk_kyle_mistele.jpg
  • Sessions:

- Loop Engineering from first principles — Day 2 — Session Day 1 3:45pm-4:05pm

Code is free, software is infinite, and agents can do it all - that's the promise of the lights-off software factory, where humans interact only with tickets & specifications, and nobody reads the code, let alone writes it. We ran our own for six months, and we have the scars to prove it - bad code compounded, and agents created problems that agents couldn't solve - until we had to throw it all away. But this is a survivor's guide, not an obituary. In this talk, we'll share the challenges we encountered, what we liked, what we hated, what we're still doing, what we stopped doing, and what we started doing afterwards.

Lakshya Agrawal

  • Role: Creator and maintainer of GEPA
  • Company: GEPA
  • Bio: Lakshya A. Agrawal is the creator and maintainer of GEPA and a second-year EECS PhD student at UC Berkeley’s Sky Computing Lab. His research focuses on optimization, evaluation, and self-improvement for LLM-based agents and systems. He previously worked as an AI4Code Research Fellow at Microsoft Research.
  • Twitter: https://x.com/LakshyAAAgrawal
  • LinkedIn: https://www.linkedin.com/in/lakshyaaagrawal/
  • Website: https://lakshyaaagrawal.github.io/
  • Photo: /wf26/speakers/by-id/spk_lakshya_agrawal.jpg
  • Sessions:

- Self-Improvement of Context, Harness, and Model Weights through Reflective Optimization — Day 3 — Session Day 2 2:25pm-2:45pm

Large language models are increasingly adapted to downstream tasks via reinforcement learning methods like GRPO, which often require thousands of rollouts to learn new tasks. We argue that language provides a much richer learning medium: an LLM can reflect on full trajectories (including reasoning, tool calls and errors) to diagnose failures and propose targeted improvements. We introduce GEPA, a reflective prompt optimizer that incorporates this principle outperforming GRPO by up to 20% while using up to 35x fewer rollouts across tasks spanning 5+ domains and also works with black-box models.

Building on this, we then introduce optimize_anything, a unified API that generalizes reflective optimization to arbitrary text parameters. This single system achieves state-of-the-art results across eight fundamentally different areas, including nearly tripling ARC-AGI accuracy via agent architecture discovery, generating CUDA kernels that beat PyTorch and cutting cloud scheduling costs by 40% through policy discovery, establishing LLM-based reflective search as a general-purpose problem-solving paradigm.

Finally, I present Fast-Slow Training (FST), which brings reflective optimization into LLM post-training. FST jointly optimizes model parameters ("slow weights") via RL and textual contexts ("fast weights") via GEPA. Because the fast channel quickly absorbs task-specific nuances, the slow parametric updates are freed to consolidate general reasoning rather than memorizing task details. This yields up to 3x better sample efficiency, a higher performance asymptote with a significantly lower drift from the base model. This reduced drift preserves plasticity for continual learning, allowing FST to adapt sequentially where parameter-only RL stalls.

Broadly, our work advocates a fundamental shift in AI adaptation: replacing task-specific algorithms with diagnostic evaluation, and evolving from parameter-only post-training to the joint optimization of prompts, agent architectures, and model weights.

Lance Martin

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Member of technical staff at Anthropic. Working on the Claude Platform, including Claude Managed Agents and the claude-api skill in Claude Code. Prior to Anthropic, was one of the early team at LangChain. Prior to LangChain, spent several years focused on vision for self-driving cars (Uber ATG, Ike, Nuro) and got a PhD from Stanford.
  • Twitter: https://x.com/RLanceMartin
  • LinkedIn: https://www.linkedin.com/in/lance-martin-64a33b5
  • Website: https://rlancemartin.github.io
  • Blog: https://rlancemartin.github.io
  • Photo: /wf26/speakers/by-id/spk_lance_martin.jpg
  • Sessions:

- Claude for long-horizon tasks — Day 2 — Session Day 1 1:55pm-2:15pm

Claude is capable of long horizon tasks. In this talk, we'll share lessons learned about building agent harnesses for reliable and secure long-horizon work. This include decoupling the brain and hands, self-verification, self-learning, and design for evolving agent harnesses.

Laurie Voss

  • Role: Head of Developer Relations
  • Company: Arize AI
  • Bio: Laurie Voss is Head of Developer Relations at Arize AI, the leading company for AI observability and evaluations. He has been a developer for over 30 years and was co-founder of npm, Inc.. He believes passionately in making the web bigger, better, and more accessible for everyone.
  • Twitter: https://x.com/seldo
  • LinkedIn: https://www.linkedin.com/in/seldo/
  • Website: https://seldo.com
  • Blog: https://seldo.com
  • Photo: /wf26/speakers/by-id/spk_laurie_voss.jpg
  • Sessions:

- From Vibes to Production: Evaluating and Shipping AI Agents That Work 101 — Day 1 — Workshop Day 9:00am-11:00am

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions.

This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems.

You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

- From Vibes to Production: Evaluating and Shipping AI Agents That Work 201 — Day 1 — Workshop Day 2:20pm-4:20pm

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions.

This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems.

You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

- Evals Track Intro — Day 3 — Session Day 2 10:25am-10:30am

- The Death of the Code Review — Day 3 — Session Day 2 12:05pm-12:25pm

Code review was built for a world where humans wrote all the code. Now, the question isn’t “does this diff look good?” — it’s “can this system safely ship code on its own?” This talk will show why and how traditional code review will quietly be replaced by automated verification harnesses. We’ll show how prompt learning can be used to clone your best internal code reviewers, turning their judgment into automated evaluation loops. We’ll also open source a code review training harness that captures review patterns and turns them into reusable checks for AI-generated code.

- How long can your skills be before your agent forgets what you told it? — Day 3 — Session Day 2 1:30pm-1:50pm

A year ago, frontier models lost the thread somewhere around 200 simultaneous instructions, so skills files had to stay short and lean on sub-skills and subagents. We re-ran IFScale on the 2026 frontier and found the ceiling has moved by an order of magnitude: closer to 2,000 instructions, up to 5,000 on the strongest models. The more interesting story is how models fail at the new frontier: DeepSeek quietly drops instructions, Opus refuses outright when innocuous words trip a safety classifier, Gemini burns its whole budget on reasoning and emits nothing, and GPT-5.5 stops to tell you your request was unreasonable. The capacity problem is largely solved; verification is wide open. We'll show the data, the failure modes, and what it costs to find out. You’ll come out with hard data on the ceiling for complex instructions to LLMs

Lee Robinson

  • Role: ML, Model Behavior
  • Company: Cursor
  • Bio: Model research and personality at Cursor. Previously Vercel.
  • Twitter: https://x.com/leerob
  • LinkedIn: https://www.linkedin.com/in/leeerob/
  • Website: https://leerob.com
  • Photo: /wf26/speakers/by-id/spk_lee_robinson.jpg
  • Sessions:

- Recursive Model Improvement — Day 2 — Session Day 1 5:10pm-5:30pm

Lena Hall

  • Role: Senior Director Developers and AI
  • Company: Akamai
  • Bio: Lena Hall is Senior Director Developers and AI at Akamai. She previously led Developer Experience for North America at AWS and Big Data Developer Relations at Microsoft, and regularly teaches practical AI and developer topics through talks and videos.
  • Twitter: https://x.com/lenadroid
  • LinkedIn: https://www.linkedin.com/in/lena-hall
  • Photo: /wf26/speakers/by-id/spk_lena_hall.jpg
  • Sessions:

- The Signal Layer: What to Build When Anything Can Be Built — Day 4 — Session Day 3 3:20pm-3:40pm

AI has made implementation faster, cheaper, and more widely available. That changes the real bottleneck in software.

When every team can generate code, spin up agents, prototype workflows, and ship demos faster than ever, the advantage moves to a different layer: knowing what is worth building, who it is for, how people will discover it, and how the product should behave once they do.

This talk introduces the Signal Layer: the system of public signals, user intent, agent experience, distribution loops, and product judgment that helps builders decide what deserves to exist before they commit time, infrastructure, and trust to building it.

We will look at how AI changes the software lifecycle from “can we build it?” to “should this exist?” and how developers, AI engineers, and technical leaders can design products that earn adoption instead of producing impressive demos that disappear.

When anything can be built, the most valuable builders are the ones who can read signal early, shape the right experience, and build the thing users were already moving toward.

Leo Mehr

  • Role: Director of Engineering
  • Company: Ramp
  • Bio: Leo is a Director of Engineering at Ramp, where he's built the Forward Deployed Engineering function (ramp.com/fde) and also makes Ramp usable by agents (agents.ramp.com). He previously was co-founder and head of eng at cybersecurity startup Lumos (backed by a16z and Neo). Leo completed an MS in CS at Stanford and lives in SF.
  • Twitter: https://x.com/leomehr
  • LinkedIn: https://www.linkedin.com/in/leomehr
  • Website: https://leomehr.com/
  • Photo: /wf26/speakers/by-id/spk_leo_mehr.jpg
  • Sessions:

- How Forward Deployed Engineering is done at Ramp — Day 2 — Session Day 1 2:25pm-2:45pm

Leo Platzer

  • Role: Founder
  • Company: Deasy Labs / Collibra
  • Bio: Leo Platzer co-founded Deasy Labs, a company focused on making unstructured enterprise content more accessible and AI-ready, which was acquired by Collibra.
  • Sessions:

- From raw documents to AI-ready data — Day 2 — Session Day 1 3:20pm-3:40pm

Starting from a real document corpus full of overlapping, look-alike files, we walk through what it takes to make retrieval on those files reliable, from deduplicating to enriching with metadata. Watch how each step reshapes the vector space, and what happens to the answers that come back.

Liad Yosef

  • Role: Co-creator
  • Company: MCP Apps
  • Bio: Liad Yosef is the co-creator and maintainer of the MCP Apps spec, a member of the MCP Steering Committee, and the co-builder of MCP-UI. Liad is currently the co-founder and CTO at ORA - building the future of the agentic web. Previously leading agentic interfaces in Shopify's CEO office, Liad is a seasoned AI lead and software architect. He has been a web enthusiast for two decades, passionate about crafting developer-first experiences. When he isn't defining open-source standards for the agentic web, he writes poetry and moonlights as an analog astronaut for the European Space Agency.
  • Twitter: https://x.com/liadyosef
  • LinkedIn: https://linkedin.com/in/liadyosef
  • Website: https://ora.ai
  • Photo: /wf26/speakers/by-id/spk_liad_yosef.jpg
  • Sessions:

- Rebuilding the web for agents — Day 2 — Session Day 1 12:05pm-12:25pm

AI apps are the new browsers. And the web is not ready.

For thirty years we built the web for human eyes, benchmarked by tools like Lighthouse: humans measuring human behavior. That era is ending. Bot traffic has overtaken human traffic, and we can't hand-write a benchmark for what comes next - every best practice goes stale the moment models improve.

Your next customer isn't a human with a credit card - it's an agent with a protocol, and it would rather not see your interface at all. That shift moves the UX question from how a human experiences your product to how an agent does, and how a human experiences that agent. Already, some services report their MCP traffic outpacing their web UI. The agent is rapidly becoming the main surface, and it always takes the path of least friction. Claude Code might consistently prefer PostHog over Mixpanel simply because PostHog has the better agentic surface - and Mixpanel loses customers without a human ever weighing in.

Meanwhile the agentic web protocol stack keeps multiplying, a new one seemingly every week. The harder problem isn't discovery - it's operability: whether the web can actually be run once an agent arrives, and what is the ideal stack for that. Should we lean into headless protocols, or ones like WebMCP that treat the UI as the source of truth? Does a site need to implement every new spec just to support every kind of agent?

So we stopped guessing and watched real agents work the whole journey: finding, understanding, authenticating, acting, handing back to a human. The findings go against the last year of agent-readiness advice. Agents ignore the files we built for them, reaching for docs and homepages instead - and whatever they reach, they trust and act on. But when those files are linked properly, their usage jumps 4x. The format isn't the key for the agentic web. Reachability is.

The web will never be completely headless. Some moments still demand a human: choosing a seat, comparing options, casually exploring. And agents aren't uniform - some want full headless access, others spin up a browser to fill the gaps, but that's a friction point, not a free fallback. So the web is going nearly headless, always with a human eye at the end.

This talk maps the entire agent web landscape based on findings from real agent journeys research:

  • Which protocols earn their place and which are noise.
  • Why "agent-ready" and "accessible" are the same engineering problem.
  • How MCP Apps close the last mile - and when headful protocols like WebMCP step in.
  • How to build for agent-readiness that survives the next model - not a checklist that's stale in a month.

The gap between ready and not is about to separate the relevant from the invisible.

- MCP Apps - Extending the frontier — Day 3 — Session Day 2 2:25pm-2:45pm

AI agents are quickly becoming the new browsers, changing how users consume content and get work done. That shift is increasingly powered by a new generation of agentic apps that don’t just present text but deliver interactive experiences within any MCP host. By standardizing interactive UI on MCP, the MCP Apps official extension (SEP-1865) is poised to become the new agentic app runtime, serving as the backbone of the future and removing adoption obstacles that previously hindered the protocol. Join us to learn more about: The new web - How MCP Apps reshapes the traditional app landscape and transforms the way users interact with the web Deep dive into MCP Apps - - Architecture - Real-world use cases - What's ahead? - Getting started (+community and #mcp-apps-wg) - Future Vision

Lina Colucci

  • Role: CEO
  • Company: LemonSlice
  • Bio: Co-Founder and CEO of LemonSlice, an AI lab working to break the avatar Turing test. LemonSlice raised $10.5M seed from Matrix and YC and have the most advanced interactive avatar model in the world. Originally from Brazil, Lina is an ML researcher and artist - ballerina, musician, photographer, YouTuber. She previously founded and ran one of the leading ML consulting firms in the US, and has a PhD from MIT and Harvard.
  • Twitter: https://x.com/lina_colucci
  • LinkedIn: https://www.linkedin.com/in/lina-colucci/
  • Website: https://www.linacolucci.com
  • Photo: /wf26/speakers/by-id/spk_lina_colucci.jpg
  • Sessions:

- Voice agents with Realtime Video — Day 4 — Session Day 3 1:55pm-2:15pm

Lotte Seifert

  • Role: Founder
  • Company: SID AI
  • Bio: Founder of SID AI. Training frontier models to retrieve and reason over any data source.
  • Twitter: https://x.com/lotteseifert
  • LinkedIn: https://www.linkedin.com/in/lotteseifert/
  • Photo: /wf26/speakers/by-id/spk_lotte_seifert.jpg
  • Sessions:

- Where RL Will Take Search — Day 2 — Session Day 1 2:50pm-3:10pm

Search is having its Bitter Lesson moment. By turning search into an RL problem, we can finally scale search quality with compute! RL is extremely sample efficient when compared to classical search training objectives and we see no ceiling to how far we can scale this new paradigm. We cover the training of SID-1, the first RL-trained search model, and how search will look like post-RL.

Lotte Verheyden

  • Role: AI engineer and developer educator, Langfuse
  • Company: Clickhouse
  • Bio: Lotte Verheyden guides both humans and agents through AI engineering at Langfuse, part of ClickHouse. Her work focuses on making AI observability and agent workflows legible to developers and coding agents.
  • Twitter: https://x.com/lotte_verheyden
  • Photo: /wf26/speakers/by-id/spk_lotte_verheyden.jpg
  • Sessions:

- Continuously improving agents with Langfuse — Day 1 — Workshop Day 1:15pm-2:15pm

Join us for a hands-on Langfuse workshop where we'll show you how to observe, debug, and improve your AI applications, step by step, using a real sample app. Bring your questions and discover how Langfuse can level up your specific use cases!

Lou

  • Role: Head of Developer Relations
  • Company: Z.ai
  • Bio: Lou is Head of Developer Relations at Z.ai. Z.ai develops the GLM family of AI models and products, including Chat Z.ai and open models for coding, agents, and local deployment.
  • Twitter: https://x.com/louszbd
  • Website: https://chat.z.ai/
  • Photo: /wf26/speakers/by-id/spk_lou_zai.jpg
  • Sessions:

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:30pm-1:50pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:55pm-2:15pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

Louis-François Bouchard

  • Role: CTO & Co-Founder
  • Company: Towards AI
  • Bio: Louis-François Bouchard is the co-founder of Towards AI, where he builds and teaches a practical toolkit for shipping reliable LLM products. He co-authored Building LLMs for Production, a hands-on guide to prompting, fine-tuning, retrieval augmented generation, and evaluation. Through Towards AI Academy, he has launched multiple in-depth courses for AI engineers, designed to turn developers into AI professionals who can transform prototypes into scalable, customer-ready systems. He also runs the What’s AI YouTube channel and newsletter, translating new research and best practices into clear engineering playbooks for 70K+ subscribers and tens of thousands of readers. Today, he partners with founders and organizations on AI strategy, training design, and production workflows that raise accuracy, reduce risk, and make generative AI useful for paying customers, and he speaks at events such as AIE, Uphill Conf.
  • Twitter: https://x.com/Whats_AI
  • LinkedIn: https://www.linkedin.com/in/whats-ai/
  • Website: https://www.louisbouchard.ai/
  • Blog: https://www.louisbouchard.ai/
  • Photo: /wf26/speakers/by-id/spk_louis_fran_ois_bouchard.jpg
  • Sessions:

- Context Engineering in 2026: Compaction, Memory & Cost — Day 1 — Workshop Day 2:20pm-4:20pm

Every long agent session eventually breaks: the assistant that swore it would "never push to main" does exactly that forty turns later. The model didn't get dumber — its context did. This workshop is about engineering the context window so that stops happening, shown with Towards AI's open-source AI tutor, which answers questions for students of our AI-engineering courses. Context engineering is deciding what the model sees on every single call — instructions, history, retrieved course content, memory, and tool outputs — and it's the line between a tutor that holds a coherent session and one that forgets the student's setup halfway through. We'll move in three stages, mirroring how the project actually went. The concepts: the two root problems (a finite window, a stateless model), the full compaction toolkit (truncation, trimming, tool-result clearing, summarization, and offloading to files — and when each actually helps), memory that survives across sessions, skills loaded on demand, and production-grade retrieval (chunking, metadata, course scoping, hybrid search, reranking, and evaluating). We'll cover the tutor's architecture, and the evaluation harness we used to measure every run on Gemini — tokens, cost, latency, and memory probes instead of vibe-checks. At real volume, even Gemini Flash got expensive, so we tested whether open and local models could match the quality for a fraction of the cost and match result quality. Everything is open-source and will be shared during the workshop.

Lovina Dmello

  • Role: Senior Software Developer
  • Company: NVIDIA
  • Bio: Lovina Dmello is a senior infrastructure software engineer on the Deep Learning Libraries team at NVIDIA, where she works on building and maintaining the infrastructure that powers the NVIDIA deep learning ecosystem. Before joining NVIDIA, Lovina spent four years at Apple on the Apple Payments and Wallets backend team, and three years at Oracle on the Oracle Cloud Infrastructure team. She earned her master's degree in Computer Science from the University of Georgia, where her thesis focused on ransomware classification using machine learning algorithms. Lovina shares her insights through research papers and writing on AI/ML security, agentic AI systems, TensorRT, deep-learning libraries, and infrastructure best practices.
  • LinkedIn: https://www.linkedin.com/in/lovina25
  • Blog: https://developer.nvidia.com/blog/author/ldmello
  • Photo: /wf26/speakers/by-id/spk_lovina_dmello.jpg
  • Sessions:

- Your LLM Stack Is a 2008 Database With Better Marketing: Why ML Security Is Dominated by Misconfiguration, Not Missing Features — Day 2 — Session Day 1 11:10am-11:30am

ShadowRay exposed over a billion dollars of data through a missing authentication check. It wasn't a zero-day. It wasn't a clever new attack class. It was a default config someone never flipped off. That story is not the exception in production ML, it's the rule. We synthesized 139 peer-reviewed papers on production ML security across access control, runtime security, infrastructure, and operations. Five findings stood out, and one of them upends how most teams think about ML security: - Misconfiguration, not missing features, is the dominant failure mode. The mechanisms exist. Teams aren't using them, or are using them wrong. - Adversarial defenses impose 15–30% inference overhead, which is why almost no production system actually runs them. - ML-specific security tooling lags general DevOps tooling by years. - Security, data-science, and ops teams operate in expertise silos that create persistent gaps no single team can see. - LLM and multi-tenant GPU threats are evolving faster than defenses (prompt injection, RAG poisoning, GPU side channels). This talk walks through the four-pillar defense-in-depth framework, the six-category threat taxonomy that maps each attack to its primary and secondary defenses, and a four-level security maturity model that matches overhead budgets to deployment contexts. You leave knowing where your stack actually sits and which 3 misconfigurations account for most of the risk.

Lu Zhang

  • Role: Member of Technical Staff
  • Company: OpenAI
  • Bio: Lu is an engineer working on large-scale inference platforms, focused on making AI model serving reliable, efficient, and scalable. His work includes distributed systems, workload scheduling, performance optimization, and production reliability. Previously, Lu built and operated GPU clusters supporting large machine learning workloads.
  • LinkedIn: https://www.linkedin.com/in/luzhang1/
  • Photo: /wf26/speakers/by-id/spk_lu_zhang.jpg
  • Sessions:

- Routing LLM Inference in Production: From Engine Signals to Policy — Day 4 — Session Day 3 11:10am-11:30am

Production LLM apps need more than a fast model: they need an inference routing layer that can choose where each request should run as engines, capacity, latency, and geography cost change. This talk shares a generalized Inference Load Balancer (ILB) proxy/controller architecture. A low-latency proxy applies routing weights and request-path signals, while a controller computes source-cluster-to-engine weights from demand, capacity/performance profiles, replica state, and geography cost. We will cover the practical debugging patterns AI engineers need: reading engine signals, explaining why a request went to one backend instead of another, handling retries and load shedding, and keeping routing behavior observable without exposing OpenAI-specific internals or non-public metrics.

Lucas Atkins

  • Role: CTO
  • Company: Arcee AI
  • Bio: Lucas Atkins serves as CTO and Head of Research at Arcee, where he led the development of the company’s proprietary training stack. He has worked in machine learning since 2019, including text-to-speech systems for automotive systems. His work also includes training specialized language translation systems for the UAE in 2022 and collaborating with ROCM from 2023 to 2024 on optimizing enterprise GPUs for large-scale model training.
  • Twitter: https://x.com/latkins
  • LinkedIn: https://www.linkedin.com/in/lucas-atkins-2892482b6
  • Website: https://arcee.ai
  • Photo: /wf26/speakers/by-id/spk_lucas_atkins.jpg
  • Sessions:

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:30pm-1:50pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:55pm-2:15pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

Lucas Palma

  • Role: Information Security Manager
  • Company: Nubank
  • Bio: Lucas Palma is an Information Security Manager at Nubank, one of the world’s largest digital financial services platforms, where he leads Product Security across Application Security, Mobile Security, AI Security, and Product Security Engineering. He works at the intersection of software engineering, product security, and AI, helping teams adopt AI safely without slowing down product development.

His current work focuses on securing AI adoption at enterprise scale, including AI coding assistants, agentic workflows, MCP integrations, secure coding guidance, AI security tooling, and automated security review inside developer workflows. He has more than 15 years of experience in information security, software engineering, and financial services, with work ranging from building security programs from scratch to reducing banking malware incidents by 90 percent through layered mobile protections.

Lucas is passionate about making security practical for engineers by turning real attack patterns into guardrails, automation, evaluations, and tools that help teams ship faster and safer.

  • LinkedIn: https://www.linkedin.com/in/lucaspalma/
  • Website: https://www.nubank.com/
  • Photo: /wf26/speakers/by-id/spk_lucas_palma.jpg
  • Sessions:

- We Vetted 2,000 AI Skills Before They Reached Developers — Day 4 — Session Day 3 1:55pm-2:15pm

AI skills and plugins are becoming part of the software supply chain. They steer agent behavior, describe tools, run commands, access files, and shape how developers build with AI. Treating them as harmless configuration is a mistake. This talk shares what we learned from building an automated security review system for more than 2,000 internal AI skills before they reached a company wide plugin marketplace. I will walk through the risks we found, the checks that worked, the checks that created noise, and how we turned skill review into something developers could run locally and in CI. We will cover practical patterns for reviewing unsafe instructions, destructive commands, sensitive data exposure, risky tool use, credential handling, external calls, and agent behavior drift. The goal is to help AI engineers think about skills, plugins, and agent instructions as production dependencies that deserve review before they reach real users.

Lukas Petersson

  • Role: Co-Founder
  • Company: Andon Labs
  • Bio: Lukas Petersson is co-founder of Andon Labs, testing AIs in the real world. Andon OS is our stack for letting AI agents run businesses safely without any humans in the loop. Today, we operate vending machines, retail stores, cafés, embroidery and software companies, but the list grows every day. Andon partners with all four frontier AI labs (Anthropic, OpenAI, Google, xAI) to test their AI models in real-world scenarios; its Project Vend collaboration with Anthropic was covered by WSJ, Time, and 60 Minutes, and its Vending-Bench benchmark runs at every major model release.
  • Twitter: https://x.com/lukaspet
  • LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/
  • Website: https://lukaspet.substack.com/
  • Photo: /wf26/speakers/by-id/spk_lukas_petersson.jpg
  • Sessions:

- Vending-Bench: Long-Horizon Agent Evals for a Simulated Vending Business — Day 3 — Session Day 2 10:45am-11:05am

Long-horizon agent evals via a simulated vending machine business, testing negotiation, pricing, and supplier management over 365 days.

Mahesh Sathiamoorthy

  • Role: CEO
  • Company: Bespoke Labs
  • Bio: Co-founder and CEO of Bespoke Labs, which is a data research lab. He created OpenThoughts, a widely used reasoning dataset, and Generative Retrieval, which is used across the industry to improve recommender systems. Previously at Google DeepMind.
  • Twitter: https://x.com/madiator
  • LinkedIn: https://linkedin.com/in/smaheswaran
  • Website: https://smahesh.com
  • Blog: http://smahesh.com/blog
  • Photo: /wf26/speakers/by-id/spk_tbd_bespoke_labs.jpg
  • Sessions:

- Data and Environment Curation for Post-training LLMs — Day 2 — Session Day 1 3:45pm-4:05pm

Hold for Bespoke Labs. Company works on data curation, eval tooling, and reinforcement-learning environment curation for agent development.

Manoj Nair

  • Role: CTO & Chief Innovation Officer
  • Company: Snyk
  • Bio: Manoj Nair is Chief Technology Officer and Chief Innovation Officer at Snyk, where he leads the Emerging Technologies and Solutions Office.
  • LinkedIn: https://www.linkedin.com/in/mnair1
  • Photo: /wf26/speakers/by-id/spk_manoj_nair.jpg
  • Sessions:

- Security Track intro — Day 2 — Session Day 1 10:25am-10:30am

- Through the AI Fog: The architectural decision the next 24 months of agentic security depends on. — Day 2 — Session Day 1 10:45am-11:05am

Maor Bril

  • Role: Chaos Catalyst
  • Company: Character.ai
  • Bio: Maor is a Principal Software Engineer at Character.ai, where he builds the agentic platform behind Stories, Streams, and the AI Social Feed (16M+ MAU). He open-sourced claude-agent-sdk-go and JudgeJudy, the multimodal eval harness that gates every AgentX release. Before Character.ai he led the Datastores org at Coinbase and shipped infrastructure at Netflix, Google, and VMware (via the Arkin acquisition). Twenty years of building systems that run in production. He writes about agentic systems and AI engineering on LinkedIn.
  • Twitter: https://x.com/maorbril
  • LinkedIn: https://www.linkedin.com/in/maorbril
  • Website: https://character.ai
  • Photo: /wf26/speakers/by-id/spk_maor_bril.jpg
  • Sessions:

- Evaling Video Slop — Day 3 — Session Day 2 1:55pm-2:15pm

Everyone is shipping video models. Almost no one is evaling them honestly. CLIP score doesn't catch temporal incoherence. Vibes-based human review doesn't scale. And every "AI judge" you wire up will quietly drift away from human preference unless you measure the drift. This is a tactical talk on building real multimodal eval, using JudgeJudy (open-sourced at Character.ai) as the working example. You'll leave with: Why video is different from text. Temporal consistency, shot continuity, narrative coherence, and the metrics that actually capture each (clip_temporal, temporal_consistency, and friends). AI judges, the real version. Custom rubrics, when they work, when they hallucinate, when they collapse to a single dimension and pretend they didn't. The calibration loop. Pearson/Spearman correlation against human scores, automated rubric improvement, detecting systematic judge bias before it costs you a release. Pairwise preference models for video. Training a Qwen3-VL backbone with Bradley-Terry loss to score "is this slop?" before it ships. Regression gates in CI. How every AgentX release at Character.ai passes through an eval wall before it reaches users. Closing the loop with JudgeJudy. Correlating eval scores against real telemetry (Amplitude, Statsig) and feeding validated gates back into the runtime. If you're shipping any multimodal output and your eval strategy is still "the team watches some clips on Friday," this is the upgrade. github.com/character-ai/judgejudy

Marah Abdin

  • Role: Team Lead - Synthetic Data
  • Company: poolside
  • Bio: Synthetic data Lead at Poolside, building the Laguna models (pre-training/post-training). Previously at Microsoft AI/Research, building the Phi models.
  • Twitter: https://x.com/marah_i_abdin
  • LinkedIn: https://www.linkedin.com/in/marah-abdin
  • Website: https://marahabdin.com
  • Photo: /wf26/speakers/by-id/spk_marah_abdin.jpg
  • Sessions:

- The Messy Reality of Scale: Synthetic Data and Pre-Training at Poolside — Day 2 — Session Day 1 11:10am-11:30am

TBD — focus on data quality considerations for LLM pretraining and code generation.

Marco Casalaina

  • Role: VP Products, Core AI and AI Futurist
  • Company: Microsoft
  • Bio: Marco Casalaina is VP Products, Core AI and AI Futurist at Microsoft. His recent public posts center on GitHub Copilot, Copilot CLI, and Microsoft's CoreAI work.
  • LinkedIn: https://www.linkedin.com/in/marcocasalaina
  • Photo: /wf26/speakers/by-id/spk_marco_casalaina.jpg
  • Sessions:

- Power agents with Microsoft IQ — Day 3 — Session Day 2 2:25pm-2:45pm

Agents need more than data, they need context. Learn how Microsoft IQ connects agents to enterprise knowledge, business data, and work signals. See how Foundry IQ, Fabric IQ, and Work IQ provide grounded, permission-aware context that enables agents to reason, act, and deliver reliable results.

Maria Bledsoe

  • Role: General Manager, Product Marketing
  • Company: Microsoft
  • Bio: Maria Bledsoe is General Manager of Product Marketing at Microsoft, writing and speaking about enterprise AI readiness, AI tools, and the practical foundations required to scale AI across organizations.
  • Photo: /wf26/speakers/by-id/spk_maria_bledsoe.jpg
  • Sessions:

- Using AI tools to teach old apps new tricks — Day 2 — Session Day 1 2:25pm-2:45pm

Becoming AI-ready starts with modernizing your legacy systems and technical debt — and keeping them modernized. We’ll show how you can use agentic AI to take on the hardest parts of modernization: analyzing large codebases, mapping dependencies, planning upgrades, refactoring safely, while doing it all at scale with enterprise controls. With GitHub Copilot modernization capabilities, you can move from legacy complexity to modernized apps in days, not months.

Marina Petzel

  • LinkedIn: https://www.linkedin.com/in/marina-petzel
  • Photo: /wf26/speakers/by-id/spk_marina_petzel.jpg
  • Sessions:

- Beyond Golden Signals: Monitoring in the Age of GenAI — Day 2 — Session Day 1 2:25pm-2:45pm

The four golden signals (Latency, Errors, Traffic, Saturation) have been the foundation of application monitoring for years, and it still matters, but for GenAI applications, these signals alone leave significant blind spots. A request can return 200 OK with low latency while the response hallucinates, leaks PII, or costs much more than expected. This talk will walk you through what changes when you're monitoring non-deterministic, token-priced, prompt-injectable systems. We'll cover three additional monitoring dimensions: Cost (token attribution, model-mix tracking, wasted spend on failed requests), Safety (prompt injection detection, PII scanning, jailbreak attempts), and Quality (hallucination rate, relevance scoring, user satisfaction) and show why each one is necessary alongside your existing instrumentation.

Mark Lummus

  • Role: Product Lead
  • Company: PayPal
  • Bio: Software/product leader with 30+ years of experience spanning product management, software engineering, automation, DevOps, and business development.
  • LinkedIn: https://www.linkedin.com/in/marklummus
  • Photo: /wf26/speakers/by-id/spk_mark_lummus.jpg
  • Sessions:

- Burn your flags: How PayPal designs interactive CLI tools for agents — Day 1 — Workshop Day 2:20pm-4:20pm

The common guidance for designing complex CLI tooling that agents can use is to add a 'non-interactive' mode, where a normally interactive & flow-based command can be executed in a single pass by feeding it a bunch of flags. This is necessary for deterministic automation, but agents aren't scripts; they aren't really constrained in the same way, and they benefit greatly from the same step-by-step contextual workflows that humans do. In this workshop, PayPal goes deep on techniques we've used in our upcoming paypal CLI that you can steal to make your complex CLI workflow tool agent-usable — without giving up the guardrails and guidance that interactive CLI tools provide.

Martin Harrysson

  • Role: Senior Partner
  • Company: McKinsey & Company
  • Bio: Leader of McKinsey’s Software Product Development practice, called SoftwareX

Leads McKinsey’s research on impact of AI and agentic on Software Development

Spends the majority of his time serving Software and other Technology clients as well as Financial Services clients on software-related topics

  • LinkedIn: https://www.linkedin.com/in/martinharrysson/
  • Photo: /wf26/speakers/by-id/spk_martin_harrysson.jpg
  • Sessions:

- Tokenomics: From AI Spend to AI Value — Day 3 — Session Day 2 11:00am-12:00pm

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As enterprise AI adoption accelerates, token spend is scaling faster than value realization. We address i) how to make decisions amid unclear cost and value dynamics, ii) how to shift from token-level to workflow-level analysis, and iii) how to manage downstream behavior implications on AI usage.

- The Agentic Product Development Organization — Day 4 — Session Day 3 11:00am-12:00pm

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As AI agents become embedded in day-to-day work, organizations will need to rethink product development teams, roles, and skills. This foundational shift reshapes management layers and requires overcoming challenges across talent attraction, development, and retention.

Matt Brockman

  • Role: AI Engineer
  • Company: E2B
  • Bio: AI Engineer at E2B. Building sandboxes for you.
  • Twitter: https://x.com/badphilosopher
  • LinkedIn: https://www.linkedin.com/in/matt-brockman-629214139
  • Website: https://e2b.dev
  • Photo: /wf26/speakers/by-id/spk_matt_brockman.jpg
  • Sessions:

- How I learned to stop worrying and love the sandbox — Day 1 — Workshop Day 11:05am-12:05pm

Running sandboxes at scale can get painful. How do you manage a thousand concurrent sandboxes? We'll cover burst traffic, fast sandbox creation under load, resource exhaustion, shared state with volumes, and per-user data isolation. Then you'll trigger each failure, implement fixes, and see the cost impact in real time. You'll leave with hands-on experience debugging sandbox failures and a set of observability and scaling patterns you can start implementing.

Matt Dailey

  • Role: Founder
  • Company: Ref.
  • Bio: Matt Dailey is building Ref., including ref.tools and MCP tooling that helps coding agents work with public and private libraries without wasting context or using incorrect APIs.
  • LinkedIn: https://www.linkedin.com/in/matthewjdailey
  • Photo: /wf26/speakers/by-id/spk_matt_dailey.jpg
  • Sessions:

- Velocity Sickness: What Happens When Your Whole Team Gets 10x Faster — Day 4 — Session Day 3 3:20pm-3:40pm

Learn more about Ref: https://ref.tools/ AI made writing code nearly free, and on most teams, that's quietly breaking how the team works. Individually, everyone feels ten times faster. Together, the signals point the other way: too many PRs moving in too many directions, engineers throwing away whole agent sessions and starting over ("declaring agent bankruptcy"), and critical decisions getting made inside agent chats that no one will ever see or review. There's a lot of energy, and it's all going somewhere different. I call this velocity sickness: the organizational pain that comes from individual speed. It's the engineering version of an author who ships a book a week: prolific, productive, and completely unreadable by the team that's supposed to build on it. Almost every conversation about AI coding is about making one engineer faster. This talk is about what happens to the team when all of them are. Once implementation stops being the bottleneck, the hard part isn't writing the code. It's tracking it, reviewing it, and keeping a hundred parallel decisions coherent. That's the problem eng leaders are actually being handed, and it's the one this session takes on directly. Engineering has always had three phases: plan, implement, polish. AI collapsed the middle one to almost nothing, so the leverage, and the real work, move to the decision-heavy ends. The fix isn't better prompts; it's changing what our tools treat as first-class. We have to split the decision layer from the implementation layer: humans spend their time at the decision layer, reviewing and making the choices that matter, while agents handle the implementation. That means durable, reviewable plans, not ephemeral chats. Review the decisions before you review the diff. What attendees will leave with: - A mental model for plan / implement / polish and why the decision layer is now where engineering leverage lives, plus the language to explain velocity sickness to their own team. - A concrete shift: how to pull your team's important decisions out of throwaway agent chats and into a shared, reviewable source of truth, so individual speed compounds into team cohesion instead of chaos.

Matt Gibiec

  • Role: Regional Director, Solutions Engineering
  • Company: Dynatrace
  • Bio: Matt Gibiec is a Regional Director in Dynatrace’s Solutions Engineering organization focused on software and AI-native markets. He helps organizations solve complex software delivery and observability challenges.
  • Sessions:

- Your AI Agent Has No Nervous System — Day 4 — Session Day 3 11:10am-11:30am

Most agents ship with solid evals and zero runtime observability. When something breaks in production — wrong answer, runaway retry loop, or silent tool failure — you're debugging blind. You can see the output, but you can't see what the agent believed when it made the decision. This talk walks through how to instrument agentic pipelines with OpenTelemetry: capturing system context at every step, making prompt state and tool call outcomes visible as structured data, and governing token consumption as SLOs instead of discovering overruns on an invoice. Attendees will leave with three takeaways: an understanding of telemetry for multi-step agentic workflows, a pattern for capturing system context at the span level so teams know exactly what the agent saw before it acted, and a framework for visibility into token budget and behavioral drift before something goes sideways in production. Telemetry is the nervous system. System context is the memory. Token budgets are the vital signs. None of it is optional.

Matt Lawler

  • Role: Forward Deployed Engineer Lead
  • Company: AssemblyAI
  • Bio: Leads FDE for Onboarding at AssemblyAI, helping teams ship speech-to-text and voice AI to production, from model selection and architecture through deployment and scale.
  • Photo: /wf26/speakers/by-id/spk_matt_lawler.jpg
  • Sessions:

- FDE Playbook: Build an AI Support Agent and Give It a Voice — Day 3 — Session Day 2 11:10am-11:30am

Bio: Matt Lawler leads FDE for Onboarding at AssemblyAI, helping teams ship speech-to-text and voice AI to production, from model selection and architecture through deployment and scale.

Description:

Most support bots can read. Joey can talk back. In this session, AssemblyAI's Forward Deployed Engineer Lead, Matt Lawler, shares how his team built Joey, an AI support agent that increased end-to-end resolution rates from 10% to 75%. He'll walk through the architecture, key lessons learned, and how the team extended Joey into a fully voice-enabled agent.

Matt Linderman

  • Role: Partner, Technology Practice
  • Company: McKinsey & Company
  • Bio: Partner in McKinsey's Software & Technology practice. Helping technology leaders build agentic products and transform how organizations build software.
  • LinkedIn: https://www.linkedin.com/in/matthew-linderman/
  • Photo: /wf26/speakers/by-id/spk_matt_linderman.jpg
  • Sessions:

- Tokenomics: From AI Spend to AI Value — Day 3 — Session Day 2 11:00am-12:00pm

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As enterprise AI adoption accelerates, token spend is scaling faster than value realization. We address i) how to make decisions amid unclear cost and value dynamics, ii) how to shift from token-level to workflow-level analysis, and iii) how to manage downstream behavior implications on AI usage.

- The Agentic Product Development Organization — Day 4 — Session Day 3 11:00am-12:00pm

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As AI agents become embedded in day-to-day work, organizations will need to rethink product development teams, roles, and skills. This foundational shift reshapes management layers and requires overcoming challenges across talent attraction, development, and retention.

Matthew Jewkes

  • Role: Cofounder & CTO
  • Company: Standard Cybernetics
  • Bio: Twice exited founder & CTO. Previously SVP AI Transformation at NYSE:CLVT with a focus on biopharma intelligence and scientific data. Built the 1.0 of Signal iOS. What comes next will be marvelous.
  • Twitter: https://x.com/mjewkes
  • LinkedIn: https://linkedin.com/in/mjewkes
  • Photo: /wf26/speakers/by-id/spk_matthew_jewkes.jpg
  • Sessions:

- Engineering Agency out of the Happy Path — Day 3 — Session Day 2 1:55pm-2:15pm

I spent ‘24 and ‘25 structuring the entire written history of biopharma - through drugs, trials, deals, etc. This was a ~500B token effort that translated into a production system now used by 19 of the 20 largest pharmas. We achieved PhD-level performance at scale with 99.95% accuracy over critical concepts.

The hard parts were solving questions of domain and organizational “shape”. This involved identifying which critical concepts and which bundle of tasks were worth the organizational investment to automate. And the biggest spillover win wasn't actually about time savings, it was about refocusing scarce expert judgment on error exhaust - out of which falls potential high value roadmap.

I'll walk through real examples and non-obvious, transferable wins. While the case example is in biopharma, the pattern applies to any business that relies on expert domain judgement to deliver differentiated value.

Max Drake

  • Role: Product Engineer
  • Company: tldraw
  • Bio: Product engineer at tldraw building a very good infinite canvas sdk, currently working on bringing agents to the canvas
  • Twitter: https://x.com/max__drake
  • LinkedIn: https://www.linkedin.com/in/maxdrake1/
  • Website: https://maxdrake.md
  • Blog: https://maxdrake.md
  • Photo: /wf26/speakers/by-id/spk_max_drake.jpg
  • Sessions:

- The Spatial Harness: Bringing Agents to the Canvas — Day 3 — Session Day 2 11:10am-11:30am

What if chat is the wrong interface for managing agents? What if we're holding ourselves back by squeezing our thoughts and the way we work to into a one-dimensional, single-threaded interface? At a high level, this talk aims to present the work we've done at tldraw to build a spatial harness, or a way to allow agents to work on a canvas and collaborate with users and each other natively. This work represents important steps towards building better agent + canvas experiences, a product category we've seen explode in the recent months (Paper, Replit Agent 4, Google Stitch, etc). It's also not something I've really seen talked about elsewhere. See: - Multi-agent collaboration on the canvas (fairies.tldraw.com) - We've also recently brought code mode (https://blog.cloudflare.com/code-mode-mcp/) to the tldraw desktop app and MCP app.

Maxime Rivest

  • Role: Core Contributor
  • Company: DSPy
  • Bio: Maxime builds tools and create content that make LLMs more accessible and powerful for everyone. He is a core contributor to DSPy and has built numerous open-source Python libraries to advance the ecosystem, including attachments, functai, ovllm, dspy-lm-auth, dspy-template-adapter, and mcp2py. Previously, he worked at Elsevier, building AI infrastructure and compound AI programs to cost effectively run on 100 million records weekly.
  • Twitter: https://x.com/MaximeRivest
  • LinkedIn: https://linkedin.com/in/maximerivest
  • Website: https://maximerivest.com
  • Blog: https://maximerivest.com
  • Photo: /wf26/speakers/by-id/spk_maxime_rivest.jpg
  • Sessions:

- The Unreasonable Effectiveness of Separating the Task from the Model — Day 4 — Session Day 3 9:40am-10:00am

By declaring your task’s inputs and outputs without initially considering model capability, you create the space needed to figure out the model execution later. DSPy’s entire promise is that you should evaluate and execute your AI engineering at a level higher than a specific prompt template or a particular provider’s API shape: the Signature. However, models have evolved significantly over the last few years. How can the same input and output specifications still work in a world now filled with tools, RLMs, and Skills? By defining your task strictly through its inputs and outputs, the underlying implementation becomes completely flexible. You can experiment with different models, settings, weights, templating strategies, and output formats, all without touching your actual AI workflow. Consequently, you can leverage components built by others and focus entirely on your core AI task. In this talk we will present how dspy 3.5 makes it easier much easier. DSPy has its roots in prompt optimization, where we build efficient ways to conduct search and learning beneath the signature. In this talk we will give a preview of DSPy 4.0 where we use the fact that models have now passed a tipping point for two critical concepts we have always needed. First, we no longer need to limit the search space to a single instruction block per LLM call; models can now reliably write the code underneath a signature themselves—so they should. Second, traditional prompt optimization has always required a scalar metric, which is notoriously one of the hardest parts to get right. What if a DSPy program could learn directly from your interactions with users? Ultimately, all you care about is that the function you call respects the inputs and outputs of your signature. You can let the models figure out the rest.

Maximilian-David Rumpf

  • Role: CEO
  • Company: SID.ai
  • Bio: Researcher at ETH Zürich, now CEO of SID.ai
  • Twitter: https://x.com/maxrumpf
  • LinkedIn: https://linkedin.com/in/maximiliandavid
  • Website: https://maxrumpf.com
  • Photo: /wf26/speakers/by-id/spk_maximilian_david_rumpf.jpg
  • Sessions:

- Where RL Will Take Search — Day 2 — Session Day 1 2:50pm-3:10pm

Search is having its Bitter Lesson moment. By turning search into an RL problem, we can finally scale search quality with compute! RL is extremely sample efficient when compared to classical search training objectives and we see no ceiling to how far we can scale this new paradigm. We cover the training of SID-1, the first RL-trained search model, and how search will look like post-RL.

Maximillian Piras

  • Role: Founding Designer
  • Company: Yutori
  • Bio: Currently focused on UIUX for agentic AI as the Founding Designer at Yutori. Previously Head of Design at Headliner and Sr. Designer at 8tracks, leading cross-platform UIUX for millions of users. Additionally, developed graphics/animations for clients including Giphy, MIT, & Ryuichi Sakamoto while also being a contributing writer to Smashing Magazine.
  • Twitter: https://x.com/MVXMXM
  • LinkedIn: https://www.linkedin.com/in/maximilliannyc/
  • Website: https://www.maximin.design
  • Blog: https://www.maximin.design
  • Photo: /wf26/speakers/by-id/spk_maximillian_piras.jpg
  • Sessions:

- Mousepower: agents that can’t be measured, can’t be managed. — Day 3 — Session Day 2 12:05pm-12:25pm

Agents have a measurement problem, which makes them impossible to efficiently manage. You’ve likely heard many say execution is now cheap, but judgement is the new bottleneck. This is because our evaluation frameworks weren’t designed for systems that tirelessly output in parallel. The canary in the coal mine is code generation becoming largely solved at the expense of breaking code review. As agents reverberate across all knowledge work, the same fracture will spread to artifacts, actions, & decisions. Yet without a scalable quality measure, we can’t ascend to a higher level of abstraction because we won’t trust the foundation below. So how do we design measurements that are efficient, intuitive, & trustworthy? Past paradigm shifts offer inspiration, such as James Watt not just building a better engine but also inventing horsepower to map it onto existing mental models. We need an equivalent quantification to communicate the “mousepower” of agents. Information theory gives us the starting point: concepts like entropy, ergodic processes, and Hamiltonian problems point us toward the most tractable trajectories — easier to verify than they are to solve.

Melanie Warrick

  • Role: Developer Relations Engineering
  • Company: Temporal Technologies
  • Bio: Melanie Warrick works in Developer Relations Engineering at Temporal Technologies. She has built AI solutions across data engineering, machine learning, developer relations, and health-tech roles, and now focuses on durable infrastructure and agentic systems.
  • Twitter: https://x.com/nyghtowlYT
  • Website: https://nyghtowl.com
  • Photo: /wf26/speakers/by-id/spk_melanie_warrick.jpg
  • Sessions:

- The Human Is an Async API — Day 4 — Session Day 3 2:25pm-2:45pm

Production agent systems need humans in the loop. So why do they keep getting modeled as synchronous tool calls? The agent ecosystem is focused on autonomy, but in reality, especially for high-stakes or regulated workflows, humans are a critical feature, not an afterthought. This demo-driven talk shows how to stop bolting on humans and start treating them as async-by-default endpoints with proper durability, retry, and escalation semantics. We will walk through two live, multi-agent patterns built with LangGraph and Google ADK, on Temporal for durable execution: The Agent Calls the Human. A fleet dispatch system escalates a disruption to an approver. We will intentionally kill the worker process mid-wait. Hours later, the human responds. State survives, and the agent resumes. The Human Calls the Agent. An operator interrupts a long-running task mid-flight to redirect it. The agent halts gracefully, surfaces state, accepts the override, and continues. Harness engineering has heavily focused on model autonomy. This talk is about the other half of the puzzle: the human. You will leave with two production-ready architectural designs you can apply this week: agent-initiated approval gates with timeout and escalation semantics, and human-initiated interrupts with graceful agent halt and resumption. Not every agent needs a human in the loop. But if you are building systems where the cost of being wrong exceeds the cost of being slow, this talk is for you.

Merve Noyan

  • Role: MLE
  • Company: Hugging Face
  • Bio: Works at Hugging Face open-source team, author of the book Vision Language Models with Hugging Face published by O'Reilly.
  • Twitter: https://x.com/mervenoyann
  • LinkedIn: https://www.linkedin.com/in/merve-noyan-28b1a113a
  • Website: https://hf.co/merve
  • Blog: https://hf.co/merve
  • Photo: /wf26/speakers/by-id/spk_merve_noyan.jpg
  • Sessions:

- Skill issue: stop deploying vision language models, use them with Skills to build e2e vision apps on edge — Day 2 — Session Day 1 11:40am-12:00pm

With the boom of vision language models barrier of entry to build vision apps are much lower so developers tend to use them right away. However, these models are very large and inefficient in production. In this talk, I will go through combining vision language models with Skills to build end-to-end vision apps from training to deployment using HF Skills, on top of showing the state-of-the-art in small computer vision/multimodal models.

- Compression at the Edge — Day 4 — Session Day 3 2:25pm-2:45pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

- Compression at the Edge — Day 4 — Session Day 3 2:50pm-3:10pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

Micah Hill-Smith

  • Role: CEO
  • Company: Artificial Analysis
  • Bio: Co-Founder and CEO at Artificial Analysis, the leading independent AI benchmarking company. Artificial Analysis publishes benchmarks and analysis across agents, models, inference providers and hardware. Artificial Analysis maintains widely referenced leaderboards and evaluation frameworks that are regularly cited by frontier AI organizations, including OpenAI, Anthropic, Google, NVIDIA and others.
  • Twitter: https://x.com/_micah_h
  • LinkedIn: https://www.linkedin.com/in/micahhill-smith/
  • Photo: /wf26/speakers/by-id/spk_micah_hill_smith.jpg
  • Sessions:

- Trends in AI — Day 3 — Session Day 2 4:50pm-5:10pm

Micah Silverman

  • Role: Director of Developer Relations
  • Company: Snyk
  • Bio: Director of Developer Relations at Snyk with extensive Java development experience; author and speaker on developer security and secure software practices.
  • LinkedIn: https://www.linkedin.com/in/micahsilverman
  • Photo: /wf26/speakers/by-id/spk_micah_silverman.jpg
  • Sessions:

- AI Security Engineer Foundations + Certificate — Day 1 — Workshop Day 2:20pm-4:20pm

In each of the two sessions, we cover 6 modules and participants receive a certificate of completion at the end. The modules are: OWASP Top 10 for LLM, Addressing Shadow AI, AI Threat Modeling, Securing Agents & MCP, Securing Vibe Coding, & AI Red Teaming

Michael Chiang

  • Role: Co-founder
  • Company: Ollama
  • Bio: Michael Chiang is co-founder of Ollama, which builds tools for running large language models locally.
  • LinkedIn: https://ca.linkedin.com/in/mchiang0610
  • Sessions:

- Compression at the Edge — Day 4 — Session Day 3 2:25pm-2:45pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

- Compression at the Edge — Day 4 — Session Day 3 2:50pm-3:10pm

Compression at the Edge examines how smaller weights, faster inference, and constrained-memory deployments are making capable local AI more practical. The panel explores where compressed models already beat cloud on latency, privacy, cost, or control, what breakthroughs would unlock broader adoption, and how open model tooling is shaping the edge AI stack.

Moderator: Chris Alexiuk (NVIDIA). Panelists: Daniel Han (Unsloth), Asma Beevi (NVIDIA), Merve Noyan (Hugging Face), Michael Chiang (Ollama).

Michael Forrester

  • Role: AI Workforce Transformation
  • Company: Accenture
  • Bio: Michael Forrester is a student, explorer, and educator working at the boundary between humanity and technology. Over 25+ years he's moved from CTO to individual contributor across operations, AI, machine learning, cloud infrastructure, and platform engineering, including time at AWS, ThoughtWorks, Red Hat, and Honeywell. Today he helps organizations adopt generative AI in ways that are sustainable, secure, and cost-effective, building on training programs that have reached over a million engineers across AWS, Kubernetes, and AI-driven operations. He speaks regularly at KubeCon and CNCF events and co-hosts podcasts on how AI is reshaping the engineering discipline. His work spans Claude Code and MCP integrations, AI safety frameworks for platform engineers, and courses ranging from AWS certifications to K8sGPT. His read on the 2020s: engineering is evolving, not disappearing. Systems thinking, design thinking, and architecture matter more than ever, even as the tools change. Tools don't transform organizations. People do.
  • Twitter: https://x.com/peopleforrester
  • Website: https://www.michaelrishiforrester.com
  • Blog: https://www.michaelrishiforrester.com
  • Photo: /wf26/speakers/by-id/spk_michael_forrester.jpg
  • Sessions:

- Build a Platform, Unleash an Agent on it.... and Watch it Burn! — Day 1 — Workshop Day 1:15pm-2:15pm

You get a Kubernetes cluster with an Internal Developer Platform already running: ArgoCD for GitOps, Kyverno for admission control, Falco for runtime detection, Prometheus for observability. Everything is instrumented. Everything is enforced. You also get an AI agent with cluster access. Your job is to get the agent to break something. Deploy a non-compliant workload. Escalate privileges. Modify infrastructure outside Git. Exfiltrate data through an agent response. Some of you will fail because the governance stack catches it. Some of you will succeed because it doesn't. Afterward we regroup and map what got blocked, what slipped through, and why. The 80% that existing CNCF tools already govern becomes obvious. The 20% gap where agent-specific tooling is missing becomes undeniable. You leave with a concrete governance map and the exact list of failure modes your own platform probably isn't covering yet.

Michael Grinich

  • Role: Founder & CEO
  • Company: WorkOS
  • Bio: Founder and CEO of WorkOS, a company focused on APIs and integrations that help applications become enterprise-ready.
  • Twitter: https://x.com/grinich
  • Photo: /wf26/speakers/by-id/spk_michael_grinich.jpg
  • Sessions:

- Auth for Agents: Unblock Autonomous AI with auth.md — Day 4 — Session Day 3 11:40am-12:00pm

AI agents are ready to act on users' behalf, but legacy auth flows were built for humans, not agents. This session introduces auth.md, an open protocol that lets agents register and authenticate users without sign-up forms, and shares what early implementers have learned since launch. Learn about the new protocol that Cloudflare, Firecrawl, Cogny, and monday.com are adopting to power agent registration — authenticating agents without sign-up forms.

Michael Liendo

  • Role: Staff Developer Advocate
  • Company: Auth0
  • Bio: Michael Liendo is a Staff Developer Advocate at Auth0 who focuses on simplifying complex development topics through written and video tutorials, especially around full-stack SaaS, authentication, and serverless development.
  • Photo: /wf26/speakers/by-id/spk_michael_liendo.jpg
  • Sessions:

- Trust, But Verify: Human-in-the-Loop for Agents That Actually Matter — Day 4 — Session Day 3 1:30pm-1:50pm

"In this talk we'll walk through the full spectrum of human-in-the-loop patterns, from lightweight inline confirmations to out-of-band permission gates to handing your agent a wallet with real money in it and more. Each pattern fits a different level of consequence, and knowing which to reach for is what separates demo agents from production ones. We'll cover the honest tradeoffs of latency, user experience, and trust so you can make the right call for your specific use case.

The entire talk is built around various live demos that escalate in stakes with every step. You'll leave with a mental model and working reference architecture you can apply the same day."

Michael Patterson

  • Company: Coder
  • Photo: /wf26/speakers/by-id/spk_michael_patterson.jpg
  • Sessions:

- The Lethal Trifecta Is Already on Your Developers' Laptops — Day 4 — Session Day 3 11:10am-11:30am

The lethal trifecta: an AI agent with access to private data, exposure to untrusted content, and the ability to communicate externally. Combine all three and an attacker can trick your agent into exfiltrating anything it can see and there is no prompt-level fix.. Most enterprises have already deployed this pattern at scale: Claude Code, Cursor, and Copilot on developer laptops with local credentials, MCPs reaching into internal systems, and open egress. I'll speak to my own personal agent stack as a textbook example, then trace the same shape across enterprise deployments I see at Coder. The back half is four architectural moves that defuse it: governed compute, centralized credentials, default-deny egress, identity-bound audit. Walk out with a mental model and a checklist you can run against your own deployment the next morning.

Michelle Nguyen

  • Role: Co-Founder
  • Company: Gimlet Labs
  • Bio: Michelle Nguyen is cofounder of Gimlet Labs, the first multi-silicon inference cloud to run agentic workloads across different types of hardware, where she leads engineering. She was the first engineer at Pixie Labs where she worked across the stack on projects ranging from Pixie's deployment mechanisms to its distributed query engine. Before Pixie, Michelle was at Trifacta helping build intuitive and interactive UIs. Michelle holds a MS and BS in EECS from UC Berkeley.
  • LinkedIn: https://www.linkedin.com/in/michelle-nguyen-82736762
  • Photo: /wf26/speakers/by-id/spk_michelle_nguyen.jpg
  • Sessions:

- All the Things We Have to Do to Satisfy Your Insatiable Need for Tokens — Day 4 — Session Day 3 11:40am-12:00pm

Every time the industry figures out how to serve tokens faster and cheaper, the appetite grows to match. Models get bigger, contexts get longer, agents start chaining thousands of calls together. The finish line keeps moving. This talk is a technical tour through everything the industry has done to keep up, led by two experts in high-performance inference. We'll start with the optimizations that made hardware work harder without changing the underlying architecture. Then we'll go up a level with techniques that work smarter across requests and across the model itself. And finally, a peek into the future with heterogeneous disaggregated inference, the architectural shift that splits prefill and decode across specialized hardware, and even more advanced forms of hardware specialization coming your way soon. Token demand is about to get a lot more insatiable. Let's see what the future has in store for us!

Midam Kim

  • Role: ML Engineer
  • Company: ServiceNow
  • Bio: Midam Kim is an ML Engineer at ServiceNow, where she builds and evaluates a multilingual voice AI platform spanning a dozen languages. She holds a PhD in Linguistics from Northwestern University and has backgrounds in linguistics, speech science, cognitive science, machine learning, and business. Her work sits at the rare intersection of production ML engineering and speech science—translating decades of linguistic research into the engineering decisions voice AI teams are making right now.
  • LinkedIn: https://www.linkedin.com/in/midamkim/
  • Photo: /wf26/speakers/by-id/spk_midam_kim.jpg
  • Sessions:

- "My name is... my name is...": A Linguistic Map for Building and Debugging Voice Agents — Day 2 — Session Day 1 3:20pm-3:40pm

Every voice AI engineer has heard it: a caller repeating their name three times, getting more frustrated with each attempt. The logs look clean. Confidence scores look fine. Linguistics can help solving the mystery. By the end of this talk, you'll have a diagnostic framework for the failures that slip past standard metrics, a way to turn "the agent just didn't get it" into concrete, debuggable failure modes. The framework maps three levels of linguistic structure (sounds, words, and interactions) against the two dimensions every voice agent engineer already works in: what we hear (speech recognition) and what we speak (speech synthesis). That 3×2 grid surfaces problems your current tooling can't see, including: 1. Why your user cannot make your system understand their name 2. Why a single well-intentioned vocabulary hint can cause catastrophic drops in a non-English language 3. Why a transcript that's "cumulatively correct" can still ruin the user experience Drawing on examples from production multilingual voice AI work, I'll show where linguistic expertise connects to the engineering decisions you're already making and where it reveals failure modes that confidence scores will never warn you about. Who this is for: Voice AI engineers, ML practitioners on Voice AI pipelines, and anyone who's watched clean logs while their agent quietly fails real users.

Miguel González Fernández

  • Role: Tech Lead
  • Company: Browserbase
  • Bio: Miguel González Fernández is a Tech Lead at Browserbase and co-author of the Microsoft Research/Browserbase Universal Verifier work for computer-use agents.
  • LinkedIn: https://www.linkedin.com/in/miguelgfz
  • Photo: /wf26/speakers/by-id/spk_miguel_gonz_lez_fern_ndez.jpg
  • Sessions:

- The Art of Building Verifiers for Computer Use Agents — Day 4 — Session Day 3 11:40am-12:00pm

Every team building browser agents has the same problem: you can't trust your own evals. Browser tasks are too open-ended for deterministic checks, so teams use LLM verifiers as judges, and the judges are wrong constantly. WebVoyager misses 45% of failures. WebJudge misses 22%. Used as RL reward, you're not training a better agent, you're training a more confident liar. This talk walks through the Universal Verifier, open-sourced with Microsoft Research: false positive rate near zero, Cohen's κ matching human-human agreement. Four design principles, one open benchmark, and an honest account of where auto-research worked and where it plateaued.

Mihnea Munteanu

  • Role: Senior Product Lead
  • Company: YouTube
  • Bio: Senior product leader who specializes in very large scale 0 to 1 consumer product experiences. Led the transformation of YouTube Search from traditional ranking-based systems to AI-native architecture. Drove the 0 to 1 launch for Ask Youtube (Launched @ Google IO '26). Previously at Webflow, Grammarly, McKinsey.
  • Twitter: https://x.com/ainteligentsia
  • LinkedIn: https://www.linkedin.com/in/mihneamunteanu/
  • Photo: /wf26/speakers/by-id/spk_mihnea_munteanu.jpg
  • Sessions:

- Ask YouTube — Open Q&A — Day 3 — Session Day 2 2:25pm-2:45pm

(updated) an off-the-record session with Mihnea Munteanu, Senior Product Lead, Ask YouTube / AI Search @ Google

Mike Chambers

  • Role: Senior Developer Advocate for Generative AI
  • Company: Amazon Web Services (AWS)
  • Bio: Mike Chambers is a Senior Developer Advocate for Generative AI at AWS. He creates practical agentic-AI and Amazon Bedrock educational material, including serverless agentic workflows and Generative AI with Large Language Models content.
  • Blog: https://blog.mikegchambers.com
  • Photo: /wf26/speakers/by-id/spk_mike_chambers.jpg
  • Sessions:

- Harness Engineering: Building the Production Cage for Powerful Domain Agents — Day 4 — Session Day 3 12:05pm-12:25pm

Every agent is a while loop. The model takes strings in and produces strings out. We've all written it, debugged it, shipped it. And yet every team building agents is still re-inventing the same session management, truncation logic, tool wiring, and memory plumbing from scratch. The hard part is the harness: session isolation, context management, memory persistence, sandboxed execution, observability. The machinery that makes a model dependable in production. Most of the failures we see in deployed agents (context rot, premature completion, tool bloat) trace back to harness problems, not model problems. This talk covers what a harness actually does, why "harness engineering" suddenly showed up in engineering posts from everyone, and what changes when you stop building harnesses by hand. In live demos, we'll build the same agent three ways: hand-rolled Python, framework-generated, and fully managed through a single API call. Each level shifts the failure modes from infrastructure plumbing to engineering judgment, where the real questions are what context to preserve, when to verify, and how to keep an agent from finishing half the job and calling it done. The harness handles the machinery. You still have to engineer the behavior.

Mike Krieger

  • Role: Head of Labs
  • Company: Anthropic
  • Bio: Mike Krieger leads Anthropic Labs and previously served as Anthropic's Chief Product Officer. He co-founded Instagram and was its CTO, later co-founding Artifact before joining Anthropic.
  • Twitter: https://x.com/mikeyk
  • LinkedIn: https://www.linkedin.com/in/mikekrieger
  • Blog: http://mikeyk.wordpress.com
  • Photo: /wf26/speakers/by-id/spk_mike_krieger.jpg
  • Sessions:

- How Anthropic Builds: Lessons from Labs — Day 4 — Session Day 3 10:00am-10:20am

Mike Phipps

  • Role: Lead AI Engineer
  • Company: Gates Foundation
  • Bio: Mike is the lead AI engineer in business operations at the Gates Foundation, where he built and deployed SIP (Strategic Intelligence Platform), the foundation's enterprise-wide knowledge graph. Built on Neo4j and served to Claude through MCP, SIP unifies siloed structured records and unstructured documents into one semantic layer that agents can query directly. Before AI engineering, he earned his PhD at CERN in experimental high-energy nuclear physics, working with some of the largest datasets in science. His work now centers on AI-first data modeling, driven by the conviction that for most enterprise teams the durable moat comes from properly modeling and expressing their data assets and domain knowledge — not the commoditizing layers above.
  • LinkedIn: https://www.linkedin.com/in/mike-phipps-79339a38
  • Photo: /wf26/speakers/by-id/spk_mike_phipps.jpg
  • Sessions:

- Your Moat Is Your Data Model — Day 4 — Session Day 3 11:40am-12:00pm

Every enterprise AI team faces the same strategic question: where in the stack should a small team focus its effort? Models, frontends, and agent frameworks evolve rapidly and are increasingly commoditized. But regardless of how these layers mature, AI in enterprise settings remains bottlenecked by the same underlying problem: structured data is siloed across systems of record with domain-specific schemas, and the unstructured data needed to contextualize it sits in entirely separate systems, with its own systematic complexities. The durable work is cleaning, curating, and semantically modeling this data in an AI-first manner so that any client — chat, workflow, or otherwise — can query across it. That's the moat. At the Gates Foundation, my team built and deployed our foundation-wide knowledge graph on Neo4j that unifies structured and unstructured data behind a single MCP server. The graph itself is modeled for agentic consumption: natural hierarchies are projected as traversable paths rather than flattened tables, and unstructured documents are semantically chunked, tagged, and mapped to structured entities at ingestion time using AI-driven ETL. The result is a semantic layer where an agent can express a complex cross-system question as a concise graph query and receive an accurate answer. This talk is an architectural walkthrough covering the end-to-end pipeline: AI-based extraction and semantic chunking of unstructured documents, the agent-first data modeling decisions, design considerations for our MCP server, and how we handle graph-based retrieval evals. We'll walk through real query sessions showing Claude interacting with the graph through both chat and workflow integrations. The intended takeaway is a practical framework for where a small enterprise team's investment compounds — and why that investment is the data model, not the layers above it.

Mingsheng Hong

  • Role: VP of AI at Ironclad
  • Company: Ironclad
  • Bio: Mingsheng Hong is a tech entrepreneur and executive specializing in AI and data infrastructure and products, with a Ph.D. in Computer Science from Cornell. He is the VP of AI at Ironclad, where he focuses on building AI-native products and features for legal contracting. Previously he worked in senior engineering leadership roles at Google and Microsoft. He also co-founded Bluesky Data, pioneering AI-driven workload optimization for modern data platforms and exited it through acquisition by Microsoft.
  • LinkedIn: https://www.linkedin.com/in/mingshenghong/
  • Sessions:

- From Tokenmaxxing to Trusted Throughput — Day 3 — Session Day 2 2:25pm-2:45pm

AI adoption is accelerating, but for many engineering organizations, token consumption is now significant enough to demand real economic discipline. Drawing on Ironclad’s experience scaling AI across engineering, Mingsheng Hong will introduce the concept of trusted throughput: the rate at which teams convert AI usage into reviewed, validated, maintainable, and safely deployed customer value. He will share a practical framework for measuring AI cost and return, identifying bottlenecks in code review, CI, and merge workflows, and improving ROI through better guardrails, engineering practices, build-versus-buy decisions, and token optimization. Attendees will leave with a clearer way to evaluate AI efficiency—not by minimizing usage or rewarding tokenmaxxing, but by maximizing trusted customer value per dollar of AI spend and unit of human attention.

Morgan Willis

  • Role: Principal Cloud Technologist
  • Company: Amazon Web Services (AWS)
  • Bio: Morgan Willis is a Principal Cloud Technologist at AWS who creates technical courses, reference architectures, tutorials, live streams, and open-source examples for AI agents, context engineering, orchestration, multi-agent systems, evaluation, guardrails, and secure scalable AI applications.
  • LinkedIn: https://www.linkedin.com/in/morganwilliscloud
  • Photo: /wf26/speakers/by-id/spk_morgan_willis.jpg
  • Sessions:

- The Infinite Context Window Is a Myth: Context Engineering for AI Agents — Day 3 — Session Day 2 3:20pm-3:40pm

Large context windows have become a popular answer to the growing complexity of AI agents. When agents lose track of details, forget prior decisions, or degrade in reasoning quality, the instinct is often to add more tokens. In practice, this rarely fixes the problem and often makes it worse. Bigger context windows increase cost and latency, introduce noise, and amplify failure modes like lost-in-the-middle effects, context collapse, and brittle summarization. This talk argues that the real challenge is not context size, but context engineering. In this session, we will explore practical context engineering techniques for building AI agents that reason reliably over time without relying on ever-larger context windows. Starting from a stateless agent, we will walk through progressively more advanced strategies, including short-term and long-term memory, conversation curation policies, retrieval-augmented generation, and tool-driven context injection. We will examine common failure modes such as context pollution from tool outputs, brevity bias during summarization, and reasoning degradation as conversations grow, and show concrete ways to mitigate them. The talk is grounded in real agent implementations using the Strands Agents SDK and Amazon Bedrock AgentCore, but the principles apply broadly to any agent framework. This session is intended for engineers building AI agents beyond simple chatbots who want practical techniques they can apply immediately.

Moritz Johner

  • Role: Staff Engineer
  • Company: Form3
  • Bio: Staff Engineer at Form3, focused on Kubernetes, security, and platform engineering. One of the creators and maintainers of external-secrets.
  • LinkedIn: https://www.linkedin.com/in/moritz-johner/
  • Website: https://www.form3.tech/
  • Photo: /wf26/speakers/by-id/spk_moritz_johner.jpg
  • Sessions:

- We Gave an Agent Production Code Access and Then Tried to Sleep at Night — Day 2 — Session Day 1 11:40am-12:00pm

We let an agent touch production code to fix CVEs. That is either automation or a supply chain incident, depending on how honest your architecture is. PatchPilot started simple: find vulnerable dependencies, patch them, open a PR, let CI prove the fix, move on. Then reality showed up. The agent needed repository access, CI logs, credentials, and a Docker socket. Without that, it was useless. With it, every security reviewer in the room had a point. This is the production case study: what we gave the agent, what we refused, what infosec pushed back on, and where they were right. We will cover scoped permissions, constrained PRs, audit trails, approval gates, CI evidence, credential boundaries, and the gap between "it generated a patch" and "we can defend this change." Agentic remediation is not just developer productivity. It is a new participant in your software supply chain.

Nachiket Paranjape

  • Role: Software Engineer
  • Company: DoorDash
  • Bio: Software Engineer at DoorDash's AI Platform Team. Currently leading the AI Evals initiative. Previously Engineering Lead at Galileo AI (acquired by Cisco).
  • Twitter: https://x.com/nmparanjape
  • LinkedIn: https://www.linkedin.com/in/nachiketparanjape/
  • Photo: /wf26/speakers/by-id/spk_nachiket_paranjape.jpg
  • Sessions:

- AI Evals Platform for Cross-Functional Teams at Scale — Day 2 — Session Day 1 1:55pm-2:15pm

DoorDash's Evals Platform is designed for more than just engineers. It brings human review, automated judges, and online experimentation into a single calibration loop so engineering, product managers, and strategy and operations teams can all contribute to improving AI quality. Engineers can instrument, trace, and evaluate agent behavior, while cross-functional teams can review outputs, curate trusted examples, and provide structured feedback that improves how automated judges behave over time. By combining experimentation, fully customized annotation workflows, calibration, and analytics in one system, the platform turns AI quality from a fragmented technical exercise into a shared operating model for continuously improving agent performance and making rollout decisions with confidence. While vendor platforms offer pieces of this workflow, we needed something broader: a unified system that lets engineers, product managers, and Strategy & Ops all participate directly in improving AI quality. Our goal is not just to run evals, but to enable cross-functional teams to review outputs, calibrate judges, run experiments, and make rollout decisions without being blocked on engineering. That requirement, along with tighter integration into our internal workflows and operating model, is why we are building this platform in-house.

Nader Khalil

  • Role: Director of Developer Technology
  • Company: NVIDIA
  • Bio: Director of Developer Tech at NVIDIA leading Open Source, Agent Marketing, Developer Experience. CEO & co-founder of Brev.dev, which was acquired by NVIDIA in July 2024.
  • Twitter: https://x.com/naderlikeladder
  • LinkedIn: https://linkedin.com/in/naderlikeladder
  • Website: https://nader.coffee
  • Photo: /wf26/speakers/by-id/spk_nader_khalil.jpg
  • Sessions:

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 10:45am-11:05am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

- State of the Union: Why Local, Why Now — Day 4 — Session Day 3 11:10am-11:30am

Local AI has crossed from interesting to useful, driven by stronger open models, better hardware, and a maturing ecosystem for running intelligence outside the cloud. This panel explores what that shift unlocks for sovereignty, defense, regulated industries, privacy, cost, and resilience, and why open-source AI may be central to who benefits from the next wave of intelligence.

Moderator: Nader Khalil (NVIDIA). Panelists: Joseph Nelson (Roboflow), Alex Cheema (Exo Labs), Ahmad Osman (r/LocalLLaMA).

- Model Routing — Day 4 — Session Day 3 3:20pm-3:40pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

- Model Routing — Day 4 — Session Day 3 3:45pm-4:05pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

Naman Ahuja

  • Role: Senior Software Engineer
  • Company: Meta
  • Bio: Software Engineer at Meta working at the intersection of distributed systems, AI infrastructure, and hardware enablement. He focuses on adopting new hardware platforms across Meta’s datacenter fleet to support AGI-scale workloads and production AI systems.

His work spans capacity management, autoscaling, reliability engineering, and datacenter-scale resource optimization. He has led initiatives to integrate cutting-edge hardware, improve compute utilization, and build reliable platforms for large-scale AI workloads and agentic workflows.

  • LinkedIn: https://www.linkedin.com/in/namanahuja/
  • Website: https://buzzingtech.ai/
  • Blog: https://buzzingtech.ai/
  • Photo: /wf26/speakers/by-id/spk_naman_ahuja.jpg
  • Sessions:

- Operating Distributed Inference Systems at Scale — Day 4 — Session Day 3 10:45am-11:05am

Inference has rapidly become one of the most important infrastructure problems in modern computing. As AI systems evolve into autonomous agents with persistent memory, tool usage, and multi-step reasoning, traditional inference architectures struggle under growing demands for latency, throughput, cost efficiency, and reliability. In this talk, I’ll share lessons from building large-scale elastic compute and AI infrastructure systems powering production workloads. We’ll explore the modern inference stack and the architectural patterns emerging to support next-generation agentic AI systems. Topics include distributed inference architectures for large-scale AI systems, GPU scheduling and elastic compute for inference workloads, multi-tenant inference infrastructure, caching, batching, latency optimization strategies, reliability and fault isolation for inference systems, observability and control loops for AI serving platforms, balancing cost, throughput, and user experience, and why inference is becoming an infrastructure orchestration problem. Attendees will gain practical insights into designing scalable, resilient, and cost-efficient inference platforms for modern AI workloads.

Natalie Meurer

  • Role: Head of Agent Engineering
  • Company: Sierra
  • Bio: Head of Agent Engineering at Sierra, leading teams that design, build, and deploy AI agents for enterprise customer experiences.
  • Twitter: https://x.com/natalie_meurer
  • LinkedIn: https://www.linkedin.com/in/nataliemeurer
  • Photo: /wf26/speakers/by-id/spk_natalie_meurer.jpg
  • Sessions:

- The Dirty Secret of Forward Deployed Engineering — Day 2 — Session Day 1 1:30pm-1:50pm

Since its origins at Palantir, the term "Forward Deployed Engineer" has described wildly different jobs, yet today it's one of the fastest-growing roles in AI. What happened? And what does that reveal about the future of engineering?

Join Nat Meurer, Head of Agent Engineering at Sierra, for a historical tour of one of tech's most misunderstood roles, and why its biggest contradiction may explain where the industry is headed next.

Navinkumar Patil

  • Role: Staff Software Engineer
  • Company: PayPal
  • Bio: Engineer focused on large-scale and distributed systems.
  • Photo: /wf26/speakers/by-id/spk_navinkumar_patil.jpg
  • Sessions:

- Burn your flags: How PayPal designs interactive CLI tools for agents — Day 1 — Workshop Day 2:20pm-4:20pm

The common guidance for designing complex CLI tooling that agents can use is to add a 'non-interactive' mode, where a normally interactive & flow-based command can be executed in a single pass by feeding it a bunch of flags. This is necessary for deterministic automation, but agents aren't scripts; they aren't really constrained in the same way, and they benefit greatly from the same step-by-step contextual workflows that humans do. In this workshop, PayPal goes deep on techniques we've used in our upcoming paypal CLI that you can steal to make your complex CLI workflow tool agent-usable — without giving up the guardrails and guidance that interactive CLI tools provide.

Neil Zeghidour

  • Role: Co-founder & CEO
  • Company: Gradium
  • Bio: Neil Zeghidour is the co-founder and CEO of Gradium. Neil founded Gradium after a decade of building and leading frontier generative audio teams at Meta and Google DeepMind. Being frustrated by slow and brittle voice assistants , he built the engineering teams that developed the first neural audio codecs and introduced the first audio LLMs, such as AudioLM, at Google. He later created Kyutai to launch Moshi, the world's first real-time, full-duplex conversational AI , and Hibiki, the first simultaneous speech-to-speech translation system. Today, Gradium is focused on helping developers build natural, real-time voice agents by providing ultra-low latency streaming APIs that transition these breakthroughs from the research lab to production.
  • Twitter: https://x.com/neilzegh
  • Website: https://gradium.ai
  • Photo: /wf26/speakers/by-id/spk_neil_zeghidour.jpg
  • Sessions:

- Your Voice Agent is Just a Walkie-Talkie — Day 2 — Session Day 1 12:05pm-12:25pm

Everyone says cascaded voice pipelines are dead and native speech models are the future. Yet production environments are still dominated by STT-LLM-TTS stacks. Reconciling the natural flow of native audio with the elite reasoning of a cascaded agent remains an unsolved systems problem. This talk dissects the brutal technical trade-offs behind that counterintuitive reality. We will break down why your voice agent is still stuck behaving like a walkie-talkie and map out the specific technical roadmap required to build full-duplex AI that actually works.

- Everybody Gets a Digital Clone! (Part 1 of 3) — Day 2 — Session Day 1 1:30pm-1:50pm

Walk out of this workshop with a deployed digital clone that makes your phone calls for you. We will skip the theory and immediately get our hands dirty wiring together OpenClaw, Twilio, and Gradium to build an autonomous voice agent on a live cellular network. You will tackle the hardest parts of real-time telephony: routing audio streams, handling human interruption, and killing latency. In 60 minutes, your AI will be ready to call restaurants for the daily special, book appointments, and actively negotiate on your behalf.

- Everybody Gets a Digital Clone! (Part 2 of 3) — Day 2 — Session Day 1 1:55pm-2:15pm

Walk out of this workshop with a deployed digital clone that makes your phone calls for you. We will skip the theory and immediately get our hands dirty wiring together OpenClaw, Twilio, and Gradium to build an autonomous voice agent on a live cellular network. You will tackle the hardest parts of real-time telephony: routing audio streams, handling human interruption, and killing latency. In 60 minutes, your AI will be ready to call restaurants for the daily special, book appointments, and actively negotiate on your behalf.

- Everybody Gets a Digital Clone! (Part 3 of 3) — Day 2 — Session Day 1 2:25pm-2:45pm

Walk out of this workshop with a deployed digital clone that makes your phone calls for you. We will skip the theory and immediately get our hands dirty wiring together OpenClaw, Twilio, and Gradium to build an autonomous voice agent on a live cellular network. You will tackle the hardest parts of real-time telephony: routing audio streams, handling human interruption, and killing latency. In 60 minutes, your AI will be ready to call restaurants for the daily special, book appointments, and actively negotiate on your behalf.

- Voice is the universal interface — Day 4 — Session Day 3 11:40am-12:00pm

Language models give us the ability to create natural language, conversational, interfaces for computers. We are seeing a rapid shift among early adopters to using general language instead of traditional user interfaces for tasks like writing code and editing spreadsheets. Join the cofounders of Pipecat, Gradium, and Daily as we discuss the future of realtime voice and AI interfaces. Voice is the most efficient input mode for natural-language systems, and often the most efficient output mode, as well. But good voice interfaces require a very high degree of conversational facility, intelligence, task-specific reliability, and robustness to real-world realities like multiple speakers and background noise. There's a long history of voice interfaces in science fiction: Star Trek, Iron Man, Her. We'll use these depictions of computing possibilities as a jumping off point for talking about the ideal voice interface. How close are we to being able to build these interfaces with today's models, hardware, orchestration tooling, and UI libraries? What are the most promising research directions? What did the movies get wrong, now that we actually have experience building natural language, open-ended, voice systems?

Nicholas Arcolano

  • Role: Head of AI & Research
  • Company: Jellyfish
  • Bio: Head of AI & Research at Jellyfish, building AI agents and data platforms that help software engineering organizations measure and navigate AI transformation at scale. Leveraging massive real-world data sets to study what's working (and what's not) about AI tool and agent use across the industry, and sharing these learnings through published research and benchmarks so engineering leaders can make confident, evidence-based decisions. Harvard Ph.D., previously at TrueMotion (CMT), Runkeeper, MIT Lincoln Laboratory.
  • Twitter: https://x.com/arcolano
  • LinkedIn: https://www.linkedin.com/in/arcolano/
  • Website: https://arcolano.com/
  • Blog: https://arcolano.com/
  • Photo: /wf26/speakers/by-id/spk_nicholas_arcolano.jpg
  • Sessions:

- Tokenmaxxing is the New "Lines of Code" — Day 3 — Session Day 2 1:30pm-1:50pm

Somebody in your company is going to ask what you're getting for all that AI spend. If you don't have a good answer, someone else will make one up... and it might be "total tokens consumed". That's how tokenmaxxing becomes policy: not because anyone thinks it's a good metric, but because engineering didn't offer a better story. I work with datasets spanning hundreds of companies, hundreds of thousands of engineers, and billions of lines of shipped code to understand how AI engineering is evolving and what actually matters to measure. One thing I've learned is that raw token spend is a VERY crude estimator of value. For example, across levels of token spend, cost per merged pull request varies 300x — but output only varies 2x. The good news is the data also shows what DOES matter, and it's measurable and actionable – but most teams aren't tracking it yet. This talk will give you the data, metrics, and frameworks you need to keep your org from adopting the latest terrible vanity metric. You'll learn what actually separates teams that scale AI effectively from those just burning tokens, and how to tell the story that keeps your AI investment funded and growing.

Nick Heiner

  • Role: VP of RL Environments
  • Company: Surge AI
  • Bio: Nick Heiner is the Head of RL Environments at Surge AI, the post-training company founded and bootstrapped to $1B in revenue by CEO Edwin Chen, where he works directly with top labs to help shape frontier models.
  • Twitter: https://x.com/nickheiner
  • LinkedIn: https://www.linkedin.com/in/nick-heiner-3874055a/
  • Website: https://www.nickheiner.com/
  • Photo: /wf26/speakers/by-id/spk_nick_heiner.jpg
  • Sessions:

- When Will The Benchmaxxing Plague End? — Day 2 — Session Day 1 2:50pm-3:10pm

Model releases are heralded by a flourish of trumpets, a chorus of weeping angels, and often, inflated benchmark claims. Why do benchmarks so often not reflect real-world value? Is it intrinsic to the science of benchmarking, or just the consequence of our current practices? Is LM Arena a cancer on AI?

Nick Nisi

  • Role: Developer Experience Engineer
  • Company: WorkOS
  • Bio: Developer Experience Engineer at WorkOS focused on developer tooling and TypeScript; previously co-hosted JS Party and organizes NebraskaJS and TypeScript community events.
  • Twitter: https://x.com/nicknisi
  • LinkedIn: https://www.linkedin.com/in/nicknisi
  • Website: https://nicknisi.com
  • Photo: /wf26/speakers/by-id/spk_nick_nisi.jpg
  • Sessions:

- Lifestyles of the AI-Native: Voice-coding, agent skills, hooks and scheduled tasks — Day 1 — Workshop Day 4:30pm-5:30pm

Most engineers are bolting AI onto a workflow that was designed for a pre-AI world. The result is a faster version of the same grind. This talk is about the other path: rebuilding the daily practice of software engineering from the ground up, around what agents are actually good at.

Two senior practitioners from WorkOS will walk through how we actually work now as AI-native engineers — not in the aspirational sense, but the literal one. We think out loud and voice-code instead of typing our way to clarity. We package recurring expertise into agent skills so we're not re-explaining context every session. We wire up hooks that fire on the events we care about, and hand off scheduled tasks to agents that run overnight, while we're away from the keyboard, or otherwise off the clock. The throughline is intentional design: deciding what a human should hold onto and what should be delegated, then building the machinery to make that real.

Because there are two of us, you'll see more than one set of habits — where our setups converge on the same patterns, and where they diverge based on how each of us thinks and works. The pitch isn't "do more." It's that an AI-native setup, designed deliberately, buys back attention and protects you from the burnout that comes from treating agents as a turbocharger for an old loop. Attendees will leave with a concrete mental model for voice-driven development, a pattern for authoring reusable agent skills, and working examples of hooks and scheduled automations they can adapt the same week.

Nicolai Ouporov

  • Role: CEO
  • Company: Fleet
  • Bio: Nicolai Ouporov is founder and CEO of Fleet, an applied AI and product lab building simulations and real-world challenges for testing and training agents. He previously was a founding engineer and first hire at Respell and has Stanford robotics research experience.
  • Sessions:

- Building Worlds for Models — Day 2 — Session Day 1 3:20pm-3:40pm

Hold for Fleet AI. Company focuses on simulated environments / training gyms for AI agents and fits the posttraining / RL environments theme.

Nidhi Kaushik Vyas

  • Role: Product
  • Company: Google DeepMind
  • Bio: AI product leader who turns frontier research into products people and companies actually use. Background in multimodal generative systems, I've helped launch high-impact products using multimodal Gemini. Previously, machine learning researcher at Apple, and have contributed to research breakthroughs, as well as scalable, trusted, user-centered experiences.
  • LinkedIn: https://www.linkedin.com/in/nidhivyas/
  • Photo: /wf26/speakers/by-id/spk_nidhi_vyas.jpg
  • Sessions:

- Designing Multimodal Collaborative Agents for Next-Gen Commerce — Day 4 — Session Day 3 10:45am-11:05am

Today's commerce agents wait to be told what to look for. But most users live by a different rule: "I don't know what I want — I'll know it when I see it". If agentic commerce is ever going to cross the chasm, these systems need to stop waiting and start co-shopping. The future of commerce belongs to agentic collaborators that offer a white-glove, personal shopper experience - entirely absorbing the cognitive burden of product discovery, deep research, and validation. Rather than requiring shoppers to input exact search terms or define clear objectives, modern shopping systems will seamlessly guide them from a rough idea to the ideal product. By leveraging multimodal capabilities, these assistants can interpret abstract aesthetic "vibes" to understand user preferences, generate visual references to clarify questions, and enable a highly immersive try-before-you-buy experience to validate products, keeping the user aligned and visually grounded throughout the process. This talk will explore how advanced systems like Gemini work alongside users to clarify their preferences during the discovery process, co-navigate fluidly generated product categories, leverage individual context to filter choices, and produce interactive side-by-side comparisons tailored to the buyer's key priorities. The session will also cover robust auto-rater frameworks and how to design evals for high-agency execution. Attendees building conversational agents, managing complex product data graphs, or creating next-generation multimodal agentic interfaces will gain practical frameworks and insights to deliver highly personalized experiences at scale.

Niels Rogge

  • Role: Machine Learning Engineer
  • Company: Hugging Face
  • Bio: Niels works as a Machine Learning Engineer at Hugging Face as part of the Community Science team. Together with @_akhaliq (also known as AK), which you might know from posting trending research papers, he ensures researchers improve the discoverability and visibility of their artifacts by making them available on the hub with proper links to the paper, write model and dataset cards and more.
  • Twitter: https://x.com/NielsRogge
  • LinkedIn: https://www.linkedin.com/in/niels-rogge-a3b7a3127/
  • Website: https://nielsrogge.github.io/
  • Blog: https://nielsrogge.github.io/
  • Photo: /wf26/speakers/by-id/spk_niels_rogge.jpg
  • Sessions:

- How I automate my own job at Hugging Face using agents — Day 2 — Session Day 1 2:50pm-3:10pm

This talk will showcase how I automated a large part of my own job at Hugging Face. This involves both open (GLM-5.1) and closed-source models (Claude, Gemini), the Claude Agents SDK, serverless infra like Modal and Hugging Face Jobs. I will also discuss how I use agentic coding tools like Cursor and Codex to implement AI agents which automate my job, and how everything is connected to the internal Slack of Hugging Face.

Nikita Kothari

  • Role: Senior Member of Technical Staff
  • Company: Salesforce
  • Bio: Nikita Kothari is a Senior Member of Technical Staff at Salesforce, where she builds AI-driven enterprise solutions that integrate Large Language Models (LLMs), agentic AI, and intelligent automation to enhance personalization, access, and system trust. With over a decade of experience spanning Salesforce, LinkedIn, and Amazon, she has led transformative projects in AI messaging, recommendation systems, and scalable data synchronization frameworks. Nikita holds a Master of Science in Computer Science from The University of Texas at Dallas and is passionate about advancing responsible and practical AI applications for modern enterprise ecosystems.
  • LinkedIn: https://www.linkedin.com/in/nikita-kothari3
  • Website: https://www.salesforce.com/
  • Photo: /wf26/speakers/by-id/spk_nikita_kothari.jpg
  • Sessions:

- MCPs, CLIs, and Skills: Choosing the Right Tooling Layer for Agentic Development — Day 4 — Session Day 3 11:10am-11:30am

Agentic development needs more than one interface: MCPs provide clean, portable connectors to services, with built-in patterns for security and auth. CLIs offer composability, debuggability, and workflows developers already trust. Skills teach agents how to use a wide variety of tools and MCPs effectively without overloading context.

Nilofer Rajpurkar

  • Role: Product Lead, Agent and Developer Experience
  • Company: Stripe
  • Bio: Nilofer Rajpurkar is a Product Lead at Stripe, where she leads teams building tools and infrastructure for developers and AI agents, including the Stripe MCP Server, Stripe CLI, Docs, and Sandboxes. Her work sits at the intersection of developer tooling, AI, and the emerging agent economy — she is particularly interested in how agents are changing the way software is built and how humans and agents transact online.

Throughout her career, Nilofer has focused on platforms and tooling that enable developers to create and scale their products. Prior to Stripe, she worked at GitHub on GitHub Actions and GitHub Packages, and at Microsoft on mobile developer tools.

Beyond technology, Nilofer has served as an NYC[x] Innovation Fellow with U.S. Digital Response, volunteered as a computer science teacher with TEALS, and served on the Seattle Arts Commission and the Coca-Cola Scholars Foundation Alumni Board."

  • Twitter: https://x.com/nilli_minaj
  • LinkedIn: https://www.linkedin.com/in/nilofer-rajpurkar-3a80a168
  • Photo: /wf26/speakers/by-id/spk_nilofer_rajpurkar.jpg
  • Sessions:

- Inside the AI economy: What Stripe’s data reveals — Day 2 — Session Day 1 10:45am-11:05am

Stripe powers 78% of the Forbes AI 50, giving Stripe index-level visibility into the AI economy. AI companies are growing faster, selling globally by default, and monetizing earlier. See the data behind the growth: how AI has collapsed the cost of launching, how the fastest-growing companies are adapting their pricing, and the role agents are starting to play in commerce.

Nishant Gupta

  • Role: Software Engineer, Tech Lead
  • Company: Meta
  • Bio: I am a Staff Software Engineer and Researcher at Meta, specializing in large-scale distributed systems, AI infrastructure, and operational resilience. Within Meta Superintelligence Labs, I build agentic infrastructure that enables AI systems to operate reliably in production through evaluation, auditing, safety controls, feedback loops, and human oversight.

I previously led the development of Meta’s next-generation elastic compute infrastructure, managing roughly 30% of fleet capacity across tens of millions of servers in 20+ geo-distributed datacenters, delivering billions of dollars in infrastructure savings while shaping multi-year strategy with executive leadership.

My research focuses on resource optimization, reliability, and safe AI deployment at scale. I designed and deployed Dynamic Idle Resource Leasing, a production system that safely oversubscribes datacenter capacity while preserving strict reliability guarantees. I have authored research papers with 90+ citations.

I am passionate about building scalable, fault-tolerant systems and translating cutting-edge research into real-world infrastructure that delivers measurable impact.

  • LinkedIn: https://www.linkedin.com/in/nishantgupta-ai/
  • Blog: https://buzzingtech.ai/
  • Photo: /wf26/speakers/by-id/spk_nishant_gupta.jpg
  • Sessions:

- Operating Distributed Inference Systems at Scale — Day 4 — Session Day 3 10:45am-11:05am

Inference has rapidly become one of the most important infrastructure problems in modern computing. As AI systems evolve into autonomous agents with persistent memory, tool usage, and multi-step reasoning, traditional inference architectures struggle under growing demands for latency, throughput, cost efficiency, and reliability. In this talk, I’ll share lessons from building large-scale elastic compute and AI infrastructure systems powering production workloads. We’ll explore the modern inference stack and the architectural patterns emerging to support next-generation agentic AI systems. Topics include distributed inference architectures for large-scale AI systems, GPU scheduling and elastic compute for inference workloads, multi-tenant inference infrastructure, caching, batching, latency optimization strategies, reliability and fault isolation for inference systems, observability and control loops for AI serving platforms, balancing cost, throughput, and user experience, and why inference is becoming an infrastructure orchestration problem. Attendees will gain practical insights into designing scalable, resilient, and cost-efficient inference platforms for modern AI workloads.

Niv Granot

  • Role: Tech Group Lead
  • Company: AI21 Labs
  • Bio: Niv Granot is a Tech Group Lead at AI21 Labs working on AI systems engineering, retrieval, web search, and knowledge-intensive tools. He has discussed RAG evaluation and query-dependent chunking in AI21 technical content.
  • Photo: /wf26/speakers/by-id/spk_niv_granot.jpg
  • Sessions:

- Stop Chunking Like It's 2022 — Day 2 — Session Day 1 3:20pm-3:40pm

Every RAG system bets everything on a single chunk size. 500 tokens? 800? Pick wrong, and half your queries fail before they start. But here's what nobody tells you: all the picks are wrong; there is no single chunk size that works for all queries. We ran oracle experiments across meeting transcripts, story chapters, and TV scripts. The result? Queries disagree violently on what chunk size works best - sometimes by 40 percentage points. Your "tuned" chunk size isn't a compromise; it's systematic underperformance. In this talk, we'll expose why fixed chunking fails and show you a dead-simple fix: index at multiple chunk sizes, aggregate at retrieval time using Reciprocal Rank Fusion. No retraining. No LLM overhead. Just 1-37% better recall across benchmarks by letting queries vote with their ranks instead of forcing them into one-size-fits-all boxes. Walk away knowing exactly when your chunk size is sabotaging you - and how to stop leaving 20-40% of your retrieval performance on the table.

Nixon Dinh

  • Company: PayPal
  • LinkedIn: https://www.linkedin.com/in/nixon-dinh
  • Photo: /wf26/speakers/by-id/spk_nixon_dinh.jpg
  • Sessions:

- The Death of Keyword Search and the Rise of Agent-Readable Catalogs — Day 3 — Session Day 2 11:10am-11:30am

As search shifts from classic keyword matching to more conversational experiences, product data quality becomes critical to LLM-powered retrieval. At PayPal, we tested how enriching traditional catalog data could help AI systems better find, understand, and rank products across large-scale commerce catalogs. We built a RAG-based AI judge to compare enrichment approaches and identify five patterns that consistently improved AI discovery results.In this talk, we'll share the evaluation framework, key lessons, and a practical approach for preparing enterprise data for conversational and agentic search.

Nnenna Ndukwe

  • Role: Principal Developer Advocate and Software Engineer
  • Company: Qodo AI
  • Bio: Principal Developer Advocate and Software Engineer at Qodo AI, with experience across startups, media tech, cybersecurity, and AI. She is an international speaker and community builder focused on making AI practical and accessible for engineers.
  • Twitter: https://x.com/nnennahacks
  • Photo: /wf26/speakers/by-id/spk_nnenna_ndukwe.jpg
  • Sessions:

- How to Build Quality Gates into Agentic Coding Workflows — Day 1 — Workshop Day 11:05am-12:05pm

AI coding agents can now generate code at unprecedented speed. But faster code generation creates a new engineering problem: how do we know when agent-written code is actually safe, maintainable, and ready to merge? In this hands-on workshop, attendees will build an agentic coding workflow with enforceable code quality gates across planning, implementation, testing, and code review. By the end of the session, participants will have a working reference pattern for agentic software delivery: an AI-assisted workflow that can inspect a repo, implement a change, run tests, evaluate risk, respond to feedback, and surface what still requires human judgment. This is a technical enablement session for engineers building with AI coding agents, platform teams designing agentic SDLC workflows, and AI engineering leaders thinking about how to scale software quality with AI.

Nyah Macklin

  • Role: Sr. Developer Advocate, Artificial intelligence
  • Company: Neo4j
  • Bio: Nyah Macklin is a seasoned researcher and speaker on topics around AI, ML, Ethics, Governance, and Responsibility. Nyah serves as a Senior Developer Advocate for Artificial Intelligence at Neo4j, specializing in context engineering, knowledge graphs, and AI-driven developer tooling where Nyah has built high-impact technical communities and led initiatives that advance a critical understanding of AI and its use cases. They are also the Founder & CTO of Afros in AI, a technical community dedicated to showcasing the multifaceted nature of artificial intelligence. Beyond Nyah's technical expertise, Nyah has a background in government leadership and technology policy, having served as Chief of Staff in the U.S. state government, where they helped shape tech-driven legislative initiatives and equity-driven legislation. When not immersed in their work, Nyah cares about empowering, teaching, and tutoring engineers, live-streaming technical deep dives, and building open-source tools that make software more accessible, explainable, and community-driven.
  • Twitter: https://x.com/nyahmacklindev
  • LinkedIn: https://linkedin.com/in/nyahmacklin
  • Photo: /wf26/speakers/by-id/spk_nyah_macklin.jpg
  • Sessions:

- RAG Needs a Map: Using GraphRAG to Retrieve Connected Context — Day 1 — Workshop Day 11:05am-12:05pm

Vector search is good at finding similar text, but real answers often depend on how facts, entities, and documents connect. In this hands-on workshop, you’ll build a GraphRAG workflow that uses relationships to retrieve connected context for more grounded AI responses.

Olive Song

  • Role: RL Lead
  • Company: MiniMax
  • Bio: Researcher at MiniMax focused on reinforcement learning and model evaluation for the M-series models.
  • Twitter: https://x.com/olive_jy_song
  • Photo: /wf26/speakers/by-id/spk_olive_song.jpg
  • Sessions:

- Thom Wolf keynote — Day 2 — Session Day 1 10:05am-10:25am

- Agents at Scale: Inside MiniMax's Model and the Infrastructure Behind It — Day 3 — Session Day 2 2:50pm-3:10pm

Olive Song (RL Lead, https://www.minimax.io/) and Dan Fu (VP of Kernels, https://www.together.ai/) dig into the engineering behind one of the most widely used open model families in the agent ecosystem: how MiniMax built the model for agentic workloads, and what it takes to serve it at scale.

Olive on the model side:

The RL decisions behind long-context reasoning and tool use

What training for agentic behavior actually looks like in practice

Dan on the infrastructure side:

Why agentic workloads break inference engines built for chat: prefill-heavy traffic, high cache hit rates, long-context inputs

The kernel-level optimizations built for MiniMax's workload profile

How the two teams collaborate on model launches and ongoing performance work

Omar Solano

  • Role: AI Engineer
  • Company: Towards AI
  • Bio: Omar Solano is an AI Engineer at Towards AI, where he architects and builds production AI agents and applied LLM systems. His work spans RAG, fine-tuning, agentic workflows, and long-context and reasoning-model systems. He leads client-facing AI consulting projects and delivers hands-on AI engineering workshops for developers, engineering teams, and international conference audiences, including training for Europol and the New York Public Library. Omar has authored 50+ technical lessons and book chapters on RAG, AI agents, fine-tuning, and coding agents, reaching 90,000+ learners through Towards AI's courses and publications.
  • Twitter: https://x.com/omar_solano1
  • LinkedIn: https://www.linkedin.com/in/omar-solano1
  • Photo: /wf26/speakers/by-id/spk_omar_solano.jpg
  • Sessions:

- Context Engineering in 2026: Compaction, Memory & Cost — Day 1 — Workshop Day 2:20pm-4:20pm

Every long agent session eventually breaks: the assistant that swore it would "never push to main" does exactly that forty turns later. The model didn't get dumber — its context did. This workshop is about engineering the context window so that stops happening, shown with Towards AI's open-source AI tutor, which answers questions for students of our AI-engineering courses. Context engineering is deciding what the model sees on every single call — instructions, history, retrieved course content, memory, and tool outputs — and it's the line between a tutor that holds a coherent session and one that forgets the student's setup halfway through. We'll move in three stages, mirroring how the project actually went. The concepts: the two root problems (a finite window, a stateless model), the full compaction toolkit (truncation, trimming, tool-result clearing, summarization, and offloading to files — and when each actually helps), memory that survives across sessions, skills loaded on demand, and production-grade retrieval (chunking, metadata, course scoping, hybrid search, reranking, and evaluating). We'll cover the tutor's architecture, and the evaluation harness we used to measure every run on Gemini — tokens, cost, latency, and memory probes instead of vibe-checks. At real volume, even Gemini Flash got expensive, so we tested whether open and local models could match the quality for a fraction of the cost and match result quality. Everything is open-source and will be shared during the workshop.

Omer Primor

  • Company: Bright Data
  • Sessions:

- The Rise of CaaS: Context-as-a-Service for Agentic AI — Day 3 — Session Day 2 1:55pm-2:15pm

Agentic workflows have commoditized. The new bottleneck is context. As models improve, AI agents are increasingly limited not by reasoning ability, but by the quality, freshness, and specificity of the information they can access. This session introduces Context as a Service, or CaaS, an emerging category for builders creating web-native context layers for AI agents. These tools collect, structure, enrich, index, and analyze live web data, making it available as agent-ready knowledge for specific use cases and vertical downstream applications. We ll explore how builders are turning hard-to-access web domains into agent-ready context layers: fragmented public data, dynamic sources, multimodal content, and fast-changing signals that generic models cannot reliably process within their token limits. Attendees will learn how to think about CaaS as both a technical architecture and a market opportunity: what to build, where context creates defensibility, and how raw web data can become the foundation for reliable agentic products.

Omri Bruchim

  • Role: Engineering Group Manager
  • Company: Monday
  • Bio: Omri Bruchim is an Engineering Group Manager at monday.com, where he leads the AI group and focuses on building trustworthy, production-ready AI systems. With nearly 20 years of experience in software engineering and leadership, he previously held a GM role at Wix, and founded a real-time data processing platform in his startup called drift.dev. Omri began his career in Unit 8200 and holds Degree from Ben-Gurion University and MBA from Tel Aviv University. He is passionate about making AI accessible and impactful, and enjoys flying his kite-surf on weekends.
  • Twitter: https://x.com/omribruchim
  • LinkedIn: https://www.linkedin.com/in/omribruchim/
  • Website: https://edginary.io
  • Blog: https://x.com/omribruchim
  • Photo: /wf26/speakers/by-id/spk_omri_bruchim.jpg
  • Sessions:

- From Systems of Record to Systems of Context — Day 4 — Session Day 3 12:05pm-12:25pm

Enterprise AI agents are moving fast, but most of them still hit the same wall in production: they have access to tools, documents, APIs, and databases, but they do not understand the real context of how work gets done. At monday.com, we are building agents that operate across real customer workflows, internal product surfaces, knowledge, permissions, memory, and actions. The hard part is not just calling the right tool or retrieving the right document. The hard part is building a reliable context layer that helps agents understand users, work objects, organizational knowledge, prior decisions, business rules, and the relationships between them. This talk will explore the emerging idea of the context graph: a living, queryable layer that connects entities, history, permissions, decisions, and meaning across an organization. Foundation Capital describes context graphs as the next major enterprise AI opportunity because agents need more than rules. They need decision traces: how rules were applied, where exceptions were made, who approved what, and what precedent actually governs reality. I will share how we think about this opportunity at monday.com, how we are implementing parts of it in practice, and what we have learned from building AI agents inside a real AI work platform. The talk will include concrete examples, including how context is collected, represented, retrieved, governed, and evaluated. The audience will leave with a practical framework for moving beyond one-off RAG pipelines and prompt stuffing toward a reusable context layer that compounds over time, improves agent quality, and becomes a strategic moat for companies building AI-native products.

Owen Halpert

  • Role: GTM
  • Company: turbopuffer
  • Bio: Owen Halpert works on GTM at turbopuffer. He previously worked on the GPU-accelerated KNN side of OpenSearch at Amazon and has a Berkeley CS background.
  • LinkedIn: https://www.linkedin.com/in/halpert
  • Website: https://owenhalpert.com
  • Photo: /wf26/speakers/by-id/spk_owen_halpert.jpg
  • Sessions:

- Give your coding agents the power of turbogrep! — Day 2 — Session Day 1 11:10am-11:30am

Coding agents can grep the filesystem, but sometimes semantic search is more useful for finding the right files, especially on large codebases. Claude Code and Codex, unlike Cursor, do not use semantic search for code retrieval. There are good reasons for this, but Cursor has consistently demonstrated that semantic retrieval can materially improve code search to improve answer accuracy, increase code retention, and reduce token usage. In this session, we'll share a coding agent plugin for semantic codebase search alongside other modalities (BM25, regex/globbing/grep, filtering), and demonstrate how an agent can choose the right tool for the job. We'll share benchmark-style results that compare answer quality and token consumption with and without semantic retrieval across a small set of representative tasks.

Pablo Castro

  • Role: Distinguished Engineer and CVP
  • Company: Microsoft
  • Bio: Pablo leads the AI Knowledge team in the CoreAI division at Microsoft, where he focuses on building state-of-the-art information understanding and retrieval systems for AI applications and agents, including products such as Foundry IQ, Azure AI Search and Azure Content Understanding.
  • LinkedIn: https://www.linkedin.com/in/pabloc
  • Photo: /wf26/speakers/by-id/spk_pablo_castro.jpg
  • Sessions:

- On AI and Knowledge — Day 2 — Session Day 1 9:05am-9:25am

Paige Bailey

  • Role: AI Developer Relations Engineering Lead
  • Company: Google DeepMind
  • Bio: AI leader focused on bridging foundational AI research and real-world implementation. Previously Director of Machine Learning and MLOps at GitHub/Microsoft, with experience across developer tooling, MLOps, and applied machine learning.
  • Twitter: https://x.com/DynamicWebPaige
  • Photo: /wf26/speakers/by-id/spk_paige_bailey.jpg
  • Sessions:

- Research to Reality with Google DeepMind — Day 1 — Workshop Day 12:10pm-1:10pm

Palak Agarwal

  • Role: Developer Relations Lead
  • Company: Reducto
  • Twitter: https://x.com/palak_agarwal6
  • LinkedIn: https://www.linkedin.com/in/palak06
  • Website: https://palakagarwal.me/
  • Photo: /wf26/speakers/by-id/spk_palak_agarwal.jpg
  • Sessions:

- How Reducto parsed the Epstein Files for the Viral JMail Project: The Secret Complexities of Document — Day 1 — Workshop Day 1:15pm-2:15pm

Reducto powered the infrastructure behind Jmail, a fully searchable email interface with over 3.5 million scanned government pages built days after the Epstein files release. The site went viral overnight, racking up millions of views across news coverage and social media. In this workshop we'll break down how Reducto's Parse API handled everything from redacted PDFs to handwritten letters to dense financial tables at that scale, then walk through the same pipeline hands-on using the Reducto CLI and MCP. You'll leave with a working setup and a clear mental model for applying document parsing to your own projects.

Pamela Fox

  • Role: Principal Cloud Advocate
  • Company: Microsoft
  • Bio: Pamela Fox is a developer advocate at Microsoft, where she helps developers use Python with Azure, Microsoft Foundry, VS Code, and GitHub. Before Microsoft, she was an early engineer at Coursera, early DevRel at Google, lecturer at UC Berkeley, and creator of the Khan Academy programming courses.
  • Twitter: https://x.com/pamelafox
  • LinkedIn: https://www.linkedin.com/in/pamela-s-fox/
  • Website: https://pamelafox.org/
  • Blog: http://blog.pamelafox.org/
  • Photo: /wf26/speakers/by-id/spk_pamela_fox.jpg
  • Sessions:

- Get Started with Models in Microsoft Foundry to Build AI Apps — Day 1 — Workshop Day 9:00am-10:15am

In this hands-on lab, you will build a production-ready AI application using Microsoft Foundry, with no fine-tuning or deep machine learning expertise required. You will discover and select models, provision a Foundry project, and connect to a hosted model using the OpenAI SDK. You’ll implement a comment moderation workflow, compare model outputs, and package the solution as a hosted agent using Python, ready for real-world integration.

- The model swap workshop — Day 1 — Workshop Day 11:05am-12:05pm

Frontier labs are releasing new models constantly, and it is hard to know when “better” is better enough to justify touching a working system. On top of that, “just swap the model” often turns into real work because providers expose different APIs and different expectations around tools and structured outputs. The model swap workshop is a hands-on bake-off across frontier LLMs. We will run the same scenarios using multiple models (OpenAI, Anthropic, Kimi, and more) and compare results side by side for agentic tool use, structured outputs, and multimodal tasks. Swapping models is not just changing a model name. In this workshop, you will actually do the swaps, including moving between OpenAI-style Responses APIs and Anthropic-style Messages APIs, then see what breaks and what needs to change in your prompts, tool definitions, and JSON strategies. We will finish by running a small eval suite so you can quantify tradeoffs instead of relying on vibes. We will provide the Microsoft Foundry environment for access to the models, no account needed.

- Observe, optimize and protect your hosted agents in Microsoft Foundry — Day 1 — Workshop Day 2:20pm-3:35pm

Modern agents fail in ways traditional monitoring can’t catch. In this hands-on lab, learn how Microsoft Foundry Observability helps you move from prototype → production with context-specific evaluation suites (auto-generated evaluators + test datasets) wired into developer workflows via skills/MCP tooling for hosted agents. Then scale quality with continuous evaluation, trace-linked analysis, and adaptive red teaming—and walk away with a sandbox to explore additional features on your own.

- Use Copilot across CLI, dev, and cloud workflows to move faster end-to-end — Day 2 — Session Day 1 11:40am-12:00pm

Copilot isn't just for writing code. Learn how to use it across CLI and cloud workflows to scaffold apps, debug faster, and automate repetitive steps across your entire dev lifecycle.

- OpenAI, Anthropic, or agent frameworks: choose the right AI stack — Day 3 — Session Day 2 11:40am-12:00pm

OpenAI SDK, Anthropic SDK, or an LLM-agnostic agent framework. Which one should your next AI app be built on? Starting with Foundry Models, we walk through each option in code, show what you gain and what you give up at every layer, and help you pick the right abstraction for your scenario without overbuilding.

- Diagnosing agent failures in production — Day 4 — Session Day 3 10:45am-11:05am

Agent behavior changes in production. Learn common failure modes and how to debug, test, and improve performance using real evaluation techniques.

Paola Estefania

  • Role: Staff Engineer
  • Company: Better Auth
  • Bio: Staff Engineer at Better Auth focused on agent Identity. Co creator of Agent Auth Protocol
  • LinkedIn: https://uy.linkedin.com/in/paolaestefaniadecamposdefranco
  • Photo: /wf26/speakers/by-id/spk_paola_estefania.jpg
  • Sessions:

- Agent Auth — Day 1 — Workshop Day 4:30pm-5:30pm

Better Auth has grown to 27k GitHub stars and over 1.5M weekly downloads, becoming a popular choice for developers who want to own their authentication stack. We recently introduced Agent Auth, a protocol designed to support autonomous and delegated agents operating services for an organization or a user. It allows agents to dynamically negotiate capabilities, manage access boundaries, and maintain secure authorization flows. This session will break down the protocol design and demonstrate it live, showing how agents can securely authenticate and operate with dynamic permissions.

Parth Asawa

  • Role: CS PhD student
  • Company: UC Berkeley
  • Bio: Parth Asawa is a PhD student at UC Berkeley advised by Professor Matei Zaharia and Professor Joey Gonzalez. Parth's research is on continual learning, studying how to enable models to stably learn from streams of experiences over time. His work focuses on sample-efficient learning and spans the stack of data, learning algorithms, architectures, and evaluation.
  • Twitter: https://x.com/pgasawa
  • LinkedIn: https://www.linkedin.com/in/pgasawa/
  • Website: https://pgasawa.github.io/
  • Photo: /wf26/speakers/by-id/spk_parth_asawa.jpg
  • Sessions:

- Beyond Static Intelligence: Evaluating Continual Learning — Day 3 — Session Day 2 10:45am-11:05am

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this---in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

Patricija Žemaitytė

  • Role: Product Manager
  • Company: Oxylabs
  • Bio: With over five years of experience in the IT industry, Patricija Žemaitytė has built a focus on product management in the web scraping space and currently focuses on SERP and LLM scraping products at Oxylabs.
  • LinkedIn: https://lt.linkedin.com/in/patricijazemaityte
  • Photo: /wf26/speakers/by-id/spk_patricija_emaityt.jpg
  • Sessions:

- How Web Data Infrastructure Powers the Next Generation of AI — Day 3 — Session Day 2 3:20pm-3:40pm

For years, the web intelligence industry has powered major data developments. As big data grew, ensuring sustained data flow became harder. Now, AI is taking the biggest leaps forward. How the web intelligence industry responded to this increasing scale and complexity is the story of the most crucial steps forward in AI today. This presentation demonstrates how web scraping infrastructure fuels AI innovation by linking the web's repository to AI developers. Told through AI products, it addresses both the engineering challenges and solutions for developers, and the strategic use cases for business decision-makers.

Patrick Debois

  • Role: Member Technical Staff
  • Company: Tessl
  • Bio: Independent advisor, Product DevRel lead at Tessl, and curator of ainativedev.io, he studies AI-native development patterns, context engineering, and how the context flywheel turns everyday coding into organizational knowledge. Know for his pioneering work on DevOps and DevOpsdays.
  • Twitter: https://x.com/patrickdebois
  • LinkedIn: https://www.linkedin.com/in/patrickdebois/
  • Website: https://jedi.be
  • Blog: https://jedi.be/blog
  • Photo: /wf26/speakers/by-id/spk_patrick_debois.jpg
  • Sessions:

- Coding Agents Don't Scale Themselves. Neither Do Your Teams.The Rise of Agent Enablement. — Day 4 — Session Day 3 1:30pm-1:50pm

Every company wants to know how others are actually scaling AI coding. But it's hard to get past the generic transformation stories. What are the new practices showing up in real engineering orgs? What does maturity actually look like, and what separates teams that are moving from teams that are stuck? What are the patterns for enabling humans and agents, together? Patrick Debois has been collecting the practices and patterns, talking to the early Agent Enablement teams already on the job, team leads, and VPs of Engineering. What's showing up is a new function: a team that enables other teams to get real leverage out of their agents. This talk takes the Context Development Lifecycle off the individual laptop and onto the org chart, grouped across three pillars: - Enablement. From individual experimentation to team and org-level fluency with agents. - Platform. Agent tooling that runs like a real delivery pipeline: fast, observable, cost-aware. - Governance. Ad-hoc guardrails growing into real evaluation, telemetry, and accountable agent work. For Agent Enablement leaders scaling it out across the org. For team leads looking to help their teams get better at this. For VPs ready to unblock the friction and unlock what agents can actually do. Coding agents don't scale themselves. This is the talk about who does

Paul Bakaus

  • Role: Founder
  • Company: Renaissance Geek, Inc.
  • Bio: Paul Bakaus is a product engineer, two-time founder, and creative technologist. He's the CEO and founder of Renaissance Geek, Inc., an AI startup emerging from stealth. Its first project, Impeccable, is an open-source design skill, cli and extension that de-slops the design models produce out-of-the-box, and allows live visual iteration with a rich command palette in your production code base.

Before Renaissance Geek, Paul spent 20+ years building at the intersection of code, design, and culture. He created jQuery UI in 2007, still running on millions of sites. At Google, he shipped Chrome DevTools features, helped speed up the web with AMP, made it visual with Web Stories, and became Google's first Head of Creator Relations. Most recently, at Spotter, he built Spotter Studio, giving YouTubers back their most precious resource: time.

His mission: move the human-AI interface past the chat box, and build AI tools for everyone, not just engineers.

  • Twitter: https://x.com/pbakaus
  • LinkedIn: https://linkedin.com/in/paulbakaus
  • Website: https://www.paulbakaus.com/
  • Photo: /wf26/speakers/by-id/spk_paul_bakaus.jpg
  • Sessions:

- The Dark Arts of Skill Engineering — Day 1 — Workshop Day 4:30pm-5:30pm

Most agent skills are a system prompt and a prayer. They produce safe, median output because that's what LLMs default to. After building 24 design skills across 9 AI platforms, I found the patterns that break through that ceiling, and they're rarely documented or discussed. Make your agents argue: spawn parallel sub-agents that independently evaluate the same work, then force their conflicting opinions into a single result. The output is bolder than any single agent would dare. Build mixture-of-expert skills that route to specialized sub-agents the way frontier models route to specialized networks. Give your skills memory through persistent context files that restore across sessions, so every invocation builds on the last. Wire up skill hooks that auto-activate after execution to validate, transform, or chain into the next skill. Exploit barely documented environment variables and shell expansion to make skills context-aware before they even run. Let's dig into the dark arts of skill engineering to craft ultra powerful skills.

- Design at the Speed of Adjectives — Day 3 — Session Day 2 1:30pm-1:50pm

Every design tool today operates at the wrong level of abstraction for AI-assisted engineering. Traditional tools give you padding sliders and color pickers, built for a world where designer and engineer are separate roles moving at separate speeds. Prompt-to-design tools one-shot a pretty landing page from a sentence, which is more dangerous because it looks like it's working. No serious design director hears a prompt and starts pushing pixels. The brief comes first. What's the emotional territory? What should this not feel like? Today's AI tools skip that discovery entirely. The result is output without intent. Technically competent, strategically empty. The right abstraction for a world where the designer is also the engineer lives between these extremes. Not pixels. Not prompts. Adjectives. "Make it feel warmer." "Strip it to its essence." "Add tension." These are the controls a creative director actually thinks in. Drawing on lessons from building Impeccable, an open source design tool with 24 adjective-level commands like /bolder, /quieter, and /distill, I'll share what worked, what didn't, and how to apply this thinking to any AI interface where creative intent matters more than parameter control.

Paul Klein IV

  • Role: Founder & CEO
  • Company: Browserbase
  • Bio: Founder and CEO of Browserbase, a company building browser infrastructure for web automation and AI agents.
  • Twitter: https://x.com/pk_iv
  • Photo: /wf26/speakers/by-id/spk_paul_klein_iv.jpg
  • Sessions:

- Bringing agents onto the world wide web — Day 3 — Session Day 2 11:40am-12:00pm

The web wasn't built for agents. Heavy HTML, human-first UIs, and a DOM that can hijack the model's context. Still, agents browse it for millions of hours every month through Browserbase, across teams like Ramp, Shopify, and Lovable. This talk walks through that browser agent harness layer by layer, from the security boundary between DOM and model to caching, Agent Identity, and the infrastructure that provisions browsers at scale, and where browser agents go once it is in place.

Paula Dozsa

  • Role: iOS Engineer
  • Company: Tolan
  • Bio: Paula Dozsa is an iOS engineer creating whimsical AI companions at Tolan. She previously co-founded and led development at EdTech startup imagi and built iOS apps at xAI and Spotify.
  • Twitter: https://x.com/paularambles
  • LinkedIn: https://www.linkedin.com/in/paulacodes/
  • Photo: /wf26/speakers/by-id/spk_paula_rambles.jpg
  • Sessions:

- Tolan: Voice-First AI Companion — Day 2 — Session Day 1 1:30pm-1:50pm

Pauline Brunet

  • Role: VP, Forward Deployed Engineering
  • Company: Cursor
  • Bio: VP of Forward Deployed Engineering at Cursor. Building the motion and team to help customers adopt Cursor and drive meaningful returns. We configure and co-build alongside customer software and transformation teams. Spent 10 years in AI deployments across enterprises.
  • LinkedIn: https://www.linkedin.com/in/pauline-brunet/
  • Photo: /wf26/speakers/by-id/spk_pauline_brunet.jpg
  • Sessions:

- How Forward Deployed Engineering is done at Cursor — Day 2 — Session Day 1 11:10am-11:30am

Pedro Lopez

  • Role: Senior Software Engineer
  • Company: Airbyte
  • Bio: Pedro S. Lopez is a Senior Software Engineer at Airbyte. His WF26 talk focuses on how Airbyte built its Agent MCP Server and CLI for agent-oriented data integration workflows.
  • Photo: /wf26/speakers/by-id/spk_pedro_lopez.jpg
  • Sessions:

- How We Built the Airbyte Agent MCP Server and CLI — Day 2 — Session Day 1 3:45pm-4:05pm

Agents need a reliable way to reach live business data. At Airbyte we built two interfaces for that, and this session is how.

Cam built much of that surface. He covers the MCP server that exposes hundreds of sources through one endpoint with managed auth, and the CLI that's designed for agent harnesses rather than humans, with embedded help, packaged agent skills, and no credentials passed over the command line. Expect the real engineering: why a CLI turned out to fit autonomous agents better than the API or SDK, how auth works across the layers, and the tradeoffs the team made along the way.

Come if you're building agent tooling or thinking about how to expose your own systems to agents cleanly.

Peter Werry

  • Role: Founding Engineer
  • Company: Unblocked
  • Bio: Founding engineer at Unblocked working on context engines for modern engineering teams.
  • Photo: /wf26/speakers/by-id/spk_peter_werry.jpg
  • Sessions:

- Beyond RAG: Build a Relational Context Engine from Scratch — Day 1 — Workshop Day 12:10pm-1:10pm

In this workshop we'll explore the importance of context engines in modern engineering workflows, and we'll look at why traditional RAG techniques are no longer enough to deliver the context agents need.

We'll build a structured query engine that fills the gaps left by RAG, translating natural language into validated database queries over GitHub PR and Issue data. We'll implement schema-aware prompting, identity resolution, query validation, and error-driven retry loops, and you'll walk away with a working query engine for your GitHub repository.

- How to generate mergeable code with a context engine — Day 3 — Session Day 2 11:40am-12:00pm

Your agents are fast, capable, and completely context-blind. They generate code that compiles but doesn't reflect how your system actually works. You're likely already seeing the impact: ballooning token costs, longer review cycles, and inconsistent outputs. More MCPs, rules, and bigger context windows give agents access to information, but not understanding. In this session, we dissect how teams pulling ahead use a context engine to give agents exactly what they need for the task at hand. Includes a short demo showing the workflows a context engine can augment.

Philip Kiely

  • Role: Developer Relations
  • Company: Baseten
  • Bio: Philip Kiely leads Developer Relations at Baseten. Prior to joining Baseten in 2022, he worked across software engineering and technical writing for a variety of startups. Outside of work, you'll find Philip practicing martial arts, reading a new book, or cheering for his adopted bay area sports teams.
  • Twitter: https://x.com/philip_kiely
  • LinkedIn: https://linkedin.com/in/philipkiely
  • Website: https://baseten.co
  • Blog: https://philipkiely.com
  • Photo: /wf26/speakers/by-id/spk_philip_kiely.jpg
  • Sessions:

- What's New in Inference Engineering — Day 4 — Session Day 3 1:30pm-1:50pm

More than 30,000 engineers have learned the fundamentals of inference since Inference Engineering was published. But the field keeps accelerating, so it's time for the first public addendum to the book. The past four months have seen a renewed focus on training-dependent inference optimization across the "big three" performance techniques of speculation, caching, and quantization. This talk provides structured guidance for training DFlash and EAGLE 3 draft models to accelerate LLM decode, introduces the concept of KV compaction, and explains the hype behind TurboQuant.

Philipp Schmid

  • Role: Staff Engineer
  • Company: Google DeepMind
  • Bio: Philipp Schmid is a Staff Engineer at Google DeepMind working on Gemini and Gemma. His work focuses on helping developers build and benefit from AI responsibly.
  • Twitter: https://x.com/_philschmid
  • LinkedIn: https://www.linkedin.com/in/philipp-schmid-a6a2bb196/
  • Website: https://www.philschmid.de/
  • Blog: https://www.philschmid.de
  • Photo: /wf26/speakers/by-id/spk_philipp_schmid.jpg
  • Sessions:

- Why Agents Should Have Their Own Sandbox — Day 3 — Session Day 2 1:30pm-1:50pm

- Don't Ship Skills Without Evals — Day 3 — Session Day 2 3:20pm-3:40pm

There are thousands agent skills. Almost none of them are tested. They get vibe-checked with two manual runs, maybe a thumbs-up from a colleague, then shipped. You wouldn't merge code without tests — so why are we shipping skills without evals? This talk covers the full lifecycle of building reliable agent skills: what a skill actually is (and isn't), how to write one that triggers correctly, and how to build a lightweight eval harness that catches failures before your users do.

- Agents Without Code: How Skills, YAML, and Filesystems Replaced Python — Day 4 — Session Day 3 3:45pm-4:05pm

Six months ago, building an agent meant writing a Python class with a while loop, tool definitions in dicts, manual state management or writing custom python functions. Today, you define an agent in a YAML file, drop a SKILL.md into a folder, and deploy. This talk traces the arc from "Agent in Python" to "Agent as filesystem". You'll learn the same agent built three ways: the hard way (Jan 2025), the simple way (Oct 2025), and the zero-code way (today).

Pierluca D'Oro

  • Role: Founder
  • Company: Programma Labs
  • Bio: Pierluca D’Oro is founder at a stealth startup revolutionizing how humans interact with AI-generated software. At Mila, he pioneered two early ideas that now sit at the center of agent development: making reinforcement learning scale through simple recipes, and using LLMs as feedback systems to train agents. At Meta Superintelligence Labs, he worked on frontier model development and led environment generation for mobile computer use agents.
  • Twitter: https://x.com/proceduralia
  • LinkedIn: https://www.linkedin.com/in/pierluca-doro/
  • Website: https://www.proceduralia.com
  • Blog: https://pragmatichumanism.substack.com/
  • Photo: /wf26/speakers/by-id/spk_pierluca_d_oro.jpg
  • Sessions:

- Computer Use at the Edge of the Statistical Precipice — Day 3 — Session Day 2 11:10am-11:30am

Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent's pass@k in deterministic environments. We trace this and other failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions). To address the first, we propose PRISM, five design principles for CUA environments and instantiate them in DigiWorld, a benchmark of 15 realistic sandboxed mobile applications able to evaluate agents in over 3.2 million verified unique configurations. To address the second, we develop an aggregation framework that correctly accounts for the nested structure of CUA benchmarks. All together, we show that principled environment design and rigorous evaluation methodology are not optional refinements but prerequisites for meaningful CUA research.

Prakhar Dixit

  • Role: Partner
  • Company: McKinsey
  • Bio: Partner in McKinsey’s Seattle office with more than 10 years of experience advising CxOs on technology transformations on growth and productivity. He has a background in product management and software engineering, helping technology companies grow efficiently through AI, improved ways of working, and operating model transformations.
  • LinkedIn: https://www.linkedin.com/in/prakhar-dixit/
  • Photo: /wf26/speakers/by-id/spk_prakhar_dixit.jpg
  • Sessions:

- Tokenomics: From AI Spend to AI Value — Day 3 — Session Day 2 11:00am-12:00pm

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As enterprise AI adoption accelerates, token spend is scaling faster than value realization. We address i) how to make decisions amid unclear cost and value dynamics, ii) how to shift from token-level to workflow-level analysis, and iii) how to manage downstream behavior implications on AI usage.

- The Agentic Product Development Organization — Day 4 — Session Day 3 11:00am-12:00pm

Facilitated, peer-to-peer, under the Chatham House Rule — not recorded.

As AI agents become embedded in day-to-day work, organizations will need to rethink product development teams, roles, and skills. This foundational shift reshapes management layers and requires overcoming challenges across talent attraction, development, and retention.

Pranav Maheshwari

  • Role: Director of Integrations
  • Company: Edge And Node
  • Bio: Director of engineering at Edge And Node building financial harness through ampersend.ai
  • Twitter: https://x.com/impranavm_
  • LinkedIn: https://www.linkedin.com/in/thepranavmaheshwari/
  • Photo: /wf26/speakers/by-id/spk_pranav_maheshwari.jpg
  • Sessions:

- Agent Spending Without Controls: The Missing Infrastructure Layer for AI Pa… — Day 4 — Session Day 3 1:30pm-1:50pm

AI agents are already transacting autonomously, but the infrastructure to govern how they spend does not yet exist. Traditional payment rails were built for humans, not for systems making thousands of micro-decisions per minute on someone else's behalf. This session brings together Edge & Node's CEO and Senior Solutions Architect to cover both the strategic case and the technical implementation. Rodrigo opens with the infrastructure gap: why programmable budget governance is a foundational requirement for any team deploying agents in production, and what it means to have real-time visibility and a full audit trail across every agent transaction. He also covers Edge & Node's founding membership in the x402 Foundation and why open standards for agent-to-agent and agent-to-service payments matter for the broader ecosystem. Pranav then goes deep on the stack: how structured, indexed blockchain data from The Graph powers reliable agent decision-making, how Amp Enterprise extends that into auditable datasets at production scale, and what it looks like in practice to wire ampersend into agent frameworks including LangChain, CrewAI, AutoGPT, and custom-built systems. He walks through the x402 and A2A standards that make agent payments interoperable and what a real deployment looks like end to end. The session closes with the bigger picture: bots are already half of all internet traffic, TradFi and DeFi are converging, and the infrastructure stack that wins is the one built for where they meet.

Pranay Bhatia

  • Role: AI engineer and product leader
  • Company: Fireworks AI
  • Bio: Pranay Bhatia is an AI engineer and product leader at Fireworks AI. He previously worked on Google’s PaLM and Gemini API developer tools.
  • Twitter: https://x.com/pranaycbhatia
  • LinkedIn: https://www.linkedin.com/in/pranay-bhatia-58132b22/
  • Photo: /wf26/speakers/by-id/spk_pranay_bhatia.jpg
  • Sessions:

- Stop Model Shopping: Why Ownership Beats Choice in the Agent Stack — Day 4 — Session Day 3 12:05pm-12:25pm

Teams shipping successful agents at scale know that model ownership is now a much more durable advantage than model choice. They’re fine-tuning open models using their proprietary data, building tight data feedback loops between their products and their models, and treating customization as a core product discipline to differentiate. I’ve spent the last decade building AI infrastructure, first as co-creator and head of PyTorch at Meta, now as CEO of Fireworks AI, where my team powers AI agent infrastructure stacks for companies like Cursor, Notion, Uber, DoorDash, and Vercel. I’ve watched hundreds of teams try to ship agents into production, and the patterns behind their success and failure are remarkably consistent. In this talk, I’ll share hard-won lessons from real production deployments across coding, productivity, and enterprise use cases, like: - Model choice matters, but model ownership matters more. Fine-tuning on proprietary data and building a feedback loop between your product and your models creates compounding advantages that no API swap will ever replicate, and it’s now the standard for all state-of-the-art models. It’s how Cursor hit 1,000 tokens/sec with quality that off-the-shelf models could never match, and it’s how Quora saw 3x speed improvements in its chatbot Poe. - The eval gap is where most agent projects die. Teams will spend months on prompt engineering and model selection, then ship without rigorous evaluation. Treating AI development with the same discipline as software development, with CI/CD, regression testing, and continuous evaluation, is what separates production-grade agents from impressive demos. A custom evaluation suite, coupled with RFT, is how Genspark achieved 12% higher quality on their trained model, resulting in a 50% cost reduction. - The real moat is the data flywheel. When you own the loop between your product, your data, and your models, every interaction makes the system better. Surrendering that loop to a third-party provider means surrendering the very data that makes your product defensible. Ownership is how Vercel created a custom code model that matched competitor quality at 40x speed. I’ll ground this talk in real examples I’ve seen work and fail across hundreds of agent deployments.

Preetika Bhateja

  • Role: Product Manager
  • Company: Google
  • Bio: Product Manager at Google/YouTube working on ads, evals, agents, llm-as-judge systems. Before PM, data engineer at google cloud
  • Photo: /wf26/speakers/by-id/spk_preetika_bhateja.jpg
  • Sessions:

- Model Whisperers: How Evals and Prompts Shape Agent Behavior — Day 3 — Session Day 2 1:30pm-1:50pm

Getting an AI agent to behave the way you want isn’t just about writing better prompts. In real systems, behavior emerges from a loop: prompts->evals->iteration->feedback. Small changes in any part of that loop can completely change outcomes. We saw this while building a seed asset agent - a system that turns messy, real-world advertising creatives (low quality images, cluttered visuals, heavy text overlays) into clean, reusable assets for downstream Gen AI tools. The agent acts like an editor, simplifying visuals, removing unnecessary elements, and isolating core content so that additional context (like text or CTAs) can be added back in a more controlled, brand-safe way. But the real challenge wasn’t just building the agent - it was making it reliable. And prompting alone wasn’t enough. What actually moved the system forward was how we defined success—and how we used evals to reinforce it. Over time, evals stopped being just a way to measure quality. They became part of how the agent learned what “good” looks like. In this talk, we’ll cover: Why prompting alone doesn’t give you stable agent behavior How evals act like feedback signals, not just scorecards How we built evals sets that reflect the real-world Using agent trace logs to understand why things fail (not just that they fail) How to iterate without breaking things you already fixed By the end, you’ll have a set of patterns you can apply to any system dealing with messy/continuously changing data and how to tweak your prompt and evals to accommodate such changes.

Prerna Kakkar

  • Role: Senior Software Engineer
  • Company: Google
  • Bio: Prerna Kakkar is TL for Agentic Evaluation for Google Cloud Databases and is an active contributer to Evalbench and MCP Toolbox for Databases.
  • LinkedIn: https://www.linkedin.com/in/prernakakkar95/
  • Website: https://about.google/
  • Photo: /wf26/speakers/by-id/spk_prerna_kakkar.jpg
  • Sessions:

- Build-Time vs. Run-Time: Why Your Dev Tools Will Fail in Production — Day 3 — Session Day 2 10:45am-11:05am

A dangerous pattern is evolving in the ecosystem: developers are deploying "Build-Time" tools into "Run-Time" environments. In this session, we will introduce a critical distinction for the MCP ecosystem: the difference between Build-Time Agents (Developer Assistants like Gemini Code Assist) and Run-Time Agents (End-user applications like a Customer Support bot). Drawing from our experience building the MCP Toolbox, we will demonstrate why the "Atomic" tools that make Build-Time agents powerful become catastrophic liabilities for Run-Time agents. We will provide a framework for transitioning your architecture across three key axes: Design: Moving from flexible, atomic primitives to "Composite Workflows" that encapsulate business logic. Security: Shifting from "Developer Identity" (trusted) to "Workload Identity" (zero-trust), where the agent is treated as an untrusted user. Reliability: Why production agents need "Agent-Readable" errors (natural language guidance) rather than the stack traces that developers rely on. Attendees will leave with a clear rubric for evaluating whether their tools are truly "Production Ready" or just "Prototype Ready."

Priyanka Phatak

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Priyanka Phatak leads the Claude Managed Agents team at Anthropic, building the systems that let developers run reliable, production-grade autonomous agents on Claude. In her two years at Anthropic she's worn several EM hats — spinning up Apps Platform, Public Sector Engineering, and Product Infrastructure along the way. Previously, she was Director of Engineering at Lyft, where she ran the Transit, Bikes & Scooters engineering org, and earlier built products at Yammer.
  • Twitter: https://x.com/PriyankaPhatak
  • LinkedIn: https://www.linkedin.com/in/priyankaphatak/
  • Photo: /wf26/speakers/by-id/spk_priyanka_phatak.jpg
  • Sessions:

- Claude Managed Agents Workshop (Part 1) — Day 2 — Session Day 1 10:45am-11:05am

Build an agent with Claude Managed Agents

- Claude Managed Agents workshop (Part 2) — Day 2 — Session Day 1 11:10am-11:30am

Build an agent with Claude Managed Agents

- Claude Managed Agents workshop (Part 3) — Day 2 — Session Day 1 11:40am-12:00pm

Build an agent with Claude Managed Agents

- Claude Managed Agents workshop (Part 4) — Day 2 — Session Day 1 12:05pm-12:25pm

Build an agent with Claude Managed Agents

Prukalpa Sankar

  • Role: Founder & Co-CEO
  • Company: Atlan
  • Bio: Prukalpa Sankar is the Founder & Co-CEO of Atlan, the context layer for AI. She's been early to a defining idea of the AI era: context is king. AI systems are only as good as the business context behind the data they rely on. Under her leadership, Atlan has become a Leader in the Gartner Magic Quadrants for both Data & Analytics and Metadata Management, serves 300+ enterprises including Mastercard, GM, JPMorgan Chase, and Nasdaq, and has raised $200M+ from Sequoia, GIC, and Salesforce Ventures. Before Atlan, Prukalpa co-founded SocialCops, the world's largest government data lake powering the UN's SDG monitoring — recognized by the New York Times and the World Economic Forum. She's been featured in Forbes 30 Under 30 and Fortune 40 Under 40.
  • Twitter: https://x.com/prukalpa
  • LinkedIn: https://www.linkedin.com/in/prukalpa
  • Photo: /wf26/speakers/by-id/spk_prukalpa_sankar.jpg
  • Sessions:

- WTF Is the Context Layer? The Missing Infrastructure for Production Agents — Day 3 — Session Day 2 1:55pm-2:15pm

In the last two years, models have gotten exponentially smarter. Two years ago they couldn't pass the bar. Today, top 1% of test scorers. And yet most agents still can't answer a simple business question correctly. You ship a demo that works. You deploy it. The business abandons it in a month.

The missing variable is context: the business definitions, procedural knowledge, and operational norms that make a human expert valuable.

Drawing on hundreds of production deployments, Prukalpa Sankar will break down what it actually takes to give agents contextual intelligence — and get them past the demo stage.

She'll walk through the architecture of a context layer: how context repos work (versioned, testable, portable), how simulation environments catch failures before deployment, how agent traces compound back into shared context, and why context engineering scales where fine-tuning and prompting don't. She'll also cover why your context needs to be open (MCP, Iceberg, deploy to any framework) — and what happens when it isn't.

Qianru Lao

  • Role: Member of Technical Staff
  • Company: OpenAI
  • Bio: Qianru Lao is a Member of Technical Staff on the Inference team at OpenAI, where she works on infrastructure for large-scale model serving. Previously, she contributed to the open-source Delta Lake project at Databricks and worked on distributed storage systems at Alibaba Cloud and infrastructure tooling at Google. She holds degrees in Computational Science and Engineering from Harvard and Computer Science from Sun Yat-sen University.
  • LinkedIn: https://linkedin.com/in/qianru-lao
  • Website: https://openai.com
  • Photo: /wf26/speakers/by-id/spk_qianru_lao.jpg
  • Sessions:

- Routing LLM Inference in Production: From Engine Signals to Policy — Day 4 — Session Day 3 11:10am-11:30am

Production LLM apps need more than a fast model: they need an inference routing layer that can choose where each request should run as engines, capacity, latency, and geography cost change. This talk shares a generalized Inference Load Balancer (ILB) proxy/controller architecture. A low-latency proxy applies routing weights and request-path signals, while a controller computes source-cluster-to-engine weights from demand, capacity/performance profiles, replica state, and geography cost. We will cover the practical debugging patterns AI engineers need: reading engine signals, explaining why a request went to one backend instead of another, handling retries and load shedding, and keeping routing behavior observable without exposing OpenAI-specific internals or non-public metrics.

Qingyang Wu

  • Role: Staff Research Scientist
  • Company: Together AI
  • Bio: Qingyang Wu is a Staff Research Scientist at Together AI working on text generation, dialog systems, multimodal models and inference research. He previously researched language models at Columbia University.
  • Twitter: https://x.com/QingyangWu1
  • Photo: /wf26/speakers/by-id/spk_qingyang_wu.jpg
  • Sessions:

- Open-Source Inference Engineering for the Agentic Era — Day 1 — Workshop Day 9:00am-11:00am

Agentic coding workloads demand long contexts, multi-turn conversations, and throughput at a scale that most inference engines weren't built for. TokenSpeed is a new open-source engine purpose-built for this regime, built collaboratively by NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, and others. In this 2-hour hands-on workshop, Together Inference Research Engineers and a TokenSpeed co-creator will cover TokenSpeed architecture, deploying your first model, optimizing for agentic workloads, kernel and hardware tuning, and throughput/latency trade-offs.

Rachna Srivastava

  • Role: Enterprise Architect
  • Company: DFPI
  • Bio: Rachana Srivastava is an Enterprise Architect and AI Solutions Leader with over 20 years of experience advancing generative AI, big data analytics, and large-scale distributed systems. She specializes in architecting intelligent, enterprise-grade AI solutions, including autonomous agents, knowledge graphs, and real‑time analytics platforms. At the Department of Financial Protection and Innovation (DFPI), she leads major digital transformation initiatives, designing high‑impact AI systems that dramatically improve regulatory workflow efficiency and document intelligence accuracy. Rachana’s prior roles include senior engineering leadership positions at Synopsys, Ayla Networks, Hewlett Packard, Thomson Reuters, Acxiom, and IBM, where she built high‑scale data pipelines, security analytics platforms, and AI-driven debugging tools. Her work consistently bridges deep technical expertise with strategic architectural vision. She holds an MS in Statistics, an MBA in Finance, and multiple certifications in deep learning, system design, and product management. Rachana is also a recognized speaker, sharing insights on AI architecture, scalable systems, and responsible innovation.
  • LinkedIn: https://www.linkedin.com/in/rachana-srivastava-ms-mba-78bab86
  • Website: https://dfpi.ca.gov/
  • Photo: /wf26/speakers/by-id/spk_rachna_srivastava.jpg
  • Sessions:

- Guardians of the State: How We Built an Air-Gapped AI Fortress for Consumer Data — Day 3 — Session Day 2 1:55pm-2:15pm

Every enterprise slide deck talks about "data privacy," but at the California Department of Financial Protection and Innovation (DFPI), a single leaked Social Security Number or bank account doesn’t just mean a bad PR day—it violates strict state consumer laws and triggers massive regulatory security breaches. When your raw data includes petabytes of unredacted fraud complaints, dark web scam networks, and banking statements, standard commercial public APIs are an absolute non-starter. This talk breaks down the exact enterprise architecture the DFPI uses to leverage frontier-level reasoning on highly sensitive data without crossing legal lines. We will walk through the code and infrastructure of our sovereign data pipeline. Attendees will learn: The Infrastructure: How we host and serve local, open-weights models (like Llama 3 or Mistral) in a strictly air-gapped, secure cloud environment. The Sanitization Stack: How we built a multi-stage PII scrubbing pipeline that uses high-speed deterministic regex combined with a small, specialized local LLM to handle messy, unstructured text. The Validation Loop: How we technically validate that zero sensitive data leaks into model context weights or logging files. No promissory corporate hoopla here—just real, hard-earned architecture patterns and code snippets from a state regulator showing how to ship secure, local AI. Key Takeaways for the Audience: Learn to build a dual-pass PII sanitization pipeline for unstructured financial data. Understand the resource and latency trade-offs of running air-gapped, open-weight models locally vs. commercial APIs. Discover how to set up automated validation frameworks to detect and stop context poisoning or logging leaks.

Rafael Levi

  • Role: DevRel
  • Company: Bright Data
  • Sessions:

- Video Discovery for Agentic World-Model Training — Day 2 — Session Day 1 2:50pm-3:10pm

Physical AI had its “Attention Is All You Need” moment with the rise of Vision-Language-Action models. The next bottleneck is data: not just more video, but the ability to find the exact real-world moments that teach models how the world works: gravity, motion, causality, human behavior, and object interactions. This session explores a new approach: discovering specific scenes from the vastness of the web. We’ll show how teams can search for moments like objects falling, people interacting with environments, or actions unfolding over time, then collect and structure only the relevant clips for training and evaluation. Attendees will learn how scene-level discovery changes multimodal data pipelines, reducing wasted collection, processing, storage, and review, while making it easier to build targeted datasets for VLA systems, robotics, physical AI, and agentic world models.

- Video Discovery for Agentic World-Model Training — Day 4 — Session Day 3 1:30pm-1:50pm

Physical AI had its “Attention Is All You Need” moment with the rise of Vision-Language-Action models. The next bottleneck is data: not just more video, but the ability to find the exact real-world moments that teach models how the world works: gravity, motion, causality, human behavior, and object interactions. This session explores a new approach: discovering specific scenes from the vastness of the web. We’ll show how teams can search for moments like objects falling, people interacting with environments, or actions unfolding over time, then collect and structure only the relevant clips for training and evaluation. Attendees will learn how scene-level discovery changes multimodal data pipelines, reducing wasted collection, processing, storage, and review, while making it easier to build targeted datasets for VLA systems, robotics, physical AI, and agentic world models.

Rafal Wilinski

  • Role: Founding Engineer
  • Company: Runlayer
  • Bio: Rafal Wilinski is a founding engineer at Runlayer, founder of Dynobase, and previously led AI agents work at Zapier.
  • Twitter: https://x.com/rafalwilinski
  • LinkedIn: https://pl.linkedin.com/in/rafwilinski
  • Website: https://rwilinski.ai
  • Photo: /wf26/speakers/by-id/spk_rafal_wilinski.jpg
  • Sessions:

- Self-Improving Agents That Teach the Company Back — Day 2 — Session Day 1 12:05pm-12:25pm

Agents forget too much. A run might solve a customer escalation, debug a deployment, or figure out the review pattern for a tricky code path, then the knowledge disappears into a transcript. At Runlayer, we started treating that knowledge as a product surface. Skills are reviewable, editable instructions that agents can load over MCP. An agent can start with a task, learn something useful while doing the work, and draft or update a private skill from that run. That skill loads into future runs for the same agent, stays inspectable by humans, and can eventually graduate into a team or org-level skill. The flywheel gets more interesting once a skill becomes useful beyond the agent that created it. A learned skill can move from one agent's private memory into shared organizational knowledge, then become available through the Runlayer plugin inside Claude Code, ChatGPT, and other AI clients employees already use. The agent does the work, captures the playbook, and the company gets better at that work everywhere agents are used. This talk walks through the architecture and product choices behind self-improving skills: post-run distillation, skill mutation tools, private-by-default scoping, runtime loading, UI inspection, promotion into shared skills, and the safety boundary between this agent learned something and everyone should now use it. The goal is an agent that leaves behind a better handbook for the next person, the next run, and eventually the whole organization.

Raghav Saboo

  • Role: Staff Machine Learning Engineer, Tech Lead Search & Personalization
  • Company: DoorDash, Inc.
  • Bio: Raghav Saboo is a Staff Machine Learning Engineer and Tech Lead for Search & Personalization within DoorDash's New Verticals business line. He is currently focused on scaling agentic and generative AI integrations within search and recommendation systems. Previously, he was at Amazon, where he developed the first generation of large language models and distilled models for Alexa AI's new language launches. Prior to Amazon, Raghav worked as a Machine Learning consultant, building zero-to-one solutions for clients across multiple industries. He has publications in WSDM, SIGIR, and RecSys, and holds a Master's from Duke University alongside a combined BEng and MEng from Imperial College London.
  • LinkedIn: https://www.linkedin.com/in/raghavsaboo/
  • Website: https://buildshipai.substack.com/
  • Photo: /wf26/speakers/by-id/spk_raghav_saboo.jpg
  • Sessions:

- LLM Recsys at DoorDash — Day 2 — Session Day 1 11:40am-12:00pm

Ramana Siddanth Emani

  • Role: Data Scientist
  • Company: Auditoria AI
  • Bio: Data Scientist @ Auditoria AI, building SmartResearch Agent.
  • Twitter: https://x.com/siddanth2486
  • LinkedIn: https://www.linkedin.com/in/siddanth-emani
  • Website: https://www.auditoria.ai/
  • Blog: https://siddanthemani.github.io/
  • Photo: /wf26/speakers/by-id/spk_ramana_siddanth_emani.jpg
  • Sessions:

- Your Finance Agent's Bottleneck Is You — Day 4 — Session Day 3 2:25pm-2:45pm

Most "AI for Finance" demos look great and almost none survive past pilot. If you've pushed an agent past one workflow, one tenant, or one Workday schema, you know the bottleneck isn't the model - it's the engineer behind the agent, who can't iterate fast enough to keep up with real AP data, real RBAC, and real query volume. What if you built your dev loop with the same primitives you're shipping to the finance team? In this talk, I'll show the subagent + skills + MCP stack - a production multi-agent system over AP, PO, vendor, and multi ERP systems, a LangGraph pattern that survives production, and the three failure modes that kill finance pilots before they ship.

Rania Khalaf

  • Role: Chief AI Officer
  • Company: WSO2
  • Bio: Dr. Rania Khalaf is Chief AI Officer and GM of AI at WSO2, leading the company's AI and Agentic roadmap, including the new Agent Platform. With deep expertise at the intersection of AI, cloud platforms, and enterprise software, Dr. Khalaf has a proven track record of building AI-native products from zero to market and driving company-wide transformations. Previously, she was Chief Information and Data Officer at Inari, a unicorn biotech startup where she built its digital and AI organization through a period of extraordinary growth, delivering deep learning and knowledge graph platforms for gene discovery. She was Director of IBM Research AI Engineering and Distinguished Research Staff Member leading an organization of at the frontier of AI and cloud, driving innovations to product including award-winning Watson products and open-source projects that became industry standards. She holds Bachelor's and Master's degrees from MIT and a PhD from the University of Stuttgart, with technical depth resulting in 90+ publications and over 8,000 citations. Rania serves on academic and industry boards and is a frequent speaker at institutions including MIT, Harvard, Yale, and UC Berkeley.
  • LinkedIn: https://www.linkedin.com/in/raniakhalaf
  • Photo: /wf26/speakers/by-id/spk_rania_khalaf.jpg
  • Sessions:

- The Chief AI Officer: A framework for the emerging Swiss Army Knife of roles — Day 3 — Session Day 2 3:45pm-4:05pm

The Chief AI Officer (CAIO) is currently the C-Suite’s most "multiversal" role. In a single day, you must inhabit different realities: you are a Tinker building scalable experiments in bleeding edge tech, an Architect navigating the hype cycle to execute high-stakes product strategy, and a Coach guiding a workforce and your customers on meaningful AI adoption - minus the fluff. It is a role defined by high-speed context switching and the pressure to deliver "Everything, Everywhere, All at Once." As one of the first Chief AI Officers, and leaning into my experience across Fortune 500, unicorns starups, and PE backed firms, I share a dynamic 20/60/20 Framework for the modern CAIO. We’ll explore how to navigate this multi-tool role by treating the organization as an "Equalizer"—learning when to push the sliders of focus based on your industry’s maturity and where you are in the AI journey.

Rashi Agrawal

  • Role: Head of Agentic AI
  • Company: Hinge Health
  • Bio: Rashi Agrawal is the Head of Agentic AI at Hinge Health, where she engineers high-stakes, secure, and HIPAA-compliant systems. Operating at the pioneer edge of generative AI technology, she architects state-of-the-art frameworks that move beyond simple automation to solve critical problems and drive dramatic business growth.

Previously, as Head of AI at Goodleap, a leading FinTech in Green Energy, Rashi spearheaded enterprise-wide transformation initiatives that optimized complex loan processing and customer engagement. Blending holistic vision with deep technical expertise, she successfully deployed intelligent decision-making and risk assessment platforms that delivered measurable value.

Earlier in her career, Rashi led engineering teams at Yahoo, maturing early-stage technical challenges into massive growth engines for their multi-billion-dollar Advertising business. Her leadership ensures innovation is grounded in business strategy, establishing AI as a competitive moat rather than just an operational layer.

Beyond the office, Rashi is a global explorer who has traveled to over 50 countries. A prominent thought leader in the engineering community, she is an Indian immigrant with a Master’s in Software Engineering from San Jose State University, an alumna of the Stanford Graduate School of Business Executive Education program, and the founder of Women In Tech AI (WIT AI), an organization dedicated to empowering and elevating women leaders in the field.

  • LinkedIn: https://www.linkedin.com/in/rashi283/
  • Website: https://sessionize.com/rashiagrawal/
  • Photo: /wf26/speakers/by-id/spk_rashi_agrawal.jpg
  • Sessions:

- Guardrails First: Engineering Member-Facing Health AI — Day 4 — Session Day 3 11:10am-11:30am

Everywhere else in the company, an AI pilot can reach production in weeks. For our member-facing clinical assistant, it can't, and that single constraint redesigned our entire architecture. This is a field report on building conversational AI in a regulated digital health setting, where "move fast and break things" isn't a culture choice. It's a liability. We'll get concrete about what changes when every output has to be clinically safe, auditable, and compliant: PHI is protected by architecture, not policy. Production and non-production are hard-isolated, dashboards are sanitized, and engineers outside the US never touch protected health information. Must-not-fail behavior never lives in a prompt. Emergency escalation and intent routing run as deterministic rules at the top of every conversation turn, before the model is consulted. If you can't afford to get something wrong, you don't leave it to a probabilistic system. Clinical safety is a continuous eval layer. ~30 LLM-as-judge evaluators score clinical accuracy, clinical safety, escalation routing, and recommendation relevance, continuously, not once. Every output is auditable. Each turn, tool call, and reasoning step is traced so outputs can be reviewed and meet regulated reporting obligations. The throughline: in regulated healthcare, compliance constraints aren't a tax you pay around the architecture. They become the architecture. We'll talk about why guardrails-first is the only way to ship member-facing health AI, and why "painfully slow" is sometimes exactly right. (This is non-diagnostic, member-facing AI. The talk is about engineering discipline under regulation, not medical claims.) Key takeaways - In regulated health AI, "move fast" is the wrong default. Design for deliberate, careful launches. - Must-not-fail behaviors belong in deterministic rules at the top of every turn, never in the prompt. - Protect PHI through architecture: isolate prod from non-prod, sanitize dashboards, restrict access by role and geography. - Make every output auditable. Trace each turn, tool call, and reasoning step so safety is reviewable, not assumed. - Treat clinical safety as a continuous LLM-as-judge layer, not a one-time gate.

Rayan Garg

  • Role: CEO
  • Company: Theta Software
  • Bio: CEO at Theta Software, building RL environments. Previously at DeepSilicon.
  • Twitter: https://x.com/RayanGarg
  • LinkedIn: https://www.linkedin.com/in/rayan-garg/
  • Photo: /wf26/speakers/by-id/spk_rayan_garg.jpg
  • Sessions:

- Rethinking Environments for Long Horizon Work — Day 2 — Session Day 1 11:40am-12:00pm

As autonomous agents push towards longer-horizon tasks, a number of challenges emerge in measuring and improving frontier model capabilities. In this talk, we discuss how long-horizon tasks are defined and measured, how RL environments and verifiers have to scale for more complex and open-ended tasks, and how we navigate these problems at Theta.

Raymond Feng

  • Role: Researcher
  • Company: Applied Compute
  • Bio: Researcher at Applied Compute. Building the post-training stack, training specialized workhorse models for enterprises, and researching new techniques for model customization. Graduated from MIT.
  • Twitter: https://x.com/raymondmfeng
  • Website: https://raymondhfeng.github.io/
  • Photo: /wf26/speakers/by-id/spk_raymond_feng.jpg
  • Sessions:

- Learning on the job: the future of post-training — Day 3 — Session Day 2 12:05pm-12:25pm

Rémi Louf

  • Role: CEO
  • Company: .txt
  • Bio: CEO and co-founder at .txt, building reliable agent infrastructure. Also known for Outlines.
  • Twitter: https://x.com/remilouf
  • LinkedIn: https://www.linkedin.com/in/remilouf/
  • Website: https://thetypicalset.com
  • Photo: /wf26/speakers/by-id/spk_remi_louf.jpg
  • Sessions:

- Agent Frameworks Considered Harmful — Day 4 — Session Day 3 1:55pm-2:15pm

Remy Guercio

  • Role: Strategic Projects
  • Company: Tailscale
  • Bio: Remy Guercio works on Strategic Projects at Tailscale. His recent AI talks and interviews focus on network-based sandboxes, secure LLM access, and identity-aware infrastructure for AI agents.
  • LinkedIn: https://www.linkedin.com/in/remyguercio
  • Photo: /wf26/speakers/by-id/spk_remy_guercio.jpg
  • Sessions:

- An AI Future Without the Lock-In — Day 4 — Session Day 3 3:20pm-3:40pm

Every organization navigating AI adoption faces the same trap: the market moves faster than any procurement cycle, no single vendor leads across model quality, interface, sandbox, and data access for more than a few months at a time, and the obvious answer of consolidating behind one platform trades short-term control for long-term lock-in. This session makes the case that the winning strategy is not picking the best walled garden. It is building a connective layer underneath all of them. Tailscale's Remy Guercio walks through the four components required for transformative AI, why vertically integrated stacks are structurally fragile, and how organizations can maintain visibility and control without betting on a single vendor's continued dominance. The second half of the session covers three new capabilities in Aperture, Tailscale's identity-aware AI gateway: Identity-Aware Universal Data Connectors (Public Alpha), which translate Tailscale network identity into scoped access to internal data sources via MCP and API endpoints; a Responsive Chat UI (Public Alpha) that gives non-technical users a mobile-friendly interface to every LLM configured in Aperture; and Sandbox Support (Private Alpha), bringing ephemeral and persistent compute environments into the same identity model. Attendees leave with a framework for evaluating AI platforms that does not depend on picking a winner, and a concrete path to deploying provider-agnostic AI tooling on infrastructure they already run.

Richard Socher

  • Role: CEO & Co-Founder
  • Company: You.com / Recursive Superintelligence
  • Bio: AI researcher and entrepreneur; CEO and Co-Founder of You.com and Recursive Superintelligence. Previously Chief Scientist and EVP at Salesforce, with a Stanford PhD in Computer Science and widely cited work in NLP and deep learning.
  • Twitter: https://x.com/RichardSocher
  • Photo: /wf26/speakers/by-id/spk_richard_socher.jpg
  • Sessions:

- First Steps Toward Automated AI Research — Day 3 — Session Day 2 10:45am-11:05am

Rishab Kumar

  • Role: Staff Developer Evangelist
  • Company: Twilio
  • Bio: Rishab Kumar is a Staff Developer Evangelist at Twilio, GitHub Star, Google Developer Expert, and AWS Community Builder who works on developer relations and agentic voice/messaging applications with Twilio and Amazon Bedrock.
  • Twitter: https://twitter.com/rishabk7
  • LinkedIn: https://www.linkedin.com/in/rishabkumar7
  • Photo: /wf26/speakers/by-id/spk_rishab_kumar.jpg
  • Sessions:

- From Stateless to Stateful: Orchestrating Real-Time Voice & Messaging Agents with Twilio and Amazon Bedrock — Day 3 — Session Day 2 12:05pm-12:25pm

We have all had that maddening customer service experience: you text a support line about a delayed flight, receive a confirmation, but when you call in a minute later, the voice agent asks, "How can I help you today?" completely blind to the SMS you just sent. This is the "Channel Amnesia" problem. While businesses are pouring billions into generative AI, most agents are still built on stateless architectures that forget customer context the second a session ends. In this session, we will cure AI amnesia. You will learn how to orchestrate stateful, production-grade AI agents across SMS and Voice using Twilio Agent Connect and Amazon Bedrock. We will dive into why traditional serverless compute fails stateful agents, how to leverage AWS Fargate for isolated, long-lived sessions, and how to configure Bedrock AgentCore over WebSockets to hit sub-50ms streaming voice latency. No slide-ware here expect a live, cross-channel demo and open-source code you can deploy tomorrow.

Rita Zhang

  • Company: Coreweave
  • Bio: Rita Zhang works on CoreWeave's inference platform and AI/ML workload infrastructure. Her background includes principal software engineering work at Microsoft on cloud-native and AI platform systems.
  • Twitter: https://x.com/ritazzhang
  • Website: https://ritazh.com
  • Photo: /wf26/speakers/by-id/spk_rita_zhang.jpg
  • Sessions:

- Vertical Mobility: Building an AI Inference Platform That Scales from MVP to Trillion-Parameter Workloads — Day 4 — Session Day 3 12:05pm-12:25pm

The future of AI inference is not one-size-fits-all. This talk explores a multi-tiered architecture that supports the full AI lifecycle, from rapid, pay-per-token experimentation to dedicated, SLO-bound production and extreme-scale, self-managed deployments. Learn about lessons learned from CoreWeave’s inference stack as performance, cost, and control requirements evolve.

Ritvik Pandya

  • Role: Engineering Manager
  • Company: JP Morgan Chase
  • Bio: Ritvik Pandya is an engineering leader with over seventeen years building distributed systems and large-scale payment infrastructure, currently at JPMorgan Chase, with prior experience at other leading technology companies. He works at the intersection of platform engineering, observability, and reliability — designing high-throughput systems that stay dependable under real-world load. He writes and speaks on building dependable systems at scale, and is a member of the IEEE Consumer Technology Society.
  • LinkedIn: https://www.linkedin.com/in/ritvik-pandya/
  • Website: https://www.jpmorganchase.com/
  • Photo: /wf26/speakers/by-id/spk_ritvik_pandya.jpg
  • Sessions:

- AI : Learned Execution Graphs for Real-Time Anomaly Detection & Drift Classification in APIs — Day 4 — Session Day 3 1:30pm-1:50pm

API ingress controllers process requests through ordered sequences of middleware steps — authentication, authorization, validation, rate limiting, routing, service invocation, caching. We model this pipeline as a directed acyclic graph (DAG) learned from structured telemetry events, then apply graph-based anomaly detection and drift classification in real time at 1,600+ TPS. The system emits one structured event per processing step, constructs per-endpoint execution graphs using sequence mining with statistical confidence thresholds, and learns per-node baselines (latency, dependency, execution frequency). Three graph intelligence capabilities emerge: (1) Graph-based anomaly attribution — compute per-node deviation ratios against learned baselines to identify the exact bottleneck node and its dependency. In production, this pinpointed a 41x deviation at a single graph node that was invisible to service-level monitoring, reducing root cause identification from 2-3 hours to under 30 seconds. (2) Graph structural drift detection — compare observed node sequences against the learned graph topology to detect missing nodes (mandatory processing step silently skipped), reordered nodes (middleware misconfiguration), and unexpected new nodes (unauthorized middleware injection). Traditional monitoring reported "system healthy" when a mandatory node was removed — latency dropped, errors at zero — only the learned graph comparison detected the structural change. (3) Per-client graph fingerprinting — learn client-specific execution graph profiles using exponential moving averages. Detect when a client's graph traversal pattern changes, classify the cause (client behavior change vs. configuration drift vs. infrastructure failover) using KL divergence on node-visit distributions, and apply graph-aware adaptive control scoped to specific nodes rather than entire endpoints. The execution graph model also enables a novel approach to retry storm detection: analyzing idempotency key entropy at graph nodes to classify traffic as legitimate growth vs. retry amplification, and returning cached responses at the specific graph node rather than rejecting requests — breaking the retry amplification loop. Production system processing high TPS. Attendees will learn the graph construction methodology, the anomaly attribution algorithm, and concrete patterns for adding learned graph intelligence to any middleware pipeline.

Rob Cheung

  • Role: Co-founder
  • Company: Zo Computer
  • Bio: Rob Cheung is Co-founder of Zo Computer. He was previously the first engineer at Substack and earlier worked on the Venmo team before reuniting with Ben Guo to build Zo.
  • Twitter: https://x.com/perceptnet
  • LinkedIn: https://www.linkedin.com/in/robertkcheung
  • Website: https://rob.zo.space
  • Photo: /wf26/speakers/by-id/spk_rob_cheung.jpg
  • Sessions:

- Everyone Gets A Software Company — Day 2 — Session Day 1 11:40am-12:00pm

Rob Wachen

  • Role: Co-founder and President
  • Company: Etched
  • Bio: Rob Wachen is the co-founder and president of Etched. Etched is building rack-scale infrastructure designed to serve frontier models at scale. A Thiel Fellow and Harvard dropout, Rob previously co-founded Prod, a startup accelerator with a cohort valuation of $100B+.
  • Photo: /wf26/speakers/by-id/spk_rob_wachen.jpg
  • Sessions:

- Latent Space Live: the Inference Inflection from First Principles — Day 4 — Session Day 3 12:30pm-1:30pm

- Rob Wachen — transformer-only ASICs for inference — Day 4 — Session Day 3 1:55pm-2:15pm

Etched's Sohu approach to transformer inference on custom silicon.

Robert Brennan

  • Role: CEO
  • Company: OpenHands
  • Bio: Robert Brennan is the CEO of All Hands AI, the company behind OpenHands, an MIT-licensed software development agent. He has previously worked in natural language processing (for Google search) and has spend the last decade building commercial open source software.
  • Twitter: https://x.com/rbren_dev
  • LinkedIn: https://www.linkedin.com/in/robert-a-brennan/
  • Website: https://rbren.io
  • Blog: https://rbren.io
  • Photo: /wf26/speakers/by-id/spk_robert_brennan.jpg
  • Sessions:

- Sandboxes Aren't Optional: Runtime Isolation Patterns for Coding Agents at Scale — Day 3 — Session Day 2 3:20pm-3:40pm

Last year, an AI coding agent wiped a production database during a code freeze, ignored explicit instructions to stop, then told the developer recovery was impossible. (It wasn't.) That's what happens when your security model is "we told the agent to be careful." When agents can write code, run tests, make API calls, and push commits, security is no longer a prompt engineering problem. It's a runtime isolation problem. This talk covers the patterns we follow at OpenHands and that you can steal wholesale: Docker and Kubernetes isolation, per-agent file system scoping, network egress controls, RBAC for multi-tenant deployments, and the full audit trail every enterprise security team demands. We'll walk through the three most common failure modes we see when teams skip proper isolation, including one case where an agent helpfully committed secrets to a public repo. You'll see a live demo of 50 parallel sandboxed agents running against a real codebase, with resource limits, timeout enforcement, and graceful degradation when agents hit unexpected states. You'll leave with a sandbox checklist and reference Kubernetes config. Bounded autonomy isn't a limitation on agent capability. It's what makes production trust possible.

Robert McHardy

  • Role: Pre-training Lead
  • Company: poolside
  • Bio: Team and tech lead for pre-training at poolside, where he trains large language models for code. Recently led the pre-training of Laguna XS.2 and M.1, poolside's first two public open-weight models. Before that, Robert worked as a Senior Researcher at AssemblyAI where he trained multilingual speech models, and previously built AI for cancer and infectious-disease research at InstaDeep and BioNTech's joint lab. MSc in Machine Learning from UCL.
  • Twitter: https://x.com/robert_mchardy
  • LinkedIn: https://www.linkedin.com/in/robert-mchardy
  • Website: https://www.robertmchardy.de
  • Photo: /wf26/speakers/by-id/spk_robert_mchardy.jpg
  • Sessions:

- The Messy Reality of Scale: Synthetic Data and Pre-Training at Poolside — Day 2 — Session Day 1 11:10am-11:30am

TBD — focus on data quality considerations for LLM pretraining and code generation.

Roberto Milev

  • Role: Chief Architect
  • Company: Navan
  • Bio: Roberto Milev is Chief Architect at Navan, where he leads AI architecture across 120+ microservices and 20 engineering teams. His work on autonomous agent systems has been published at ACM CAIS 2026, and he has presented at OpenAI and the UnLock conference on AI architecture topics.
  • LinkedIn: https://www.linkedin.com/in/robertomilev/
  • Website: https://navan.com
  • Photo: /wf26/speakers/by-id/spk_roberto_milev.jpg
  • Sessions:

- Agents Are Where Microservices Were in 2015. We're Making All the Same Mistakes. — Day 3 — Session Day 2 2:50pm-3:10pm

Remember when everyone was shipping microservices without service discovery, circuit breakers, or distributed tracing? Agents are in that exact phase right now. Everyone's building them. Almost nobody is thinking about the infrastructure underneath. We've been deploying production agents across 120+ microservices. Here's the stack that's emerging: Runtime — containerized execution, session persistence, workspace snapshots. Solved-ish, mostly duct tape. Memory — RAG had a good run. It's not enough. Tiered memory — short-term, long-term with semantic/episodic strategies, agents deciding what to remember and forget. Observability — you can't tail -f an agent. Execution traces, reasoning chains, confidence signals — agents need their own observability stack. Testing — the biggest gap. Unit testing non-deterministic behavior, regression testing prompt changes, knowing your agent got worse before users do. Skills and tools — MCP and skill definitions as the standard interface layer — the REST APIs of the agent era. Context engineering — what the agent knows at decision time. The new performance tuning. Guardrails and auth — scoped credentials, budget limits, knowing when to stop. Least-privilege for agents. Orchestration — single vs. multi-agent, choreography vs. orchestration. Same tradeoffs as microservices, new failure modes. This talk maps the stack, draws the parallels to how we eventually got microservices right, and calls out what's still painfully missing.

Rodrigo Coelho

  • Role: CEO
  • Company: Edge & Node
  • Bio: Rodrigo Coelho (pronounced KWAY-lee-o) is a serial entrepreneur and seasoned technology leader and early web3 innovator with over two decades of experience in engineering, entrepreneurship, and decentralized infrastructure. In 2025, he became Chief Executive Officer of Edge & Node, the core developer of The Graph, where he guides the company’s vision, strategy, and global growth. Rodrigo was the first hire for The Graph, playing a pivotal role in its early architecture, ecosystem development, and community expansion. His career began in the late 1990s, when he co-founded an application development firm during the early web era, delivering solutions for Fortune 500 clients. He later founded and successfully exited two technology startups, further solidifying his track record in innovation and leadership. With a foundation in Industrial Engineering, Rodrigo is recognized for his strategic vision and deep technical expertise. He is committed to advancing the decentralized web, empowering developers, and fostering a thriving global community through research, partnerships, and open innovation. He is based in the San Francisco Bay area.
  • Twitter: https://x.com/rodventures
  • LinkedIn: https://www.linkedin.com/in/rodrigoco/
  • Photo: /wf26/speakers/by-id/spk_rodrigo_coelho.jpg
  • Sessions:

- Agent Spending Without Controls: The Missing Infrastructure Layer for AI Pa… — Day 4 — Session Day 3 1:30pm-1:50pm

AI agents are already transacting autonomously, but the infrastructure to govern how they spend does not yet exist. Traditional payment rails were built for humans, not for systems making thousands of micro-decisions per minute on someone else's behalf. This session brings together Edge & Node's CEO and Senior Solutions Architect to cover both the strategic case and the technical implementation. Rodrigo opens with the infrastructure gap: why programmable budget governance is a foundational requirement for any team deploying agents in production, and what it means to have real-time visibility and a full audit trail across every agent transaction. He also covers Edge & Node's founding membership in the x402 Foundation and why open standards for agent-to-agent and agent-to-service payments matter for the broader ecosystem. Pranav then goes deep on the stack: how structured, indexed blockchain data from The Graph powers reliable agent decision-making, how Amp Enterprise extends that into auditable datasets at production scale, and what it looks like in practice to wire ampersend into agent frameworks including LangChain, CrewAI, AutoGPT, and custom-built systems. He walks through the x402 and A2A standards that make agent payments interoperable and what a real deployment looks like end to end. The session closes with the bigger picture: bots are already half of all internet traffic, TradFi and DeFi are converging, and the infrastructure stack that wins is the one built for where they meet.

Roland Gavrilescu

  • Role: Co-Founder, CEO
  • Company: Introspection
  • Bio: Co-founder and CEO at Introspection, building infra for self-improving AI systems. Previously at xAI and Superhuman.
  • Twitter: https://x.com/rolandgvc
  • LinkedIn: https://www.linkedin.com/in/roland-gavrilescu/
  • Website: https://www.introspection.dev/blog
  • Blog: https://www.introspection.dev/blog
  • Photo: /wf26/speakers/by-id/spk_roland_gavrilescu.jpg
  • Sessions:

- Autoresearch in the wild — Day 3 — Session Day 2 3:20pm-3:40pm

We have reached model capability overhang. Models are now bottleneck by the systems built around them. In this session we discuss how the next generation of compound AI systems need to be designed for self-improvement, how to set up feedback loops that automate the continuous refinement of the end-to-end architecture.

Romain Huet

  • Role: Head of Developer Experience
  • Company: OpenAI
  • Bio: Romain Huet is a French entrepreneur and engineer with a passion for developer platforms. He currently leads Developer Experience at OpenAI, inspiring and supporting founders and builders to integrate AI into their applications, and directing the creation of elegant and powerful tools for all developers. Previously, Romain spent five years at Stripe, leading product management for the developer platform and overseeing global developer relations. Before Stripe, he helped with the relaunch of Twitter’s developer platform and co-founded Jolicloud in Paris, where he developed a cloud-based operating system and the Jolibook, a personal computer.
  • Twitter: https://x.com/romainhuet
  • LinkedIn: https://www.linkedin.com/in/romainhuet/
  • Photo: /wf26/speakers/by-id/spk_romain_huet.jpg
  • Sessions:

- The Golden Age of AI Engineering — Day 2 — Session Day 1 9:25am-9:45am

TBD

Ronak Chokshi

  • Role: Director of Product Marketing
  • Company: Microsoft
  • Bio: Ronak Chokshi is Director of Product Marketing at Microsoft, with recent public activity around Microsoft Copilot, Copilot CLI, Azure Content Understanding, and AI marketplace/product announcements.
  • Photo: /wf26/speakers/by-id/spk_ronak_chokshi.jpg
  • Sessions:

- Power agents with Microsoft IQ — Day 3 — Session Day 2 2:25pm-2:45pm

Ronak Malde

  • Role: Co-Founder and CEO
  • Company: Trajectory
  • Bio: Co-Founder & CEO of Trajectory.

Previously trained SWE-1 at Windsurf, then gemini post-training at DeepMind after acquisition

  • Twitter: https://x.com/rronak_
  • LinkedIn: https://www.linkedin.com/in/ronak-malde
  • Photo: /wf26/speakers/by-id/spk_ronak_malde.jpg
  • Sessions:

- Scaling up Continual Learning — Day 3 — Session Day 2 11:10am-11:30am

Trajectory (stealth) is a research and product lab building the platform for continual learning, where frontier models are continuously trained as they interact with the real world. We are a team of ex-Deepmind, OpenAI, Meta superintelligence, Apple, and raised 15M from Conviction. The Fair will be after we have launched to the world. We will be walking through the primitives of continual learning, and how we can scale fast by leveraging these tools.

Ross Taylor

  • Role: CEO
  • Company: General Reasoning
  • Bio: CEO at General Reasoning Inc building long-horizon AI systems. Previously reasoning lead at Meta AI, Llama 2, Llama 3, Galactica, and founder of Papers with Code (acquired by Meta).
  • Twitter: https://x.com/rosstaylor90
  • LinkedIn: https://uk.linkedin.com/in/rosstaylor90
  • Website: https://rossjtaylor.com
  • Photo: /wf26/speakers/by-id/spk_ross_taylor.jpg
  • Sessions:

- Scaling to Long-Horizons: Algorithms, Environments, Compute — Day 2 — Session Day 1 2:25pm-2:45pm

What does it take to scale language models to year long tasks? In this talk we'll cover the algorithm, environment and compute considerations for scaling language models to long horizons. We'll cover the latest reinforcement learning approaches, how to build hard, high-fidelity long-horizon environments, and how to build scalable infrastructure for these tasks.

Ross Wollman

  • Photo: /wf26/speakers/by-id/spk_ross_wollman.jpg
  • Sessions:

- Benchmarking VS Code with VSC-Bench: How to measure agent performance — Day 4 — Session Day 3 11:40am-12:00pm

"Agent quality in VS Code depends on a stack of variables: model, version, prompts, extensions, MCP servers, and more. Each one affects quality, tokens, and latency—and they interact in ways that are hard to reason about.

In this session, we’ll show how to benchmark different configurations using VSC-Bench so you can compare results side by side and understand what actually works. Instead of guessing which setup is better, you’ll learn how to measure tradeoffs and make data-driven decisions."

Rowan Christmas

  • Role: Staff Product Manager
  • Company: Docker
  • Bio: Rowan Christmas is a Staff Product Manager at Docker with prior strategy, technology and consulting leadership experience, working in Docker's AI/agent platform context.
  • Sessions:

- YOLO Mode, Safely: microVM Sandboxes for Any Agent — Day 4 — Session Day 3 1:30pm-1:50pm

This talk shows the alternative: every agent session in its own microVM, with its own kernel and a hard boundary to the host. You decide what lives inside that boundary: filesystem, network, the tools it's allowed to call. The sandbox runs Claude Code, Cursor, Codex, or whatever else you're driving. You'll see an agent live in full YOLO mode inside a sandbox, a real attempt to escape, and the boundary that holds up.

Rustem Feyzkhanov

  • Role: Senior Engineering Manager - AI Platform
  • Company: Snorkel AI
  • Bio: Rustem Feyzkhanov is a Senior Engineering Manager on the AI Platform Engineering team at Snorkel AI, where he leads work on infrastructure and platform systems for building expert-authored datasets, simulation environments, and evaluation pipelines for frontier AI models and production agents. His work focuses on scalable agent evaluation, secure sandboxed execution, benchmark quality, and the systems needed to run large volumes of agent simulations reliably. Before Snorkel, Rustem was an ML Engineering Manager at Instrumental, applying AI to manufacturing, and an engineer at Astro Digital, building AI systems for satellite imagery. He is passionate about AI agents, evaluation infrastructure, serverless computing, and practical machine learning systems. Rustem is the author of the course and book Serverless Deep Learning with TensorFlow and AWS Lambda and Practical Deep Learning on the Cloud, and he is the main contributor to the open-source lambda-packs repository for serverless Python packages.
  • Twitter: https://x.com/ryfeus
  • LinkedIn: https://www.linkedin.com/in/ryfeus
  • Website: https://ryfeus.io
  • Photo: /wf26/speakers/by-id/spk_rustem_feyzkhanov.jpg
  • Sessions:

- From Agent Traces to Agent Simulations: The next era of agent evaluation — Day 3 — Session Day 2 12:05pm-12:25pm

Agent evaluation is moving beyond reviewing static traces after the fact. This talk explores how executable simulation environments let teams repeatedly test agents across realistic tasks, compare models and harnesses, and uncover failure modes that trace review alone misses. Drawing from Snorkel's experience building simulation datasets at scale for major labs and contributions to projects like Agents' Last Exam and Terminal-Bench, we'll cover concrete engineering patterns for building these environments: defining clear specs and requirements, implementing evaluators for simulation environments and tasks themselves, keeping environments decoupled from any single agent or model, and designing verifiers that evaluate both final outputs and agent traces. Attendees will leave with a practical mental model for creating environments that are lightweight enough to run at scale, but realistic enough to mock production systems such as databases, APIs, and tools in ways that meaningfully challenge agents.

Ryan Cooke

  • Company: WorkOS
  • Photo: /wf26/speakers/by-id/spk_ryan_cooke.jpg
  • Sessions:

- No, That's Not a Software Factory — Day 4 — Session Day 3 10:45am-11:05am

Drop an agent in a sandbox, point it at your repo, watch it ship code. Whether you're buying from a vendor or building it yourself, everyone is following the same playbook. But a sandbox isn't a software factory. At WorkOS, we built Project Horizon, and it taught us that infrastructure is only the first challenge. The unlock is encoding how your org actually builds software: the way work gets planned, scoped, and verified, the conventions and judgment calls that define your engineering culture. Our product engineering process served as the blueprint for every agent workflow we built in Horizon.

Ryan Dahl

  • Role: CEO
  • Company: Deno
  • Bio: Ryan Dahl is a programmer and the creator of Node.js and co-founder of Deno. Born in California in the early 1980s, he studied mathematics at the University of Rochester before pursuing graduate work in algebraic topology at UC San Diego. In 2009 he created Node.js, which brought JavaScript to the server and reshaped how a generation of developers builds software. He later spent time at Google Brain researching early generative image models, and in 2018 co-founded Deno, a modern, secure runtime for JavaScript and TypeScript.
  • Twitter: https://x.com/rough__sea
  • Photo: /wf26/speakers/by-id/spk_ryan_dahl.jpg
  • Sessions:

- Security Firewall for Agents — Day 2 — Session Day 1 10:45am-11:05am

Why personal agents that run untrusted LLM code need a sandboxed OS/runtime model, not just a compute sandbox.

Ryan Marten

  • Role: Member of Technical Staff
  • Company: Laude Institute
  • Bio: Ryan Marten is building Harbor at the Laude Institute and works on research-to-production efforts including Harbor, Terminal-Bench, and OpenThoughts-Agent.
  • Sessions:

- Everything Is a Rollout — Day 3 — Session Day 2 3:45pm-4:05pm

tba

Sachin Malhotra

  • Role: Member of Technical Staff
  • Company: Anthropic
  • Bio: Sachin Malhotra is a Member of Technical Staff on the Developer Infrastructure team at Anthropic, where he builds and operates the CI/CD systems underpinning one of the world's largest ML monorepos. His work spans test reliability, CI observability, and—increasingly—the challenge of giving AI agents real write access to production systems, safely.

He has spent the past year thinking about what it looks like when developer tooling has to scale with the pace of frontier ML research. Before Anthropic, Sachin held engineering roles at Etsy and Microsoft. He holds an MS in Computer Science from the University of Southern California.

  • Twitter: https://x.com/edorado93
  • LinkedIn: https://www.linkedin.com/in/edorado93
  • Website: http://anthropic.com/
  • Blog: https://bruteforced.dev/
  • Photo: /wf26/speakers/by-id/spk_sachin_malhotra.jpg
  • Sessions:

- Give the Agent a Budget, Not a Token — Day 4 — Session Day 3 3:20pm-3:40pm

Every agent demo runs with a god-token. Then it ships, and someone has to explain why the helpful AI just rm -rf'd the staging database "to clean up." I run platform infrastructure at a frontier lab, and for the last year my job has partly been: let coding agents do real work against real systems, without ever having to write the postmortem. This talk is the permission model that fell out of that - not RBAC-with-extra-steps, but primitives designed for an actor that's smart, fast, tireless, and occasionally confidently wrong. The four primitives: - Asymmetric verbs - the agent can quarantine but not delete, retry but not approve, propose but not merge. The verb list is the security boundary. Stop thinking in resources, start thinking in reversible vs. irreversible actions. - Regenerating budgets - every agent identity gets N disruptive actions per window. Burn the budget, you're benched until it refills. No human-in-the-loop until the budget's gone — which means 95% autonomy with a hard ceiling on blast radius. - The undo test - if the agent can't undo it, the agent can't do it without a second key. One line, surprisingly load-bearing. - Tripwires over allow-lists - let the agent roam, but instrument the three actions that would actually hurt. Cheaper than enumerating everything safe. I'll show the ~200-line policy layer that implements all four, the failure modes each one exists to catch, and the one design I shipped that turned out to be security theater. Tool-agnostic - works whether your agent is touching CI, a database, a cloud account, or your users' files. If you're shipping an agent that does anything more than read, you'll leave with a threat model and a starting policy you can paste into your repo on the flight home.

Safia Abdalla

  • Role: Software Engineer
  • Company: Warp
  • Bio: Safia Abdalla is a software engineer at Warp working on Oz, Warp's cloud orchestration platform for agents, including multi-agent orchestration and self-improving verification loops.
  • Photo: /wf26/speakers/by-id/spk_safia_abdalla.jpg
  • Sessions:

- The Agent Behind the Curtain: Building the Oz Cloud Agent Platform — Day 4 — Session Day 3 10:45am-11:05am

At Warp, we’re building Oz to be the platform that enables people to be creative and build with cloud agents. That sounds simple, but only because the job of good developer tooling is to take on complexity before it reaches the user. The best tools fit into the way developers already think, then make accessible work that used to feel out of reach.

This talk is about the engineering philosophy behind that work: how Warp’s evolution from terminal to local agent to Oz shaped the way we think about building for developers. The focus is not on inventing brand-new abstractions for their own sake, but on making a messy stack of real engineering concerns feel coherent: where agents run, how they delegate, how teams control their environments, how humans can see what happened, and how the platform leaves room for people to build things they couldn’t even imagine before.

4:04 PM

Sai Krishna Rallabandi

  • Role: Director, Data Science
  • Company: Fidelity Investments
  • Bio: Sai Krishna Rallabandi is Director, Data Science at Fidelity Investments, where he leads applied LLM and AI-agent work. He has been awarded Meta fellowship for his PhD in Computer Science from Carnegie Mellon University. His applied work has taken first place at multiple Finance for NLP challenges over the past 4 years. His research spans speech and language processing with a focus on financial data.
  • Twitter: https://x.com/Saikallis9012
  • LinkedIn: https://www.linkedin.com/in/sai-krishna-rallabandi-8595418b/
  • Website: https://saikrishnarallabandi.github.io/
  • Blog: https://saikrishnarallabandi.github.io/
  • Photo: /wf26/speakers/by-id/spk_sai_krishna_rallabandi.jpg
  • Sessions:

- Wearing the Agent: Engineering a Family-and-Friends Personal Agent, from Group Chats to Glasses — Day 4 — Session Day 3 3:45pm-4:05pm

Judith is a personal AI agent that has run in daily production for a year, used by more than a dozen of my family and friends across three WhatsApp group chats, Telegram, and Discord. This talk walks through how it's built, in two parts. The first part is the engineering that makes one agent safe for many people to share: a multi-tenant permission model (read-only for my mom, exec for me), a memory stack — FAISS + Neo4j + curated long-term notes — that stays useful over a year instead of bloating into noise, cron-scheduled subagents that scout and act on their own, and the guardrails it enforces on every message — redact personal info before posting to a group, never reply to the wrong person, and screen attacker-controllable text for prompt injection before acting on it. The second part takes the agent off the screen and onto a $50 pair of smart glasses. It captures what I see, describes and stores it as a running visual memory, sets destination path on maps before I get onto car, finds and tells me which aisle in the store to go to first, etc. I cover the latency budget that keeps it conversational — on-device Whisper for speech, cloud reasoning, sub-one-second round trips — and the custom neural voice it speaks in rather than stock TTS, drawn from my speech-synthesis background. Both parts are shown live, including a candid look at the pieces that don't work yet. Audience takeaways: A multi-tenant architecture for a personal agent multiple people actually share A memory design that survives real long-term use (not just a vector store) A defensive checklist for any agent that ingests untrusted text A blueprint for an ambient, vision-aware wearable interface on commodity hardware, with a real latency budget

Sait Izmit

  • Role: Principal Product Manager
  • Company: Snowflake
  • Bio: Sait Izmit is a Principal Product Manager at Snowflake focused on AI solutions for go-to-market teams. He works on enterprise AI platform and agent deployments at scale.
  • LinkedIn: https://www.linkedin.com/in/saitizmit/
  • Photo: /wf26/speakers/by-id/spk_sait_izmit.jpg
  • Sessions:

- Building GTM AI Agents: Lessons from Deploying to 6,000 Users — Day 4 — Session Day 3 3:20pm-3:40pm

Building an enterprise AI agent for GTM teams isn't just an LLM problem—it's a product, engineering, and adoption challenge. In this session, I'll share how we built and scaled Snowflake's internal GTM AI Assistant from MVP to a production system serving more than 6,000 employees and answering over one million questions. We'll cover how we scoped the MVP, evolved the architecture over time, balanced quality versus coverage, adopted emerging technologies like MCP, and continuously adapted as the AI landscape rapidly changed. You'll leave with practical lessons for building enterprise AI products that users actually trust and use.

Salil Subbakrishna

  • Company: GitHub
  • LinkedIn: https://www.linkedin.com/in/salilsub
  • Photo: /wf26/speakers/by-id/spk_salil_subbakrishna.jpg
  • Sessions:

- Modernize CI/CD using agent-assisted workflows that reduce manual debugging — Day 2 — Session Day 1 1:30pm-1:50pm

AI agents are reshaping CI/CD. See how workflows become adaptive—understanding failures, fixing issues, and accelerating releases without constant manual intervention.

Salman Munaf

  • Role: Lead Site Reliability Engineer
  • Company: TikTok
  • Bio: Salman Munaf is a Lead Site Reliability Engineer at TikTok, where he builds and operates large-scale video infrastructure serving millions of users. He specializes in distributed systems, observability, and reliability at scale, with prior experience as a Software Engineer at Meta. Salman is passionate about helping developers embed reliability into their workflows from day one, making complex systems more resilient and easier to operate.
  • LinkedIn: https://www.linkedin.com/in/salman96/
  • Photo: /wf26/speakers/by-id/spk_salman_munaf.jpg
  • Sessions:

- AI Agents Are Just Distributed Systems Now — Day 4 — Session Day 3 2:50pm-3:10pm

AI agents are often described as a new kind of software, but once they move beyond chat and start calling tools, reading data, making decisions, retrying tasks, and coordinating workflows, they begin to look a lot like distributed systems. They have state. They call external services. They depend on APIs. They fail partially. They retry. They time out. They can loop. They can act on stale context. They can produce inconsistent results. And when something goes wrong, teams need logs, traces, permissions, ownership, and rollback paths just like they do with any other production system. This session will give engineers a practical way to reason about AI agents using familiar distributed systems concepts. We will break down the agent loop: planning, tool use, observation, memory, and retries. Then we will map common agent failure modes to engineering patterns teams already know, including timeouts, circuit breakers, idempotency, rate limits, least privilege, observability, and human approval. The goal is to move past the hype and treat agents like real production systems. Attendees will leave with a clear mental model for designing, debugging, and operating agents safely, especially as they become part of customer-facing products, internal developer tools, and business workflows.

Sam Bhagwat

  • Role: Founder/CEO
  • Company: Mastra
  • Bio: co-founder/ceo mastra the typescript agent framework. author principles of building ai agents. prev cofounder gatsbyjs.
  • Twitter: https://x.com/calcsam
  • LinkedIn: https://www.linkedin.com/in/sambhagwat/
  • Photo: /wf26/speakers/by-id/spk_sam_bhagwat.jpg
  • Sessions:

- Every Harness Will Become A Claw — Day 2 — Session Day 1 3:20pm-3:40pm

Most of the Harness discussion is just a reprise of Context Engineering from last summer. But it's not 2025 anymore. We live in a Claude Code world, and the best way to think about a harness is Context engineering + Coding Agents = Harness. Harnesses are a magical DX because of specific features like planning mode, parallel subagents, skills, background tasks etc. But it doesn't stop there. People are shoving their harnesses in a box, making them listen to external events, giving them channels (the ability to ping its users), and a heartbeat. They are making them into Claws. And actually, harnesses _want_ to become claws, so they can take up more share of mind, suit collaboration workflows, and be available afk. I propose "Steinberger's law", a spinoff of Zawinski's law: every harness will expand until it becomes a Claw

Sam Parsons

  • Role: Senior Staff Software Engineer and Tech Lead
  • Company: PayPal Braintree
  • Bio: Sam Parsons is a Senior Staff Software Engineer and Tech Lead at PayPal Braintree. His current team builds Payments Orchestration, and he has more than 15 years of experience building critical fintech, travel, higher-education, and government software systems.
  • Twitter: https://x.com/sjparsons
  • Website: https://sjparsons.com
  • Photo: /wf26/speakers/by-id/spk_sam_parsons.jpg
  • Sessions:

- How PayPal Enterprise Payments handles agent-initiated payments across ChatGPT and Google AI Mode — Day 2 — Session Day 1 10:45am-11:05am

PayPal Enterprise Payments has shipped integrations across the major agentic surfaces in the last six months each with human-in-the-loop confirmation and full transaction attribution back to the originating AI platform. We'll tour all three paths: ACP for ChatGPT apps (delegated payment tokens via complete_checkout, allowance validation, facilitator_details attribution), UCP with Google Pay for Google AI Mode (server-side tokenizationSpecification, parsing androidPayCards for the single-use token), and a preview of MCP Apps inline checkout, where the payment surface renders in-chat and card data never enters the LLM context. For each path we'll cover where PayPal Enterprise Payments fits, what the shopper and merchant each see, and the tradeoffs between them. You leave with working code and the docs to evaluate which path fits your stack.

Samridhi Vaid

  • Role: Senior Machine Learning Engineer
  • Company: Towards AI
  • Bio: Samridhi Vaid is a Senior Machine Learning Engineer at Towards AI, where she builds production AI systems—multimodal LLMs, agentic systems, RAG pipelines, and evaluations—and helps teach the next generation of AI practitioners through courses, books, and technical writing. With 4+ years across ML/AI and software engineering, she specializes in taking models end-to-end, from training and evaluation to cloud deployment and CI/CD.
  • Twitter: https://x.com/samridhivaid
  • LinkedIn: https://www.linkedin.com/in/samridhivaid/
  • Website: https://samridhivaid.com/
  • Photo: /wf26/speakers/by-id/spk_samridhi_vaid.jpg
  • Sessions:

- Context Engineering in 2026: Compaction, Memory & Cost — Day 1 — Workshop Day 2:20pm-4:20pm

Every long agent session eventually breaks: the assistant that swore it would "never push to main" does exactly that forty turns later. The model didn't get dumber — its context did. This workshop is about engineering the context window so that stops happening, shown with Towards AI's open-source AI tutor, which answers questions for students of our AI-engineering courses. Context engineering is deciding what the model sees on every single call — instructions, history, retrieved course content, memory, and tool outputs — and it's the line between a tutor that holds a coherent session and one that forgets the student's setup halfway through. We'll move in three stages, mirroring how the project actually went. The concepts: the two root problems (a finite window, a stateless model), the full compaction toolkit (truncation, trimming, tool-result clearing, summarization, and offloading to files — and when each actually helps), memory that survives across sessions, skills loaded on demand, and production-grade retrieval (chunking, metadata, course scoping, hybrid search, reranking, and evaluating). We'll cover the tutor's architecture, and the evaluation harness we used to measure every run on Gemini — tokens, cost, latency, and memory probes instead of vibe-checks. At real volume, even Gemini Flash got expensive, so we tested whether open and local models could match the quality for a fraction of the cost and match result quality. Everything is open-source and will be shared during the workshop.

Samuel Colvin

  • Role: Founder & CEO
  • Company: Pydantic
  • Bio: Samuel Colvin is a Python and Rust developer and the founder of Pydantic Inc., backed by Sequoia. With over 13 years of software engineering experience, he created Pydantic Validation, an open source library downloaded over 550M times per month and a core dependency of virtually every GenAI Python library. Samuel has also built Pydantic Logfire (developer-first observability), Pydantic AI (agent framework), Pydantic Evals (AI evaluation), and Pydantic AI Gateway (model routing) and Pydantic Monty (a python implementation, in rust, for LLMs to run code without host access). Samuel maintains an active presence in the developer community through GitHub and X (@samuelcolvin), where he shares his work, engages with other developers, and posts his opinionated takes.
  • Twitter: https://x.com/samuelcolvin
  • Photo: /wf26/speakers/by-id/spk_samuel_colvin.jpg
  • Sessions:

- Your agent needs a sandbox, not a desert — Day 3 — Session Day 2 12:05pm-12:25pm

Everyone agrees agents need code execution. That agreement lasts right up until you ask how to do it. The default answer is usually something like "My agent needs a full Linux VM to succeed". That's a very convenient answer for sandbox providers, but I think it's often incorrect. In many real-world agent workflows, the model does not need a whole computer. It does not need arbitrary packages, shell access, CPython, node, let alone awk sed and gcc. It needs a small amount of safe, expressive compute: enough to write code, call tools, and keep intermediate state out of the context window. That is the idea behind Monty: a minimal Python interpreter, written in Rust, designed specifically for running code written by agents. In this talk, I'll argue that for a surprisingly large class of agent systems, a curated set of tools in a custom runtime is better than a full sandbox. Not because full sandboxes are bad, but because they solve a much larger problem than most embedded agents actually have. And you pay for that mismatch in complexity, cost, operational pain, and 100,000X higher latency. Sandboxes are great, but there's such a thing as too much sand - in many scenarios the constraints and limitations of a custom built, minimal sandbox are a feature, not a bug.

Samuel Denton

  • Role: Platform Research Lead
  • Company: Applied Compute
  • Bio: Leading Platform Research at Applied Compute — focused primarily on continual learning, context, synthetic users/tasks, and more around our RL stack. Previously at Scale AI and Amazon.
  • Twitter: https://x.com/samueldenton
  • LinkedIn: https://www.linkedin.com/in/sam-denton-161b50126/
  • Photo: /wf26/speakers/by-id/spk_sam_denton.jpg
  • Sessions:

- Bringing Continual Learning into Enterprises — Day 3 — Session Day 2 2:25pm-2:45pm

Sandhya Subramani

  • Role: Senior Developer Advocate, GenAI
  • Company: Amazon Web Services
  • Bio: Sandhya Subramani is a Sr. Developer Advocate at AWS with 8+ years of experience in Applied AI Research, specializing in Large Language Models and agentic AI systems. She has developed and deployed AI solutions at organizations including Amazon, Warner Bros, and Fidelity Investments.
  • LinkedIn: https://www.linkedin.com/in/sandhyasubramani/
  • Photo: /wf26/speakers/by-id/spk_sandhya_subramani.jpg
  • Sessions:

- Agent Speedrun: Idea → Code → Deploy → Observe, Fix → Ship — Day 1 — Workshop Day 11:05am-12:05pm

One agent. Fully deployed to production before the workshop ends. We'll take you from a blank file to a running production agent using Amazon Bedrock AgentCore and Strands Agents, covering the full lifecycle: ideation, coding the agent loop, deploying to serverless infrastructure, wiring up observability, breaking it intentionally, fixing it with tracing data, and shipping the final version. Bring your laptop and leave with a deployed agent.

- Agents That Forge Their Own Tools: Self-Modifying AI in the Wild — Day 4 — Session Day 3 12:05pm-12:25pm

What happens when your agent decides its existing tools aren't good enough and writes new ones? Self-modifying agents can generate, test, and deploy their own tool implementations at runtime, adapting to problems they weren't explicitly programmed to solve. In this session, we'll demo a live agent that forges its own tools on the fly, discuss the safety boundaries you need, and explore where this pattern makes sense (and where it absolutely doesn't).

- Tell the Robot What You Want — Day 4 — Session Day 3 3:45pm-4:05pm

What if you could command a robot just by talking to it?

This session introduces Strands Agents, an open-source framework that lets developers control physical sensors and actuators using natural language, by exposing hardware as programmable agent tools through a unified interface. The agent interprets the request, selects appropriate tools, and orchestrates execution. We explore a hybrid model where low-latency perception and actuation run locally on edge hardware, and higher-level reasoning and multi-step planning are delegated to cloud-based agents when needed. This preserves real-time responsiveness while enabling richer reasoning.

A live robot demonstration anchors the session. Using the SO101 robotic arm powered by NVIDIA GR00T alongside HuggingFace LeRobot, attendees see how an instruction such as “pick up the cube” moves from conversation to perception to physical action.

Sangwu Lee

  • Role: AI Lead
  • Company: Krea.ai
  • Bio: Sangwu Lee is an AI Lead at Krea.ai, working on generative media models and creative AI systems.
  • Website: https://re-n-y.github.io/devlog/
  • Photo: /wf26/speakers/by-id/spk_sangwu_lee.jpg
  • Sessions:

- Training Krea 2 - What matters in generative model training. — Day 4 — Session Day 3 10:45am-11:05am

Learn how Krea trained its first image foundation model from scratch. I will discuss

1. Our training and data pipelines

2. What are the most important aspects of improving model performance

3. How we intend to train the next generation of image generation models.

Check out our technical report for details: https://www.krea.ai/blog/krea-2-technical-report

Saoud Rizwan

  • Role: Founder & CEO
  • Company: Cline
  • Bio: Founder and CEO of Cline, an open-source AI coding agent for software development workflows.
  • Twitter: https://x.com/sdrzn
  • Photo: /wf26/speakers/by-id/spk_saoud_rizwan.jpg
  • Sessions:

- Open Source Is Dead. Long Live Open Source. — Day 4 — Session Day 3 3:45pm-4:05pm

Closed model labs set take‑it‑or‑leave‑it prices, but open‑weight models force inference hosts to compete on the same models, driving costs down and shifting power back to builders instead of vendors. I’ll tell the story of how Cline went from viral open source project to a case study in AI‑generated slop, entitled PRs, and brand‑diluting forks and why, even as that old idea of open source community died, open weight models and auditable code are now the only real check we have on model pricing and control.

Sara Hooker

  • Role: CEO, and Co-founder
  • Company: Adaption
  • Bio: Sara Hooker is a co-founder of Adaption, which builds intelligence that continuously evolves. Sara leads a large team of AI researchers and engineers that build extremely efficient, adaptable systems. Sara Hooker was previously VP of Research at Cohere, a $6.8 billion frontier AI company focused on generative AI for enterprise. Prior to Cohere, she built large systems in computer vision and NLP at Google Deepmind. Her work has been featured in mainstream news outlets including Techcrunch, New York Times, Washington Post, Axios, MIT Technology, The Atlantic. Sara is a frequent expert advisor to AI research and policy initiatives around the world: she is currently on Kaggle's ML Advisory Research Board and serves on the World Economic Forum council on the Future of Artificial Intelligence and the Future of Data Frontiers. She has been listed as one of AI's top 13 innovators by Fortune and one of Time100 Most Influential People in AI.
  • Twitter: https://x.com/sarahookr
  • LinkedIn: https://www.linkedin.com/in/sararosehooker/
  • Website: https://www.sarahooker.me/
  • Photo: /wf26/speakers/by-id/spk_sara_hooker.jpg
  • Sessions:

- Adaption Labs — Gradient-Free Continual Learning — Day 3 — Session Day 2 1:30pm-1:50pm

Gradient-free continual learning for AI systems that adapt from real-world experience.

Sarah Sachs

  • Role: Eng Lead, AI
  • Company: Notion
  • Bio: Sarah Sachs is an engineering leader focused on shipping practical, high-leverage AI into real products at scale. Currently leading AI Modeling at Notion, she oversees four core areas of the company’s modeling efforts: Reasoning and Agentic Orchestration, Core Model Engineering, Search and Ranking, and Data Specialists & Evals — driving the next generation of Notion AI across reasoning, retrieval, and end-to-end drafting and editing.

Before Notion, Sarah was Director of Engineering for AI and Infrastructure at Tome, where she led AI features reaching 18 million users, built end-to-end presentation generation, and owned the OpenAI partnership and model infrastructure. Prior to that she spent three years at Robinhood as Head of NLP & GenAI, setting company-wide generative AI strategy, transitioning a BERT-powered chatbot into a compliant generative assistant, and building NLP content moderation for a regulated fintech environment. Earlier in her career she was a founding ML engineer at Sunshine (formerly Lumi Labs) and a software engineer at Google, where she launched and patented Personal Score in Google Maps, featured at Google I/O 2018.

Sarah holds a Sc.B. in Applied Mathematics–Computer Science from Brown University, graduating with a 4.0 CS GPA and the University Distinguished Thesis Award.

  • Twitter: https://x.com/sarahmsachs
  • LinkedIn: https://www.linkedin.com/in/sarahmsachs/
  • Photo: /wf26/speakers/by-id/spk_sarah_sachs.jpg
  • Sessions:

- Notion's Token Town — Day 2 — Session Day 1 2:50pm-3:10pm

Sarah Sanders

  • Role: Context Engineer
  • Company: PostHog
  • Bio: Context Engineer at PostHog building and securing AI agents
  • LinkedIn: https://www.linkedin.com/in/sarah-s-42913121a/
  • Photo: /wf26/speakers/by-id/spk_sarah_sanders.jpg
  • Sessions:

- We let an AI agent execute Bash and lived to talk about it — Day 4 — Session Day 3 2:25pm-2:45pm

PostHog's Wizard agent can read your codebase, install packages, and run shell commands on your laptop. Yes, on purpose. This talk covers how we went from "defense-in-hope" to a standalone, robust security service. It'll highlight results from a pentest that made us question our life choices, an internal audit that challenged our architecture, and the debate over how to secure the entire pipeline. You'll learn why "scan-then-trust" is a weaker model than you think, what it takes to build kill switches you hope you never use, and what happens when you pentest an AI agent that has access to Bash.

Sarah Simionescu

  • Company: Composio
  • Photo: /wf26/speakers/by-id/spk_sarah_simionescu.jpg
  • Sessions:

- Dashboards are Dead — Day 4 — Session Day 3 3:45pm-4:05pm

AX is the new UX, and how to build for agents.

Sarthak Aggarwal

  • Role: Co-founder
  • Company: Decawork
  • Bio: Sarthak is the co-founder of Decawork, building autonomous IT admin for the AI workforce. Decawork is in production at publicly listed enterprises, autonomously running IT across identity, security, and endpoint infrastructure. Backed by Y Combinator and Entrepreneurs First. A BITS Pilani engineering grad who has been coding since age 10, Sarthak previously built AI agents at NVIDIA, supporting on-ground engineers and FDEs deployed at OpenAI and Meta. He has also engineered enterprise-grade AI agents for Microsoft and Hitachi at Ema. Earlier, he led Conquest, India's largest student-run startup accelerator.
  • Twitter: https://x.com/_sarthak4
  • LinkedIn: https://www.linkedin.com/in/sarthak-agg/
  • Website: https://sarthak.site
  • Photo: /wf26/speakers/by-id/spk_sarthak_aggarwal.jpg
  • Sessions:

- IT Admin for the AI Workforce: Why Your AI Agents Will Need Their Own IT Department — Day 2 — Session Day 1 1:55pm-2:15pm

Every enterprise will soon run two workforces - human and AI. Humans already have IT departments managing their identities, access, incidents, and compliance. Who manages all that for your fleet of 10,000 AI agents? Nobody. Yet. At Decawork AI, we started by building autonomous IT resolution for human employees - a dual-agent system where the agent that thinks can't act and the agent that acts can't improvise. We're live in production across multiple enterprises - autonomously resolving incidents across identity systems, security platforms, endpoint infrastructure, and collaboration stacks. But here's what we discovered: the patterns for managing human IT - identity lifecycle, access governance, incident resolution, audit logging - are the exact same patterns you'll need to manage AI agent fleets at scale. The next massive infrastructure layer isn't AI agents doing work. It's AI agents managing other AI agents. This talk covers the architecture, the production war stories, and the thesis: IT Admin for the AI workforce is an inevitability, and we're building it now.

Saul Howard

  • Role: VP Engineering
  • Company: Anterior
  • Bio: VP Engineering at Anterior building the AI Platform for Healthcare. Previously at Apple Cloud.
  • Twitter: https://x.com/saulhoward
  • LinkedIn: https://linkedin.com/in/saulhoward
  • Website: https://saulhoward.com
  • Photo: /wf26/speakers/by-id/spk_saul_howard.jpg
  • Sessions:

- Why Your Enterprise Tech Stack Isn't Ready for AI Agents - And What to Build Instead — Day 4 — Session Day 3 3:45pm-4:05pm

Agent-executed work is a new infrastructure primitive. Until you treat it that way, you're running a demo, not enterprise AI. Your existing stack was built for deterministic software. Agents reason, delegate, and make judgment calls. That distinction creates infrastructure problems most engineering teams haven't confronted: security vulnerabilities baked in by design, no audit trail, no explainability, no human-in-the-loop. At Anterior, we've deployed clinical AI agents across many of the largest US health plans, covering 50 million lives. Healthcare, with high stakes, strict regulation, deeply human workflows, exposes infrastructure gaps that exist everywhere - and makes the paradigm shift unavoidable: agent-executed work as a first-class primitive, alongside compute, storage, and APIs. We'll cover why bolting agents onto existing data pipelines fails, what infrastructure primitives are missing (and why teams don't notice until an audit), and how to architect a stack where security, compliance, and human oversight are load-bearing from day one. If you're serious about agents in any mission-critical context, this is the infrastructure conversation you need to have.

Sean Cai

  • Role: CEO
  • Company: Independent / State of Data
  • Bio: Data Quality Research at Prime Intellect and State of Data Author. Prior investor at Hummingbird and Costanoa.
  • Twitter: https://x.com/SeanZCai
  • Website: https://www.seancai.com/
  • Photo: /wf26/speakers/by-id/spk_sean_cai.jpg
  • Sessions:

- State of Data — Day 3 — Session Day 2 11:10am-11:30am

Sean Sodha

  • Role: Senior Product Manager
  • Company: NVIDIA
  • Bio: Sean Singh Sodha is a deep learning Senior Product Manager at NVIDIA, responsible for the Nemotron Retriever portfolio of embedding, reranking, and extraction models powering agentic retrieval and memory systems. Before joining NVIDIA, Sean ran his own Generative AI venture and was formerly at IBM Watson. He holds an MBA from the Wharton School of Business, M.S. in engineering from Cornell University, and B.Sc. in Electrical Engineering from Purdue University.
  • LinkedIn: https://www.linkedin.com/in/sean-sodha/
  • Photo: /wf26/speakers/by-id/spk_sean_sodha.jpg
  • Sessions:

- Your Agreements Are a Database You Can't Query. We're Fixing That — Day 2 — Session Day 1 1:55pm-2:15pm

Agreements power every enterprise business, but the most critical data — pricing schedules, SLA obligations, rate cards — is often trapped in tables that traditional extraction tools destroy.

This session shows what changes when you can actually extract that data accurately at scale and make it searchable.

We'll walk through the before and after:

Before: Contract tables require manual review. Rate cards are buried. SLA terms are scattered across exhibits. Procurement teams spend hours piecing together pricing structures — and searching for specific terms means opening every document.

After: Tables are automatically extracted, structured, and queryable. Operations teams can surface SLA notification requirements on demand. Legal can answer "what hourly rate did we agree to?" in seconds.

Docusign will share what we've achieved evaluating NVIDIA Nemotron Parse for our document processing pipeline, including how we tested against real enterprise contracts (not synthetic benchmarks), why we're serving the model via vLLM, and what it takes to turn extracted table data into searchable, retrievable agreement intelligence.

NVIDIA will cover the architecture behind Nemotron Parse and where the model is heading — including how NeMo Retriever's embedding and reranking models connect extracted data to search and RAG-based applications.

Attendees will leave with a realistic view of where vision-language models excel at document understanding, where the gaps remain, and how to think about building searchable contract intelligence into their own systems.

Sebastian Fox

  • Role: CEO
  • Company: Composo
  • Bio: CEO of Composo. Former MD. Led AI teams at McKinsey & QuantumBlack. Working on quality evaluation for AI in high-stakes domains (e.g. health, pharma, legal, finance)
  • LinkedIn: https://www.linkedin.com/in/seb--fox/
  • Photo: /wf26/speakers/by-id/spk_sebastian_fox.jpg
  • Sessions:

- Inside 847 Production Clinical AI Notes — Day 4 — Session Day 3 2:50pm-3:10pm

A Series B clinical AI company had an ambient scribe in production for six months. Internal evals passed every release. A clinical team spot-checked a sample weekly and saw nothing alarming. The system had healthy NPS, expanding deployments, and the company was preparing for European market expansion. We ran a structured audit on 847 production notes. Found 127 failures across six categories. 23 were severity-critical - the kind that could directly alter a clinical decision. The team's existing LLM-as-judge had reported zero failures across the same notes. This talk is the engineering forensics of that audit. The audit setup: which production traces we sampled, how the structured failure-mode coding worked, and the reviewer protocol. The results: three dominant failure clusters - decision-status corruption (19 cases), structured omissions (34 cases), and dosage substitution (12 cases) - and the underlying generation pattern behind each. For each cluster I will show: a real anonymised trace, the eval rule that should have caught it but did not, an explanation of why the eval missed it, and the criterion that does catch it. The pattern that emerged in the data is engineering-actionable. The team had built a 20-criterion content-faithfulness eval layer. The failures lived underneath it, in a missing intent layer. We replaced the broad content layer with a five-criterion intent layer (decision status, omission impact, dosage integrity, diagnostic chain, laterality consistency). Detection rate went from 0% to 96% on the failure set. Compute cost dropped because the intent layer is cheaper to run than the content layer it replaced. You will leave with a forensics protocol for auditing your own production AI, the five intent criteria that generalise to any high-stakes domain, and the architectural pattern: build a thin intent layer, not a thick content layer.

Serena Ge

  • Role: CEO
  • Company: Datacurve
  • Bio: CEO at Datacurve. Building research and data collection infrastructure to advance frontier models. Datacurve is the creator of DeepSWE benchmark - the benchmark designed to reflect the realistic experience of developers in their day-to-day work.
  • Twitter: https://x.com/serenaa_ge
  • LinkedIn: https://www.linkedin.com/in/serena-ge-4583731b4/
  • Photo: /wf26/speakers/by-id/spk_serena_ge.jpg
  • Sessions:

- DeepSWE: expert code datasets — Day 4 — Session Day 3 10:45am-11:05am

DeepSWE and the data/eval layer behind coding agents; why curated expert code datasets matter for reliable agent performance.

Shafik Quoraishee

  • Role: Staff Engineer
  • Company: The New York Times
  • Bio: Shafik Quoraishee is a Staff Games Engineer, published writer, illustrator, and artificial intelligence researcher at The New York Times. With over a decade of experience in mobile development, he plays a key role in building and integrating the Times's highly popular digital puzzles into the Android ecosystem.

Outside of game engineering, Quoraishee is a practitioner specializing in computer vision, neural networks, and accessibility. He has spearheaded experimental projects, such as designing on-device handwriting recognition for crosswords. His machine learning research has been deployed across real-world civilian and government systems and featured in publications like Towards Data Science and Nieman Lab. Before his tenure at The New York Times, he engineered mobile and data systems for major media and sports brands, including the National Basketball Association (NBA) and Business Insider.

  • Twitter: https://x.com/squoraishee
  • LinkedIn: https://www.linkedin.com/in/shafik-quoraishee/
  • Website: https://www.shafikquoraishee.com/
  • Photo: /wf26/speakers/by-id/spk_shafik_quoraishee.jpg
  • Sessions:

- On-Device Agentic AI for the New York Times Games — Day 4 — Session Day 3 2:50pm-3:10pm

Traditional mobile game architectures rely on static state machines and fixed behavioral trees. Under this model, gameplay and accessibility are treated as rigid, separate systems. This results in blunt difficulty toggles, predictable character loops, and reactive features that fail to address a player's actual context. Constraint-Centric Agentic Simulation (CCAS) offers a theoretical shift. By modeling the game world as a continuous, multi-agent negotiation, accessibility and challenge become part of a single, fluid continuum.

Using the JetBrains Koog framework on Android, this session explores the theory of running local agents on consumer mobile devices. We will discuss how principles of game theory, specifically dynamic negotiation and constraint satisfaction, can be used to build systems that reason over game states. Instead of executing pre-planned scripts, these agents dynamically alter their strategies. They negotiate environmental constraints to provide emergent challenges for high-skill players or organically smooth out cognitive and motor friction points for those requiring assistance.

Running these theoretical models on edge hardware requires overcoming significant practical hurdles. We will break down the architecture needed to support this continuous adaptation without relying on cloud computation. We will cover how to manage memory footprints, compress state histories for rapid backtracking, and schedule local planning loops so they integrate flawlessly with the rendering engine.

Shane Wolf

  • Company: Atlassian
  • LinkedIn: https://www.linkedin.com/in/shane-wolf
  • Photo: /wf26/speakers/by-id/spk_shane_wolf.jpg
  • Sessions:

- The best SDLC is the one you build yourself: Why orchestration changes everything — Day 1 — Workshop Day 9:00am-11:00am

Industry research shows AI productivity gains have plateaued at 10–15% — because today's tools only optimize the 20% of a developer's day spent writing code. The real bottlenecks are left and right of code: planning, orchestration, review, and operations. We'll also explore the value of AI-powered code reviews - from establishing code standards that AI can seamlessly enforce, to triggering agentic pipelines that autonomously fix issues. Join Atlassian's Shane Wolf and Andrei Bocan for a hands-on deep dive into the AI-native SDLC. In this workshop, we'll move past single-player copilots and show you how Atlassian is turning Jira into an AI-native orchestration layer for the entire software development lifecycle. Then, we'll go further. You'll learn how to build custom automations that chain these capabilities together, transforming your Jira board into an agentic software factory where humans set intent and agents execute.

Shashank Goyal

  • Role: Head of Provider Ecosystem
  • Company: OpenRouter
  • Bio: Shashank Goyal is Head of Provider Ecosystem and a Founding Engineer at OpenRouter, where he helps build the infrastructure powering one of the world's largest LLM marketplaces. He works on the systems that enable developers and enterprises to access, route, and scale across hundreds of AI models and providers through a single API.

Prior to OpenRouter, Shashank was a backend engineer at OpenSea and spent more than four years as a software engineer at Google. His experience spans large-scale infrastructure, developer platforms, and AI systems, with a focus on reliability, performance, and developer experience.

  • Twitter: https://x.com/shashankgoyal95
  • LinkedIn: https://www.linkedin.com/in/shashankgoyal1/
  • Photo: /wf26/speakers/by-id/spk_shashank_goyal.jpg
  • Sessions:

- Letting the Interns Loose — How We Accelerated AI Adoption. — Day 3 — Session Day 2 11:10am-11:30am

Shawn Chan

  • Role: Vice President
  • Company: China Resources Holdings
  • Bio: Shawn Chan is Vice President at China Resources Holdings, leading a consumer-sector fund, and was previously Head of Investment at A.S. Watson Group (CK Hutchison). Across fifteen years he has executed cross-border M&A, IPOs and strategic investments in companies including Oatly, Airbnb, SenseTime, Moore Threads, Leapmotor and EVE Energy, across Hong Kong, mainland China, the UK and the US. MSc Finance from the University of Manchester. His current focus is what it takes for AI agents to earn trust inside real investment committee workflows.
  • Sessions:

- Build for the Memo, Not the Demo — Notes from 200 Investment Committees — Day 4 — Session Day 3 1:30pm-1:50pm

By the end of this talk you will have a buyer-side specification for AI investment agents, the exact artifacts, evidence formats, and trust gates a senior finance team will require before letting an AI system touch a $100M+ capital allocation decision. Drawn from fifteen years and roughly 200 investment committees at CK Hutchison (A.S. Watson Group) and China Resources Holdings, on the side of the table the AI engineering audience almost never hears from. Most enterprise AI in finance is still being built by engineers who have never sat in an investment committee. I have spent fifteen years on the other side of that demo, cross-border M&A, IPO execution and strategic investment, as a buyer on deals including Oatly (Series B through Nasdaq IPO), Airbnb (Series F), SenseTime, Moore Threads, Leapmotor and EVE Energy, and on the A.S. Watson tri-market IPO and Temasek's strategic stake. I have watched analyst memos get torn apart, and signed off on decisions where being wrong meant being wrong by nine figures. From that seat, almost every AI finance demo I have seen has the same problem: it optimizes for the demo, not for the memo. This talk walks through the specific failure modes that kill AI agents at the IC door: Source hierarchy is not retrieval. A footnote in an audited 10-K outweighs a sell-side note, which outweighs a transcript, which outweighs an internal email. Most RAG systems flatten this. Numerical consistency is non-negotiable. A memo that says "revenue grew 18%" in paragraph one and "17.4%" in the sensitivity table is dead on arrival. Contradiction is a feature. Real diligence surfaces conflicts between sources; AI agents tend to silently resolve them. Every assumption must be separable from every fact. Investment committees do not approve assumptions hidden inside prose. Audit trail is the deliverable. If a regulator, an auditor, or a board member cannot trace a claim back to evidence in under thirty seconds, the system is unusable. Accountability cannot be delegated to a model. Someone has to sign the memo. The architecture has to reflect that. The session closes with a concrete buyer-side specification, what an AI investment agent must produce, in what form, with what evidence, before a senior finance team will let it touch a live deal. Not a framework slide.

Sheilah Kirui

  • Company: NVIDIA
  • Website: https://developer.nvidia.com/blog/author/skirui
  • Photo: /wf26/speakers/by-id/spk_sheilah_kirui.jpg
  • Sessions:

- Seeing the Plumbing: Profiling vLLM Speculative Decoding on NVIDIA Blackwell — Day 4 — Session Day 3 11:40am-12:00pm

Speculative decoding promises dramatic LLM speedups by using a tiny draft model to guess tokens ahead of a large target model. However, dual-model serving fundamentally rewrites your memory dynamics and introduces a rigid engineering trade-off: guess right, and you bypass the memory-bandwidth bottleneck; guess wrong, and you waste compute.

This session is a live-demo routing identical workloads through baseline and speculative configurations in vLLM on a single NVIDIA RTX 6000 Blackwell GPU. Splitting the screen between a Streamlit app and a live Grafana dashboard, we will profile the inference engine across three vectors:

Time per Output Token (TPOT): The real-time, user-facing latency delta.

KV Cache & Memory Footprint: The exact VRAM tax of tracking parallel token states within a 96GB budget.

Draft Acceptance Rate: Visualizing the tipping point where dropping acceptance rates cause speculative decoding to fall below baseline efficiency.

Supporting Materials

Project Repository: https://github.com/akamai-developers/speculative-decoding-example-vllm-blackwell# (Work In Progress / Active Development)

Shlok Khemani

  • Role: Independent Researcher
  • Company: Independent
  • Bio: Researching memory and personal AI agents
  • Twitter: https://x.com/shloked
  • LinkedIn: https://linkedin.com/in/shlokkhemani/
  • Website: https://shloked.com
  • Blog: https://www.shloked.com
  • Photo: /wf26/speakers/by-id/spk_shlok_khemani.jpg
  • Sessions:

- Lessons from Studying Every Memory System — Day 3 — Session Day 2 3:20pm-3:40pm

For the past year I've done one thing obsessively: studied how AI products implement personalization. I've reverse-engineered the memory systems inside ChatGPT, Claude, Gemini, and Poke, and helped consumer teams build their own.

In this talk, I'll trace the evolution of ChatGPT and Claude memory over the past three years. I'll then share lessons learnt from studying these systems and share thoughts on where I think memory for consumer is heading.

Shreya Rajpal

  • Role: CEO
  • Company: Snowglobe
  • Bio: Shreya Rajpal is CEO of Snowglobe and co-founder/CEO of Guardrails AI. She created the open-source Guardrails framework and works on tools for validating LLM outputs, preventing hallucinations, detecting policy risks, and generating simulation-based evaluation datasets.
  • Twitter: https://x.com/ShreyaR
  • LinkedIn: https://www.linkedin.com/in/shreya-rajpal/
  • Website: http://shreya-rajpal.com
  • Photo: /wf26/speakers/by-id/spk_shreya_rajpal.jpg
  • Sessions:

- Simulation-Maxxing: How Nubank ships agents 20× faster with simulations — Day 4 — Session Day 3 2:50pm-3:10pm

You know how to build an agent - write a prompt, spec out some tools and call an LLM (or gateway). At this point, you probably also know how to build an agent that “actually works” using some combination of agent frameworks, eval tools and looking at your data. This talk is about building an agent much, much faster using simulations to hill-climb your agent configuration instead of grinding on real data. We’ll dive deep into a case study of how a top-5 fintech made their agent dev cycle 20x faster using simulation-driven optimization. We’ll cover: - When to use real data vs. simulations in agent building - How to design simulation environments tailored to your agent - How to automate the optimization loop so you’re hill climbing agent configurations without manual tuning

Shruti Arora

  • Role: Member of Technical Staff and Customer Engagement
  • Company: Amazon AGI Lab
  • Bio: Shruti Arora is with Amazon AGI Lab and is co-presenting the “Build with Perception Agents” workshop at AI Engineer World’s Fair 2026.
  • LinkedIn: https://www.linkedin.com/in/shruti-arora-0730
  • Photo: /wf26/speakers/by-id/spk_shruti_arora.jpg
  • Sessions:

- Build with Perception Agents — Day 1 — Workshop Day 2:20pm-4:20pm

Human-agent collaboration is changing, becoming more visual. Models can perceive, point, and verify, but most agents still rely on us typing a paragraph to explain what we're looking at. Meet perception agents: computer use agents that see screens how you see screens. They understand, reason, and verify their own work. They let you point, draw, and describe, just as people collaborate in real life. We call this shared perception, and at AGI Lab we just open-sourced the first two primitives of our perception agent harness: visual verification and visual annotation. In this workshop, you'll get hands-on with both, build one sample use case end-to-end, then take the primitives back to your day-to-day in a mini hackathon. Best ideas win prizes.

Shu Fang

  • Role: Software Engineer
  • Company: Two Sigma Investments
  • Bio: Software engineer at Two Sigma Investments building the AI platform. Formerly led Developer Productivity and Insurance Engineering teams. Previously at Wealthfront.
  • Photo: /wf26/speakers/by-id/spk_shu_fang.jpg
  • Sessions:

- Tethered: Our Agents Are Us — Day 2 — Session Day 1 12:05pm-12:25pm

Personal AI assistants have dominated the zeitgeist of late with the advent of OpenClaw. However, letting an agent run as you remotely with access to your full suite of tools terrifies us in the technical community. How then did we get comfortable with enabling this functionality firmwide at a 70 billion dollar hedge fund? This talk will go over the underlying architecture, controls, and UX that enables every employee at Two Sigma to have a remote AI Assistant that acts as us in full. With access to our entire set of internal tools. Notably, this isn't just for engineers. Every single employee gets a remote agent that assumes their identity and can take broad action on their behalf. And we're ok with it.

Shubhankar Srivastava

  • Role: Founding Sales Engineer
  • Company: Browserbase
  • Bio: Shubhankar Srivastava is the founding Sales Engineer of Browserbase, a developer-infrastructure company for building and deploying browser agents that interact with the web. Before Browserbase, he co-founded Houseware, a product analytics company that was acquired by LaunchDarkly.
  • Twitter: https://x.com/_shubhankar
  • Photo: /wf26/speakers/by-id/spk_shubhankar_srivastava.jpg
  • Sessions:

- Hill-climbing Skills: How to Improve Agents Without Touching the Model — Day 1 — Workshop Day 4:30pm-5:30pm

Agent Capability is now highly dependent on the markdown files read at runtime -- skills.This workshop treats skills as a first-class optimization surface. We borrow the concept of autoresearch (from Karpathy) and apply it to the skills your agents already read. You'll see how we at Browserbase did the same for browser agents, enabling our customers to scale the coverage of their browser agents while improving performance(2x faster runs) and optimizing for token spend(upto 10x cheaper).You'll leave with a working http://SKILL.md you generated through an auto-research loop, and a mental model for when skill optimization beats fine-tuning or prompt engineering.

Simon Eskildsen

  • Role: CEO and co-founder
  • Company: turbopuffer
  • Bio: Co-founder and CEO at turbopuffer. Formerly Principal Engineer at Shopify, where he helped scale infra from 1K → 1M RPS.
  • Twitter: https://x.com/Sirupsen
  • LinkedIn: https://www.linkedin.com/in/sirupsen/
  • Website: https://sirupsen.com
  • Blog: https://sirupsen.com/napkin
  • Photo: /wf26/speakers/by-id/spk_simon_eskildsen.jpg
  • Sessions:

- How to Connect AI to Billions of Legal Documents — Day 2 — Session Day 1 2:25pm-2:45pm

Legora’s foundational engineering challenge is connecting frontier LLMs to billions of legal documents so the models can efficiently solve end-to-end legal workflows without burning extra tokens. We’ll share the retrieval architecture we built with turbopuffer that achieves: 1. Strict data isolation across millions of legal cases in a very security-conscious domain 2. Predictable search performance (<100ms p90 latency) on large contexts 3. High retrieval quality (95%+ recall@10) with fewer agent loops We’ll retrospect on two architectures that failed to achieve all 3 (and why), and the key design factors that make the current solution work at our scale. Practical takeaways include: - How to evaluate per-tenant vs shared-index retrieval under strict data isolation - How to efficiently index and retrieve context to maximize relevance per input token - How to build a highly intelligent AI application when your inference budget is constrained

Simran Arora

  • Role: Computer Science PhD Student
  • Company: Stanford University
  • Bio: Simran Arora is a Stanford computer science PhD student working on AI systems, GPU kernels, and efficient model execution. She is a co-author of KernelBench and related multi-GPU kernel research, and is associated with Together AI in recent AI-for-science and systems work.
  • Website: https://arorasimran.com
  • Photo: /wf26/speakers/by-id/spk_simran_arora.jpg
  • Sessions:

- Can LLMs write fast multi-GPU kernels? We built a benchmark to find out. — Day 2 — Session Day 1 12:05pm-12:25pm

LLMs have gotten surprisingly good at writing GPU kernels, but almost all the benchmarks measuring that progress are single-GPU. In production, communication is the bottleneck: all-reduce alone accounts for over 20% of inference latency on Llama-3.3-70B, and that gap keeps widening as compute scales faster than interconnect bandwidth. ParallelKernelBench (PKB) offers a benchmark and evaluation framework for multi-GPU kernel generation and includes 87 problems from real codebases where the task is replacing PyTorch + NCCL with a CUDA kernel that moves data directly over NVLink. We tested GPT-5.5, Gemini 3 Pro, Opus 4.7, and other frontier coding models. Under a third of problems solved were correctly, and fewer than a quarter of those beat the naive baseline. We'll cover why they fail, what the patterns look like, and a few cases where models produced kernels faster than anything publicly available, including one for NVIDIA NeMo-RL's GRPO training loop, which has no prior optimized public reference. The benchmark is open source and we want to see what you can do!

Sitanshu Gupta

  • Company: Coreweave
  • LinkedIn: https://www.linkedin.com/in/sitanshugupta
  • Photo: /wf26/speakers/by-id/spk_sitanshu_gupta.jpg
  • Sessions:

- Vertical Mobility: Building an AI Inference Platform That Scales from MVP to Trillion-Parameter Workloads — Day 4 — Session Day 3 12:05pm-12:25pm

The future of AI inference is not one-size-fits-all. This talk explores a multi-tiered architecture that supports the full AI lifecycle, from rapid, pay-per-token experimentation to dedicated, SLO-bound production and extreme-scale, self-managed deployments. Learn about lessons learned from CoreWeave’s inference stack as performance, cost, and control requirements evolve.

Sonar

  • Sessions:

- Expo Welcome Speech — Day 1 — Workshop Day 6:00pm-6:15pm

Soumya Gupta

  • Role: ML Engineer
  • Company: Uber
  • Bio: Soumya Gupta is a Tech Lead and Applied AI Engineer at Uber, where she architects and scales production-grade Generative AI and Computer Vision solutions. Her work focuses on deploying agentic orchestration, multimodal modeling, and core machine learning primitives at global scale.
  • Twitter: https://x.com/guptasoumya12
  • LinkedIn: https://www.linkedin.com/in/guptasoumya12/
  • Photo: /wf26/speakers/by-id/spk_soumya_gupta.jpg
  • Sessions:

- Building Closed-Loop Evals for a Multimodal Agent at Uber Scale — Day 3 — Session Day 2 11:40am-12:00pm

This talk covers how we designed evals for Uber's food enhancement agent—which edits food photography to better present dishes for smaller, independent Uber Eats merchants—along with the pitfalls and lessons learned along the way.

The problem is uniquely hard: we must stay faithful to the original dish, preserve each merchant's brand and packaging, and avoid homogenizing the marketplace—all without an existing playbook for multimodal evals in a narrow domain. We'll dig into what we learned navigating reward hacking, where the agent figured out how to game the eval loop, and how we built a closed feedback loop incorporating offline and online signals for continuous improvement—all while balancing creativity against rigid safety guardrails at scale.

If you're an ML or applied AI practitioner working on multimodal systems, agentic pipelines, or eval design—especially building generative features under tight safety or quality constraints—you'll walk away with practical strategies for designing multimodal evals in a narrow domain, recognizing and countering reward hacking, and building offline/online feedback loops that keep a generative agent improving in production.

Stefania Druga

  • Role: Research Scientist
  • Company: Sakana.ai
  • Bio: Hi! I am Stef. I am currently a Research Scientist at Sakana AI in Tokyo, Japan working on novel architectures beyond the transformer. Previously I was a research at Google Deep Mind working on novel multimodal AI applications. I graduated with a Ph.D. in Creative AI Literacies at the University of Washington Information School.
  • Twitter: https://x.com/Stefania_druga
  • LinkedIn: https://www.linkedin.com/in/drugastefania/
  • Website: https://stefania11.github.io/
  • Photo: /wf26/speakers/by-id/spk_stefania_druga.jpg
  • Sessions:

- Memory Harnesses for Long-Running Research Agents — Day 3 — Session Day 2 11:40am-12:00pm

At Sakana AI we build agents that run for hundreds of turns to read literature, run experiments, and draft papers. The model rarely breaks. The harness around it is the weak point: the agent contradicts a decision it made 80 turns ago, redoes finished work, or drifts from the question it started on. This is the binding-constraint thesis. For long-horizon tasks, reliability is set as much by the harness as by the model as clearly instantiated in autoresearch recent efforts. This is a field guide to the harness's memory layer. I'll trace a real research agent through its lifecycle, show exactly where context rot and drift set in, and cover the patterns that hold over 100+ turns: three-tier memory, progressive disclosure, recall-first compaction, sub-agent isolation, and architectural memory beyond the vector database. I will show how to measure whether your memory harness actually helps, at the trajectory level, so you stop tuning prompts to fix what's really a state-management bug.

Stephanie Jarmak

  • Role: Agent Advocate
  • Company: Sourcegraph
  • Bio: Dr. Stephanie Jarmak is an Agent Advocate, AI engineer, and research scientist building agentic systems for go-to-market, developer tooling, and knowledge infrastructure. She is an OSS maintainer of Gas City and a research affiliate with the NASA search engine SciX, where her work focuses on search relevance, discovery systems, and AI-assisted access to scientific literature.

Her recent GTM projects include AccountBot, a Slack-native sales assistant for account research and campaign workflows; GEO, an evaluation suite for understanding how AI systems discover, describe, and recommend products; and mcp-ax, a framework for measuring how usable MCP products are from an agent’s perspective.

Previously, Stephanie served as Project Scientist for Planetary Science at SciX and as an astronomer at Southwest Research Institute. She holds a Ph.D. in Physics and brings a research-engineering lens to building AI systems across a variety of domains.

  • Twitter: https://x.com/sgjarmak
  • LinkedIn: https://linkedin.com/in/stephanie-jarmak
  • Website: https://www.sjarmak.ai/
  • Photo: /wf26/speakers/by-id/spk_stephanie_jarmak.jpg
  • Sessions:

- The Death of Developer Advocates — Day 4 — Session Day 3 3:45pm-4:05pm

Developer Advocacy is dead. Over the last decade Developer Advocates have been a key part of any devtool company. Coding agents are the customer now. Your ICP is Claude Code, Codex, and a myriad of other coding agents that are going to evaluating, using, and suggesting tools to their human counterparts, then implementing them. So what do you do about it? Pivot to "Agent Advocates". This is a similar role but with the expressed purpose of understanding how Agents experience your product and using those findings to improve the agent experience. In this talk/workshop I'll share how to evaluate the agent experience of your product, how to improve it, and how to communicate that to your team so they can change the products roadmap.

Stephen Chin

  • Role: VP of Developer Relations
  • Company: Neo4j
  • Bio: Stephen Chin is VP of Developer Relations at Neo4j, program chair and board member for the LF AI & Data Foundation, and author of numerous titles including the upcoming GraphRAG: The Definitive Guide for O'Reilly. He has given keynotes and main stage talks at numerous conferences around the world including AI Engineer Summit, AI DevSummit, Open Source Summit, Devoxx, Jfokus, DevNexus, JNation, JavaOne, Shift, Joker, swampUP, and GIDS. Stephen is an avid motorcyclist who has done evangelism tours in Europe, Japan, and Brazil, interviewing developers in their natural habitat. When he is not traveling, he enjoys teaching kids how to do AI, embedded, and robot programming together with his daughters.
  • Twitter: https://x.com/steveonjava
  • LinkedIn: https://linkedin.com/in/steveonjava
  • Photo: /wf26/speakers/by-id/spk_stephen_chin.jpg
  • Sessions:

- CrabRAG: Why Automated Assistants Need Graph Memory, Not More Tokens — Day 4 — Session Day 3 10:45am-11:05am

Autonomous assistants are easy to demo and hard to make reliable. The problem is usually not tool access. It is memory. Most assistant architectures still treat memory as a chat log plus vector retrieval. That is fine for document question answering, but it breaks down when the assistant must connect conversations, people, tools, and decisions across multiple tool iterations. For an AI engineer, a single request can depend on a Slack thread, a GitHub PR, a failed CI run, a calendar event, and prior operating preferences or constraints. These are not isolated pieces of context. They form a connected state that changes as work progresses and context grows. In this talk, I’ll show why knowledge graphs, context graphs, and GraphRAG provide a better foundation for OpenClaw-style assistants. Knowledge graphs capture durable entities and relationships. Context graphs capture the operational layer assistants usually lose, including actions, decision traces, provenance, and recency. GraphRAG turns that structure into task-time context by combining graph traversal, semantic retrieval, and tool use. Attendees will leave with practical patterns for schema design, retrieval routing, and evaluation, plus a concrete blueprint for assistants that remember more than the last prompt and retrieve more than the nearest chunk.

Steve Yegge

  • Role: Icon
  • Company: Gas Town
  • Bio: Steve Yegge is a longtime software engineer and technical writer known for his work on developer tools and software engineering.
  • Twitter: https://x.com/steve_yegge
  • Photo: /wf26/speakers/by-id/spk_steve_yegge.jpg
  • Sessions:

- Agentic Security: Permissions, Provenance, and the Agent Supply Chain — Day 2 — Session Day 1 2:25pm-2:45pm

As AI agents move from demos into production engineering workflows, the security boundary shifts from code alone to the permissions, tools, prompts, dependencies, credentials, and orchestration layers that agents can touch. This talk frames agentic security broadly: least-privilege agent permissions, sandboxing and capability design, provenance for agent-generated changes, risks in agent/tool/package supply chains, and practical patterns for keeping autonomous coding and operational agents auditable and containable.

Subbiah Sethuraman

  • Role: Partner
  • Company: ZS Associates
  • Bio: Subbiah leads the AI Engineering Practice at ZS, where he architects and scales enterprise AI systems spanning agentic and traditional ML. He has delivered enterpriseAI applications for leading pharmaceutical clients across R&D, Commercial, and Enterprise domains.

His work centers on the engineering foundations that make agents work in the enterprise: semantic and knowledge layers, content authoring and virtual assistants that reason across millions of documents, agent development platforms and AgentOps and governance frameworks for orchestrating, observing, and controlling agentic systems in production.

Before pharma, he built AI Engineering solutions across Retail, Manufacturing, and FinTech, and architected Apple's core big data analytics platform.

A recognized thought leader, he is passionate about responsible AI and the engineering discipline behind reliable agentic systems.

  • LinkedIn: https://www.linkedin.com/in/subbiahsethuraman/
  • Website: https://subbiah-sethuraman.medium.com/
  • Photo: /wf26/speakers/by-id/spk_subbiah_sethuraman.jpg
  • Sessions:

- Why We Killed Our Multi-Agent Pipeline: Lessons From Pharma Commercial Intelligence — Day 4 — Session Day 3 3:45pm-4:05pm

Key takeaways: A practical design principle for agentic systems in regulated, high-stakes domains: derive the architecture from agent behavior, don't impose it. Concrete patterns the audience can apply this week — domain knowledge graphs as agent context, deterministic preprocessing as a complement to agentic reasoning, reference-based context management. An honest case study from production: what worked, what didn't, and the open architectural questions we're still working on. Abstract : We lead the architecture and AI engineering org behind ZS Associates' commercial intelligence platform for pharmaceutical brand teams. The product has two surfaces: a proactive alert system that delivers signal-driven intelligence packets when a brand's KPIs move, and a conversational analytics chat where business users ask ad-hoc questions. A year ago we built both surfaces as separate V1 stacks. They broke in different ways. The diagnosis was the same: we had decided on the structure before we knew what the agent actually needed. This talk is about the design principle that came out of rebuilding both — and what it produced. The architecture is derived, not designed. We stopped trying to predict what scaffolding the agent would need and started designing the system around what the agent's behavior, on real production tasks, actually demanded. Tools, context, structure, and guardrails get introduced at the points where the agent's reasoning needs them — and nowhere else. What that produced is an architecture that's smaller than V1, not bigger. A single agent owns each investigation end-to-end across both surfaces, launching parallel sub-agents when the work needs them — not according to a pre-defined topology. A pharmaceutical commercial knowledge graph — HCPs, accounts, payers, territories, brands, KPIs and the relationships between them — gives the agent the domain context it needs without prompt-engineering heroics. Statistical signal detection runs deterministically before the agent wakes up, so the agent's job is to explain signals, not find them. Raw query results stay out of the context window through a reference-pattern that lets the agent reason over data without drowning in it. Each of those decisions came from watching an agent struggle on a real task and asking what does it need here? — not from sketching the architecture in a doc and forcing the agent into it. The patterns generalize. If you're shipping agents over messy enterprise data — finance, supply chain, claims, operations — the failure modes and the fixes will look familiar. We'll close with the open questions and the pieces we haven't solved yet.

Suchet Bargoti

  • Role: Director of Inspection and Mapping
  • Company: Skydio
  • Bio: Suchet Bargoti is Director of Inspection and Mapping at Skydio, where he leads work on autonomous drone systems for capturing, inspecting, and understanding critical infrastructure. His work spans field robotics, perception, and applied autonomy, with a focus on making drones reliable tools for collecting high-quality data in complex real-world environments.
  • LinkedIn: https://www.linkedin.com/in/sbargoti
  • Photo: /wf26/speakers/by-id/spk_suchet_bargoti.jpg
  • Sessions:

- From Manual Drones to Autonomous Multi-Agent Missions — Day 3 — Session Day 2 2:25pm-2:45pm

Skydio is the leading U.S. drone manufacturer, deploying autonomous flying robots across critical infrastructure systems that keep nations running. Our products and technology are precipitating an evolution in how drones are operated: from direct, line-of-sight control via a handheld controller, to remote operation from anywhere in the world through a web browser where a single operator can orchestrate multiple drones simultaneously. Our customer fleet of flying robots represents one of the largest scale deployments of autonomous robots in the world today, a fusion of cutting edge robotics research with practical, data driven engineering across hardware and software, working together to save lives and increase efficiency for the critical industries we serve. In this talk, we will focus on the key components of the autonomy stack spanning the cloud and the edge that enable these operations, and how they give operators superpowers, allowing them to accomplish high-level objectives through a single command.

Sujee Maniyam

  • Role: Developer Advocate
  • Company: Nebius
  • Bio: Sujee Maniyam is a developer advocate at Nebius with a background in ML, data engineering, technical training, and production inference education.
  • Photo: /wf26/speakers/by-id/spk_sujee_maniyam.jpg
  • Sessions:

- Optimizing Open Models for Production Grade Inference — Day 4 — Session Day 3 2:25pm-2:45pm

Open-source foundation models are rapidly closing the gap with proprietary systems, enabling organizations to build powerful AI applications with greater flexibility and control. However, deploying these models in production introduces a new set of challenges: latency, throughput, scalability, and cost efficiency.In this talk, we'll explore the modern inference optimization techniques that power large-scale AI systems in production. Topics include KV cache optimization, cache-aware routing, prefill/decode disaggregation, speculative decoding, and other emerging approaches used to improve performance and reduce infrastructure costs.Through practical examples and real-world architecture patterns, attendees will gain a deeper understanding of how to run open models efficiently at scale.

Sumanyu Sharma

  • Role: Founder/CEO
  • Company: Hamming AI
  • Bio: Sumanyu Sharma is the Founder & CEO of Hamming AI, a YC company that invented automated testing, monitoring, and red-teaming for AI voice agents. Hamming helps teams catch issues before their agents confidently say the wrong thing to a real customer. Before Hamming, Sumanyu worked on high-stakes systems at Tesla and Citizen across revenue, safety, and emergency response. He started in AI research at Waterloo, teaching models to search X-rays by meaning. Put simply, he has spent his career turning “the AI seems fine” into “the AI passed the test".
  • Twitter: https://x.com/sumanyu
  • LinkedIn: https://www.linkedin.com/in/sumanyusharma/
  • Website: https://hamming.ai
  • Blog: https://hamming.ai/blog
  • Photo: /wf26/speakers/by-id/spk_sumanyu_sharma.jpg
  • Sessions:

- I Monitored Crime Audio. Voice Agents Scare Me More. — Day 2 — Session Day 1 2:25pm-2:45pm

Bad voice-agent calls are starting to look less like QA bugs and more like incident scenes. I learned that instinct at Citizen, where noisy radio, ambiguous speech, fast-moving incidents, and real-time alerts became information people might actually act on. That work was stressful for obvious reasons. Voice agents scare me more. Not because they sound creepy. Because they sound good enough that people trust them. And now they are connected to calendars, CRMs, EHRs, reservation systems, refunds, transfers, account data, and support workflows. At Hamming, we monitor more than 10,000 voice agents and have analyzed millions of calls. The weird thing you learn at that scale is that production voice agents do not usually fail like demos. They fail quietly. The agent sounds natural, but misses a two-word answer. It handles the happy path, but loses the plot when the caller interrupts. It says the address was updated, but no tool call happened. It supports six languages, but gets worse at the switch point between two of them. This talk is about treating every bad voice-agent call like an incident scene. The evidence is there if you collect it: transcript, waveform, latency waterfall, interruption points, ASR uncertainty, tool trace, system-of-record state, and post-call outcome. At Tesla, I learned that autonomous systems need release gates and regression loops before they hit the real world. At Citizen, I learned that messy audio becomes safety-critical when people act on it. Voice agents need both instincts. The takeaway is a voice-agent forensics loop. What did the caller say? What did the agent think happened? What did the tool actually do? What does the system of record say? And how do we turn that weird production failure into a regression test before it happens 10,000 more times?

Sunny Rekhi

  • Role: FDE CTO
  • Company: Decagon
  • Bio: Sunny Rekhi is FDE CTO at Decagon and speaks about how forward deployed engineering is done at Decagon.
  • Sessions:

- How Forward Deployed Engineering is done at Decagon — Day 2 — Session Day 1 1:55pm-2:15pm

Suraj Gupta

  • Role: Software Engineer
  • Company: Warp
  • Bio: Suraj Gupta is a software engineer at Warp working on agentic developer experience and AI terminal workflows, including Warp's agentic development environment.
  • Photo: /wf26/speakers/by-id/spk_suraj_gupta.jpg
  • Sessions:

- Warp: Building Self-Improving Agent Software Factories — Day 3 — Session Day 2 1:55pm-2:15pm

We are in the era of Software Factories, where the entire SDLC is being automated by agents. We will cover how we are approaching self-improving software factories leveraging dedicated agents to update skills, persistent cross-harness memory, and implementing feedback loops to ensure that software factories continually improve.

Susheem Koul

  • Role: Senior Software Engineer
  • Company: Microsoft
  • Bio: Senior Software Engineer at Microsoft. Building AI Driven Systems for Commerce at Scale
  • LinkedIn: https://www.linkedin.com/in/susheemkoul
  • Website: https://susheemk.substack.com
  • Blog: https://susheemk.substack.com/
  • Photo: /wf26/speakers/by-id/spk_susheem_koul.jpg
  • Sessions:

- FinOps for AI Agents: Who Spent All the Tokens? — Day 4 — Session Day 3 11:10am-11:30am

When an autonomous agent finishes a task successfully but costs ten times more than it did the previous day, traditional application monitoring fails. A recursive tool loop that retries silently, an oversized context window that quietly expands, or an unflagged model upgrade can burn through an entire budget long before a human notices. The execution appears successful on functional dashboards, meaning the only clear signal of failure is the cloud invoice at the end of the month. As AI systems move into production, tokens have become a primary operational resource alongside CPU, memory, and storage, yet few teams manage them with equivalent systems rigor. Most architectures lack the granular visibility required to attribute token spend to specific users, agents, or workflows, and they lack mechanisms to terminate a runaway loop before it triggers a financial incident. This session treats token consumption as a first class systems problem, demonstrating how to make it observable, attributable, and enforceable across complex agent workflows. The presentation covers practical engineering patterns for instrumenting token usage at every model call and tool invocation, attributing costs down to specific users or business operations, surfacing expensive execution paths, and enforcing runtime budgets, quotas, and circuit breakers to halt runaway behavior in real time. Attendees will leave with a practical framework for governing agent spend deliberately, transforming tokens into a managed operational resource rather than a surprise line item on the cloud bill.

Swaroop Chitlur Haridas

  • Company: DoorDash
  • Photo: /wf26/speakers/by-id/spk_swaroop_chitlur_haridas.jpg
  • Sessions:

- AI Evals Platform for Cross-Functional Teams at Scale — Day 2 — Session Day 1 1:55pm-2:15pm

DoorDash's Evals Platform is designed for more than just engineers. It brings human review, automated judges, and online experimentation into a single calibration loop so engineering, product managers, and strategy and operations teams can all contribute to improving AI quality. Engineers can instrument, trace, and evaluate agent behavior, while cross-functional teams can review outputs, curate trusted examples, and provide structured feedback that improves how automated judges behave over time. By combining experimentation, fully customized annotation workflows, calibration, and analytics in one system, the platform turns AI quality from a fragmented technical exercise into a shared operating model for continuously improving agent performance and making rollout decisions with confidence. While vendor platforms offer pieces of this workflow, we needed something broader: a unified system that lets engineers, product managers, and Strategy & Ops all participate directly in improving AI quality. Our goal is not just to run evals, but to enable cross-functional teams to review outputs, calibrate judges, run experiments, and make rollout decisions without being blocked on engineering. That requirement, along with tighter integration into our internal workflows and operating model, is why we are building this platform in-house.

swyx

  • Role: Curator
  • Company: Latent Space / AI Engineer
  • Bio: Shawn Wang, known online as swyx, is a co-founder/editor of Latent Space and an organizer and prominent voice in the AI Engineer community.
  • Twitter: https://x.com/swyx
  • Photo: /wf26/speakers/by-id/spk_shawn_wang.jpg
  • Sessions:

- The Highest Loop — Day 2 — Session Day 1 9:00am-9:05am

We celebrate the third birthday of the AI Engineer post.

- Latent Space Live: the Inference Inflection from First Principles — Day 4 — Session Day 3 12:30pm-1:30pm

Tanay Varshney

  • Role: Principal Engineer
  • Company: NVIDIA
  • Bio: Tanay Varshney is a principal engineer at NVIDIA working on NeMotron models, NeMo Platform and LLM inference architecture at NVIDIA.
  • LinkedIn: https://www.linkedin.com/in/tanayvarshney
  • Photo: /wf26/speakers/by-id/spk_tanay_varshney.jpg
  • Sessions:

- Model Routing — Day 4 — Session Day 3 3:20pm-3:40pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

- Model Routing — Day 4 — Session Day 3 3:45pm-4:05pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

Tanmai Gopal

  • Role: CEO/cofounder
  • Company: PromptQL
  • Bio: Tanmai is the CEO/cofounder of PromptQL (née Hasura). He loves programming languages, databases and working on creating simple abstractions to solve hard problems.
  • Photo: /wf26/speakers/by-id/spk_tanmai_gopal.jpg
  • Sessions:

- Your company brain will leak secrets. Here's how we stopped it for big banks and ourselves. — Day 2 — Session Day 1 2:50pm-3:10pm

Everyone wants a shared "company brain", one single AI that knows everything the org knows. But it's nearly impossible to build one, because the moment AI scrapes everyone's data into one place, a single wrong answer to the wrong person is a breach. The downside of modifying a above-my-pay-grade shared skill, or leaking confidential information to the wrong colleague is catastrophic. Ergo, company brain projects can only ever ship to the few people who already had access to everything, or stay hobbled with strictly public information (eg: River at Shopify). We've been building one for the last year and have successfully deployed for Fortune 100 banks, for distributed-operations orgs with global scale, and for ourselves as a 70-person AI-native startup. I'll leave you with a blueprint covering how we solved the following problems: 1. Permissions for shared data and tools 2. A shared context layer (skills, knowledge, semantic layer) with its own access control 3. Scoping the blast radius of wrong context 4. Auto-learning without auto-leaking If your company brain effort has been blocked by security, compliance, or just a healthy fear of the intern asking the AI a question and getting back the exec comp table, this is the talk.

Tanmay Sah

  • Role: Senior Quantitative Modeler
  • Company: Zions Bancorporation
  • Bio: Tanmay Sah, PhD, is a quantitative modeler and AI researcher working at the intersection of predictive modeling, model risk, AI evaluation, and agentic AI systems. His notable work includes research on AI agent verification; TanML, an open-source automated machine learning model validation toolkit; and Decoding Reddit Memes Virality. He is especially interested in the next generation of trustworthy AI systems: agents that can reason, use tools, remain auditable, and operate safely under real-world constraints.
  • LinkedIn: https://www.linkedin.com/in/tanmay-sah/
  • Photo: /wf26/speakers/by-id/spk_tanmay_sah.jpg
  • Sessions:

- 2 hr deep dive on LLM Inference at Scale — Part 1 of 2 — Day 1 — Workshop Day 12:10pm-1:10pm

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

(NOTE: this is a 2 hour workshop that happens over lunch break - you should try to have lunch before or after if attending)

compute kindly sponsored by Coreweave/Marimo!

- 2 hr deep dive on LLM Inference at Scale — Part 2 of 2 — Day 1 — Workshop Day 1:15pm-2:15pm

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

(NOTE: this is a 2 hour workshop that happens over lunch break - you should try to have lunch before or after if attending)

Tariq Shaukat

  • Role: Chief Executive Officer
  • Company: Sonar
  • Bio: Chief Executive Officer of Sonar. Previously served as President of Google Cloud and President of Bumble.
  • Twitter: https://x.com/tariqshaukat
  • Photo: /wf26/speakers/by-id/spk_tariq_shaukat.jpg
  • Sessions:

- In the Land of AI Agents, the Verifiers Are King — Day 3 — Session Day 2 9:25am-9:45am

As AI agents take on increasingly complex development tasks, the critical challenge has shifted from generation to verification. Hallucination is not a temporary bug. Evidence suggests that as models grow more capable, failures become more frequent and more convincing, making cognitive surrender among human reviewers an acute risk. This talk introduces a three-stage discipline for responsible agentic development, Guide, Verify, Solve, and argues that rigorous verification infrastructure is both a safety requirement and a competitive advantage. Counterintuitively, code quality matters more in an agentic world: clean, low-complexity codebases make agents faster, cheaper, and more reliable, while technical debt compounds at machine speed.

Tarun Sunkaraneni

  • Role: Browser Use
  • Company: Amazon AGI
  • Bio: Tarun Sunkaraneni works on browser-use/autonomy at Amazon AGI and previously worked at Microsoft and Plaid. His public engineering profile emphasizes Amazon AGI autonomy work.
  • LinkedIn: https://www.linkedin.com/in/tsunny007
  • Photo: /wf26/speakers/by-id/spk_tarun_sunkaraneni.jpg
  • Sessions:

- Ray Actors, Vision Tokens, and the GIL: Engineering an SFT Data Pipeline That Keeps GPUs Busy — Day 3 — Session Day 2 3:45pm-4:05pm

Perception agents only learn as fast as we can feed them. Multimodal SFT is deceptively expensive on the data side, and at million-sample scale, naive pipelines leave a fleet of GPUs waiting on Python and data preprocessing.This talk walks through the SFT data pipeline we built to train vision-language models for perception agents. We rebuilt the data path so that image fetching, vision preprocessing, tokenization, and loss-mask generation all happen off the trainer's critical path, and only the artifacts the trainer actually consumes ever cross the boundary into the training loop. We pair this with a blended multi-dataset sampler designed for resumable streaming over very large mixes, and an I/O layer tuned for the realities of fetching multimodal data from object storage.The result: on large-scale VLM SFT runs, the trainer went from spending most of each step blocked on data to spending most of it training, a major improvement in useful GPU time. We'll share the architecture at a conceptual level, the gotchas at million-datapoint scale, and a mental model engineers can take home for the data side of any perception-agent stack.

Tejas Bhakta

  • Role: CEO
  • Company: Morph
  • Bio: Founder of Morph. Building specialized models and specialized inference for codegen. Prev. ML engineer at Tesla
  • Twitter: https://x.com/tejasybhakta
  • LinkedIn: https://www.linkedin.com/in/tejas-bhakta/
  • Website: https://tejasbhakta.com
  • Blog: https://tejasbhakta.com
  • Photo: /wf26/speakers/by-id/spk_tejas_bhakta.jpg
  • Sessions:

- Autoresearch for Kernels — Day 3 — Session Day 2 2:50pm-3:10pm

Why all work is moving into models and why agent orchestration and multi-agent systems are the future

Tejas Kumar

  • Role: AI Engineer
  • Company: IBM
  • Bio: Tejas Kumar is an international keynote speaker, best selling author, and host of the developer-loved ConTejas Code podcast with an engineering background spanning 25 years, from design to frontend to backend to devops. Today, Tejas shares talks at large with developer communities worldwide, equipping them to do their best work.
  • Twitter: https://x.com/tejaskumar_
  • LinkedIn: https://linkedin.com/in/tejasq
  • Website: https://tejaskumar.com
  • Blog: https://tej.as/blog
  • Photo: /wf26/speakers/by-id/spk_tejas_kumar.jpg
  • Sessions:

- Evals in AI: A Deep Dive — Day 1 — Workshop Day 12:10pm-1:10pm

“Our evals pass and our velocity is up, so it works.” It’s the most reassuring sentence in AI engineering and also the most dangerous. Teams are shipping more code than ever while incidents per PR and change-failure rates climb, and the instruments meant to catch this are quietly broken. This talk takes apart both halves of that false comfort. First, why velocity lies: the same AI-driven throughput that lights up your dashboard is what’s eroding quality underneath it. Then we explore four ways offline evals deceive you: LLM-as-judge bias (your grader rewards confident, wordy, wrong answers over terse correct ones), staleness, distribution shift between your golden set and real traffic, and single-score evals that hide which step of an agent actually failed. The centerpiece is a live demo. We’ll wire up an LLM judge on stage and watch it crown a confident, friendly, factually wrong answer. Then we’ll fix it live on stage with a three-line rubric change. Same model, different instrument. From there we’ll build up what to measure instead: traces and spans, production observability, probe-based evaluation, error budgets, and quality leading indicators that sit beside every velocity number. Attendees will leave with a five-line checklist they can apply Monday. No prior eval tooling required. If you’ve ever shipped something agentic and had a nagging feeling the dashboards were too kind, this is for you.

Tereza Tížková

  • Role: Growth
  • Company: Factory
  • Bio: Tereza works on growing Factory, the AI-native software development platform where autonomous coding agents (Droids) help engineering teams ship across the entire development lifecycle.

Before Factory, she was working on growth as the first hire at E2B, the open-source sandboxed virtual machines powering AI products.

In her free time, she likes to write blog, enjoys testing new products, and talking to developers.

  • Twitter: https://x.com/tereza_tizkova
  • LinkedIn: https://www.linkedin.com/in/tereza-tizkova-568439174/
  • Website: https://www.terezatizkova.com
  • Blog: https://www.terezatizkova.com/blog
  • Photo: /wf26/speakers/by-id/spk_tereza_t_kov.jpg
  • Sessions:

- Rise of the Software Factory — Day 2 — Session Day 1 11:10am-11:30am

The Stanford HAI 2024 AI Index reports a 30x productivity gap between AI leaders and laggards. The differentiator is not company culture, prompting technique or model selection, but the infrastructure. Organizations capturing outsized value from AI agents have machine-readable codebases, deterministic internal APIs, CI/CD pipelines with agent-addressable hooks, and permission models granular enough to scope exactly what an agent can touch. I believe the “agents as employees” framing is most useful if you operationalize it. An employee has persistent identity, episodic and semantic memory, scoped permissions that don’t get renegotiated every task, an audit trail, and a defined escalation path when things go wrong. Persistent computer use (with a stable execution environment that survives across steps) was the real inflection point that is making this possible. Some interesting production problems remain under-explored. How do you give an agent persistent identity across pull requests? How do you recover from partial failure mid-task without discarding completed work? How do you enforce code ownership policies when the author is a model? How do you bound token spend when pipelines spin up sub-agents recursively? This talk defines agent readiness as a concrete infrastructure checklist: structured codebases, deterministic APIs, per-agent scoped credentials, atomic and idempotent operations, structured execution traces, and explicit thresholds for when the agent stops and a human takes over. It presents research results in practice, and what are the steps organizations need to take to be fully agent-ready.

Thais Castello Branco

  • Role: Founder & CEO
  • Company: Taste Labs
  • Bio: Thais Castello Branco is the founder and CEO of Taste, building the taste infrastructure layer for AI across subjective domains, starting with design. Taste works with frontier labs on post-training data and RL environments, and with app-layer companies to improve and evaluate generations via their API. Taste recently closed $18.5M in seed funding, led by Amplify and CRV, and is hiring across engineering and ML research.
  • Twitter: https://x.com/thaiscbranco_
  • LinkedIn: https://www.linkedin.com/in/thais-castello-branco/
  • Photo: /wf26/speakers/by-id/spk_thais.jpg
  • Sessions:

- Ending AI Slop — Day 2 — Session Day 1 1:55pm-2:15pm

- Training Taste — Day 3 — Session Day 2 1:55pm-2:15pm

Thariq Shihipar

  • Role: Claude Code
  • Company: Anthropic
  • Bio: Engineer and serial entrepreneur currently working on Claude Code at Anthropic. Previously founded One More Multiverse, co-founded Pubpub.org, and co-founded Chime.
  • Twitter: https://x.com/trq212
  • Photo: /wf26/speakers/by-id/spk_thariq_shihipar.jpg
  • Sessions:

- Field Guide to Fable — Day 3 — Session Day 2 9:05am-9:25am

https://x.com/trq212/status/2027463795355095314

Theo Browne

  • Role: Founder/YouTuber
  • Company: T3 Tools & YouTuber
  • Bio: Full time CEO of T3 Tools. Part time YouTuber, investor, and developer.
  • Twitter: https://x.com/theo
  • LinkedIn: https://www.linkedin.com/in/t3gg/
  • Website: https://t3.gg
  • Photo: /wf26/speakers/by-id/spk_theo_browne.jpg
  • Sessions:

- Closing Keynote — Theo Browne — Day 4 — Session Day 3 4:30pm-4:50pm

Thom Wolf

  • Role: Co-founder and CSO
  • Company: Hugging Face
  • Bio: Co-founder and Chief Science Officer of Hugging Face. Previously physics research and attorney.
  • Twitter: https://x.com/Thom_Wolf
  • LinkedIn: https://linkedin.com/in/thom-wolf
  • Website: https://thomwolf.io
  • Photo: /wf26/speakers/by-id/spk_thom_wolf.jpg
  • Sessions:

- Thom Wolf keynote — Day 2 — Session Day 1 10:05am-10:25am

- Training Frontier Models to Out-Think Hackers — Day 3 — Session Day 2 11:40am-12:00pm

We will give a surprisingly optimistic talk about AI and cyber, and why we believe it is not the end of cybersecurity as we know it, but an opportunity to empower defenders and build a lasting edge over attackers.

Cyber is a battle of skill and speed, and the rising tide of frontier models is allowing human attackers to move faster and cheaper. That combination of skilled hackers and breakthrough LLMs is a real threat, while defensive systems are still expected to operate at scale with limited human intervention, constrained by what models can do out of the box. But the answer is not fear or despair. Just as high-quality data transformed software engineering, the right cyber training data can teach models to turn from weapons being used against us into tools that protect us.

Thor 雷神 Schaeff

  • Role: Member of the Technical Staff (DevX) at Google DeepMind
  • Company: Google DeepMind
  • Bio: Thor Schaeff works in Developer Experience at Google DeepMind, helping developers build with the Gemini API and Google AI Studio.
  • Twitter: https://x.com/thorwebdev
  • LinkedIn: https://www.linkedin.com/in/thorwebdev/
  • Website: https://thorweb.dev/
  • Blog: https://thorweb.dev
  • Photo: /wf26/speakers/by-id/spk_thor_schaeff.jpg
  • Sessions:

- Can Your Agent Hear You Now? — Day 2 — Session Day 1 3:20pm-3:40pm

- Build realtime multimodal agents with Gemini Live — Day 3 — Session Day 2 10:45am-11:05am

The Gemini Live API is incredible versatile when it comes to building realtime AI experiences. From live translation across 2000 different language pairs to building realtime multimodal agents that can work across text, audio, and vision. This workshop gets you from zero to fully conversational agent in a matter of hours.

- Build realtime multimodal agents with Gemini Live (continued 2) — Day 3 — Session Day 2 11:10am-11:30am

The Gemini Live API is incredible versatile when it comes to building realtime AI experiences. From live translation across 2000 different language pairs to building realtime multimodal agents that can work across text, audio, and vision. This workshop gets you from zero to fully conversational agent in a matter of hours.

- Build realtime multimodal agents with Gemini Live (continued 3) — Day 3 — Session Day 2 11:40am-12:00pm

The Gemini Live API is incredible versatile when it comes to building realtime AI experiences. From live translation across 2000 different language pairs to building realtime multimodal agents that can work across text, audio, and vision. This workshop gets you from zero to fully conversational agent in a matter of hours.

- Build realtime multimodal agents with Gemini Live (continued 4) — Day 3 — Session Day 2 12:05pm-12:25pm

Thorsten Hans

  • Role: Senior Developer Advocate
  • Company: Akamai
  • Bio: Thorsten Hans is a Senior Developer Advocate at Akamai focused on emerging cloud technologies, WebAssembly, Spin and edge-native AI. He is also a Docker Captain who shares experiments and practical guidance with developer communities.
  • Twitter: https://x.com/ThorstenHans
  • LinkedIn: https://de.linkedin.com/in/thorstenhans
  • Website: https://www.thorsten-hans.com
  • Photo: /wf26/speakers/by-id/spk_thorsten_hans.jpg
  • Sessions:

- Edge-Native AI: Building Ultra-Fast Agents and MCP Servers with Spin — Day 3 — Session Day 2 1:55pm-2:15pm

Centralized AI is slow; Edge-native AI is the revolution. Thorsten Hans demonstrates how to build intelligent agents and Model Context Protocol (MCP) servers that run at the speed of light. Using Spin and WebAssembly, we'll bypass the "cloud tax" of high latency and cold starts. Discover how to ship AI-driven features that live closer to your users, ensuring sub-millisecond responsiveness and enhanced privacy. Stop waiting for the origin it's time to bring the brain to the edge and master the stack that powers the next generation of intelligent, distributed applications.

Tim Sweeney

  • Role: Principal Engineer
  • Company: Weights & Biases by CoreWeave
  • Bio: Tim Sweeney is a Principal Engineer at Weights & Biases by CoreWeave. His WF26 main-stage session, "Closing the Loop: An Autonomous AI Research Agent," focuses on autonomous AI research agents and the feedback loops behind them.
  • LinkedIn: https://www.linkedin.com/in/tssweeney
  • Photo: /wf26/speakers/by-id/spk_tim_sweeney.jpg
  • Sessions:

- Closing the Loop: An Autonomous AI Research Agent — Day 3 — Session Day 2 1:30pm-1:50pm

The holy grail of agentic AI tooling is the autoresearch loop: an agent that can sift through your experiments, create visualizations, propose a hypothesis, launch a training job, read the results, and try again autonomously. In this session, we'll show new autoresearch capabilities built directly into the W&B Models web and iOS apps. We will demo these live using a real-world fine-tuning project, covering everything from launching jobs and reading loss curves to surfacing outlier runs that consume researcher hours and recommending the next steps. Then you'll learn how the eval-driven development loop in W&B Weave makes agents like this trustworthy. You'll see how production traces become benchmarks, and how only the agents that beat the bar make it to production. Join us to learn the same loop we use to improve our own agentic features.

Tina Manghnani

  • Role: Product Manager
  • Company: Microsoft
  • Bio: Tina Manghnani is a Product Manager at Microsoft building the Foundry agents platform and developer tooling for hosting and operationalizing pro-code agents in enterprise systems.
  • LinkedIn: https://www.linkedin.com/in/tina-manghnani
  • Photo: /wf26/speakers/by-id/spk_tina_manghnani.jpg
  • Sessions:

- From framework to runtime: running agents with Foundry Agent Service — Day 3 — Session Day 2 10:45am-11:05am

See how agents move from frameworks into production systems. Learn how Foundry Agent Service provides hosted execution, scaling, and lifecycle management—combining models, tools, and orchestration into a production-ready runtime.

- Design multi-agent systems that actually work — Day 4 — Session Day 3 12:05pm-12:25pm

Real-world agent systems don’t run in isolation. Learn how to design and coordinate multi-agent systems that collaborate effectively in production—splitting responsibilities, managing system-level complexity, and operating with shared context from Microsoft IQ. See how agents move from single interactions to orchestrated systems that reason, act, and evolve together.

Tisha Chawla

  • Role: Software Engineer
  • Company: Microsoft
  • Bio: Tisha Chawla is a Software Engineer at Microsoft, where she builds production-grade agentic systems designed to perform reliably against real enterprise data. Moving past isolated AI demos, her work targets the core infrastructure of agent engineering: durable state management, deterministic execution, and self-healing workflows that recover without manual intervention.

As an architect rather than a consumer of AI, Tisha designs the orchestration layers that allow coding agents, reliability agents, and spec-driven development workflows to scale. She is a published applied machine learning researcher and regularly delivers deep-dive technical sessions on deploying resilient, enterprise-scale AI architecture.

  • LinkedIn: https://www.linkedin.com/in/tisha-chawla
  • Website: https://dev.to/tisha
  • Photo: /wf26/speakers/by-id/spk_tisha_chawla.jpg
  • Sessions:

- FinOps for AI Agents: Who Spent All the Tokens? — Day 4 — Session Day 3 11:10am-11:30am

When an autonomous agent finishes a task successfully but costs ten times more than it did the previous day, traditional application monitoring fails. A recursive tool loop that retries silently, an oversized context window that quietly expands, or an unflagged model upgrade can burn through an entire budget long before a human notices. The execution appears successful on functional dashboards, meaning the only clear signal of failure is the cloud invoice at the end of the month. As AI systems move into production, tokens have become a primary operational resource alongside CPU, memory, and storage, yet few teams manage them with equivalent systems rigor. Most architectures lack the granular visibility required to attribute token spend to specific users, agents, or workflows, and they lack mechanisms to terminate a runaway loop before it triggers a financial incident. This session treats token consumption as a first class systems problem, demonstrating how to make it observable, attributable, and enforceable across complex agent workflows. The presentation covers practical engineering patterns for instrumenting token usage at every model call and tool invocation, attributing costs down to specific users or business operations, surfacing expensive execution paths, and enforcing runtime budgets, quotas, and circuit breakers to halt runaway behavior in real time. Attendees will leave with a practical framework for governing agent spend deliberately, transforming tokens into a managed operational resource rather than a surprise line item on the cloud bill.

Todd Fisher

  • Role: Head of Engineering Launching 0-1 startups
  • Company: Philo Ventures
  • Bio: In his current role Todd co-launches a new venture backed startup every few months, acting as an interim CTO / founding engineer. Over Todd’s close to 20 year career, he has held titles such as head of engineering, director of engineering and principal software engineer. Todd loves to build things and, like many of us, has been embracing the shift into the AI native mindset of software development. When Todd is not telling an AI to write code, he can be found strumming his guitar in random places, such as the beach, in the mountains, on his front porch, or even at a tech conference such as this one :). Todd is passionate about creating positive experiences for people, including through writing music, building products, and community building . Todd is always open for a jam session or the occasional paired programming with strangers, just ask.
  • LinkedIn: https://www.linkedin.com/in/todd-b-fisher/
  • Website: https://www.linkedin.com/in/todd-b-fisher
  • Photo: /wf26/speakers/by-id/spk_todd_fisher.jpg
  • Sessions:

- While my guitar gently speaks — Day 4 — Session Day 3 1:30pm-1:50pm

Do you ever wonder What the next evolution of live performances will look like? I do all the time. Come experience what happens when you combine live guitar playing with DSP as well as TTS and other models, all running locally. Prepare to be entertained and get familiar with new possibilities that modern tools open up in the audio and digital signal processing space while you enjoy a live performance on top of an informative slide presentation.

Walk away from this talk inspired to help build the next evolution of options for musicians and live performances. We will touch on building with tools such as classic DSP, JUCE, TTS, STT, pitch detection with YIN, llama 3 and more with an emphasis of running it all locally on device!

You might even get a chance to have a conversation with a guitar!

Tom Ouyang

  • Role: Principal Engineer
  • Company: Google DeepMind
  • Bio: Tom Ouyang is a Principal Engineer at Google DeepMind, where he works on research and development for Gemini Audio, focusing on real-time capabilities like natural dialog, streaming translation, and audio understanding. Previously, he spent five years as a Principal Software Engineer at Waymo, developing machine learning models for autonomous vehicle perception. Prior to Waymo, Tom spent over six years at Google working in the area of mobile text entry and language modeling. He holds a PhD in computer science from the Massachusetts Institute of Technology.
  • LinkedIn: https://www.linkedin.com/in/tom-ouyang-8b5a5142/
  • Photo: /wf26/speakers/by-id/spk_tom_ouyang.jpg
  • Sessions:

- Speech-to-Speech Model Research at Google DeepMind — Day 2 — Session Day 1 11:10am-11:30am

Most voice interfaces today are built as a 3-way cascade system (ASR/LLM/TTS). While functional, this cascaded approach introduces latency bottlenecks, strips away non-verbal nuance, and limits emotion-aware, multi-turn dialogue. Today, we are witnessing a profound shift toward native speech-to-speech models that process audio natively from end to end. In this session, we’ll explore the exciting paradigm at Google DeepMind to train speech-to-speech models for real-time voice agents. We will cover the high-level product and research challenges of building voice agents that feel truly conversational, optimizing for fluid turn-taking and low latency while maintaining enterprise-grade intelligence.

Tomás Hernando Kofman

  • Role: CEO & Co-Founder
  • Company: Not Diamond
  • Bio: Tomás Hernando Kofman is the founder of Not Diamond, the world's most powerful intelligent model router for coding agents. Not Diamond helps engineering teams achieve frontier quality at a fraction of the price by identifying when a prompt requires a more powerful model and when a cheaper one is sufficient. They work with some of the highest volume AI startups and enterprises in the world, including OpenRouter and SAP, and are backed by Jeff Dea, Julien Chaumond, Ion Stoica, and many others.
  • Twitter: https://x.com/tomas_hk
  • LinkedIn: https://www.linkedin.com/in/tomashk/
  • Photo: /wf26/speakers/by-id/spk_tom_s_hernando_kofman.jpg
  • Sessions:

- Intelligent Model Routing: Frontier Performance Without Frontier Bills — Day 3 — Session Day 2 2:50pm-3:10pm

It is Summer 2026 and the world is burning for token consumption—figuratively and literally. Accelerating frontier model capabilities increasingly allow agents to operate across long-running, highly parallelized tasks at exponential inference growth. In this talk, I explain how dynamic model routing—intelligently directing agent requests to the best-suited model at the best price—can reduce token costs by up to 90% while maintaining or improving accuracy. I walk through how routing works, when it doesn't, and why the world (and your agent) need routing to scale intelligence to infinity and beyond.

Tony Fabrikant

  • Role: Co-founder
  • Company: CoupleWork AI
  • Bio: Tony Fabrikant is cofounder of BetterLabs AI and builder of CoupleWork, an AI relationship coach helping couples navigate conflict, repair, crisis moments, and long-term relationship health. He previously held senior technology leadership roles at S&P Global, where he served as CTO for a major financial data and analytics business, and at Bridgewater Associates, where he worked on investment technology and research systems.

Tony’s work focuses on applying AI agents, real-time voice AI, and safety-first product design to deeply human problems. At CoupleWork, he partners with cofounder Clay Cockrell, LCSW, a licensed relationship therapist, to build evidence-based AI support for couples who are priced out of therapy, between sessions, or otherwise unable to access traditional support. He is especially interested in how AI systems can be designed to strengthen human relationships rather than simply maximize engagement.

  • LinkedIn: https://www.linkedin.com/in/tony-fabrikant
  • Photo: /wf26/speakers/by-id/spk_tony_fabrikant.jpg
  • Sessions:

- Al is becoming the World's largest Relationship Therapist. We Can't Afford to Get it Wrong. — Day 4 — Session Day 3 1:30pm-1:50pm

Millions of people are now turning to AI for relationship advice and emotional support, often before they'd ever consider a human therapist. Most of the AI Therapy that is available is without clinical oversight, ethical frameworks, or any serious reckoning with what it means to intervene in the most intimate and vulnerable space in a person's life. People are getting hurt. As a couples therapist with 30 years experience, I teamed up with the former CTO at S&P and we created CoupleWork, an AI relationship therapist I essentially trained on three decades of clinical knowledge and every evidence-based modality that exists. Our voice interactive AI, Maxine, is proving this can be done responsibly and very effectively. And what we're learning about the nature of love, connection, and human vulnerability at scale is something this industry needs to hear. I also want to talk about what comes next: the regulatory frameworks that don't yet exist, the liability questions nobody is answering, and why the therapists who should be leading this conversation are almost entirely absent from it.

Tushar Jain

  • Role: EVP of Engineering
  • Company: Docker
  • Bio: Tushar Jain is Executive Vice President of Engineering at Docker.
  • LinkedIn: https://www.linkedin.com/in/tusharj
  • Photo: /wf26/speakers/by-id/spk_tushar_jain.jpg
  • Sessions:

- Unlock Agent Autonomy: The Runtime for AI-Native Systems — Day 2 — Session Day 1 3:45pm-4:05pm

The way software gets built in 2026 doesn't look like it did in 2024. The actors changed. Agents read and write entire codebases. Subagents spawn to chase down a flaky test, refactor a module, or triage an incident. But this shift doesn't stop at the SDLC. Agents increasingly invoke tools, interact with enterprise systems, install dependencies, call APIs, and orchestrate workflows across local machines, CI systems, cloud infrastructure, and organizational boundaries. The teams leaning into this shift are moving faster, and the gap is widening by the quarter.

But few have the confidence to let agents operate autonomously across those environments. Not because the model capability isn't there. Trust isn't. Agents can pull a poisoned dependency, invoke an untrusted tool, wipe a database, leak sensitive data, or access systems they shouldn’t. Prompt-level instructions won't close that gap, the unlock has to happen one layer down, at the runtime layer itself.

Docker spent the last decade making it safe to ship software by getting the runtime right: isolation, network policy, trusted base images, and credentials. Agents are the next workload, and the same principles apply. Tushar Jain, EVP of Engineering at Docker, walks through what the runtime layer for AI-native systems looks like in practice: hardened runtime foundations, sandboxes that constrain what agents can touch, and governance controls that limit what agents can introduce, access, and execute across local, CI, cloud, and enterprise environments. The pattern is the same on every vector: reduce the surface area of what the agent gets to decide, so the parts that matter aren't left to a prompt.

Attendees leave with a clearer framework for giving agents more autonomy safely. Engineers see how agentic applications can operate across tools and infrastructure. Security leaders get a runtime model that maps to controls they already understand. Platform teams get a way to scale agent execution without standing up a new runtime for every team.

Tyler Gillam

  • Role: Senior Software Engineer II - Agentic AI
  • Company: Digital Ocean
  • Bio: Tyler Gillam is a Senior Software Engineer II working on Agentic AI at DigitalOcean and a core engineer on DigitalOcean's Inference Router, focused on intelligent model routing by cost, latency, task type, and model preference.
  • LinkedIn: https://www.linkedin.com/in/tdgillam
  • Photo: /wf26/speakers/by-id/spk_tyler_gillam.jpg
  • Sessions:

- Preferences > Benchmarks: Model Routing for How Teams Actually Build — Day 4 — Session Day 3 12:05pm-12:25pm

There is no best model. There's only the right model for a given task, and the right model depends on your team's preferences, not a benchmark score. This talk makes the case for preference-aligned routing: choosing models by the constraints that actually matter — cost, latency, task type, model preference — instead of a single leaderboard number. We'll demo a sub-200ms routing decision running on a purpose-built 30B MoE model with no application code changes, walk through real coding workflows routing most traffic to open models without losing accuracy, and show where this goes next: evals, caching, and personalization.

Uday Kanagala

  • Role: Software Architect
  • Company: Navan
  • Bio: Uday Kanagala is a Software Architect at Navan with experience in cloud-native distributed systems, microservices, DevOps, and autonomous-agent workflows for production issue resolution and architectural governance.
  • LinkedIn: https://www.linkedin.com/in/udaybhanuprasad
  • Photo: /wf26/speakers/by-id/spk_uday_kanagala.jpg
  • Sessions:

- Agents Are Where Microservices Were in 2015. We're Making All the Same Mistakes. — Day 3 — Session Day 2 2:50pm-3:10pm

Remember when everyone was shipping microservices without service discovery, circuit breakers, or distributed tracing? Agents are in that exact phase right now. Everyone's building them. Almost nobody is thinking about the infrastructure underneath. We've been deploying production agents across 120+ microservices. Here's the stack that's emerging: Runtime — containerized execution, session persistence, workspace snapshots. Solved-ish, mostly duct tape. Memory — RAG had a good run. It's not enough. Tiered memory — short-term, long-term with semantic/episodic strategies, agents deciding what to remember and forget. Observability — you can't tail -f an agent. Execution traces, reasoning chains, confidence signals — agents need their own observability stack. Testing — the biggest gap. Unit testing non-deterministic behavior, regression testing prompt changes, knowing your agent got worse before users do. Skills and tools — MCP and skill definitions as the standard interface layer — the REST APIs of the agent era. Context engineering — what the agent knows at decision time. The new performance tuning. Guardrails and auth — scoped credentials, budget limits, knowing when to stop. Least-privilege for agents. Orchestration — single vs. multi-agent, choreography vs. orchestration. Same tradeoffs as microservices, new failure modes. This talk maps the stack, draws the parallels to how we eventually got microservices right, and calls out what's still painfully missing.

Uday Kiran Medisetty

  • Role: Distinguished Engineer
  • Company: Uber
  • Bio: Uday Kiran Medisetty is a Distinguished Engineer at Uber, where he leads engineering efforts spanning Agentic AI experiences across first-party and third-party surfaces, and GenAI initiatives for engineering productivity (agentic coding, AI debugging, code reviews, and large-scale refactoring). He also co-leads the company-wide engineering community shaping Uber's architecture, culture, and standards. Over 11+ years at Uber, Uday has led some of the company's largest technical undertakings, including the ground-up rebuild of the Fulfillment Platform that powers billions of daily transactions across Mobility and Delivery, and the Earner multi-gig platform supporting diverse earning opportunities globally. Earlier, he was a Staff Engineer at VMware, building next-generation API infrastructure used across the company's product lines. He holds an MS in Computer Science from Georgia Tech and a BS from BITS Pilani, India.
  • Twitter: https://x.com/udaykiran
  • LinkedIn: https://www.linkedin.com/in/udaykiran/
  • Photo: /wf26/speakers/by-id/spk_uday_kiran_medisetty.jpg
  • Sessions:

- Agentic SDLC at Uber: Building Blocks for Uber's Software Factory — Day 2 — Session Day 1 11:40am-12:00pm

99% of Uber engineers are using AI every month, 70% of PRs are attributed to AI, and 15% of PRs are now done entirely by autonomous agents. In this session, we go behind the scenes to show you exactly what it takes to get there — starting with the foundational building blocks: the model gateway, MCP infrastructure, agent skills, knowledge systems, and cloud developer environments that make agentic engineering possible at scale. Then, once those foundations are in place, we show you how to assemble them into a fully agentic SDLC. We'll walk through every stage — from research and spec writing, to autonomous code generation, to verifying and validating that code before it ships, to monitoring what happens after it lands, and continuously improving it over time. With tooling example demos throughout. Whether you're just starting your agentic journey or already running agents in production, you'll leave with a concrete blueprint for what this looks like end to end.

Udi Menkes

  • Role: Principal Product Manager
  • Company: Intuit
  • Bio: Udi Menkes is a Principal AI Product Manager at Intuit, where he leads the development of custom financial LLMs and agentic AI systems that turn Intuit's proprietary data and financial domain expertise into experiences delivering real value to 100M+ customers across QuickBooks, TurboTax. Credit Karma, and Mailchimp. He has spent 15+ years building AI that touches money — patented financial crime AI at ThetaRay for banks including Citi, Santander, and OCBC, and conversational and revenue-optimization AI at OpenWeb for 120M+ monthly users. Udi runs GenAI PM (genaipm.com), an AI agent that scans X, LinkedIn, YouTube, GitHub, and 200+ AI and PM blogs daily and delivers a 7-minute, high-signal brief to 5,500+ AI Product teams and builders at Google, OpenAI, NVIDIA, Meta, Amazon, Microsoft, and more.
  • Twitter: https://x.com/menkesu
  • LinkedIn: https://www.linkedin.com/in/udimenkes/
  • Website: https://genaipm.com/about
  • Blog: https://genaipm.com/about
  • Photo: /wf26/speakers/by-id/spk_udi_menkes.jpg
  • Sessions:

- Why Off-the-Shelf AI Doesn't Understand Money — Day 4 — Session Day 3 11:10am-11:30am

Ask any LLM a financial question about your business. You'll get a fluent, confident, generic answer — one that doesn't truly know your business, or what happened when businesses like yours made that same decision. We build financial AI at Intuit serving 100M+ customers. Our custom LLMs outperform general-purpose models on accuracy while cutting latency in half. But that's the foundation, not the destination. I'll cover where financial intelligence goes when AI stops reporting what happened and starts helping you decide what to do next (and does it for you).

Uri Rolls

  • Role: CEO
  • Company: Arithmetic
  • Bio: Uri Rolls is Co-Founder and CEO of Arithmetic, an AI data company building cyber worlds for frontier models. Just as AI transformed software engineering, Arithmetic is working to bring the same leap to cybersecurity by creating post-training data from real vulnerabilities, teaching models to reason across black-box search spaces, turn reconnaissance into exploitation, and take correct defensive action.

Previously, Uri conducted black-hole imaging research at Harvard and the Smithsonian, and spent several years in Israeli Intelligence.

  • Twitter: https://x.com/uri_rolls
  • LinkedIn: https://www.linkedin.com/in/urirolls/
  • Photo: /wf26/speakers/by-id/spk_tbd_operating_intelligence.jpg
  • Sessions:

- Training Frontier Models to Out-Think Hackers — Day 3 — Session Day 2 11:40am-12:00pm

We will give a surprisingly optimistic talk about AI and cyber, and why we believe it is not the end of cybersecurity as we know it, but an opportunity to empower defenders and build a lasting edge over attackers.

Cyber is a battle of skill and speed, and the rising tide of frontier models is allowing human attackers to move faster and cheaper. That combination of skilled hackers and breakthrough LLMs is a real threat, while defensive systems are still expected to operate at scale with limited human intervention, constrained by what models can do out of the box. But the answer is not fear or despair. Just as high-quality data transformed software engineering, the right cyber training data can teach models to turn from weapons being used against us into tools that protect us.

Vaibhav Gupta

  • Role: CEO
  • Company: Boundary
  • Bio: Vaibhav Gupta is the co-founder at Boundary and building BAML, a new programming language that's agent first. Previously at D. E. Shaw, Google, and Microsoft. In his free time, Vaibhav dabbles in competitive table tennis and board games, and various aspects of compilers and VMs.
  • Twitter: https://x.com/vaibcode
  • LinkedIn: https://www.linkedin.com/in/vaigup
  • Website: https://www.youtube.com/@boundaryml
  • Blog: https://www.boundaryml.com/blog
  • Photo: /wf26/speakers/by-id/spk_vaibhav_gupta.jpg
  • Sessions:

- fighting slop with slop — Day 2 — Session Day 1 3:20pm-3:40pm

We haven't done a code review in two years. The last time I read every line of code in a PR was about six months ago. And we build a programming language with a runtime meant to replace V8. This is real engineering: compiler internals, runtime behavior, type systems, codegen, concurrency semantics, and FFIs across multiple languages. The thing that makes this possible is a technique we call "fight slop with slop" - every line of code is analyzed in depth by a sprawling toolchain of custom visualizers, linters, test snapshots and a whole bunch more. While the core language VM code has super high standards, a lot of these meta-tools are mostly vibe-coded. I'll dive deep into all the tactical things we've built, and how to adopt "fight slop with slop" in your own team

Valeria Wu Fon

  • Role: Product Manager
  • Company: Google DeepMind
  • Bio: Product Manager at Google DeepMind for Gemini's speech to speech model. Previously worked across early stage companies, venture/banking, and a brief stint at a surf hostel. Valeria studied Symbolic Systems (CS, Neuroscience, and Philosophy) at Stanford, where she focused on human-centered AI. Originally from Lima, Peru, she is a single-digit golfer and a dedicated foodie.
  • Twitter: https://x.com/valeriawu_
  • LinkedIn: https://www.linkedin.com/in/valeriawu/
  • Photo: /wf26/speakers/by-id/spk_valeria_wu.jpg
  • Sessions:

- Speech-to-Speech Model Research at Google DeepMind — Day 2 — Session Day 1 11:10am-11:30am

Most voice interfaces today are built as a 3-way cascade system (ASR/LLM/TTS). While functional, this cascaded approach introduces latency bottlenecks, strips away non-verbal nuance, and limits emotion-aware, multi-turn dialogue. Today, we are witnessing a profound shift toward native speech-to-speech models that process audio natively from end to end. In this session, we’ll explore the exciting paradigm at Google DeepMind to train speech-to-speech models for real-time voice agents. We will cover the high-level product and research challenges of building voice agents that feel truly conversational, optimizing for fluid turn-taking and low latency while maintaining enterprise-grade intelligence.

Varun Krovvidi

  • Photo: /wf26/speakers/by-id/spk_varun_krovvidi.jpg
  • Sessions:

- 6 Pillars of an Agentic Harness That Fixes Production Incidents — Day 2 — Session Day 1 2:50pm-3:10pm

A model delights us when any plausible answer works, but a production incident has one right answer, and the model alone can't reliably reach it. Getting there depends less on the model and more on the orchestration, context, and judgment built around it. That work is harness engineering, and it is the new frontier.

This session breaks down the six pillars of an agentic harness required to fix production incidents: model orchestration, context, reasoning, actions, learning, and evals. Join Resolve AI to walk through what each one does, why a better model doesn't make any of them go away, and how they compose to find the root cause of a live incident across massive context, under a clock, with real revenue on the line.

Varun Pant

  • Role: Builder, NeuroSymbolic AI
  • Company: AWS
  • Bio: Builds AI products at AWS, currently neurosymbolic AI.
  • Twitter: https://x.com/varun_pant_
  • LinkedIn: https://www.linkedin.com/in/varunp1/
  • Photo: /wf26/speakers/by-id/spk_varun_pant.jpg
  • Sessions:

- Your Code Has Bugs. Lean4 Has Proofs. A Practical Guide to Formal Verification for Engineers — Day 3 — Session Day 2 11:40am-12:00pm

AI is generating more of your code than ever — how do you prove it doesn't ship bugs? Lean is a theorem prover that's also a programming language, and it's quietly becoming practical for verifying real software. In this talk, I'll show you how formal verification works — some examples of proof tactics, and a practical framework for when to verify vs. test

Varun Shenoy

  • Role: Cofounder
  • Company: Long Lake
  • Bio: Cofounder at Long Lake. Focused on diffusing AI: deploying frontier technology into essential services businesses beyond Silicon Valley. Previously at Baseten and Stanford.
  • Twitter: https://x.com/varunshenoy_
  • LinkedIn: https://www.linkedin.com/in/varunshenoy
  • Website: https://varunshenoy.com
  • Blog: https://varunshenoy.com
  • Photo: /wf26/speakers/by-id/spk_varun_shenoy.jpg
  • Sessions:

- How do you diffuse AI into the real world? — Day 3 — Session Day 2 10:45am-11:05am

Most AI conversations are still about models, benchmarks, and demos. We want to talk about what it actually takes to make AI work inside real companies. The gap between impressive demos and production value is where most enterprise AI efforts die. We've all seen burned budgets, cynical teams, and tools that never leave the pilot phase. We've spent the last two years closing that gap across the American services economy, and we'll share a bit of our playbook. This talk walks through three layers of what real AI deployment looks like, drawn from Long Lake's live operating environments: Measure: How we built domain-specific evals and workflows to improve performance on real HOA management tasks, not synthetic benchmarks, but metrics tied to actual business outcomes. Embed: How we put AI directly inside tools like Revit, meeting users where they already work instead of asking them to change how they operate. Scale: The enablement playbooks and operating techniques we use to help teams of property managers, payroll specialists, and more adopt AI in their day-to-day jobs. The broader theme is vertical superintelligence: not just better models, but systems built around proprietary data, workflow context, domain tools, human enablement, and continual learning. This talk is for builders and operators who care less about benchmark theater and more about how to deliver measurable outcomes, deal with change management, and teach non-technical workforces to use AI effectively in production beyond just Claude Code / Cowork.

Varun Singh

  • Role: Pre-Training Lead
  • Company: Arcee AI
  • Bio: Pre-training lead at Arcee AI working on end to end pre-training of large language models, with a strong interest in architecture and optimization. Led the pre-training of Arcee's Trinity series of models, ranging from a 6B mixture-of-experts to a 400B mixture-of-experts model.
  • Twitter: https://x.com/stochasticchasm
  • LinkedIn: https://www.linkedin.com/in/varun-singh-cs
  • Photo: /wf26/speakers/by-id/spk_varun_singh.jpg
  • Sessions:

- The Base Model is Dead — Day 2 — Session Day 1 1:30pm-1:50pm

It's a common belief that large language models are trained to be a good model of human web-text, and thus base models are "mirrors" of what we see on the internet. Historically, this was largely true, but no modern base model truly reflects the internet in the way that GPT-3 once did. Instruction data along with synthetic reasoning traces are moving earlier and earlier into the training pipeline, and "mid-training" has emerged as a new stage to accommodate longer datapoints that more concretely resemble downstream capabilities. As a result, pre-training no longer has the goal of creating a linguistic prior, but instead has the additional goals of baking in behavior and more atomic skills into the trained "base" model. Between this shift in what a base model is and the blurring of the lines between the different stages of model training, it's an open question as to what the best approach is here (at least outside the walls of the big labs). But I believe that the role we view the base model playing will continue to shift as we're pulled forward through new phases of model capabilities.

Vasant Kearney

  • Role: CEO and Founder
  • Company: Onlay
  • Bio: Vasant Kearney is founder and CEO of Onlay AI, where he builds agentic healthcare revenue-cycle infrastructure across claims, eligibility, attachments, payer follow-up, payment posting, and bank reconciliation.

He was previously co-founder and CTO of Retrace, an AI company focused on healthcare automation. Before startups, Vasant was an Assistant Professor of Radiation Oncology Physics at UCSF, completed a medical physics residency there, and earned a PhD in Biomedical Engineering. He mentored radiation oncology residents at UCSF and machine learning students at the University of San Francisco. His scientific work spans a broad range of topics, including generative MRI-to-CT translation, tumor localization, anatomy segmentation, treatment-planning optimization, NVIDIA CUDA programming, and dental disease classification.

His work sits at the intersection of healthcare operations, regulated workflows, multimodal AI, and production agent systems.

  • Twitter: https://x.com/vasantkearney
  • LinkedIn: https://www.linkedin.com/in/vasant-kearney-7b7a48b3
  • Website: https://onlay.ai/
  • Photo: /wf26/speakers/by-id/spk_vasant_kearney.jpg
  • Sessions:

- Healthcare’s Agent Bytecode: X12 as the Harness for AI Agents — Day 4 — Session Day 3 1:55pm-2:15pm

LLMs made old languages newly useful: COBOL for mainframes, Fortran for scientific code, and Rust, SQL, and Prolog as strict substrates for agentic systems. Healthcare has its own old language hiding in plain sight: X12. Before LLMs, X12 was mostly treated as ugly plumbing: loops, delimiters, companion guides, clearinghouse edits, payer-specific quirks, rejections, and acknowledgments. In an agentic workflow, those constraints become the feature. They give stochastic agents a deterministic target. This talk shows how healthcare agents can compile messy operational evidence into X12-shaped workflows: chairside audio into 837D claim narratives, imaging systems into 275/PWK attachment flows, payer portals and phone calls into 270/271 eligibility and 276/277 claim status, preauth evidence into 278 workflows, and EOBs, scanned mail, and bank data into 835/820 payment reconciliation. The core pattern is simple: LLMs reason over ambiguity; X12 provides the syntactic and semantic harness for validation, auditability, acknowledgments, rejections, human review, and high-volume automation. This is not an EDI nostalgia talk. It is a production architecture talk about building reliable agents in one of the messiest enterprise domains.

Vasuman Moza

  • Role: Founder & CEO
  • Company: Varick Agents
  • Bio: Vas Moza is the founder and CEO of Varick Agents, a San Francisco startup transforming enterprise companies with AI. Varick's approach starts with an audit: the team embeds with each customer to map how the business actually works, then ships the agents, integrations, and workflows that take AI from prototype to dependable production. As one part of that mission, Varick is building the AI FDE: an agent that does the work of a forward-deployed engineer - scoping problems, writing and integrating code, and standing up production systems alongside the people who run them. Varick is built on the bet that the hard part of enterprise AI isn't the model - it's deployment, reliability, and trust.
  • Twitter: https://x.com/vasuman
  • LinkedIn: https://www.linkedin.com/in/vasumanmoza/
  • Photo: /wf26/speakers/by-id/spk_vasuman_moza.jpg
  • Sessions:

- AI tools for Forward Deployed Engineering — Day 2 — Session Day 1 11:40am-12:00pm

Vayum Arora

  • Role: Growth
  • Company: Weco AI
  • Bio: Vayum started in AI as an ML engineer and product lead on Apple's Intelligence teams, building the foundational models behind its features. After starting his own company, he backed early-stage founders as a seed investor at Founder Collective, then decided to go build. Now he leads growth at Weco AI, taking recursive self-improvement out of the lab and into ML teams across finance, healthcare, and frontier AI.
  • LinkedIn: https://www.linkedin.com/in/vayum-arora/
  • Blog: https://www.weco.ai/blog/vardera-case-study
  • Photo: /wf26/speakers/by-id/spk_vayum_arora.jpg
  • Sessions:

- Hands-on AutoResearch: Cracking OpenAI's Parameter Golf — Day 1 — Workshop Day 2:20pm-4:20pm

Heard about autoresearch, or tried it a few times in playground settings? This hands-on tutorial teaches you how to use autoresearch on one of the most serious challenges in ML this year: OpenAI's Parameter Golf.

The challenge: train the best language model that fits in just 16MB. We entered our autoresearch agent this past spring, and it outperformed the field of over 1,000 participants. You'll learn how we approached it, then get to do it yourself: kick off an autoresearch agent, watch it improve a tiny language model's training script, steer it when progress stalls, and visualize your results. You'll leave with a working autoresearch setup you can point at your own code.

compute kindly sponsored by Modal!

Venky B

  • Role: Founder & CEO
  • Company: Plivo
  • Bio: Founder and CEO of Plivo, a developer platform for building voice and messaging AI agents. Plivo powers thousands of businesses worldwide, from fast-growing startups to Fortune 500 enterprises.
  • Twitter: https://x.com/bevenky
  • LinkedIn: https://www.linkedin.com/in/bevenky/
  • Photo: /wf26/speakers/by-id/spk_venky_b.jpg
  • Sessions:

- 5 Voice Agent Failure Modes You'll Hit in Week One — Day 2 — Session Day 1 1:55pm-2:15pm

Building a voice agent that demos well is easy now. The hard part starts the second a real person calls it. Most voice agents today are basically a chatbot with a microphone bolted on, they listen, then think, then talk, one side at a time, like a walkie talkie. Real conversations don't work that way. People pause in the middle of a thought, they say "um" and "uh", they talk over you, they change their mind halfway through. The agent has to work out when you're actually done talking, when it should stop talking, and when you've said something it cannot afford to get wrong, like your phone number or email. None of this shows up when you test with text. All of it shows up in week one.

This talk is the five failures that hit every team in that first week, the ones we see again and again. For each case we will walk though examples and best practices for what actually breaks and what to do about it. If you're about to put a voice agent in front of real callers, or you already did and it's quietly falling apart, this is the talk that saves you the weeks everyone else burns figuring it out

Vincent Weisser

  • Role: Co-founder & CEO
  • Company: Prime Intellect
  • Bio: Vincent Weisser is Co-founder & CEO of Prime Intellect. Prime Intellect provides an open stack for training, deploying, and continuously improving AI models with compute, RL post-training, evaluations, and inference.
  • LinkedIn: https://www.linkedin.com/in/vincentweisser
  • Website: https://www.primeintellect.ai/
  • Sessions:

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:30pm-1:50pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

- Local Models: Trust, Control, Optimization — Day 4 — Session Day 3 1:55pm-2:15pm

Local Models: Trust, Control, Optimization looks at why builders are choosing local AI for privacy, reliability, customization, cost, and ownership, while still asking where cloud remains necessary. The panel covers local-first versus hybrid strategies, the role of open-source models, and the infrastructure stacks making frontier-quality intelligence possible outside centralized APIs.

Moderator: Carter Abdallah (NVIDIA). Panelists: Vincent Weisser (Prime Intellect), Lucas Atkins (Arcee AI), Chris Alexiuk (NVIDIA), Lou (Z.ai).

Vincent Wendy

  • Role: Senior Creative Designer
  • Company: AI Engineer
  • Bio: Vincent Wendy is a Senior Creative Designer at AI Engineer, focused on UX/UI, visual design, and creative systems for technical audiences. He previously worked across UX and graphic design, including User Experience Lead work at KodeFox.
  • LinkedIn: https://id.linkedin.com/in/vinwendy
  • Photo: /wf26/speakers/by-id/spk_vincent_wendy.jpg
  • Sessions:

- One Designer + Al. Hundreds of Deliverables. — Day 3 — Session Day 2 3:45pm-4:05pm

TBD — internal AI Engineer design talk about designing for AIE.

Vinoo Ganesh

  • Role: CEO & Co-Founder
  • Company: Kepler
  • Bio: CEO & Co-Founder of Kepler, backed by founders of OpenAI and Facebook AI, building verifiable AI infrastructure. Vinoo spent 7 years at Palantir as a software engineer and forward deployed engineer, leading and deploying across commercial, healthcare, and defense. He led Project Frontline, Palantir's rotation program that moves software engineers into FDE roles. Later, he served as Head of Business Engineering at Citadel, leading the FDE function embedded in the investment business.
  • Twitter: https://x.com/vinooganesh
  • LinkedIn: https://www.linkedin.com/in/vinoo-ganesh/
  • Website: https://vinoo.io
  • Photo: /wf26/speakers/by-id/spk_vinoo_ganesh.jpg
  • Sessions:

- How Forward Deployed Engineering is done at Kepler — Day 2 — Session Day 1 3:20pm-3:40pm

- How Kepler Built Verifiable AI for Financial Services — Day 4 — Session Day 3 12:05pm-12:25pm

Financial answers have to be auditable. Vinoo Ganesh (CEO, Kepler) shows how Kepler Finance pairs Claude's reasoning with deterministic verification infrastructure to index 26M+ SEC filings across 14,000+ companies and 27 markets — and validate every number back to the exact filing, page, and line item. A look at trust, provenance, and content engineering for AI in regulated finance.

Vinoth Govindarajan

  • Role: Member of Technical Staff
  • Company: OpenAI
  • Bio: Vinoth Govindarajan is a Member of Technical Staff at OpenAI, where he works on core data infrastructure for large-scale AI systems and internal agent platforms. His work focuses on control planes, stateful architectures, scalability, low-latency systems, observability, and reliability patterns that make production system safe, resilient, and predictable.

Vinoth brings an end-to-end perspective on modern data platforms and open table formats. Before OpenAI, he was a Staff Software Engineer at Apple, where he helped build next-generation data platforms using Apache Iceberg, Spark, Trino, and Flink. Earlier, at Uber, he developed incremental ETL frameworks and real-time data pipelines powered by Apache Hudi.

Outside of his work, he is the co-author of Engineering Lakehouses with Open Table Formats book and writes The Agent Stack on substack platform, a systems-first publication about production AI agents and data infrastructure. Vinoth is also an open-source contributor and has presented at industry conferences and community events on lakehouse architecture, data systems, and agent harness.

  • Twitter: https://x.com/iamvinoth
  • LinkedIn: https://www.linkedin.com/in/vinothgovindarajan/
  • Website: https://theagentstack.substack.com/
  • Blog: https://theagentstack.substack.com/
  • Photo: /wf26/speakers/by-id/spk_vinoth_govindarajan.jpg
  • Sessions:

- Your Agent Didn’t Fail. Your Harness Did. — Day 2 — Session Day 1 11:10am-11:30am

AI agents do not fail only because the model is wrong. Many production failures happen in the harness around the model: state is not persisted, two runs mutate the same session, a tool call never returns, an approval loses scope, or an internal success never becomes user-visible proof. This talk uses OpenClaw as a public case study to examine real harness failure modes and extract a reusable production model for AI engineers. We will look at how events enter an agent system, how session state is rehydrated, why single-writer lanes and throttles matter, and why tool execution needs scoped approvals and auditable receipts. The core idea is simple: a model proposes, the harness commits, and the receipt proves it. Attendees will leave with a practical 'run receipt' audit they can apply to their own agents: what woke it up, which state did it inherit, what authority did it use, what executed, and what evidence survived.

Viren Baraiya

  • Role: Co-Founder and CTO
  • Company: Orkes
  • Bio: Viren Baraiya is Co-Founder and CTO of Orkes, a company building cloud services around the Conductor orchestration platform. He was one of the original creators of Netflix Conductor and previously led engineering work at Netflix and Google.
  • LinkedIn: https://www.linkedin.com/in/virenb
  • Photo: /wf26/speakers/by-id/spk_viren_baraiya.jpg
  • Sessions:

- Harnessing Agents: The Durable Runtime for Dynamic Workflows — Day 3 — Session Day 2 11:10am-11:30am

Agents increasingly generate and revise workflows at runtime instead of following control flow written in advance. That breaks a common assumption of durable execution: that the workflow graph is known when the system is deployed. How do you safely run and recover a program that did not exist until an agent created it? This talk shows how Conductor provide a durable harness for dynamic workflows. Connecting existing agent frameworks to Conductor without requiring developers to rewrite their agent logic. Conductor executes the generated plan as an inspectable workflow with durability, parallelism, retries, human approvals, MCP tool calls and policy enforcement. We will demonstrate an agent creating a workflow, executing part of it, and replanning the remainder as conditions change while preserving completed work and using idempotency and saga compensation to manage side effects safely. The agent owns the plan. The harness owns the guarantees.

Vivek Muppalla

  • Role: VP AI Engineering
  • Company: Hippocratic AI
  • Bio: Vivek Raju Muppalla is VP of AI Engineering at Hippocratic AI, where he leads product engineering for healthcare agents powering AI Front Door, Nurse Co-Pilot, and over 200 million patient-agent interactions. His focus is on turning frontier models into clinically safe, production-grade voice agents across real-time orchestration, evaluation, reliability, and patient-facing workflows.

Vivek has spent over a decade building applied AI and large-scale production systems across Cohere, Scale AI, Unity Technologies, Amazon, Groupon, and Expedia. His work has spanned GenAI applications, synthetic data, computer vision, simulation, and production ML at scale. At Cohere, as VP of AI Engineering and Custom Models, he launched GenAI products across Fortune 500 enterprises and co-developed Takane, a high-performing Japanese LLM built in partnership with Fujitsu.

Throughout his career, Vivek has focused on the hardest part of shipping AI: building systems that are reliable, measurable, and useful in production.

  • Twitter: https://x.com/vim1up
  • LinkedIn: https://www.linkedin.com/in/vivekmuppalla/
  • Website: https://hippocraticai.com/
  • Photo: /wf26/speakers/by-id/spk_vivek_raju_muppalla.jpg
  • Sessions:

- 200 Million Patient Interactions Later: What the Generic Voice Stack Misses — Day 4 — Session Day 3 12:05pm-12:25pm

A healthcare voice agent can be right on the benchmark and still fail in production. Real patients hesitate, interrupt, misremember medications, code-switch mid-sentence, and disclose risk indirectly. After 200M+ patient-agent interactions, the lesson is clear: in clinical voice AI, interaction is a safety variable. This talk breaks down what Hippocratic AI had to rebuild beyond the generic voice stack: not just ASR, VAD, an LLM, TTS, and turn-taking heuristics, but a real-time safety system that treats silence, clarification, escalation, multilingual continuity, and medication-specific recognition as first-class engineering problems. We’ll walk through the production architecture behind Hippocratic AI’s voice agents: a 30+ model supervisor constellation, including the 4.1T-parameter AI Front Door system, designed to catch failures a single primary model misses. The talk covers how specialized models monitor medication identification, overdose risk, labs and vitals, escalation criteria, workflow confirmation, and other clinical safety surfaces while the patient conversation is still happening. We’ll focus on four production lessons: - Benchmarks are not enough: MedQA and USMLE-style accuracy do not capture the failure modes that appear in a 12-minute, multi-turn patient call. - Interaction signals become training data: pauses, interruptions, hesitation, clarification requests, and escalation markers are mined from production calls and turned into structured eval and training signals. - One LLM is not a safety architecture: supervisor models can overrule, block, or escalate when the primary model sounds plausible but misses a clinical risk. - Voice infrastructure has clinical failure modes: domain ASR, medication vocabulary, code-switching, latency, and turn-taking all affect whether the system makes the right next move.

Vivek Trivedy

  • Role: Head of Applied Research
  • Company: LangChain
  • Bio: Vivek leads Applied Research at LangChain Labs where he's focused on cracking Continual Learning & making it accessible to the world's agent builders. Previously he worked on the LangChain Deep Agents open-source agent harness, worked on his own startup around agents for visual reasoning, and did Health AI at AWS for ~4 years. His PhD was focused on representation learning in Computer Vision.
  • Twitter: https://x.com/Vtrivedy10
  • LinkedIn: https://www.linkedin.com/in/vivek-trivedy-433509134/
  • Website: https://www.vtrivedy.com/
  • Blog: https://www.vtrivedy.com/
  • Photo: /wf26/speakers/by-id/spk_vivek_trivedy.jpg
  • Sessions:

- Improving Agents is a Data Mining Problem — Day 3 — Session Day 2 1:55pm-2:15pm

Harness Engineering, Post-Training, Continual Learning...these all boil down to the same underlying substrate - Mining Agent Traces 1. I need to run my agents to collect Traces 2. Understand behaviors from Traces at scale 3. Filter data for "improvement" 4. Do an improvement step There's a reason why every continual learning platform ends up looking like an observability platform. It's because Traces are the lifeblood of agent improvement. The mechanism that we use to attempt improvement can vary - Harness Eng, SFT, etc. But without understanding the data agents produce, no algorithm will truly build better agents. The holy grail of Agent Improvement is Continual Learning. Consistently mining data and integrating it into the agent definition over infinitely long time horizons. Today, the easiest way to do that is to build an observability platform and constantly point agentic compute to understand the data that agents produce. We'll walk through the current methods of understanding traces at massive scale and choosing how to integrate them to improve agents across your personal agents, team agents, and entire company.

Vlad Luzin

  • Role: Founder
  • Company: Band.ai
  • Bio: Vlad Luzin is Founder of Band.ai, working on agent orchestration and deployment patterns for AI systems.
  • LinkedIn: https://il.linkedin.com/in/luzin
  • Photo: /wf26/speakers/by-id/spk_vlad_luzin.jpg
  • Sessions:

- Every Agent, Everywhere, All at Once — Day 2 — Session Day 1 1:30pm-1:50pm

Coding agents are deaf to anything outside their own session, and a LangGraph or CrewAI one has no idea the others exist. Different vendors, different frameworks, different machines none of them share a way to work together. This demo fixes that live: the Claude Code on your laptop, Codex on your colleague's, a LangGraph agent you're running locally, and the OpenClaw on your Mac Studio at home collaborating on the same goal, going back and forth, full-duplex, across every vendor, framework, and machine line at once.

- Is Orchestration the Future? — Day 3 — Session Day 2 11:10am-11:30am

ChatGPT, Claude Code, OpenClaw — three inflection points that reshaped the industry in two years, each pointing the same way: the next step is many agents, not one. Which raises the question nobody's answered well yet — how do many agents actually work together? Today's answer is orchestration, and it's genuinely good — until you need stateful peers holding a single conversation together, which none of them are built to do. So we'll make a different case: that the next inflection point is a collaboration layer that lets separate agent systems share one conversation as stateful peers, whatever they're built on. We'll show that this is the inflection point the last three were leading to with a demo and a real enterprise use case.

Vyas A

  • Role: Head of Product
  • Company: Plivo
  • Bio: Leads Product at Plivo - the developer platform for building voice and messaging agents. His work spans product, growth and developer experience.
  • Twitter: https://x.com/not_ryan_vy45
  • LinkedIn: https://www.linkedin.com/in/narayanvyas/
  • Photo: /wf26/speakers/by-id/spk_vyas_a.jpg
  • Sessions:

- 5 Voice Agent Failure Modes You'll Hit in Week One — Day 2 — Session Day 1 1:55pm-2:15pm

Building a voice agent that demos well is easy now. The hard part starts the second a real person calls it. Most voice agents today are basically a chatbot with a microphone bolted on, they listen, then think, then talk, one side at a time, like a walkie talkie. Real conversations don't work that way. People pause in the middle of a thought, they say "um" and "uh", they talk over you, they change their mind halfway through. The agent has to work out when you're actually done talking, when it should stop talking, and when you've said something it cannot afford to get wrong, like your phone number or email. None of this shows up when you test with text. All of it shows up in week one.

This talk is the five failures that hit every team in that first week, the ones we see again and again. For each case we will walk though examples and best practices for what actually breaks and what to do about it. If you're about to put a voice agent in front of real callers, or you already did and it's quietly falling apart, this is the talk that saves you the weeks everyone else burns figuring it out

Walden Yan

  • Role: Co-founder & CPO
  • Company: Cognition
  • Bio: Walden Yan is the co-founder and Chief Product Officer of Cognition, the AI lab behind Devin, the autonomous AI software engineer. He helped shape Devin from an early research project into a product now deployed across some of the world's largest enterprises. Before Cognition, Walden was a gold medalist for the United States at the International Olympiad in Informatics and studied Computer Science and Economics at Harvard.
  • Twitter: https://x.com/walden_yan
  • LinkedIn: https://www.linkedin.com/in/waldenyan
  • Photo: /wf26/speakers/by-id/spk_walden_yan.jpg
  • Sessions:

- Model Routing — Day 4 — Session Day 3 3:20pm-3:40pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

- Model Routing — Day 4 — Session Day 3 3:45pm-4:05pm

Model Routing explores how teams decide when to use local models, open-source models, or frontier cloud systems, and why the answer is increasingly hybrid rather than one-size-fits-all. The panel digs into routing architectures, model selection strategies, stack decisions, and what still needs to improve in local AI before more workloads can move closer to the user.

Moderator: Nader Khalil (NVIDIA). Panelists: Walden Yan (Cognition), Tanay Varshney (NVIDIA), Alex Atallah (OpenRouter).

Wallon Walusayi

  • Company: Qodo
  • LinkedIn: https://www.linkedin.com/in/wallon
  • Photo: /wf26/speakers/by-id/spk_wallon_walusayi.jpg
  • Sessions:

- AI Engineering & Governance 2026 Trends — Day 3 — Session Day 2 10:45am-11:05am

AI Engineering & Governance 2026 Trends

- AI Engineering & Governance 2026 Trends — Day 4 — Session Day 3 10:45am-11:05am

AI Engineering & Governance 2026 Trends

Wei-Lin Chiang

  • Role: Co-founder & CTO
  • Company: Arena
  • Bio: Wei-Lin Chiang is the co-founder and CTO of Arena, the leading open platform for evaluating AI through real-world human feedback. A systems builder and researcher, Wei-Lin has played a foundational role in the design, scaling, and launch of the community-driven evaluation platform.

He earned his Ph.D. in Computer Science from UC Berkeley where worked with Ion Stoica. His research focused on AI systems and evaluation. His work spans everything from efficient distributed systems to dataset curation, and model evaluation, with publications in top venues including ICLR, NeurIPS, NSDI, SIGMOD, KDD, and ICML. He was a core contributor to widely cited projects such as Chatbot Arena, LLM judge, Vicuna, and co-authored multiple open benchmarks that shape how AI models are evaluated today.

Outside of work, Wei-Lin enjoys hiking and cycling, often exploring new trails and routes as a way to unwind and stay active.

  • Twitter: https://x.com/infwinston
  • LinkedIn: https://www.linkedin.com/in/wei-lin-chiang-51b025b2/
  • Website: https://infwinston.github.io/
  • Photo: /wf26/speakers/by-id/spk_wei_lin_chiang.jpg
  • Sessions:

- Closing Keynote — Day 3 — Session Day 2 5:10pm-5:30pm

Whitney Lee

  • Role: Senior Technical Advocate
  • Company: Datadog
  • Bio: Whitney Lee is a creator and systems thinker who explores how observability, AI, and platform engineering connect across the cloud native ecosystem. She brings humor, depth, and clarity to complex technologies while building original frameworks that help others understand how systems fit together. She runs a vibrant YouTube channel, hosts Datadog Illuminated, has delivered 2 KubeCon keynotes, and combines storytelling and technical rigor to illuminate the human side of cloud native engineering.
  • LinkedIn: https://www.linkedin.com/in/whitneylee/
  • Website: https://whitneylee.com/
  • Photo: /wf26/speakers/by-id/spk_whitney_lee.jpg
  • Sessions:

- Build a Platform, Unleash an Agent on it.... and Watch it Burn! — Day 1 — Workshop Day 1:15pm-2:15pm

You get a Kubernetes cluster with an Internal Developer Platform already running: ArgoCD for GitOps, Kyverno for admission control, Falco for runtime detection, Prometheus for observability. Everything is instrumented. Everything is enforced. You also get an AI agent with cluster access. Your job is to get the agent to break something. Deploy a non-compliant workload. Escalate privileges. Modify infrastructure outside Git. Exfiltrate data through an agent response. Some of you will fail because the governance stack catches it. Some of you will succeed because it doesn't. Afterward we regroup and map what got blocked, what slipped through, and why. The 80% that existing CNCF tools already govern becomes obvious. The 20% gap where agent-specific tooling is missing becomes undeniable. You leave with a concrete governance map and the exact list of failure modes your own platform probably isn't covering yet.

Will Bond

  • Role: Staff Software Engineer
  • Company: Uber
  • Bio: Staff Software Engineer at Uber, working on AI developer experience and code review
  • Twitter: https://x.com/wbond
  • LinkedIn: http://linkedin.com/in/wbond
  • Photo: /wf26/speakers/by-id/spk_will_bond.jpg
  • Sessions:

- Scaling Code Quality: Building uReview, Uber’s Multi-Agent Code Review Engine — Day 2 — Session Day 1 12:05pm-12:25pm

At Uber scale, human-only code reviews create massive bottlenecks, while generic AI tools overwhelm developers with noisy, hallucinated spam. This session explores the architecture behind uReview, Uber’s multi-agent AI code review engine designed strictly for high-precision feedback. Attendees will learn how we moved beyond monolithic prompts to build a modular pipeline featuring deep contextual ingestion, specialized domain agents, and a Generator-Verifier grader system. By enforcing strict confidence scoring and semantic deduplication, uReview filters out AI noise, shifting the focus from comment quantity to high-signal actionability and significantly reducing Pull Request cycle times. Talk Outline I. The Code Review Crisis at Uber Scale (0–3 mins) Establish the critical tension between engineering velocity and code quality, highlighting why standard AI implementations fail in massive monorepo environments. 1. The Monorepo Bottleneck: At Uber, thousands of engineers commit code daily. Relying solely on human reviewers creates a massive operational bottleneck, leading to reviewer fatigue, extended Pull Request cycle times, and inevitable missed vulnerabilities. 2. The Developer Spam Problem: Generic LLM integrations fail because they prioritize comment quantity over actionable quality. If an AI posts ten hallucinated suggestions on a diff, developers will simply mute the tool. AI must reduce cognitive load, not add to it. 3. The Signal-to-Noise Mandate: Defining the North Star for uReview. The goal is not to replace human reviewers, but to build an AI system that respects developer time by delivering high-precision, strictly verified code feedback. II. The uReview Architecture: A Modular Agentic Pipeline (3–10 mins) Detail the transition from a monolithic prompt approach to uReview’s sophisticated, multi-stage agentic workflow designed for enterprise codebases. 1. Deep Contextual Ingestion: A standard git diff is not enough. We discuss how uReview fetches extended context, integrating with our build systems to analyze surrounding functions, upstream dependencies, and class hierarchies before generating a single token. 2. Specialized Domain Assistants: Instead of a generalist model, uReview deploys independent AI agents. We route code to narrow, specialized agents—such as a Go Concurrency Analyzer, a Java Memory Leak Detector, or a Security Vulnerability Scanner—to ensure precise, domain-specific insights. 3. Hybrid Intelligence: Probabilistic LLMs cannot operate in a vacuum. We detail how uReview integrates deterministic tools, like Bazel dependency graphs and static linters, to ground AI suggestions in objective codebase realities. III. Engineering the Trust Layer (10–17 mins) Dive into the verification phase. This is the core engineering that filters out AI noise and ensures uReview maintains developer trust. 1. The Generator-Verifier Pattern: Implementing a Grader Model architecture. A primary agent generates code suggestions, but a secondary, high-reasoning model audits those suggestions against strict coding guidelines to catch hallucinations before they reach the PR. 2. Confidence Scoring and Suppression: We assign a numerical confidence score to every generated comment. If a comment falls below our calibrated threshold, uReview silently drops it. We explore the engineering behind suppressing low-confidence outputs to prevent tooling spam. 3. Semantic Deduplication: Technical strategies for merging overlapping warnings. If a deterministic static analysis tool and an LLM agent flag the same null pointer exception, uReview merges them into a single, concise developer instruction. IV. Operationalizing uReview at Scale (17–20 mins) Conclude by discussing the long-term governance, feedback loops, and measurable impact of running an AI review engine in production. 1. The Telemetry Feedback Loop: We embedded Useful and Not Useful rating buttons directly into the developer UI on every uReview comment. We discuss how this telemetry flows back into a curated data lake, driving continuous Reinforcement Learning from Human Feedback and prompt refinement. 2. Shifting Success Metrics: Why organizations must abandon vanity metrics like total comments posted. We measure uReview’s success through Actionability Rate (the percentage of AI comments accepted as commits) and the reduction in Mean Time To Merge.

Will Brown

  • Role: Researcher
  • Company: Prime Intellect
  • Bio: Will Brown leads Applied Research at Prime Intellect and builds open research infrastructure to enable every company to train, deploy, and self-improve their own frontier agentic models. He holds a PhD in Computer Science from Columbia University.
  • Twitter: https://x.com/willccbb
  • LinkedIn: https://www.linkedin.com/in/willcb/
  • Website: https://willcb.com
  • Photo: /wf26/speakers/by-id/spk_will_brown_willccbb.jpg
  • Sessions:

- The Prime Intellect Stack — Day 1 — Workshop Day 4:30pm-5:30pm

Deep dive into Prime Intellect's open-source ecosystem of post-training tools, including the verifiers and prime-rl libraries, as well as our Lab platform for self-serve training and inference.

- Reinforcement Learning without Verifiable Rewards — Day 3 — Session Day 2 1:30pm-1:50pm

Verifiable rewards are the gold standard for RL training, but real-world agent tasks frequently lack clean deterministic evaluation objectives. This talk surveys our efforts to scale RL in non-verifiable settings -- including task synthesis, unsupervised environment design, and automatic judge calibration -- to ultimately enable self-improvement in production, grounded in real-world agent traces and domain-specific context.

Will Bryk

  • Role: CEO
  • Company: Exa
  • Bio: Will Bryk is building the next generation of search. As co-founder and CEO of Exa, he's chasing perfect search: an engine that gives every AI the highest quality information in the world, retrievable in milliseconds. Exa powers search for Cursor, Cognition, HubSpot, OpenRouter, and over 400,000 developers, and recently raised a $250M Series C at a $2.2B valuation from a16z, Benchmark, Lightspeed, Nvidia, and Y Combinator. Will grew up in New York City and studied CS and physics at Harvard, where he led the robotics club and dove into ML research. He founded Exa on one conviction: information is civilizational infrastructure, and if every AI can reach the best information, so can every human.
  • Twitter: https://x.com/WilliamBryk
  • LinkedIn: https://www.linkedin.com/in/william-bryk/
  • Photo: /wf26/speakers/by-id/spk_will_bryk.jpg
  • Sessions:

- The Search Engine for the Agentic Web — Day 2 — Session Day 1 11:40am-12:00pm

Every search API claiming to be "built for AI" is actually Google with a wrapper. That's a problem, because AI agents don't search like humans. A human waits 1 second for a result. An agent making 50 sequential searches at 1 second each creates a 50-second lag. That kills the product. And latency is just one dimension: agents need semantic precision, structured outputs, and a range that spans sub-200ms real-time retrieval all the way to multi-step deep research. No human-facing search engine was ever designed to do that. Will Bryk, CEO of Exa, shares what he learned building a search engine from scratch for AI. He'll cover the architectural decisions behind Exa's latency spectrum, what real usage patterns look like across companies like Cursor, Notion, HubSpot, and Lovable, and why the benchmarks the field relies on today are dangerously inadequate for evaluating agentic search. The bigger argument: search is becoming the most critical primitive in AI infrastructure, and almost no one is building it right.

Will Lyon

  • Role: Product Manager
  • Company: Neo4j
  • Bio: William Lyon is a Product Manager for AI Innovation at Neo4j, where he is building graph intelligence for AI agents. He is the author of the book Fullstack GraphQL and has a masters degree in computer science from the University of Montana. You can find him online at lyonwj.com
  • Twitter: https://x.com/lyonwj
  • LinkedIn: https://www.linkedin.com/in/lyonwj/
  • Website: https://lyonwj.com/
  • Photo: /wf26/speakers/by-id/spk_tbd.jpg
  • Sessions:

- Actionable Knowledge For Agents With Context Graphs — Day 2 — Session Day 1 11:10am-11:30am

Willem Pienaar

  • Role: Co-founder and CTO
  • Company: Cleric
  • Bio: Willem Pienaar is co-founder and CTO of Cleric, where he is building an autonomous AI SRE. He is also the creator of Feast and has worked on MLOps and open-source tooling.
  • Twitter: https://x.com/willpienaar
  • LinkedIn: https://www.linkedin.com/in/willempienaar
  • Website: https://willem.co
  • Photo: /wf26/speakers/by-id/spk_willem_pienaar.jpg
  • Sessions:

- Your Agent Can't Tell If It's Right — Day 4 — Session Day 3 10:45am-11:05am

Coding agents feel reliable because of one signal you never think about: the tests. They catch confident mistakes in seconds, so you never see most of them. The real world has no test suite. Put an agent in production and that signal is gone, and a wrong answer looks the same as a right one. So how do you know it's right? We watched our agent look at an 80% drop in throughput and report zero user impact, because a similar alert the month before had been noise. The data to catch it was already in front of it. There is no single verifier, but there are several weaker signals. While the agent reasons: grounding each claim against live data, and looking for evidence that distinguishes competing hypotheses. Before it acts: calibrated confidence, and a separate critic. After it acts: whether the fix held, whether the alert returned, whether an engineer redid the work. None is conclusive on its own. Combined, they estimate whether the agent was right. The talk covers where these signals come from, how we combine them, and how often they still disagree.

Wolfram Ravenwolf

  • Role: AI Evangelist
  • Company: Weights & Biases by CoreWeave
  • Bio: Wolfram Ravenwolf is an AI Evangelist at CoreWeave / Weights & Biases, where he helps builders evaluate, debug, and ship useful AI systems. He works across model evaluation, agent tooling, inference infrastructure, and developer education, translating hands-on engineering work into practical guidance for teams adopting frontier AI. Wolfram is the creator of WolfBench, a five-metric framework for evaluating agent performance based on Terminal-Bench 2.0, and regularly tests new models, coding agents, and evaluation workflows in real-world conditions. He is also a ThursdAI co-host, speaker, writer, and longtime AI community builder. Before joining CoreWeave/W&B, he worked as an engineer, researcher, and consultant focused on making complex technology usable. His talks are practical, opinionated, and grounded in live experimentation: fewer buzzwords, more working systems.
  • Twitter: https://x.com/WolframRvnwlf
  • LinkedIn: https://de.linkedin.com/in/wolframravenwolf
  • Website: https://wolfbench.ai/
  • Blog: https://wolfbench.ai/
  • Photo: /wf26/speakers/by-id/spk_wolfram_ravenwolf.jpg
  • Sessions:

- From Zero to Leaderboard: Building an End-to-End AI Agent Evaluation Pipeline — Day 1 — Workshop Day 12:10pm-1:10pm

Running one agent eval is easy. Running hundreds — with controlled timeouts, replicated configs, and automated collection across distributed VMs — requires infrastructure that most teams end up building from scratch. In this workshop, we shortcut that process and build a rigorous evaluation pipeline end-to-end. Participants will set up and connect the full evaluation stack: Layer 1 — The Benchmark Runner. Configure Harbor to orchestrate parallel agent evaluations on Terminal-Bench 2.0, with W&B Sandboxes providing isolated environments for each task. Layer 2 — The Collection Pipeline. Use WolfBench to scan distributed VMs for results, deduplicate across runs, download trajectories, and build a local results archive that survives VM teardown. Layer 3 — The Analysis Framework. Compute the five-metric framework (Ceiling / Best / Average / Worst / Solid) across replicated runs. Learn to read the spread: when is a model "better"? When is a score difference just noise? Layer 4 — The Observability Layer. Upload full agent conversation traces to W&B Weave for per-turn inspection. See exactly where an agent goes wrong — the command it ran, the output it misread, the moment it started looping. Layer 5 — The Leaderboard. Generate interactive HTML charts that show the full performance distribution, not a single bar. We'll work with real data from hundreds of production runs, and participants will leave with a working pipeline they can adapt to their own agents and benchmarks. Laptops required; all tools are open-source.

XiangMing Sun

  • Company: Unitree
  • Sessions:

- Unitree: Building Mass Produced Humanoids — Day 3 — Session Day 2 1:30pm-1:50pm

Yogendra Miraje

  • Role: Principal AI Engineer
  • Company: FactSet
  • Bio: Principal AI Engineer in FactSet Research Systems. Building Agentic Systems grounded in Financial data.
  • Twitter: https://x.com/YogiNotTheBear
  • LinkedIn: https://www.linkedin.com/in/mirajey/
  • Website: https://yogimiraje.com
  • Photo: /wf26/speakers/by-id/spk_yogendra_miraje.jpg
  • Sessions:

- Skills are new features: Building Skill-Centric Harness for Agentic Products — Day 4 — Session Day 3 3:20pm-3:40pm

Yohan Raju

  • Role: Pre-Sales Expert
  • Company: Bright Data
  • Bio: Yohan Raju is a solutions engineer and pre-sales specialist at Bright Data with experience in custom solutions, client success and real-time web data workflows.
  • LinkedIn: https://www.linkedin.com/in/yohan-raju-04221996
  • Photo: /wf26/speakers/by-id/spk_yohan_raju.jpg
  • Sessions:

- Building AI Agents with Real-Time Web Data — Day 1 — Workshop Day 12:10pm-1:10pm

Your AI agent is only as good as the data it can access — and static training data isn't enough anymore. In this hands-on workshop, you'll learn how to connect AI agents to the live web using Bright Data's MCP (Model Context Protocol) server and scraping APIs, turning any LLM into a real-time web-aware system.

Yohei Nakajima

  • Role: Managing Partner
  • Company: Untapped Capital
  • Bio: Yohei Nakajima is a General Partner and co-founder of Untapped Capital, a pre-seed venture fund backing unexpected founders at the earliest stages. He is best known as the creator of BabyAGI, one of the early open-source autonomous agent experiments that helped popularize task-driven AI agents. Yohei’s work sits at the intersection of venture capital, software prototyping, and frontier AI research: he builds tools and experiments to understand where technology is going, then uses those lessons to support founders and shape investment theses.

Most recently, Yohei has been developing ActiveGraph, an event-log-native architecture for building agents that are replayable, inspectable, forkable, and capable of continuous improvement. Across investing, writing, demos, and open-source projects, his approach is simple: build to learn, share what works, and help more people understand what AI-native systems make possible.

  • Twitter: https://x.com/yoheinakajima
  • LinkedIn: https://www.linkedin.com/in/yoheinakajima
  • Website: https://yoheinakajima.com
  • Blog: https://yohei.me
  • Photo: /wf26/speakers/by-id/spk_yohei_nakajima.jpg
  • Sessions:

- Active Graph Agent Runtime (BabyAGI 4) — Day 4 — Session Day 3 11:10am-11:30am

Proposing a novel event-sourced graph runtime for building long-running auditable, agentic systems. Built on top of and combining various BabyAGI iterations and graph experiments (memory, code, logs) into a single primitive.

Yoni Michael

  • Role: Co-Founder
  • Company: typedef
  • Bio: Yoni Michael is the co-founder of Typedef, a company building the data context layer for AI agents working across modern data stacks. Typedef analyzes transformation code, lineage, schemas, metrics, and usage patterns to help agents reason safely about complex data systems.

Yoni has spent more than a decade building infrastructure and data platforms at the intersection of data and AI. Prior to Typedef, he led infrastructure engineering teams at Tecton and Salesforce. He previously co-founded Coolan, a data center analytics company acquired by Salesforce.

  • Twitter: https://x.com/yoni_michael
  • LinkedIn: https://www.linkedin.com/in/yonimichael
  • Photo: /wf26/speakers/by-id/spk_yoni_michael.jpg
  • Sessions:

- The Data Context Layer: Why Data Engineering Agents Need More Than Code and Databases — Day 1 — Workshop Day 2:20pm-4:20pm

Modern AI agents typically understand either code or databases. Code-focused agents reason over files, dependencies, and syntax, while database agents see tables, columns, and query results. This works for software development and basic analytics—but it breaks down for data engineering. In real data environments, agents fail because they lack context: an understanding of how data flows, what it represents, and why it behaves the way it does in production. Introducing the data context layer—a missing third layer that bridges code, data, and business semantics. Without it, agents hallucinate impact, suggest unsafe joins, and struggle with root cause analysis. This presentation will define the data context layer and showcase its use in practice, including end-to-end lineage from sources to reports; semantic metadata such as grain, measures, dimensions and business logic; runtime signals including job executions, failures, and performance patterns; and logical vs. physical modeling distinctions. Attendees will walk away with a greater understanding of: Why the code layer (dbt SQL, manifests, Git history) provides structure but misses grain, aggregation semantics, and join safety Why the data layer (warehouse tables, execution metrics, failures) shows what happened, but not why How the data context layer unifies lineage, semantic metadata, runtime behavior, and business rules The presentation will also cover architecture patterns for building and maintaining a data context layer, including why property graphs are well-suited for contextual reasoning and how agents can query context safely instead of relying on prompt stuffing.

Yu Su

  • Role: Co-founder and CEO
  • Company: NeoCognition
  • Bio: Co-founder and CEO at NeoCognition. Associate Professor at the Ohio State University. Building towards continual learning and abundance of specialized intelligence.
  • Twitter: https://x.com/ysu_nlp
  • LinkedIn: https://www.linkedin.com/in/ysu1989/
  • Website: https://ysu1989.github.io/
  • Photo: /wf26/speakers/by-id/spk_yu_su.jpg
  • Sessions:

- Intelligence + Continual Learning = Expertise — Day 3 — Session Day 2 12:05pm-12:25pm

Talk on continual learning for LLMs and agents, drawing on retrieval-to-memory and environment-adaptation research.

Yubo Wang

  • Role: LLM Inference
  • Company: Together AI
  • Bio: Yubo Wang works on LLM inference at Together AI. His work and WF26 session focus on open-source inference engineering, serving large models efficiently, and building production inference systems for agentic workloads.
  • LinkedIn: https://www.linkedin.com/in/yubo-wang-057616117
  • Photo: /wf26/speakers/by-id/spk_yubo_wang.jpg
  • Sessions:

- Open-Source Inference Engineering for the Agentic Era — Day 1 — Workshop Day 9:00am-11:00am

Agentic coding workloads demand long contexts, multi-turn conversations, and throughput at a scale that most inference engines weren't built for. TokenSpeed is a new open-source engine purpose-built for this regime, built collaboratively by NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, and others. In this 2-hour hands-on workshop, Together Inference Research Engineers and a TokenSpeed co-creator will cover TokenSpeed architecture, deploying your first model, optimizing for agentic workloads, kernel and hardware tuning, and throughput/latency trade-offs.

Yuchen Fama

  • Role: Senior Principal Product Manager
  • Company: Red Hat
  • Bio: Yuchen Fama is a Builder, Benchmarker, and Senior Principal Product Manager of Inference at Red Hat and also a contributor to vLLM and GuideLLM. She has more than 15 years of experience in ML and AI. She has served as VP of Product, CTO, and CPO at multiple AI startups and previously led AI/ML research teams within several Fortune 500 companies. She holds a Ph.D. in Statistics and enjoys reading, traveling, skiing, and scuba diving.
  • LinkedIn: https://www.linkedin.com/in/yuchen-fama
  • Photo: /wf26/speakers/by-id/spk_yuchen_fama.jpg
  • Sessions:

- KV Cache-Aware Routing and P/D Disaggregation on Kubernetes: The Parts Public Benchmarks Don't Show — Day 4 — Session Day 3 2:50pm-3:10pm

We're at the inflection point between classic LLM inference and agentic inference. When we look at the agentic workloads and trace replays, many core characteristics break classic LLM serving assumptions. The most consequential: the server no longer controls its own cache lifecycle. The client does, through prompt construction, multi-turn context that grows and changes each turn.

This has downstream effects. Because context is client-determined, prefill strategy, eviction, and routing decisions move up to the scheduler layer. KV cache becomes volatile — frequent eviction and rewrite, driven from outside the engine. And latency becomes a first-class scheduling metric alongside throughput. This talk covers the open stack for LLM and agentic era inference serving: vLLM and llm-d.

We begin with the core characteristics and challenges of agentic inference, then the economics: prefill dominates cost, and cache reuse is the primary lever. We explain why KV-aware routing through a fleet-wide scheduler is the first optimization to apply, ahead of adding capacity.

Next, prefill/decode disaggregation. We separate compute-bound prefill from memory-bound decode, and examine what public benchmarks omit: the conditions under which P/D disaggregation shines, and the workload shapes that justify the added architectural complexity.

We close with GLM-5.2 and show the equivalent stack assembled in the open: cache-aware routing, P/D disaggregation, tiered KV offload, and wide expert parallelism — implemented on vLLM and llm-d.

Attendees leave with a tuning decision framework: which lever to apply first, how to read workload signals, and where additional GPUs do and don't help.

Yunmo Koo

  • Role: Founding Engineer
  • Company: FriendliAI
  • Bio: Yunmo Koo is a founding engineer at FriendliAI focused on LLM inference optimization, distributed training, multi-cloud systems, and LLMOps. He builds production ML infrastructure for lower latency, better reliability, and improved cost efficiency.
  • LinkedIn: https://www.linkedin.com/in/yunmokoo
  • Website: https://yunmorning.me
  • Sessions:

- Inference performance as a competitive advantage — Day 3 — Session Day 2 2:50pm-3:10pm

Most AI teams focus on model quality, but production success often comes down to inference performance. In this session, FriendliAI will explore the optimization techniques behind high-performance LLM serving, including continuous batching, speculative decoding, smart caching, and efficient GPU utilization. Learn how leading AI teams reduce infrastructure costs, improve latency, and scale inference workloads without sacrificing performance. We'll share practical insights and deployment strategies that separate experimental AI projects from production-grade systems.Whether you're an ML engineer, platform engineer, MLOps practitioner, or technical founder, you'll leave with a better understanding of how inference optimization can become a competitive advantage for your AI applications.

Yuval Belfer

  • Role: Sr. Developer Advocate
  • Company: AI21
  • Bio: Yuval is a Senior Developer Advocate at AI21 Labs, where he helps engineers go from "it works in the demo" to "it works in production." He hosts the YAAP podcast (Yet Another AI Podcast) and teaches applied GenAI on various programs. His work spans RAG, fine-tuning, agents, and evaluation (or Yuval-uation, if you're nasty).
  • Twitter: https://x.com/yuvalinthedeep
  • LinkedIn: https://linkedin.com/in/yuval-belfer
  • Photo: /wf26/speakers/by-id/spk_yuval_belfer.jpg
  • Sessions:

- Stop Chunking Like It's 2022 — Day 2 — Session Day 1 3:20pm-3:40pm

Every RAG system bets everything on a single chunk size. 500 tokens? 800? Pick wrong, and half your queries fail before they start. But here's what nobody tells you: all the picks are wrong; there is no single chunk size that works for all queries. We ran oracle experiments across meeting transcripts, story chapters, and TV scripts. The result? Queries disagree violently on what chunk size works best - sometimes by 40 percentage points. Your "tuned" chunk size isn't a compromise; it's systematic underperformance. In this talk, we'll expose why fixed chunking fails and show you a dead-simple fix: index at multiple chunk sizes, aggregate at retrieval time using Reciprocal Rank Fusion. No retraining. No LLM overhead. Just 1-37% better recall across benchmarks by letting queries vote with their ranks instead of forcing them into one-size-fits-all boxes. Walk away knowing exactly when your chunk size is sabotaging you - and how to stop leaving 20-40% of your retrieval performance on the table.

- Two Bugs That Hid in Plain Sight: A vLLM Debugging Detective Story — Day 4 — Session Day 3 3:20pm-3:40pm

Your model generates gibberish. Once every thousand prompts. High confidence scores. No crashes. No warnings. We hit this twice while building Jamba models. First: A request gets misclassified during scheduling, loads stale state from a previous prompt cache slot, and confidently generates nonsense. Second: Logprob spikes during RL training that looked like training instability-until we noticed they tracked with rollout count, then with cache size. In this talk, we'll walk through both debugging journeys-the false starts, how we instrumented vLLM to thread request IDs through the forward pass, the search for variables that change failure structure rather than magnitude, and the lesson both share: distributed inference systems fail silently. No stack trace. No sanitizer warning. Just wrong answers with perfect confidence. You'll learn how to build comparison scripts that expose logprob divergence, force memory pressure to surface rare bugs, and shrink a distributed RL training mystery into a reproducible single-script failure. Walk away knowing how to debug vLLM when it lies to you quietly.

Yves Raimond

  • Role: SVP/GM, AI & Personalization
  • Company: Spotify
  • Bio: Senior leader for AI and personalization at Spotify.
  • Photo: /wf26/speakers/by-id/spk_yves_raimond.jpg
  • Sessions:

- Spotify LLM Recsys — Day 2 — Session Day 1 11:10am-11:30am

Zach Blumenfeld

  • Role: AI Research Engineer
  • Company: Neo4j
  • Bio: Zach Blumenfeld is an AI/ML graph specialist at Neo4j who helps engineers, data scientists, and business leaders use graph technology for analytics and intelligent applications, including GraphRAG, agentic systems, fraud detection, entity resolution, and recommendation engines.
  • LinkedIn: https://www.linkedin.com/in/zachblumenfeld/
  • Photo: /wf26/speakers/by-id/spk_zach_blumenfeld.jpg
  • Sessions:

- AI on Your Lakehouse: Context Comes in Shapes, Not Queries — Day 1 — Workshop Day 9:00am-11:00am

Your agent can reach your data but still can't use it reliably: vector search and Text2SQL each hand it a slice, but not the view to know what's truly relevant and how to connect the right info. Without that, answers come back confident but wrong, and agent decisions cannot be trusted. The problem isn't caused by a bad model or bad query, but rather a lack of context, and thinking in terms of shapes is what cracks it.

In this hands-on session, you'll learn how to build three reusable graph shapes from your lakehouse data using Neo4j, so your agent can navigate and view the right context to answer and act accurately:

  • Table of Contents (Trees) — navigate what's there
  • Themes (Communities) — surface patterns nobody named
  • Connections (Paths & Cycles) — trace how entities, documents, and records relate

Portable to BigQuery, Databricks, Snowflake, or anywhere. You'll leave with real, practical techniques and the code to run with your own data and agents.

Zach Lloyd

  • Role: Founder and CEO
  • Company: Warp
  • Bio: Zach Lloyd is the founder and CEO of Warp, a modern Agentic Development Environment born from the terminal and cloud agent orchestration platform. Previously he led engineering for Google Sheets and the broader Google Docs suite at Google. He now focuses on cloud agent infrastructure and the future of AI‑powered software engineering, helping all developers to deliver great software, quickly and reliably.
  • Twitter: https://x.com/zachlloydtweets
  • LinkedIn: https://www.linkedin.com/in/zachlloyd/
  • Photo: /wf26/speakers/by-id/spk_zach_lloyd.jpg
  • Sessions:

- Self-Improving software factories: The new open source model" — Day 2 — Session Day 1 1:55pm-2:15pm

Alt titles: Agent orchestration with message passing / Agent orchestration for every model / Warp’s approach to agent orchestration With models getting more capable, we’ve quickly scaled from single agent problems to multi-agent problems – How can agents delegate tasks to accomplish ever-larger goals? You may have heard of “agent swarms” or “agent teams” in this arena, but they come with drawbacks: model lock-in, complex UX, or both. We want to share how we’ve tackled orchestration with our model-agnostic platform, Oz. Our approach has some unique goals: - Support any model, and any harness (claude, codex, etc) - Delegate across local instances and across isolated cloud sandboxes - Provide a UX that requires zero tmux or TUI knowledge to use We’ll explore how we implemented message passing across harnesses, how we handle agent sandboxing with Docker containerization + serverless deploys, and how we designed these primitives to make a system that works with any agent. You’ll walk away with a clear outline of how to build agent orchestration well. Plus, we invite you to try our Oz orchestration platform and tell us what you think. Talk format: Primarily a tech demo and code walkthrough. We’ll show multiple examples of tasks that are best served by delegation, and show both local and cloud-based runs. We’ll also walk through the design of our message passing implementation at a high level to show how it works.

Zack Proser

  • Role: AI Engineer, Applied AI
  • Company: WorkOS
  • Bio: AI engineer working on Applied AI at WorkOS; previously led Developer Education at WorkOS and has earlier experience at Pinecone, Gruntwork, and Cloudflare.
  • Twitter: https://x.com/zackproser
  • Photo: /wf26/speakers/by-id/spk_zack_proser.jpg
  • Sessions:

- Lifestyles of the AI-Native: Voice-coding, agent skills, hooks and scheduled tasks — Day 1 — Workshop Day 4:30pm-5:30pm

Most engineers are bolting AI onto a workflow that was designed for a pre-AI world. The result is a faster version of the same grind. This talk is about the other path: rebuilding the daily practice of software engineering from the ground up, around what agents are actually good at.

Two senior practitioners from WorkOS will walk through how we actually work now as AI-native engineers — not in the aspirational sense, but the literal one. We think out loud and voice-code instead of typing our way to clarity. We package recurring expertise into agent skills so we're not re-explaining context every session. We wire up hooks that fire on the events we care about, and hand off scheduled tasks to agents that run overnight, while we're away from the keyboard, or otherwise off the clock. The throughline is intentional design: deciding what a human should hold onto and what should be delegated, then building the machinery to make that real.

Because there are two of us, you'll see more than one set of habits — where our setups converge on the same patterns, and where they diverge based on how each of us thinks and works. The pitch isn't "do more." It's that an AI-native setup, designed deliberately, buys back attention and protects you from the burnout that comes from treating agents as a turbocharger for an old loop. Attendees will leave with a concrete mental model for voice-driven development, a pattern for authoring reusable agent skills, and working examples of hooks and scheduled automations they can adapt the same week.

Zain Hasan

  • Role: Staff AI/ML Engineer - DX
  • Company: Together AI
  • Bio: AI/ML engineer and educator focused on large-scale models, tooling, and developer education.
  • Twitter: https://x.com/ZainHasan6
  • Website: https://zainhas.github.io
  • Photo: /wf26/speakers/by-id/spk_zain_hasan.jpg
  • Sessions:

- Open-Source Inference Engineering for the Agentic Era — Day 1 — Workshop Day 9:00am-11:00am

Agentic coding workloads demand long contexts, multi-turn conversations, and throughput at a scale that most inference engines weren't built for. TokenSpeed is a new open-source engine purpose-built for this regime, built collaboratively by NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, and others. In this 2-hour hands-on workshop, Together Inference Research Engineers and a TokenSpeed co-creator will cover TokenSpeed architecture, deploying your first model, optimizing for agentic workloads, kernel and hardware tuning, and throughput/latency trade-offs.

Zhengyao Jiang

  • Role: CEO & Cofounder
  • Company: Weco AI
  • Bio: Cofounder & CEO @WecoAI - automated hill climbing with LLMs. Previously: PhD in ML at UCL
  • Twitter: https://x.com/zhengyaojiang
  • LinkedIn: https://www.linkedin.com/in/zhengyao-jiang-387b44145/
  • Website: https://zhengyaojiang.github.io/
  • Photo: /wf26/speakers/by-id/spk_zhengyao_jiang.jpg
  • Sessions:

- Hands-on AutoResearch: Cracking OpenAI's Parameter Golf — Day 1 — Workshop Day 2:20pm-4:20pm

Heard about autoresearch, or tried it a few times in playground settings? This hands-on tutorial teaches you how to use autoresearch on one of the most serious challenges in ML this year: OpenAI's Parameter Golf.

The challenge: train the best language model that fits in just 16MB. We entered our autoresearch agent this past spring, and it outperformed the field of over 1,000 participants. You'll learn how we approached it, then get to do it yourself: kick off an autoresearch agent, watch it improve a tiny language model's training script, steer it when progress stalls, and visualize your results. You'll leave with a working autoresearch setup you can point at your own code.

compute kindly sponsored by Modal!

- An AI Agent Became the #1 Contributor in OpenAI's Hiring Challenge — Day 3 — Session Day 2 1:55pm-2:15pm

Earlier this year, OpenAI ran Parameter Golf, a model-training competition that doubled as a hiring filter. Over 1,000 researchers competed to train the best small language model under a 16MB cap. The top contributor was the one candidate OpenAI couldn't hire. Our autonomous research agent Aiden finished with 7 merged records, more than twice as many as any other contributor, and ended up the most-cited participant in the community.

This talk is about what those 22 days showed. I'll cover on high level how does it works and which of its ideas produced the records. But the part worth more than the leaderboard is the collaboration itself, the community and AI agent building on each other's work, the largest natural experiment in human-AI collaboration I've seen run in public. I'll close with what it tells us about where humans and autonomous research each still matter for the foreseeable future.

1:57 PM

Zixuan Li

  • Role: Head of Z.ai
  • Company: Z.ai
  • Bio: Head of Z.ai, involved with Z.ai's AI products and GLM model strategy.
  • Photo: /wf26/speakers/by-id/spk_zixuan_li.jpg
  • Sessions:

- GLM-5.2: Frontier Intelligence, Open Weights. — Day 2 — Session Day 1 9:45am-10:05am

Zubin Aysola

  • Role: Senior Software Engineer Weave
  • Company: Weights & Biases by CoreWeave
  • LinkedIn: https://www.linkedin.com/in/zubin-aysola
  • Sessions:

- ARIA, how we built autoresearch with autoresearch — Day 4 — Session Day 3 11:10am-11:30am

ARIA is an end-to-end auto research and AI research product that improves models, launches training jobs, and agents alike. We used ARIA along with a sophisticated evaluation framework we're calling the WBAF, Weights and Biases Agent Factory, to build itself. ARIA reads its own production traces, improves its own prompts, tools, skills, and other effects to solve customer challenges. In this talk, we dive into the evaluation framework, how we built a sophisticated reinforcement learning style environment over the Weights & Biases product, and how we scaled from zero to one to a full team working in parallel on improving an agent.


Tracks

39 tracks across 4 days, covering the full breadth of AI engineering.

Day 1 — Workshop Day

Full-day hands-on workshops across all tracks.

Day 2 — Session Day 1

TrackRoom
Software FactoriesKeynote
Claws & Personal AgentsTrack 1
Vision & OCRTrack 2
Search & RetrievalTrack 3
Workshops Day 2Track 4
SecurityTrack 5
Voice & Realtime AITrack 6
LLM RecsysTrack 7
Forward Deployed EngineeringTrack 8
Data QualityTrack 9
AI-Native EnterprisesLeadership 1
AI Architects: Show my WorkflowLeadership 2
CTO CircleLeadership Lounge

Day 3 — Session Day 2

TrackRoom
AutoresearchKeynote
Sandbox & Platform EngineeringTrack 1
Robotics & World ModelsTrack 2
Memory & Continual LearningTrack 3
Workshops Day 3Track 4
EvalsTrack 5
Design EngineeringTrack 6
Computer UseTrack 7
Context EngineeringTrack 8
Posttraining & MidtrainingTrack 9
AI-Native EnterprisesLeadership 1
AI Architects: TokenmaxxingLeadership 2
CTO CircleLeadership Lounge

Day 4 — Session Day 3

TrackRoom
Harness EngineeringKeynote
Generative MediaTrack 1
Agentic CommerceTrack 2
AI in FinanceTrack 3
Local AITrack 4
GraphsTrack 5
AI in GTMTrack 6
AI in HealthcareTrack 7
Agentic EngineeringTrack 8
InferenceTrack 9
AI-Native EnterprisesLeadership 1
AI Architects: AI FactoriesLeadership 2
CTO CircleLeadership Lounge

Venue

Moscone West Convention Center, San Francisco, CA

Three levels of programming:

FloorWhat
1st FloorRegistration, Expo, Food, Evening Socials
2nd FloorBreakout Rooms
3rd FloorKeynotes + VIP Rooms

Hotels

Discounted room blocks close June 6, 2026. Book early — World Cup in the US means rooms sell out fast.

HotelStatus
San Francisco Marriott MarquisAvailable
Parc 55 San FranciscoSOLD OUT
InterContinental San FranciscoAvailable

Tickets

Full refunds available up to one month before the event.

TierPriceAccess
Leadership$2,399Keynote + leadership tracks + expo + workshops
Engineering + Workshops$1,999All engineering tracks + workshops + expo
Engineering$1,499All engineering tracks + expo
Expo Explorer$299Expo hall access only

Group discounts (applied automatically):

QuantityDiscount
5+ tickets10% off
10+ tickets15% off
15+ tickets20% off
30+ ticketsEmail info@ai.engineer

Purchase: https://app.ai.engineer/e/ai-engineer-worlds-fair-2026/portal


AIE for AI Engineers

Open endpoints for building apps, agents, and tools on conference data.

Endpoints

  • llms.md (overview + schedule): https://ai.engineer/worldsfair/llms.md
  • llms-full.md (this file): https://ai.engineer/worldsfair/llms-full.md
  • Sessions JSON: https://ai.engineer/worldsfair/sessions.json
  • Speakers JSON: https://ai.engineer/worldsfair/speakers.json
  • MCP Server: https://ai.engineer/worldsfair/mcp
  • iCal Calendar: https://ai.engineer/worldsfair/calendar.ics
  • Speaker Embeddings: https://ai.engineer/worldsfair/speakers-embeddings.json
  • Session Embeddings: https://ai.engineer/worldsfair/sessions-embeddings.json
  • Schedule Page: https://ai.engineer/worldsfair/schedule

MCP Server

The Model Context Protocol server lets AI agents (Antigravity, Codex, Claude, Cursor, etc.) query conference data with tool calls.

Tools: get_conference_info, list_speakers, list_sessions, get_schedule

Config (add to your MCP client):

```json

{

"mcpServers": {

"aie-worldsfair": {

"url": "https://ai.engineer/worldsfair/mcp"

}

}

}

```

Example (curl):

```bash

curl -X POST https://ai.engineer/worldsfair/mcp \

-H "Content-Type: application/json" \

-H "Accept: application/json" \

-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "list_speakers", "arguments": {"search": "OpenAI"}}}'

```

Embeddings

Pre-computed Gemini Embedding 2 vectors for all speakers and sessions.

128-dim via Matryoshka Representation Learning (MRL) truncation from 3072.

Useful for semantic search, clustering, and recommendations.

  • Speaker embeddings: https://ai.engineer/worldsfair/speakers-embeddings.json
  • Session embeddings: https://ai.engineer/worldsfair/sessions-embeddings.json

Agent Skills

Install the AI Engineer agent skill to give your coding agent full knowledge of the conference API:

```bash

npx skills add aidotengineer/skills

```

Works with 40+ agents including Antigravity, Devin, Claude Code, Cursor, Windsurf, and Codex.

Source: https://github.com/aiDotEngineer/skills

CLI Tool

Access conference data from your terminal (npm: @aidotengineer/aie):

```bash

npx @aidotengineer/aie --list # list all conferences

npx @aidotengineer/aie wf # World's Fair info

npx @aidotengineer/aie wf speakers # speakers (page 1, 20/page)

npx @aidotengineer/aie wf sessions --day help # list all days

npx @aidotengineer/aie wf search "agents" # search sessions + speakers

npx @aidotengineer/aie wf speakers --json # raw JSON output

```

Quick Start (curl)

```bash

curl https://ai.engineer/worldsfair/llms.md

curl https://ai.engineer/worldsfair/llms-full.md

curl https://ai.engineer/worldsfair/sessions.json | jq .

curl https://ai.engineer/worldsfair/speakers.json | jq .

curl -O https://ai.engineer/worldsfair/calendar.ics

```


Highlights

  • Startup Battlefield on July 2nd
  • 100+ expo partners throughout
  • 10 engineering tracks per day + 2 Leadership Tracks on main days
  • World Cup Quarterfinal VIP Suite (July 1, Levi's Stadium — invite only, sponsorships available)
  • No afterparties July 1 & 2 — side events encouraged

Links

Loading Overview…