Hands‑On Lab: PQCs for Agentic Decision Models

A hands-on lab showing how tiny parameterized quantum circuits (Cirq + PennyLane) can be embedded into agent utility functions to shape exploration.

Hook: Why agentic decision models need new stochastic primitives

Developers and IT teams building hybrid AI systems in 2026 face a familiar friction: classical agents (from bandits to policy-gradient learners) often need better, lightweight ways to inject structured randomness into utility calculations so exploration is principled and diverse. Quantum devices are no longer an esoteric curiosity — small parameterized quantum circuits (PQCs) running on simulators or cloud QPUs can act as compact stochastic components that generate correlated, tunable noise patterns. In this lab you'll build, run, and evaluate a hybrid agent that embeds a PQC into its agent utility, comparing agent behaviors across classical and quantum-driven stochastic models.

The idea in one sentence

Use small parameterized quantum circuits as stochastic utility generators inside an agent's decision function to produce structured exploration behaviors — implemented here with Cirq and PennyLane and evaluated on a multi-armed bandit problem.

Why this matters in 2026

QPU access and noise-aware simulators: Cloud quantum backends matured by late 2025 — providers offer microscale QPUs and better noise models, which makes realistic experiments practical for prototyping.
QML tooling integration: Libraries like PennyLane and Cirq provide stable interop layers, letting you embed PQCs into differentiable pipelines or sample from hardware effortlessly.
Smaller, smarter projects win: Teams prefer targeted, fast-iteration experiments rather than large, risky efforts; inserting a compact PQC as a stochastic primitive is low-risk but potentially high-reward.

Lab overview — what you'll build

Implement a classical multi-armed bandit agent baseline (epsilon-greedy and Thompson-style sampled noise).
Implement a quantum-stochastic bandit: each arm's utility is perturbed by a sample from a small PQC.
Compare performance (regret, exploration entropy, compute cost) across simulators and a cloud QPU if available.
Run experiments and interpret results, including suggestions to tune circuits and integrate into RL policies.

Prerequisites

Python 3.10+
Pip-installed packages: cirq, pennylane, pennylane-cirq, numpy, matplotlib
Optional: cloud QPU credentials (e.g., IonQ/Quantinuum/Google) for hardware runs

Step 0 — Environment quick-start

Install required packages (local simulator):

pip install cirq pennylane pennylane-cirq numpy matplotlib

In the examples below we default to PennyLane's default.qubit simulator and Cirq for explicit circuit construction; swap in cloud device wires via PennyLane plugins for hardware runs.

Step 1 — Minimal classical bandit baseline

We'll use a stationary multi-armed bandit with fixed Bernoulli rewards for each arm. The baseline agent will be epsilon-greedy.

import numpy as np

class EpsilonGreedy:
    def __init__(self, k, eps=0.1):
        self.k = k
        self.eps = eps
        self.counts = np.zeros(k)
        self.values = np.zeros(k)

    def select(self):
        if np.random.rand() < self.eps:
            return np.random.randint(self.k)
        return int(np.argmax(self.values))

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

Step 2 — Quantum-stochastic utility sampler concept

We want a modular randomizer: given a classical utility score u_c for an arm, the sampler returns u_c + noise, where noise is drawn from a PQC parameterized by a small vector θ and optionally by a classical seed. There are two patterns:

Measurement-sample pattern: Sample a bitstring from the PQC and map it to a numeric perturbation.
Expectation-value pattern (differentiable): Use expectation values (e.g., Z) as a bounded stochastic signal that can be used in differentiable pipelines.

Design choices

Keep circuits tiny (1–3 qubits) to ensure cheap runs and clear interpretability.
Use parameterized rotations to control the distribution shape (θ acts like a latent seed).
For experiments, compare classical Gaussian noise, measurement-sampled PQC noise, and expectation-based PQC noise.

Step 3 — Implementing PQC samplers (Cirq + PennyLane)

Below are two implementations: one returns a scalar from bitstring samples, the other returns an expectation value.

3a: Measurement-sample PQC (bitstring mapping)

import cirq
import pennylane as qml
from pennylane import numpy as pnp

# Define a tiny Cirq circuit that PennyLane will run
qubits = [cirq.LineQubit(i) for i in range(2)]

def build_cirq_circuit(theta):
    c = cirq.Circuit()
    c.append(cirq.rx(theta[0])(qubits[0]))
    c.append(cirq.ry(theta[1])(qubits[1]))
    c.append(cirq.CNOT(qubits[0], qubits[1]))
    c.append(cirq.rz(theta[2])(qubits[1]))
    return c

# PennyLane device wrapping Cirq simulator
dev = qml.device("cirq.simulator", wires=2, repetitions=100)

@qml.qnode(dev)
def sample_bitstring(theta):
    # Construct the same gates using PennyLane ops (compatible mapping)
    qml.RX(theta[0], wires=0)
    qml.RY(theta[1], wires=1)
    qml.CNOT(wires=[0,1])
    qml.RZ(theta[2], wires=1)
    return qml.sample(qml.PauliZ(wires=[0])), qml.sample(qml.PauliZ(wires=[1]))

def bitstring_to_noise(samples):
    # Map measured +1/-1 values to a scalar noise in [-0.5,0.5]
    # Example: use parity and frequency
    s0 = np.mean(samples[0])
    s1 = np.mean(samples[1])
    noise = 0.25 * (s0 + s1)  # tunable scaling
    return float(noise)

# Example usage
theta = pnp.array([0.3, 1.2, 0.5])
samples = sample_bitstring(theta)
noise = bitstring_to_noise(samples)
print('sample noise', noise)

3b: Expectation-based PQC (differentiable)

dev2 = qml.device("default.qubit", wires=1)

@qml.qnode(dev2)
def expectation_noise(theta):
    qml.RY(theta[0], wires=0)
    qml.RZ(theta[1], wires=0)
    return qml.expval(qml.PauliZ(0))

# Expectation in [-1, 1]; rescale to use as noise
theta_e = pnp.array([0.4, 0.2])
exp = expectation_noise(theta_e)
noise_e = 0.5 * (exp)  # scale
print('expectation noise', noise_e)

Step 4 — Quantum bandit agent

Wrap the PQC samplers into a bandit agent that perturbs each arm's estimated value before selection.

class QuantumStochasticAgent:
    def __init__(self, k, sampler, theta_factory):
        self.k = k
        self.sampler = sampler
        self.theta_factory = theta_factory
        self.counts = np.zeros(k)
        self.values = np.zeros(k)

    def select(self):
        # For each arm compute a perturbed score
        scores = np.zeros(self.k)
        for a in range(self.k):
            theta = self.theta_factory(a)
            noise = self.sampler(theta)
            scores[a] = self.values[a] + noise
        return int(np.argmax(scores))

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

Where sampler is one of the PQC samplers above (wrapped to return scalar noise), and theta_factory maps an arm index to a parameter vector (could be learned or randomized).

Step 5 — Run experiments

We'll compare three agents on a 4-armed Bernoulli bandit: (1) classical epsilon-greedy; (2) quantum-measurement sampler agent; (3) expectation-based PQC agent. Track cumulative regret and exploration entropy.

def run_experiment(true_probs, agent, steps=1000):
    k = len(true_probs)
    cum_reward = 0.0
    regrets = []
    for t in range(steps):
        arm = agent.select()
        reward = np.random.rand() < true_probs[arm]
        agent.update(arm, float(reward))
        cum_reward += reward
        best = max(true_probs)
        regrets.append(best * (t+1) - cum_reward)
    return regrets

# Setup
true_probs = [0.2, 0.5, 0.6, 0.4]

# Classical baseline
baseline = EpsilonGreedy(k=4, eps=0.1)

# Quantum measurement-based agent wrapper
def measurement_sampler(theta):
    samples = sample_bitstring(theta)
    return bitstring_to_noise(samples)

q_agent = QuantumStochasticAgent(k=4, sampler=measurement_sampler,
                                 theta_factory=lambda a: pnp.array([0.2 + 0.1*a, 0.5, 0.1*a]))

# Expectation-based agent
def expectation_sampler(theta):
    return float(0.5 * expectation_noise(theta))

q_agent_e = QuantumStochasticAgent(k=4, sampler=expectation_sampler,
                                   theta_factory=lambda a: pnp.array([0.3 + 0.05*a, 0.2]))

# Run
reg_baseline = run_experiment(true_probs, baseline, steps=1500)
reg_q = run_experiment(true_probs, q_agent, steps=1500)
reg_qe = run_experiment(true_probs, q_agent_e, steps=1500)

# Visualization (left as an exercise to plot with matplotlib)
print('Final regrets:', reg_baseline[-1], reg_q[-1], reg_qe[-1])

Step 6 — Interpreting results and tuning

Regret reduction: Look for lower cumulative regret for quantum agents when PQC noise encourages balanced exploration among near-optimal arms. A PQC that produces correlated noise across arms can both help and hamper exploration — tune circuit entanglement accordingly.
Exploration entropy: Measure Shannon entropy of arm selection over time. Expect PQC samplers that are near Haar-random (with parameters sampled) to yield higher early entropy.
Compute trade-offs: Tiny circuits (1–2 qubits) add marginal time cost on simulators; hardware runs add queue and noise overhead — use simulators for iteration and short hardware bursts for validity checks.

Advanced strategies and research directions (2026)

By 2026 the community is exploring several promising extensions:

Conditional PQCs: Condition the PQC parameters on context vectors (tabular features) and integrate into contextual bandits and policy networks.
Differentiable integration with policy gradients: Use expectation-value PQCs so the stochastic component is backprop-able and tune θ via gradient descent jointly with policy parameters.
Noise-aware design: Use noise models derived from backends to purposefully design circuits that exploit hardware noise for richer stochasticity (late-2025 research showed this can beat naïve simulators in some tasks).
Hybrid ensembles: Combine PQC samplers with classical pseudo-random generators, switching based on computational budget or risk sensitivity.

Practical tips for production prototyping

Start small: Use 1–2 qubit circuits to reduce iteration time. Small circuits are easier to reason about and often sufficient for richer stochasticity than standard Gaussians.
Simulate then hardware-validate: Iterate on simulators (PennyLane default.qubit, Cirq simulator) and then run micro-batches on QPU to validate distributional differences.
Instrument carefully: Log selection entropy, per-arm empirical distribution, and latency. Quantum samplers add variability — quantify it.
Version circuit parameters: Treat θ as configuration. Try a/B testing where θ is fixed vs. learned to understand impact on agent behavior.
Cost control: For cloud runs use small shot numbers for sampling experiments to control billable time; use expectation-value mode for gradient-based training to reduce shot needs.

Rule of thumb: Treat PQCs as domain-specific stochastic primitives — like a new activation function — and evaluate them first for behavior diversity, then for performance gains.

Example extensions — from bandits to RL policies

You can embed the PQC sampler inside more complex agents:

Contextual bandits: Make θ a function of context embedding to produce context-dependent exploration.
Actor-Critic: Use expectation-based PQC output as an additive exploration term in the actor's logits before softmax, enabling differentiable training.
Meta-learning: Learn a θ meta-parameter that adapts across tasks to tune exploration style.

Common pitfalls

Overfitting circuit parameters: If you let θ be fully learned on a single environment without regularization, the PQC may collapse to deterministic behavior losing exploration benefits.
Mis-scaling noise: Always check the scale of PQC outputs relative to your utility values. Rescale expectation values to match reward magnitudes.
Hardware noise confusion: Differences between simulator and hardware runs may be due to noise, not behavior change — use noise models or calibration data to reason about this.

Experiment ideas to try next

Compare entangled vs. separable circuits: Does entanglement help coordinated exploration across correlated arms?
Use PQC-driven Thompson sampling: use PQC to sample a posterior-like perturbation to each arm.
Measure long-tail behavior: Do PQC agents escape local optima more often on non-stationary bandits?
Hybrid LLM-agent integration: Let a small LLM propose θ candidates and run the PQC sampler to diversify agent actions — a 2026 trend is combining small LLMs with quantum primitives for richer decision heuristics.

Final notes on reproducibility and ethics

Keep experiments reproducible by seeding classical RNGs and logging PQC parameter vectors. Quantum randomness can produce subtle biases — audit agent outcomes across populations and scenarios. The trend in 2026 emphasizes responsible hybrid systems engineering: small novel primitives are powerful but must be validated.

Actionable takeaways

Embed tiny PQCs as stochastic components to enrich agent exploration with low engineering overhead.
Use expectation-based PQCs for differentiable integration; use measurement sampling to get discrete, richer noise patterns.
Start on simulators, validate on hardware, log metrics: regret, entropy, latency.
Iterate with small, focused experiments — 2026 tooling and cloud options make this practical for teams of all sizes.

Call to action

Ready to prototype? Clone the lab notebook (starter code snippets above) into your environment, run the three agents, and report whether the PQC-driven agent changes exploration on your problem. If you want a guided workshop or help integrating PQC samplers into your RL stack, reach out — we run hands-on sessions that pair classical agent design with quantum primitives.

Hands‑On Lab: Using Quantum Circuits to Improve Agentic Decision Models

Hook: Why agentic decision models need new stochastic primitives

The idea in one sentence

Why this matters in 2026

Lab overview — what you'll build

Prerequisites

Step 0 — Environment quick-start

Step 1 — Minimal classical bandit baseline

Step 2 — Quantum-stochastic utility sampler concept

Design choices

Step 3 — Implementing PQC samplers (Cirq + PennyLane)

3a: Measurement-sample PQC (bitstring mapping)

3b: Expectation-based PQC (differentiable)

Step 4 — Quantum bandit agent

Step 5 — Run experiments

Step 6 — Interpreting results and tuning

Advanced strategies and research directions (2026)

Practical tips for production prototyping

Example extensions — from bandits to RL policies

Common pitfalls

Experiment ideas to try next

Final notes on reproducibility and ethics

Actionable takeaways

Call to action

Related Topics

askqbit

Up Next

Quantum Startup Brand Audit Checklist: What to Review Every Quarter

Quantum Startup Launch Checklist: Branding Tasks Before You Go Public

How to Name Quantum Products and Platforms: A Practical Architecture Guide

Hook: Why agentic decision models need new stochastic primitives

The idea in one sentence

Why this matters in 2026

Lab overview — what you'll build

Prerequisites

Step 0 — Environment quick-start

Step 1 — Minimal classical bandit baseline

Step 2 — Quantum-stochastic utility sampler concept

Design choices

Step 3 — Implementing PQC samplers (Cirq + PennyLane)

3a: Measurement-sample PQC (bitstring mapping)

3b: Expectation-based PQC (differentiable)

Step 4 — Quantum bandit agent

Step 5 — Run experiments

Step 6 — Interpreting results and tuning

Advanced strategies and research directions (2026)

Practical tips for production prototyping

Example extensions — from bandits to RL policies

Common pitfalls

Experiment ideas to try next

Final notes on reproducibility and ethics

Actionable takeaways

Call to action

Related Reading

Related Topics

askqbit

Up Next

Quantum Startup Brand Audit Checklist: What to Review Every Quarter

Quantum Startup Launch Checklist: Branding Tasks Before You Go Public

How to Name Quantum Products and Platforms: A Practical Architecture Guide