Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study

Gerhard Stenzel, Michael Kölle, Tobias Rohe, Julian Hager, Leo Sünkel, Maximilian Zorn and Claudia Linnhoff-Popien

Abstract: We extend directed quantum circuit synthesis (DQCS) with reinforcement learning from purely discrete gate selection to parameterized quantum state preparation with continuous single-qubit rotations Rx, Ry, and Rz. We compare two training regimes: a one-stage agent that jointly selects the gate type, the affected qubit(s), and the rotation angle; and a two-stage variant that first proposes a discrete circuit and subsequently optimizes the rotation angles with Adam using parameter-shift gradients. Using Gymnasium and PennyLane, we evaluate Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C) on systems comprising two to ten qubits and on targets of increasing complexity with λ ranging from one to five. Whereas A2C does not learn effective policies in this setting, PPO succeeds under stable hyperparameters (one-stage: learning rate approximately 5 × 10-4 with a self-fidelity-error threshold of 0.01; two-stage: learning rate approximately 10-4). Both approaches reliably reconstruct computational basis states (between 83\% and 99\% success) and Bell states (between 61\% and 77\% success). However, scalability saturates for λ of approximately three to four and does not extend to ten-qubit targets even at λ = 2. The two-stage method offers only marginal accuracy gains while requiring around three times the runtime. For practicality under a fixed compute budget, we therefore recommend the one-stage PPO policy and outline avenues to improve scalability.

Proceedings of the 17th International Conference on Agents and Artificial Intelligence: ICAART (2026)

Citation:

Gerhard Stenzel, Michael Kölle, Tobias Rohe, Julian Hager, Leo Sünkel, Maximilian Zorn, Claudia Linnhoff-Popien. “Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study”. Proceedings of the 17th International Conference on Agents and Artificial Intelligence: ICAART 2026. To appear.

Bibtex:

@inproceedings{Stenzel2026Reinforcement,
  title        = {{Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study}},
  abstract     = {{We extend directed quantum circuit synthesis (DQCS) with reinforcement learning from purely discrete gate selection to parameterized quantum state preparation with continuous single-qubit rotations Rx, Ry, and Rz. We compare two training regimes: a one-stage agent that jointly selects the gate type, the affected qubit(s), and the rotation angle; and a two-stage variant that first proposes a discrete circuit and subsequently optimizes the rotation angles with Adam using parameter-shift gradients. Using Gymnasium and PennyLane, we evaluate Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C) on systems comprising two to ten qubits and on targets of increasing complexity with λ ranging from one to five. Whereas A2C does not learn effective policies in this setting, PPO succeeds under stable hyperparameters (one-stage: learning rate approximately 5 × 10-4 with a self-fidelity-error threshold of 0.01; two-stage: learning rate approximately 10-4). Both approaches reliably reconstruct computational basis states (between 83\% and 99\% success) and Bell states (between 61\% and 77\% success). However, scalability saturates for λ of approximately three to four and does not extend to ten-qubit targets even at λ = 2. The two-stage method offers only marginal accuracy gains while requiring around three times the runtime. For practicality under a fixed compute budget, we therefore recommend the one-stage PPO policy and outline avenues to improve scalability.}},
  author       = {Gerhard Stenzel and Michael Kölle and Tobias Rohe and Julian Hager and Leo Sünkel and Maximilian Zorn and Claudia Linnhoff-Popien},
  booktitle    = {Proceedings of the 17th International Conference on Agents and Artificial Intelligence: ICAART},
  year         = {2026},
  publisher    = {SciTePress},
  organization = {INSTICC},
  keywords     = {toappear, Quantum State Preparation, Reinforcement Learning, Directed Quantum Circuit Synthesis, Parameterized Gates, Proximal Policy Optimization}
}