A Quantum Neural Network (QNN) is a parameterized quantum circuit that is tuned by parameterized unitary operations, such as gates. QNNs can be seen as the quantum analogues of classical Neural Networks (NNs), and are typically used to provide speedups to hard computational Machine Learning (ML) problems. QNNs have also provided promising improvements to quantum chemistry and material science problems. Similarly to the classical case, the efficiency of QNN applications depends on a combination of the expressivity and the effectiveness of the training procedure, which optimizes a loss function in terms of the read-outs and the QNN parameters for specific applications. QNN training is often a non-deterministic polynomial time problem.
So far, the investigation of training QNNs has been a trial-and-error approach by empirically comparing the performance of standard classical optimizers on training QNNs’ loss functions. These studies are restricted to small system sizes due to the limited access to quantum machines of reasonable sizes and the exponential cost in simulating them classically. Most of the available results involve special cases of QNNs such as the quantum approximate optimization algorithm (QAOA) or extremely over-parameterized cases. However, one prominent result is that random initialization of parameters will lead to vanishing gradients for much smaller size QNNs than classical Neural Networks. The challenge in training QNNs mainly arises from the highly non-convex nature of the corresponding loss functions.
In this paper, the authors conducted a quantitative investigation on the landscape of loss functions for QNNs as a way to study their training issue. The work focuses on understanding the properties of local minima of loss functions such as i) the number of local minima depending on the architecture of QNNs, and ii) whether these local minima are benign or spurious ones, meaning that they are either close to the global minima or saddle points that can be escaped, or they are truly bad local minima that will hinder the training procedure. This work also provides theoretical proofs in order to find spurious local minima for under-parametrized QNNs. The authors identify a general condition of under-parameterized QNNs, in the sense that a dataset can be constructed such that the number of spurious local minima scales exponentially with the number of parameters. This conceptual paradox could be explained by interference affecting the qubits. Interference replaces the role of non-linear activation (in classical NNs) in creating bad local minima for QNNs. Therefore, training a dataset with simple variants of gradient-based methods is hard. Based on the proposed construction, training is almost optimal in terms of the dependence of the number of local minima on the number of parameters, since an almost matching upper bound is calculated. This upper bound also demonstrates a clear separation between QNNs and classical NNs. The authors find an upper bound 2p for the number of local minima for a QNN circuit with p parameters.
Alongside theoretical results, the authors investigate the practical performance of the common optimizers on the construction with p-parameters involving p-qubits. Unlike theoretical results, the experimental data suggests that the existence of spurious local minima can affect the optimization process, which increases the difficulty in training. The experiments being implemented are small enough that were run on an Intel Core i7-7700HQ Processor (2.80GHz) with 16G memory and the training is simulated with PyTorch.
The considered QNN instances are trained with three popular optimizers: Adam, RMSProp, and L-BFGS. The first two methods are variants of vanilla gradient descent with adaptive learning rate while the last method involves implementation of the approximate Newton method that utilizes second-order gradient information. The results show that for all instances and optimizers, under random initialization, the optimizations converge to local minima with non-negligible suboptimality (i.e., different from the global one by a non-negligible amount) with high probability. Also, as the number of bad local minima grows exponentially with the number of parameters in the construction, the success probability should also decay exponentially.
The overall results demonstrate the practical difficulty in training under-parameterized QNNs with gradient-based methods highlighting that gradient-based black-box methods may not potentially solve QNNs efficiently. A potential scope of work in the future could be designing a QNN architecture with a benign landscape when the data distribution is known. Based on the known fact that the landscape for optimizing a variational quantum ansatz can be benign when sufficiently parameterized, it could be beneficial to explore the change in the landscape with the increasing number of parameters, provided that the system size is fixed. Another direction of interest would be to consider alternative loss functions, such as those constructed from sums of QNN outputs with same architectures but different settings (a common setting in solving differential equations with QNNs for example). Furthermore, exploring algorithms beyond gradient-based methods that can solve the optimization problem efficiently and provably with minimum losses may lead to great improvements.