Convergence of the GMR search process. (a-f) A sample search in the augmented goal-policy space ζ is plotted for six selected self-refinement iterations. Here, only one dimension for ζ
O and ζ
I is depicted, but we have and . The gray dots show the initial set of iterations and the black dots are the new estimated policy. The blue line represents the known goal of the task with highest reward on which the red cross shows the best achieved policy after convergence.