We thank the reviewers for their thorough and helpful comments.
We’re glad that all of them found our work to be exciting, and agree that the problem we address is important.

In particular, we appreciate a comment from one of the reviewers, which matches our vision of enabling systematic stimuli generation.
“””These results are strong enough to make it worth investigating GOLI more, to determine how well it generalizes in practice. Publishing the paper will further incentivize these valuable investigations.”””

While it is not feasible to demonstrate GOLI on all stimuli-generation tasks, these tasks have common requirements like ensuring the stimuli are grammatical and meaningful, have specific constraints on where certain words appear in the stimuli, etc. which GOLI can successfully encode and solve for.
—------------------------------------------------------------
fMRI experiment: The coarse-grained language vs non-language contrast in the cited work are used to only functionally _identify_ the language network/system (LS) in an individual, because many brain regions involved in higher-level cognition vary in their anatomical locations across individuals (Vázquez-Rodríguez et al., 2019).
Once the LS is identified though, studies have shown that the system’s responses vary with different stimuli (Heilbron et al., 2022). However, what characterizes a “high response” or “low response” stimulus is currently unknown. To test possible hypotheses, we wish to generate sentences that are “high response” _by construction_. The goal of the fMRI experiment is thus to find sentence stimuli that produce “strong” (high or low) activity in the LS. We show how GOLI can solve this previously unsolved problem.
----------
Symbolic constraints: Yes, it is possible to encode symbolic constraints of the nature you describe. See https://raw.githubusercontent.com/anonmyous-author/GOLI/main/goli_example.txt for a detailed example.
----------
Use of English: GOLI can be made to generate stimuli in any language by choosing an appropriate language model and mapping model. We will update Section 7 with this note.
------- R2 ---
Related work: 1. HotFlip and others solve just the site-perturbation problem using gradients and not the site-selection problem. Our contribution is in recognizing the existence of both these sub-problems, and showing how they both can be used to make GOLI effective. See https://raw.githubusercontent.com/anonmyous-author/GOLI/main/goli_vs_hotflip.txt for a use-case which GOLI’s formulation can encode but HotFlip’s cannot.
2. HotFlip does not address other components which can be added to a loss function like those discussed in Eq 6 in the paper. Being able to accommodate such losses are essential for generating “meaningful” stimuli.
-------
Std. deviations: This detail is indeed missing in the paper; we will fix this. The experiment was simply run 10 times at random, similar to the setup in NeuroCF. Randomness shuffles the training-validation sets which fine-tunes the RoBERTa-based classifier on the augmented datasets (original + counterfactuals generated by GOLI, NeuroCF) in each run. We report std dev (not error) of accuracy across 10 runs.
-------
Model bias: Yes, we fully agree on the models being biased as well. We currently discuss this in detail in Section 7.
----------
Prompting: Great idea. We will attempt to get this into the final camera-ready.
----------
Experiment  2: Yes, we agree the best way to validate the generated stimuli’s properties is to test them on a new set of subjects in a follow-up fMRI experiment. We did not follow up with this experiment because running fMRI studies is prohibitively expensive and time-consuming. However, we are very interested in this question and will run these experiments, analyze it in-depth, and prepare a separate manuscript describing insights from the study. In this work, we wanted to focus on the algorithm used in generating such stimuli, and hence demonstrate the stimuli meeting the required goals (elicit high or low predictions).
—--R3—
Question A: With just a few ingredients: i) a loss function like the cross-entropy or MSE loss, ii) a trained classifier that predicts sentiment, part-of-speech or any other property relevant to the task, iii) a vocabulary of words the experimenter wants to ensure the generated stimuli should have, and iv) other loss objectives that impose felicity and adherence to grammar (similar to Eq 6 from the paper), GOLI should be capable of generating meaningful stimuli for a variety of use-cases. See https://raw.githubusercontent.com/anonmyous-author/GOLI/main/goli_example.txt where we explain in detail how a common psycholinguistic task can be encoded in GOLI.
-------------
Question B: We are not sure what the reviewer’s argument is against the evidence we show in the two cases, and we believe there may have been a misunderstanding in the general approach we propose. In both cases, we show how GOLI can result in sentence-stimuli that satisfy non-trivial goals. The seed sentences to prime GOLI are irrelevant.
-------------
Question C: No! There may be a misunderstanding here. We _do not_ need stimuli to exist beforehand. We can start off with *any* random sentence, even if it is just a set of random words that result in a meaningless sentence. Starting with any arbitrary string of words like S = “hat is for there” is sufficient to get the gradient-based method to find suitable modifications in a way that satisfies constraints imposed by the experimenter, and turn the example string into something totally different and meaningful like G = “I placed the book there”. Notice, the resulting string G has more words than the random seed string S. GOLI’s formulation also allows inserting new words in addition to modifying existing words.