**A huge thanks goes out to Konrad for having a fun debate + collaborative look at this simulation on Twitter (See threads**

__here__and

__here__**to see more of Konrad's comments and arguments)**

Recently, Xuelong Zhao and Konrad Kording posted a report to biorxiv [1] that used a cross-validation (CV) approach to analyze single-trial dynamics in LIP during decision making. The authors argue that the simplicity of CV makes it a better way to compare models than more complicated approaches like the Deviance Information Criterion (DIC) computed with Markov chain Monte Carlo (MCMC) methods, which we used in our paper [2].

To briefly summarize, Zhao & Kording propose randomly splitting each single trial spike train into training (90% of the trial) and test sets (10% of the trial). They then fit a parameter (slope of a ramp or time of a step) to the training set spikes from each trial via maximum likelihood, and use this fitted rate function to predict the test portion of the spike train. The model with the best predictive log likelihood over all 10 partitions of the training and test sets summed over all trials is declared the winner. We refer to this method of splitting a single trial’s spike train into training and test portions as

*within-trial CV*.

Cross-validation in time series models is tricky, and not trivial as implied in this biorxiv piece (See [3] for an overview of cross-validation methodology). As was pointed out on

*dependent*training and test sets, violating a principle assumption of standard CV that training and test sets are iid. The authors have suggested that this doesn’t matter because they only care about the relative ability of the ramp/step models to fit the data. However, that assertion will only hold in the unlikely event that the prediction biases from each model are equal and cancel out. Because step and bound hit times are estimated for each trial independently, the prediction biases may not average away as the number of trials increases leaving us with an inconsistent estimator of the difference in predictive log likelihoods. We therefore decided to run some tests on simulated data to see if this was the case. A minimal requirement of model comparison methods is that they should work reliably on simulations, and we tested this extensively in our information criterion approach [2].

We simulated 100 sets of synthetic spike trains from a stepping model (simulation code is available

__here__). Each dataset contained 1000 simulated trials (a lot more than we expect to get for a real cell). Each spike train was 50 bins long with an initial firing rate of 40sp/s and 80sp/s after the step (assuming 10ms bins). The step times were distributed uniformly on the interval [1,50]. We find that Zhao & Kording's method of splitting the data set within single trials and finding a maximum likelihood fit of the step (or bound hit) times on each trial misidentified 86/100 of the simulations as coming from a linear ramp-to-bound model instead of the true stepping model. This is an extremely high failure rate, especially considering the size of the simulated datasets. The results are shown in the left panel of the figure below.

Each point shows the mean predictive LL difference between the ramp and step model for a given simulation and the bars show the standard error of the 10-fold CV. Positive values mean the method classified the simulation as ramping. Ramp simulations were always identified as ramps (right panel).

We suggest that a CV can still be applied to these models, but would require splitting training and test sets

*across trials*. In such an approach, 90% of the trials - the complete spike train of each trial - are used to train model parameters. The models predict the rate on the remaining 10% of trials by marginalizing out the step (or bound hit) times for those trials. The predictive log likelihood for a trial ytest of length T given spike rate parameters**θ**and assuming a uniform distribution over the step time, s, becomes:We found that the across-trial CV procedure identified the correct model in all 100 step simulations (below). This shows that step structure is obvious in the simulated spike trains, but within-trial CV misses it.

Zhao & Kording also considered models with an additional autoregressive spike history term on top of ramp/step dynamics. We suspect that including spike history in the within-trial CV method would exacerbate the problem because the spike counts in the test set likely have to be given to the training procedure to fit the ML estimate of the spike history filter.

We acknowledge Zhao & Kording’s concern about the "possibility of coding bugs in large academic software." The key is validating your analysis, whether it seems complex or not. We were stunned to find no discussion about the risks and challenges of CV in time series in this biorxiv post. In fact, Konrad discussed in [4] the perils of slicing test and training sets within correlated observation sets, quite similar to the criticisms we are laying out here. We think cross-validation is a wonderful tool, but it cannot be applied in any arbitrary fashion.

One final point we’d like to highlight is that the simplicity achieved by Zhao & Kording’s approach does not actually come from CV, as stated in their abstract. Simplicity ought to be attributed to the choice of a GLM for the ramp model and fitting bound hit times with maximum likelihood via coordinate ascent, rather than fitting a diffusion-to-bound model to spike data as in [2]. Fitting and comparing diffusion-to-bound models with CV would come with all the same computational baggage and take approximately K=10 times longer than an information criterion approach. Also, complete and appropriate across-trial CV with the ramp GLM would require more complex fitting using a method like expectation-maximization, MCMC, or variational Bayes. Accomplishing accurate CV for this ramp-step comparison surely requires more than a few simple lines of code!

Here, we only address the statistical methodology of Zhao & Kording, and not the specific application of modeling to the step-vs-ramp in LIP debate. We think tests including constant fluctuations in rate are reasonable, as we mentioned in [5] (but considering baseline firing rate fluctuations that are correlated in time would create complications even for across trial CV). Because the model comparison metric they propose contains such high bias in idealized synthetic data, the results presented in [1] unfortunately can't tell us much about LIP dynamics during perceptual decisions. On a more positive note, we hope this shows the potential usefulness of public peer review. We would, of course, be delighted to see more modeling efforts that can teach us more about the diversity of LIP single trial dynamics than our one-size-fits-all cartoon stepping model. And it would be great if those new approaches are simple.

[1] Zhao, X., & Kording, K. P. (2018). Rate fluctuations not steps dominate LIP activity during decision-making. bioRxiv. doi:

[2] Latimer, K. W., Yates, J. L., Meister, M. L., Huk, A. C., & Pillow, J. W. (2015). Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science, 349(6244), 184-187.

[3] Arlot, Sylvain, & Celisse, Alain (2010). A survey of cross-validation procedures for model selection. Statist. Surv. 4, 40--79. doi:10.1214/09-SS054.

[4] Saeb, S., Lonini, L., Jayaraman, A., Mohr, D. C., Kording, K. P. (2017). The need to approximate the use-case in clinical machine learning, GigaScience, 6(5), 1–9,

[5] Latimer, K. W., Huk, A. C., & Pillow, J. W. (2017). No cause for pause: new analyses of ramping and stepping dynamics in LIP (Rebuttal to Response to Reply to Comment on Latimer et al. 2015). bioRxiv, 160994. doi:

We acknowledge Zhao & Kording’s concern about the "possibility of coding bugs in large academic software." The key is validating your analysis, whether it seems complex or not. We were stunned to find no discussion about the risks and challenges of CV in time series in this biorxiv post. In fact, Konrad discussed in [4] the perils of slicing test and training sets within correlated observation sets, quite similar to the criticisms we are laying out here. We think cross-validation is a wonderful tool, but it cannot be applied in any arbitrary fashion.

One final point we’d like to highlight is that the simplicity achieved by Zhao & Kording’s approach does not actually come from CV, as stated in their abstract. Simplicity ought to be attributed to the choice of a GLM for the ramp model and fitting bound hit times with maximum likelihood via coordinate ascent, rather than fitting a diffusion-to-bound model to spike data as in [2]. Fitting and comparing diffusion-to-bound models with CV would come with all the same computational baggage and take approximately K=10 times longer than an information criterion approach. Also, complete and appropriate across-trial CV with the ramp GLM would require more complex fitting using a method like expectation-maximization, MCMC, or variational Bayes. Accomplishing accurate CV for this ramp-step comparison surely requires more than a few simple lines of code!

Here, we only address the statistical methodology of Zhao & Kording, and not the specific application of modeling to the step-vs-ramp in LIP debate. We think tests including constant fluctuations in rate are reasonable, as we mentioned in [5] (but considering baseline firing rate fluctuations that are correlated in time would create complications even for across trial CV). Because the model comparison metric they propose contains such high bias in idealized synthetic data, the results presented in [1] unfortunately can't tell us much about LIP dynamics during perceptual decisions. On a more positive note, we hope this shows the potential usefulness of public peer review. We would, of course, be delighted to see more modeling efforts that can teach us more about the diversity of LIP single trial dynamics than our one-size-fits-all cartoon stepping model. And it would be great if those new approaches are simple.

**UPDATE:**The within trial CV results depend a fair amount on the height of the step/ramp. Larger step sizes show more consistent classification, but I still believe these biases matter when applying to data. A 40sp/s jump is sizable. The across trial CV (with its ideal independent test-train sets) appears more consistent.[1] Zhao, X., & Kording, K. P. (2018). Rate fluctuations not steps dominate LIP activity during decision-making. bioRxiv. doi:

__https://doi.org/10.1101/249672__[2] Latimer, K. W., Yates, J. L., Meister, M. L., Huk, A. C., & Pillow, J. W. (2015). Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science, 349(6244), 184-187.

__http://science.sciencemag.org/content/349/6244/184__[3] Arlot, Sylvain, & Celisse, Alain (2010). A survey of cross-validation procedures for model selection. Statist. Surv. 4, 40--79. doi:10.1214/09-SS054.

__https://projecteuclid.org/euclid.ssu/1268143839__[4] Saeb, S., Lonini, L., Jayaraman, A., Mohr, D. C., Kording, K. P. (2017). The need to approximate the use-case in clinical machine learning, GigaScience, 6(5), 1–9,

__https://doi.org/10.1093/gigascience/gix019__[5] Latimer, K. W., Huk, A. C., & Pillow, J. W. (2017). No cause for pause: new analyses of ramping and stepping dynamics in LIP (Rebuttal to Response to Reply to Comment on Latimer et al. 2015). bioRxiv, 160994. doi:

__https://doi.org/10.1101/160994__