How do you design unbiased neuromodulation trials?


At the annual meeting of the Neuromodulation Society of the UK and Ireland (NSUKI; 3–4 November, Manchester, USA) Alan Batterham (Health & Social Care Institute, Teesside University, Middlesbrough, UK) assisted Sam Eldabe during his presentation on the future of neuromodulation research. Batterham discussed trial designs including head-to-head comparisons and superiority versus non-inferiority studies, offering advice on how they should be designed and conducted from the standpoint of a statistician.

On the issues surrounding head-to-head device comparison trials, Batterham maintained that if you are looking to design a pivotal definitive trial which is seen as unbiased both within the field and external to the field then you should give consideration to controlling and running that trial using non-profit entities, ie., a non-industry-sponsored trial.

“The main design options that you have in a head-to-head comparison trial are a parallel group randomised controlled trial and some sort of crossover design. Clearly a parallel group randomised trial compared to a crossover has an increased sample size requirement, all else equal, it is about four times the number of patients you would need compared to a crossover. But a crossover has problems with the assumptions concerning washout of treatments and the elimination of potential carry over effect. There is also, as we found in our study of high-frequency stimulation versus sham—which we think is the first double-blind trial in the field—that there is a substantial period effect. We found that the number of responders in the first period that people were exposed to irrespective of treatment was far higher than in the second period, likely due to some sort of expectancy effect. I think crossover trials do have a role depending on the context, but I would conclude that overall a parallel group randomised trial is cleaner and more robust in order to develop good inferences,” he explained.

When it comes to a superiority versus a non-inferiority design, Batterham pointed out that there is a trend toward more and more head-to-head non-inferiority trials as “they are often seen as a safe option because with some manipulation of the non-inferiority margin and other parameters, you can actually design a trial to show non-inferiority if you want to, more easily than you can show superiority versus another. The regulatory burden of proof is a reasonable assurance of safety and effectiveness and therefore if there are other approved devices and treatments on the market, it is logical to show that your new device is not substantially worse than the referenced treatment. You might not think it is better, but it could be almost as good but have other advantages like fewer side effects or it is cheaper, for example,” he said.

However, Batterham added, some of the issues with non-inferiority trials relate to sample size, planning and inference. The first issue, which is also the most crucial, is the setting a priori of the non-inferiority margin.

“You should choose this margin very carefully and justify it clinically. In lay terms, it is the largest difference that you would be prepared to accept in order to say that your new device was not substantially worse than an existing treatment, where the existing treatment is the best currently available device in this context. The regulatory bodies suggest that you should be looking to preserve a substantial proportion of the active controls effect versus placebo. So let us assume that your reference treatment or your active control is conventional SCS, then, you should have already established in advance historical data that that treatment is better than placebo before you proceed to do a non-inferiority trial. This relates to assumptions of what are known as assay sensitivity and constancy. You should establish that the treatment will be superior to placebo in the setting of the present non-inferiority trial and this is a reason why a third arm in head-to-head comparison studies is seen as the gold standard—to have a placebo arm alongside your two treatment comparisons,” Batterham advised.

Giving a sample size calculation properly powered for a non-inferiority study, he stated, “We can assume a one-sided p value of 0.025 (which most regulatory bodies recommend), 90% power and a non-inferiority margin of a 10% difference in proportions. So you are prepared to accept a 10% difference in proportions between your new device and your reference device in terms of the proportion responding. We will see that the proportion responding in the actual control of 50%, so say proportion responding in the conventional SCS of around 50%. You would need 526 patients per arm to run that study.”

For a superiority study, on the other hand, he said, “if you have an a priori belief that your new treatment is going to be superior to the reference treatment you could set it up as 95% confidence interval so 2p=0.05, 90% power, the same assumption with the response rate (50% response rate in active control) and a minimum clinically important difference of a 10% difference in proportions responding.”

In order to see these studies properly powered, Batterham suggests therefore, that you would need approximately 1,000 patients.

On the other hand, in order to do a comparison of a device against placebo far fewer patients are required because you are looking for a much bigger difference than 10%, but instead you might be looking for a 25% or 30% difference in proportion responding between placebo and an active treatment. For that study, then, you would need sample sizes of the order of 50–70 per group for a parallel group randomised trial.

Batterham stated, “I would like to issue a call for properly powered studies, be they head-to-head comparison studies or placebo-controlled.”

He referred to the paper “Power failure–Why small sample size undermines the reliability of neuroscience” published in 2013, remarking, “it is well appreciated, of course, that if you have low statistical power, you have less of a chance of detecting your targeted effect size, that is well known, but it is less well appreciated that low power also reduces the likelihood when you find a statistically significant effect that that is a true effect and often the effects are exaggerated…You cannot trust statistically significant results that emerge from underpowered studies, they are an exaggerated effect. Really, we should be looking at properly powered placebo controlled studies in order to satisfy that step that our active control is superior before we move onto head-to-head comparisons.”