-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to get some kind of "local significance level" accounting only for the sequentiality of the analysis? #3
Comments
Maybe I have a reasonable answer. I tried to design both a single- and a multi-stage design with equal characteristic (except than the multiplicity correction, that is not [yet?] provided for multi-stage design), i.e.: Bernoulli, power 80%, alpha 0.05%, control rate=0.33, d1=0.33, d0=0, type of control=MARGINAL. In the single-stage, the output is below:
In the multi-stage, instead it is:
My conclusion is that the sequential familywise error rate for the multi-stage design should be exactly 0.05, just as the Output says. I was in doubt since I was not being able to individuate the penalization for the sequential design. By comparing the two, though, I noted that the required sample-size for single-stage was 37 per group, total=148, while for multi-stage it is 20+20=40, total = 160. Maybe the 3 units per group are the cost of two-stage design? If it were so, it would be a very interesting price! The output seems now reasonably clear: as I expected, in the multi-stage case the app provides only the calculation for the sequentiality correction. Am I wrong? PS As you can see, the row:
does not show the entered value (0.33). Moreover, in downloading the pdf and docx reports, the tables are not those reported by the web app:
but look like these:
|
Hi Paolo, The output from the multi-stage commands is to control the FWER to specified level alpha (i.e., it adjusts for both having multiple arms and multiple stages). What is returned are upper (efficacy) and lower (futility) stopping boundaries on the Z-scale. You can find the efficacy boundaries for example with
To convert to the p-value scale all you need to run is
In the single-stage case, there are a variety of multiple comparison procedures available (to adjust for having multiple arms) - the selected one in the code above is Holm-Bonferroni. This doesn't leverage the (approximately known) correlation between the test statistics, which is what gives rise to the maximal FWER (0.043) being below the allowed level (0.05). If you were to change the single-stage implementation to use Dunnett's correction, this would arguably be a fairer comparison of the cost of incorporating the interim analysis to the maximal sample size, as the multi-stage commands only implement a generalised Dunnett test (though in the grand scheme of things it's not going to change much compared to using Holm-Bonferroni). Hope that makes sense, Michael |
It makes a lot of sense. As for the 0.043, I have to apologize: had I tried to select the Bonferroni's correction I would have realized by myself that the deviation from 0.05 was not due to Holm! Notwithstanding I had sometimes used the R software provided by Derringer (2018) A simple correction for non-independent tests, related to Nyholt (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other, I did not believe that the... cost of the correlations between tests could be shown so convincingly. Your app is a mine of useful suggestions! I had preferred the Bonferroni-Holm correction since I expected that two of the comparisons would have had very small p-values while the third would have been on the boundary. Using the step-down method, I thought I would provide the maximum power to this last sacrificing others little or nothing. Surely being conditioned from the rpact application, which uses for multi-stage design the same regimen that multiarm for single-stage - i.e. it reports the new threshold for p-values, I did not well realize that for multi-stage designs you had decided to use only the stopping boundaries on the Z scale, but including in them all the adjustments. It would be interesting to know the why of this choice (they would seem equivalent ways at all...). Surely, to use the quantiles of the normal distribution is just as simple and it is even clearer if you know they are already adjusted for multiplicity too! Please, let me know if it is possible to download tables equal to those printed by the shine app (I have copied and pasted them but I fear they are not complete), or if I've done something wrong to get only a lot of ones and zeros (they seem model matrices...!). Or, in alternative, if they have some sense the way they are now. Thank you again, Michael. |
It's really just personal preference around how to specify the boundaries - I've worked mostly in the past defining them in terms of Z, but it is as you say completely equivalent. I can add returning both in a future version easily enough. On the tables, it may be that I've updated the shiny app but not the package as they run off different repositories. Can I check what version of multiarm you're using (you can find this with |
Yes, I understand. But keep in mind that many R functions [i.e. multcomp::glht()] apply the multiplicity adjustment by default. Having to manage overlapping adjustments may become quite uncomfortable, as I will try to show. If the multi-stage output reports the fully adjusted Z boundary, the risk arises to print plots showing confidence intervals having a twofold multiplicity correction, one applied in shine, which is conveyed in the level parameter to be passed to glht, and the other applied by default by the function. To prevent the issue, I have to get first the p-value fully adjusted as shown by you, then pass its ones' complement to the level parameter and finally disable the default multiplicity adjustment, this last, according to the help file, by means of the parameter
Note that the output of
even when the multiplicity correction is turned off by means of In fact, glht interprets the "Dunnett Contrasts" statement as the one-to-all contrasts matrix, without any reference to the Dunnett's multiplicity correction that is always additionally applied (it does so even for the Tukey's Contrasts (pairwise), to which it applies the so-called "single-step correction". As an aside, both the Dunnett and Tukey's single-step corrections take into account the correlations between the outcomes, for as I know). However, the confidence level is reported at the bottom along with the Quantile. When the
being instead:
but I still have to understand where the quantile comes from and why it is not equal to:
Here it seems that glht() makes some correction, since the p-values entered do not match the quantiles. Fortunately, it is possible to input directly the quantile, or both the quantile and the level. In this last case the quantile prevails while the level is only printed on the screen, without influencing the computations (but only if it is entered as a number! If the statement
Of course, the result obtained entering
Maybe I was unclear. The tables I showed above had been copied and pasted from a .docx that had been downloaded from the shine app. I still do not know if and how it is possible to get a report file from R (this was one of the reasons why I had asked you a technical file or a videotutorial). In any case, the version installed on my R instance is:
|
Unfortunately, producing confidence intervals with desired coverage for a multi-arm multi-stage design isn't as simple as using the adjusted significance levels that preserve the familywise error-rate (see, e.g., https://www.cytel.com/hubfs/0-library-0/presentations/MehtaSlides-IISA-Pune-2015.pdf), so it's not currently possible to use multiarm to find appropriate CIs. On a report, that's not currently a feature you can implement in R unfortunately - you can load the GUI offline through |
I have to apologize for putting the cart before the horse with the confidence intervals! Admittedly, I was overlooking the specific sequentiality issue. While according to some opinions it is altogether impossible to compute them at all, according to at least two authors, which propose how to construct as many versions of them just starting from their adjusted p-values, they are actually feasible. I thought to be correct to do so, in force of two conditions: that I would have used validated p-values; and that I would have used one (of two) validated methods. In fact, I still find it hard to realize why, while being the p-values correct and the parameter distribution known, the intervals might not guarantee cover. If the Ludbrook and Serlin's procedures are correct, two different kinds of p-values would exist, usable and not-usable, which is bewildering to me. So, now it has become undeferrable for me to face the specific literature on the argument (e.g. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC:London, 2000)! At present, the confidence intervals are an imprescindible requirement, and so this may be a fatal shortcoming of multiarm-multistage design, at least until it will be overcome. What a pity! |
Dear professor Grayling,
As I anticipated in the previous post, I am a little embarrassed in interpreting the tables printed. I know it is surely my fault, not having studied in depth your project, that is complex indeed.
Until I will be more aware of its thorough working, to provisionally simplify (I hope not unduly!), my aim is to get a kind of local alpha taking account for sequentiality to guarantee "the overall test" to stay below the, say, alpha=0.05. I would enter it as a sequentially-adjusted alpha level into any kind of multiple comparison I will do next.
As an example I report the output of the rpact application, which though does not allow for multi-arm design.
Here, in my understanding, the wanted alpha level for the first interim would be 0.0294, that should be corrected for sequentiality only, being it relative to a two-arm design. It is equal to the final stage, having been chosen the Pocock's design. Setting the O'Brien & Fleming design the two numbers become:
Two-sided local significance level 0.0052 0.0480
in agreement with my expectations.
In the "Operating characteristics summary: "(key, error rates and others) I am unsure to correctly interpret the big deal of data so, to be safe, I would prefer to use a more simple and well-identified datum.
Please, may you tell me what is the number I am looking for?
In alternative, may you tell me which numbers should I look at to get the alpha level to use with the omnibus test and with the direct comparisons between control and experimental groups?
Thank you in advance!
Paolo
PS This information is easily found even in your function in case of one-stage design, in which the multiplicity correction method can be chosen.
As an example:
Here the output is clearer: sample size in each arm = 18, critical thresholds for multiple comparisons with the step-down method = (0.017,0.025,0.05) in particular. I am not sure, however, what exactly the "maximum familywise error rate" = 0.043 represents, i.e. why it is not 0.05 (I believe it is something may be overlooked for it is only linked to the three comparisons with the Bonferroni-Holm's method, since it seems independent from specific data).
In the sequential section I would have expected something similar. However, the correction method is no longer selectable.
I guess there is a reason for this, but I hope there is a way to get the equivalent information, maybe in the tables ...
As an instance, if the single-stage design above were to become a stage of two (for a total N=72*2=144 samples) having the aim of decreasing the effect to detect but allowing to stop earlier the trial in case a more substantial one was present, by how much the multiple comparisons thresholds of the Bonferroni-Holm's correction would be penalized in the two stages using the Pocock design? Is there a simple way to answer this question with your package?
The text was updated successfully, but these errors were encountered: