Kaspar: And yeah, I mean these same people, that’s the illustrative example that goes back to Steven.
I think if these same people then have a P [00:16:30] value of 4%, would they call this a trend to non significance? Of course not. This is then getting. Into the realm of misuse of concepts, in my opinion. So the p-value should not be used to [00:16:45] qualify the level of evidence in a hypothesis test because it, you can’t, it’s a binary thing.
Now, if you take the p-value as it was intended to be used as kind of a, a continuous quantification of evidence against the null, then it’s a fair [00:17:00] question to ask what is large evidence or not? And. There is a paper by, uh, I think Poco and colleagues where they provide suggestions of labels. So they say if the P value is [00:17:15] above 0.1, then it’s no evidence.
If it’s between five and 10%, it’s, they call it insufficient evidence. If it’s between one and 5%, they call it some evidence. And if it’s between one per mil and 1%, it’s strong evidence and [00:17:30] everything below is overwhelming evidence. So. Again, this has nothing to do with statistical significance.
Alexander: Mm-hmm.
Kaspar: Or hypothesis testing.
It’s just you label verbally how much evidence you have against the null, but it’s not about [00:17:45] type one error. It’s not about type two error. It’s not about statistical significance. If everybody would keep these concepts apart in this way. We would be fine, but of course this is not what’s happening. So this is the case of one hypothesis test.
And then yet another complication is [00:18:00] this issue of multiple testing. I mean, say you have 200 genes, you compare them between two groups and you find 10 P values, which are below 5%, and then you end up writing a paper with these 10 genes in [00:18:15] there.
Alexander: Yeah,
Kaspar: that’s not accurate. But if you say, I did. Compare 200 genes between these two groups and I found 10 P values below 5%, and you write a paper about that, that’s perfectly accurate.
That’s in, because then I [00:18:30] know as a statistician, this is exactly what I would expect if there was no difference at all between the two groups for these 200 genes, because of course 10 P-values below 5%, that’s exactly what you would expect if there’s no difference at all. Um, and I [00:18:45] think that is where things.
If people just report what they did, then everybody can do the multiple testing correction in their head. And then it also doesn’t matter so much whether you report P values or whatever, because then it’s complete transparency about what you [00:19:00] did. And that is also something that people under appreciate in drug development.
I mean, why do we send in statistical analysis plans before we look at the data? Why do we remain blinded until data cutoff? This is one aspect is exactly this. You [00:19:15] pre-specify. You say what you’re gonna do, you do it. Either you succeed or you don’t. And only then the hypothesis test that you run can be assumed to have the operating characteristics that you pre-specified power and type one error [00:19:30] you still make assumptions about or, or you still hope that all the assumptions you made, uh, hold true approximately, but.
Only then is operator characteristics hold. And I think there is an under appreciation in other fields of, of that [00:19:45] feature. I mean the multiple testing issue there also, I mean this is all I, I didn’t invent all this, of course, it’s just all available in the literature. There’s a famous example where group of researchers in the ICS two trial submitted the trial to the Lancet and the [00:20:00] Lancet came back and said, we published this trial, but can you add, I dunno, a list of a hundred subgroup analysis to that.
Then the author said, well, we do that, but we just want to add one additional sub. And the Lancet said, well, that’s fine. And then they added a Zodiac sign [00:20:15] and of course that immediately illustrated.
Alexander: Yeah.
Kaspar: The difficulty in interpretation, and this is actually something I have been doing in the past as well, kind of if people go overboard.
With a multitude of analysis, [00:20:30] not pre-specified, just kind of exploratory. And then cherry pick the ones they like. Add something like the birth month or add something like the zodiac sign. And then if they see small P values for two or three zodiac signs as well. Maybe that makes them think. Yeah, [00:20:45] and maybe initially don’t say it’s the Zodiac side, just say, I have here actually another subgroup.
Look at these small P values. Should I tell you what it is? Then of course they’re very interested and then it’s, I dunno, Gemini and uh, and Libra or whatever. And then hopefully [00:21:00] everybody understands that we should not go overboard. So this is kind of, these are two different problems. One is interpretation of hypothesis test.
Uh. In connection to P value. And the other thing is multiple testing, but of course I think they are related. And from somebody [00:21:15] who maybe is not so deep into statistics, it’s easy to confuse the two. But, uh, in, in theory, these are two different problems
Alexander: when it comes to subgroup analysis, uh, or subpopulations.
I, I’m actually usually much more interested [00:21:30] looking into the point estimates and the confidence in the world because I want to see whether they are more or less in line consistent. Do I see big, big changes? Yeah. Do I see [00:21:45] quantitative or even qualitative interaction tests? Yeah. That is something that I’m interested in.
So for example, when we run a clinical trial. And it is significant. However, there’s, let’s say [00:22:00] 5% of the patients that carry most of the safety and you can clearly identify then, and you then get a label, not for all, but for the 95% that don’t carry this, the safety burden, then you can, of course you want to run.
[00:22:15] Okay. What are your estimates in this smaller population? And of course when it’s 5% by design, it will not be very largely different. Yeah. Still you wanna have a description of your benefit risk profile in these [00:22:30] patients. Yeah. Maybe it’s 10% or 20%. Yeah. So then of course it’s interesting to see kind of.
From an estimation point of view, whether you have similar effects, and of course, [00:22:45] usually the most cases, the confidence interval will get wider with fewer patients. And then of course, a P value said maybe was significant before, becomes bigger than 6% afterwards or bigger than [00:23:00] 5% afterwards.
Kaspar: So this is yet another topic that’s prone to multiple testing, kind of subgroup analysis, for example, in clinical trials.
And I think they may serve different purposes. And very often the purpose is not really [00:23:15] clarified upfront. We just as a habit, we show these forest plots and if, if it’s not clear upfront, what are they supposed to be used for? Then everybody can draw his or her own conclusions. I mean, let me mention two things here.
I. [00:23:30] There is literature quite well published and a little bit of a kind of a consensus. How you should appraise if you have post-talk subgroup findings, for example, has there ever been a biological rationale? Ideally, you [00:23:45] specify a rationale if you suspect there is an effect in a subgroup. I mean, if you run a clinical trial, you analyze 20 subgroups and nobody ever raised their hand and said, in this subgroup, I actually think there is something different.
And then you’ll find something. That just doesn’t give [00:24:00] it much credibility because if this is so obvious that in this subgroup something should be different, why has, has nobody raised it before? No. I mean, the interaction test, which we know notoriously doesn’t have a lot of power. There is literature that helps you interpret [00:24:15] postdoc subgroup findings.
Uh, and the other thing I’d like to mention is just make a plaque for a paper that we have that. Comes out soon in statistical methods in medical research because what you were describing is exactly what you’re actually interested in. You can think of, you have two [00:24:30] extremes. One extreme if you have a clinical trial and you have 20 subgroups.
One extreme is to say the overall estimate is also my best estimate in every subgroup.
Alexander: Well, at least from, uh. Precision point of view. [00:24:45]
Kaspar: Exactly. That’s exactly the the point. Then you have a very precise estimate in every subgroup, but the estimate might be biased. Yeah. The other extreme is when you say, and that’s what we actually typically do, we look at [00:25:00] every subgroup specific estimate.
Then you have an unbiased estimate in every subgroup, but with potentially low precision because these subgroups tend to be small. These are two extremes. Now the question is maybe we should interpolate between [00:25:15] these two extremes and find something in between. And one way to do that is to think using kind of penalization methods.
And that is the proposal we make kind of borrow a little bit from other subgroups. If you have 19 subgroups that [00:25:30] are completely had homogeneous and you have just one outlier, then maybe you should pull that outlier to the others. And it is indeed an outlier. Yeah. But if you have things kind of all over the place.
Maybe you don’t really have a homogeneous effect. I think you are [00:25:45] right. This is then less about testing. It’s more about can we get an accurate estimate. Sometimes maybe you want to strike a different bias, variance trade off than looking at the one extreme of using the overall estimate in every subgroup or [00:26:00] just looking at subgroup specific estimates.
Maybe something in between is, is maybe more accurate. And actually, and that’s what we show in the paper. If you do that. If you look at estimation efficiency, so the mean square error piece. [00:26:15] Interpolations actually can dramatically reduce the mean square error, both compared to the overall estimate and to subgroup specific estimates.
And uh, I think that’s where, yeah, hopefully at some point we should get at.
Alexander: Yeah, I think that’s a perfect [00:26:30] illustration. Now let’s get to some kind of wordings that you could use in the discussion. Of such papers. I think we should shy away for subgroup analysis to talk about We have proven, or [00:26:45] we have shown, I would more kind of go towards sentences and words like the subgroups suggests, indicate, or something like that.
Rather than just kind of very, very strong words. [00:27:00] What is your take on that?
Kaspar: We could even go further and just report what we have. I mean, I know you, you run a trial or you do some analysis, you do some study, and then of course you have to discuss and qualify things. So I appreciate, I can understand the need [00:27:15] to do that.
There should at least be a very comprehensive, factual description of what you have found and then the discussion. Of course, how should you qualify that? But yeah, there is a big push. To exaggerate the findings, and it’s [00:27:30] interesting. I mean, we call a P value below alpha, typically statistically significant, which in many journals and in in science at large may open the door to get something published because this carries a lot of [00:27:45] meaning.
When you have one pre-specified, pre-planned hypothesis test for which you run the study, but very often this P value smaller than 5%. Is not used in such a framework. And it’s not [00:28:00] without irony that in this Pocock paper they call a P value between one and 5% some evidence. So they, they are not terribly excited about the P value between one and 5%.
Yeah. The whole problem comes from the confusion with statistically significant. [00:28:15] And then there is one aspect we didn’t even discuss, which people often bring up. As well. And I think which adds to the confusion in English, significant means, statistically significant and potentially also clinically significant.
And of course, we all know this has [00:28:30] nothing to do with each other a priority. So a statistically significant effect must not at all be clinically meaningful. Of course, in the design of a clinical trial, and that’s what I work a lot with teams, is you should design a clinical trial [00:28:45] such that. A statistically significant result is clinically meaningful.
But there is a lot of confusion around this as well, because the effect we power at say you have a hazard ratio. Yeah,
Alexander: it’s yet another thing. Yeah. That’s
Kaspar: yet another thing that’s [00:29:00] even in the mind of many statisticians. These concepts are not so clear in my experience. So when you have 80% power to detect the hazard ratio of 0.75, if you get the P value of 5%, you will not see an effect of 0.75.
You will see something like [00:29:15] 0.82. This is the quantity you need to discuss with the clinical team. I don’t care so much about the 0.75, whether that’s clinically meaningful. I care whether point A two is clinically meaningful because if you end up with a trial that has a P value of 5% and that point [00:29:30] estimate of 0.82, and then everybody just shrugs with their shoulders and say, well.
We thought we get 0.75, we will not change any practice because of something of 0.82. Then you have, yeah, a statistically significant but clinically irrelevant result [00:29:45] and you want to avoid that. So that’s yet another complication that people imply clinical relevance from statistical significance, which of course, yeah, is independent.
Alexander: I think the talking about clinical relevance yet a completely different [00:30:00] topic. Thanks so much. I think that was super helpful talking about these things and giving some kind of clarity to this customer and I have, uh, agreed that we will record more of such episodes that go [00:30:15] into a little bit the kind of day-to-day challenges.
That we all face, because I know from many discussions that there’s debate about it. There’s, you know, that it’s not kind of taught at universities or there’s not [00:30:30] enough time. Also, getting that knowledge within the usual kind of day-to-day work is not that easy. So thanks a lot Kaspar for this awesome discussion about P values hypothesis testing, [00:30:45] estimation, and all the other things.
Is there any final sentence that you want to have the listener go away with?
Kaspar: Thanks, Alexander and I very much look forward to next episodes. Maybe one thing I mention and, uh, I think I’m sometimes [00:31:00] also critical of my own profession, but I encourage everybody or every statistician. Before you go back to clinicians or stakeholders and criticize them for not using the concepts appropriately, make sure you do your homework and understand the concepts yourself.[00:31:15]
And because that will enable you to have a better conversation, of course, with your stakeholders. And then make sure, or at least attempt. To have these concepts used, uh, properly.
Alexander: Yeah. And make sure that all the different statisticians in your [00:31:30] organizations speak with one voice. That’s another thing as that will help with credibility, but that a clinician says.
Yeah, but Fred, the other statistician, he was okay with that. Okay. Thanks a lot. Otherwise you get multiple [00:31:45] testing by choosing the right statistician. That’s definitely something that we wanna avoid. Thanks a lot, Kaspar. Have a great time.
This show was created in [00:32:00] association with PSI. Thanks to Reine and her team at VVS working in the background, and thank you for listening. Reach your potential lead great science and serve patients. Just be an effective [00:32:15] [00:32:30] statistician.