P-value and confidence intervals – the good, the bad, and the ugly

March 11, 2025 by Reine

Dr. Alexander Schacht

In this episode of The Effective Statistician, I sit down with Kaspar Rufibach to tackle a topic that affects statisticians every day—how to interpret p-values, confidence intervals, and statistical hypotheses.

We explore the differences between Fisher’s and Neyman-Pearson’s approaches, clear up common misconceptions, and discuss how misinterpreting statistical significance can lead to flawed conclusions.

Using real-world examples from clinical trials and drug development, we highlight best practices for communicating statistical results effectively.

Whether you’re working with clinicians or business stakeholders, this episode will help you gain clarity on these fundamental statistical concepts and use them correctly in your daily work.

What You’ll Learn in This Episode

✅ The key differences between Fisher’s and Neyman-Pearson’s approaches to hypothesis testing

✅ Why statistical significance is often misunderstood and misused

✅ The relationship between p-values, confidence intervals, and decision-making in clinical trials

✅ How multiple testing impacts statistical conclusions

✅ Best practices for discussing statistical results with non-statisticians

Key Topics Discussed

🧮 The Role of P-Values in Hypothesis Testing

📊 Confidence Intervals vs. Hypothesis Testing

❗ Common Pitfalls and Misconceptions

🔍 Multiple Testing and Subgroup Analyses

🎯 Practical Advice for Statisticians

Resources & Links

📖 Recommended Reading: Papers on p-value interpretation, confidence intervals, and statistical significance misuse.

🎓 The Effective Statistician Academy: Visit Here for free and premium courses.

Connect with Us

🔗 Follow The Effective Statistician on LinkedIn
🎙️ Subscribe to the podcast on Spotify, Apple Podcasts, or your favorite platform
📩 Have questions or topics you want us to cover? Reach out to us!

Enjoyed the episode? Share it with your colleagues and leave a review!

I’m thrilled to share that my book, “How to Be an Effective Statistician – Volume 1”, is now available!

It’s packed with insights to help statisticians, data scientists, and quantitative professionals excel as leaders, collaborators, and change-makers in healthcare and medicine.

Check out the testimonials and grab your copy!

Boost Your Leadership with MDLC!

The Medical Data Leaders Community (MDLC) gives you the tools and support to grow your leadership skills, expand your influence, and drive meaningful change in healthcare research. Whether you’re just starting or already leading a team, this community helps you take your career to the next level

Learn more here!

Learn on demand

[borlabs-cookie id=”teachable” type=”content-blocker”]

[/borlabs-cookie]

Kaspar Rufibach

Expert Biostatistician at Roche

Kaspar is an Expert Statistical Scientist in Roche’s Methods, Collaboration, and Outreach group and is located in Basel.

He does methodological research, provides consulting to Roche statisticians and broader project teams, gives biostatistics training for statisticians and non-statisticians in- and externally, mentors students, and interacts with external partners in industry, regulatory agencies, and the academic community in various working groups and collaborations.

He has co-founded and co-leads the European special interest group “Estimands in oncology” (sponsored by PSI and EFSPI, which also has the status as an ASA scientific working group, a subsection of the ASA biopharmaceutical section) that currently has 39 members representing 23 companies, 3 continents, and several Health Authorities. The group works on various topics around estimands in oncology.

Kaspar’s research interests are methods to optimize study designs, advanced survival analysis, probability of success, estimands and causal inference, estimation of treatment effects in subgroups, and general nonparametric statistics. Before joining Roche, Kaspar received training and worked as a statistician at the Universities of Bern, Stanford, and Zurich.

More on the oncology estimand WG: http://www.oncoestimand.org
More on Kaspar: http://www.kasparrufibach.ch

Transcript

P-value and confidence intervals – the good, the bad, and the ugly

Alexander: [00:00:00] You are listening to The Effective Statistician Podcast, the weekly podcast with Alexander Schacht and Benjamin Piske designed to help you reach your potential lead great science and serve patients while having a great [00:00:15] work life balance

in addition to our premium courses on the Effective Statistician Academy, we [00:00:30] also have lots of free resources for you across all kind of different topics within that academy. Head over to the effective statistician.com and find the [00:00:45] Academy and much more. For you to become an effective statistician.

I’m producing this podcast in association with PSI, a community dedicated to leading and promoting use of statistics within the healthcare industry [00:01:00] for the benefit of patients. Join PSI today to further develop your statistical capabilities with access to the ever-growing video on demand content library free registration to all PSI webinars and much, much more.

[00:01:15] Head over to the PSI website at PSIweb.org to learn more about PSI activities and become a PSI member today.[00:01:30]

Welcome to another episode of The Effective Statistician, and this time I have again, Kaspar on the line, which is really great because we are talking about in issues that is not very much. Taught at university [00:01:45] because it is really something that hits us pretty much every day in our work environment because in the kind of textbook versions of statistics, we hardly ever talk about this, [00:02:00] and that is the interpretation of statistical hypotheses, confidence intervals, P values, and so on.

So you have been stepping over that problem. Where is that coming from?

Kaspar: Welcome [00:02:15] Alexander, or thanks for having me on the podcast again. I think there are two sources where, where I stumble upon over this one is of course, the discussion that’s going on for a couple of years with, uh.

American Statistical Association issuing some [00:02:30] recommendations, kind of basically saying we should not use P values anymore. Well, that’s one way to approach it. I’m typically not in favor of kind of banning stuff, but rather using them in the correct way, but I acknowledge that. If you have a hundred [00:02:45] year history of not using something correctly, maybe at some point you, you might end up just banning it, so, but that’s maybe a little bit of a separate discussion.

The other piece where this comes up is when statisticians approach me and say, I have this [00:03:00] collaboration with commercial team, with the clinical team, and that they have done a couple of analysis and then just emphasize a small P value and then assign a big importance to that P value. How should I. Make them aware that [00:03:15] maybe this is exaggerating the evidence that you have against some null hypothesis.

I then try to first talk to the statistician and explain to them where these things come from. So we have these three concepts, confidence interval, hypothesis testing, [00:03:30] and significance testing in the fishery in sense, which means the P value. And first, you need to kind of separate these things in your head.

Then try to go back to your stakeholder and, and try to make them aware that [00:03:45] using the p value in, in an inappropriate sense might exaggerate the evidence you have against the certain null hypothesis. If you want,

Alexander: let’s first talk about the two schools there, Fisher and Name and PSN, how they [00:04:00] differently looked at hypothesis testing and uh, corresponding p value.

Can you explain the difference there?

Kaspar: Together with two colleagues from the University of Zurich. I wrote a book about 15 years ago, and there I co-authored a chapter about these [00:04:15] concepts and that made me look into the history of all this, and I guess at least if you are my age, you also learned this at university still, but that’s a long time ago.

I think Fisher was in some sense, first and Fisher worked in an environment where it was not about. [00:04:30] Making a decision immediately. You would run an experiment here maybe in rock Hamstead, and you run an ex another experiment somewhere else, and then you run yet another experiments maybe later and at some point.

He wanted to kind of [00:04:45] have some claim about, do I now have enough evidence to reject a certain null hypothesis? I think the aim was to combine evidence from multiple experiments.

Alexander: Mm-hmm.

Kaspar: And one way of doing that is to say, I devise a quantity [00:05:00] that is independent of the statistical test, independent of the sample size for each experiment.

And then I combine these numbers, overall experiments, and if this number at some point fulfills a certain criteria. I feel confident to [00:05:15] say I have enough evidence against a certain null. And of course, that quantity is the P value. The way of combining it is what we today call meta-analysis.

Alexander: Mm-hmm.

Kaspar: So P value combination.

So that’s one kind of school of thought, but this has never been intended to [00:05:30] call something significant or. To reject or to make an explicit decision? I think it, the spirit was more science was understood as you accumulate evidence over time, and at some point you have in some sense enough to be confident to [00:05:45] say, okay, I think this is how things work and

Alexander: mm-hmm.

And

Kaspar: then there was this other school, Neiman Pearson developed a framework where you say, I have a null hypothesis. So that’s kind of similar to Fisher. And I put. [00:06:00] An opposite of it, which I call the alternative hypothesis, and then I want to make an explicit decision for one or the other.

Alexander: Mm-hmm.

Kaspar: Now, whenever you make an explicit decision, you run the risk of that decision being wrong.

And what Neiman and [00:06:15] Pearson said is, okay, let’s quantify the risk of a wrong decision. And if the null hypothesis is true, I have the risk of still rejecting it based on variability. The probability of that happening is what we call the type one error. [00:06:30] Or I have something that really works. So in some sense, the alternative is, is true, or maybe the null is not true.

And I want to quantify the probability of then not deciding for H one. [00:06:45] That’s what we call the type two error. And this gives a framework for decision making. If you remember how you learned about that at school or at university? Typically, I think we are taught to construct a [00:07:00] statistical hypothesis, test and compute the test statistic, and then derive the distribution of that test statistic, assuming the null hypothesis is true, and then compare the test statistics to quintiles of that distribution under the [00:07:15] null and either.

This test statistic is too large, then we reject the null or it’s not too large, then we don’t reject the null. If you follow that procedure and you just do one test, and if all assumptions are true that you make, then this [00:07:30] test has these operating characteristics that pre-specified type one and type two error.

Yep, that’s all fine. If you keep it that way, that’s all fine. And at the time actually, Fisher and Niman Pearson, I think that they had a quite a heated debate. Others can, can. Tell more [00:07:45] about, about which framework to use, but I think the challenge comes from, I think, two facts. Of course, in a hypothesis test in the Neiman Pearson sense, we call rejection of the null.

Statistically significant.

Alexander: Yeah.

Kaspar: Now, Fisher [00:08:00] called his test or his concept of P values. Significance test. Mm-hmm. I mean, this doesn’t really help if you are not studying statistics and, and don’t keep these things apart perfectly Well, of course that doesn’t help. And the [00:08:15] other challenge is that of course you can use the P-value to make a decision in a hypothesis test by just comparing it to the type one error that you assume.

Mm-hmm. If it’s smaller, you reject null. If it’s larger, you don’t reject null. And I think that’s where the confusion comes [00:08:30] from, that these concepts are mixed.

Alexander: Yep. And then there’s of course the, yet the other concept of that you just wanna measure something. You wanna measure a knots ratio, a risk different, a hazard ratio [00:08:45] or whatsoever, some kind of treatment effect, and you have a confidence interval around that.

And of course you can construct always with a confidence interval, a significance test, or an Iman Pearson test. Yeah, just [00:09:00] by checking whether the confidence interval includes or doesn’t include the zero or the one, depending on have your, do you have an absolute difference or relative ratio? However, the primary goal [00:09:15] of.

Confidence intervals is not a test. It is to measure something and to get some kind of confidence. How often will this confidence [00:09:30] interval actually overlap the true value? Yeah, we can of course now go even in the patient way where there we talk about prediction intervals and there’s even, uh, you have a probability, uh, statement which you [00:09:45] by the way, don’t have in the confidence intervals.

Yeah. That’s another kind of little difference that I think it’s a very, very difficult to explain to non statisticians.

Kaspar: You bring up this third concept, and as you say, it has a [00:10:00] completely different goal. The goal, I mean, you have this metaphysical population that you think you have drawn your sample from, um, and then you want to make a statement about a quantity in the population based on your sample and the [00:10:15] confidence interval seems to be useful tool.

For doing that, but indeed your interest is here. Not in refuting some hypothesis, it’s just to say what values of a population quantity are compatible [00:10:30] with what I have seen? Yeah. In my data. And, uh, confusion again comes from the fact that you can use a confidence interval in the way you described to make a decision in a hypothesis test.

But the question is, should I do that?

Alexander: Yeah.

Kaspar: One [00:10:45] area where this comes up in drug development sometimes is, I mean, you run clinical trials with primary endpoints, secondary endpoints, and sometimes we have quite complicated. Type one error protection strategies, and we shift around alphas for rejected [00:11:00] hypothesis.

And so this is about hypothesis testing, which hypothesis can we reject?

Alexander: Yep.

Kaspar: And then when you talk about the drug label, there is often, of course, the question. We also want to make inference about potentially the same [00:11:15] quantities. Say your primary is an hazard ratio and then you have response proportion differences.

And so how should I construct these confidence intervals? Should they, in some sense, reflect the type one error protection strategy you had for the hypothesis [00:11:30] tests? Yes or no? And I, I mean, I think there are different schools of thought. My school of thought is for me. The confidence intervals are independent of the hypothesis

Alexander: testing.

Yeah.

Kaspar: Confidence interval is something that [00:11:45] makes you learn something about the population parameter. Uh, I don’t care so much about potentially underlying type one air protection strategy, and in my opinion, the confidence interval must not be able to repeat

Alexander: mm-hmm.

Kaspar: The [00:12:00] hypothesis test decision. So these are different things in my mind, but I appreciate that people have different opinions about that.

But think of, for, for a non statistician, it’s already very difficult to interpret one confidence interval for [00:12:15] one mean. How should a nons statistician interpret a confidence interval that accounts for a type one error strategy in your clinical trial?

Alexander: Yeah,

Kaspar: I think this is a very, yeah, very hard problem and a difficult ask.

Alexander: Completely [00:12:30] agree. Also, if you look into it from a evidence synthesis point of view, very often you will combine different studies. Into an overall evidence, and then Cochran uses non [00:12:45] adjusted confidence intervals. Of course, they would never have the idea that using an adjusted confidence interval, so they take the alter ratio, they takes the variability that you have since the study and then combine it, or risk difference or the mean [00:13:00] difference or whatsoever.

One of the easy things. Is when you have the primary endpoint and you have everything pre-specified or also for the, let’s say, um, the secondary endpoints that you adjust for [00:13:15] in your testing strategy. For these, calling them significant, statistically significant, different, or not statistically significant different.

That is more or less easy. I think the problem [00:13:30] comes for all the other cases, and these are of course, the majority of cases. Yeah. So let’s imagine you have secondary endpoint. Usual secondary endpoint, like a [00:13:45] p os that you also have included in your study. And you look into this or maybe you have a, have another subpopulation, like you have wanna look into all those patients [00:14:00] that just with, uh, let’s say 10 or 20% of the population is excluded for whatever medical reason.

Now you wanna have a confidence interval there. Of course, [00:14:15] most of the journals will then ask, oh, what’s the P values there? Yeah, you’ll be very likely be pressed to present P values. I think it will be very hard to get it published without P values. Now the question is, [00:14:30] how do you put wording around these P values?

What would be your thought on that? Kasper?

Kaspar: Let’s maybe start with an easy case. Yeah. You have one hypothesis test. Say you run a clinical trial, you have a primary endpoint, you have one [00:14:45] hypothesis test, and you do that. Then you reject the null. Yeah. And then you call this statistically significant. Yeah.

This is how it’s meant to be used. Statistical significance is a binary thing. Yeah. [00:15:00] Either it is statistically significant or it is not.

Alexander: It’s not highly statistically significant or

Kaspar: Yes, because for a hypothesis test in the Neiman Pearson sense, you don’t need a pva. You just compare your test statistic against [00:15:15] the suitable quantile under the null.

Mm-hmm. Either it’s larger, it’s smaller, you are statistically significant, yes or no. Interpret it like that. You are not even tempted to call it highly significant or what, because you can’t. Yeah. Because you just check one number against the other and [00:15:30] that’s it. Now with this habit of mixing hypothesis testing and the P value, people are tempted to start calling things.

Highly significant because the P value is smaller than alpha and it is very [00:15:45] small, but you cannot reject Annu hypothesis more than reject. Yeah. It’s a binary thing. So that is kind of one implication of that mixing of concepts and, and I, I think this argument goes back to Steven send this then also relates [00:16:00] to sometimes when people see have a P value of 0.06, if you are clear in a hypothesis test.

And I mean we have these clinical trials in drug development. Yep. It’s not significant. You don’t have a drug. Yeah. End of story. Yeah. Because it’s a binary thing, but then people [00:16:15] sometimes tend start to think or talk about a trend to significance.

Alexander: Yeah.

Kaspar: And yeah, I mean these same people, that’s the illustrative example that goes back to Steven.

I think if these same people then have a P [00:16:30] value of 4%, would they call this a trend to non significance? Of course not. This is then getting. Into the realm of misuse of concepts, in my opinion. So the p-value should not be used to [00:16:45] qualify the level of evidence in a hypothesis test because it, you can’t, it’s a binary thing.

Now, if you take the p-value as it was intended to be used as kind of a, a continuous quantification of evidence against the null, then it’s a fair [00:17:00] question to ask what is large evidence or not? And. There is a paper by, uh, I think Poco and colleagues where they provide suggestions of labels. So they say if the P value is [00:17:15] above 0.1, then it’s no evidence.

If it’s between five and 10%, it’s, they call it insufficient evidence. If it’s between one and 5%, they call it some evidence. And if it’s between one per mil and 1%, it’s strong evidence and [00:17:30] everything below is overwhelming evidence. So. Again, this has nothing to do with statistical significance.

Alexander: Mm-hmm.

Kaspar: Or hypothesis testing.

It’s just you label verbally how much evidence you have against the null, but it’s not about [00:17:45] type one error. It’s not about type two error. It’s not about statistical significance. If everybody would keep these concepts apart in this way. We would be fine, but of course this is not what’s happening. So this is the case of one hypothesis test.

And then yet another complication is [00:18:00] this issue of multiple testing. I mean, say you have 200 genes, you compare them between two groups and you find 10 P values, which are below 5%, and then you end up writing a paper with these 10 genes in [00:18:15] there.

Alexander: Yeah,

Kaspar: that’s not accurate. But if you say, I did. Compare 200 genes between these two groups and I found 10 P values below 5%, and you write a paper about that, that’s perfectly accurate.

That’s in, because then I [00:18:30] know as a statistician, this is exactly what I would expect if there was no difference at all between the two groups for these 200 genes, because of course 10 P-values below 5%, that’s exactly what you would expect if there’s no difference at all. Um, and I [00:18:45] think that is where things.

If people just report what they did, then everybody can do the multiple testing correction in their head. And then it also doesn’t matter so much whether you report P values or whatever, because then it’s complete transparency about what you [00:19:00] did. And that is also something that people under appreciate in drug development.

I mean, why do we send in statistical analysis plans before we look at the data? Why do we remain blinded until data cutoff? This is one aspect is exactly this. You [00:19:15] pre-specify. You say what you’re gonna do, you do it. Either you succeed or you don’t. And only then the hypothesis test that you run can be assumed to have the operating characteristics that you pre-specified power and type one error [00:19:30] you still make assumptions about or, or you still hope that all the assumptions you made, uh, hold true approximately, but.

Only then is operator characteristics hold. And I think there is an under appreciation in other fields of, of that [00:19:45] feature. I mean the multiple testing issue there also, I mean this is all I, I didn’t invent all this, of course, it’s just all available in the literature. There’s a famous example where group of researchers in the ICS two trial submitted the trial to the Lancet and the [00:20:00] Lancet came back and said, we published this trial, but can you add, I dunno, a list of a hundred subgroup analysis to that.

Then the author said, well, we do that, but we just want to add one additional sub. And the Lancet said, well, that’s fine. And then they added a Zodiac sign [00:20:15] and of course that immediately illustrated.

Alexander: Yeah.

Kaspar: The difficulty in interpretation, and this is actually something I have been doing in the past as well, kind of if people go overboard.

With a multitude of analysis, [00:20:30] not pre-specified, just kind of exploratory. And then cherry pick the ones they like. Add something like the birth month or add something like the zodiac sign. And then if they see small P values for two or three zodiac signs as well. Maybe that makes them think. Yeah, [00:20:45] and maybe initially don’t say it’s the Zodiac side, just say, I have here actually another subgroup.

Look at these small P values. Should I tell you what it is? Then of course they’re very interested and then it’s, I dunno, Gemini and uh, and Libra or whatever. And then hopefully [00:21:00] everybody understands that we should not go overboard. So this is kind of, these are two different problems. One is interpretation of hypothesis test.

Uh. In connection to P value. And the other thing is multiple testing, but of course I think they are related. And from somebody [00:21:15] who maybe is not so deep into statistics, it’s easy to confuse the two. But, uh, in, in theory, these are two different problems

Alexander: when it comes to subgroup analysis, uh, or subpopulations.

I, I’m actually usually much more interested [00:21:30] looking into the point estimates and the confidence in the world because I want to see whether they are more or less in line consistent. Do I see big, big changes? Yeah. Do I see [00:21:45] quantitative or even qualitative interaction tests? Yeah. That is something that I’m interested in.

So for example, when we run a clinical trial. And it is significant. However, there’s, let’s say [00:22:00] 5% of the patients that carry most of the safety and you can clearly identify then, and you then get a label, not for all, but for the 95% that don’t carry this, the safety burden, then you can, of course you want to run.

[00:22:15] Okay. What are your estimates in this smaller population? And of course when it’s 5% by design, it will not be very largely different. Yeah. Still you wanna have a description of your benefit risk profile in these [00:22:30] patients. Yeah. Maybe it’s 10% or 20%. Yeah. So then of course it’s interesting to see kind of.

From an estimation point of view, whether you have similar effects, and of course, [00:22:45] usually the most cases, the confidence interval will get wider with fewer patients. And then of course, a P value said maybe was significant before, becomes bigger than 6% afterwards or bigger than [00:23:00] 5% afterwards.

Kaspar: So this is yet another topic that’s prone to multiple testing, kind of subgroup analysis, for example, in clinical trials.

And I think they may serve different purposes. And very often the purpose is not really [00:23:15] clarified upfront. We just as a habit, we show these forest plots and if, if it’s not clear upfront, what are they supposed to be used for? Then everybody can draw his or her own conclusions. I mean, let me mention two things here.

I. [00:23:30] There is literature quite well published and a little bit of a kind of a consensus. How you should appraise if you have post-talk subgroup findings, for example, has there ever been a biological rationale? Ideally, you [00:23:45] specify a rationale if you suspect there is an effect in a subgroup. I mean, if you run a clinical trial, you analyze 20 subgroups and nobody ever raised their hand and said, in this subgroup, I actually think there is something different.

And then you’ll find something. That just doesn’t give [00:24:00] it much credibility because if this is so obvious that in this subgroup something should be different, why has, has nobody raised it before? No. I mean, the interaction test, which we know notoriously doesn’t have a lot of power. There is literature that helps you interpret [00:24:15] postdoc subgroup findings.

Uh, and the other thing I’d like to mention is just make a plaque for a paper that we have that. Comes out soon in statistical methods in medical research because what you were describing is exactly what you’re actually interested in. You can think of, you have two [00:24:30] extremes. One extreme if you have a clinical trial and you have 20 subgroups.

One extreme is to say the overall estimate is also my best estimate in every subgroup.

Alexander: Well, at least from, uh. Precision point of view. [00:24:45]

Kaspar: Exactly. That’s exactly the the point. Then you have a very precise estimate in every subgroup, but the estimate might be biased. Yeah. The other extreme is when you say, and that’s what we actually typically do, we look at [00:25:00] every subgroup specific estimate.

Then you have an unbiased estimate in every subgroup, but with potentially low precision because these subgroups tend to be small. These are two extremes. Now the question is maybe we should interpolate between [00:25:15] these two extremes and find something in between. And one way to do that is to think using kind of penalization methods.

And that is the proposal we make kind of borrow a little bit from other subgroups. If you have 19 subgroups that [00:25:30] are completely had homogeneous and you have just one outlier, then maybe you should pull that outlier to the others. And it is indeed an outlier. Yeah. But if you have things kind of all over the place.

Maybe you don’t really have a homogeneous effect. I think you are [00:25:45] right. This is then less about testing. It’s more about can we get an accurate estimate. Sometimes maybe you want to strike a different bias, variance trade off than looking at the one extreme of using the overall estimate in every subgroup or [00:26:00] just looking at subgroup specific estimates.

Maybe something in between is, is maybe more accurate. And actually, and that’s what we show in the paper. If you do that. If you look at estimation efficiency, so the mean square error piece. [00:26:15] Interpolations actually can dramatically reduce the mean square error, both compared to the overall estimate and to subgroup specific estimates.

And uh, I think that’s where, yeah, hopefully at some point we should get at.

Alexander: Yeah, I think that’s a perfect [00:26:30] illustration. Now let’s get to some kind of wordings that you could use in the discussion. Of such papers. I think we should shy away for subgroup analysis to talk about We have proven, or [00:26:45] we have shown, I would more kind of go towards sentences and words like the subgroups suggests, indicate, or something like that.

Rather than just kind of very, very strong words. [00:27:00] What is your take on that?

Kaspar: We could even go further and just report what we have. I mean, I know you, you run a trial or you do some analysis, you do some study, and then of course you have to discuss and qualify things. So I appreciate, I can understand the need [00:27:15] to do that.

There should at least be a very comprehensive, factual description of what you have found and then the discussion. Of course, how should you qualify that? But yeah, there is a big push. To exaggerate the findings, and it’s [00:27:30] interesting. I mean, we call a P value below alpha, typically statistically significant, which in many journals and in in science at large may open the door to get something published because this carries a lot of [00:27:45] meaning.

When you have one pre-specified, pre-planned hypothesis test for which you run the study, but very often this P value smaller than 5%. Is not used in such a framework. And it’s not [00:28:00] without irony that in this Pocock paper they call a P value between one and 5% some evidence. So they, they are not terribly excited about the P value between one and 5%.

Yeah. The whole problem comes from the confusion with statistically significant. [00:28:15] And then there is one aspect we didn’t even discuss, which people often bring up. As well. And I think which adds to the confusion in English, significant means, statistically significant and potentially also clinically significant.

And of course, we all know this has [00:28:30] nothing to do with each other a priority. So a statistically significant effect must not at all be clinically meaningful. Of course, in the design of a clinical trial, and that’s what I work a lot with teams, is you should design a clinical trial [00:28:45] such that. A statistically significant result is clinically meaningful.

But there is a lot of confusion around this as well, because the effect we power at say you have a hazard ratio. Yeah,

Alexander: it’s yet another thing. Yeah. That’s

Kaspar: yet another thing that’s [00:29:00] even in the mind of many statisticians. These concepts are not so clear in my experience. So when you have 80% power to detect the hazard ratio of 0.75, if you get the P value of 5%, you will not see an effect of 0.75.

You will see something like [00:29:15] 0.82. This is the quantity you need to discuss with the clinical team. I don’t care so much about the 0.75, whether that’s clinically meaningful. I care whether point A two is clinically meaningful because if you end up with a trial that has a P value of 5% and that point [00:29:30] estimate of 0.82, and then everybody just shrugs with their shoulders and say, well.

We thought we get 0.75, we will not change any practice because of something of 0.82. Then you have, yeah, a statistically significant but clinically irrelevant result [00:29:45] and you want to avoid that. So that’s yet another complication that people imply clinical relevance from statistical significance, which of course, yeah, is independent.

Alexander: I think the talking about clinical relevance yet a completely different [00:30:00] topic. Thanks so much. I think that was super helpful talking about these things and giving some kind of clarity to this customer and I have, uh, agreed that we will record more of such episodes that go [00:30:15] into a little bit the kind of day-to-day challenges.

That we all face, because I know from many discussions that there’s debate about it. There’s, you know, that it’s not kind of taught at universities or there’s not [00:30:30] enough time. Also, getting that knowledge within the usual kind of day-to-day work is not that easy. So thanks a lot Kaspar for this awesome discussion about P values hypothesis testing, [00:30:45] estimation, and all the other things.

Is there any final sentence that you want to have the listener go away with?

Kaspar: Thanks, Alexander and I very much look forward to next episodes. Maybe one thing I mention and, uh, I think I’m sometimes [00:31:00] also critical of my own profession, but I encourage everybody or every statistician. Before you go back to clinicians or stakeholders and criticize them for not using the concepts appropriately, make sure you do your homework and understand the concepts yourself.[00:31:15]

And because that will enable you to have a better conversation, of course, with your stakeholders. And then make sure, or at least attempt. To have these concepts used, uh, properly.

Alexander: Yeah. And make sure that all the different statisticians in your [00:31:30] organizations speak with one voice. That’s another thing as that will help with credibility, but that a clinician says.

Yeah, but Fred, the other statistician, he was okay with that. Okay. Thanks a lot. Otherwise you get multiple [00:31:45] testing by choosing the right statistician. That’s definitely something that we wanna avoid. Thanks a lot, Kaspar. Have a great time.

This show was created in [00:32:00] association with PSI. Thanks to Reine and her team at VVS working in the background, and thank you for listening. Reach your potential lead great science and serve patients. Just be an effective [00:32:15] [00:32:30] statistician.