Sunday, March 16, 2014

If at first you don't succeed...

If only post hoc analyses always brought out the inner skeptic in us all! Or came with red flashing lights instead of just a little token "caution" sentence buried somewhere. 

Post hoc analysis is when researchers go looking for patterns in data. (Post hoc is Latin for "after this.") Testing for statistically significant associations is not by itself a way to sort out the true from the false. (More about that here.) Still, many treat it as though it is - especially when they haven't been able to find a "significant" association, and turn to the bathwater to look for unexpected babies.

Even when researchers know the scientific rules and limitations, funny things happen along the way to a final research report. It's the problem of researchers' degrees of freedomthere's a lot of opportunity for picking and choosing, and changing horses mid-race. Researchers can succumb to the temptation of over-interpreting the value of what they're analyzing, with "convincing self-justification." (See the moving goalposts over time here, for example, as trialists are faced with results that didn't quite match their original expectations.)

And even if the researchers don't read too much into their own data, someone else will. That interpretation can quickly turn a statistical artifact into a "fact" for many people.

Let's look more closely at Significus' pet hate: post hoc analyses. There are dangers inherent in multiple testing when you don't have solid reasons for looking for a specific association. The more often you randomly dip into data without a well-founded target, the higher your chances of pulling out a result that will later prove to be a dud.

It's a little like fishing in a pond where there are random old shoes among the fish. The more often you throw your fishing line into the water, the greater your chances of snagging a shoe instead of a fish.

Here's a study designed to show this risk. The data tossed up significant associations such as: women were more likely to have a cesarean section if they preferred butter over margarine, or blue over black ink.

The problem is huge in areas where there's a lot of data to fish around in. For published genome-wide association studies, for example, over 90% of the "associations" with a disease couldn't consistently be found again. Often, researchers don't report how many tests were run before they found their "significant" results, which makes it impossible for others to know how big a problem multiple testing might be in their work.

The problem extends to subgroup analyses where there is not an established foundation for an association. The credibility of claims made on subgroups in trials is low. And it has serious consequences. For example, an early trial suggested only men with stroke-like symptoms benefit from aspirin - which stopped many doctors from prescribing aspirin to women.

How should you interpret post hoc and subgroup analyses then? If analyses were not pre-specified and based on established, plausible reasons for an association, then one study isn't enough to be sure.

With subgroups that weren't randomized as different arms of a trial, it's not enough that the average for one subgroup is higher than the average for another subgroup. There could be other factors influencing the outcome other than their membership of that subgroup. An interaction test is done to try to account for that.

It's more complicated when it's a meta-analysis, because there are so many differences between one study and another. The exception here is an individual patient data meta-analysis, which can study differences between patients directly.

In the end, it comes down to being careful not to see a new hypothesis generated by research as a "fact" already proven by the study from which it came.

Post hoc, ergo propter hoc. This description of basic faulty logic - "after this, therefore because of this" - is as ancient as the language that made it famous. We've had millennia to snap out of the dangerous mental shortcut of seeing a cause where there's only coincidence. Yet we still hurtle like lemmings over cliffs into its alluring clutches.

Sunday, December 29, 2013

What's so good about "early," anyway?

"Early." It's one of those words like "new" and "fast," isn't it? As though they are inherently good, and their opposites - "late," "old" and "slow" - are somehow bad.

Believing in the value and virtue of being an early bird has deep roots in our cultural consciousness. It goes back at least as far as ancient Athens. Aristotle's treatise on household economics said that early rising was both virtuous and beneficial: "It is likewise well to rise before daybreak; for this contributes to health, wealth and wisdom."

But just as Gertrud came to suspect the benefits for her of being early weren't all they were cracked up to be, earliness isn't always better in other areas either. The "get in early!" assumption has an in-built tendency to lead us astray when it comes to detection of diseases and conditions. And even most physicians - just the people we often rely on to inform us - don't understand enough about the pitfalls that lead us to jump to conclusions about early detection too, well…early.

Pitfall number 1: Those who need it least get the most early detection

This one is a double-edged sword. Firstly, whether it's a screening program or research studying early detection, there tends to be a "worried well" or "healthy volunteer" effect (selection bias). It's easy to have higher than average rates of good health outcomes in people who are at low risk of bad ones anyway. This can lead to inflated perceptions of how much benefit is possible.

The other problem is an over-supply of fatalism among many people who may be able to materially benefit from early detection. Constant bombardment about all the things they could possibly be worrying about might even make it more likely that they shut out vital information - which could make it even more likely that they ignore symptoms, for example.

Pitfall number 2: Over-diagnosis from detecting people who would never have become ill from the condition detected

This one is called length bias. For many conditions, like cancers, there are dangerous ones that develop too quickly for a screening program to catch them. Early detection is actually better at detecting ones that may never threaten their health. More people die with cancer, than of it.

So early detection means many people are fighting heroic battles that were never necessary. And some will actually be harmed by parts of some screening processes that carry serious risks of their own (like colonoscopies), or adverse effects of the treatments they got which they didn't need.

Add to those the number of people who are diagnosed as being "at risk" of conditions they will never have or which would have resolved without treatment, and the number harmed is depressingly huge.

This massive swelling of the numbers of people who have survived phantoms is spreading the shadow of angst ever wider (a subject I've written about in relation to cancer at Scientific American). Spend 10 minutes or so listening to Iona Heath on this subject - starting just past 2 minutes on this video. [And read @Deevybee's important comment and links about developmental conditions in early childhood below.]

Pitfall number 3: The statistical effect that means survival rates "improve" even if no one's life expectancy increases

This is lead-time bias. And it's why you should always be careful when you see survival rates in connection with early detection and treatment. Screening programs, by definition, are for people who have no symptoms (pre-clinical). So they cut the part of your life where you don't know you have the disease short. Even if the earlier diagnosis made no difference to the length of your life, the amount of time you lived with knowledge of the disease (disease "survival") is longer.

What we want is to move the needle on length and/or quality of life. For that to happen, there has to be safe and effective treatment, safe and effective screening procedures, and more people found at a time they can be helped than would have come from diagnosing the condition when there symptoms.

Here's an example. This person's disease began when they were 40 years old. They lived without any problem from it until 76 years old - then they died when they were 80. Their disease survival was less than 5 years. The proportion of their life that they "had" the disease was short.

Now here's the same person, with early detection that made no difference to when they died - but the needle on how long they have "had" the disease has shifted. So they now survive longer than 5 years with the disease. The "lead time" has changed, but survival in the way we mean it hasn't changed at all.

Randomized trials are needed to establish that in fact early detection and intervention programs do more good than harm - some do, some don't.

More Statistically Funny on screening - "You have the right to remain anxious" and on over-diagnosis: here and here.

Here's a fact sheet about what you need to know about screening tests. And here's a little more technical primer of the 3 biases explained here.

Sunday, November 17, 2013

Does it work? Beware of the too-simple answer

Leonard is so lucky! He's just asked a very complicated question and he's not getting an over-confident and misleading answer. Granted, he was likely hoping for an easier one! But let's dive into it.

"Does": that auxiliary verb packs a punch. How do we know whether something does or doesn't work?   It would be great if that were simple, but unfortunately it's not.

I talk a lot here at Statistically Funny about the need for trials and systematic reviews of them to help us find the answers to these questions. But whether we're talking about trials or other forms of research, statistical techniques are needed to help make sense of what emerges from a study.

Too often, this aspect of research is going to lead us down a garden path. It's common for people to take the approach of relying only, or largely, on a statistical significance test of the null hypothesis: the assumption that there is no difference. So if a result is within the range that could occur by chance alone, the assumption of the null hypothesis stands. But if it's not within that range, it's "statistically significant."

However a statistically significant result - especially from a single study - is often misunderstood and contributes to over-confidence about what we know. It's not a magical wand that finds out the truth. I wrote about this in some detail about testing for statistically significance this week over at my Scientific American blog, Absolutely Maybe. Leonard's statistician is a Bayesian: you can find out some more about that, too, in my post.

As chance would have it, there was also a lot of discussion this week in response to a paper published while I was writing that post. It called for a tightening of the threshold for significance, which isn't really the answer either. Thomas Lumley puts that into great perspective over at his wonderful blog, Biased and Inefficient: a very valuable read.

"It": now this part should be easy, right? Actually, this can be particularly tricky. The treatment you could be using may not be very much like the one that was studied. Even if it's a prescription drug, the dose or regimen you're facing might not be the same as the one used in studies. Or it might be used in conjunction with another intervention that could affect how it works.

Then there's the question of whether "it" is even what it says it is. Unlike prescription drugs, the contents of herbal remedies and dietary supplements aren't closely regulated to ensure that what it says on the label is what's inside. That was also recently in the news, and covered in detail here by Emily Willingham.

If it's a non-drug intervention, it's actually highly likely that the articles and other reports of the research don't ever make clear exactly what "it" is. Paul Glasziou had a brainwave about this: he's started HANDI: the Handbook of Non-Drug Intervention. When a systematic reviews shows that something works, the HANDI team wants to dig out all the details and make sure we all know exactly what "it" is.

For example, if you heard that drinking water before meals can help you lose weight, and you want to try it, HANDI helpfully points out what that actually means is drinking half a liter of water before every meal AND having a low-calorie diet. HANDI is new, so there aren't many "it"s explained. But you can see them here.

"Work": this one really needs to get specific. As I point out in the slides from a talk I gave this month, you really need to be thinking about each possible outcome separately - and thinking about the possible adverse effects too. There can be complicated trade-offs between effects, and the quality of the evidence is going to vary for each of them.

Think of it this way: if you do a survey with 150 questions in it, there are going to be more answers to some of the questions than others. For example, if you had 400 survey respondents, they might all have answered the first easy question and there could be virtually no answers to a hard question near the end. So thinking "a survey of 400 people found…" an answer to that later question is going to be seriously misleading.

Then there's the question of how much does it work for that particular outcome? Does a sliver of a benefit count to you as "working"? That might be enough for the person answering your question, but it might not be enough for it to count for you - especially if there are risks, costs or inconvenience involved.

And there's who did it work for in the research? Whether or not research results apply to a person in your situation can be straightforward, but it might not be.

And how high did researchers set the bar? Did the treatment effect have to be superior to doing nothing, or doing something else - or is the information coming from comparing it to something else that itself may not be all that effective? You might think that can't possibly happen, but it does more often than you might think. You can find out about this here at Statistically Funny, where I tackle the issue of drugs that are "no worse (more or less)." 

Finally, one of the most common trip-ups of all: did they really measure the outcome, or a proxy for it? If it's a proxy for the real thing, how good is it? The use of surrogate measures or biomarkers is increasing fast: you can learn more about why this can lead to an unreliable answer here.

So while there are many who might have told Leonard, "Yes, it's been proven to work in clinical trials" in a few seconds flat, I wonder how long it would take his statistician to answer the question? There are no stupid questions, but beware of the too-simple answer.

Monday, September 16, 2013

More than one kind of self-control

If you like reading randomized trials about skin and oral health treatments - and who doesn't? - you come across a few split-face and split-mouth ones. Instead of randomizing groups of people to different interventions so that a group of people can be a control group (parallel trials), sections of a person are randomized.

It's not only done with faces and teeth. Pairs of body parts can be randomized too, like arms or legs. These studies are sometimes called "within-person" trials. This kind of randomization means that you need fewer people in the trial, because you don't have to account for all the variations between human beings.

It has to be a treatment that affects only the specific area of the body treated, though. Anything that could have an influence on the "control" part is called a spill-over effect. There are still inevitably things that happen that affect the whole person, and those have to be accounted for with this kind of trial. Body part randomization is one of several ways a person can be their own control: the n of 1 trial is another way.

Randomizing sections didn't start in trials with people: it began with split-plot experiments in agricultural research. The idea was developed by the pioneer statistician, Sir Ronald Aylmer Fisher, who had done breeding experiments. He explained the technique in his classic 1925 text, "Statistical Methods for Research Workers."

It's great to see that neither blackheads nor treatment effects are hampering the Twilling sisters' style! They do seem to be at risk of susceptibility to the skincare industry's hard sells, though. Those issues are the subject of a post over at my Scientific American blog, called "Blemish: The truth about blackheads."

Sunday, July 28, 2013

Alleged effects include howling

When dogs howl at night, it's not the full moon that sets them off. Dogs are communicating for all sorts of reasons. We're just not all that good at understanding what they're saying.

We make so many mistakes about attributing cause and effect for so many reasons, that it's almost surprising we get it right as often as we do. But all those mistaken beliefs we realize we have, don't seem to teach us a lesson. Pretty soon after catching ourselves out, we're at it again, taking mental shortcuts, being cognitive misers.

It's so pervasive, you would think we would know this about ourselves, at least, even if we don't understand dogs. Yet we commonly under-estimate how much bias is affecting our beliefs. That's been dubbed the bias blind spot that we (allegedly) tend to live in.

Even taking all that into account, "effect" is an astonishingly over-used word, especially in research and science communication where you would hope people would be more careful. The maxim that correlation (happening at the same time) does not necessarily mean causation has spread far and wide, becoming something of a cliche along the way.

But does that mean that people are as careful with the use of the word "effect" as they are with the use of the "cause" word? Unfortunately not.

Take this common one: "Side effects include...." Well, actually, don't be so fast to swallow that one. Sometimes, genuine adverse effects will follow that phrase. But more often, the catalogue that follows is not adverse effects, but a list of adverse events - things that happened (or were reported). Some of them may be causally related, some might not be.

You have to look carefully at claims of benefits and harms. Even researchers who aren't particularly biased will word it carelessly. You will often hear that 14% experienced nausea, say - without it being pointed out that 13% of people on placebos also experienced nausea, and the difference wasn't statistically significant. Some adverse effects are well known, and it doesn't matter (diarrhea and antibiotics, say). That's not always so, though - a complex subject I'll get to on a future Statistically Funny, so watch this space.

If the word "effect" is over-used, the word "hypothesis" is under-used. Although generating hypotheses is a critical part of science, hypotheses aren't really marketed as what they are: ideas in need of testing. Often the language is that of attribution throughout, with a little fig-leaf of a sentence tacked on about the need for confirmatory studies. In fact, we cannot take replication and confirmation for granted at all.

Occasionally, the word "effect" is used to name a literal "hypothesis." That happened with "the Hawthorne effect." You can read more about that in my post, The Hawthorne effect: An old scientists' tale lingering "in the gunsmoke of academic snipers"Yes, Statistically Funny has a little sister at Scientific American now: Absolutely Maybe.

Sunday, June 30, 2013

Goldilocks and the three reviews

Goldilocks is right: that review is FAR too complicated. The methods section alone is 652 pages long! Which wouldn't be too bad, if it weren't that it is a few years out of date. It took so long to do this review and go through rigorous enough quality review, it was already out of date the day it was released. Something that happens often enough to be rather disheartening.

When methodology for systematic reviewing gets overly rococo, the point of diminishing returns will be passed. That's a worry, for a few reasons. For one, it's inefficient and more reviews could be done with the resources. Secondly, more complex methodology can both be daunting, and it can be hard for researchers to accomplish with consistency. Thirdly, when a review gets very elaborate, reproducing or updating it isn't going to be easy either.

It's unavoidable for some reviews to be massive and complex undertakings, though, if they're going to get to the bottom of massive and complex questions. Goldilocks is right about review number 2, as well: that one is WAY too simple. And that's a serious problem, too.

Reviewing evidence needs to be a well-conducted research exercise. A great way to find out more about what goes wrong when it's not, is reading Testing Treatments. And see more on this here at Statistically Funny, too.

You need to check the methods section of every review before you take its conclusions seriously - even when it claims to be "evidence-based" or systematic. People can take far too many shortcuts. Fortunately, it's not often that a review gets as bad as the second one Goldilocks encountered here. The authors of that review decided to include only one trial for each drug "in order to keep the tables and figures to a manageable size." Gulp!

Getting to a good answer also quite simply takes some time and thought. Making real sense of evidence and the complexities of health, illness and disability is often just not suited to a "fast food" approach. As the scientists behind the Slow Science Manifesto point out, science needs time for thinking and digesting.

To cover more ground, people are looking for reasonable ways to cut corners, though. There are many kinds of rapid review, including reliance on previous systematic reviews for new reviews. These can be, but aren't always, rigorous enough for us to be confident about their conclusions.

You can see this process at work in the set of reviews discussed at Statistically Funny a few cartoons ago. Review number 3 there is in part based on review number 2 - without re-analysis. And then review number 4 is based on review number 3.

So if one review gets it wrong, other work may be built on weak foundations. Li and Dickersin suggest this might be a clue to the perpetuation of incorrect techniques in meta-analyses: reviewers who got it wrong in their review, were citing other reviews that had gotten it wrong, too. (That statistical technique, by the way, has its own cartoon.)

Luckily for Goldilocks, the bears had found a third review. It had sound methodology you can trust. It had been totally transparent from the start - included in PROSPERO, the international prospective register for systematic reviews. Goldilocks can get at the fully open review quickly via PubMed Health, and its data are in the Systematic Review Data Repository, open to others to check and re-use. Ahhh - just right!

I'm grateful to the Wikipedians who put together the article on Goldilocks and the three bears. That article pointed me to the fascinating discussion of "the rule of three" and the hold this number has on our imaginations.

Sunday, June 23, 2013

Studies of cave paintings have shown....

The mammoth has a good point. Ogg's father is making a classic error of logic. Not having found proof that something really happens, is not the same as having definitive proof that this thing cannot possibly happen.

Ogg's family doesn't have the benefit of Aristotle's explanation of deductive reasoning. But even two thousand years after Aristotle got started, we still often fall into this trap.

In evidence-based medicine, a part of this problem is touched on by the saying, "absence of evidence is not evidence of absence." A study says "there's no evidence" of a positive effect, and people jump to the conclusion - "it doesn't work." Baby Ogg gets thrown out with the bathwater.

The same thing is happening when there are no statistically significant serious adverse effects reported, and people infer from that, "it's safe." 

This situation is the opposite of the problem of reading too much into a finding of statistical significance (covered here in Statistically Funny). Only in this case, people are over-interpreting non-significance. Maybe the researchers simply didn't study enough of the right people, or they weren't looking at the outcomes that later turn out to be critical.

Researchers themselves can over-interpret negative results. Or they might phrase their conclusions carelessly. Even if they avoid the language pitfalls here, journalists could miss the nuance (or think the researchers are just being wishy-washy) and spread the wrong message. And even if everyone else phrased it carefully, the reader might jump to that conclusion anyway.

When researchers say "there is no evidence that...", they generally mean they didn't find any, or enough of, a particular type of evidence that they would find convincing. Obviously, no one can ever be sure they have even seen all the evidence. And it doesn't mean everyone would agree with their conclusion, either. To be reasonably sure of a negative, you might need quite a lot of evidence.

On the other hand, the probability of something being extremely unlikely to be real based on quite a lot of knowledge - that there's a community of giant blue swans with orange and pink polka dots on the Nile, say - increases the confidence you might have in even a small study exploring that hypothesis.

Which brings us to the other side of this coin. Proving that something doesn't exist to the satisfaction of people who perhaps need to believe it most earnestly, can be quite impossible. People trying to disprove the claim that vaccination causes autism, for example, are finding that despite the Enlightenment, our rational side can be vulnerable to highjacking. Voltaire hit that nail on the head in the 18th century: "The interest I have to believe a thing is no proof that such a thing exists."

Tuesday, May 21, 2013

He said, she said, then they said...

Conflicting studies can make life tough. A good systematic review could sort it out. It might be possible for the studies to be pooled into a meta-analysis. That can show you the spread of individual study results and what they add up to, at the same time.

But what about when systematic reviews disagree? When the "he said, she said" of conflicting studies goes meta, it can be even more confusing. New layers of disagreement get piled onto the layers from the original research. Yikes! This post is going to be tough-going...

A group of us defined this discordance among reviews as: the review authors disagree about whether or not there is an effect, or the direction of effect differs between reviews. A difference in direction of effect can mean one review gives a "thumbs up" and another a "thumbs down."

Some people are surprised that this happens. But it's inevitable. Sometimes you need to read several systematic reviews to get your head around a body of evidence. Different groups of people approach even the same question in different but equally legitimate ways. And there are lots of different judgment calls people can make along the way. Those decisions can change the results the systematic review will get.

When and how they searched for studies - and what type and subject - means that it's not at all unusual for groups of reviewers to be looking at different sets of studies for much the same question.

After all that, different groups of people can interpret evidence differently. They often make different judgments about the quality of a study or part of one - and that could dramatically affect its value and meaning to them.

It's a little like watching a game of football where there are several teams on the field at once. Some of the players are on all the teams, but some are playing for only one or two. Each team has goal posts in slightly different places - and each team isn't necessarily playing by the same rules. And there's no umpire.

Here's an example of how you can end up with controversy and people taking different positions even when there's a systematic review. The area of some disagreement in this subset of reviews is about psychological intervention after trauma to prevent post-traumatic stress disorder (PTSD) or other problems:

Published in 2002Published in 2005; Published in 2005Published in 2010; Published in 2012Published in 2013.

The conclusions range from saying debriefing has a large benefit to saying there is no evidence of benefit and it seems to cause some PTSD. Most of the others, but not all, fall somewhere in between, leaning to "we can't really be sure". Most are based only on randomized trials, but one has none, and one has a mixture of study types.

The authors are sometimes big independent national or international agencies. A couple of others include authors of the studies they are reviewing. The definition of trauma isn't the same - they may or may not include childbirth, for example. The interventions aren't the same.

The quality of evidence is very low. And the biggest discordance - whether or not there is evidence of harm - hinges mostly on how much weight you put on one trial.

It's about debriefing. The debriefing group is much bigger than the control group because they stopped the trial early, and while it's complicated, that can be a source of bias.

The people in the debriefing group were at quite a lot higher risk of PTSD in the first place. Data for more than 20% of the people randomized is missing - and that biases the results too (it's called attrition bias). You can't be sure those people didn't return because they were depressed, for example. If so, that could change the results. 

It's no wonder there's still a controversy here.

If you want to read more about debriefing, here's my post in Scientific American: Dissecting the controversy about early psychological response to disasters and trauma.

Thursday, May 9, 2013

They just Google THAT?!

I admit I needed Google to quickly find out that the category for bunny-shaped clouds is "zoomorphic". And I think Google is wonderful - and so does Tess. But...

There's just been another study published about the latest generation of doctors and their information and searching habits. Like Tess' friend, they rely pretty heavily on Googling. We could all be over-estimating, though, just how good people are at finding things with Google - including the biomedically trained.

Many of us assume that the "Google generation" or "digital natives" are as good at finding information as they are at using technology. A review in 2008 came to the conclusion that this was "a dangerous myth" and those things don't go hand in hand. It may not have gotten any better since then either.

Information literacy is about knowing when you need information, and knowing how to find and evaluate it. Google leads us to information that the crowd is basically endorsing. If the crowd has poor information literacy in health, then that can reinforce the problem.

This is an added complication for health consumers. While there's an increasing expectation that healthcare system decisions and clinical decisions be based on rigorous assessments of evidence, that's not really trickling down very fast. Patient information is generally still pretty old school.

What would it mean for patient information to be really evidence-based? I believe it includes using methods to minimize bias in finding and evaluating research to base the information on, and using evidence-based communication. Those ideas are gaining ground, for example in standards in England and Germany, and this evaluation by WHO Europe of one group of us putting these concepts into practice.

Missing critical information that can shift the picture is one of the most common ways that reviews of research can get it wrong. For systematic reviews of evidence, searching for information well is a critical and complex task.

This brings us to why Tess' talents, passions and chosen career are so important. We need health information specialists and librarians to link us with good information in many ways.

This week at the excellent annual meeting of the Medical Library Association in Boston (think lots of wonderful Tess'es!), there was a poster by Whitney Townsend and her colleagues at the Taubman Health Sciences Library (University of Michigan). Their assessment of 368 systematic reviews suggests that even systematic reviewers need help searching.

Google's great, but it doesn't mean we don't still need to "go to the library."

(Disclosure: I work in a library these days - the world's largest medical one at the National Institute of Health (NIH). If this has put you in the mood for honing up your searching skills, there are some tips for searching PubMed Health here.)

Tuesday, April 23, 2013

Women and children overboard

It's the Catch-22 of clinical trials: to protect pregnant women and children from the risks of untested drugs....we don't test drugs adequately for them.

In the last few decades, we've been more concerned about the harms of research than of inadequately tested treatments for everyone, in fact. But for "vulnerable populations," like pregnant women and children, the default was to exclude them.

And just in case any women might be, or might become, pregnant, it was often easier just to exclude us all from trials.

It got so bad, that by the late 1990s, the FDA realized regulations and more for pregnant women - and women generally - had to change. The NIH (National Institutes of Health) took action too. And so few drugs had enough safety and efficacy information for children that, even in official circles, children were being called "therapeutic orphans." Action began on that, too.

There is still a long way to go. But this month there was a sign that maybe times really are changing. The FDA approved Diclegis for nausea and vomiting in pregnancy. It's a new formulation of the key ingredients of Bendectin, the only other drug ever approved for that purpose in the USA. Nothing else has been shown to work.

Thirty years ago, the manufacturer withdrew Bendectin from the market because it was too expensive to keep defending it in the courts. It's a gripping story, involving the media, activists, junk science and some fraud. It had a major influence on clinical research, public opinion and more. You can read more about it in my guest blog at Scientific American, Catch-22, clinical trial edition: the double bind for women and children.

In dozens of court cases over Bendectin, judges and juries struggled with competing testimony about scientific evidence. In one hearing, a judge offered the unusual option of a "blue ribbon jury" or a "blue, blue ribbon jury": selecting only people who would be qualified to understand the complex testimony and issues of causation. The plaintiffs refused.

Ultimately, in one of the Bendectin cases, Daubert versus Merrell Dow Pharmaceuticals, the Supreme Court re-defined the rules around scientific evidence for US courts. The previous Frye Rule called for consensus. The 1972 Federal Rules of Evidence said "all relevant evidence is admissible."

The new Daubert standard determined that evidence must be "reliable" - grounded in "the methods and procedures of science" - not just relevant.

We still need everyone involved to better understand what reliable scientific evidence on clinical effects really means, though. You can read more about that here at Statistically Funny.