

Year : 2015  Volume
: 18  Issue : 1  Page : 7482 Statistics in clinical research: Important considerations Howard Barkan Affiliated Researcher and Consulting Statistician, School of Public Health, University of California Berkeley, Berkeley, CA 94704.7380, Saybrook University, Oakland, CA 94612, USA Correspondence Address: Statistical analysis is one of the foundations of evidencebased clinical practice, a key in conducting new clinical research and in evaluating and applying prior research. In this paper, we review the choice of statistical procedures, analyses of the associations among variables and techniques used when the clinical processes being examined are still in process. We discuss methods for building predictive models in clinical situations, and ways to assess the stability of these models and other quantitative conclusions. Techniques for comparing independent events are distinguished from those used with events in a causal chain or otherwise linked. Attention then turns to study design, to the determination of the sample size needed to make a given comparison, and to statistically negative studies.
Introduction Clinicians examine and intervene with individual patients. The understandings of the clinical challenges they will need to address, of the likely past and future courses of the clinical conditions they are seeing, and evaluations of the effectiveness and risks of their clinical actions and strategies are all based on consideration of the characteristics and histories of clients similar to the one they're now seeing and with whom they may be about to intervene. Statistics is a key tool linking the multiplicity of potential observations of every client with the more abstract concepts of clinical entities, natural histories, clinical response and risks. These more abstract constructs are the foundation on which clinical decisions rest. On a more applied level, clinicians need to understand statistics well enough to follow and evaluate the empirical studies that provide an evidence base for clinical practices. Studies conducted decades ago found major lacunae in physicians' knowledge of statistics. [1],[2],[3] This is a problem more recent studies have found to be only somewhat reduced in magnitude. [4],[5],[6] It leads clinicians to mistrust, misunderstand and ignore the statistics in journal articles. [7] There are several aspects of statistical concepts, methods and their application which are key to their understanding and interpretation. These have been presented for practitioners in major clinical journals by excellent clinicians and statisticians (for initial papers in such series, cf. e.g. [8],[9],[10],[11],[12],[13],[14],[15],[16],[17] ). We will present these concepts and methods with goals of strengthening clinicians' comprehension of statistical aspects of the clinical literature, their evaluation of the strengths and weaknesses of the analyses presented, and their active participation in research. The presentation in this paper is rooted in experience gained from studies conducted by the author [18],[19],[20] and the clinical literature. We hope to help make these inherently abstract statistical concepts and techniques more intelligible in the applied world of clinical practice. We will begin by discussing aspects of measurement, sampling, and analytic goal that guide the choice of statistical techniques. The discussion will then turn to aspects of analytic design and conduct, which impact important finer details of the project's conduct and of the interpretation of its results. The reader is referred to the paper series referenced above [817] for more detailed discussions of particular statistical techniques. Clinical processes are real world. Statistics is abstract. Twoway translation is important Biological and clinical entities are complex and changing, multidimensional structures and processes, which evolve over time. All research works begin by selecting particular features of physical objects and segments of processes, which will be used in the research to represent those structures and processes. [11] These selected observations operationalize the abstract concept of a clinical entity or process into specified measurements. Statistics works with these operationalizations, modeling and analyzing properties and processes which are shared among groups of observations. Note that, the external validity of the results of statistical analyses, while key to the value of those results, is importantly a function of measurement and sampling  that is, what was measured how, in which subjects, at which times, and how well those selected measurements represent the clinical entities and processes which the research is investigating. [21] A perfectly chosen and executed analysis will be at best misleading if it is conducted of the wrong data or data collected using an inaccurate measurement technique, or at the wrong time, and so on. To quote the frequent aphorism from introductory statistics courses, "Garbage in, garbage out." We will discuss the analysis of appropriately selected and measured data. Evaluations of the validity of the measures collected as representations, of the modeling of causal processes, and of the generalizability of the results are all important to the value of statistical analyses but beyond the scope of this paper. Measurement scaling Certain aspects of measurement and sampling are key to which statistical techniques are appropriate. The first attribute, which indicates the appropriateness of and hence guides the choice among statistical procedures, is the scaling of the measurements being treated as variables in the analysis. Statistics represents measurements as scales. In terms of the appropriateness of statistical techniques, the key differentiation among scaling techniques is mathematical: What each number represents and which mathematical analyses of those numbers are valid. [9],[11],[14],[23],[24] Measurements can be classified as using nominal, ordinal, and interval scalings. Nominal scalings use distinct and mutuallyexclusive numbers are used to name each category of observation. Nominal scalings only classify observations. The numbers assigned in a nominal scale carry no further information about magnitude. Set theory, which deals with which observations belong in which groups and with how groups overlay, is the only mathematics appropriate for nominal scales. Clinical examples of nominal scalings include any notation that a disease is (simply) present or absent, a the binary classification used in calculating incidence and prevalence rates and the sensitivity and specificity of diagnostic tests, demographic measures (such as gender and ethnic group), and disease classification systems such as the International Classification of Disease (ICD)10 and the Diagnostic and Statistical Manual of Mental Disorders5. Sensitivity and specificity, key indices of the strength of diagnostic findings as evidence of a disease, both begin by treating the finding and the disease as binomial nominal variables. Binomials are attributes that are either present or absent. [25],[26],[27] Ordinal scalings are mathematically the next more complex. Ordinal scalings place observations in ordersay from least to mostbut are not able to specify or compare the differences between pairs of measurements. Many clinical measurements and indices and many psychological and attitude measurements are ordinally scaled: e.g. tumor grade, pain scales, and Likert attitude scales. Disease stage is an example of an ordinal scaling. Stage 4 cancers are "worse" than stage 3 cancers which are in turn worse than stage 2 cancers, but the ordinal scaling of staging does not indicate how much worse. It is impossible to say whether the difference between stage 4 and stage 3 is more or less than the difference between stage 3 and stage 2 based on the assigned stage alone. That is, the statement that one stage is "worse" than another derives from the association of stage differences with other factors such as duration of survival rather than on the measurement of stage itself. Ordinal scalings add the mathematics of inequalities to set theory as permissible mathematical operations. Interval scalings are mathematically the most complex of the measurement scales used. Interval scales place observations in order and specify both the magnitude of individual measurements and the distance between pairs of measurements. Interval scalings permit all of the basic arithmetic operations and the calculations based on those operations. Frequent Many widely used clinical observations are intervally scaled: e.g. anthropometric measurements of height and weight, blood pressure, and duration of time intervals. Two other scaling options considerations are frequently mentioned for interval scales. The first is whether the source measurements are discreet (e.g. number of children in the household) or continuous (e.g. blood pressure). This distinction bears on the source measurement and may influence how collected data are displayed graphically, but has no influence on the choice or calculation of statistical analyses. The second difference among interval scales is whether or not the scale has a true "0" point. Those with a true "0" points are sometimes called ratio scales because the presence of a true "0" point makes division and hence the calculation of ratios possible. Consider, for example, temperature. The Kelvin scale has a true zero point at absolute zero and hence is a ratio scale. The Centigrade and Fahrenheit scales have a zero point that's mathematically arbitrary and hence are interval scales. This difference bears on which conclusions regarding these measurements are meaningful. For example, it is meaningful to say that the temperature of 30°K is half a temperature of 60°K while it is not valid to make the same statement regarding 30°F versus 60°F. This difference has no bearing on the choice of statistical procedures to analyze these data. This mathematical type of scaling is one of the principal determinants of the appropriateness of a particular statistical analysis for a particular dataset. [28] In general, statistical analyses which can be conducted of mathematically simpler scales, say nominal scales, can also be conducted of more complex scales. For example, the mode, i.e. identification of the most frequent observation, which is the principal statistic describing central tendency for nominally scaled variables, can also be used to describe distributions of ordinally and intervally scaled variables. On the other hand, statistical analyses designed for more complex scales often cannot be applied to mathematically simpler scales. For example, calculation of an average depends on the ability to add and divide the observed measurements. These mathematical operations are not valid with ordinal and nominal scalings, making invalid the operations involved in calculating the average of such a scaling in a sample. Note that the greater power of the analyses available for interval scalings leads to a frequent temptation to treat measurements such as tumor stage which are appropriately scaled ordinally as though they were scaled intervally. Descriptive statistics and measurement scaling: Single variables Examinations of single variables use descriptive statistics to characterize the central tendency, the single best description of the sample of measurements, and variability. Descriptive statistics for single variables play important roles in research. Descriptive statistics summarize characteristics of the study and control groups in randomized trials. [18] To evaluate the baseline comparability of the an investigation's study and control groups, the proportions are examined when comparing nominally scaled variable such as gender. The median is can also be examined when comparing the ordinally scaled urgency. While Averages are can be examined when comparing intervally scaled characteristics: e.g. groups members' age, serum albumin, and platelet count and other key hematologic indices. They are at the core of clinically relevant indices of prevalence and incidence, and of the evaluation of the sensitivity and specificity of diagnostic findings as evidence of particular conditions. The median is can also be examined when comparing the ordinally scaled urgency. While Averages are can be examined when comparing intervally scaled characteristics: e.g. groups members’ age, serum albumin, and platelet count and other key hematologic indices. In most general terms, a form of descriptive statistical analysis which is valid for simpler mathematical scalings can be used with mathematically more complex scalings. For example, the category containing the highest proportion of a nominal variable is termed the mode. The mode is a valid analysis of nominally scaled variables. We can count the number of patients assigned each ICD9 coded diagnosis. We can then compare these counts to evaluate which diagnosis was most frequent that is, the mode. The mode can also be used to describe variables that are ordinally scaled  e.g. which stage of lung cancer is most frequent  and intervally scaled  e.g. what number of children/family is most frequent. In contrast, statistics designed specifically for more complex scalings may be invalid for measurements using mathematically simpler scalings. For example, it is valid to calculate the mean and standard deviation of the number of distant metastases/patient because our count of the number of distant metastases is intervally scaled: The difference between 0 and 1 distant metastases equals the difference between 3 and 4 distant metastases which equals one. [22] In contrast, we cannot calculate the average lung cancer stage because we cannot add or divide stage measurements: Is it at all meaningful to say that stage 2 lung cancer is twice stage 1 lung cancer? The situation is even more clouded with nominally scaled variables. The numbers used as codes in the ICD9 carry no direct implication of magnitude. It is not meaningful to say that the diagnosis of reticulosarcoma is twice the diagnosis of leptospirosis icterohemorrhagica because reticulosarcoma's ICD9 code of 200.0 is twice, the leptospirosis ICD9 code of 100.0. [23],[29] Descriptive statistics and measurement scaling: Multiple variables Let us now turn our attention to the associations among variables, first paying attention to how we describe that association. The strength of the association between two variables is described by correlation coefficients. [30],[31] Correlation coefficients describe the strength of association between two variables of the same mathematical type. Correlation coefficients typically range from "0" indicating no association to "−1" and "1", indicating perfect association. The square of the correlation coefficient can be interpreted as the proportion of the variance of one variable that is predicted by the other variable. The square of "1" equals the square of "−1" equals "1," indicating perfect association. The most frequently used correlation coefficients are phi and Cramer's V for nominal variables, Spearman's rho (or rankorder) correlation for ordinal variables, and Pearson's r (or productmoment) correlation for interval variables. Kappa is also often used for binomial nominal variables. Binomial variables are nominal variables with only two values: e.g. gender and the presence versus absence of a characteristic or disease. Kappa adjusts in its calculation for the agreement expected by chance alone. [32] This has made kappa a useful index in investigations of interobserver agreement among radiologists and other clinicians (there has been some argument about this interpretation of kappa, cf. [33] ). Note that agreement does not imply accuracy. Accuracy, assessed for binary classifications by sensitivity, specificity, and receiver operating characteristic curves, will not be discussed further in this paper. [25],[26],[27],[34],[35] For all but nominal variables, the sign of the correlation coefficient indicates the direction of the association. Positive correlation coefficients describe situations in which increases in value of one of the variables are associated with increases in the other variable, while negative coefficients describe situations in which increases in one of the variables are associated with decreases in the other. Correlationbased analyses using techniques such as factor analysis can be used to examine the associations among multiple measures used to investigate single events or conditions. [36],[37] This technique can identify groupings and key measures, potentially reducing the length and increasing the efficiency of diagnostic evaluations. [38],[39] Patterns found in factor analysis can be helpful in exploring biological interactions and indicate particular groupings which may have clinical implication. [40],[41],[42] Measurement timing The discussion so far has carried the implicit assumption that we are able to measure the entire course of the events we are studying. That may be true for many of the acute clinical events and processes in which cardiac anesthesiology plays a major role. However, this is clearly true neither for all long term processes in cardiac anesthesia nor for those iatrogenic effects whose appearance is delayed, nor for cardiology, nor for clinical processes generally. Clinical and research data are often gathered within a limited time frame while the processes to which clinical attention is being given, and those which are being studied continue beyond that time frame's boundaries. The techniques of survival analysis and lifetable statistics have been developed to address these challenges presented by what is termed "right censoring." [43],[44],[45],[46] Right censoring exists when a study is investigating a process that has reached a conclusion in some, but not all of the subjects when the study ends hence censoring information about that outcome. In situations such as this, the sample size of those at risk for a study's terminal event varies over the course of the study because that size is reduced by "1" every time one of the study's terminal those events (say tumor recurrence or mortality) occurs, removing the person experiencing the event from the group at risk for it. Lifetable analyses typically examine median time to the target event to avoid being biased by the long times to event of those in the sample who have not experienced the event by the time the study concludes and whose experience is rightcensored. Lifetable experience is typically depicted using KaplanMeier survival curves, where "survival time" is taken to signify time to the process designated's final effect (e.g. reinfection, tumor recurrence or mortality). Appropriate evaluation of statistical significance also uses techniques discussed below which take this rightcensorship into account. It is important that studies whose samples are rightcensored use such lifetable based techniques. Studies in that situation that calculate survival time by averaging time to the terminal events which have occurred will produce biased estimates unless all of those terminal events have occurred because rightcensorship will be excluding those with the potentially longest survival times. Modeling associations and prediction Correlations measure the strength and, for all types except nominal variables, the direction of associations between variables. Regression modeling provides the tools for making those predictions from one or more independent variables to the dependent variable. [30],[31],[47],[48],[49],[50] The measurement and the completeness of the measurement of the dependent variable indicate which form of regression modeling is appropriate. If the dependent variable is a binomial, that is, a nominal variable with only two values, and it is known whether or not each member of the sample experienced that outcome, multiple logistic regression is used to model the effects of the independent variables on the odds ratio of experiencing that outcome. [51],[52] When the outcome condition is relatively rare and with some other constraints, these odds ratios can be treated as estimates of the relative risk each independent variable carries for the outcome. This model is appropriate for outcomes in, say, a study of surgical intervention in which the outcome of interest is short term and can be predicted to have occurred before discharge from hospital. In contrast, the Cox proportional hazards model and regression is are used when the outcome data are right censored., that is, when the outcome status of all subjects is not known (often because insufficient time has passed for the outcome to have occurred in all subjects in whom it may eventually occur). This is likely to be the case, for example, if the study is investigating delayed effects after therapeutic interventions such as postsurgical survival in cancer patients. Cox regression models the risk of the target outcome as a hazard function which is a function of time and of the independent variables included in the model. The final principal form of regression modeling is (multiple) linear regression, which predicts a dependent variable measured on an interval scale based on the values of one or more predictors. For example, linear regression can be used to model the association of the natural log of urea with age [30] (taking the natural log of urea made the relationship of urea with age a straight line). Linear regressions predict straight lines (or planes or their multidimensional analogs). There are constraints on the type of distribution and on the associations among variables suitable for linear regression analysis. [30] Many clinical variables have exponential or other nonlinear associations. Discussion of the regression modeling of these processes and of their associations is beyond the scope of this paper. [53] Result likelihood and stability Clinical decisions and research need to move beyond the initial sample of measurements (of say the initial patient or group of patients) to reach more generalized conclusions. Say a change is noted in laboratory measurement following an operative procedure. How likely is it that other patients undergoing that procedure will experience the same change? Is that change other than the difference that would be seen in patients with the same clinical condition who are measured twice, but who do not undergo that procedure? What is the range of change in that laboratory measurement, which can be expected in future patients who do and who do not undergo that procedure? These questions explore the extent to which we can generalize from our particular clinical observations and the trustworthiness of those generalizations. These questions are in the arena of statistical inference. There have been many presentations of the general logic underlying statistical inference (cf. e.g. [54],[55],[56],[57] ). The reader is referred to those sources, and to any classical statistics or biostatistics text for the logic underlying classical tests of statistical significance. We will now first discuss alternatives to the point comparison represented by classical significance testing. Then, given the widespread use of classical significance testing, we will discuss several modifications necessary for its appropriate use in clinical studies. Classical tests of significance assess the likelihood of the study's actual results given a set of assumptions about the sources of the measures being compared. The tests are designed to support a point judgment about the likelihood of those source groups being identical. The statistical significance test result evaluates the likelihood of the results obtained were the data drawn from identical groups, saying nothing about the magnitude or stability of any differences that were actually found. Further, these tests refer to an arbitrary cutpoint (usually P < 0.05) to support conclusions about similarity versus difference. There is a longstanding argument that analyses should estimate the range of intergroup differences consistent with the collected data rather than ending with a single statement regarding statistical significance. [58],[59],[60],[61],[62],[63] Given that significance tests provide a point statement, while confidence intervals express a range of estimation, some advocate reporting both (e.g. [60],[61] ). Confidence intervals can also be calculated using what are termed "Bayesian" techniques. These techniques initially presented by Thomas Bayes (17021761) treat probability as a statement of degree of belief in a statement rather than as an estimate of the frequency. In clinical practice, Bayesian techniques are used to calculate the predictive value (positive) of a diagnostic finding given prior beliefs about the finding's sensitivity and specificity and about the prevalence of the diseases being considered. [27] In the context of statistical inference, Bayesian techniques take into account prior beliefs about the statistics being compared by the test of significance. This is in contrast to classical tests of statistical significance and calculations of confidence intervals that are based only on the sets of actual measurements and assumptions about the underlying population distributions. [63],[64],[65] The continual reassessment method, first proposed by O'Quigley in 1990, applies Bayesean techniques to toxicity data from dosefinding trials. [66] Bayesian techniques are used to reapply new trial data cyclically to prior toxicity estimates (from the trial or initially from elsewhere) to reestimate dosetoxicity curves and estimate the optimal dose in Phase 1 clinical trials. [67],[68],[69],[70],[71] The above paragraph noted that confidence intervals can be used to evaluate a likelihood of intergroup differences. These confidence intervals estimating the magnitude of the intergroup difference go beyond traditional point computations of statistical significance which only refer to the likelihood of the particular difference tested to estimate the magnitude of the intergroup difference. They also estimate the expected stability of associations between variables. Please note that confidence intervals can also be calculated around other statistics, ranging from the proportions and means calculated as descriptive statistics through correlation coefficients to regression coefficients. In each case, the confidence interval predicts the stability of the point statistic calculated using a defined sample. The confidence interval estimates the boundaries likely to include (desired target) proportions (often 95%) of future similar measurements made from that statistical population. Independent versus paired measurement While there is serious discussion about alternatives to classical tests of statistical significance as evaluations of the generalizability of findings as noted above, these classical tests continue to be widely used. [72],[73],[74] Several issues regarding the conduct of these tests and the interpretation of their findings recur repeatedly. The first issue is whether or not the measures being compared are independent. [54],[75],[76] Tests of the statistical significance of differences in paired measurements differ from tests of independent measurements because in paired observations the first set of measurements is a precise prediction against which the second measurement is compared. Any difference or any difference in a specified direction is potentially of interest when comparing independent samples. The sets of measurements in repeat measurement of the same subjects are obviously related, with the second measurements being departures from first measurements that are already in the sample study. Paired analyses are also needed when the selection of samples is matched. Matching is often used in epidemiological studies to maximize comparability of the samples on all factors other than the factor whose influence is being compared (i.e. a risk factor in a cohort study or clinical outcome in a study using a casecontrol design). Adjustment for multiple outcomes Classical tests of statistical significance assume there has been only a single examination of the relationship being investigated. This assumption is often violated. It is violated when there are a series of separate examinations of the association of a single dependent variable with multiple potential independent variables, or of a single independent variable with multiple potential effects. [77],[78] This can also happen by design in randomized controlled trials, when the Data Safety Monitoring Committee by protocol reviews the data at prespecified intervals. Associations can also be examined during the study's initial design phase then reexamined in the full study. This is problematic because each analysis in multiple comparison which uses a P < 0.05 threshold has a 1 in 20 chance of producing a false positive result. In essence, this means that if 20 tests are performed there's a virtual certainty that at least one will yield a false positive result. [79] This risk of a false positive can be mitigated in the design by adjusting the threshold for declaring statistical significance. The simplest but most conservative approach, the Bonferroni adjustment, divides the target P value by the number of comparisons made. Equally rigorous but less stringent techniques such as the false detection rate are now in use. [80] All these techniques adjust individual comparison thresholds so the final statistical significance for all comparisons combined is P < 0.05. Statistical power and negative studies Clinical studies can only be effective if the sample size is large enough to give the study a reasonable chance of finding the association as hypothesized by the study's designers which it is investigating. The chance of a study yielding a statistically significant result if its hypothesis is supported is termed its statistical power. There are established methods for calculating statistical power for studies given the planned analysis, sample size, and assumptions about the population from which the sample will be drawn. [81],[82] If studies achieve statistically significant results, the question of statistical power is moot. The power was de facto adequate. The real challenge is when study results fail to reach statistical significance. Over a period of decades, examinations of studies with statistically nonsignificant results have found the studies to have been underpowered. [83],[84],[85] Paralleling Freiman's et al. [85] earlier study, Moher et al. [84] reviewed 383 randomized trials published in three major journals, finding 102 which had failed to reach statistical significance. Of the 70 of these negative trials which examined binary or intervally scaled primary outcomes, only 16 (22.9%) had 80% power to detect a 25% difference in outcome rates, and only 36 (51.4%) had 80% power to detect the easier to find 50% difference in outcome rates. This problem continues. In a recent study examining papers published in British orthopedic journals, Sexton et al. [83] found 49 papers reporting findings that failed to reach statistical significance. Only three (6.1%) of those papers reported a statistical power analysis and had a sample size large enough to give the study adequate statistical power. Comments Clinicians practice with individual patients, while conclusions about care practices almost always involve considerations of aspects of the clinical courses followed by many. Statistics is one of the important tools to help bridge this gap. This paper has reviewed certain selected key aspects of the statistical approach to clinical events and care. Please note that many of the studies used as examples are clinically illuminating and methodologically sound. However, there are also aspects of the design and execution which were the subject to recurring methodological weaknesses. These include statistical power analysis and sample size planning and the selection and conduct of appropriate analyses in light of the sampling and measurements used. Routine conduct of pilot studies before full studies are initiated could help strengthen study designs and lessen the threat of such methodological weaknesses. Hopefully the clinical reader will use these tools to understand the strengths and weaknesses of past work. One central goal in conducting methodologically robust studies is to build a sound evidence base for clinical care. These quantitative tools can contribute to building such a solid foundation. Acknowledgments The authors acknowledge the sincere efforts of Dr. Dave Nicholas in reviewing and developing the manuscript. References


