While clinical trials have been our gold standard in shaping current medical guidelines, they can carry their own limitations, the most important ones being non-generalizability to our daily patients, limited follow-up, or even failure to achieve adequate power due to small sample size. All of these limitations can be alleviated to a large extent by retrospective observational studies, especially in our era of "big data" and easy availability of adequate computing power for analyzing it.
Observational studies, by nature, are retrospective, capturing what has already occurred and hence representing "real-world evidence", in contrast to tightly controlled randomized clinical trials (RCTs) where causation is sought. RCTs have come a long way and, even some of them, such as the SPRINT trial, have declared themselves to be a "pragmatic" study, with the intervention being more lax. However, the majority of RCTs were not designed as such. Large observational studies using big data can, therefore, complement RCTs, provided they are appropriately designed and analyzed. It is essential to recognize the limitations and intricacies of any large database that one will use before embarking on any such study: sampling design (complex-survey/stratified vs simple random sampling), identifying the target/sampled population, inpatient vs outpatient setting, clinical vs administrative data - each of these carry their own magnitude of inaccuracies and caveats that the investigator needs to be aware of and proactive in adjusting for.
Statistical techniques are important in such studies. Multivariable analysis using logistic regression has been ubiquitously used throughout the observational literature with a generally good acceptance rate by the average clinician. However, big data can impose challenges on this technique. In addition to the possibility of inaccurate data/coding/clerical errors (which we assume are randomly equally distributed among our cohorts of interest), the vast magnitude of observations can lead to spurious results by unanticipated interactions, the more covariates we include.
Traditionally, researchers are advised to keep the number of covariates in a regression to correspond to at least ten positive outcomes (events) per covariate to avoid overfitting (i.e. if the positive outcome is death and the total number of deaths in the sample is 100, it is advised to limit the number of covariates to 10). With big data, this is a non-issue but the caveat lies in the number of covariates that the researcher will include - i.e. the more covariates, the higher the chances for unanticipated interactions (multicollinearity) not included in the model (misspecification). Therefore, while one will try to adjust for as many confounders as possible, given the opportunity of having big data, it is essential to screen for potential interactions (using correlation matrices, variance inflation factors etc.) and subsequently adjust accordingly (with interaction terms, covariate removal from the model, stratification). Unfortunately, including multiple/complex interactions in a model can significantly diminish the interpretability of that model.
Working with big data, we have the luxury of actually using various matching techniques (propensity matching, exact/coarse matching) that allow us to match the group of interest with controls while maintaining adequate power within all of the matching groups, thereby virtually eliminating the issue of multicollinearity, especially when we use exact/coarse matching.
In our study (https://www.ericas.org/reviews/view/17), we aimed to quantify the effect of Acute Kidney Injury (AKI) on inpatient mortality in Clostridium difficile infection (CDI) patients (n=2,859,599), using both traditional multivariable logistic regression with interaction adjustments (odds ratio 3.16; 95% CI 3.02 - 3.30; p<0.001) and by propensity matching (propensity-matched odds ratio 1.86; 95% CI 1.79 - 1.94; p<0.001). One can notice a smaller confidence interval and a smaller, more conservative effect size in the propensity-matched analysis, which likely represents a more accurate estimation of AKI impact on inpatient mortality in CDI patients.
Dr Paris Charilaou, Department of Internal Medicine, Saint Peter's University Hospital/Rudgers-Robert Wood Johnson Medical School, New Brunswick, New Jersey, USA