Fake vs Synthetic Data: What’s the difference?

The ethical and accurate handling of data is paramount in the domain of clinical research. As the demand for data-driven clinical insights continues to grow, researchers face challenges in balancing the need for accuracy with the availability of data and the imperative to protect sensitive information. In situations where quality real patient data is not available, synthetic data can be the most reliable data source from which to derive predictive insights. Synthetic data can be more cost-effective and time-efficient in many cases than acquiring the equivalent real data.

Synthetic data must be differentiated from fake data. In recent years there has been much controversy concerning fake data detected in published journal articles which have previously passed peer review, particularly in an academic context. As one study is generally built upon assumptions formed by the results of another, this preponderance of fake data has really had a catastrophic impact on our ability to trust any published scientific research, regardless of whether the study at hand also contains fake data. It has become clear that the implementation of increased quality control standards for all published research needs to be prioritised.

While synthetic data is not without it’s own pitfalls, the key difference between synthetic and fake data lies in it’s purpose and authenticity. Synthetic data is designed to emulate real-world data for specific use cases, maintaining statistical properties without revealing actual (individual) information. On the other hand, fake data is typically fabricated and may not adhere to any real-world patterns or statistics.

In clinical research, the use of real patient data is fraught with privacy concerns and other ethical considerations. Accurate and consistent patient data can also be hard to come by for other reasons such as heterogeneous recording methods or insufficient disease populations. Synthetic data is emerging as a powerful solution to navigate these limitations. While accurate synthetic data is not a trivial thing to generate, researchers can harness advanced algorithms and models built by expert data scientists to generate synthetic datasets that faithfully mimic the statistical properties and patterns of real-world patient and other data. This allows researchers to simulate and predict relevant clinical outcomes in situations where real data is not readily available, and do so without compromising individual patient privacy.

A large proportion of machine learning models in an AI context are currently being trained on synthetic rather than real data. This is largely because using generative models to create synthetic data tends to be much faster and cheaper than collecting real-world-data. Real-world data can at times lack sufficient diversity to make insights and predictions truly generalisable.

Both the irresponsible use of synthetic data and the generation and application of fake data in academic, industry and clinical research settings can have severe consequences. Whether stemming from dishonesty or incompetence, the misuse of fake data or inaccurate synthetic data poses a threat to the integrity of scientific inquiry.

This following sections define and delineate between synthetic and fake data as well as summarise the key applications of synthetic data in clinical research as compared to the potential pitfalls associated with the unethical use of fake data.

Synthetic Data:

Synthetic data refers to artificially generated data that mimics the statistical properties and patterns of real-world data. It is created using various algorithms, models, or simulations to resemble authentic patient data as closely as possible. It may do so without containing any real-world identifying information about individual patients comprising the original patient sample from which it was derived.

Synthetic data can be used in situations where privacy, security, or confidentiality concerns make it challenging to access or use real patient data. It can also be used in cases where an insufficient volume of quality patient data is available or where existing data is too heterogeneous to draw accurate inferences, such as is typically the case with rare diseases. It can potentially be employed in product testing to create realistic scenarios without subjecting real patients to unnecessary risk.

3 key use cases for synthetic data in clinical research

1. Privacy Preservation:

– Synthetic data allows researchers to conduct analyses and develop statistical models without exposing sensitive patient information. This is particularly crucial in the healthcare and clinical research sectors, where maintaining patient confidentiality is a legal and ethical imperative.

2. Robust Testing Environments:

– Clinical trials and other experiments related to product testing or behavioural interventions often necessitate testing in various scenarios. Synthetic data provides a versatile and secure testing ground, enabling researchers to validate algorithms and methodologies without putting real patients at risk.

3. Data Augmentation for Limited Datasets:

– In situations where obtaining a large and diverse dataset is challenging, synthetic data serves as a valuable tool for augmenting existing datasets. This aids in the development of more robust models and generalisable findings. A data set can be made up of varying proportions of synthetic versus real-world data. For example, a real world data set may be fairly large but lack diversity on the one hand, or small and overly heterogeneous on the other. The methods of generating synthetic data to augment these respective data sets would differ in each case.

Fake Data:

Fake data typically refers to data that is intentionally fabricated or inaccurate due to improper data handling techniques. In situations of misuse it is usually combined with real study data to give misleading results.

Fake data can be used ethically for various purposes, such as placeholder values in a database during development, creating fictional scenarios for training or educational purposes, or generating data for scenarios where realism is not crucial. Unfortunately in the majority of notable academic and clinical cases it has been used with the deliberate intention to mislead by doctoring study results and thus poses a serious threat to the scientific community and the general public.

.There are three key concerns with fake data.

1. Academic Dishonesty:

– Some researchers may be tempted to fabricate data to support preconceived conclusions, meet publication deadlines or attain competitive research grants. After many high profile cases in recent years it has become apparent that this is a pervasive issue across academic and clinical research. This form of academic dishonesty undermines the foundation of scholarly pursuits and erodes the trust placed in research findings.

2. Mishaps and Ineptitude:

– Inexperienced researchers may inadvertently create fake data, whether due to poor data collection practices, computational errors, or other mishaps. This unintentional misuse can lead to inaccurate results, potentially rendering an entire body of research unreliable if it remains undetected.

3. Erosion of Trust and Reproducibility:

– The use of fake data contributes to the reproducibility crisis in scientific research. One study found that 70% of studies cannot be reproduced due to insufficient reporting of data and methods. When results cannot be independently verified, trust in the scientific process diminishes, hindering the advancement of knowledge. The addition of fake data into this scenario makes replication and thus verification of study results all the more challenging.

In an evolving clinical research landscape, the responsible and ethical use of data is paramount. Synthetic data stands out as a valuable tool in protecting privacy, advancing research, and addressing the challenges posed by sensitive information – assuming it is generated as accurately and responsibly as possible. On the other hand, the misuse of fake data undermines the integrity of scientific research, eroding trust and impeding the progress of knowledge and it’s real-world applications. It is important to stay vigilant against bias in data and employ stringent quality control in all data contexts of data handling.

The Role of Clinical-Translational Studies in Validation of Diagnostic Devices

Clinical-translational studies refer to research studies that bridge the gap between early-stage diagnostic development and real-world clinical application. In a diagnostics context these studies focus on translating promising diagnostic technologies from laboratory research (preclinical stage) to clinical practice, where they can be validated, assessed for clinical utility, and eventually integrated into routine healthcare settings.

The primary goal of clinical-translational studies for diagnostics is to evaluate the performance, accuracy, safety, and overall effectiveness of new diagnostic tests or devices in real-world patient populations. These studies play a critical role in determining whether the diagnostic technology can reliably detect specific diseases or conditions, guide treatment decisions, improve patient outcomes, and enhance the overall healthcare experience.

Key Characteristics of Clinical-Translational Studies for Diagnostics:

Validation of Diagnostic Accuracy:
In clinical-translational studies, diagnostic accuracy and reliability is rigorously validated. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) are assessed to determine how effectively the diagnostic test can identify true positive and true negative cases. These metrics provide essential insights into the precision and reliability of the test’s performance.

Clinical Utility Evaluation:
Beyond accuracy, clinical-translational studies focus on evaluating the clinical utility of the diagnostic technology. The impact of the test on patient management, treatment decisions, and overall healthcare outcomes is carefully assessed. Real-world data is analysed to understand how the test guides appropriate clinical actions and leads to improved patient outcomes. This evaluation helps stakeholders better assess the value of the diagnostic test in clinical practice.

Inclusion of Diverse Patient Populations:
Clinical-translational studies encompass a wide range of patient populations to ensure the generalisability of the diagnostic test’s results. Studies are designed to include patients with various demographics, medical histories, and disease severities, making the findings applicable to real-world scenarios. Robust statistical analyses are employed to identify potential variations in test performance across different patient groups, enhancing the diagnostic test’s inclusivity and practicality.

Comparative Analyses:
In certain cases, comparative analyses are conducted in clinical-translational studies to evaluate the performance of the new diagnostic technology against existing standard-of-care tests or reference standards. Differences in accuracy and clinical utility are quantified using statistical methods, enabling stakeholders to make informed decisions regarding the adoption of the new diagnostic test or device.

Use of Real-World Evidence:
Real-world evidence plays a pivotal role in clinical-translational studies. Data from routine clinical practice settings are collected to assess the test’s performance under authentic healthcare conditions. Advanced statistical techniques are employed to analyse real-world data, providing valuable insights into how the diagnostic test performs in real patient populations. This evidence informs the adoption and implementation of the test in clinical practice.

Compliance with Regulatory Guidelines:
Compliance with regulatory guidelines and standards is essential for the success of clinical-translational studies. Studies are designed and conducted following regulatory requirements set by health authorities, ensuring adherence to Good Clinical Practice (GCP) guidelines and ethical considerations to ensure data quality and to protect patient safety and privacy.

Conducting Longitudinal Studies:
For certain diagnostic technologies, particularly those used for monitoring or disease progression, longitudinal studies may be necessary. These studies are designed to assess the diagnostic device’s performance over time and identify potential variations or trends. Longitudinal analyses enable researchers to understand how the diagnostic test performs in the context of disease progression and treatment response.

Interdisciplinary Collaboration:
Clinical-translational studies involve collaboration among diverse stakeholders, such as clinicians, biostatisticians, regulatory experts, and industry partners. Biostatisticians play a pivotal role in facilitating effective communication and coordination among team members. This interdisciplinary collaboration ensures that all aspects of the research, from study design to data analysis and interpretation, are conducted with precision and expertise.

Clinical-translational studies in diagnostics demand a comprehensive and multidisciplinary approach, where biostatisticians play a vital role in designing robust studies, analysing complex data, and providing valuable insights. Through these studies, diagnostic technologies can be validated, and their clinical relevance can be determined, ultimately leading to improved patient care and healthcare outcomes.

For more information on our services for clinical-translational studies see here.

Checklist for proactive regulatory compliance in medical device R&D projects

Meeting regulatory compliance in medical device research and development (R&D) is crucial to ensure the safety, efficacy, and quality of the device. Here are some strategies to help achieve regulatory compliance:

  1. Early Involvement of Regulatory Experts: Engage regulatory experts early in the R&D process. Their insights can guide decision-making and help identify potential regulatory hurdles from the outset. This proactive approach allows for timely adjustments to the development plan to meet compliance requirements.
  2. Stay Updated with Regulations: Medical device regulations are continually evolving. Stay abreast of changes in relevant regulatory guidelines, standards, and requirements in the target markets. Regularly monitor updates from regulatory authorities to ensure that the R&D process aligns with the latest compliance expectations.
  3. Build a Strong Regulatory Team: Assemble a team of professionals with expertise in regulatory affairs and compliance. This team should collaborate closely with R&D, quality, and manufacturing teams to ensure that compliance considerations are integrated throughout the product development lifecycle.
  4. Conduct Regulatory Gap Analysis: Perform a comprehensive gap analysis to identify any discrepancies between current practices and regulatory requirements. Address the gaps proactively to avoid potential compliance issues later in the development process.
  5. Implement Quality Management Systems (QMS): Establish robust QMS compliant with relevant international standards, such as ISO 13485. The QMS should cover all aspects of medical device development, from design controls to risk management and post-market surveillance.
  6. Adopt Design Controls: Implement design controls, as per regulatory guidelines (e.g., FDA Design Controls). This ensures that the R&D process is well-documented, and design changes are carefully managed and validated.
  7. Risk Management: Conduct thorough risk assessments and establish a risk management process. Identify potential hazards, estimate risk levels, and implement risk mitigation strategies throughout the R&D process.
  8. Clinical Trials and Data Collection: If required, plan and conduct clinical trials to collect essential data on safety and performance. Ensure that clinical trial protocols comply with regulatory requirements, and obtain appropriate ethics committee approvals.
  9. Preparation for Regulatory Submissions: Early preparation for regulatory submissions, such as pre-submissions (pre-IDE or pre-CE marking) or marketing applications, is essential. Compile all necessary documentation, including technical files, to support regulatory approvals.
  10. Engage with Regulatory Authorities: Maintain open communication with regulatory authorities throughout the development process. Seek feedback, clarify uncertainties, and address any questions or concerns to facilitate a smoother regulatory review.
  11. Post-Market Surveillance: Plan post-market surveillance activities to monitor the device’s performance and safety after commercialisation. This ongoing data collection ensures compliance with post-market requirements and facilitates timely response to adverse events.
  12. Training and Education: Provide continuous training and education to the R&D team and other stakeholders on regulatory requirements and compliance expectations. This ensures that all members are aware of their responsibilities in maintaining regulatory compliance.

By implementing these strategies, medical device R&D teams can navigate the complex landscape of regulatory compliance more effectively. Compliance not only ensures successful product development but also builds trust with customers, stakeholders, and regulatory authorities, paving the way for successful market entry and long-term success in the medical device industry.

Biostatistics checklist for regulatory compliance in clinical trials

  1. Early Biostatistical Involvement: Engage biostatisticians from the outset to ensure proper study design, data collection, and statistical planning that align with regulatory requirements.
  2. Compliance with Regulatory Guidelines: Stay updated with relevant regulatory guidelines (e.g., ICH E9, FDA guidance) to ensure statistical methods and analyses comply with current standards.
  3. Sample Size Calculation: Perform accurate sample size calculations to ensure the study has sufficient statistical power to detect clinically meaningful effects.
  4. Randomisation and Blinding: Implement appropriate randomisation methods and blinding procedures to minimise bias and ensure the integrity of the study.
  5. Data Quality Assurance: Establish data quality assurance processes, including data monitoring, validation, and query resolution, to ensure data integrity.
  6. Handling Missing Data: Develop strategies for handling missing data in compliance with regulatory expectations to maintain the validity of the analysis.
  7. Adherence to SAP: Strictly adhere to the Statistical Analysis Plan (SAP) to maintain transparency and ensure consistency in the analysis.
  8. Statistical Analysis and Interpretation: Conduct rigorous statistical analyses and provide accurate interpretation of the results, aligning with the study objectives and regulatory requirements.
  9. Interim Analysis (if applicable): Implement interim analysis following the SAP, if required, to monitor study progress and make data-driven decisions.
  10. Data Transparency and Traceability: Ensure data transparency and traceability through clear documentation, well-organized datasets, and proper archiving practices.
  11. Regulatory Submissions: Provide statistical sections for regulatory submissions, such as Clinical Study Reports (CSRs) or Integrated Summaries of Safety and Efficacy, as per regulatory requirements.
  12. Data Security and Privacy: Implement measures to protect data security and privacy, complying with relevant data protection regulations.
  13. Post-Market Data Analysis: Plan for post-market data analysis to assess long-term safety and effectiveness, as required by regulatory authorities.

By following this checklist, biostatisticians can play a pivotal role in ensuring that clinical trial data meets regulatory approval and maintains data integrity, contributing to the overall success of the regulatory process for medical products.

The Call for Responsible Regulations in Medical Device Innovation

In the seemingly fast-paced world of medical technology, the quest for innovation is ever-present. However, it is crucial to recognise that the engineering of medical devices should not mirror the recklessness and hubris of exploratory engineering exemplified by the recent Ocean Gate tragedy where the stubborn blinkeredness of figures like Stockton Rush is not kept in check by sufficiently stringent regulations and safety standards. While it may seem in poor taste to criticise one who has lost their life under such tragic circumstances, the incident is absolutely emblematic of everything that can go wrong when the hubris of the innovator left relatively unbridled in the service of short-term commercial gains. More troubling in this case was that American safety standards were in place to protect human life, however the company was able to operate outside the United States jurisdiction in order to by-pass those standards. Fortunately, most medical device patients will not be receiving treatment over international waters. Despite this there exist loopholes to be filled.

The jurisdictional loophole of “export only” medical device approval

As of 2022 the United States pulls in 41.8% of global sales revenue from medical devices. 10% of Americans currently have a medical device implanted and 80,000 Americans have died as a result of medical devices over the past 10 years. Interestingly Americans have the 46th highest life expectancy in the world despite having dis-proportionally high access to the most advanced medical treatments, including medical devices. Perhaps more worryingly, thousands of medical devices manufactured in the United States are FDA approved for “Export Only” meaning they do not pass the muster for use by American citizens. This “Export Only” status is one factor that partially accounts for America’s disproportionate share of the global medical device market. Foreign recipients of such medical devices are just as often from developed countries with their own high regulatory standards such as Australia, United Kingdom and Europe, and have accepted the device based on its stamp of approval by the FDA. Patients in these countries are typically not made aware of the particular risks, have not been disclosed the reasons why it has not been approved for use in the United States, nor that it has failed to gain this approval in its country of origin.

Local regulators such as the TGA in Australia, the MDR in Europe, or the MHRA in the UK, all claim to have some of the most stringent regulatory standards in the world. Despite this, American devices designated “Export Only” by the FDA, there are roughly 4600 in total, get approved predominantly due to differential device classification between the FDA and the importing country. By assigning a less risky class in the importing country the device escapes the need for clinical trials and the high level of regulatory scrutiny it was subject to in the United States. While devices that include medicines, tissues or cells are designated high risk in Australia and require thorough clinical validation, implantable devices for example can require only a CE mark by the TGA. This means that an implantable device such as a titanium shoulder replacement that has failed clinical studies in the United States and received an “Export Only” designation by the FDA can be approved by the TGA with or MDR with very little burden of evidence.

Regulatory standards must begin to evolve at the pace of technology.

Of equal concern is the need for regulatory standards that dynamically keep up with the pace of innovation and the emergent complexity of the devices we are now on a trajectory to engineer.

It is no longer enough to simply prioritise safety, regulation, and stringent quality control standards, we now need to have regular re-assessments of the standards themselves to evaluate whether they in-fact remain adequate to assess the novel case at hand. In many cases, even with current devices under validation, the answer to this question could well be “no”. It is quite possible that methods that would have previously seemed beyond consideration in the context of medical device evaluation, such as causal inference and agent-based models, may now become integrated into many a study protocol. Bayesian methods are also becoming increasingly important as a way of calibrating to increasing device complexity.

When the stakes involve devices implanted in people’s bodies or software making life-altering decisions, the need for responsible innovation becomes paramount.

If an implantable device also has a software component, the need for caution increases and exponentially so if the software is to be driven by AI. As these and other hybrid devices become the norm there is a need to test and thoroughly validate the reliability of machine learning or AI algorithms used in the device, the failure rate of software, and how this rate changes over time, software security and susceptibility to hacking or unintended influence from external stimuli, as well as the many metrics of safety and efficacy of the physical device itself.

The Perils of Recklessness:

Known for his audacious approach to deep-sea exploration, Stockton Rush has become a symbol of recklessness and disregard for safety protocols. While such daring may be thrilling in certain fields, it has no place in the medical device industry. Creating devices that directly impact human lives demands meticulous attention to detail, adherence to rigorous safety standards, and a focus on patient welfare.

There have been several class action lawsuits in recent years related to medical device misadventure. Behemoth Johnson & Johnson has been subject to several class action law suits pertaining to its medical devices. A recent lawsuit brought against the company, along with five other vaginal mesh manufacturers, was able to establish that 4000 adverse events had been reported to the FDA which included serious and permanent injury leading to loss of quality of life. Another recent class-action lawsuit relates to Johnson & Johnson surgical tools which are said to have caused at least burn injuries to at least 63 adults and children. These incidents are likely the result of recklessness in pushing these products to market and would have been avoidable had the companies involved chosen to conduct proper and thorough testing in both animals and humans. Proper testing occurs as much on the data side as in the lab and entails maintaining data integrity and statistical accuracy at all times.

Apple has recently been subject to legal action due to the their racially-biased blood oxygen sensor which, as with similar devices by other manufacturers, is able to detect blood oxygen more accurately for lighter skinned people than dark. Dark skin absorbs more light and can therefore give falsely elevated blood oxygen readings. It is being argued that users believing their blood oxygen levels to be higher than actual levels has contributed to higher incidences of death in this demographic, particularly during the pandemic. This lawsuit could have likely been avoided If the company had conducted more stringent clinical trials which recruited a broad spectrum of participants and stratified subjects by skin tone to fairly evaluate any differences in performance. If differences were identified, they should also have been transparently reported on the product label, if not also discussed openly in sales material, so that consumers can make an informed decision as to whether the watch was a good choice for them based on their own skin tone.

Ensuring Regulatory Oversight:

To prevent the emergence of a medtech catastrophes of unimagined proportions, robust regulation and vigilant oversight are crucial as we move into a newer technological era. Not just to redress current inadequacies in patient safeguarding but to also to prepare for new ones. While innovation and novel ideas drive progress, they must be tempered with accountability. Regulatory bodies play a vital role in enforcing safety guidelines, conducting thorough evaluations, and certifying the efficacy of medical devices before they reach the market. Striking the right balance between promoting innovation and safeguarding patient well-being is essential for the industry’s long-term success.

Any device given “Export Only” status by the FDA, or indeed by any other regulatory authority,  should necessitate further regulatory testing in the jurisdictions in which it is intended to be sold and should by flagged by local regulatory agencies as insufficiently validated. Currently this seems to be taking place more in word than in deed under may jurisdictions.

Stringent Quality Control Standards:

The gravity of medical device development calls for stringent quality control standards. Every stage of the development process, from design and manufacturing to post-market surveillance, must prioritize safety, reliability, and effectiveness. Employing best practices, such as adherence to recognized international standards, robust testing protocols, and continuous monitoring, helps identify and address potential risks early on, ensuring patient safety remains paramount.

Putting Patients First:

Above all, the focus of medical device developers should always be on patients. These devices are designed to improve health outcomes, alleviate suffering, and save lives. A single flaw or an overlooked risk could have devastating consequences. Therefore, a culture that fosters a sense of responsibility towards patients is vital. Developers must empathize with the individuals who rely on these devices and remain dedicated to continuous improvement, addressing feedback, and learning from past mistakes.

Putting patient safety as the very top priority is the only way to avoid costly lawsuits and bad publicity stemming from a therapeutic device that was released onto the market too early in the pursuit of short-term financial gain. While product development and proper validation is an expensive and resource consuming process, cutting corners early on in the process will inevitably lead to ramifications at a later stage of the product life cycle.

Allowing overseas patients access to “export only” medical devices is attractive to their respective companies as it allows data to be collected from the international patients who use the device, which can later be used as further evidence of safety in subsequent applications to the FDA for full regulatory approval. This may not always be an acceptable risk profile for the patients who have the potential to be harmed. Another benefit of “Export Only” status to American device companies is that marketing the device overseas can bring in much needed revenue that enables further R&D tweaks and clinical evaluation that will eventually result in FDA approval domestically. Ultimately it is the responsibility of national regulatory agencies globally to maintain strict classification and clinical evidence standards lest their citizens become unwitting guinea pigs.

Collaboration and Transparency:

The medical device industry should embrace a culture of collaboration and transparency. Sharing knowledge, research, and lessons learned can help prevent the repetition of past mistakes. Open dialogue among developers, regulators, healthcare professionals, and patients ensures a holistic approach to device development, wherein diverse perspectives contribute to better, safer solutions. This collaborative mindset can serve as a safeguard against the emergence of reckless practices.

The risks associated with medical devices demand a paradigm shift within the industry. Developers must strive to distance themselves from the medtech version of Ocean Gate and instead embrace responsible innovation. Rigorous regulation, stringent quality control standards, and a relentless focus on patient safety should be the guiding principles of medical device development. By prioritising patient well-being and adopting a culture of transparency and collaboration, the industry can continue to advance while ensuring that every device that enters the market has been meticulously evaluated and designed with the utmost care.

Further reading:

Law of the Sea and the Titan incident: The legal loophole for underwater vehicles – EJIL: Talk! (ejiltalk.org)

Drugs and Devices: Comparison of European and U.S. Approval Processes – ScienceDirect

https://www.theregreview.org/2021/10/27/salazar-addressing-medical-device-safety-crisis/

https://www.medtechdive.com/news/medtech-regulation-FDA-EU-MDR-2023-Outlook/641302/
https://www.marketdataforecast.com/market-reports/Medical-Devices-Market

FDA Permits ‘Export Only’ Medical Devices | Industrial Equipment News (ien.com)

FDA issues ‘most serious’ recall over Johnson & Johnson surgical tools (msn.com)

Jury Award in Vaginal Mesh Lawsuit Could Open Flood Gates | mddionline.com

Lawsuit alleges Apple Watch’s blood oxygen sensor ‘racially biased’; accuracy problems reported industry-wide – ABC News (inferse.com)

Effective Strategies for Regulatory Compliance

1. Establish a Regulatory Compliance Plan: Develop a comprehensive plan that outlines the regulatory requirements and compliance strategies for each stage of the product development process.

2. Engage with Regulatory Authorities Early: Build relationships with regulatory authorities and engage with them early in the product development process to ensure that all requirements are met.

3. Conduct Risk Assessments: Identify potential risks and hazards associated with the product and develop risk management strategies to mitigate those risks.

4. Implement Quality Management Systems: Establish quality management systems that ensure compliance with regulatory requirements and promote continuous improvement.

5. Document Everything: Maintain detailed records of all activities related to the product development process, including design, testing, and manufacturing, to demonstrate compliance with regulatory requirements.

Stata: Statistical Software for Regulatory Compliance in Clinical Trials

Stata is widely used in various research domains such as economics, biosciences, health and social sciences, including clinical trials. It has been utilised for decades in studies published in reputable scientific journals. While SAS has a longer history of being explicitly referenced by regulatory agencies such as the FDA, Stata can still meet regulatory compliance requirements in clinical trials. StataCorp actively engages with researchers, regulatory agencies, and industry professionals to address compliance needs and provide technical support, thereby maintaining a strong commitment to producing high-quality software and staying up to date with industry standards.

Stata’s commitment to accuracy, comprehensive documentation, integrated versioning, and rigorous certification processes provides researchers with a reliable and compliant statistical software for regulatory submissions. Stata’s worldwide reputation, excellent technical support, seamless verification of data integrity, and ease of obtaining updates further contribute to its suitability for clinical trials and regulatory compliance.

To facilitate regulatory compliance in clinical trials, Stata offers features such as data documentation and audit trails, allowing researchers to document and track data manipulation steps for reproducibility and transparency. Stata’s built-in “do-files” and “log-files” can capture commands and results, aiding in the audit trail process. Stata provides the flexibility to generate analysis outputs and tables in formats commonly required for regulatory reporting (e.g., PDF, Excel, or CSV). It also enables the automation of reproducible, fully-formatted publication standard reports. Strong TLF and CRF programming used to be the domain of SAS which explains their early industry dominance. SAS was developed in 1966 using funding from the National Institute of Health. In recent years, however, Stata has arguably surpassed what is achievable in SAS with the same efficiency, particularly in the context of clinical trials.

Stata has extensive documentation of adaptive clinical trial design. Adaptive group sequential designs can be achieved using the GDS functionality. The default graphs and tables produced using GDS analysis really do leave SAS in the dust being more visually appealing and easily interpretable. They are also more highly customisable than what can be produced in SAS. Furthermore the Stata syntax used to produce them is minimal compared to corresponding SAS commands, while still retaining full reproducibility.

Stata’s comprehensive causal inference suite enables experimental-style causal effects to be derived from observational data. This can be helpful in planning clinical trials based on observed patient data that is already available, with the process being fully documentable.

Advanced data science methods are being increasingly used in clinical trial design and planning as well as for follow-up exploratory analysis of clinical trial data. Stata has both supervised and unsupervised machine learning capability in its own right for decades. Stata can also integrate with other tools and programming languages, such as Python for PyStata and PyTrials, if additional functionalities or specific formats are needed. This can be instrumental for advanced machine learning and other data science methods goes beyond native features and user-made packages in terms of customisability. Furthermore, using Python within the Stata interface allows for compliant documentation of all analyses. Python integration is also available in SAS via numerous packages and is able to eliminate some of the limitations of native SAS, particularly when it comes to graphical outputs.

Stata for FDA regulatory compliance

While the FDA does not mandate the use of any specific statistical software, they emphasise the need for reliable software with appropriate documentation of testing procedures. Stata satisfies the requirements of the FDA and is recognized as one of the most respected and validated statistical tools for analysing clinical trial data across all phases, from pre-clinical to phase IV trials. With Stata’s extensive suite of statistical methods, data management capabilities, and graphics tools, researchers can rely on accurate and reproducible results at every step of the analysis process.

When it comes to FDA guidelines on statistical software, Stata offers features that assist in compliance. Stata provides an intuitive Installation Qualification tool that generates a report suitable for submission to regulatory agencies like the FDA. This report verifies that Stata has been installed properly, ensuring that the software meets the necessary standards.

Stata offers several key advantages when it comes to FDA regulatory compliance for clinical trials. Stata takes reproducibility seriously and is the only statistical package with integrated versioning. This means that if you wrote a script to perform an analysis in 1985, that same script will still run and produce the same results today. Stata ensures the integrity and consistency of results over time, providing reassurance when submitting applications that rely on data and results from clinical trials.

Stata also offers comprehensive manuals that detail the syntax, use, formulas, references, and examples for all commands in the software. These manuals provide researchers with extensive documentation, aiding in the verification and validity of data and analyses required by the FDA and other regulatory agencies.

To further ensure computational validity, Stata undergoes extensive software certification testing. Millions of lines of certification code are run on all supported platforms (Windows, Mac, Linux) with each release and update. Any discrepancies or changes in results, output, behaviour, or performance are thoroughly reviewed by statisticians and software engineers before making the updated software available to users. Stata’s accuracy is also verified through the National Institute of Standards (NIST) StRD numerical accuracy tests and the George Marsaglia Diehard random-number generator tests.

Data management in Stata

Stata’s Datasignature Suite and other similar features offer powerful tools for data validation, quality control, and documentation. These features enable users to thoroughly examine and understand their datasets, ensuring data integrity and facilitating transparent research practices. Let’s explore some of these capabilities:

  1. Datasignature Suite:

The Datasignature Suite is a collection of commands in Stata that assists in data validation and documentation. It includes commands such as `datasignature` and `dataex`, which provide summaries and visualizations of the dataset’s structure, variable types, and missing values. These commands help identify inconsistencies, outliers, and potential errors in the data, allowing users to take appropriate corrective measures.

2. Variable labelling:

 Stata allows users to assign meaningful labels to variables, enhancing data documentation and interpretation. With the `label variable` command, users can provide descriptive labels to variables, making it easier to understand their purpose and content. This feature improves collaboration among researchers and ensures that the dataset remains comprehensible even when shared with others.

3. Value labels:

 In addition to variable labels, Stata supports value labels. Researchers can assign descriptive labels to specific values within a variable, transforming cryptic codes into meaningful categories. Value labels enhance data interpretation and eliminate the need for constant reference to codebooks or data dictionaries.

4. Data documentation:

Stata encourages comprehensive data documentation through features like variable and dataset-level documentation. Users can attach detailed notes and explanations to variables, datasets, or even individual observations, providing context and aiding in data exploration and analysis. Proper documentation ensures transparency, reproducibility, and facilitates data sharing within research teams or with other stakeholders.

5. Data transformation:

Stata provides a wide range of data transformation capabilities, enabling users to manipulate variables, create new variables, and reshape datasets. These transformations facilitate data cleaning, preparation, and restructuring, ensuring data compatibility with statistical analyses and modelling procedures.

6. Data merging and appending:

Stata allows users to combine multiple datasets through merging and appending operations. By matching observations based on common identifiers, researchers can consolidate data from different sources or time periods, facilitating longitudinal or cross-sectional analyses. This feature is particularly useful when dealing with complex study designs or when merging administrative or survey datasets.

7. Data export and import:

Stata offers seamless integration with various file formats, allowing users to import data from external sources or export datasets for further analysis or sharing. Supported formats include Excel, CSV, SPSS, SAS, and more. This versatility enhances data interoperability and enables collaboration with researchers using different software.

These features collectively contribute to data management best practices, ensuring data quality, reproducibility, and documentation. By leveraging the Datasignature Suite and other data management capabilities in Stata, researchers can confidently analyse their data and produce reliable results while maintaining transparency and facilitating collaboration within the scientific community.

Stata and maintaining CDISC standards. How does it compare to SAS?

Stata and SAS are both statistical software packages commonly used in the fields of data analysis, including in the pharmaceutical and clinical research industries. While they share some similarities, there are notable differences between the two when it comes to working with CDISC standards:

  1. CDISC Support:

SAS has extensive built-in support for CDISC standards. SAS provides specific modules and tools, such as SAS Clinical Standards Toolkit, which offer comprehensive functionalities for CDASH, SDTM, and ADaM. These modules provide pre-defined templates, libraries, and validation rules, making it easier to implement CDISC standards directly within the SAS environment. Stata, on the other hand, does not have native, dedicated modules specifically designed for CDISC standards. However, Stata’s flexibility allows users to implement CDISC guidelines through custom programming and data manipulation.

2. Data Transformation:

SAS has robust built-in capabilities for transforming data into SDTM and ADaM formats. SAS provides specific procedures and functions tailored for SDTM and ADaM mappings, making it relatively straightforward to convert datasets into CDISC-compliant formats. Stata, while lacking specific CDISC-oriented features, offers powerful data manipulation functions that allow users to reshape, merge, and transform datasets. Stata users may need to develop custom programming code to achieve CDISC transformations.

3. Industry Adoption:

SAS has been widely adopted in the pharmaceutical industry and is often the preferred choice for CDISC-compliant data management and analysis. Many pharmaceutical companies, regulatory agencies, and clinical research organizations have established workflows and processes built around SAS for CDISC standards. Stata, although less commonly associated with CDISC implementation, is still a popular choice for statistical analysis across various fields, including healthcare and social sciences. Stata has the potential to make adherence to CDISC standards a more affordable option for small companies and therefore an increased priority.

4. Learning Curve and Community Support:

SAS has a long been the default preference in the context of CDISC compliance and is what statistical programmers are used to, thus SAS is known for its comprehensive documentation and extensive user community. Resources including training materials, user forums, and user groups, which can facilitate learning and support for CDISC-related tasks. Stata also has an active user community and provides detailed documentation, but its community may be comparatively smaller in the context of CDISC-specific workflows. Stata has the advantage of reducing the amount of programming required to achieve CDISC compliance, for example in the creation of SDTM and ADaM data sets.

While SAS offers dedicated modules and tools specifically designed for CDISC standards, Stata provides flexibility and powerful data manipulation capabilities that can be leveraged to implement CDISC guidelines. The choice between SAS and Stata for CDISC-related work may depend on factors such as industry norms, organizational preferences, existing infrastructure, and individual familiarity with the software.

While SAS has historically been more explicitly associated with regulatory compliance in the clinical trial domain, Stata is fully equipped to fulfil regulatory requirements and has been utilised effectively in clinical research since. Researchers often choose the software they are most comfortable with and consider factors such as data analysis capabilities, familiarity, and support when deciding between SAS and Stata for their regulatory compliance needs.

It is important to note that compliance requirements can vary based on specific regulations and guidelines. Researchers are responsible for ensuring their analysis and reporting processes align with the appropriate regulatory standards and should consult relevant regulatory authorities when necessary.

The Devil’s Advocate: Stata for Clinical Study Design, Data Processing, & Statistical Analysis of Clinical Trials.

Stata is a powerful statistical analysis software that offers some advantages for clinical trial and medtech use cases compared to the more widely used SAS software. Stata provides an intuitive and user-friendly interface that facilitates efficient data management, data processing and statistical analysis. Its agile and concise syntax allows for reproducible and transparent analyses, enhancing the overall research process with more readily accessible insights.

Distinct from R, which incorporates S based coding, both Stata and SAS have used C based programming languages since 1985.  All three packages can parse full Python within their environment for advanced machine learning capabilities, in addition to those available natively. In Stata’s case this is achieved through the pystata python package. Despite a common C based language, there are tangible differences between Stata and SAS syntax. Stata generally needs less lines of code on average compared to SAS to perform the same function and thus tends to be more concise. Stata also offers more flexibility to how you code as well as more informative error statements which makes debugging a quick and easy process, even for beginners.

When it comes to simulations and more advanced modelling our experience had been that the Basic Edition of Stata (BE) is faster and uses less memory to perform the same task compared to Base SAS. Stata BE certainly has more inbuilt capabilities than you would ever need for the design and analysis of advanced clinical trials and sophisticated statistical modelling of all types. There is also the additional benefit of thousands of user-built packages, such as the popular WinBugs, that can be instantly installed as add-ons at no extra cost. Often these packages are designed to make existing Stata functions even more customisable for immense flexibility and programming efficiency.  Both Stata and SAS represent stability and reliability and have enjoyed widespread industry adoption. SAS has been more widely adopted by big pharma and Stata more-so with public health and economic modelling. 

It has been nearly a decade since the Biostatistics Collaboration of Australia (BCA) which determines Biostatistics education nationwide has transitioned from teaching SAS and R as part of their Masters of Biostatistics programs to teaching Stata and R. This transition initially was made in anticipation of an industry-wide shift from SAS to Stata. Whether their predictions were accurate or not, the case for Stata use in clinical trials remains strong.

Stata is almost certainly a superior option for bootstrapped life science start-ups and SMEs. Stata licencing fees are in the low hundreds of pounds with the ability to quickly purchase over the Stata website, while SAS licencing fees span the tens to hundreds of thousands and often involve a drawn-out process just to obtain a precise quote.

Working with a CRO that is willing to use Stata means that you can easily re-run any syntax provided from the study analysis to verify or adapt it later. Of course, open-source software such as R is also available, however Stata has the advantage of a reduced learning curve being both user-friendly and sufficiently sophisticated.

Stata for clinical trials

  1. Industry Adoption:

Stata has gained significant popularity and widespread adoption in the field of clinical research. It is commonly used by researchers, statisticians, and healthcare professionals for the statistical analysis of clinical data.

2. Regulatory Compliance and CDISC standardisation:

Stata provides features and capabilities that support regulatory compliance requirements in clinical trials. While it may not have the same explicit recognition from CDISC as SAS, Stata does lend itself well to CDISC compliance and offers tools for documentation, data tracking, and audit trails to ensure transparency and reproducibility in analyses.

3. Comprehensive Statistical Procedures:

A key advantage of Stata is its extensive suite of built-in statistical functions and commands specifically designed for clinical trial data analysis. Stata offers a wide range of methods for handling missing data, performing power calculations, and of course a wide range of methods for analysing clinical trial data; from survival analysis methods, generalized linear models, mixed-effects models, causal inference, and Bayesian simulation for adaptive designs. Preparatory tasks for clinical trials such as meta-analysis, sample size calculation and randomisation schedules are arguably easier to achieve in Stata than SAS. These built-in functionalities empower researchers to conduct various analyses within a single software environment.

4. Efficient Data Management:

Stata excels in delivering agile data management capabilities, enabling efficient data handling, cleaning, and manipulation. Its intuitive data manipulation commands allow researchers to perform complex transformations, merge datasets, handle missing data, and generate derived variables seamlessly.

Perhaps the greatest technical advantage of Stata over SAS in the context of clinical research is usability and greater freedom to keep open and refer to multiple data sets with multiple separate analyses at the same time. While SAS can keep many data sets in memory for a single project, Stata can keep many data sets in siloed memory for simultaneous use in different windows to enable viewing or working on many different projects at the same time. This approach can make workflow easier because no data step is required to identify which data set you are referring to, instead the appropriate sections of any data sets can be merged with the active project as needed and due to siloing, which works similarly to tabs in a browser, you do not get the log, data or output of one project mixed up with another. This is arguably an advantage for biostatisticians and researchers alike who typically do need to compare unrelated data sets or the statistical results from separate studies side-by-side.

5. Interactive and Reproducible Analysis:

Stata provides an interactive programming environment that allows users to perform data analysis in a step-by-step manner. The built-in “do-file” functionality facilitates reproducibility by capturing all commands and results, ensuring transparency and auditability of the analysis process. The results and log window for each data set prints out the respective syntax required item by item. This syntax can easily be pasted into the do-file or the command line to edit or repeat the command with ease. SAS on the other hand tends to separate the results from the syntax used to derive it.

6. Graphics and Visualization:

While not traditionally known for this, Stata actually offers a wide range of powerful and customizable graphical capabilities. Researchers can generate high-quality publication standard  plots and charts of any description needed to visualise clinical trial results Common examples include survival curves, forest plots, spaghetti and diagnostic plots. Stata also has built-in options to perform all necessary assumption and model checking for statical model development.

These visualisations facilitate the exploration and presentation of complex data patterns, as well as the presentation, and communication of findings. There are many user-created customisation add-ons for data visualisation that rival what is possible in R customisation.

The one area of Stata that users may find limiting is that it is only possible to display one graph at a time per active data set. This means that you do need to copy graphs as they are produced and save them into a document to compare multiple graphs side by side.

7. Active User Community and Support:

Like SAS, Stata has a vibrant user community comprising researchers, statisticians, and experts who actively contribute to discussions, share knowledge, and provide support. StataCorp, the company behind Stata, offers comprehensive documentation, online resources, and user forums, ensuring users have access to valuable support and assistance when needed. Often the resources available for Stata are more direct and more easily searchable than what is available for SAS when it comes to solving customisation quandaries. This is of course bolstered by the availability of myriad instant package add-ons.

Stata’s active and supportive user community is a notable advantage. Researchers can access extensive documentation, online forums, and user-contributed packages, which promote knowledge sharing and facilitate problem-solving. Additionally, Stata’s reputable technical support ensures prompt assistance for any software-related queries or challenges.

While SAS and Stata have their respective strengths, Stata’s increasing industry adoption, statistical capabilities, data management features, reproducibility, visualisation add-ons, and support community make it a compelling choice for clinical trial data analysis.

As it stands, SAS remains the most widely used software in big-pharma for clinical trial data analysis. Stata however offers distinct advantages in terms of user-friendliness, tailored statistical functionalities, advanced graphics, and a supportive user community. Consider adopting Stata to streamline your clinical trial analyses and unlock its vast potential for gaining insights from research outcomes. An in-depth overview of Stata 18 can be found here. A summary of it’s features for biostatisticians can be found here.

Further reading:

Using Stata for Handling CDISC Complient Data Sets and Outputs (lexjansen.com)

Dynamic Systems Modelling and Complex Adaptive Systems (CAS) Techniques in Biomedicine and Public Health

Dynamical systems modelling is a mathematical approach to studying the behaviour of systems that change over time. These systems can be physical, biological, economic, or social in nature, and they are typically characterized by a set of variables that evolve according to certain rules or equations.

CAS (Complex Adaptive Systems) models are a specific type of dynamical systems model that are used to study systems that are complex, adaptive, and composed of many interconnected parts. These systems are often found in natural and social systems, and they are characterized by a high degree of uncertainty, nonlinearity, and emergence.

To build a dynamical systems model, one typically starts by identifying the variables that are relevant to the system being studied and the relationships between them. These relationships are usually represented by a set of equations or rules that describe how the variables change over time. The model is then simulated or analysed to understand the system’s behaviour under different conditions and to make predictions about its future evolution.

CAS models are often used to study systems that exhibit emergent behaviour, where the behaviour of the system as a whole is more than the sum of its parts. These models can help us understand how complex systems self-organize, adapt, and evolve over time, and they have applications in fields such as biology, economics, social science, and computer science.

Whatever the approach, a model is intended to represent the real system, but there are limits to the application of models. The reliability of any model often falls short when attempting to operate within and apply the parameter boundaries of the model to any real life context.

 The previous article outlined some basic characteristics of complex adaptive systems (CAS). The CAS approach to modelling real world phenomena requires a different approach to the more conventional predictive modelling paradigm. Complex adaptive systems such as ecosystems, biological systems, or social systems require looking at interacting elements and observing the patterns that arise, creating boundary conditions from these patterns, running experiments or simulations, and responding to the outcomes in an adaptive way.

To further delineate the complex systems domain in practical terms we can use the Cynefin framework developed by David Snowden et al. to contrast the Simple, Complicated, Complex and Chaotic domains. For the purpose of this article the Chaotic domain will be ignored.

Enabling constraints of CAS models

In contrast to complex domain is the “known” or “simple” domain represented by ordered systems such as a surgical operating theatre or clinical trials framework. These ordered systems are rigidly constrained and can be planned and designed in advance based upon prior knowledge. In this context best practice can be applied because an optimal way of doing things is pre-determined.

The intermediary between the simple and complex domains is the “knowable” or ” complicated” domain. An example of such is the biostatistical analysis of basic clinical data. Within a complicated system there is a right answer that we can discover and design for. In this domain we can apply good practice based on expert advice (not best practice) as a right and wrong way of doing things can be determined with analysis.

Complex domain represents a system that is in flux and not predictable in the linear sense. A complex adaptive system can be operating in a state that is anywhere from equilibrium to the edge of chaos. In order to understand the system state one should perform experiments that probe relationships between entities. Due to the lack of linearity, multiple simultaneous experimental probes should occur in parallel, not in sequence, with the goal of better understanding processes. Emergent practice is determined in line with observed, evolving patterns. Ideally, decentralised Interpretation of data should be distributed to system users themselves rather than determined by a single expert in a centralised fashion.

As opposed to operating from a pre-supposed framework, the CAS structure should be allowed to emerge from the data under investigation. This avoids the confirmation bias that occurs when data are fitted to a predefined framework regardless of whether this framework best represents the data being modelled. Following on from this, model boundaries should also be allowed to emerge from the data itself.

Determining unarticulated needs from clusters of agent anecdotes or data points is a method of determining where improvement needs to occur in service provision systems. Yet this method forms an analogy that is mimicked in biological systems as well if an ABM was to be applied in a biomolecular context.

In understanding CAS, dispositionality of system states rather than linear causality should be the focus . Rather than presuming an inherent certainty as to “if I do A, B will result”, instead dispositional states arise as a result of A, which may result in B, but the evolution of which cannot be truly predicted.

“The longer you hold things in a state of transition the more options you’ve got” linear iterations based on a defined requirement with a degree of ambiguity which should be explored rather than eliminated. The opposite of standard statistical approach.

CAS modelling should include real-time feedback loops over multiple agents to avoid cognitive bias. In CAS modelling, every behaviour or interaction will produce unintended consequences, for this reason, David Snowden suggests, Small, fast experiments should be run in parallel, so that any bad, unintended consequences can be mitigated and the good ones amplified.

Modes of analysis and modelling:

System dynamics models (SDM)

  • A SDM simulates the movements of entities within the system and can be used to investigate macro behaviour of the system.
  • Changes to system state variables over time are modelled using differential equations.
  • SDMs are multi-dimentional, non-linear and include feedback mechanisms.

  • Visual representations of the model can be produced using stock and flow diagrams to summarise interdependencies between key system variables.
  • Dynamic hypotheses of the system model can be represented in a causal loop diagram
  • SDM is appropriate for modelling aggregate flows, trends, sub-system behaviour.

Agent based models (ABM)

  • ABMs can be used to investigate micro behaviour of the system from more of a bottom-up perspective through Intricate flows of individual based activity.
  • State changes of individual agents are simulated by ABMs rather than the broader entites captured by SDM
  • Multiple types of agent are operating within the same complex adaptive system modelled
  • Data within the ABM can be aggregated to infer more macro or top-down system behaviour.

Agents within the ABM can make decisions, engage in behaviour defined by simple rules and attributes, learn from experience and from feedback from interactions with other agents or the modelled environment. This is as true in models of human systems as it is with molecular scale systems. In both examples agents can par take in communication on a one to one, one to many and one to location basis. Previously popular models such as discrete event simulation (DES) was implemented to model passive agents at a finite time rather than active “decision makers” over dynamic periods that are a feature of ABMs.

Hybrid Models

  • Both ABM and SDM are complimentary techniques for simulating micro and macro level behaviour of complex adaptive systems and therefore engaging in exploratory style analysis of such systems.
  • Hybrid models emulate individual agent variability as well as variability in the the behaviour of aggregate entities they compose.
  • Simulate macro and micro level system behaviour in many areas of investigation such as health service provision, biomedical science.

Hybrid models have the ability to combine two or more types of simulation within the same model. These models can combine SDMs and ABMs, or other techniques, to address both top-down and bottom-up micro and macro dynamics in a single model that more closely captures whole system behaviour. This has the potential to elevate many of the necessary trade-offs of using one of the simulation types alone.

As software capability develops we are seeing an increased application of hybrid modelling techquiques. Previously wide-spread techniques such as DES and Markov models, which are one-dimentional, uni-directional, linear, are now proving inadequate in the task of modelling the complex adaptive and dynamic world we inhabit.

Model Validation Techniques

SDMs and ABMs are not fitted to observed data but instead use both qualitative and quantitative real world data to inform and develop the model and it’s parameters as a simulation of real world phenomena. For this reason model validation of SDMs and ABMs should be even more rigorous than for more traditional models such as maximum likelihood or least squares methods. Sensitivity analysis and validation tests such as behavioural validity tests can be used to compare model output against real-world data from organisations or experiments, relevant to the scale of the model being validated.

Structure of the model such as

  • Checking how the model behaves when subject to extreme parameter values.
  • Things like dimensional consistency, boundary adequacy, mass balance
  • Sensitivity analysis – how sensitive is the model to changes in key parameters.

Network Analysis

Data accrual from diverse data sources challenges and limitations

While complex systems theory has origins in the mathematics chaos theory, there are many examples contemporaneously where complex systems theory has been divorced form the mathematics and statistical modelling and applied in diverse fields such as business and healthcare or social services provision. Mathematical modelling adds validity to complex systems analysis. The problem with completing solely qualitative analysis without the empiricism of mathematical modelling, simulation and checking against a variety of real world data sets, the results

Latent Variable Modelling And The Chi Squared Exact Fit Statistic

Latent variable modelling and the chi squared exact fit statistic

Latent variable models are exploratory statistical models used extensively throughout clinical and experimental research in medicine and the life sciences in general. Psychology and neuroscience are two key sub-disciplines where latent variable models are routinely employed to answer a myriad of research questions from the impact of personality traits on success metrics in the workplace (1) to measuring inter-correlated activity of neural populations in the human brain based on neuro-imaging data (2). Through latent variable modelling, dispositions, states or process which must be inferred rather than directly measured can be linked causally to more concrete measurements.
Latent variable models are exploratory or confirmatory in nature in the sense that they are designed to uncover causal relationships between observable or manifest variables and corresponding latent variables in an inter-correlated data set. They use structural equation modelling (SEM) and more specifically factor analysis techniques to determine these causal relationships which and allow the testing of numerous multivariate hypotheses simultaneously. A key assumption of SEM is that the model is fully correctly specified. The reason for this is this is that one small misspecification can affect all parameter estimations in the model, rendering inaccurate approximations which can combine in unpredictable ways (3).

With any postulated statistical model it is imperative to assess and validate the model fit before concluding in favour of the integrity of the model and interpreting results. The acceptable way to do this across all structural equation models is the chi squared (χ²) statistic.

A statistically significant χ² statistic is indicative of the following:

  • A systematically miss-specified model with the degree of misspecification a function of the χ² value.
  • The set of parameters specified in the model do not adequately fit the data and thus that the parameter estimates of the model are inaccurate. As χ² operates on the same statistical principles as the parameter estimation, it follows that in order to trust the parameter estimates of the model we must also trust the χ², or vice versa.
  •  As a consequence there is a need for an investigation of where these misspecification have occurred and a potential readjustment of the model to improve its accuracy.

While one or more incorrect hypotheses may have caused the model misspecification, the misspecification could equally have resulted from other causes. It is important to thus investigate the causes of a significant model fit test . In order to properly do this the following should be evaluated:

  • Heterogeneity:
  •  Does the causal model vary between sub groups of subjects?
  • Are there any intervening within subject variables?
  • Independence:
  • Are the observations truly independent?
  • Latent variable models involve two key assumptions: that all manifest variables are independent after controlling for any latent variables and, an individual’s position on a manifest variable is the result of that individual’s position on the corresponding latent variable (3).
  • Multivariate normality:
  • Is the multivariate normality assumption satisfied?


The study:

A 2015 meta-analysis of 75 latent variable studies drawn from 11 psychology journals has highlighted a tendency in clinical researchers to ignore the χ² exact fit statistic when reporting and interpreting the results of the statistical analysis of latent variable models (4).
97% of papers reported at least one appropriate model, despite the fact that 80% of these did not pass the criteria for model fit and the χ² exact fit statistic was ignored. Only 2% of overall studies concluded that the model doesn’t fit at all and one of these interpreted a model anyway (4).
Reasons for ignoring the model fit statistic: overly sensitive to sample size, penalises models when number of variables is high, general objection to the logic of exact fit hypothesis. Overall broach consensus of preference for Approximate fit indices (AFI).
AFI are instead applied in these papers to justify the models. This typically leads to questionable conclusions. In all just 41% of studies reported χ² model fit results. 40% of the studies that failed to report a p value for the reported χ² value did report a degrees of freedom. When this degrees of freedom was used to cross check the unreported p values, all non-reported p values were in fact significant.
The model fit function was usually generated through maximum likelihood methods, however 43% of studies failed to report which fit function was used.
A further tendency to accept the approximate fit hypothesis when in fact there was no or little evidence of approximate fit. This lack of thorough model examination empirical evidence of questionable validity. 30% of studies showed custom selection of more lax cut-off criteria for the approximate fit statistics than was conventionally acceptable, while 53% failed to report on cut-off criteria at all.
Assumption testing for univariate normality was assessed in only 24% of studies (4).
Further explanation of  χ² and model fit:

The larger the data set the more that increasingly trivial discrepancies are detected as a source of model misspecification. This does not mean that trivial discrepancies become more important to the model fit calculation, it means that the level of certainty with which these discrepancies can be considered important has increased. In other words, the statistical power has increased. Model misspecification can be the result of both theoretically relevant and irrelevant/peripheral causal factors which both need to be equally addressed. A significant model fit statistic indicating model misspecification is not trivial just because the causes of the misspecification are trivial. It is instead the case that trivial causes are having a significant effect and thus there is a significant need for them to be addressed. The χ² model fit test is the most sensitive way to detect misspecification in latent variable models and should be adhered to above other methods even when sample size is high. In the structural equation modelling context of multiple hypotheses, a rejection of model fit does not result in the necessary rejection of each of the models hypotheses (4).
Problems with AFI:

The AFI statistic does provide a conceptually heterogeneous set of fit indices for each hypothesis, however none of these indices are accompanied by a critical value or significance level and all except one arise from unknown distributions. The fit indices are a function of χ² but unlike the χ²  fit statistic they do not have a verified statistical basis nor do they present a statistically rigorous test of model fit. Despite this satisfactory AFI values across hypotheses are being used to justify the invalidity of a significant χ² test.
Mote Carlo simulations of AFI concluded that it is not possible to determine universal cut off criteria in any forms of model tested.  Using AFI, the probability of correctly rejecting a mis-specified model decreased with increasing sample size. This is the inverse of the  statistic. Another problem with AFI compared to χ²  is that the more severe the model misspecification or correlated errors, the more unpredictable the AFI become. Again this is the inverse of what happens with the χ²  statistic (4).
The take away:

Based on the meta-analysis the following best practice principles are recommended in addition to adequate attention to the statistical assumptions of heterogeneity, independence and multivariate normality outlined above:

  1. Pay attention to distributional assumptions.
  2. Have a theoretical justification for your model.
  3. Avoid post hoc model modifications such as dropping indicators, allowing cross-loadings and correlated error terms.
  4. Avoid confirmation bias.
  5. Use an adequate estimation method.
  6. Recognise the existence of equivalence models.
  7. Justify causal inferences.
  8. Use clear reporting that is not selective.

Image:  

Michael Eid, Tanja Kutscher,  Stability of Happiness, 2014 Chapter 13 – Statistical Models for Analyzing Stability and Change in Happiness
​https://www.sciencedirect.com/science/article/pii/B9780124114784000138

​References:
(1). Latent Variables in Psychology and the Social Sciences

(2) Structural equation modelling and its application to network analysis in functional brain imaging
https://onlinelibrary.wiley.com/doi/abs/10.1002/hbm.460020104

(3) Chapter 7: Assumptions in Structural Equation modelling
https://psycnet.apa.org/record/2012-16551-007

(4) A cautionary note on testing latent variable models
https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01715/full

Do I need a Biostatistician?

Do I need a Biostatistician?

“…. half of current published peer-reviewed clinical research papers … contain at least one statistical error… When just surgical related papers were analysed, 78% were found to contain statistical errors.”

Peer reviewed published research is the go to source for clinicians and researchers to advance their knowledge on the topic at hand. It also currently the most reliable way available to do this. The rate of change in standard care and exponential development and implementation of innovative treatments and styles of patient involvement makes keeping up with the latest research paramount. (1)

Unfortunately, almost half of current published peer-reviewed clinical research papers have been shown to contain at least one statistical error, likely resulting in incorrect research conclusions being drawn from the results. When just surgical related papers were analysed, 78% were found to contain statistical errors due to incorrect application of statistical methods. (1)

Compared to 20 years ago all forms of medical research require the application of increasingly complex methodology, acquire increasingly varied forms of data, and require increasingly sophisticated approaches to statistical analysis. Subsequently the meta-analyses required to synthesise these clinical studies are increasingly advanced. Analytical techniques that would have previously sufficed and are still widely taught are now no longer sufficient to address these changes. (1)

The number of peer reviewed clinical research publications has increased over the past 12 years. Parallel to this, the statistical analyses contained in these papers are increasingly complex, as is the sophistication with which they are applied. For example, t tests and descriptive statistics were the go to statistical methodology for many highly regarded articles published in the 1970’s and 80’s. To rely on those techniques today would be insufficient, both in terms of being scientifically satisfying and in, in all likelihood, in meeting the current peer-review standards. (1)

Despite this, some concerning research has noted that these basic parametric techniques are actually currently still being misunderstood and misapplied reasonably frequently in contemporary research. They are also being increasingly relied upon (in line with the increase in research output) when in fact more sophisticated and modern analytic techniques would be better equipped and more robust in answering given research questions. (1)

Another contributing factor to statistical errors is of course ethical in nature. An recent online survey consulting biostatisticians in America revealed that inappropriate requests to change or delete data to support a hypothesis were common, as was the desire to mould the interpretation of statistical results of to fit in with expectations and established hypotheses, rather than interpreting results impartially. Ignoring violations of statistical assumptions that would deem to chosen statistical test inappropriate, and not reporting missing data that would bias results were other non-ethical requests that were reported. (2)

The use of incorrect statistical methodology and tests leads to incorrect conclusions being widely published in peer reviewed journals. Due to the reliance of clinical practitioners and researchers on these conclusions, to inform clinical practice and research directions respectively, the end result is a stunting of knowledge and a proliferation of unhelpful practices which can harm patients. (1)

Often these errors are a result of clinicians performing statistical analyses themselves without first consulting a biostatistician to design the study, assess the data and perform any analyses in an appropriately nuanced manner. Another problem can arise when researchers rely on the statistical techniques of a previously published peer-reviewed paper on the same topic. It is often not immediately apparent whether a statistician has been consulted on this established paper. Thus it is not necessarily certain whether the established paper has taken the best approach to begin with. This typically does not stop it becoming a benchmark for future comparable studies or deliberate replications. Further to this it can very often be the case that the statistical methods used have since been improved upon and other more advanced or more robust methods are now available. It can also be the case that small differences in the study design or collected data between the established study and the present study mean that the techniques used in the established study are not the most optimal techniques to address the statistical needs of present study, even if the research question is the same or very similar.

Another common scenario which can lead to the implementation of non-ideal statistical practices is under-budgeting for biostatisticians on research grant applications. Often biostatisticians are on multiple grants, each with a fairly low amount of funding allocated to the statistical component due to tight or under budgeting. This limits the statistician’s ability to focus substantially on a specific area and make a more meaningful contribution in that domain. A lack of focus prevents them from becoming a expert at this particular niche and engage in innovation.This in turn can limit the quality of the science as well as the career development of the statistician.

In order to reform and improve the state and quality of clinical and other research today, institutions and individuals must assign more value to the role of statisticians in all stages of the research process. Two ways to do this are increased budgeting for and in turn increased collaboration with statistical professionals.


References:

(1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6106004/

​(2) https://annals.org/aim/article-abstract/2706170/researcher-requests-inappropriate-analysis-reporting-u-s-survey-consulting-biostatisticians

Transforming Skewed Data: How to choose the right transformation for your distribution

​Innumerable statistical tests exist for application in hypothesis testing based on the shape and nature of the pertinent variable’s distribution. If however the intention is to perform a parametric test – such as ANOVA, Pearson’s correlation or some types of regression – the results of such a test will be more valid if the distribution of the dependent variable(s) approximates a Gaussian (normal) distribution and the assumption of homoscedasticity is met. In reality data often fails to conform to this standard, particularly in cases where the sample size is not very large. As such, data transformation can serve as a useful tool in readying data for these types of analysis by improving normality, homogeneity of variance or both.For the purposes of Transforming Skewed Data, the degree of skewness of a skewed distribution can be classified as moderate, high or extreme. Skewed data will also tend to be either positively (right) skewed with a longer tail to the right, or negatively (left) skewed with a longer tail to the left. Depending upon the degree of skewness and whether the direction of skewness is positive or negative, a different approach to transformation is often required. As a short-cut, uni-modal distributions can be roughly classified into the following transformation categories:


This article explores the transformation of a positively skewed distribution with a high degree of skewness. We will see how four of the most common transformations for skewness – square root, natural log, log to base 10, and inverse transformation – have differing degrees of impact on the distribution at hand. It should be noted that the inverse transformation is also known as the reciprocal transformation. In addition to the transformation methods offered in the table above Box-Cox transformation is also an option for positively skewed data that is >0. Further the Yeo-Johnson transformation is an extension of the Box-Cox transformation which does not require the original data values to be positive or >0.
The following example takes medical device sales in thousands for a sample of 2000 diverse companies. The histogram below indicates that the original data could be classified as “high(er)” positive skewed.
​The skew is in fact quite pronounced – the maximum value on the x axis extends beyond 250 (the frequency of sales volumes beyond 60 are so sparse as to make the extent of the right tail imperceptible) – it is however the highly leptokurtic distribution that that lends this variable to be better classified as high rather than extreme. It is in fact log-normal – convenient for the present demonstration. From inspection it appears that the log transformation will be the best fit in terms of normalising the distribution.

​​Starting with a more conservative option, the square root transformation, a major improvement in the distribution is achieved already. The extreme observations contained in the right tail are now more visible. The right tail has been pulled in considerably and a left tail has been introduced. The kurtosis of the distribution has reduced by more than two thirds.

​A natural log transformation proves to be an incremental improvement yielding the following results:
​This is quite a good outcome – the right tail has been reduced considerably while the left tail has extended along the number line to create symmetry. The distribution now roughly approximates a normal distribution. An outlier has emerged at around -4.25, while extreme values of the right tail have been eliminated. The kurtosis has again reduced considerably.

Taking things a step further and apply a log to base 10 transformation yields the following:
​In this case the right tail has been pulled in even further and the left tail extended less than the previous example. Symmetry has improved and the extreme value in the left tail has been bought closer in to around -2. The log to base ten transformation has provided an ideal result – successfully transforming the log normally distributed sales data to normal.

In order to illustrate what happens when a transformation that is too extreme for the data is chosen, an inverse transformation has been applied to the original sales data below.
​Here we can see that the right tail of the distribution has been brought in quite considerably to the extent of increasing the kurtosis. Extreme values have been pulled in slightly but still extend sparsely out towards 100. The results of this transformation are far from desirable overall.

Some thing to note is that in this case the log transformation has caused data that was previously greater than zero to now be located on both sides of the number line. ​Depending upon the context, data containing zero may become problematic when interpreting or calculating the confidence intervals of un-back-transformed data.  As  log(1)=0,  any data containing values <=1 can be made >0 by adding a constant to the original data so that the minimum raw value becomes >1 . Reporting un-back-transformed data can be fraught at the best of times so back-transformation of transformed data is recommended. Further information on back-transformation can be found here. 

Adding a constant to data is not without it’s impact on the transformation. As the below example illustrates the effectiveness of the log transformation on the above sales data is effectively diminished in this case by the addition of a constant to the original data.

​​​Depending on the subsequent intentions for analysis  this may be the preferred outcome for your data –  it is certainly an adequate improvement and has rendered the data approximately normal for most parametric testing purposes.

Taking the transformation a step further and applying the inverse transformation to the sales + constant data, again, leads to a less optimal result for this particular set of data – indicating that the skewness of the original data is not quite extreme enough to benefit from the inverse transformation.

​​It is interesting to note that the peak of the distribution has been reduced whereas an increase in leptokurtosis occurred for the inverse transformation of the raw distribution. This serves to illustrate how a small alteration in the data can completely change the outcome of a data transformation without necessarily changing the shape of the original distribution.

There are many varieties of distribution, the below diagram depicting only the most frequently observed. If common data transformations have not adequately ameliorated your skewness, it may be more reasonable to select a non-parametric hypothesis test that is based on an alternate distribution.

​Image credit: cloudera.com