Introducing Anatomise Biostats’ New Modular Pricing Strategy for Clinical Trial Projects


We are excited to announce a significant update to the pricing strategy for our clinical trial services. In response to feedback from our valued clients and industry stakeholders, Anatomise Biostats is introducing a modular pricing strategy. This new approach replaces our previous hourly, retainer, and project-based pricing models to offer greater transparency, efficiency, and value to our clients in the medtech, medical device, and pharmaceutical sectors.

What’s New?

Starting immediately, the modular pricing strategy will apply to all new clinical trial projects, with the following key features:

  • Base Package: A minimum commitment of £200,000 per annum, billed in six-monthly instalments. This base package covers core services, including biostatistics, quality control, and standard programming.
  • Add-On Services: For clients requiring additional expertise, we offer services such as bioinformatics or data science for an additional £50,000 per annum.
  • Multiple Concurrent Trials: If your project involves multiple concurrent clinical trials, an additional fee of £50,000 will apply for statistical programming to account for the complexity and scope of the work.

Why Are We Making These Changes?

At Anatomise Biostats, we strive to provide services that are not only robust and reliable but also tailored to the evolving needs of our clients. After careful consideration and extensive consultation with industry stakeholders, we have implemented these changes for the following reasons:

  1. Responding to Industry Feedback
    Our clients have consistently voiced the need for greater predictability and clarity in pricing. The modular strategy directly addresses these concerns by offering a clear, upfront pricing model.
  2. Simplifying the Bidding Process
    Typically, when engaging with a new vendor, clinical teams spend a lot of time in preliminary discussions trying to assess resource allocation and overall costs before any agreements are made. This can be time-consuming and uncertain. With our new pricing model, you’ll have a clear idea of costs before those initial discussions even begin, saving both time and resources.
    By presenting upfront pricing, we streamline the proposal process, enabling these discussions to be more productive and focused. There’s no need for lengthy back-and-forth before signing an NDA, and you can move faster towards getting a project underway.
  3. Standardising Pricing for Consistency
    In an industry where every project is unique, maintaining consistency in pricing can be a challenge. Our modular pricing strategy simplifies this by providing standardised rates that apply across various types of projects, making financial planning and forecasting much easier.
  4. Efficiency in Onboarding and Project Execution
    One of the key inefficiencies in traditional pricing models is the extensive time and money spent on human resources, onboarding, and integrating consultants. We have calculated that millions are wasted annually on these activities, which detract from the primary goal: delivering high-quality clinical trial results.
    With our modular approach, we eliminate the need for interviewing, background checks, and hours spent on induction SOPs, which are typical in Functional Service Provider (FSP) models. This allows us—and you—to focus on what truly matters: the success of your clinical trial.

Additional Benefits of the Modular Pricing Model

  1. Improved Budget Control and Forecasting
    The modular pricing model provides a clear understanding of costs upfront, allowing for better budget control and financial forecasting. This transparency helps avoid unexpected expenses later in the project, enabling you to allocate resources more effectively throughout the clinical trial lifecycle.
  2. Flexibility and Scalability
    Our modular approach offers flexibility, enabling clients to add or remove services as their project evolves. As your clinical trial progresses, you can easily scale up services such as bioinformatics or statistical programming as needed, without disrupting the workflow.
  3. Optimised Resource Allocation
    By knowing upfront what services are included and at what cost, your clinical teams can more efficiently allocate internal resources. This ensures that your in-house team remains focused on core project tasks, while specialised work is seamlessly managed by Anatomise Biostats.
  4. Reduced Administrative Burden
    With a clear pricing structure in place, there’s no need for protracted negotiations or multiple rounds of contract discussions. This reduces the administrative burden for both sides, allowing your team to concentrate on project execution rather than managing unnecessary paperwork or HR processes.
  5. Shorter Time to Project Start
    Our upfront, standardised pricing model allows projects to commence more quickly by eliminating the delays associated with cost discussions and vendor evaluations. This means faster project initiation, helping you meet critical timelines more efficiently.
  6. Clearer Accountability and Deliverables
    With defined packages and associated costs, there is greater clarity around deliverables. Both parties have a mutual understanding of what will be delivered and when, ensuring stronger project management and accountability throughout the process.

Our Commitment to Streamlining Your Experience

At the heart of this new pricing model is our commitment to making the clinical trial process more efficient, predictable, and ultimately profitable for our clients. By minimising unnecessary HR expenses, simplifying the proposal process, and removing bottlenecks in onboarding, we aim to streamline the experience of working with us.

Our ultimate goal is to make it easier and more straightforward for you to collaborate with us. With the modular pricing model, discussions with your clinical teams can focus on project strategy and deliverables, rather than getting bogged down in cost negotiations or time-consuming vendor assessments. This allows for faster decision-making and more effective project launches, helping you meet your timelines with greater confidence.

We understand that every clinical trial is a critical investment. Our goal is to support you with transparent, predictable pricing that allows you to focus on what matters most—moving your pipeline forward and achieving commercial success.

If you have any questions about this new pricing strategy or would like to discuss your upcoming project, please reach out to our team.

We look forward to helping you bring your innovations to market faster and more efficiently with Anatomise Biostats.

Distributed Ledger Technology for Clinical & Life Sciences Research: some Use-Cases for Blockchain & Directed Acyclic Graphs

Applications of blockchain and other distributed ledger technology (DLT) such as directed acyclic graphs (DAG) to clinical trials and life sciences research are rapidly emerging.

Distributed ledger technology (DLT) such as blockchain has a myriad of use-cases in life sciences and clinical research.
Distributed ledger technology (DLT) has the potential to solve a myriad of problems that currently plague data collection, management and access processes in clinical and life sciences research, including clinical trials. DLT is an innovative approach to operating in environments where trust and integrity is paramount by paradoxically removing the need for trust in any individual component and providing full transparency as to the micro-environment of the platform operations as a whole.Currently the two forms of DLT predominating are blockchain and directed acyclic graphs (DAGs). While quite distinct from one another, in theory the two technologies are intended to serve similar purposes, or were developed to address the same goals. In practice, blockchain and DAGs may have optimal use-cases that differ in nature from one another, or be better equipped to serve different goals – the nuance of which to be determined on a case by case basis.

Bitcoin is the first known example of blockchain, however blockchain goes well beyond the realms of bitcoin and cryptocurrency use cases. One of the earliest and currently predominating DAG DLT platforms is IOTA which has proved itself in a plethora of use cases that go well beyond what blockchain could currently achieve, particularly within the realm of the internet of things (IOT). In fact Iota has been developing an industry data marketplace active since 2017 which makes it possible to store, sell via micro-transactions and access data streams via web browser. For the purposes of this article we will focus on DLT applications in general and include use-cases in which blockchain or DAGs can be employed interchangeably. Before we begin, what is Distributed Ledger  technology?

The Iota Tangle has already been implemented in a plethora of use cases that may be beneficially translated to clinical and life sciences research.

Source: iota.org Iota’s Tangle is an example of directed acyclic graph (DAG) digital ledger technology. Iota has been operating an industry data marketplace since 2017.
​DLT is a decentralised digital system which can be used to store data and record transactions in the form of a ledger or smart contract. Smart contracts can be set up to form a pipeline of conditioned (if-then) events, or transactions, much like an escrow in finance, which are shared across nodes on the network. Nodes are used to both store data and process transactions, with multiple (if not all) nodes accommodating each transaction – hence the decentralisation. Transactions themselves are a form of dynamic data, while a data set is an example of static data. Both blockchain and DAGs employ advanced cryptography algorithms which as of today render them un-hackable. This is a huge benefit in the context of sensitive data collection such as patient medical records or confidential study data. It means that data can be kept secure, private, untampered with, and shared efficiently with whomever requires access. Because each interaction or transaction is recorded this enables the integrity of the data to be upheld in what is considered a “trustless” exchange. Because data is shared on multiple nodes for all involved to witness across the network, records become harder to manipulate of change in an underhanded way. This is important in the collection of patient records or experimental data that is destined for statistical analysis. Any alterations to data that are made are recorded across the network for all participants to see, enabling true transparency. All transactions can come in the form of smart contracts which are time stamped and tied to a participant’s identity via the use of digital signatures.

In this sense DLT is able to speed up transactions and processes, while reducing cost, due to the removal of a middle-man or central authority overseeing each transaction, or transfer of information. DLT can be public or private in nature. A private blockchain, for example, does have trusted intermediary who decides who is to have access to the blockchain, who can participate on the network, which data can be viewed by which participants. In the context of clinical and life sciences research this could be a consortium of interested parties, ie the research team, or an industry regulator or governing body. In a private blockchain, the transactions themselves remain decentralised, while the blockchain itself has built in permission layers that allow full or partial visibility of data depending upon the stakeholder. This is necessary in the context of sharing anonymised patient data and blinding in randomised controlled trials.
Blockchain and Hashgraph are two examples of distributed ledger technology (DLT) with applications which could achieve interoperability across healthcare,  medicine, insurance, clinical trials and life sciences research.

Source: Hedera Hashgraph whitepaper. Blockchain and Hashgraph are two examples of distributed ledger technology (DLT).
Due to the immutable nature of each ledger transaction, or smart contract, stakeholders are unable to alter or delete study data without a consensus over the whole network. In this situation, an additional transaction recorded and time-stamped on the blockchain while the original transaction, that recorded the data to be altered in its original form, remains intact. This property helps to reduce the incidence of human error, such as data entry error, as well as any underhanded alterations with the potential to sway study outcomes.

In a clinical trials context the job of the data monitoring committee, and any other form of auditing  becomes much more straight forward. DLT also allow for complete transparency in all financial transactions associated with the research. Funding bodies can see exactly where all funds are being allocated and at what time points. In-fact every aspect of the research supply-chain, from inventory to event tracking, can be made transparent to the desired entities. Smart contracts operate among participants in the blockchain and also between the trusted intermediary and the DLT developer whose services have been contracted for building the platform framework, such as the private blockchain. The services contracts will need to be negotiated in advance so that the platform is tailored to adequately conform to individualised study needs. Once processes are in place and streamlines the platform can be replicated in comparable future studies.

DLT can address the problem of duplicate records in study data or patient records, make longitudinal data collection more consistent and reliable across multiple life cycles. Many disparate stakeholders, from doctor to insurer or researcher, can share in the same patient data source while maintaining patient privacy and improving data security. Patients can retain access to the data and decide with whom to share it with, which clinical studies to participate in and when to give or withdraw consent.

DLT, such as blockchain or DAGs, can improve collaboration by making the sharing of technical knowledge easier and centralising data or medical records, in the sense that they are located on the same platform as every other transaction taking place. This results in easier shared access by key stakeholders, shortening of negotiation cycles due to improved coordination and making established clinical research processes more consistent and replicable.

From a statisticians perspective, DLT should result in data of higher integrity which yields statistical analysis of greater accuracy and produces research with more reliable results that can be better replicated and validated in future research. Clinical studies will be streamlined due to the removal of much bureaucracy and therefore more time and cost effective to implement as a whole. This is particularly important in a micro-environment with many moving parts and disparate stakeholders such as the clinical trials landscape.


References and further reading:

From Clinical Trials to Highly Trustable Clinical Trials: Blockchain in Clinical Trials, a Game Changer for Improving Transparency?
https://www.frontiersin.org/articles/10.3389/fbloc.2019.00023/full#h4

Clinical Trials of Blockchain

Blockchain technology for improving clinical research quality
https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-017-2035-z

Blockchain to Blockchains in Life Sciences and Health Care
https://www2.deloitte.com/content/dam/Deloitte/us/Documents/life-sciences-health-care/us-lshc-tech-trends2-blockchain.pdf

Simpson’s Paradox: the perils of hidden bias.

How Simpson’s Paradox Confounds Research Findings And Why Knowing Which Groups To Segment By Can Reverse Study Findings By Eliminating Bias.

Introduction
 
The misinterpretation of statistics or even the “mis”-analysis of data can occur for a variety of reasons and to a variety of ends. This article will focus on one such phenomenon
contributing to the drawing of faulty conclusion from data – Simpson’s Paradox.


At times a situation arises where the outcomes of a clinical research study depict the inverse of expected (or essentially correct) outcomes. Depending upon the statistical approach, this could affect means, proportions or relational trends among other statistics.
Some examples of this occurrence are a negative difference when a positive difference was anticipated, a positive trend when a negative one would have been more intuitive – or vice versa. Another example commonly pertains to the cross tabulation of proportions, where condition A is proportionally greater over all, yet when stratified by a third variable, condition B is greater in all cases . All of these examples can be said to be instances of Simpson’s paradox. Essentially Simpson’s paradox represents the possibility of supporting opposing hypotheses – with the same data.
Simpson’s paradox can be said to occur due to the effects of confounding, where a confounding variable is characterised by being related to both the independent variable and the outcome variable, and unevenly distributed across levels of the independent variable. Simpson’s paradox can also occur without confounding in the context of non-collapsability. 
For more information on the nuances of confounding versus non-collapsability in the context of Simpson’s paradox, see here.

 In a sense, Simpson’s paradox is merely an apparent paradox, and can be more accurately described as a form of bias. This bias most often results from a lack of insight into how an unknown lurking variable, so to speak, is impacting upon the relationship between two variables of interest. Simpson’s paradox highlights the fact that taking data at face value and utilising it to inform clinical decision making can often be highly misleading. The chances of Simpson’s paradox (or bias) impacting the statistical analysis can be greatly reduced in many cases by a careful approach that has been informed by proper knowledge of the subject matter. This highlights the benefit of close collaboration between researcher and statistician in informing an optimal statistical methodology that can be adapted on a per case basis.

The following three part series explores hypothetical clinical research scenarios in which Simpson’s paradox can manifest.

Part 1

Simpson’s Paradox in correlation and linear regression

​​

​Scenario and Example

A nutritionist would like to investigate the relationships between diet and negative health outcomes. As higher weight has been previously associated with negative health outcomes, the research sets out to investigate the extent to which increased caloric intake contributes to weight gain. In researching the relationship between calorie intake and weight gain for a particular dietary regime, the nutritionist uncovers a rather unanticipated negative trend. As caloric intake increases the weight of participants appears to go down. The nutritionist therefore starts recommending higher calorie intake as a way to dramatically lose weight. Weight does appear to go down with calorie intake, however if we stratify the data by different age groupings, a positive trend between weight and calorie intake emerges for each age group. While overall elderly have the lowest calorie intake, they also have the highest weight, and teens have the highest calorie intake but the lowest weight, this accounts for the negative trend but does not give an honest picture of the impact of calories on weight. In order to gain an accurate picture of the relationship between weight and calorie intake we have to know which variable to group or stratify the data by, and in this case it’s age. Once the data is stratified by five separate age categories a positive trend between calories and weight emerges in each of the 5 categories. In general, the answer to which variable to stratify by or control for isn’t typically this obvious and in most cases and requires some theoretical background and a thorough examination of the available data including associated variables for which the information is at hand.


Remedy

In the above example, age shows a negative relationship to the independent variable, calories, but a positive relationship to the dependent variable, weight. It is for this reason that a bit of data exploration and assumption checking before any hypothesis testing is so essential. Even with these practices in place it is possible to overlook the source of confounding and caution is always encouraged.
 
Randomisation and Stratification:
In the context of a randomised controlled trial (RTC), the data should be randomly assigned to treatment groups as well as stratified by any pertinent demographic and other factors so that these are evenly distributed across treatment arms (levels of the independent variable). This approach can help to minimise, although not eliminate the chances of bias occurring in any such statistical context, predictive modelling or otherwise.

Linear Structural Equation Modelling:
 If the data at hand is not randomised but observational, a different approach should be taken to detect causal effects in light of potential confounding or non-collapsability. One such approach is linear structural equation modelling where each variable is generated as a linear function of it’s parents, using a directed acyclic graph (DAG) with weighted edges. This is a more sophisticated and ideal approach to simply adjusting for x number of variables, which is needed in the absence of a randomisation protocol.

Heirachical regression:
This example illustrated an apparent negative trend of the overall data masking a positive trend In each individual subgroup, in practice, the reverse can also occur.
In order to avoid drawing misguided conclusion from the data the correct statistical approach must be entertained, a hierarchical regression controlling for a number of potential confounding factors could avoid drawing wrong conclusion due to Simpson’s paradox.

 

Article: Sarah Seppelt Baker


Reference:
The Simpson’s paradox unraveled, Hernan, M, Clayton, D, Keiding, N., International Journal of Epidemiology, 2011.

Part 2

Simpson’s Paradox in 2 x 2 tables and proportions


​Scenario and Example

Simpson’s paradox can manifest itself in the analysis of proportional data and two by two tables. In the following example two pharmaceutical cancer treatments are compared by a drug company utilising a randomised controlled clinical trial design. The company wants to test how the new drug (A) compares to the standard drug (B) already widely in clinical use.  1000 patients were randomly allocated to each group. A chi squared test of remission rates between the two drug treatments is highly statistically significant, indicating that the new drug A is the more effective choice. At first glance this seems reasonable, the sample size is fairly large and equal number of patients have been allocated to each groups.
Drug  Treatment
A
B
Remisson Yes
798 (79.8%)
705 (70.5%)
Remission No
202
295
Total sample size
1000
1000
The chi-square statistic for the difference in remission rates between treatment groups is 23.1569. The p-value is < .00001. The result is significant at p < .05.


When we take a closer look, the picture changes. It turns out the clinical trial team forgot to take into account the patients stage of disease progression at the commencement of treatment. The table below shown that drug A was allocated to far more patients with stage II cancer (79.2%) and drug B was allocated to far more patients with stage IV cancer (79.8%). 

Stage II
Stage IV
Drug Treatment
A
B
A
B
Remission Yes
697 (87.1%)
195 (92.9%)
101 (50.5%)
510 (64.6%)
Remission No
103
15
99
280
Total sample size
800
210
200
790
The chi-square statistic for the difference in remission rates between treatment groups for patients with stage II disease progression at treatment outset is 5.2969. The p-value is .021364. The result is significant at p < .05.


The chi-square statistic for the difference in remission rates between treatment groups for patients with stage IV disease progression at treatment outset is 13.3473. The p-value is .000259. The result is significant at p < .05.

Unfortunately the analysis of tabulated data is no less prone to bias in results akin to Simpson’s Paradox than continuous data. Given that stage II cancer is easier to treat than stage IV, this has given drug A an unfair advantage and has naturally lead to a higher remission rate overall for drug A. When the treatment groups are divided by disease progression categories and reanalysed, we can see that remission rates are higher for drug B in both stage II and stage IV baseline disease progression. The resulting chi squared statistics are wildly different to the first and statistically significant in the opposite direction to the first analysis.  In causal terms, stage of disease progression affects difficulty of treatment and likelihood of remission. Patients at a more advanced stage of disease, ie stage IV, will be harder to treat than patients at stage II. In order for a fair comparison between two treatments, patients stage of disease progression needs to be taken into account. In addition to this some drugs may be more efficacious at one stage or the other, independent of the overall probabilities of achieving remission at either stage. 

Remedy

Randomisation and Stratification:
Of course in this scenario, stage of disease progression is not the only variable that needs to be accounted for in order to insure against biased results. Demographic variables such as age, sex socio-economic status and geographic location are some examples of variables that should be controlled for in any similar analysis. As with the scenario in part 1, this can be achieved is through stratified random allocation of patients to treatment groups at the outset of the study. Using a randomised controlled trial design where subjects are randomly allocated to each treatment group as well as stratified by pertinent demographic and diagnostic variables will reduce the chances of inaccurate study results occurring due to bias.

Further examples of Simpson’s Paradox in 2 x 2 tables

Simpson’s paradox in case control and cohort studies

Case control and cohort studies also involve analyses which rely on the 2×2 table. The calculation of their corresponding measures of association the odds ratio and relative risk, respectively, is unsurprisingly not immune to the effect of bias and in much the same way as the chi square example above. This time, a reversed odds ratio or relative risk in the opposite direction can occur if the pertinent bias has not been accounted and controlled for.

Simpson’s paradox in meta-analysis of case control studies

Following on from the example above, this form of bias can pose further problems in the context of meta-analysis. When combining results from numerous case control studies the confounders in question may or may not have been identified or controlled for consistently across all studies and some studies will likely have identified different confounders to the same variable of interest. The odds ratios produced by the different studies can therefore be incompatible and lead to erroneous conclusions. Meta-analysis can therefore fall prey to ecological fallacy as a result of systematic bias, where the odds ratio for the combined studies is in the opposite direction to the odds ratios of the separate studies. Imbalance in treatment arm size has also been found to act as a confounder in the context of meta-analysis of randomised controlled trials. Other methodological differences between studies may also be at play, such as differences in follow-up times between studies or a very low proportion of observed events occurring in some studies, potentially due to a shorted follow-up time.

That’s not to say that meta-analysis cannot be performed on these studies, inter-study variation is of-course more common than not, as with all other analytical contexts it is necessary to proceed with a high level of caution and attention to detail. On the whole an approach of simply pooling study results is not reliable, the use of more sophisticated meta-analytic techniques, such as random effects models or Bayesian random effects models that use a Markov chain algorithm for estimating the posterior distributions, are required to mitigate inherent limitations of the meta-analytic approach. Random-effects models assume the presence of study-specific variance which is a latent variable to be partitioned. Bayesian random-effects models can come in parametric, non-parametric or semi-parametric varieties, referring to the shape of the distributions of study-specific effects.
​​
For more information on Simpson’s paradox in meta-analysis, see here.

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-8-34

For more information on how to minimise bias in meta-analysis, see here.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3868184/

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780110202

Part 3

Simpson’s Paradox & Cox Proportional Hazard Models

Time to event data is common in clinical science and epidemiology, particularly in the context of survival analysis. Unfortunately the calculation of hazard rate in survival analysis is not immune to Simpson’s Paradox as the mathematics behind Simpson’s paradox is essentially the mathematics of conditional probability. In-fact Simpson’s paradox in this context has the interesting characteristic  of holding for some intervals of the time variable (failure time T) but not others. In this case Simpson’s paradox would be observed across the effect of variable Y on the relationship between variable X and time interval T. The proportional hazards model can be seen as an extension of 2 by 2 tables, given that the type of data is similar is used, the difference being that time is typically is as much an outcome of interest in relation to some factor Y. In this context Y could be said to be a covariate to X.

Scenario and example
 
A 2017 paper describes a scenario whereby the death rate due to tuberculosis was lower in Richmond than New York for both African-Americans and for Caucasian-Americans, yet lower in New York than Richmond when the two ethnic groups were combined.
For more details on this example as well as the mathematics behind it see here.
For more examples of Simpson’s paradox in Cox regression see here.


Site specific bias

Factors contributing to bias in survival models can be different to those in more straightforward contexts. Many clinical and epidemiological studies include data from multiple sites. More often than not there is heterogeneity across sites. This heterogeneity can come is various forms and can result in within and between–site clustering, or correlation, of observations on site specific variables. This clustering, if not controlled for, can lead to Simpson’s paradox in the form of hazard rate reversal, across some or all of time T, and has been found to be a common explanation of the phenomenon in this context. Site clustering can occur on the patient level, for example, due to site specific selection procedures for the recruitment of patients (lead by the principal investigators individual to each site), or differences in site specific treatment protocols. Site specific differences can occur intra or internationally and in the international case can be due, for example, to differences in national treatment guidelines or differences in drug availability between countries. Resource availability can also differ between sites whether intra or internationally. In any time to event analysis involving multiple sites (such as the Cox regression model) a site-level effect should be taken into account and controlled for in order to avoid bias-related inferential errors.
 


Remedy
 

Cox regression Model including site as a fixed covariate:
Site should be included as a covariate in order to account for site specific dependence of observations.

Cox regression Model treating site as a stratification variable:
In cases where one or more covariates violate the Proportional Hazards (PH) assumption as indicated by a lack of independence of scaled Schonefeld residuals to time, stratification may be more appropriate. Another option in this case is to add a time-varying covariate to the model. The choice made in this regard will depend on the sampling nuances of each particular study.

Cox shared frailty model:
In specific conditions the Cox shared frailty model may be more appropriate. This approach involves treating subjects from the same site as having the same frailty and requires that each subjects is not clustered across more than one level two unit. While it is not appropriate for multi-membership multi-level data, it can be useful for more straight forward scenarios.

In tailoring the approach to the specifics of the data, appropriate model adjustments should produce hazard ratios that more accurately estimate the true risk.

Transforming Skewed Data: How to choose the right transformation for your distribution

​Innumerable statistical tests exist for application in hypothesis testing based on the shape and nature of the pertinent variable’s distribution. If however the intention is to perform a parametric test – such as ANOVA, Pearson’s correlation or some types of regression – the results of such a test will be more valid if the distribution of the dependent variable(s) approximates a Gaussian (normal) distribution and the assumption of homoscedasticity is met. In reality data often fails to conform to this standard, particularly in cases where the sample size is not very large. As such, data transformation can serve as a useful tool in readying data for these types of analysis by improving normality, homogeneity of variance or both.For the purposes of Transforming Skewed Data, the degree of skewness of a skewed distribution can be classified as moderate, high or extreme. Skewed data will also tend to be either positively (right) skewed with a longer tail to the right, or negatively (left) skewed with a longer tail to the left. Depending upon the degree of skewness and whether the direction of skewness is positive or negative, a different approach to transformation is often required. As a short-cut, uni-modal distributions can be roughly classified into the following transformation categories:


This article explores the transformation of a positively skewed distribution with a high degree of skewness. We will see how four of the most common transformations for skewness – square root, natural log, log to base 10, and inverse transformation – have differing degrees of impact on the distribution at hand. It should be noted that the inverse transformation is also known as the reciprocal transformation. In addition to the transformation methods offered in the table above Box-Cox transformation is also an option for positively skewed data that is >0. Further the Yeo-Johnson transformation is an extension of the Box-Cox transformation which does not require the original data values to be positive or >0.
The following example takes medical device sales in thousands for a sample of 2000 diverse companies. The histogram below indicates that the original data could be classified as “high(er)” positive skewed.
​The skew is in fact quite pronounced – the maximum value on the x axis extends beyond 250 (the frequency of sales volumes beyond 60 are so sparse as to make the extent of the right tail imperceptible) – it is however the highly leptokurtic distribution that that lends this variable to be better classified as high rather than extreme. It is in fact log-normal – convenient for the present demonstration. From inspection it appears that the log transformation will be the best fit in terms of normalising the distribution.

​​Starting with a more conservative option, the square root transformation, a major improvement in the distribution is achieved already. The extreme observations contained in the right tail are now more visible. The right tail has been pulled in considerably and a left tail has been introduced. The kurtosis of the distribution has reduced by more than two thirds.

​A natural log transformation proves to be an incremental improvement yielding the following results:
​This is quite a good outcome – the right tail has been reduced considerably while the left tail has extended along the number line to create symmetry. The distribution now roughly approximates a normal distribution. An outlier has emerged at around -4.25, while extreme values of the right tail have been eliminated. The kurtosis has again reduced considerably.

Taking things a step further and apply a log to base 10 transformation yields the following:
​In this case the right tail has been pulled in even further and the left tail extended less than the previous example. Symmetry has improved and the extreme value in the left tail has been bought closer in to around -2. The log to base ten transformation has provided an ideal result – successfully transforming the log normally distributed sales data to normal.

In order to illustrate what happens when a transformation that is too extreme for the data is chosen, an inverse transformation has been applied to the original sales data below.
​Here we can see that the right tail of the distribution has been brought in quite considerably to the extent of increasing the kurtosis. Extreme values have been pulled in slightly but still extend sparsely out towards 100. The results of this transformation are far from desirable overall.

Some thing to note is that in this case the log transformation has caused data that was previously greater than zero to now be located on both sides of the number line. ​Depending upon the context, data containing zero may become problematic when interpreting or calculating the confidence intervals of un-back-transformed data.  As  log(1)=0,  any data containing values <=1 can be made >0 by adding a constant to the original data so that the minimum raw value becomes >1 . Reporting un-back-transformed data can be fraught at the best of times so back-transformation of transformed data is recommended. Further information on back-transformation can be found here. 

Adding a constant to data is not without it’s impact on the transformation. As the below example illustrates the effectiveness of the log transformation on the above sales data is effectively diminished in this case by the addition of a constant to the original data.

​​​Depending on the subsequent intentions for analysis  this may be the preferred outcome for your data –  it is certainly an adequate improvement and has rendered the data approximately normal for most parametric testing purposes.

Taking the transformation a step further and applying the inverse transformation to the sales + constant data, again, leads to a less optimal result for this particular set of data – indicating that the skewness of the original data is not quite extreme enough to benefit from the inverse transformation.

​​It is interesting to note that the peak of the distribution has been reduced whereas an increase in leptokurtosis occurred for the inverse transformation of the raw distribution. This serves to illustrate how a small alteration in the data can completely change the outcome of a data transformation without necessarily changing the shape of the original distribution.

There are many varieties of distribution, the below diagram depicting only the most frequently observed. If common data transformations have not adequately ameliorated your skewness, it may be more reasonable to select a non-parametric hypothesis test that is based on an alternate distribution.

​Image credit: cloudera.com

Accessing Data Files when Using SAS via Citrix Receiver

 ​A SAS licence can be prohibitively expensive for many use cases. Installing the software can also take up a surprising amount of hard disk space and memory. For this reason many individuals with light or temporary usage needs choose to access a version of SAS which is licenced to their institution and therefore shared across many users. Are you trying unsuccessfully to access an SAS remotely via your institution using Citrix receiver? This step-by-step guide might help.

SAS syntax can differ based on whether a remote versus local server is used. An example of a local server is the computer you are physically using. When you have SAS installed on the PC you are using, you are accessing it locally. A remote server, on the other hand, allows you to access SAS without having SAS installed on your PC.

Client software such as Citrix Receiver, allows you to access SAS, and other software, from a remote server. Citrix Receiver is often used by university students, new and/or light users. SAS requires different syntax in order to enable the remote server to access data files on a local computer.
For the purpose of this example we are assuming that the data file we wish to access is located, locally, on a drive of the computer we are using.

It can be difficult to find the syntax for this on Google, where search results deal more with accessing remote data (libraries) using local SAS than the other way around. This can be a source of frustration for new users and SAS Technical Support are not always able to advise on the specifics of using SAS via Citrix Receiver.

The “INFILE”statement and the “PROC IMPORT” statement are two popular options for reading a data file into SAS. INFILE offers the greater flexibility and control, by way of manual specification of data characteristics. PROC IMPORT offers greater automation, in some cases at the risk of formatting error. The INFILE statement must always be used in the context of a DATA step, whereas PROC IMPORT acts as a stand-alone procedure. The ​document below shows syntax for the INFILE statement and PROC IMPORT procedure for local SAS compared to access via Citrix Receiver.

how to open a data file whe… by api-310702664

If you cannot see the document, please make sure that you are viewing the website in desktop mode.

​In SAS University Edition data file inputing difficulties can occur for a different reason. In order for the LIBNAME statement to run without error, a shared folder must first be defined. If you are using SAS University Edition, and experiencing an error when inputting data, the following videos may be helpful:
How to Set LIBNAME File Path (SAS University Edition) 
Accessing Data Files Via Citrix Receiver: for SAS University Edition

Troubleshooting check-list:

  • Was the “libref” appropriately assigned?
  • Was the file location referred to appropriately based on the user context?
  • ​Was the correct data file extension used?

While impractical for larger data sets, if all else fails, one can copy and paste the data from a data file into SAS using the ‘DATALINES’ function.