Conducting a clinical study requires several steps and tools.
Based on the scientific & biological question a promoter needs to address, he has to design the study, define the number of volunteers, recruit, monitor and collect data. Finally, once he gets the data, he enters the mysterious world of statistics, which has its own languages of “statistical programming”.
What is a statistical programming language, and why do we need it?
A statistical software is needed to handle, check and analyse data. There are 2 possibilities in the existing softwares:
- “ready to use” systems
- systems based on lines of code.
Ready to use software are easier to take over and are an interesting option for medical professionals. But statistics professionals generally take advantage of the command language option for a lot of reasons. It is very useful to save the lines of code which generate statistical outputs, for reasons of traceability, or on the contrary to more easily change an option by changing an element of the program. In addition, even if coding takes time, we can perform many analyses faster with code. Finally, ready-to-use software offers less flexibility in terms of data manipulation and analysis options.
Are they different languages used in statistical programming?
As well as many fields of science, statistical programming has a constantly changing landscape. Indeed, ten years ago, SAS® and R were the statistical programming languages the most widely used. SPSS, Stata, Matlab and many other softwares were existing but had less users. Lastly, we can observe a rise of R and Python users, and a decline of SAS® customers. All these programming languages have a particularity: they are completely incomprehensible to non-specialists.
SAS/R/Python… different languages of statistical programming
What are currently the key drivers for choosing a statistical language in clinical studies?
— FDA guidance —
Let's focus on the pharmaceutical industry. SAS® have always had a massive leadership in this area, due to the US Food and Drug Administration (FDA) guidance. It is well known that most submissions have used SAS®, and many people think SAS® is the only software we can use to analyze clinical trial data. However, in 2015, the FDA published a Statistical Software Clarifying Statement, opening a door to the use of R for clinical trials: “FDA does not require use of any specific software for statistical analyses […]. However, the software package(s) used for statistical analyses should be fully documented in the submission, including version and build identification”. This assertion is enhanced by “A Guidance Document for the Use of R in Regulated Clinical Trial Environments” published by FDA in March 2018.
— Distrust of freeware / Validation —
Therefore, why do so many pharmaceutical and CRO companies still pay for expensive SAS® licenses while the same work could be done with R? SAS® is a for-profit company which vets all its code to ensure it returns correct results. On the other hand, R is an open-source language, everyone can contribute by writing a package. Therefore, are results from R reliable? Those produced by the most famous R packages certainly are, those produced by more esoteric R packages might be treated cautiously. Plus, it is easier with SAS® to make results traceable (for example to give the outputs a time stamp) and stable (independent of package version). For all these reasons, SAS® remains widely used in pharma.
— Tradition / Habits / Change —
When an organization is using SAS® for many years, and has developed plenty of working macros, translating these codes from a programming language to another can be very costly in time and money – even more than one year of SAS® renewal fees. Tradition does not change rapidly. But by demanding large license fees from universities, SAS® is getting less and less learnt by students, and nowadays, many newcomers entering the workforce already know how to use R and not SAS®. Add to this fact that R is a more intuitive language – therefore easier to learn than SAS® – and more adapted to modern analytic practices (as microbiome analysis), we can imagine SAS® programming skills are going to be increasingly rare in the next few years. Several pharmaceutical and CRO companies have already initiated a migration project from SAS® to open source, and many more will follow.
Decrease of SAS interest versus increase for R
What would be the future of statistical programming language?
Since the early 2010s, machine learning is getting more and more popular in reason to higher computing power and larger amount of available data. Python and R are the two most commonly used languages for this technology today. They are both open source products and completely free to use. Python was first released in 1991 and contrarily to R, is not purposed for statistics only. Indeed, it is historically used by software engineers, recognized to be great for mathematical computations, with an elegant syntax. However, Python provides less libraries than R, and reporting tables and data visualization is more convoluted, so both Python and R are appropriated to data science, with their respective advantages.
Python for future with machine learning?
Finally, note that a recent programming language is seducing more and more data scientists: Julia. Currently its use is not commonplace yet, but many data scientists can be charmed by the rapidity, the good memory management and the good parallelism that Julia offers.
SAS, R, Python, Julia …
In conclusion, there are currently several statistical programming languages, the aim of which is always the same: to analyze data sets and to extract knowledge from them. The specialists in statistical programming always have to monitor the constantly evolving landscape of these languages: the language used today for your project is not necessarily the one that will be used tomorrow.
– Benoit Douillard, Statistical Programmer & Biostatistician, Biofortis Mérieux NutriSciences –