Open Source in Life Sciences: Balancing Innovation and Compliance

Data science teams in the life sciences industry are experiencing a significant change that will transform how clinical data is analyzed, opening the door to new possibilities for innovation. The transition isn’t happening overnight, but such incremental approaches are necessary to ensure success in such a highly regulated environment.

For decades, SAS was “it” in life sciences. If you wanted to be a data scientist in the industry, you had to know SAS. However, recently there has been a move toward open-source tools among organizations under growing pressure to cut costs and innovate faster. But open source comes with its own set of challenges—especially for such a tightly regulated industry. For the transition to work, teams must strike a careful balance between the dependability of established legacy systems and the adaptability that open-source tools bring. Understanding the trade-offs involved and handling them effectively is key.

Veteran data scientists who have spent decades working with SAS will be quick to argue its strengths. One of the benefits of proprietary systems is they come pre-validated for regulatory requirements, making them a safe and reliable choice. Such systems also come with dedicated support for faster troubleshooting, ongoing updates, and expert guidance tailored to specific use cases. Open-source tool users, on the other hand, must rely on community-driven support, which can be a bit more time-intensive, although no less reliable.

Global life sciences organizations also heavily depend on standardized workflows to maintain consistency and data integrity across teams, which SAS provides for. Though generally considered more flexible (a key attraction), open source tools like R can sometimes create discrepancies that must be carefully managed to ensure seamless collaboration and maintain the integrity of shared data.

There’s also the human element to consider. After years spent mastering SAS, it’s understandable for data scientists to balk at moving away from a tool they know (and love) so well. But retraining teams and overhauling workflows is no easy feat, and so what might seem like the usual resistance to change is, in this case, a practical reaction to the very real challenges and compromises that come with such a significant transition.

Open-source programming languages like Python and R have endeared themselves among data scientists for their versatility, advanced libraries for machine learning, predictive modeling, and active community support. One of the biggest draws is their ability to leverage sophisticated data analysis through libraries like TensorFlow, PyTorch, and others. These libraries offer prebuilt frameworks for machine learning models, enabling analysis of diverse datasets in ways that proprietary tools often can’t match–such as predicting attrition rates in medical trials or forecasting drug efficacy across diverse demographic groups.

The ability to share code, workflows, and methodologies seamlessly is critical to support global collaboration. Open-source languages support this with code that can be shared, peer-reviewed, and reproduced across teams. Open-source tools also allow data scientists to align their workflows to specific needs rather than being limited by a vendor’s roadmap–crucial flexibility for conducting groundbreaking research that requires tailored solutions to unique problems. Whether it’s developing novel biomarkers or fine-tuning algorithms for predictive modeling, the autonomy offered by open-source languages can be truly gamechanging.

And let’s not forget cost–the real elephant in the room. While proprietary systems like SAS require hefty licensing fees, open-source tools are, of course, free. For resource-constrained teams, this can be the deciding factor, freeing up budgets for infrastructure, hiring top talent, and other essential areas.

Fortunately, there is a hybrid approach that allows organizations to blend all the advantages of open-source and proprietary tools. This requires a unified environment that allows teams to innovate with the tools they’re comfortable with while benefiting from the established compliance features of proprietary systems. But to fully realize the benefits of a hybrid approach, organizations must address key challenges associated with open-source tools, particularly around ensuring reproducibility and auditability.

These can be overcome by implementing version control, model tracking, and automated documentation systems–including detailed audit trails, environment snapshots, and best data management practices–to ensure reproducibility and compliance. This not only ensures compliance but also lays the groundwork for more effective collaboration since a centralized environment enhances communication and efficiency among global teams. By sharing insights, code, models, and datasets in a common workspace, organizations can eliminate silos and accelerate project timelines within a culture of shared knowledge and collective progress.

Of course, as data science workloads grow, scalable computing resources become essential. Integrating cloud-based and on-premise infrastructure ensures teams have the necessary compute power for machine learning training, large-scale data analysis and other demanding tasks. And with the right systems in place, organizations get the secure data access controls, encryption, and other features they need to ensure scalability does not come at the cost of security and compliance.

The life sciences industry is increasingly adopting open-source tools like R for regulatory submissions, with Roche and Novartis among those leading the way. Roche has enhanced workflow efficiency by addressing compliance and validation concerns through rigorous internal processes and close collaboration with regulators. Novartis, known for its embrace of R’s flexibility, participates in initiatives like the R Validation Hub, which establishes validation frameworks for regulated clinical trial environments. 

These examples highlight a pivotal shift: with robust validation and governance frameworks in place, open-source tools can achieve the reliability and compliance required to meet regulatory standards.

Transitioning to a hybrid open-source and proprietary model requires careful planning and adaptation. Here’s how organizations can overcome the obstacles:

1. Invest in Training

Upskilling teams in open-source tools is key. This isn’t about replacing SAS expertise but adding to it. Workshops, certifications, and hands-on projects can help teams feel more confident with new tools and make the transition smoother overall.

2. Start Small

Pilot projects are a great way to explore open-source tools in a low-risk way. Python, for instance, can be used for exploratory analysis on a single trial.

3. Partner with Experts

Teaming up with technology vendors, consultants, or academic institutions experienced in open source can make the transition easier. These partners can offer guidance, share best practices, and provide technical support to help things go smoothly.

4. Focus on Change Management

Switching to open source isn’t just a technical shift—it’s a cultural one. Success depends on effective change management. That means clear communication, securing stakeholder buy-in, and laying out a solid roadmap to navigate resistance and ensure a smooth transition.

Striking the Right Balance

Balancing the reliability of proprietary systems with the flexibility of open-source tools can be challenging, but it can be done with a hybrid strategy that includes robust validation, training, and change management. The end result: a more innovative approach to data science that leads to the breakthroughs modern life sciences and humanity requires but in a way that’s scalable and compliant.


About Christopher McSpiritt 

Christopher McSpiritt is a seasoned business architect, consultant, and product leader who has spent almost two decades specializing in helping life sciences organizations improve drug development processes through process reengineering efforts and the deployment of innovative software/analytics solutions. As VP of Life Sciences Strategy at Domino Data Lab, he leads Domino’s go-to-market and product strategy for the pharmaceutical industry.