
Big medicine, medical startups, and much of medical academia have historically been steeped in secrecy and proprietary research. This is a direct result of the privacy in patient and medical data, but as AI supercharges everything from diagnostic medicine to patient treatment and care, it’s time to change the paradigm. By embracing the principles of open source collaborations and finding a balance between protecting privacy and sharing knowledge, the industry can accelerate and democratize the pace of innovation—and possibly streamline the FDA approval process that can hold innovation hostage for up to a decade.
Medical AI solutions promise to help address the international doctor shortage by providing practitioners with tools that help them assess and treat more patients faster. In pharmaceuticals, where bringing a new drug to market costs $2.6 billion and takes 10 to 15 years, they promise to propel production, enabling life-changing drugs to be released faster and at a lower cost.
If an AI tool doesn’t work as intended, however, the results can be catastrophic, leading to misdiagnosis, defective medications, and even death. Developers must build medical AI that meets the five nines of reliability—meaning that their models work as intended 99.999 percent of the time—or perhaps even more in medical AI.
With the transparency of open source even more important in this domain, developers face two key challenges: becoming more aware of their AI solution’s uncertainty and then effectively communicating this to practitioners.
Expanding the limits of medical AI with open source transparency
While the transparency of open source software helped democratize innovation in everything from smartphones to the internet, AI systems are more complicated to share. Instead of a string of code, AI systems rely on source code, model parameters, data, hyperparameters, training and inference source code, random number generation, and auxiliary software frameworks—and each component must work in concert for models to perform as desired.
When developers can truly examine all these moving parts and understand system behavior via open source sharing, it bolsters innovation and reduces failures. Instead of continuously re-inventing the wheel or putting blind faith in systems that cannot be investigated, they can drive meaningful transformation.
With incomplete, biased, or inconsistent datasets often at the heart of model failure, medical AI developers also face a steep challenge. Like all AI developers, they require enough of the right data to train their algorithms, and ensure their cluster of outputs fully covers all possible outputs of their model. Recall how Google’s Med-Gemini making up a body part—they must also be wary of hallucinations.
With lives on the line, medical datasets don’t just need to be large—they must also be representative of diverse populations, or some patients will be misdiagnosed or mistreated. Yet a 2023 systematic study of 48 healthcare AI models found that half of them demonstrated a high risk of bias, reflecting a need for developers in this field to course-correct for the biased history in medical research.
In 2019, for example, a study by MIT Technology Review showed that an algorithm being applied to 70 million patients was disproportionately favoring white patients over black ones when predicting the need for medical intervention. Once that bias was uncovered, researchers worked with the software maker to investigate the quality of the data as it interacted with the model, identify the bias at play, iterate for a solution, and ultimately, reduce the disparity by more than 80 percent.
When industry and academia collaborate, it innovates faster to create technology we can trust.
And when datasets are open source and accessible for all to review, the crowd has proven it will uncover troubling content even earlier—just as it did when it rooted out 1,000 URLs containing verified child sexual abuse material in the LAION 5B dataset that fuels AI image generators like Stable Diffusion and Midjourney.
Beyond this, if we can build a selection of open source data sets, researchers and developers will see the time between innovation, approval, and practical use shrink astronomically.
The HIPAA data hurdle
Even as doctors and AI developers are limited by their legal obligation to protect patient privacy, large corporations have gobbled up treasure troves of patient data. These businesses looking to cash in are disincentivizing open source sharing, making the collection of data even more challenging and costly.
Beyond this trend, there’s the question of how to release open source medical data when it’s protected by law. Thankfully, the ability to de-identify is ever-improving, and patients are increasingly open to sharing their medical data for the greater good.
Industry and the government have also come together to democratize medical innovation with projects like NIH’s Lung Image Database Consortium (LIDC)—which aimed to accelerate research and development in lung cancer screening by building a publicly available database of lung CT scans with annotated nodules. That 2007 effort has since been merged with the Image Database Resource Initiative (IDRI) to create a completed reference database of lung nodules on CT scans.
At the same time, Big Tech has taken to labeling “open weights” models like Google’s MedGemma as “open source.” While this is a step in the right direction, true open source requires sharing every component in AI systems. By opening up access but holding back pieces of their puzzle, these players are cementing their place in the medical AI space by inviting developers to use their tools and models. Even as this engenders innovation, however, it forces developers to put their faith in the parts of the system they cannot establish. A risk they should not be forced to take.
Breaking medical AI barriers with a new paradigm
While the challenges of balancing privacy with open collaboration are real, medical AI is revolutionizing patient care. With the stakes too high to maintain the status quo, the path forward requires the industry to prioritize collective progress over proprietary advantage.
By establishing robust frameworks for truly open source medical AI, creating data sharing incentives that benefit patients and developers, and fostering genuine partnerships between industry, academia and government, we can build a future where life-saving innovations reach more patients faster and more equitably. The innovators who can take control of data and embrace the transparency of open source collaboration will be poised to lead this charge.
About Dr. Jason Corso
Dr. Jason Corso is Co-founder and Chief Science Officer at Voxel51, and Toyota Professor of AI and Electrical Engineering & Computer Science at the University of Michigan. A veteran in the field of computer vision, Dr. Corso has dedicated over 20 years to academic research on video understanding, robotics and data science.
