Healthcare providers and their patients stand to benefit dramatically from AI technologies, thanks to their ability to leverage data at scale to reveal new insights. But for AI developers to perform the research that will feed the next wave of breakthroughs, they first need the right data and the tools to use it. Powerful new techniques are now available to extract and utilize data from complex objects like medical imaging, but leaders must know where to invest their organizations’ resources to fuel this transformation.
The Life Cycle of Machine Learning
The machine learning process that AI developers follow can be looked at in four parts:
1. Finding useful data
2. Ensuring quality and consistency
3. Performing labeling and annotation
4. Training and evaluation
When a layperson envisions creating an AI model, most of what they picture is concentrated in step four: feeding data into the system and analyzing it to arrive at a breakthrough. But experienced data scientists know the reality is much more mundane—80% of their time is spent on “data wrangling” tasks (the comparatively dull work of steps one, two, and three)—while only 20% is spent on analysis.
Many facets of the healthcare industry have yet to adjust to the data demands of AI, particularly when dealing with medical imaging. Most of our existing systems aren’t built to be efficient feeders for this kind of computation. Why is finding, cleansing, and organizing data so difficult and time-consuming? Here’s a closer look at some of the challenges in each stage of the life cycle.
Challenges in Finding Useful Data
AI developers need a high volume of data to ensure the most accurate results. This means data may need to be sourced from multiple archiving systems—PACs, VNAs, EMRs, and potentially other types, as well. The outputs of each of these systems can vary, and researchers need to design workflows to perform initial data ingestion, and possibly ongoing ingestion for new data. Data privacy and security must be strictly accounted for, as well.
However, as an alternative to this manual process, a modern data management platform can use automated connectors, bulk loaders, and/or a web uploader interface to more efficiently ingest and de-identify data.
As part of this interfacing with various archives, AI developers often source data across imaging modalities, including MR and CT scans, x-rays, and potentially other types of imaging. This presents similar challenges to the archive problem—researchers can’t create just one workflow to use this data, but rather have to design systems for each modality. One step toward greater efficiency is using pre-built automated workflows (algorithms) that handle basic tasks, such as converting a file format.
Once AI researchers have ingested data into their platform, challenges still remain in finding the right subsets. Medical images and their associated metadata must be searchable to enable teams to efficiently locate them and add them to projects. This requires the image and metadata to be indexable and to obey certain standards.
Challenges in Ensuring Quality and Consistency
Researchers know that even if they can get the data they’re interested in (which is not always a given) this data is often not ready to be used in machine learning. It’s frequently disorganized, lacking quality control, and has inconsistent or absent labeling, or other issues like unstructured text data.
Ensuring a consistent level of quality is crucial for machine learning in order to normalize training data and avoid bias. But manually performing quality checks simply isn’t practical—spreading this work between multiple researchers almost guarantees inconsistency, and it’s too large a task for one researcher alone.
Just as algorithms can be used to preprocess data at the ingestion step, they can also be applied for quality checks. For example, neuroimaging researchers can create rules within a research platform to automatically run MRIQC, a quality control app, when a new file arrives that meets their specifications. They can set further conditions to automatically exclude images that don’t meet their quality benchmark.
Challenges in Labeling and Annotation
Consistency is a recurring theme when evaluating machine learning data. In addition to needing data with consistent quality control, AI developers also need consistently labeled and annotated data. However, given that imaging data for AI will have been sourced from multiple locations and practitioners, researchers must design their own approaches to ensuring uniformity. Once again, performing this task manually is prohibitive and risks introducing its own inconsistencies.
A research data platform can help AI developers configure and apply custom labels. This technology can use natural language processing to read radiology reports associated with images, automate the extraction of specific features, and apply them to the image’s metadata. Once applied, these labels become searchable, enabling the research team to find the specific cases of interest to their training.
A data platform can also help standardize labeling within a blind multi-reader study, by giving readers a defined menu of labels that they apply once they’ve drawn the region of interest.
Challenges in Training and Evaluation
Once the research team reaches the training and scoring stage (hopefully, having reduced the upfront time investment), there are still opportunities to increase efficiency and optimize machine learning processes. A crucial consideration is an importance of ensuring comprehensive provenance. Without this, the work will not be reproducible and will not receive regulatory approval. Access logs, versions, and processing actions should be recorded to ensure the integrity of the model, and this recording should be automated to avoid omissions.
Researchers may wish to conduct their machine learning training within the same platform where their data already resides, or they may have a preferred machine learning system that is outside of the platform. In this case, a data platform with open APIs can enable the data that has been centralized and curated to interface with an outside tool.
Because the amount of data used in machine learning training is so massive, teams should seek efficiencies in how they share it amongst themselves and with their machine learning tools. A data platform can snapshot selected data and enable a machine learning trainer to access it in its place, rather than requiring duplication.
Maximizing the Value of Data
Healthcare organizations are beginning to recognize the value of their data as a true asset that can power discoveries and improve care. But to realize this goal, leaders must give their teams the tools to maximize the potential of their data efficiently, consistently, and in a way that optimizes it for present technologies and lays the foundation for future insights. With coordinated efforts, today’s leaders can give data scientists tools to help reverse the 80/20 time split and accelerate AI breakthroughs.
AboutTravis Richardson
Travis Richardson is Chief Strategist at Flywheel, a biomedical research data platform. His career has focused on his passions for data management, data quality, and application interoperability. At Flywheel, he is leveraging his data management and analytics experience to enable a new generation of innovative solutions for healthcare with enormous potential to accelerate scientific discovery and advance precision care.