
Healthcare organizations are experiencing a seismic shift in how they handle unstructured data. From digital pathology to high-resolution imaging, genomic sequencing, and machine-generated sensor data, the volume of data that doesn’t fit neatly into rows and columns of a database has exploded across hospitals, research labs, and academic institutions.
While this unstructured data is intrinsic to patient care and scientific discovery, it’s also creating a mounting crisis for enterprise IT organizations: ballooning storage costs, constrained infrastructure, security and privacy risks, and a mess to untangle for AI. One-third of the world’s data comes from the healthcare industry – and it’s growing faster than in most other sectors, according to RBC Capital. “Hospitals produce an average of 50 petabytes of data each year, with as much as 97% of that data going unused,” according to the World Economic Forum.
Behind the scenes lies a hard truth: this “cold data,” data that hasn’t been accessed in over a year or more, continues to occupy premium, on-premises and cloud file storage and is consuming outsize costs to store. A lack of visibility across disparate data silos makes this difficult because IT managers and storage administrators don’t have enough insight as to whether the data can be moved, archived, or deleted and they need to get buy-in from departments before making changes.
A more nuanced, collaborative approach to unstructured data management for healthcare organizations is now possible. This can reduce unnecessary costs to store and manage data and make it more useful and accessible by researchers, data scientists and analytics teams as well as departments and teams now driving AI initiatives.
The Growing Cost of Inaction
The size of healthcare data is a major factor in escalating costs: single X-ray and CT images can consume as much as 30 megabytes each. If a facility takes just a few dozen images each day, they quickly add up to fill many gigabytes’ worth of space each month. It can take as much as 200 gigabytes of storage to sequence just one person’s genes.
Contributing to the problem is data hoarding, which occurs when researchers or clinicians hang on to files indefinitely “just in case.” Without tools to understand or classify their data, teams often keep all of it. IT infrastructure and operations, in turn, ends up supporting an ever-expanding storage environment that must meet security, compliance, backup and performance standards.
Addressing the Data Hoarding Dilemma
To break this cycle, healthcare IT organizations should consider collaborative unstructured data management strategies that involve both automation and user participation.
The goal is threefold:
- Accelerate data tiering (online archiving) of cold data to lower-cost storage without changing how users access the data.
- Give departmental users the tools and visibility they need to make informed decisions about their own data.
- Prepare data for safe AI ingestion.
This shift begins with data discovery and classification. A global index of unstructured data across on-prem and cloud storage can show which files are used frequently, which are inactive, where data resides, and how fast it’s growing.
With a more detailed picture of the data profile, IT infrastructure managers can define automated policies to tier cold data after a certain age (e.g., 12 months) to more affordable cloud-based object storage. These rules can be enforced centrally with transparent data movement, so users continue to see their files from the original location. This way, researchers don’t waste their valuable time identifying data for tiering and archiving.
Key departmental users can also get read-only access to dashboards and reports where they can see their departmental data footprint and do their own analysis. These reports allow department heads, researchers and clinicians to tag additional files or folders for archiving, such as data from completed studies.
This dual-pronged strategy of automating age-based transparent tiering and Involving users in additional data tiering decisions can double or even triple the amount of data that is moved off premium file storage—saving a large institution six or seven figures annually on data storage and backup costs. Data is never deleted: just relocated to less expensive, durable storage. Even better, this allows the data owners, who know the data best, to be involved in its management, engendering better relationships with IT.
An additional use case here is ransomware protection. WIth data tiering to immutable object storage such as AWS S3 Object Lock or Azure Blob Immutable Storage, data cannot be modified or deleted. Since many large healthcare organizations may not be protecting all their data equally from ransomware actors, now they can do so because the economics are right.
The Benefits of Department Collaboration on Data Management
For collaborative data management to succeed, users must trust that moving or tiering data won’t interrupt their work while IT still achieves goals for lower costs and lower complexity. Benefits can include:
- Non-disruptive access: Tiered data should remain visible in the same file paths. If a user clicks on a file that has been archived, it should open just like any other file without the need to call IT.
- Accessible dashboards: Users should be able to see basic file metadata (owner, size, age, last accessed) on dashboards. This can reveal cold vs. hot data, growth trends, and cost implications.
- Easy metadata enrichment: Users can tag files as ready for archiving by IT. Additionally, they should be able to tag directories or folders of files by project name, clinical area, or research keywords. When needed, IT can apply AI tools to help in this effort by scanning files across large data sets, inspecting file content and delivering a subset of data that can be tagged with keywords. Unstructured data management software can apply tags automatically by policy. By enriching file metadata this way, it’s easier and faster for users to curate the data they need for projects. John
- Support for Chargeback: Many organizations are deploying showback or chargeback models for IT services like storage. For example, a department might only be charged for the data it keeps on expensive primary storage; anything archived to the cloud is free. Detailed reporting on data usage and costs helps departments plan ahead.
- Storage Cost Savings: For a large healthcare system, savings in the millions of dollars annually is possible, since they can rely less on expensive primary storage and therefore defer hardware purchases.
- Better Classification and Search for AI: Integral to AI is high quality unstructured data. Metadata enrichment through data tagging gives more structure and context to file data so that it can be categorized and leveraged in AI data workflows for clinical research. Authorized users can search for specific files and folders of interest across their entire data estate without IT assistance.
Petabytes of unstructured data are a blessing and a curse for healthcare CIOs. This data is an asset for future research to improve facility operations, diagnostics, treatments and outcomes for patients. It’s what healthcare CEOs and boards are clamoring for in the race to be profitable and attract patients and clinicians. Yet this data comes with high costs and compliance risks if not managed systematically. Instead of looking for more budget to expand storage capacity in the data center, IT leaders should start by understanding and classifying their data. By doing so, they can take advantage of lower-priced storage while supporting departments with new data services.
About Krishna Subramanian
Krishna Subramanian is the COO, president, and co-founder of Komprise. In her career, Subramanian has built three successful venture-backed IT businesses and was named a “2021 Top 100 Women of Influence” by Silicon Valley Business Journal.