
In 1971, a small team at Boston’s Beth Israel Hospital, led by Dr. Howard Bleich and Dr. Warner Slack, booted up the hospital’s first Center for Clinical Computing. Their PDP-11 minicomputer stored lab results and a few hundred ICD-8 codes on nine-track tape. Each evening, residents lined up to run charge‐slip reports and marvel at the glow of the terminal. Fifty years later, we still tabulate charges, but the code set has grown from those few hundred entries to nearly 70,000 in ICD-10-CM, 75,000 in ICD-10-PCS, plus CPT, HCPCS, and HCC variants. The problem: complexity has scaled exponentially while human workflows have not.
The task has outpaced linear human processes. My goal in this post is to explain (without jargon) why the coding stack is broken, how large language models (LLMs) finally give us a viable alternative, and what a modern, GPU-native pipeline looks like in daily use.
Why the Old Stack Breaks Down
Combinatorial overload
The average inpatient stay touches 12 diagnosis codes, 7 procedure codes, at least 3 modifiers, and multiple payer edits. Multiply that by 30 million discharges and you see why coders rely on the same generic codes they memorized in school.
Dynamic payer logic
Every 90 days, Medicare refreshes NCCI edits and local coverage determinations. Commercial plans publish changes even faster through private portals that coders rarely see on time. Legacy rule engines update quarterly at best, so hospitals chase a moving target with stale rules.
Labor constraints
Industry bodies warn of a growing talent gap: the American Medical Association reports a 30 percent shortage of certified medical coders on the horizon. Training a new coder can take up to 18 months, and retention is slipping because routine charts feel like factory work. Burnout drives errors, errors drive denials, and the cycle feeds on itself.
Financial stakes
Each one-point drop in coding accuracy removes roughly two points of margin in risk-based contracts. A 300-bed hospital can lose $8-10 million per year on under-coded or denied claims. Boards now ask for real-time accuracy metrics, not retrospective audits.
The net effect: human-centred workflows can no longer deliver the speed, scale, or precision the revenue cycle requires.
What Modern AI Brings to the Table, and Why It’s Finally Affordable
Large language models fine-tuned on clinical corpora shift coding from “find and type” to “infer and explain.” The difference is architectural:
- Few-shot calibration
Few-shot calibration is the turning point. Traditional models needed tens of thousands of labeled charts before they could understand local phrasing, so every deployment dragged on for months and still produced a one-size-fits-all model. A modern clinical language model learns a hospital’s or even a single provider’s documentation style from about five hundred historical charts. That compact sample is enough for the system to recognize shorthand like “rule out NSTEMI” for chest pain evaluation. When a new template appears or a specialist joins the group, the model can fine-tune overnight on a fresh handful of notes and keep its accuracy intact. The result is rapid launch, ongoing personalization, and coding that evolves in step with documentation practices instead of lagging behind them. - Context windows that fit a full chart
Current transformer blocks handle up to 32,000 tokens, enough for History, Physical, op note, imaging summaries, and nursing flowsheets in one pass. The model sees the patient story as a single graph rather than fragments. - Token-level attribution
Attention maps show exactly which sentence, lab value, or imaging finding triggered a code. Compliance and audit teams can export that rationale directly to a PDF packet. - Confidence scoring
Probabilistic outputs let the system route high-certainty encounters straight to billing while flagging low-certainty charts for human review. This dynamic routing is where throughput gains multiply. - Continuous back-prop on payer feedback
Every remit with CARC and RARC codes becomes fresh training data. The model fine-tunes nightly, so tomorrow morning it will block today’s new denial reason automatically. - Falling inference costs
GPU spot pricing, quantized weights, and serverless inference cut per-chart compute cost by more than 70 percent compared to 2021. Autonomous coding is no longer an ML science project; it is cheaper than offshore labor on a fully loaded basis.
A Modern Coding Pipeline in Practice
Below is the blueprint we use in production environments. The numbers come from field deployments across multi-hospital systems.
Stage | Tech Component | Operational Result |
Ingest | Real-time FHIR R4/R5 APIs (Bulk Export + Subscriptions)Streaming HL7 v2.x feeds (ADT, ORU, ORM, DFT)Secure SFTP/X12 gateways for legacy systems and payer 835/277 files | No manual file drops, no batch lag. |
Interpret | A fleet of containerised GPU nodes runs a domain-tuned LLM that maps each document to ICD, CPT, HCPCS, and E&M. | 1000+ charts per minute with average latency of 220 milliseconds. |
Explain | The Bilateral Audit layer stores token-level rationales for every code. | Auditors download evidence in seconds; coders learn from highlights. |
Route | Probabilistic splitter sends high-confidence encounters Straight-to-Bill; others flow to a coder review queue. | 70 percent STB rate and 40 percent denial drop at day 30. |
Learn | Nightly trainer ingests coder feedback + payer denial data, fine-tunes weights, and rolls out via canary release. | Accuracy improves 0.5 points per month with no downtime. |
Market Status and Near Horizon
Adoption is moving from early pilots to system-wide contracts. A 2023 Frost & Sullivan report indicates that over 30% of healthcare organizations are piloting or planning autonomous coding solutions. Payers are leaning in because transparent audit logs reduce their own review costs. Regulators see potential to relieve the coder shortage and are drafting guardrails rather than bans.
The next milestones:
- Multimodal input
Adding DICOM imaging and waveform signals to the context window so procedure codes align with actual device IDs and implant registries. - Synthetic pre-adjudication
Running a full payer rule simulation before claim generation, preventing denials rather than chasing them. - Edge inference
Deploying a lightweight model inside the EHR for real-time physician prompts while a heavier cloud model finalises the claim. - Real-time, point-of-care coding while the provider types
As clinical text streams into the note, the engine proposes ICD, CPT, and HCC codes on the fly, letting clinicians adjust documentation and resolve gaps before they ever hit “save.”
The Road Ahead
Coding started as ink in a ledger, then punch cards, then desktop encoders. The workload outgrew each step. LLMs and scalable GPUs finally give us a platform that grows with complexity instead of buckling under it. Hospitals that adopt autonomous, explainable coding see tangible gains: faster cash, lower denials, happier clinicians, and continuous learning baked into the stack.
The choice is clear. Either keep hiring people to fight exponential complexity or deploy systems that learn at exponential speed. The mainframe clerks of 1966 would have taken the latter if they had the option. Now we do.
About Jot Sarup Singh
Jot Sarup Singh is Co-founder and Chief Product & Technology Officer at RapidClaims, the AI-driven revenue-cycle platform re-engineering US medical billing with large-language-model automation. Since co-launching the company in 2023, Jot has architected a GPU-native LLM pipeline that now supports more than 25 medical specialties with high autonomous accuracy, helping hospitals trim billing costs by up to 70 percent and integrate with dozens of leading EHRs in weeks rather than months.
Under his product leadership RapidClaims has scaled 6× in recent quarters and attracted $11.1 million in venture funding, including an $8 million Series A round led by Accel and a $3.1 million seed round from Together Fund, Better Capital, Neon Fund, and prominent healthcare angels.