Agentic AI Platform

Enterprise Case Study

Transforming small seeds into large-scale patient segments

An autonomous multi-agent AI platform that identifies statistically similar "look-alike" patients across 312 million US records — delivering 3.2× targeting accuracy with 97% reduction in time-to-segment.

Asthma Diabetes COPD Cardiovascular Oncology

Senior Leadership, Healthcare Marketing Division · April 2026 · Confidential

5.04M

Look-alike patients identified across 5 therapy areas

92.5%

Average AUC-ROC across all therapeutic area models

3.2×

Campaign targeting accuracy vs. traditional methods

<4h

Time-to-segment per therapy area (previously 6–8 weeks)

01 The Challenge

The $30B Healthcare Marketing Challenge Demands Precision Targeting

The US healthcare marketing industry faces a fundamental transformation. Mass marketing approaches yield diminishing returns as patients expect personalized, relevant engagement. The core challenge: given a small verified patient database (~12K patients), identify statistically similar "look-alike" patients from ~350 million US records — while navigating duplicate records, fragmented data sources, and HIPAA compliance requirements. Our solution addresses significant data duplication issues and leveraging advanced Look-alike Modeling (LAM), which enables the firm to prioritize patients based on statistical signals and pre-set clinical criteria. The final architecture integrated seamlessly with industry-standard tools like LiveRamp, providing an intuitive interface for campaign managers to feed seed lists, select driver signals and visualize identified segments.

Seed Scarcity

The firm possesses verified patient lists for each therapeutic area representing less than 0.1% of the total addressable population. For asthma, 12,450 seed patients must identify look-alikes from 312 million US records. This extreme class imbalance renders traditional supervised learning ineffective without specialized techniques.

Data Duplication

The broader US population database is assembled from multiple third-party providers — claims data, pharmacy benefit managers, EHRs, and consumer data. Initial analysis revealed approximately 6% of records were duplicates or near-duplicates, with patients appearing under different identifiers across sources.

Continuous Identification

Healthcare marketing is not a one-time exercise. Patient populations evolve as new diagnoses are made, treatments change, and demographics shift. The firm required a system that could continuously identify and re-prioritize patients based on evolving statistical signals — not a static model producing a single output.

$30B

US Healthcare Marketing Market

312M

US Population Records to Score

73K

Verified Seed Patients Across 5 Therapy Areas

Duplicate Rate in Third-party Sources

02 Market Opportunity

Five Therapeutic Areas Represent $83.5M in Revenue Impact

Therapy Area	US Prevalence	Key Products	Seed Size	Look-alikes Identified	Revenue Impact
Asthma	25.0M	Nebulizers, Inhalers, Air Purifiers	12,450	847K	$18.2M
Diabetes (Type 2)	37.3M	Sugar-free Drinks, Glucose Monitors, Insulin Pens	18,200	2,010K	$37.3M
COPD	15.7M	Nebulizers, Oxygen Concentrators, Peak Flow Meters	8,300	527K	$12.1M
Cardiovascular	82.6M	BP Monitors, Heart Supplements, Cholesterol Kits	24,600	1,800K	$19.6M
Oncology	18.1M	Nutritional Supplements, Comfort Items, Support Kits	9,800	672K	$8.4M
Total	178.7M Total Addressable		73,350	5,856K	$83.5M

03 Solution Architecture

Agentic AI Replaces Monolithic Pipelines with Autonomous Agents

Traditional ML pipelines follow rigid, sequential execution. The Agentic AI paradigm decomposes the pipeline into six autonomous agents — each with defined goals, perception capabilities, reasoning logic, and action spaces — enabling continuous learning and self-correction without manual intervention.

⚙️

Orchestrator Agent

Pipeline Coordinator

Coordinates pipeline execution via DAG-based task scheduling. Monitors agent health, handles failures with exponential backoff retry, and provides real-time status updates to the marketing team dashboard via WebSocket. Maintains a unique pipeline ID for every execution for full audit and reproducibility.

DAG-based execution ordering ensures agents run only after dependencies complete

Retry logic with exponential backoff handles transient failures (e.g., Snowflake timeouts)

Pipeline versioning tracks every execution with unique pipeline ID

🔍

Deduplication Agent

Identity Resolution

Deterministic PII hashing via SHA-256

Probabilistic Jaro-Winkler matching (threshold 0.85)

LiveRamp AbiliTec graph resolution

Resolves 94.2% of duplicate records

🧬

Feature Engineering Agent

Signal Extraction

5 feature categories: Demographics, Clinical, SDoH, Environmental, Behavioral

Therapy-specific ICD-10/NDC mappings

Adaptive feature registry per therapy area

🤖

Modeling Agent

Ensemble ML Training

Trains LR + XGBoost with SMOTE oversampling

Automated hyperparameter optimization

Multi-threshold evaluation (50–90%)

📊

Scoring Agent

Population Scoring

Processes 312M records in 5M-record chunks

Generates propensity scores for every record

Creates decile-based patient segments

🔄

Feedback Agent

Continuous Learning

Ingests campaign engagement & conversion data

Enriches seed databases with high-value patients

Triggers autonomous retraining when seed grows >5%

Self-Correction

Upstream errors detected and re-processed automatically without manual intervention

Adaptive Optimization

Feature weights shift per therapy area without manual tuning

Continuous Learning

Campaign results feed back into model improvement each cycle

Autonomous Decisions

No data science intervention required for routine operations

04 Identity Resolution

Three-Stage Deduplication Resolves 94.2% of Duplicate Records

The Deduplication Agent implements a three-stage identity resolution pipeline addressing the critical challenge of duplicate records across third-party data sources. Each stage handles what the previous cannot.

65%

of resolutions

Deterministic Matching

SHA-256 hashed PII fields — exact match on name + date of birth + ZIP3. Fastest and most precise resolution layer. No false positives.

name_hash + dob_hash + zip3
→ Merge on exact match

25%

of resolutions

Probabilistic Matching

Jaro-Winkler similarity with blocking on ZIP3. Threshold: 0.85. Catches spelling variations and data entry errors across sources.

jaro_winkler(name) ≥ 0.85
+ exact(dob) + numeric(age, ±2)

10%

of resolutions

LiveRamp AbiliTec

Identity graph resolution for remaining unmatched records. Resolves cross-device and cross-source identities via RampID for complete coverage.

AbiliTec API → RampID
→ Identity graph resolution

18.7M

Duplicate Records Resolved

6.0%

Population Duplication Rate

<1%

Post-Dedup Campaign Waste

93%

Reduction in Wasted Spend

Therapy Area	Stage 1 (Deterministic)	Stage 2 (Probabilistic)	Stage 3 (LiveRamp)	Total Dedup Rate
Asthma	65.2%	24.8%	10.0%	94.3%
Diabetes	62.1%	26.3%	11.6%	93.8%
COPD	67.4%	23.1%	9.5%	95.2%
Cardiovascular	63.8%	25.4%	10.8%	94.6%
Oncology	60.5%	27.2%	12.3%	93.1%

05 Feature Engineering

Five Feature Categories Drive Therapy-Specific Signal Extraction

👤

Demographics

female_35_44, region_southeast

Asthma / Diabetes

75% / 78%

🏥

Clinical (ICD-10 / NDC)

icd_J45_20, rx_albuterol

Asthma / Diabetes

95% / 92%

🏘️

Social Determinants of Health

payer_commercial, income_bracket

Asthma / Diabetes

55% / 65%

🌡️

Environmental

aqi_moderate, pollen_high

Asthma / Diabetes

70% / 30%

📱

Behavioral

adherence_high, digital_engaged

Asthma / Diabetes

68% / 85%

ICD-10 & NDC Mappings

Area	Primary ICD-10	Key Drugs
Asthma	J45.20J45.30J45.40	Albuterol, Fluticasone, Montelukast, Budesonide
Diabetes	E11.9E11.65E10.9	Metformin, Insulin Glargine, Empagliflozin
COPD	J44.0J44.1J44.9	Tiotropium, Umeclidinium, Roflumilast
Cardio	I10I25.10I50.9	Atorvastatin, Lisinopril, Amlodipine
Oncology	C50.911C34.90C61	Pembrolizumab, Trastuzumab, Tamoxifen

06 Model Performance

Ensemble ML Achieves 92.5% Average AUC Across All Therapy Areas

The Positive-Unlabeled learning framework combines Logistic Regression with XGBoost in a weighted ensemble, autonomously optimizing weights per therapy area. SMOTE oversampling addresses the extreme class imbalance inherent to this problem.

Therapy Area	AUC-ROC	Precision @70%	Recall @70%	F1 @70%	LR : XGB
Asthma	94.2%	0.89	0.82	0.85	0.35 : 0.65
Diabetes	92.8%	0.86	0.79	0.82	0.40 : 0.60
COPD	91.5%	0.84	0.77	0.80	0.38 : 0.62
Cardiovascular	93.1%	0.87	0.81	0.84	0.42 : 0.58
Oncology	90.7%	0.82	0.75	0.78	0.35 : 0.65
Average	92.5%	0.86	0.79	0.82	—

Ensemble Architecture

Model A

Logistic Regression

C=500, L2 penalty
class_weight=balanced
max_iter=1000

Model B

XGBoost

n_estimators=300
max_depth=6, lr=0.1
scale_pos_weight=auto

Ensemble

Weighted Average

Optimized per therapy area
Best of: accuracy +
interpretability

✓ PU Learning Framework

✓ SMOTE Oversampling

✓ Correlation Removal > 0.9

✓ Multi-threshold Evaluation

✓ Stratified 80/20 Split

07 Population Scoring

5.04 Million Look-alike Patients Identified with Decile Prioritization

5.04M

Total Look-alike Patients Identified across all 5 therapeutic areas

312M records scored✓

5M-record batch chunks✓

<4 hours per therapy area✓

Uploaded to Snowflake✓

Decile	Score Range	Patients	Recommended Action
D1 — Top 10%	0.90 – 1.00	504,000	Immediate Activation
D2	0.80 – 0.89	756,000	High-Priority Targeting
D3	0.70 – 0.79	890,000	Standard Campaign
D4 – D7	0.40 – 0.69	1,800,000	Nurture / Awareness
D8 – D10	0.00 – 0.39	1,090,000	Exclude from Targeting

08 User Interface

Intuitive UI Empowers Marketing Teams Without Data Science Expertise

Designed around a "Pharma Command" aesthetic with therapeutic area color-coding, the interface enables any marketing team member to operate the entire system independently across five core screens.

01 — Dashboard

Real-time Overview

All 5 therapeutic areas with segment growth trends, feature signal radar, and agent status indicators.

02 — Seed Input

Patient Upload

Drag-and-drop CSV upload with automatic deduplication preview, LiveRamp resolution rates, and file validation.

03 — Model Config

Driver Signals

Toggle feature weights per category, select algorithm type, and set propensity score threshold for targeting precision.

04 — Segments

Visualize & Tag

Decile analysis charts, segment comparison, demographics breakdown, and tag-and-export for campaign activation.

05 — Case Study

Architecture Docs

Architecture documentation, code reference, pipeline visualization, and technology stack overview.

Marketing Team Workflow

Upload Seed

Select therapy, upload CSV

Review Dedup

Verify resolution rates

Configure Model

Adjust weights, threshold

Run Pipeline

Agents execute autonomously

Review Segments

Analyze decile results

Tag & Export

Activate campaign lists

⚡ <30 minutes from seed upload to campaign activation

09 Deployment

Azure Cloud Deployment Ensures Enterprise-Grade Scalability

Layer	Technology	Specs
Presentation Tier	React 17Nginx	0.5 CPU, 1 GB RAM 1–3 replicas, HTTP auto-scaling
Application Tier	Python 3.11FastAPI	1.0 CPU, 2 GB RAM 1–5 replicas, Gunicorn workers
ML Engine	scikit-learnXGBoostSMOTE	6 agentic ML modules PU Learning framework
Data & Integration	SnowflakeLiveRamp	Clean Room, AbiliTec HIPAA-compliant joins
Infrastructure	Container AppsACRDocker	Auto-scale, Blob Storage Automated build & deploy

1–5×

Backend auto-scaling

99.9%

Azure Container Apps SLA

HIPAA Compliant

All patient data processed within Snowflake Clean Rooms with Business Associate Agreement

RampID-Only Processing

No raw PII exported; identity resolved to pseudonymous RampIDs throughout

Full Audit Logging

Every query and agent action logged with user attribution via Azure Monitor

Encrypted at Rest & Transit

AES-256 encryption for storage, TLS 1.3 for all network traffic

10 Business Impact

3.2× Campaign Targeting Improvement with 97% Time Reduction

3.2×

Targeting Accuracy Improvement vs. Traditional Methods

97%

Reduction in Time-to-Segment (weeks → hours)

68%

Reduction in Cost Per Qualified Patient Reached

Metric	Before HealthTarget AI	After HealthTarget AI	Improvement
Time to Identify Segment	6–8 weeks (manual)	<4 hours	97% reduction
Campaign Targeting Accuracy	28% response rate	89% response rate	3.2× improvement
Duplicate Records in Campaigns	12–15% waste	<1% waste	93% reduction
Therapy Areas Served (Parallel)	1 area (manual rotation)	5 areas (parallel)	5× throughput
Data Scientist Hours per Campaign	160 hours	4 hours (monitoring)	97.5% reduction
Cost per Qualified Patient Reached	$4.20	$1.35	68% reduction

11 Continuous Learning

Feedback Loop Drives Compounding Returns Over Time

The Feedback Agent closes the loop by ingesting campaign performance data and autonomously enriching the seed database. Each cycle improves the model, which improves targeting, which generates better campaign data — a true virtuous cycle.

Feedback Agent Cycle

Ingest Campaign Results

Engagement data, conversion events, and appointment scheduling signals are collected from campaign activation platforms.

▼

Identify High-Value Patients

Patients with engagement score >0.7 or confirmed conversions are flagged as high-value for seed enrichment.

▼

Enrich Seed Database

High-value patients are deduplicated and appended to the therapy-specific seed database, expanding the positive class.

▼

Trigger Autonomous Retraining

When seed growth exceeds 5% threshold, the Orchestrator Agent automatically initiates a full retraining cycle.

Quarterly Improvement Trajectory

Q1 2026

Baseline

—

AUC Change

—

Seed Growth

2.8×

Campaign ROI

Q2 2026

Active

+1.4%

AUC Improvement

+8.2%

Seed Growth

3.1×

Campaign ROI

Q3 2026

Projected

+2.1%

AUC Improvement

+12.5%

Seed Growth

3.5×

Campaign ROI

The platform becomes more valuable with every campaign executed — each cycle enriches the seed, improves model accuracy, and increases targeting ROI through autonomous feedback integration.

12 Roadmap

Implementation Roadmap and Next Steps

Phase	Timeline	Deliverables	Status
Phase 1: Foundation	Wk 1–4	Data integration, Snowflake Clean Room setup, Deduplication Agent	✓ Complete
Phase 2: Core ML	Wk 5–8	Feature Engineering Agent, Modeling Agent, Asthma pilot	✓ Complete
Phase 3: Scale	Wk 9–12	Scoring Agent, 5 therapy areas, full population scoring	✓ Complete
Phase 4: UI & UX	Wk 13–16	React frontend, marketing team onboarding	✓ Complete
Phase 5: Deployment	Wk 17–18	Azure Container Apps, CI/CD pipeline, monitoring	✓ Complete
Phase 6: Feedback Loop	Wk 19–20	Feedback Agent, campaign integration, quarterly retraining	● Active
Phase 7: Expansion	Wk 21+	Additional therapy areas, real-time scoring, advanced analytics	◦ Planned

Planned Expansion Initiatives

Additional Therapy Areas

Expand to Neurology, Dermatology, and Immunology in Q3 2026

Real-time Scoring

Event-triggered campaigns with streaming propensity score updates

Patient Journey Mapping

Treatment pathway prediction and engagement sequencing

Platform Integrations

Connect to DSPs, CRM systems, and email activation platforms

$45M

Recommendation: Expand to 3 additional therapeutic areas in Q3 2026 to capture additional addressable revenue. The platform is production-ready and delivering measurable ROI with compounding returns through the Feedback Agent. Additional addressable revenue projected at $45M.