An autonomous multi-agent AI platform that identifies statistically similar "look-alike" patients across 312 million US records — delivering 3.2× targeting accuracy with 97% reduction in time-to-segment.
The US healthcare marketing industry faces a fundamental transformation. Mass marketing approaches yield diminishing returns as patients expect personalized, relevant engagement. The core challenge: given a small verified patient database (~12K patients), identify statistically similar "look-alike" patients from ~350 million US records — while navigating duplicate records, fragmented data sources, and HIPAA compliance requirements. Our solution addresses significant data duplication issues and leveraging advanced Look-alike Modeling (LAM), which enables the firm to prioritize patients based on statistical signals and pre-set clinical criteria. The final architecture integrated seamlessly with industry-standard tools like LiveRamp, providing an intuitive interface for campaign managers to feed seed lists, select driver signals and visualize identified segments.
The firm possesses verified patient lists for each therapeutic area representing less than 0.1% of the total addressable population. For asthma, 12,450 seed patients must identify look-alikes from 312 million US records. This extreme class imbalance renders traditional supervised learning ineffective without specialized techniques.
The broader US population database is assembled from multiple third-party providers — claims data, pharmacy benefit managers, EHRs, and consumer data. Initial analysis revealed approximately 6% of records were duplicates or near-duplicates, with patients appearing under different identifiers across sources.
Healthcare marketing is not a one-time exercise. Patient populations evolve as new diagnoses are made, treatments change, and demographics shift. The firm required a system that could continuously identify and re-prioritize patients based on evolving statistical signals — not a static model producing a single output.
| Therapy Area | US Prevalence | Key Products | Seed Size | Look-alikes Identified | Revenue Impact |
|---|---|---|---|---|---|
| Asthma | 25.0M | Nebulizers, Inhalers, Air Purifiers | 12,450 | 847K | $18.2M |
| Diabetes (Type 2) | 37.3M | Sugar-free Drinks, Glucose Monitors, Insulin Pens | 18,200 | 2,010K | $37.3M |
| COPD | 15.7M | Nebulizers, Oxygen Concentrators, Peak Flow Meters | 8,300 | 527K | $12.1M |
| Cardiovascular | 82.6M | BP Monitors, Heart Supplements, Cholesterol Kits | 24,600 | 1,800K | $19.6M |
| Oncology | 18.1M | Nutritional Supplements, Comfort Items, Support Kits | 9,800 | 672K | $8.4M |
| Total | 178.7M Total Addressable | 73,350 | 5,856K | $83.5M |
Traditional ML pipelines follow rigid, sequential execution. The Agentic AI paradigm decomposes the pipeline into six autonomous agents — each with defined goals, perception capabilities, reasoning logic, and action spaces — enabling continuous learning and self-correction without manual intervention.
Coordinates pipeline execution via DAG-based task scheduling. Monitors agent health, handles failures with exponential backoff retry, and provides real-time status updates to the marketing team dashboard via WebSocket. Maintains a unique pipeline ID for every execution for full audit and reproducibility.
Upstream errors detected and re-processed automatically without manual intervention
Feature weights shift per therapy area without manual tuning
Campaign results feed back into model improvement each cycle
No data science intervention required for routine operations
The Deduplication Agent implements a three-stage identity resolution pipeline addressing the critical challenge of duplicate records across third-party data sources. Each stage handles what the previous cannot.
SHA-256 hashed PII fields — exact match on name + date of birth + ZIP3. Fastest and most precise resolution layer. No false positives.
Jaro-Winkler similarity with blocking on ZIP3. Threshold: 0.85. Catches spelling variations and data entry errors across sources.
Identity graph resolution for remaining unmatched records. Resolves cross-device and cross-source identities via RampID for complete coverage.
| Therapy Area | Stage 1 (Deterministic) | Stage 2 (Probabilistic) | Stage 3 (LiveRamp) | Total Dedup Rate |
|---|---|---|---|---|
| Asthma | 65.2% | 24.8% | 10.0% | 94.3% |
| Diabetes | 62.1% | 26.3% | 11.6% | 93.8% |
| COPD | 67.4% | 23.1% | 9.5% | 95.2% |
| Cardiovascular | 63.8% | 25.4% | 10.8% | 94.6% |
| Oncology | 60.5% | 27.2% | 12.3% | 93.1% |
| Area | Primary ICD-10 | Key Drugs |
|---|---|---|
| Asthma | J45.20J45.30J45.40 | Albuterol, Fluticasone, Montelukast, Budesonide |
| Diabetes | E11.9E11.65E10.9 | Metformin, Insulin Glargine, Empagliflozin |
| COPD | J44.0J44.1J44.9 | Tiotropium, Umeclidinium, Roflumilast |
| Cardio | I10I25.10I50.9 | Atorvastatin, Lisinopril, Amlodipine |
| Oncology | C50.911C34.90C61 | Pembrolizumab, Trastuzumab, Tamoxifen |
The Positive-Unlabeled learning framework combines Logistic Regression with XGBoost in a weighted ensemble, autonomously optimizing weights per therapy area. SMOTE oversampling addresses the extreme class imbalance inherent to this problem.
| Therapy Area | AUC-ROC | Precision @70% | Recall @70% | F1 @70% | LR : XGB |
|---|---|---|---|---|---|
| Asthma | 0.89 | 0.82 | 0.85 | 0.35 : 0.65 | |
| Diabetes | 0.86 | 0.79 | 0.82 | 0.40 : 0.60 | |
| COPD | 0.84 | 0.77 | 0.80 | 0.38 : 0.62 | |
| Cardiovascular | 0.87 | 0.81 | 0.84 | 0.42 : 0.58 | |
| Oncology | 0.82 | 0.75 | 0.78 | 0.35 : 0.65 | |
| Average | 0.86 | 0.79 | 0.82 | — |
| Decile | Score Range | Patients | Recommended Action |
|---|---|---|---|
| D1 — Top 10% | 0.90 – 1.00 | 504,000 | Immediate Activation |
| D2 | 0.80 – 0.89 | 756,000 | High-Priority Targeting |
| D3 | 0.70 – 0.79 | 890,000 | Standard Campaign |
| D4 – D7 | 0.40 – 0.69 | 1,800,000 | Nurture / Awareness |
| D8 – D10 | 0.00 – 0.39 | 1,090,000 | Exclude from Targeting |
Designed around a "Pharma Command" aesthetic with therapeutic area color-coding, the interface enables any marketing team member to operate the entire system independently across five core screens.
All 5 therapeutic areas with segment growth trends, feature signal radar, and agent status indicators.
Drag-and-drop CSV upload with automatic deduplication preview, LiveRamp resolution rates, and file validation.
Toggle feature weights per category, select algorithm type, and set propensity score threshold for targeting precision.
Decile analysis charts, segment comparison, demographics breakdown, and tag-and-export for campaign activation.
Architecture documentation, code reference, pipeline visualization, and technology stack overview.
| Layer | Technology | Specs |
|---|---|---|
| Presentation Tier | React 17Nginx | 0.5 CPU, 1 GB RAM 1–3 replicas, HTTP auto-scaling |
| Application Tier | Python 3.11FastAPI | 1.0 CPU, 2 GB RAM 1–5 replicas, Gunicorn workers |
| ML Engine | scikit-learnXGBoostSMOTE | 6 agentic ML modules PU Learning framework |
| Data & Integration | SnowflakeLiveRamp | Clean Room, AbiliTec HIPAA-compliant joins |
| Infrastructure | Container AppsACRDocker | Auto-scale, Blob Storage Automated build & deploy |
All patient data processed within Snowflake Clean Rooms with Business Associate Agreement
No raw PII exported; identity resolved to pseudonymous RampIDs throughout
Every query and agent action logged with user attribution via Azure Monitor
AES-256 encryption for storage, TLS 1.3 for all network traffic
| Metric | Before HealthTarget AI | After HealthTarget AI | Improvement |
|---|---|---|---|
| Time to Identify Segment | 6–8 weeks (manual) | <4 hours | 97% reduction |
| Campaign Targeting Accuracy | 28% response rate | 89% response rate | 3.2× improvement |
| Duplicate Records in Campaigns | 12–15% waste | <1% waste | 93% reduction |
| Therapy Areas Served (Parallel) | 1 area (manual rotation) | 5 areas (parallel) | 5× throughput |
| Data Scientist Hours per Campaign | 160 hours | 4 hours (monitoring) | 97.5% reduction |
| Cost per Qualified Patient Reached | $4.20 | $1.35 | 68% reduction |
The Feedback Agent closes the loop by ingesting campaign performance data and autonomously enriching the seed database. Each cycle improves the model, which improves targeting, which generates better campaign data — a true virtuous cycle.
Engagement data, conversion events, and appointment scheduling signals are collected from campaign activation platforms.
Patients with engagement score >0.7 or confirmed conversions are flagged as high-value for seed enrichment.
High-value patients are deduplicated and appended to the therapy-specific seed database, expanding the positive class.
When seed growth exceeds 5% threshold, the Orchestrator Agent automatically initiates a full retraining cycle.
The platform becomes more valuable with every campaign executed — each cycle enriches the seed, improves model accuracy, and increases targeting ROI through autonomous feedback integration.
| Phase | Timeline | Deliverables | Status |
|---|---|---|---|
| Phase 1: Foundation | Wk 1–4 | Data integration, Snowflake Clean Room setup, Deduplication Agent | ✓ Complete |
| Phase 2: Core ML | Wk 5–8 | Feature Engineering Agent, Modeling Agent, Asthma pilot | ✓ Complete |
| Phase 3: Scale | Wk 9–12 | Scoring Agent, 5 therapy areas, full population scoring | ✓ Complete |
| Phase 4: UI & UX | Wk 13–16 | React frontend, marketing team onboarding | ✓ Complete |
| Phase 5: Deployment | Wk 17–18 | Azure Container Apps, CI/CD pipeline, monitoring | ✓ Complete |
| Phase 6: Feedback Loop | Wk 19–20 | Feedback Agent, campaign integration, quarterly retraining | ● Active |
| Phase 7: Expansion | Wk 21+ | Additional therapy areas, real-time scoring, advanced analytics | ◦ Planned |
Expand to Neurology, Dermatology, and Immunology in Q3 2026
Event-triggered campaigns with streaming propensity score updates
Treatment pathway prediction and engagement sequencing
Connect to DSPs, CRM systems, and email activation platforms
Recommendation: Expand to 3 additional therapeutic areas in Q3 2026 to capture additional addressable revenue. The platform is production-ready and delivering measurable ROI with compounding returns through the Feedback Agent. Additional addressable revenue projected at $45M.