Evidence before adjectives

Clinical AI should earn trust in context.

Performance is more than a single headline metric. Caire’s evidence program is organized around technical validity, clinical utility, workflow impact, and ongoing real-world monitoring.

Evidence framework

Ask the whole question.

Technical validity

Does the model perform on appropriately curated and independent data with prespecified endpoints?

Clinical validity

Does performance hold across relevant sites, scanners, acquisition patterns, and patient subgroups?

Clinical utility

Does the output improve clinician orientation or decision workflow without introducing harmful automation bias?

Operational impact

Does deployment change meaningful pathway intervals, workload, escalation reliability, or disposition?

Human factors

Can intended users understand, access, acknowledge, and appropriately act on the output under pressure?

Lifecycle monitoring

Can shifts in data, workflow, delivery, performance, and user behavior be detected and governed?

Validation blueprint

From retrospective promise to prospective reality.

A credible evaluation separates algorithm performance from the performance of the deployed clinical system.

Define intended use

Specify population, modality, setting, user, output, clinical role, exclusions, and foreseeable misuse.

Lock the evaluation

Prespecify endpoints, thresholds, reference standard, adjudication, missing-data handling, and subgroup analysis.

Test independence

Evaluate across unseen institutions and relevant technical and demographic variation.

Study workflow

Measure what changes after deployment, including time intervals, alert burden, user behavior, and unintended consequences.

Monitor continuously

Review performance, delivery health, overrides, complaints, drift signals, and protocol changes under governance.

Peer-reviewed research

Evidence for Caire ICH.

Three published retrospective studies evaluate Caire ICH across algorithm performance and AI-assisted physician interpretation. Results should be understood in the context of each study design and population.

Emergency physician reader study

Using an artificial intelligence software improves emergency medicine physician intracranial haemorrhage detection to radiologist levels

Five board-certified emergency physicians reviewed 532 non-contrast cranial CT scans before and after assistance from Caire ICH.

Read on PubMed →

+6.20%Mean emergency-physician accuracy increase with Caire ICH assistance; 95% CI for the difference 5.10%–7.29%, p=0.0092.

The authors concluded that prospective research with larger cohorts is needed to understand effects on ED logistics and patient outcomes.

External validation study

External Validation of an Artificial Intelligence Device for Intracranial Hemorrhage Detection

External retrospective validation used 510 non-contrast head CT scans: 402 with ICH and 108 without ICH.

Read on PubMed →

98.05%Accuracy
95% CI 96.44%–99.06%

97.52%Sensitivity
95% CI 95.50%–98.81%

100%Specificity
95% CI 96.67%–100%

Radiologist reader study

Deep Learning System Boosts Radiologist Detection of Intracranial Hemorrhage

Three board-certified radiologists reviewed 532 non-contrast head CT scans in a retrospective multi-reader, multi-case study, before and after Caire ICH assistance.

Read the full article →

+6.15%Mean radiologist accuracy increase, from 87.70% to 93.85% (p=0.0095).

+4.60% sensitivity+10.62% specificity+5.71% agreement

Sensitivity, specificity, and inter-reader agreement increases were not statistically significant in this study.

These studies were retrospective and used enriched datasets. They do not by themselves establish prospective clinical outcomes or performance in every deployment population. Study disclosures and author affiliations are available in each publication.

Reporting discipline

Metrics need a denominator and a setting.

Measure	Why it matters	What must accompany it
Sensitivity / specificity	Characterizes finding-level discrimination	Confidence intervals, prevalence, reference standard, threshold, and cohort definition
Positive / negative predictive value	Reflects expected usefulness in the deployment population	Local prevalence and sampling design
Time to notification	Shows technical and delivery latency	Start/end timestamps, failures, and percentile distribution
Time to clinical action	Tests whether the full pathway changed	Workflow definition, comparator, adjustment, and clinical context
Subgroup performance	Surfaces uneven performance	Sample size, prespecified groups, uncertainty, and limitations

Research collaboration

Build evidence that survives scrutiny.

Discuss a study →