A realistic, safe, and scalable end-to-end roadmap for building an offline AI medical decision-support system on IoT/edge devices.
⚠ CAUTION This system is decision support only — it must never be marketed, labeled, or used as a diagnostic or treatment device without full regulatory clearance (FDA 510(k)/De Novo, CE MDR Class IIa+).
⚠ TIP Recommended starting point: NVIDIA Jetson Orin Nano — best price/performance ratio for running quantized LLMs + ML classifiers simultaneously.
1.2 High-Level Architecture Diagram
[Diagram 1 — see original .md file for interactive Mermaid diagram]
1.3 Data Flow (Request Lifecycle)
[Diagram 2 — see original .md file for interactive Mermaid diagram]
1.4 Software Stack
Layer
Technology
Purpose
OS
Ubuntu 22.04 LTS (ARM64) / JetPack 6
Stable base with long-term support
Runtime
Python 3.10+, ONNX Runtime, llama.cpp
Model inference
LLM Server
llama.cpp server / Ollama
Quantized LLM serving
Vector DB
FAISS / Hnswlib
Local embedding retrieval
Relational DB
SQLite / DuckDB
Structured medical knowledge
UI
Flask/FastAPI + HTMX or Qt/PyQt5
Lightweight local web or native UI
TTS/STT
Whisper.cpp (STT), Piper (TTS)
Voice I/O
Logging
SQLite audit log + syslog
Compliance & traceability
2. Model Strategy
2.1 Hybrid Architecture (Recommended)
⚠ IMPORTANT No single model handles everything well. Use a hybrid approach: purpose-built ML classifiers for structured prediction + a small LLM for reasoning, explanation, and natural language interaction.
[Diagram 3 — see original .md file for interactive Mermaid diagram]
Top-5 probable conditions with calibrated probabilities
**Size**
< 10 MB (fits easily on any edge device)
**Inference**
< 5 ms on Raspberry Pi
**Alternative**
Scikit-learn Random Forest for simpler deployments
2.3 LLM Selection
Model
Parameters
Quantized Size
Min RAM
Tokens/sec (Orin Nano)
Use Case
**Phi-3 Mini**
3.8B
~2.2 GB (Q4_K_M)
4 GB
~15–20
Best quality/size ratio
**TinyLlama 1.1B**
1.1B
~700 MB (Q4)
2 GB
~30–40
Fastest, RPi-compatible
**Mistral 7B**
7B
~4.5 GB (Q4_K_M)
8 GB
~8–12
Highest quality (Orin only)
**Gemma 2 2B**
2B
~1.5 GB (Q4)
3 GB
~20–25
Good multilingual
**Qwen2.5 3B**
3B
~2 GB (Q4)
4 GB
~15–18
Strong reasoning
⚠ TIP Recommended: Start with Phi-3 Mini (Q4_K_M) on Jetson Orin Nano — best balance of medical reasoning quality and inference speed. Fall back to TinyLlama for RPi-only deployments.
2.4 Supporting Models
Task
Model
Size
Notes
Speech-to-Text
Whisper-small / Whisper-tiny
150 MB / 75 MB
Via whisper.cpp, runs on CPU
Text-to-Speech
Piper TTS
~20–50 MB per voice
ONNX-based, very fast
Text Embeddings
all-MiniLM-L6-v2
~80 MB
For RAG retrieval vector search
3. Dataset Requirements
3.1 Public Medical Datasets
Symptom–Disease Mapping
Dataset
Type
Records
Source
License
**Columbia Disease–Symptom KB**
Tabular
~150 diseases, 400+ symptoms
Columbia Univ.
Research
**Symptom–Disease Dataset (Kaggle)**
Tabular
~5K records, 130+ diseases
Kaggle community
CC0 / Open
**DDXPlus**
Tabular + Text
~1.3M synthetic patients, 49 diseases
Mila / McGill
CC-BY
**MedQuAD**
Q&A text
~47K Q&A pairs
NIH / NLM
Public domain
**PubMedQA**
Q&A text
~1K expert, 211K+ artificial
PubMed
MIT
Vitals & Clinical
Dataset
Type
Records
Source
**MIMIC-IV**
EHR (structured)
~430K admissions
PhysioNet (credentialed)
**eICU**
ICU vitals
~200K stays
PhysioNet (credentialed)
**Heart Disease UCI**
Vitals + outcomes
~920 records
UCI ML Repository
**Diabetes 130-Hospitals**
Clinical
~100K records
UCI ML Repository
Medical Knowledge for RAG
Source
Type
Use
**WHO ICD-11**
Disease classification
Standardized disease coding
**SNOMED CT**
Clinical terminology
Symptom/condition ontology
**UpToDate / BMJ Best Practice**
Clinical guidelines
RAG knowledge base (licensing required)
**OpenMedData / WikiDoc**
Articles
Open medical reference
**BNF / WHO Essential Medicines**
Drug reference
Medication information
3.2 Data Preparation Pipeline
[Diagram 4 — see original .md file for interactive Mermaid diagram]
3.3 Data Cleaning Guidelines
De-identification — Strip all PHI (names, dates, MRNs) per HIPAA Safe Harbor method
Standardize terminology — Map free-text symptoms to SNOMED-CT or ICD-11 codes
Deduplication — Remove duplicate patient records within and across datasets
3.4 Bias Mitigation
⚠ WARNING Medical datasets are historically biased by demographics. Failing to address this creates unsafe predictions for underrepresented populations.
Strategy
Implementation
**Demographic audit**
Measure performance across age, sex, ethnicity subgroups
**Stratified sampling**
Ensure proportional representation in train/test splits
**Oversampling**
SMOTE or ADASYN for underrepresented disease groups
**Fairness constraints**
Equalized odds or demographic parity during training
**Documentation**
Datasheet for Datasets (Gebru et al.) for every dataset used
4. Training Approach
4.1 Decision Framework
[Diagram 5 — see original .md file for interactive Mermaid diagram]
No retraining needed; updatable knowledge; traceable citations
Requires good embeddings + retrieval pipeline
**Fine-tuning**
Need domain-specific reasoning patterns
Better domain understanding; smaller model can punch above weight
Expensive; risk of hallucination; hard to update
**Hybrid RAG + Light Fine-tune**
Production systems
Best of both worlds
More complex pipeline
⚠ TIP Start with RAG. Fine-tune only if RAG retrieval quality is insufficient after optimization. Fine-tuning on medical data carries significant hallucination risk if not done carefully.
4.4 Fine-Tuning (If Needed)
# LoRA / QLoRA Fine-Tuning Pipeline
1. Base model: Phi-3 Mini or Gemma 2 2B
2. Dataset: MedQuAD + curated clinical Q&A (≥10K examples)
3. Method: QLoRA (4-bit quantized LoRA)
- LoRA rank: 16–64
- Learning rate: 2e-4
- Epochs: 3–5
- Use PEFT + bitsandbytes
4. Hardware: Single GPU (RTX 3090/4090) or cloud A100 for training
5. Export: Merge LoRA weights → GGUF quantization → deploy via llama.cpp
4.5 Model Compression
Technique
Savings
Quality Impact
Tools
**Post-Training Quantization (PTQ)**
4× size reduction (FP16 → INT4)
Minimal (< 2% accuracy drop)
llama.cpp, GPTQ, AWQ
**Quantization-Aware Training (QAT)**
4× with less quality loss
Very low
TensorRT, AIMET
**Pruning (unstructured)**
50–90% sparsity
Moderate (needs fine-tuning)
Neural Magic, SparseML
**Knowledge Distillation**
Train smaller student model
Variable
Hugging Face, custom
**ONNX Optimization**
1.5–3× inference speedup
None
ONNX Runtime, graph optimizations
5. Offline Knowledge System
5.1 Architecture
[Diagram 6 — see original .md file for interactive Mermaid diagram]
5.2 Knowledge Base Content
Category
Content
Format
Size Estimate
Disease profiles
~500–1000 conditions with symptoms, risk factors, epidemiology
Total knowledge base: ~400 MB — easily fits on any edge device.
5.3 RAG Implementation
# Pseudo-implementation of local RAG pipeline
# 1. Offline Indexing (done once during setup)
from sentence_transformers import SentenceTransformer
import faiss, sqlite3
embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunks = load_medical_chunks_from_sqlite("knowledge.db")
vectors = embedder.encode(chunks)
index = faiss.IndexFlatIP(384) # Inner product for normalized vectors
index.add(vectors)
faiss.write_index(index, "medical_index.faiss")
# 2. Runtime Retrieval
def retrieve(query: str, top_k: int = 5) -> list[str]:
q_vec = embedder.encode([query])
scores, indices = index.search(q_vec, top_k)
return [chunks[i] for i in indices[0]]
# 3. LLM Prompting with Retrieved Context
def generate_response(query: str) -> str:
context = retrieve(query)
prompt = f"""You are a medical decision-support assistant.
Based ONLY on the following medical references, answer the query.
Always state your confidence level and cite the source.
NEVER provide a diagnosis — only suggest possible conditions.
References:
{chr(10).join(context)}
Query: {query}
Response:"""
return llm.generate(prompt)
5.4 Explainability Layer
Component
Implementation
Purpose
**Feature Attribution**
SHAP values for ML classifier
"Fever and cough contributed most to this prediction"
**Source Citation**
RAG chunk IDs → original guideline
"Based on WHO Malaria Treatment Guidelines (2023)"
**Confidence Score**
Calibrated probability from classifier + LLM self-assessment
"78% confidence (moderate)"
**Reasoning Chain**
LLM chain-of-thought prompting
Step-by-step reasoning visible to clinician
**Differential Summary**
Top-3 conditions with distinguishing features
"Consider X, Y, Z — differentiated by..."
6. Edge Deployment
6.1 Deployment Pipeline
[Diagram 7 — see original .md file for interactive Mermaid diagram]
6.2 Step-by-Step Deployment
Step 1: Prepare Models
# Quantize LLM to GGUF (4-bit)
python llama.cpp/convert.py phi-3-mini/ --outfile phi3-mini-f16.gguf
./llama.cpp/quantize phi3-mini-f16.gguf phi3-mini-q4_k_m.gguf Q4_K_M
# Export ML classifier to ONNX
python -c "
import xgboost, onnxmltools
model = xgboost.Booster(model_file='disease_classifier.json')
onnx_model = onnxmltools.convert_xgboost(model)
onnxmltools.utils.save_model(onnx_model, 'disease_classifier.onnx')
"
Step 2: Set Up Device
# Jetson Orin Nano setup
sudo apt update && sudo apt install -y python3-pip cmake
pip3 install onnxruntime-gpu faiss-cpu flask piper-tts
# Build llama.cpp with CUDA
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_CUDA=ON && cmake --build . -j$(nproc)
Does not replace clinical judgment — HCP acts as learned intermediary
✅ "Decision support only" — never provides diagnosis or treatment orders
⚠ WARNING If any criterion is not met, the system becomes a Software as a Medical Device (SaMD) and requires FDA premarket review.
7.3 Mandatory Safety Features
┌─────────────────────────────────────────────────────────┐
│ SAFETY IMPLEMENTATION CHECKLIST │
├─────────────────────────────────────────────────────────┤
│ ☑ Every output includes confidence score (0–100%) │
│ ☑ Every output includes standard disclaimer │
│ ☑ Red-flag conditions trigger URGENT referral notice │
│ ☑ System never uses words "diagnose" or "prescribe" │
│ ☑ All interactions logged with timestamps │
│ ☑ Audit trail is tamper-evident (hash-chained) │
│ ☑ Model version and knowledge base version logged │
│ ☑ Fail-safe: if confidence < 30%, output "Insufficient │
│ information — please consult a healthcare provider" │
│ ☑ Emergency symptoms → immediate "SEEK EMERGENCY CARE" │
└─────────────────────────────────────────────────────────┘
7.4 Standard Disclaimer Template
╔══════════════════════════════════════════════════════════╗
║ ⚠ IMPORTANT MEDICAL DISCLAIMER ║
║ ║
║ This tool provides DECISION SUPPORT ONLY. ║
║ It does NOT provide medical diagnoses or treatment. ║
║ ║
║ • Results are probabilistic suggestions, not diagnoses ║
║ • Always consult a qualified healthcare professional ║
║ • In case of emergency, seek immediate medical care ║
║ • This system has not been evaluated by FDA/CE as a ║
║ medical device ║
║ ║
║ Confidence: [XX]% | Model v[X.X] | KB v[YYYY-MM] ║
╚══════════════════════════════════════════════════════════╝
7.5 Data Privacy (HIPAA-Aligned Principles)
Principle
Implementation
**Data minimization**
Collect only clinically necessary inputs; no PII stored
**Local-only processing**
All data stays on device — no cloud, no telemetry
**Encryption at rest**
LUKS full-disk encryption on edge device
**Access control**
PIN/biometric auth for healthcare worker access
**Audit logging**
Every query/response logged with timestamp, user ID
**Data retention**
Configurable auto-purge (default: 30 days)
**Physical security**
Tamper-evident enclosure, Kensington lock
8. UI/UX Design
8.1 Design Principles
Clinical simplicity — No visual clutter; every element serves a purpose
Glanceable results — Risk level visible in < 2 seconds
Accessible — Large fonts (≥16px), high contrast (WCAG AA), touch-friendly (48px targets)
Language-agnostic — Icon-heavy design, i18n-ready text
8.2 Screen Flow
[Diagram 8 — see original .md file for interactive Mermaid diagram]
⚠ NOTE This blueprint is a living document. Revisit each section as you progress through development phases. Start with Phase 1 (MVP) and validate before adding complexity.