Using a Large Language Model to generate synthetic patient data for improving patient-centered outcomes
Overview: Researchers requiring high quality anonymized data find it difficult to access and use health and health-care related data due to patient privacy safeguards, legal and intellectual property restrictions. LLMs can generate high quality data while protecting patient privacy in large volumes at low cost..
Benefits
- Low cost: Generate high volumes of high quality synthetic patient data and patient health records.
- Private and secure: Minimize risk while enabling sharing between organizations and maintaining regulatory compliance.
- Improve model accuracy: by increasing your training data set without compromising privacy or violating compliance restrictions.
Approach
Step 1: Choose the best model for the task
The bookend platform has anoptimized version of the RedPajama-INCITE-7B-Instruct model that is instruction tuned. This model can follow in-prompt instruction.
Step 2: Privacy & compliance watermarking
bookend runs models within a separate and isolated domain, eliminating the chances of unwanted exposure of model interactions and data.
Watermark all model interactions in a permanent and independently verifiable ledger, further providing for compliance and auditing.
Step 3: in-prompt fine-tuning
bookend supports in context learning and prompt optimizations making it easy to generate synthetic data even with just a few examples.
Step 4: integration
bookend provides simple and secure APIs for developers to integrate LLMs into enterprise applications. The API takes in prompt, examples, and instructions as input and generates new data as output in the instructed shape and size (e.g. json, csv etc)