Using a Large Language Model to generate synthetic patient data for improving patient-centered outcomes

Vivek Sriram
1 min readOct 3, 2023

Overview: Researchers requiring high quality anonymized data find it difficult to access and use health and health-care related data due to patient privacy safeguards, legal and intellectual property restrictions. LLMs can generate high quality data while protecting patient privacy in large volumes at low cost..

Benefits

  • Low cost: Generate high volumes of high quality synthetic patient data and patient health records.
  • Private and secure: Minimize risk while enabling sharing between organizations and maintaining regulatory compliance.
  • Improve model accuracy: by increasing your training data set without compromising privacy or violating compliance restrictions.

Approach

Step 1: Choose the best model for the task

The bookend platform has anoptimized version of the RedPajama-INCITE-7B-Instruct model that is instruction tuned. This model can follow in-prompt instruction.

Step 2: Privacy & compliance watermarking

bookend runs models within a separate and isolated domain, eliminating the chances of unwanted exposure of model interactions and data.

Watermark all model interactions in a permanent and independently verifiable ledger, further providing for compliance and auditing.

Step 3: in-prompt fine-tuning

bookend supports in context learning and prompt optimizations making it easy to generate synthetic data even with just a few examples.

Step 4: integration

bookend provides simple and secure APIs for developers to integrate LLMs into enterprise applications. The API takes in prompt, examples, and instructions as input and generates new data as output in the instructed shape and size (e.g. json, csv etc)

--

--