HealthcareNLPApplied MLDocument AI

The patient voice doesn't live in the clinical record.

A globally recognized oncology research institute wanted to hear what patients say when no clinician is listening - we built the social-listening pipeline that surfaced it.

Clinical data is rich but bounded. It captures what was prescribed, what was tested, what was observed in clinic. It captures very little of what patients actually experience between appointments - the texture of side effects, the chronology of diagnosis, the language they use to describe what's happening to them. For an oncology research institute working on patient-centered care, that gap is consequential. Real-world treatment journeys, side-effect patterns, diagnostic sequences - these existed in patient-generated speech on public platforms, but in a form no clinical team could systematically read. The question: could we turn unstructured patient speech into research-grade evidence?

  1. 01

    Source the speech responsibly.

    We built a curated scraping layer over public Twitter content tied to breast cancer, treatments, diagnostic steps, and patient experiences. The corpus was scoped to a single language and geography to keep the research signal clean.

  2. 02

    Detect the entities that matter for oncology.

    Off-the-shelf NLP isn't tuned for oncology vocabulary. We built detection specifically for treatments and molecules, side effects, diagnostic steps, and cancer stage indicators - the entities that have to be extractable for the research to mean anything. The judgment call: precision over recall. A smaller corpus of well-extracted journeys beats a larger corpus of noisy ones.

  3. 03

    Reconstruct the journey, not just the data points.

    Detection is the input. The output that matters is the patient journey - diagnosis through treatment through side effects, sequenced over time per individual. We mapped detected treatments to known side-effect profiles and assembled journeys from the dispersed signal a single patient leaves across multiple posts.

The research team gained access to real-time patient perspectives that don't appear in clinical records - the lived texture of treatment that statistics can't capture. The pipeline demonstrated social media as a research-grade data source for oncology, with a framework that extends to other platforms, other languages, and other diseases. Aggregated analytics on treatment pathways and side-effect frequency surfaced patterns the institute could now investigate clinically. The unlock: a complementary research channel to clinical data, scoped to extend.

In research workflows, the gap between clinical data and patient experience is a data engineering problem before it's a clinical one. Build the pipeline that captures the speech, and the research questions follow.

Working on a research domain where the most valuable data lives outside your institutional systems? We help research teams turn public, unstructured speech into structured evidence - responsibly and at scale.

Let's talk

Get In Touch

Have any questions? We'd love to hear from you.