PythonLLMsClinical NLP

No More Mr. Nice Prompt: Three Eras of AI Clinical Data Extraction

February 2026

Three Approaches, Three Different Ways to Fail

If you’ve read my earlier post on the npj Digital Medicine paper, you know the pipeline we built for extracting structured data from kidney cancer pathology reports. In that post, I hint at the actual python code / prompting methods are quite outdated- by the time of publication (Spring 2025). As you may imagine, as of writing (February 2026) my personal favorite paper’s methods are actually now eligible for medicare.

I thought in this post I’d quickly showcase how our extraction approach evolved across three major eras: from asking the model nicely to return JSON, to constrained decoding with JSON mode, to function calling with Pydantic validation. Each transition was driven by both wall hitting with the previous approach, and the rapidly evolving field.

Act 1: “Ask Nicely for Reasoning & JSON”

The Approach

The original setup was straightforward: craft a detailed system prompt telling the model exactly what fields you want and what format to use, then call json.loads() on the response. To handle complex reports — multiple specimens (say a lung biopsy and a lymph node biopsy in the same report), multiple IHC tests (sometimes the same test but with different results depending on the specimen) — we broke extraction into a multi-step chain: first prompt to organize raw test results, second to group by specimen, third to structure by individual test result. The outputs of each step fed into the next. Below is a very simplified version of the last prompt in that chain, note that this is based on a jinja template, so the double braces indicate where you would inject text into this template at runtime from python before sending it off to the LLM.

system:
You are a clinical data extraction assistant.
Your task is to process text from electronic medical records
into structured JSONs

user:
# Background
- An LLM has processed and organized relevant text from a pathology report
- You are being provided these segments based on specimens & IHC / FISH results
- Your role is to convert this segmented text into labeled,
  standardized structured data

## Naming conventions

Standardize test names to those in the following list:
{{panel_test_names}}

Standardize test results to those in the following list:
{{panel_test_results}}

# Instructions

## Task 1, Reasoning

Review segmented text to standardize test names and results.
And consider the following...

## Task 2, standardize and organize test names and results

Return a JSON first with a summary of your reasoning,
then with standardized test names and results for each specimen.

Follow this template:

{
  "reasoning_summary": "summary of your reasoning",
  "specimen_X_TESTNAME": "RESULT",
  "specimen_Y_TESTNAME": "RESULT"
}

It is vital that the entirety of the returned text is a valid JSON.

# Segmented Text

Finally, here is the text you will be working with:
{{segmented_text}}
"""

And then at the end, manually parsing the chained outputs into your final data structure. Please note the very obnoxious key creation with the underscores, I didn’t want to do this. But what I found was that models of the day (early GPT-4) had tremendous issues with nested JSONs, so I had to rely on the keys containing multiple indicators, the specimen name and the test name.

Where It Breaks

Malformed JSON. Every developer who’s done this knows the feeling of json.loads() raising a JSONDecodeError on a response that looks fine. Missing closing braces, trailing commas after the last item, unescaped quotes inside string values — these showed up constantly, with no pattern that made them predictable.

Conversational wrapping. Despite “Return only the JSON. No explanatory text.” being in the prompt, the model frequently added a preamble. “Based on the pathology report, here is the extracted data:” followed by the JSON. json.loads() does not appreciate this. I had to code a cleaner step first to strip these out.

Vocabulary drift. This one matters more than it sounds. The prompt specified "Diffuse" as a valid result value. The model returned "Diffusely" — grammatically reasonable, programmatically wrong. Or "diffuse positive" instead of "Diffuse". Or "positive, diffusely". Every variant that wasn’t in your exact vocabulary required manual cleanup or a post-processing normalization step. And this isn’t academic: this was one of the most persistent failure modes documented in the npj paper. The root cause was that we were instructing the model to use controlled vocabulary rather than constraining it to do so.

The parsing tax. The multi-step chain solved the complexity problem — big reports broken into manageable extractions — but it meant writing substantial custom parsing code at the end to reassemble the pieces. Every new report type or schema change meant updating that parser.

What We Learned

Free-text prompting worked because at the time, its all we had. At scale, the combination of malformed output, vocabulary drift, and manual parsing means you’re spending engineering time on glue code rather than on improving extraction quality. The insight: instruction isn’t the same as constraint. You need the model’s output to be mechanically valid, not just instructed toward validity. No more asking nicely.

Act 2: Structured Output / JSON Mode

The Approach

The obvious next step: use constrained decoding to guarantee valid JSON. OpenAI’s response_format={"type": "json_object"} and vLLM’s guided decoding via grammar constraints both operate at the token level — the model can only produce tokens that form valid JSON matching your schema.

from enum import Enum
from typing import Optional

from pydantic import BaseModel, Field


class IHCResultPrimary(str, Enum):
    POSITIVE = "Positive"
    NEGATIVE = "Negative"
    OTHER = "Other"


class IHCResultModifier(str, Enum):
    DIFFUSE = "Diffuse"
    BOX_LIKE = "Box like"
    CUP_LIKE = "Cup like"


class IHCTestName(str, Enum):
    BAP1 = "BAP1"
    CA_IX = "CA-IX"
    OTHER = "Other"


class IHCTest(BaseModel):
    specimen: str = Field(..., description="Specimen used for test")

    test_name: IHCTestName = Field(..., description="Name of test")
    test_name_other: Optional[str] = Field(
        None, description="Name of test if not in options"
    )

    test_result: IHCResultPrimary = Field(..., description="Test Result")
    test_result_modifier: Optional[IHCResultModifier] = Field(
        None, description="Test result modifier if applicable"
    )
    test_result_other: Optional[str] = Field(
        None, description="Result if not in options"
    )

class IHCReport(BaseModel):
    reasoning: str = Field(..., description="Summary of reasoning")
    test_and_results: Optional[list[IHCTest]] = Field(
        None, description="List of tests and results"
    )

# Here we use the handy OpenAI parse which lets us pass the pydantic 
#   model directly, for vllm we would dump it first
from openai import OpenAI
client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    response_format=IHCReport,
)

parsed: IHCReport = response.choices[0].message.parsed

The New Problems

It’s slower. Constrained decoding has a slight overhead compared to unconstrained generation. For batch processing hundreds of reports, this adds up quickly.

Local models get stuck. When running open-source models on vLLM with grammar-based structured output, we hit cases where the model would enter a loop — continuously generating tokens but never completing. Reports would never finish processing. This wasn’t a fringe case.

The monolithic schema problem. With structured output, you have to write out the full schema upfront with every valid option enumerated. For complex pathology reports, that means one giant schema that every report gets forced through — whether the relevant fields appear in the report or not.

No built-in error recovery. When constrained decoding produces semantically wrong output (valid JSON, wrong values), there’s no natural mechanism for catching it and retrying with context about what went wrong. The model gave you valid JSON — as far as the framework is concerned, the job is done.

Model alignment is moving on. The post-training regimes for frontier models are increasingly oriented toward function calling and tool use, not schema-constrained generation. Using structured output started to feel like swimming against the current. Additionally, here we are still explicitly having the model fill out a reasoning field, but these days reasoning is built into models we don’t need to ask for it anymore.

What We Learned

Structured output solves the syntax problem but not the semantics problem. Valid JSON can still be wrong. And the operational costs — latency, local model instability, monolithic schemas — made it harder to work with than the theoretical cleanliness suggested. The deeper issue is that you’re still fundamentally asking the model to fill out a form; you haven’t given it a more natural interface for expressing what it knows.

Act 3: Function Calling + Pydantic Validation

The Approach

Function calling inverts the dynamic: instead of telling the model “return this JSON shape,” you define tools the model can call, and the model decides when and how to call them. The model’s post-training makes this feel natural — it reasons about what information it has and selects the appropriate tool to record it.

Pydantic models define the function arguments. This is where the vocabulary drift problem gets solved properly: enums constrain the model to exact valid values, not as an instruction but as a type system. We don’t have to literally force the model to output our correct terms, we can just raise validation errors and recover.

from enum import Enum
from typing import Optional

from pydantic import BaseModel, Field


class IHCResultPrimary(str, Enum):
    POSITIVE = "Positive"
    NEGATIVE = "Negative"
    OTHER = "Other"


class IHCResultModifier(str, Enum):
    DIFFUSE = "Diffuse"
    BOX_LIKE = "Box like"
    CUP_LIKE = "Cup like"


class IHCTestName(str, Enum):
    BAP1 = "BAP1"
    CA_IX = "CA-IX"
    OTHER = "Other"


class IHCTest(BaseModel):
    specimen: str = Field(..., description="Specimen used for test")

    test_name: IHCTestName = Field(..., description="Name of test")
    test_name_other: Optional[str] = Field(
        None, description="Name of test if not in options"
    )

    test_result: IHCResultPrimary = Field(..., description="Test Result")
    test_result_modifier: Optional[IHCResultModifier] = Field(
        None, description="Test result modifier if applicable"
    )
    test_result_other: Optional[str] = Field(
        None, description="Result if not in options"
    )

# --- tool implementation ---
def record_ihc_test(**kwargs) -> tuple[str, IHCTest]:
    ihc_test = IHCTest(**kwargs)
    # Tool responses should be JSON-serializable strings for the model
    return json.dumps({"ok": True}), ihc_test

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "record_ihc_test",
            "description": "Record an IHC test result for a specimen",
            "parameters": IHCTest.model_json_schema(),  # pydantic v2
        },
    },
]


# Now a tool calling loop, ill just do pseudo code since its kinda long...
messages = [{"role": "system", "content": SYSTEM},
            {"role": "user", "content": prompt}]
tests, failures = [], 0

for _ in range(6):
    msg = client.chat.completions.create(
        model="gpt-5", messages=messages, tools=TOOLS
    ).choices[0].message
    messages.append(msg)

    if msg.tool_calls:
        for call in msg.tool_calls:
            try:
                out, t = record_ihc_test(**json.loads(call.function.arguments))
                tests.append(t)
            except ValidationError as e:
                out = json.dumps({"ok": False, "error": str(e)})
                failures += 1
            messages.append({"role": "tool", "tool_call_id": call.id,
                             "name": call.function.name, "content": out})
        if failures >= 3: raise RuntimeError("too many validation failures")
        continue

    report = IHCReport.model_validate_json(msg.content)
    save_final_result(report, tests)
    break

The model calls record_ihc_test(specimen="A", test_name="...", ...) and Pydantic validates the arguments. "Diffusely" is simply not a valid IHCResult — the model has to pick from the enum or the call fails, but we can just append the error message and retry!

The Modularity Win

The other big improvement: instead of one schema for the whole report, you define small tools that can be called once, many times, or not at all. A tool for recording specimen-level diagnoses. A separate tool for IHC results. A tool specifically for noting ambiguous findings. The model calls them in the order that makes sense for the report it’s actually looking at.

This ends up being a cleaner version of what the multi-step prompt chain in Act 1 was trying to accomplish — except the model orchestrates it rather than you hard-coding the sequence.

Validation as a Catch, Not a Guarantee

Pydantic validation catches structural and type-level errors automatically. But semantic errors — where the model extracts the wrong value from a genuinely ambiguous report — still happen. The retry loop handles the structural cases. However, semantic errors — the model genuinely interpreted an ambiguous finding differently — the retry won’t fix, and you need a human review flag.

Act 4: Real Function & In-Memory SQL as a State Layer

Wait what, I thought there were only supposed to be 3 acts? Well that does’nt sound as good for a title. Here I show how I’m thinking about actually using function calling to perform more complex operations than just pydantic validation.

Pydantic validates each function call in isolation. That’s fine for pathology reports, where each document is self-contained. But a lot of clinical extraction involves processing a series of notes across a patient timeline — clinic visits, lab results, treatment records — where what’s valid in note N depends on what was recorded from notes 1 through N-1.

The canonical example: you’re extracting lines of therapy from oncology notes in chronological order. When the model encounters a note documenting that a patient stopped pembrolizumab, it calls record_therapy_stop(...). But if note 1 was corrupted, or out of order, or the therapy name was spelled differently — there may be no corresponding start date in your records. Without a state layer, you’d write a stop date with no start date and have no idea until someone noticed the broken timeline downstream.

The fix is an in-memory SQLite database that gets built up as notes are processed, with referential integrity checks baked into the tool execution layer. The model doesn’t see the SQL — it just knows that calling a tool can fail with a constraint error, the same as any other validation failure, and gets told specifically why.

import sqlite3
from typing import List


def init_state_db() -> sqlite3.Connection:
    """
    In-memory state for ONE patient's extraction session.
    (You could swap ':memory:' for a real file if desired.)
    """
    conn = sqlite3.connect(":memory:")
    conn.row_factory = sqlite3.Row

    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS therapy_lines (
            patient_id   TEXT NOT NULL,
            therapy_name TEXT NOT NULL,
            start_date   TEXT,   -- ISO: YYYY-MM-DD
            stop_date    TEXT,   -- ISO: YYYY-MM-DD
            stop_reason  TEXT,
            PRIMARY KEY (patient_id, therapy_name)
        )
        """
    )
    conn.commit()
    return conn


def get_active_therapies(conn: sqlite3.Connection, patient_id: str) -> List[str]:
    rows = conn.execute(
        """
        SELECT therapy_name
        FROM therapy_lines
        WHERE patient_id = ?
          AND start_date IS NOT NULL
          AND stop_date IS NULL
        """,
        (patient_id,),
    ).fetchall()
    return [r["therapy_name"] for r in rows]


def record_therapy_start(
    conn: sqlite3.Connection,
    patient_id: str,
    therapy_name: str,
    start_date: str,
) -> None:
    """
    Insert a line of therapy, or fill in the start_date if the row already exists.
    """
    try:
        with conn:
            conn.execute(
                """
                INSERT INTO therapy_lines (patient_id, therapy_name, start_date)
                VALUES (?, ?, ?)
                """,
                (patient_id, therapy_name, start_date),
            )
    except sqlite3.IntegrityError:
        # Row already exists. Only set start_date if it wasn't set yet.
        with conn:
            conn.execute(
                """
                UPDATE therapy_lines
                SET start_date = ?
                WHERE patient_id = ?
                  AND therapy_name = ?
                  AND start_date IS NULL
                """,
                (start_date, patient_id, therapy_name),
            )


def record_therapy_stop(
    conn: sqlite3.Connection,
    patient_id: str,
    therapy_name: str,
    stop_date: str,
    stop_reason: str,
) -> None:
    """
    Enforces a simple temporal constraint:
    you can only stop a therapy that is currently active.
    """
    row = conn.execute(
        """
        SELECT start_date
        FROM therapy_lines
        WHERE patient_id = ?
          AND therapy_name = ?
          AND start_date IS NOT NULL
          AND stop_date IS NULL
        """,
        (patient_id, therapy_name),
    ).fetchone()

    if row is None:
        active = get_active_therapies(conn, patient_id)
        # Raise a "tool constraint error" you feed back to the model
        # the same way you'd feed back a Pydantic ValidationError.
        raise ValueError(
            "Cannot record stop: no active therapy start found. "
            f"patient_id={patient_id!r}, therapy_name={therapy_name!r}. "
            f"Active therapies: {active}"
        )

    with conn:
        conn.execute(
            """
            UPDATE therapy_lines
            SET stop_date = ?, stop_reason = ?
            WHERE patient_id = ?
              AND therapy_name = ?
            """,
            (stop_date, stop_reason, patient_id, therapy_name),
        )

The error message deliberately includes the list of active therapies the model can legally stop. A common failure mode is a name mismatch — note 1 says “pembrolizumab,” note 4 says “Keytruda” — and showing the model what names exist in state gives it the context to recognize the conflict and either reconcile the names or flag the ambiguity rather than guessing.

This constraint error flows back through the same retry loop as Pydantic errors:

def execute_tool_call(tool_name: str, tool_args: dict) -> str:
    try:
        if tool_name == "record_therapy_start":
            record_therapy_start(**tool_args)
            return "OK"
        elif tool_name == "record_therapy_stop":
            record_therapy_stop(**tool_args)
            return "OK"
        # ... other tools
    except (ValueError, ValidationError) as e:
        # Return error string — caller feeds this back to the model
        return f"CONSTRAINT_ERROR: {str(e)}"

The broader pattern here is that the in-memory DB is doing what a stateless validator can’t: enforcing temporal consistency across a patient record. You could extend this to other constraints — a response assessment requires a prior imaging record, a dose reduction requires an active regimen, a second-line therapy start requires a first-line stop. Each constraint is a SQL check that runs before the function call succeeds, and any violation becomes structured feedback the model can act on.

Side-by-Side: Three Approaches

Feature	Free-text JSON	Structured Output	Function Calling
Syntax validity	Sometimes	Always	Always
Vocabulary control	Prompt only	Schema	Enum
Missing fields	Random	Null inserted	Tool omitted
Error recovery	None	None	Retry loop
Modular extraction	No	No	Yes
Local model stability	Good	Poor	Good

What Still Needs Work

Function calling with Pydantic validation is the right approach for production clinical extraction, but it doesn’t close the loop on everything.

Some semantic errors are invisible to the validator. Pydantic catches "Diffusely" — that’s a type error. It doesn’t catch the model confidently extracting the wrong diagnosis from a genuinely ambiguous report. Those errors only surface when a human reviews the output. Building good human-in-the-loop review workflows, with the model flagging its own uncertainty via a confidence field, is the part that requires as much design attention as the extraction itself.

Schema design is still hard. Defining Pydantic models for complex nested reports — multiple specimens, conditional fields, IHC markers that only apply to certain specimen types — requires careful thought about what you actually want to capture and why. This is the same task specification problem from the npj paper, just expressed in Python types instead of prose instructions.

Tool design affects model behavior. How you decompose the extraction into tools matters. A single monolithic tool behaves differently than five small tools covering different parts of the report. Getting the decomposition right for your specific report type requires iteration — which is, again, mostly a problem about clarity of objectives rather than a technical problem. What I’ve found is that if there are too many tools and too much nested types, models can still get over overwhelmed.

Hein D et al. Iterative refinement and goal articulation to optimize large language models for clinical information extraction. npj Digital Medicine. 8, 301 (2025). Paper | Code

Three Approaches, Three Different Ways to Fail

Act 1: “Ask Nicely for Reasoning & JSON”

The Approach

Where It Breaks

What We Learned

Act 2: Structured Output / JSON Mode

The Approach

The New Problems

What We Learned

Act 3: Function Calling + Pydantic Validation

The Approach

The Modularity Win

Validation as a Catch, Not a Guarantee

Act 4: Real Function & In-Memory SQL as a State Layer

Side-by-Side: Three Approaches

What Still Needs Work

Related Work