How valuable is unstructured data in healthcare, really?

Unstructured data captures the clinical “why” behind structured values, such as physician notes explaining abnormal readings. At scale, that context drives better outcomes, new insights on adherence and side effects, and creates differentiated datasets that underpin the rapidly growing healthcare NLP market.

What is semi-structured data?

Semi-structured data sits between structured and unstructured information. Examples include categorized unstructured content or HTML. It’s usually treated as unstructured because any structure it has is optional, irregular, or inconsistently applied.

What is named entity recognition (NER)?

Named entity recognition (NER) is an AI technique that automatically finds and labels important entities—such as people, medications, diagnoses, dates, or measurements—in free text. In healthcare, effective NER turns narrative documentation into structured data that can populate databases, power analytics, and enable data reconciliation, despite clinical jargon and ambiguous abbreviations.

Our organization isn’t very technical. Is this something we can actually implement?

Yes. DocKnow is designed so non-technical users like lab staff and hospital administrators can work with it through an intuitive interface, while technical teams can go deeper using REST APIs, business rules, deployment controls, and code-level configuration. Onymos’ professional services can also help implement and tailor the solution to your workflows.

Do we need to “clean” our data before DocKnow can work with it?

No. DocKnow is built to handle messy, real-world healthcare data without extensive preprocessing. As it processes documents, it also surfaces data quality issues—like missing fields, inconsistencies, or documentation gaps—so you can improve upstream processes. If your data were already perfect, you probably wouldn’t have an unstructured data problem to solve.

< Onymos Blog

How to Solve Your Unstructured Data Problem

Automation • Data

By: Jamie Goodnight • December 5, 2025

A visualization showing unstructured data fields.

Unstructured data is information that’s difficult for software systems to store, search, or analyze.

It’s the kind of data you won’t find in a fixed, predefined format. It won’t fit neatly into a spreadsheet’s rows and columns. It might be a handwritten note in the margins of a form, an email chain, or even feedback on a social media platform.

None of that is difficult for humans to parse, because we understand meaning from context, but software systems have traditionally required explicit structure to interpret information. It’s why issues like data portability (transferring data between different systems) remain so challenging to solve.

Imagine a doctor writes a note that says, “Start metoprolol 25 mg bid.” A human expert would understand that means “begin taking 25 milligrams of metoprolol twice a day.”

But to an electronic health record (EHR) system that accepts specific key-value pairs, it’s meaningless. A human would need to “clean” and format the data. Perhaps something like:

{
"medication": "metoprolol",
"dosage": "25mg",
"timing": {
  "repeat": {
    "frequency": 2,
    "period": 1,
    "periodUnit": "d"
  }
...
}

This becomes a problem at scale because unstructured data is 80% of all healthcare data.

And only about one in five organizations claims to be using unstructured data systematically.

Unsurprisingly, those organizations are more successful than the rest. Executives who call unstructured data a key source of valuable insights “are 24% more likely” to beat their operational goals, according to Deloitte.

For everyone else, they’re… just not using 80% of their total data volume.

A visualization comparing structured and unstructured data.

Of course, there’s a reason so few organizations take advantage of all that data. It’s hard to take advantage of. Humans usually don’t have time to sift through it all, and most software isn’t smart enough to figure it out on its own.

**Making Unstructured Data Actually Usable**

Simply digitizing, say, a clinical note as part of a downstream workflow doesn’t make it useful. Digitization isn’t equivalent to “usability.” Usability requires structure, context, normalization, and implementation.

That requires real-time natural language processing (NLP), patient data stored in integrated management systems, and complex data models to make sense of it all.

And, of course, a human pilot, someone to spot check and handle the edge cases.

So, over the last five years, we’ve been collaborating with lab managers, retail pharmacists, and healthcare administrators to build a solution to do all of that: DocKnow.

DocKnow is an intelligent document processing platform built specifically for healthcare and life sciences. DocKnow’s AI has been trained on thousands of healthcare documents, includes out-of-the-box integrations with ICD and CPT taxonomies, and is designed with No-Data Architecture to ensure no PII, PHI, or any other sensitive data is accessible to Onymos.

Below, you’ll see how DocKnow uses techniques like semantic reasoning and named entity recognition (NER) to find a missing form field value (“Last Name”) in unstructured data:

What DocKnow Did Step-By-Step:

Step 1: Named entity recognition (NER)

The NER model scans the document and identifies potential person names. It recognizes “Madeline Johnson” as a person entity based on capitalization (e.g., both words are capitalized), their position in context, and common name patterns learned during initial model training.

Step 2: Document structure understanding

The model recognizes the document type based on its header, the presence of subheaders like “REASON FOR VISIT,” and its overall formatting and language. It understands that, since this document is bundled with this particular form, “PATIENT: MADELINE” is likely a reference to the “First Name” field value “Madeline” under “PATIENT INFORMATION” in the primary document.

Step 3: Corroborating evidence

DocKnow then validates this logic by looking for corroboration. Later in the note, the full name “Madeline Johnson” appears again in the same order in the phrase “New patient, Madeline Johnson, referred for consultation…” confirming its parse. An MRN field is also present, indicating we’re referring to a specific patient.

Finally, there’s no contradictory information suggesting a different name altogether (though, if there were, DocKnow would highlight both names for a human-in-the-loop reviewer).

Step 4: Confidence scoring

The system assigns a confidence score to this extraction based on:

Multiple mentions of the name in a consistent format (HIGH confidence signal)

Clear document structure with a labeled patient field (HIGH confidence signal)

Common, recognizable name components (MEDIUM-HIGH confidence signal)

No ambiguity or conflicting information (HIGH confidence signal)

We might estimate a resulting confidence score of 95%+ that “Johnson” is Madeline’s last name.

Step 5: Field mapping (structuring the data)

DocKnow maps the extracted information to structured fields (JSON, by default):

{
"firstName": "Madeline",
"lastName": "Johnson",
"dateOfBirthMmddyyyy": "07/22/2010",
...
}

To sum it up, DocKnow used contextual NER to identify the name, document structure understanding to know the name referred to the patient (not a doctor or family member), and confidence scoring to ensure accuracy.

It all worked together to extract “Johnson” from the unstructured text.

But this is just one use case, and my explanation still only scratches the surface (we didn’t talk about conformal predictions or adaptive confidence scores).

If incomplete forms, disconnected systems, or plain low-quality data are costing your organization time, money, and insight, you’re not alone, but you don’t have to stay stuck. DocKnow solves the unstructured data problems that keep healthcare organizations from reaching their operational goals. Reach out, and let’s talk about what’s possible when your data actually works for you.

FAQ: Unstructured Data in Healthcare, DocKnow, and More

Unstructured data contains the “why” behind every data point in your structured fields. A blood pressure reading of 160/95 is concerning, but the physician’s note explaining “patient stopped taking lisinopril three weeks ago due to persistent cough” transforms that number into actionable intelligence.

Multiply that context across thousands of patients, and you have insights about medication adherence, side effect patterns, and treatment effectiveness that structured data alone can never reveal. You’ll simultaneously improve patient outcomes and generate unique datasets that can translate into huge capital infusions and partnerships.

In fact, organizations trying to systematize their unstructured data are the primary growth driver in the healthcare and life sciences’ NLP market. Its value is expected to reach $16B over the next five years with a 25% CAGR.

When you see billions flowing into this market, you’re really seeing billions more in expected value creation.

Besides structured and unstructured data, there’s also semi-structured data. Examples of semi-structured data include categorized unstructured data and even HTML on a web page.

It’s usually just considered a subset of unstructured data because the structure it does have is optional or inconsistent.

Named entity recognition (NER) is an AI technique that automatically identifies and classifies key entities (like people, medications, diagnoses, dates, and measurements) in unstructured text.

It’s one of the foundational technologies that turns narrative documentation into structured, machine-readable data. Healthcare NER is particularly challenging because it involves hundreds of thousands of terms, and many of them have multiple meanings. “MS” could mean multiple sclerosis, mitral stenosis, or morphine sulfate, depending on context.

But if the NER model is effective, its structured outputs can be reliably fed into data analytics platforms, populate databases, and enable data reconciliation.

DocKnow is as accessible to laboratory specimen processors and hospital administrators as it is to data scientists. Even though the underlying technology is complex, the platform’s core functionality doesn’t require technical expertise to use.

However, for bioinformatics analysts and software engineers who do want (or need) to go deeper, DocKnow is highly configurable through its RESTful APIs, business rules, deployment options, and code visibility.

And if necessary, our professional services team is available to implement your unique requirements.

No, part of DocKnow’s value is handling messy, real-world healthcare data as-is.

In fact, one of the benefits is that DocKnow helps you identify data quality issues you didn’t know you had (e.g., missing information, inconsistencies, and gaps in documentation). If you have perfection upfront, you probably don’t have an unstructured data problem for DocKnow to solve in the first place.

How to Solve Your Unstructured Data Problem

**Making Unstructured Data Actually Usable**

What DocKnow Did Step-By-Step:

Step 1: Named entity recognition (NER)

Step 2: Document structure understanding

Step 3: Corroborating evidence

Step 4: Confidence scoring

Step 5: Field mapping (structuring the data)

FAQ: Unstructured Data in Healthcare, DocKnow, and More

We know healthcare data

Subscribe to the Onymos blog

How to Solve Your Unstructured Data Problem

Making Unstructured Data Actually Usable

What DocKnow Did Step-By-Step:

Step 1: Named entity recognition (NER)

Step 2: Document structure understanding

Step 3: Corroborating evidence

Step 4: Confidence scoring

Step 5: Field mapping (structuring the data)

FAQ: Unstructured Data in Healthcare, DocKnow, and More

We know healthcare data

Subscribe to the Onymos blog

Contact us to get started

Talk to experts

**Making Unstructured Data Actually Usable**