onymos-logo
< Onymos Blog

How to Solve Your Unstructured Data Problem

A visualization showing unstructured data fields.

Unstructured data is information that’s difficult for software systems to store, search, or analyze.  

It’s the kind of data you won’t find in a fixed, predefined format. It won’t fit neatly into a spreadsheet’s rows and columns. It might be a handwritten note in the margins of a form, an email chain, or even feedback on a social media platform.

None of that is difficult for humans to parse, because we understand meaning from context, but software systems have traditionally required explicit structure to interpret information. It’s why issues like data portability (transferring data between different systems) remain so challenging to solve.

Imagine a doctor writes a note that says, “Start metoprolol 25 mg bid.” A human expert would understand that means “begin taking 25 milligrams of metoprolol twice a day.”

But to an electronic health record (EHR) system that accepts specific key-value pairs, it’s meaningless. A human would need to “clean” and format the data. Perhaps something like:

{
"medication": "metoprolol",
"dosage": "25mg",
"timing": {
  "repeat": {
    "frequency": 2,
    "period": 1,
    "periodUnit": "d"
  }
...
}

This becomes a problem at scale because unstructured data is 80% of all healthcare data.

And only about one in five organizations claims to be using unstructured data systematically. 

Unsurprisingly, those organizations are more successful than the rest. Executives who call unstructured data a key source of valuable insights “are 24% more likely” to beat their operational goals, according to Deloitte. 

For everyone else, they’re… just not using 80% of their total data volume.

A visualization comparing structured and unstructured data.

Of course, there’s a reason so few organizations take advantage of all that data. It’s hard to take advantage of. Humans usually don’t have time to sift through it all, and most software isn’t smart enough to figure it out on its own.

Making Unstructured Data Actually Usable

Simply digitizing, say, a clinical note as part of a downstream workflow doesn’t make it useful. Digitization isn’t equivalent to “usability.” Usability requires structure, context, normalization, and implementation.

That requires real-time natural language processing (NLP), patient data stored in integrated management systems, and complex data models to make sense of it all.

And, of course, a human pilot, someone to spot check and handle the edge cases.

So, over the last five years, we’ve been collaborating with lab managers, retail pharmacists, and healthcare administrators to build a solution to do all of that: DocKnow.

DocKnow is an intelligent document processing platform built specifically for healthcare and life sciences. DocKnow’s AI has been trained on thousands of healthcare documents, includes out-of-the-box integrations with ICD and CPT taxonomies, and is designed with No-Data Architecture to ensure no PII, PHI, or any other sensitive data is accessible to Onymos.

Below, you’ll see how DocKnow uses techniques like semantic reasoning and named entity recognition (NER) to find a missing form field value (“Last Name”) in unstructured data:

What DocKnow Did Step-By-Step:

Step 1: Named entity recognition (NER)

The NER model scans the document and identifies potential person names. It recognizes “Madeline Johnson” as a person entity based on capitalization (e.g., both words are capitalized), their position in context, and common name patterns learned during initial model training.

Step 2: Document structure understanding

The model recognizes the document type based on its header, the presence of subheaders like “REASON FOR VISIT,” and its overall formatting and language. It understands that, since this document is bundled with this particular form, “PATIENT: MADELINE” is likely a reference to the “First Name” field value “Madeline” under “PATIENT INFORMATION” in the primary document.

Step 3: Corroborating evidence

DocKnow then validates this logic by looking for corroboration. Later in the note, the full name “Madeline Johnson” appears again in the same order in the phrase “New patient, Madeline Johnson, referred for consultation…” confirming its parse. An MRN field is also present, indicating we’re referring to a specific patient.

Finally, there’s no contradictory information suggesting a different name altogether (though, if there were, DocKnow would highlight both names for a human-in-the-loop reviewer).

Step 4: Confidence scoring

The system assigns a confidence score to this extraction based on:

  • Multiple mentions of the name in a consistent format (HIGH confidence signal)
  • Clear document structure with a labeled patient field (HIGH confidence signal)
  • Common, recognizable name components (MEDIUM-HIGH confidence signal)
  • No ambiguity or conflicting information (HIGH confidence signal)

We might estimate a resulting confidence score of 95%+ that “Johnson” is Madeline’s last name.

Step 5: Field mapping (structuring the data)

DocKnow maps the extracted information to structured fields (JSON, by default):

{
"firstName": "Madeline",
"lastName": "Johnson",
"dateOfBirthMmddyyyy": "07/22/2010",
...
}

To sum it up, DocKnow used contextual NER to identify the name, document structure understanding to know the name referred to the patient (not a doctor or family member), and confidence scoring to ensure accuracy.

It all worked together to extract “Johnson” from the unstructured text.

But this is just one use case, and my explanation still only scratches the surface (we didn’t talk about conformal predictions or adaptive confidence scores).

If incomplete forms, disconnected systems, or plain low-quality data are costing your organization time, money, and insight, you’re not alone, but you don’t have to stay stuck. DocKnow solves the unstructured data problems that keep healthcare organizations from reaching their operational goals. Reach out, and let’s talk about what’s possible when your data actually works for you.

FAQ: Unstructured Data in Healthcare, DocKnow, and More

Unstructured data contains the “why” behind every data point in your structured fields. A blood pressure reading of 160/95 is concerning, but the physician’s note explaining “patient stopped taking lisinopril three weeks ago due to persistent cough” transforms that number into actionable intelligence.

Multiply that context across thousands of patients, and you have insights about medication adherence, side effect patterns, and treatment effectiveness that structured data alone can never reveal. You’ll simultaneously improve patient outcomes and generate unique datasets that can translate into huge capital infusions and partnerships.

In fact, organizations trying to systematize their unstructured data are the primary growth driver in the healthcare and life sciences’ NLP market. Its value is expected to reach $16B over the next five years with a 25% CAGR.

When you see billions flowing into this market, you’re really seeing billions more in expected value creation.

Besides structured and unstructured data, there’s also semi-structured data. Examples of semi-structured data include categorized unstructured data and even HTML on a web page.

It’s usually just considered a subset of unstructured data because the structure it does have is optional or inconsistent.

Named entity recognition (NER) is an AI technique that automatically identifies and classifies key entities (like people, medications, diagnoses, dates, and measurements) in unstructured text.

It’s one of the foundational technologies that turns narrative documentation into structured, machine-readable data. Healthcare NER is particularly challenging because it involves hundreds of thousands of terms, and many of them have multiple meanings. “MS” could mean multiple sclerosis, mitral stenosis, or morphine sulfate, depending on context.

But if the NER model is effective, its structured outputs can be reliably fed into data analytics platforms, populate databases, and enable data reconciliation.

DocKnow is as accessible to laboratory specimen processors and hospital administrators as it is to data scientists. Even though the underlying technology is complex, the platform’s core functionality doesn’t require technical expertise to use.

However, for bioinformatics analysts and software engineers who do want (or need) to go deeper, DocKnow is highly configurable through its RESTful APIs, business rules, deployment options, and code visibility.

And if necessary, our professional services team is available to implement your unique requirements.

No, part of DocKnow’s value is handling messy, real-world healthcare data as-is.

In fact, one of the benefits is that DocKnow helps you identify data quality issues you didn’t know you had (e.g., missing information, inconsistencies, and gaps in documentation). If you have perfection upfront, you probably don’t have an unstructured data problem for DocKnow to solve in the first place.

Use Onymos for: diagnostic and clinical workflows / billing and claims / compliance

Connect with our team to explore how Onymos solutions can maximize efficiency, minimize costs, and drive real, scalable growth.

Schedule your demo

We know healthcare data

AI in the lab. Workflow automation at scale. Digital front doors for hospitals and clinics. Healthcare and life sciences are changing fast. Are you ready? Subscribe to our blog for:

  • Trends in healthcare tech
  • Research and analysis
  • Customer stories and more

Subscribe to the Onymos blog

Overlay