Using NLP Data to Classify Patient Segments in Clinical Trial Data

One of the most common and powerful approaches in NLP provides the content experts an opportunity to label each data segment for a portion of the dataset and then analyze these labels to apply to the rest of the dataset. Some key questions need to be answered when applying this approach in different environments. For example, how many “expert” labels do we need to create before the classification works effectively? How can we evaluate this in advance? Are we limiting ourselves to only extracting from the data what we already believe to be true? We will discuss that last one in our following post on clustering.

We applied the classification technique to one of our client’s projects to determine patient segments in clinical trials. This information is not necessarily added to the clinical trial information directly and if it is, the information can be in any one of a dozen data elements that are rarely labeled consistently. Imagine patients’ segments as a tree structure with the root being initial discovery of the disease prior to treatment, with the first level of branches as “yes” or “no” to initial treatment”. The child nodes for the first level branches are A, B, C, D and E.

To classify deeper, if the patients within the trial were considered prior segment A, they can then belong to either segment A or C. When the patients have not gone through prior treatment, they can then be classified as B, D or E.

With thousands of trials and the several categories mentioned above, it becomes a challenging task to classify such trials manually. Therefore, to automate this process, we built a text classification model using NLP data preprocessing methods. By using historical trials data that had been previously categorized into patient segments (labels), this enabled our predictive model to learn how to categorize new, unlabeled data. In our initial test, we used 80% of the historical trials data as a training set and accurately predicted the correct patient segment across the five branches in 20% of the remaining trials. The accuracy for our model varied depending on how many pre-labeled trials were included in the training set.

These initial results were a good start and prompted us to continue to improve. We further refined our approach by first breaking the five segments into two super-segments, using this technique and then applying a more refined approach on those two super-segments.

Our results improved, but one key finding was the keywords we thought would help us solve problems of the segment assignments were not helpful. For example, “metastatic”; you would think that word would provide a very clear understanding of the segments assignment, but in fact it did not. It is all the words surrounding that term that make the difference. For instance, searching for the term “metastatic” alone would not work because, “not metastatic”, is also a very common term. Use of n-grams is important in sorting through these issues.

It became clear that word groupings and terms used around the key anchor words are extremely important in determining these segments accurately. We are currently considering additional techniques akin to sentiment analysis for solving these types of problems. The context of where the anchor term is found is equally as important as the term itself. Essentially, the approach would identify the anchor terms and then analyze the words surrounding that term to determine the meaning / “sentiment”.

Accuracy is another topic for conversation as well. The accuracy for any predictive model is calculated by taking the sum of true positives and negatives divided by total population. Though, in applying this process, we also identified errors made by the “experts” in their initial classification; so how accurate are they? We believe however, that the accuracy will improve over time with a bigger labeled dataset as our training dataset grows and integration of other techniques are added to support this classification approach. The client is happy, but we are determined to improve our approach.

Check out our Case Studies for more examples of how Ozmosi can develop solutions for your data needs.

← When Good Things DON’T Come to Those Who Wait: How Pharma Companies Compare in Governance Efficiency The Future of Innovation in Diabetes & Heart Disease →

Tango Therapeutics: Early Stage Oncology Pipeline Comes Into Focus in April

by Webtyde | April 17, 2025 | Industry Trends, New Technology | 0 Comments

The Annual Meeting of the American Association for Cancer Research (AACR) is taking place April 5-10 in San Diego. AACR will feature presentations from over 40 companies, covering roughly 80 innovative preclinical and clinical-stage programs targeting oncology...

7 Small-Cap Biotech Companies to Watch at AACR in April

by Webtyde | April 17, 2025 | Industry Trends, New Technology | 0 Comments

We anticipate scientific reviews and clinical trial updates from 40 companies on nearly 80 oncology development programs at the upcoming American Association for Cancer Research (AACR) Annual Meeting, which will take place April 5-10 in San Diego. The information...

Market Overview: GLP-1 Agonists and the Obesity Market

by Beau Bush | April 10, 2025 | Market Scan | 0 Comments

Introduction to GLP-1 AgonistsGLP-1 agonists have been pivotal in the pharmaceutical market for nearly two decades, beginning with the FDA approval of AstraZeneca’s Byetta in 2005. Since then, the landscape has seen numerous entries and exits, leaving Novo Nordisk and...

How Next-Generation Probability of Success Forecasting Can Improve Clinical Trial Accuracy by 44%

by Webtyde | April 10, 2025 | New Technology, Probability of Success | 0 Comments

Unlocking Next-Gen POS Forecasting for Biopharma Success In the high-stakes world of biopharma, advanced Probability of Success (POS) forecasting can revolutionize the landscape of clinical trials. By adopting next-gen POS forecasting models, companies can...

Not Your Grandparents’ Probability of Success Forecasts

by Webtyde | April 3, 2025 | New Technology, Probability of Success | 0 Comments

Redefining Probability of Success in Pharma: A Data-Driven Revolution In the world of pharmaceutical strategic planning and analytics, traditional Probability of Success (POS) forecasts are a familiar, yet often frustrating approach to assessing clinical risk. While...

Clinical Trial Success Rates: What Makes Some Companies Stand Out?

by Webtyde | April 3, 2025 | New Technology, Probability of Success | 0 Comments

Our comprehensive analysis of over 30,000 clinical trials across more than 4,000 biopharmaceutical companies reveals significant variations in clinical trial success rates. This disparity exists even among trials in the same phase and targeting the same disease,...

Using Data to Optimize Clinical Trial Recruitment

by Webtyde | March 24, 2025 | Clinical Trial Trends | 0 Comments

The Importance of a High-Performing Clinical Trial Partnership Pharmaceutical companies are heavily dependent on clinical trials to assist with the placement, promotion, and sales of their products. If they are introducing a new mechanism of action (MOA) or modality...

Healthiest States Index of The USA 2024

by Webtyde | March 24, 2025 | Disease Area Trends | 0 Comments

Health and wellness are pivotal for leading a wholesome life. Good health is a blessing. Time and health are the two most precious assets for human beings. Good health provides better possibilities for us to overcome challenges in life and reap its benefits. Do you...

FDA Accelerated Approval, Breakthrough Therapy, and Fast Track Designations Supercharge Drug Development

by Beau Bush | March 24, 2025 | Industry Trends | 0 Comments

FDA Expedited Drug Development Programs The Food and Drug Administration (FDA) follows an established and lengthy approval process that ensures patients have access to therapeutic agents proven to be safe and effective. The process relies upon a structured framework...

Uncovering New Catalyst Events in the Pharmaceutical and Biotech Markets

by Webtyde | March 24, 2025 | Industry Trends | 0 Comments

The Challenges with Predicting Catalyst Events “Chasing headlines” for catalyst events in the biotech and pharmaceutical markets is a common frustration of investing in these spaces. Predicting these headlines in advance is a primary goal, along with mastering the...

Using NLP Data to Classify Patient Segments in Clinical Trial Data

Recent Posts