Letting Data Clustering Tell the Story | Ozmosi Pharmaceutical Data

In our previous post we described the technique for assigning categories to data, based on input from content experts within a “training database”. This technique is effective for summarizing large, text-heavy data into specific categories for summaries and improved visualization. While this approach is useful for those purposes, it will not allow us to uncover new insights or trends because we are imposing a preconceived and finite set of options, or in other words, what we already know.

The following describes our approach to using clustering techniques for exploring text-heavy data. We applied this technique to two different datasets: scientific journals and presidential speeches.

Dataset #1 – Scientific Journals

For the scientific journals, we analyzed over five hundred whitepapers on a specific disease area to identify trends and key topics for that disease area over the last few years. To accomplish this task, we created term-document matrices, an NLP technique that builds a matrix for frequency of words present in abstracts of whitepapers that are available on pubmed.gov. We identified trends in research focus across products and approaches, and we could track how these evolved.

We further explored this data using text clustering and creating scatter plots of the clusters. Text clustering does not require any labeled dataset; it is a way to explore the data and let the data tell the story in the form of clusters. Scatter plots for clusters are handy for visualizing and understanding the analysis. We explored the output by applying a clustering algorithm called K Means clustering, where K represents the number of clusters that is decided manually and utilizes the term document matrices. This approach still needs more exploration, since as you can imagine, the results will vary depending the number of clusters we are “looking for”.

The abstracts for clustering have many characteristics and may be redundant. Some might wonder how the terms or the relevant characteristics in each cluster are represented in a scatter plot. To get the relevant characteristics and visualize them, we used Principal Component Analysis (PCA). Using linear combination with the existing characteristics, PCA builds new ones and constructs the best possible features to summarize the data and then cluster similar words using the above-mentioned clustering algorithm.

Dataset #2 – Presidential Speeches

A similar technique was applied to a set of presidential speeches using the techniques described above. The campaign speeches were by Barack Obama, Donald Trump, Jeb Bush, Hillary Clinton, and Ted Cruz. Here, we felt confident that we should see certain tendencies and similarities, and the outcomes would help us to calibrate our approach.

We created a 2D visualization of the term frequency-inverse document frequency (TF-IDF) matrix that was created based on the word counts taken from the four candidate’s speeches. Note that the TF-IDF matrix is very large, as it has 5 rows (one for each candidate) and N columns (where N is the number of unique unfiltered words from all the speeches). It would be impossible for us to visualize an N-dimensional space where N is in the thousands; therefore, we must make use of dimensionality reduction techniques.

Principal component analysis (PCA) is one such technique. Its general idea is that it looks to map a high dimensional space to a low dimension (in our case, 2D). It does this by finding the directions in the N-dimensional space that have the highest variance and redefines the axes to be in these directions (as compared to the typical origin we think of using when plotting a graph). In this way, the new axes (in our case PC1 and PC2) are linear combinations of the original N-dimensions. Although not entirely important, we mention that after creating PC1, PCA ensures that PC2 is orthogonal (i.e., at a right angle) to PC1, which is done so that the representation is natural since we always look at graphs that have the x and y axis sitting at 90-degree angles to one another.

We can see from the graph that Bush, Clinton, and Obama are quite near to each other (with Bush and Obama being the closest), while Trump and Cruz appear to be far away in space off in different directions. There are several reasons that we can use to describe this phenomenon, and so below, we have just listed a few systematically and ordered them based on significance.

1) First, we note that this entire graph is based on the words used during their speeches; thus, the terms and frequency are the only information we have extracted, which does not consider the semantic meaning.

2) Bush, Clinton, and Obama are likely close to each other because they all ran a “typical” campaign. They discussed the common key points that presidential campaigns have discussed in the past, and therefore the terms they were using are similar, resulting in them appearing close in the 2D representation.

3) It is a known fact that both Trump and Cruz were atypical Republican candidates. Thus, they discussed and focused on not-so-typical things in their speeches (e.g., building a wall with Mexico). This can be seen in both the x and y axes.

4) Clinton and Obama will appear close together since they are both typical Democrats and discuss the same content, so the words they’re using will be similar. This also brings in Bush, who was a typical Republican. Although they would have differing opinions, they would be discussing the same content/topics and using the same words. Using the ideas from (1), we can see how this would bring Bush closer together with Clinton and Obama.

PC1 tells us about policies and issues. PC2 denote characteristics; how candidates characterize themselves and each other in their campaign speeches.

After further analysis, such as finding the top and bottom 30 words that drove Policies/Issues (PC1) and Character (PC2) for these campaign speeches, we found the top 30 words were mostly from the democrats and Jeb Bush’s speeches, while the bottom 30 belonged to Donald Trump and Ted Cruz.

Conclusion

The results from the above analysis reinforced our confidence in this technique to summarize information and highlight themes. The most exciting part of this approach was it could be applied to develop “fingerprints” or themes. For example, each candidate showed separation from the other candidates on the axes above; but their own speeches scored extremely close to each other, highlighting a consistency and reliability that we would hope to see an analysis of language and themes. This could be applied more broadly to a body of scientific research to identify core themes in the current research and perhaps to even identify missing themes. More work and research is needed to explore the use of this technique further, but for now we will continue to track the thinking and trends in our scientific literature using techniques like these.

← Using NLP Data to Classify Patient Segments in Clinical Trial Data When Good Things DON’T Come to Those Who Wait: How Pharma Companies Compare in Governance Efficiency →

OZMOSI Announces Strategic Partnership with Planview to Advance AI-Driven Planning in Pharmaceutical R&D

by Webtyde | May 13, 2026 | Business | 0 Comments

By combining structured clinical intelligence with AI-driven portfolio planning, the partnership gives pharmaceutical teams a faster, clearer way to make high-stakes R&D decisions SPRING LAKE HEIGHTS, N.J., April 23, 2026 / PRNewswire / -- OZMOSI, a leading...

Pharmaceutical Portfolio Strategy in the AI Era: Why External Intelligence Matters

by Beau Bush | April 29, 2026 | Business, Market Scan | 0 Comments

A Joint Whitepaper by Planview and Ozmosi Executive Summary Pharmaceutical portfolio strategy is entering a new era. For decades, life sciences organizations have invested heavily in internal systems to manage clinical development, allocate resources, forecast...

Why AI Stalls in Pharmaceutical R&D Operations (and How to Fix It)

by Beau Bush | March 31, 2026 | Data Analytics, New Technology | 0 Comments

Artificial intelligence has become a fixture in pharmaceutical strategy conversations. Nearly every major organization has invested in it in some form, whether through internal teams, external partnerships, or acquisitions. The expectation is clear: AI should...

Who is Winning in Innovation and How are They Balancing Risk?

by Beau Bush | February 28, 2026 | Data Analytics, Industry Trends, Market Scan, New Technology | 0 Comments

Examining the innovation-to-risk balance among the pharmaceutical industry’s top companies In our previous post, we evaluated the pharmaceutical industry’s leading companies in terms of overall R&D pipeline strength, resulting in Roche, AstraZeneca, Bristol-Myers...

How Next-Generation Probability of Success Forecasting Can Improve Clinical Trial Accuracy by 44%

by Beau Bush | January 16, 2026 | New Technology, Probability of Success | 0 Comments

Unlocking Next-Gen POS Forecasting for Biopharma Success In the high-stakes world of biopharma, advanced Probability of Success (POS) forecasting can revolutionize the landscape of clinical trials. By adopting next-gen POS forecasting models, companies can...

Market Overview: GLP-1 Agonists and the Obesity Market

by Beau Bush | December 16, 2025 | Market Scan | 0 Comments

Introduction to GLP-1 AgonistsGLP-1 agonists have been pivotal in the pharmaceutical market for nearly two decades, beginning with the FDA approval of AstraZeneca’s Byetta in 2005. Since then, the landscape has seen numerous entries and exits, leaving Novo Nordisk and...

Not Your Grandparents’ Probability of Success Forecasts

by Beau Bush | November 9, 2025 | New Technology, Probability of Success | 0 Comments

Redefining Probability of Success in Pharma: A Data-Driven Revolution In the world of pharmaceutical strategic planning and analytics, traditional Probability of Success (POS) forecasts are a familiar, yet often frustrating approach to assessing clinical risk. While...

Who’s Winning the Pharmaceutical R&D Pipeline Race in 2025

by Beau Bush | October 26, 2025 | Business, Data Analytics, Industry Trends, Market Scan, New Technology, Probability of Success | 0 Comments

The leaders, the contenders, and the strategies for success As we approach the 4th quarter of 2025, let’s review where the pharmaceutical industry’s leading companies rank in terms of overall pipeline strength.Roche, AstraZeneca, and Bristol-Myers Squibb sit in a...

7 Small-Cap Biotech Companies to Watch at AACR in April

by Webtyde | October 23, 2025 | Industry Trends, New Technology | 0 Comments

We anticipate scientific reviews and clinical trial updates from 40 companies on nearly 80 oncology development programs at the upcoming American Association for Cancer Research (AACR) Annual Meeting, which will take place April 5-10 in San Diego. The information...

Clinical Trial Success Rates: What Makes Some Companies Stand Out?

by Beau Bush | October 9, 2025 | New Technology, Probability of Success | 0 Comments

Our comprehensive analysis of over 30,000 clinical trials across more than 4,000 biopharmaceutical companies reveals significant variations in clinical trial success rates. This disparity exists even among trials in the same phase and targeting the same disease,...

Letting the Data Tell the Story