We’re getting close to resubmitting our NLP analysis of about 1700 abstracts from Implementation Science. In getting to that point, we had to trim and remove a bunch of figures and analyses that just didn’t fit, or had to be adjusted to address reviewer comments. It was still cool, though!, so we wanted to share some of our findings.
Victoria and I used topic modeling, which is a method to cluster, then evaluate different configurations of topics. Topic modeling presumes that topics are made of words that cooccur together, and that documents (in this case the abstracts) are made of several topics.
We then used several metrics to evaluate the qualities of these configurations. These can be seen in the figure below.
Let’s run through what each of these metrics mean.
- Topic Size: The total number of tokens by topic; higher is indicative of high quality. There wasn’t a lot of variation in topic size, so we scaled the data to better look for patterns. Scaling involves transforming data points so the mean is zero, and the standard deviations are one.
- Mean token length: The average number of characters for the top tokens in a topic, with longer words potentially indicative of better quality of topics. So here, Topic 16, which seems to be about practice and service context, scoring the highest.
- Prominence: The number of unique abstracts in which a topic appears. This shows that Topic 18 (clearly systematic reviews) and Topic 10 (maybe RCTs?) appear most often across all abstracts.
- Coherence: How often each topic’s top tokens appear together in the same abstract; basically how well a topic holds together. This value is negative, though we are still looking for the highest number. Topic 21 is most coherence, which looks to be generally about study evaluation.
- Exclusivity: How unique the top tokens in each topic are when compared to the token in other topics, or a proxy measure of distinctness. Because these values were also very tight, we scaled these values. Topic 21 shows up top here, as well.
The full paper will be available soon, and my hope is that I’ll drop some of the other analysis in this space. So, consider this the Blu-Ray extras.
Cover photo from Unsplash, pulled directly from here.