The Rogue Scholar science blog archive started work on improving the subject classification of blog posts, using an open approach to subject classification developed by CWTS and OpenAlex.
Based on my previous work on the importance of subject classification for scholarly works at DataCite, Rogue Scholar launched OECD Fields of Science subject area support for every blog post in February of this year. While this was a good start, there were three significant shortcomings:
- Subject area classification was at the blog level, not individual posts,
- With 48 subject areas, the OECD Fields of Science are too broad for some important use cases,
- Crossref metadata don't (yet) support subject area metadata for individual works.
When submitting a blog to be archived by Rogue Scholar, one of only a few questions I ask blog authors is for the subject area of their blog. While this approach is straightforward, it might be too broad for some use cases, for example, recommending similar posts. There are for example almost 20,000 social science and almost 5,000 computer and information science blog posts in Rogue Scholar.
Using a more granular subject area classification applied to each publication is a major challenge, as can be seen in the major curation effort needed to classify the biomedical literature into Medical Subject Headings (MeSH) at PubMed.
For Rogue Scholar, I always envisioned a semi-automated approach based on machine learning and human curation of edge cases. Having the full-text of all blog posts available in machine-readable form makes this possible, but picking the right subject area classification and building a classification workflow from scratch are still major challenges.
The ideal subject area classification is more granular than OECD, covers all scholarly disciplines, is widely used (and documented), and openly available. My main candidate five years ago when I did the subject area work for DataCite was the Australian and New Zealand Standard Research Classification (ANZSRC) Fields of Research. The CWTS/OpenAlex classification announced last year with about 4500 topics categorized into subfields and fields similar to the OECD Fields of Science is a very attractive alternative, especially since OpenAlex uses it, makes an open API available, and more importantly makes the pre-trained model available for download with an Apache 2.0 open source license. This makes it straightforward to run the classifier on Rogue Scholar infrastructure and not depend on external APIs with associated costs and rate-limits.

Like many interesting collaborations, a conversation with Nees Jan van Eck from CWTS at a Dagstuhl seminar on Open Scholarly Information Systems in September was the trigger for me to finally start the work on using machine learning to classify Rogue Scholar blog posts into subject areas. Two months later I now have a classifier running with the CWTS/OpenAlex model provided via Hugging Face, and integrated into the Rogue Scholar infrastructure.
The next step is validating the classifications and fine-tuning the model. I have to decide if the classifications are correct, which matching score to use as a threshold, and whether providing the full-text is better than using the abstract (OpenAlex uses title and abstract). Some of the additional questions are:
- Does the classifier work well for non-English content?
- Does the classifier sometimes match several topics, e.g. for interdisciplinary research?
- What is the percentage of blog posts not classified correctly and what are their characteristics?
For community validation I wrote an integration into the Rogue Scholar Slack (join via this link) using the workflow automation service n8n, also self-hosted on Rogue Scholar infrastructure.


Obviously I don't have the domain knowledge to validate many of the classifications, so I depend on the Rogue Scholar community to help me. I am storing the classifications with each blog post, including an openalex_validated boolean field to indicate when the proposed classification was validated (accepted/rejected) by a human.
Once the classifications look good, I can integrate them into the InvenioRDM repository backend that powers Rogue Scholar. Luckily the software is flexible enough to handle any subject area classification via custom vocabularies. As this might also be of interest to other repositories running the InvenioRDM software, I also built an integration into the Inveniosoftware Discord server, testing the classification with a few Zenodo records:

Finally, if this works, it might provide an interesting model for Crossref to provide subject area classification in the metadata schema, something the recently launched Crossref Metadata Advisory Group is thinking about.
Please use Slack, email, Mastodon, or Bluesky if you have any questions or comments regarding this work, and join the Rogue Scholar Slack if you want to provide feedback on concrete classifications.
References
- Van Eck, N. J., & Waltman, L. (2024, January 24). An open approach for classifying research publications. Leiden Madtrics. https://doi.org/10.59350/qc0px-76778
- Fenner, M. (2020, September 7). Making the most out of available Metadata. Front Matter. https://doi.org/10.53731/r79rdp1-97aq74v-ag4nb
- Fenner, M. (2025, February 17). Rogue Scholar starts subject area communities. Front Matter. https://doi.org/10.53731/9zb20-k8z13
- OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract ยท Hugging Face. (n.d.). Retrieved November 26, 2025, from https://huggingface.co/OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract
- Feeney, P. (2025, May 2). Metadata Advisory Group call for applications. Crossref Blog. https://doi.org/10.64000/n23nw-3d593
