The Rogue Scholar science blog archive started work on improving the subject classification of blog posts, using an open approach to subject classification developed by CWTS and OpenAlex.
Use cases
Classification of scholarly works into defined subject areas is important for two groups of use cases:
- Understanding. Fathom the kind of research topics that are covered in science blogs.
- Discovery. Find similar publications to the one you just read via Rogue Scholar.
A good example of the first use case is a study recently published by Catharina Ochser and colleagues about scholarly blogs in Germany, and one of the main outcomes was the disciplinary distribution of these blogs:

The majority of the blogs they found were in the humanities and social sciences. One important reason for this is de.hypotheses, the German-language version of the large blog portal Hypotheses, offering hosting and other services to humanities and social sciences blogs.
Five years ago in the Make Data Count project, we were interested in the subject classification of research data, as usage of datasets is highly dependent on community size and community standards. One big problem was that the standardized subject classification of datasets was mostly lacking, so we started making this easier for DataCite members: push for the OECD Fields of Science as the standard subject area vocabulary that can be mapped from other vocabularies, and show the OECD Fields of Science as a facet in the search results of DataCite Search.
The second use case is primarily driven by readers. After reading an interesting science blog post, they might want to read more posts with similar content. Or they have an idea about the content they are looking for, but it is poorly described using keywords. Blogging platforms have long provided this functionality, but have been limited to a single blog and have used old technology, rather than semantic search.
Limitations
Rogue Scholar launched OECD Fields of Science subject area support for every blog post in February of this year. While this was a good start, there were three significant shortcomings:
- Subject area classification was at the blog level, not for individual posts,
- With 48 subject areas, the OECD Fields of Science are too broad for some important use cases,
- Crossref metadata don't support subject area metadata.
When submitting a blog to be archived by Rogue Scholar, one of only a few questions I ask blog authors is for the subject area of their blog using the OECD Fields of Science list. Similar to the subject classification of journals rather than journal articles (which, unfortunately is more often done), this approach has limitations, in particular for multidisciplinary blogs/journals. One prominent example in Rogue Scholar is the SVPOW blog, which has been writing about sauropods since October 2007, but also frequently publishes about Open Access.
The six top-level and 42 second-level subject areas in the OECD Fields of Science are good enough for many high-level questions, such as R&D spending or doctoral students per country and/or over time. But for more fine-grained questions, more granularity in the subject classification is needed. There are for example almost 20,000 social science and almost 5,000 computer and information science blog posts in Rogue Scholar, which give a very broad picture.
Using a more granular subject area classification applied to each publication is a major challenge, as can for example be seen in the major curation effort needed to classify the biomedical literature into Medical Subject Headings (MeSH) at PubMed.
The third major limitation is that publishers can't easily distribute subject classification information, as Crossref metadata can't hold that information. DataCite is better in this regard, but unfortunately subject area metadata is still only included in a small fraction of DataCite DOIs.
Ideas
For Rogue Scholar, I always envisioned a semi-automated approach based on machine learning and human curation of edge cases. Having the full-text of all blog posts available in machine-readable form makes this possible, but picking the right subject area classification and building a classification workflow from scratch are still major challenges.
The ideal subject area classification is more granular than OECD, covers all scholarly disciplines, is widely used (and documented), and openly available. My main candidate five years ago, when I did the subject area work for DataCite, was the Australian and New Zealand Standard Research Classification (ANZSRC) Fields of Research with more than 1000 subject areas, and used for example by the repository Figshare. The German Research Foundation (DFG) Subject Areas used by the Registry of Research Data Repositories (re3data), on the other hand, have some limitations, primarily that they were developed for a different use case - panels for grant evaluation.
The CWTS/OpenAlex classification announced in January 2024 with about 4500 topics categorized into subfields and fields that map to the All Science Journal Classification (ASJC) is a very attractive alternative, especially since OpenAlex uses it, makes an open API available, and more importantly makes the pre-trained model available for download with an Apache 2.0 open source license. This makes it straightforward to run the classifier on Rogue Scholar infrastructure and not depend on external APIs with associated costs and rate-limits.

Like many interesting collaborations, a conversation with Nees Jan van Eck from CWTS at a Dagstuhl seminar on Open Scholarly Information Systems in September was the trigger for me to finally start the work on using machine learning to classify Rogue Scholar blog posts into subject areas.
Implementation
Two months later I now have a classifier running with the CWTS/OpenAlex model provided via Hugging Face, and integrated into the Rogue Scholar infrastructure.
The InvenioRDM repository platform that powers Rogue Scholar has built-in subject classification with the OECD Fields of Science since v6.o released in August 2021, but this can easily be extended with custom vocabularies. Rogue Scholar now has these four hierarchical OpenAlex vocabularies built in:

Any InvenioRDM instance can install these vocabularies in yaml format and make them available in the record editor and API, and I will also include them in the InvenioRDM Starter pre-built Docker image.
I have used the subfields (that map to ASJC) to classify all participating blogs, and updated the Rogue Scholar submission form to use the OpenAlex subfields instead of the OECD Fields of Science. Please fill out the form (also linked from the footer of every Rogue Scholar page) if you don't agree with the suggested subfield (multiple subfields are not yet supported).
Rogue Scholar has created communities for each of the 252 subfields, and has reused existing OECD subject communities where it makes sense. It will take a few weeks before all blog posts are properly sorted into these new OpenAlex subfield communities.

The OpenAlex classifier is now running in the Rogue Scholar infrastructure and has started classifying using the blog post title and abstract (as the model was trained with this information). After classifying a few hundred blog posts, I found a score of at least 0.25 to be a good cutoff. With lower scores the number of false-positives was too high, whereas with higher scores (e.g. 0.40) the number of false-negatives was too high. More systematic testing of cutoff scores will follow in a few months, but it is clear that more than half of the classified posts currently have a score that is too low, and false-positives have to be removed. An example of the latter is a post about the Barcelona Declaration on Open Research Information that was classified as topic Optics and Image Analysis with a score of 0.40.
What I found so far is that non-English language posts classify well (14% are in other languages), and that posts that can't be classified are often not sciency enough, e.g. they talk about an upcoming webinar or general tools or techniques.
The subfield of the blog and the topic of the post, together with the corresponding topic subfield are shown in the sidebar of the Rogue Scholar post – again it will take a few weeks to generate this information for all posts.

The post is also added to the corresponding subfield communities, in the example above Information Systems and Conservation. The community ID corresponds to the OpenAlex subfield ID (here 1206), and OpenAlex IDs can also be used in Rogue Scholar Search:

Next steps
As I mentioned before, the next few weeks will be spent updating the subject classification data, tweaking the user interface, making the Rogue Scholar classifier instance handle more traffic, and generally respond to user feedback.
For community validation I wrote an integration into the Rogue Scholar Slack (join via this link) using the workflow automation service n8n, also self-hosted on Rogue Scholar infrastructure.


Very soon, I will convert this into a proper chatbot that can trigger the classification directly from Slack, or give feedback if a classification is wrong.
As this might also be of interest to other repositories running the InvenioRDM software, I also built a similar integration into the Inveniosoftware Discord server, testing the classification with a few Zenodo records:

As I am using the same metadata and classifier model as OpenAlex, I can compare their classifications – all Rogue Scholar posts should be archived in OpenAlex. The example post Reinvestigating the reported transition state structure of a concerted triple H-tunneling mechanism was classified into topic Chemical Reactions and Isotopes by both OpenAlex and Rogue Scholar. Also interesting is a systematic comparison of abstracts and full-text for classifier input, preliminary testing didn't show much difference, and the model was trained with abstracts.
Finally, this approach (OpenAlex subfield and topic classification and pre-trained model made available with an open license) can also work for other Crossref members interested in subject classification, and might convince Crossref to add subject area classification to the metadata schema, something the recently launched Crossref Metadata Advisory Group is thinking about.
In 2026 Rogue Scholar will use this new infrastructure to address important use cases around Understanding and Discovery.
Please use Slack, email, Mastodon, or Bluesky if you have any questions or comments regarding this work, and join the Rogue Scholar Slack if you want to provide feedback on concrete classifications.
References
- Van Eck, N. J., & Waltman, L. (2024, January 24). An open approach for classifying research publications. Leiden Madtrics. https://doi.org/10.59350/qc0px-76778
- Ochsner, C., Pampel, H., Höfting, J., & Rothfritz, L. (2025). Scholarly blogs: An analysis of infrastructural aspects based on German scholarly blogs. Journal of Documentation, 81(7), 520–544. https://doi.org/10.1108/JD-02-2025-0053
- Tay, A. (2025, August 3). Why embedding vector search is probably one of the least objectionable use of AI for search. Aaron Tay's Musings About Librarianship. https://doi.org/10.59350/d22rx-srr93
- Fenner, M. (2020, September 7). Making the most out of available Metadata. Front Matter. https://doi.org/10.53731/r79rdp1-97aq74v-ag4nb
- Fenner, M. (2025, February 17). Rogue Scholar starts subject area communities. Front Matter. https://doi.org/10.53731/9zb20-k8z13
- Taylor, M. (2024, June 5). The SSP debate on “the open access movement has failed” — part 1: Speech for the motion. SVPOW. https://doi.org/10.59350/cbmpt-b8k31
- re3data Team. (2021, October 18). Reviewing the subject classification in re3data. Coref Blog. https://doi.org/10.59350/80bzn-asp22
- OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract · Hugging Face. (n.d.). Retrieved November 26, 2025, from https://huggingface.co/OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract
- Nielsen, L. H. (2021, August 5). InvenioRDM reaches major milestone—V6.0 released. Invenio. https://doi.org/10.63517/ep4p0-vnx69
- Hauschke, C., & Steglich, P. (2025). Von der Barcelona Declaration on Open Research Information zur Paris Conference on Open Research Information. https://doi.org/10.57689/DINI-BLOG.20250106
- Feeney, P. (2025, May 2). Metadata Advisory Group call for applications. Crossref Blog. https://doi.org/10.64000/n23nw-3d593
