Natural Language Processing Tools and Workflows for Improving Research Processes

Date

2024-12-16

Advisors

Journal Title

Journal ISSN

ISSN

Volume Title

Publisher

MDPI

Type

Article

Peer reviewed

Yes

Abstract

The modern research process involves refining a set of keywords until sufficiently pertinent results are obtained from acceptable sources. References and citations from the most relevant results can then be traced to related works. This process iteratively develops a set of keywords to find the most relevant literature. However, because a keyword-based search essentially samples a corpus, it may be inadequate for capturing a broad or exhaustive understanding of a topic. Further, a keyword-based search is dependent upon the underlying storage and retrieval technology and is essentially a syntactical search rather than a semantic search. To overcome such limitations, this paper explores the use of well-known natural language processing (NLP) techniques to support a semantic search and identifies where specific NLP techniques can be employed and what their primary benefits are, thus enhancing the opportunities to further improve the research process. The proposed NLP methods were tested through different workflows on different datasets and each workflow was designed to exploit latent relationships within the data to refine the keywords. The results of these tests demonstrated an improvement in the identified literature when compared to the literature extracted from the end-user-given keywords. For example, one of the defined workflows reduced the number of search results by two orders of magnitude but contained a larger percentage of pertinent results.

Description

open access article

Keywords

Natural language processing, Hierarchical Dirichlet process, Latent Dirichlet allocation, Latent semantic indexing, Word2vec, Naive Bayes, adaptive boost

Citation

Khan, N., Elizondo, D., Deka, L., and Molina-Cabello, M. A. (2024) Natural Language Processing Tools and Workflows for Improving Research Processes. Applied Sciences, 14 (24), 11731

Rights

Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/

Research Institute

Institute of Digital Research, Communication and Responsible Innovation