Distance-based clustering of document words
DOI:
https://doi.org/10.32968/psaie.2024.3.3Keywords:
clustering, search strategies, covering disksAbstract
The feasibility of automatic question generation relies heavily on organizing the words of the document into appropriate clusters. The primary aim is to form groups from the words of the document where the words within each group exhibit similarities based on certain predefined properties. Accurately uncovering similarities between words lays the groundwork for automatically determining the words to be highlighted as questions and offering alternatives for their substitution using knowledge-intensive methods.
The distance between words is calculated based on the frequency of their co-occurrence in sentences within the documents. Thus, two words are considered closer if there are more sentences in the documents where the two words appear together. The developed concept has been implemented with several different algorithms to enable comparison of results and to reveal their advantageous and disadvantageous properties.