Exploring lexical variation through synonym sets in human and AI-written scientific texts
DOI:
https://doi.org/10.35925/j.multi.2025.4.5Kulcsszavak:
AI detection in scholarly texts, lexical variation metrics, synonym clustering, synonym set-based analysisAbsztrakt
We propose a synonym set-based framework to detect stylistic and conceptual features that distinguish native scientific writing, non-native texts, and purely AI-generated texts. Using WordNet and POS-aware synonym clustering, we analyzed 12 aligned text pairs across four concept-level metrics: synonym-set coverage, lexical reduction ratio, collapsed type-token ratio, and Jaccard similarity. Native texts consistently exhibited higher conceptual overlap (Jaccard scores between 0.217–0.344 at moderate thresholds) with their AI-generated counterparts than non-native ones. Coverage was slightly richer in native texts (mean difference ≈+0.03), while non-native texts showed more vocabulary redundancy (mean reduction ≈0.03; mean rise in redundancy ≈0.05). These patterns suggest that non-native writings show lower lexical variety and higher redundancy. Our method enables researchers to identify lexical tendencies that help differentiate human-authored and AI-written texts in academic contexts.