The MedCollect corpus was manually built for the purposes of linguistic analysis, it contains 2206 articles (1.259.567 tokens) in the topic of health and medicine, 1448 (864.472) of which are fake news and 758 (395.095) of which are control samples. The corpus contains articles from 179 different websites, around 90% of articles, however, are from only 26 websites. The oldest article in the corpus is from 2007, however, 75% of the articles were published after 2020.
In order to uncover the structural and hidden manipulative strategies characteristic of fake news, 707 articles (370.300) were isolated for manual annotation. Of these, 322 (182.626) are fake news and 385 (187.626) are control samples.
| 6722 Szeged, Egyetem utca 2. |
|
| enyik@szte.hu |
MTA–DE–SZTE Research Group
for Theoretical Linguistics
Science for the Hungarian Language National Programme of the Hungarian Academy of Sciences (MTA)
Linguistic identification of fake news and pseudoscientific views
University of Szeged
Faculty of Humanities and Social Sciences
Department of General Linguistics
