Javascript must be enabled to continue!
IRMA: the 335-million-word Italian coRpus for studying MisinformAtion
View through CrossRef
The dissemination of false information on the internet has received considerable attention over the last decade. Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive. Therefore, there is an increasing need to develop methods for automatic detection of misinformation. Although resources for creating such methods are available in English, other languages are often underrepresented in this effort. With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as ‘untrustworthy’ by professional factcheckers. The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms. It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.e., keywords, topics at three different resolutions, and LIWC lexical features). IRMA also includes domain specific information such as source type (e.g., political, health, conspiracy, etc.), quality, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior. IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.
Title: IRMA: the 335-million-word Italian coRpus for studying MisinformAtion
Description:
The dissemination of false information on the internet has received considerable attention over the last decade.
Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive.
Therefore, there is an increasing need to develop methods for automatic detection of misinformation.
Although resources for creating such methods are available in English, other languages are often underrepresented in this effort.
With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as ‘untrustworthy’ by professional factcheckers.
The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms.
It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.
e.
, keywords, topics at three different resolutions, and LIWC lexical features).
IRMA also includes domain specific information such as source type (e.
g.
, political, health, conspiracy, etc.
), quality, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior.
IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.
Related Results
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
Who is susceptible to online health misinformation? A test of four psychosocial hypotheses
Who is susceptible to online health misinformation? A test of four psychosocial hypotheses
ABSTRACTObjective: Health misinformation on social media threatens public health. One question that could lend insight into how and through whom misinformation spreads is whether c...
The Discussions of Monkeypox Misinformation on Social Media
The Discussions of Monkeypox Misinformation on Social Media
The global outbreak of the monkeypox virus was declared a health emergency by the World Health Organization (WHO). During such emergencies, misinformation about health suggestions ...
<span class="word">A <span class="word"><span class="changedDisabled">Technique <span class="word">for <span class="word"><span class="changedDisabled">Constructing <span class="word"><span class="changedDisabl
<span class="word">A <span class="word"><span class="changedDisabled">Technique <span class="word">for <span class="word"><span class="changedDisabled">Constructing <span class="word"><span class="changedDisabl
To solve the problem of constructing the frequency responses (FR) of filters on switched capacitors, which belong to the class of electronic circuits with a periodically changing s...
Does X Mark the Spot? Investigating discussions about cancer screening programs on X/Twitter through corpus analysis (Preprint)
Does X Mark the Spot? Investigating discussions about cancer screening programs on X/Twitter through corpus analysis (Preprint)
BACKGROUND
While cancer screening is proven to be effective in the early detection of the disease and early detection enables better treatment options, screening ...
<span class="word">Successful <span class="word"><span class="changedDisabled">Replacement <span class="word"><span class="changedDisabled">Therapy <span class="word"><span class="changedDisabled">After <span c
<span class="word">Successful <span class="word"><span class="changedDisabled">Replacement <span class="word"><span class="changedDisabled">Therapy <span class="word"><span class="changedDisabled">After <span c
Background. Vitamin D has recognized immunomodulatory, anti-proliferative, and differentiation-regulating effects primarily mediated through its genomic effects via the vitamin D r...
<span class="word">Exploratory <span class="word allCaps">AI-<span class="word"><span class="changedDisabled">Assisted <span class="word allCaps">ML <span class="word"><span class="changedDisabled">Screening <s
<span class="word">Exploratory <span class="word allCaps">AI-<span class="word"><span class="changedDisabled">Assisted <span class="word allCaps">ML <span class="word"><span class="changedDisabled">Screening <s
This technical note reports an exploratory, AI-assisted in silico proof of concept implementing a “signaling first, killing later” discovery paradigm: prioritizing compounds with h...
<span class="word">IMGT® <span class="word"><span class="changedDisabled">Nomenclature <span class="word">of <span class="word"><span class="changedDisabled">Immunoglobulins (<span class="word allCaps">IG) <spa
<span class="word">IMGT® <span class="word"><span class="changedDisabled">Nomenclature <span class="word">of <span class="word"><span class="changedDisabled">Immunoglobulins (<span class="word allCaps">IG) <spa
The immunoglobulins (IG) or antibodies and the T cell receptors (TR) are the antigen receptors of the adaptive immune responses (AIR) of the jawed vertebrates (Gnathostomata). IMGT...

