Javascript must be enabled to continue!

Wikipedia citations: Reproducible citation extraction from multilingual Wikipedia

Abstract Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive data sets of citations from Wikipedia. A total of 29.3 million citations were extracted from the English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any Wikipedia dump in a cloud-based setting. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline for retrieving, classifying, and disambiguating citations on demand from a given Wikipedia dump.

MIT Press

Natallia Kokash Giovanni Colavizza

Quantitative Science Studies

2025

Title: Wikipedia citations: Reproducible citation extraction from multilingual Wikipedia

Description:

Abstract Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives.

Wikipedia Citations is a project that focuses on extracting and releasing comprehensive data sets of citations from Wikipedia.

A total of 29.

3 million citations were extracted from the English Wikipedia in May 2020.

Following this one-off research project, we designed a reproducible pipeline that can process any Wikipedia dump in a cloud-based setting.

To demonstrate its usability, we extracted 40.

6 million citations in February 2023 and 44.

7 million citations in February 2024.

Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 languages so that they are parsed and mapped into a generic structured citation template.

This paper presents our open-source software pipeline for retrieving, classifying, and disambiguating citations on demand from a given Wikipedia dump.

Back

Related Results

Aberration of the citation

Multiple inherent biases related to different citation practices (for e.g., self-citations, negative citations, wrong citations, multi-authorship-biased citations, honorary citatio...

Wayback machine: reincarnation to vanished online citations

Purpose – The purpose of this paper is to know the rate of loss of online citations used as references in scholarly journals. It also indented to recover the vanish...

Citation analysis of computer systems papers

Citation analysis is used extensively in the bibliometrics literature to assess the impact of individual works, researchers, institutions, and even entire fields of study. In this ...

Interdependencies in Citation Metrics Using Dimensions (Case Study of Two NAUKMA Journals)

Quantitative data are increasingly influencing the evaluation of the effectiveness of research and researchers. Citations may be the main metric to assess the quality and value of ...

Self-citations, a trend prevalent across subject disciplines at the global level: an overview

Purpose The present study aims to determine the prevailing trend of self-citations across 27 major subject disciplines at global level. The study also examines the aspects like per...

Wikipedia: a tool to monitor seasonal diseases trends?

ObjectiveTo explore the interest of Wikipedia as a data source to monitorseasonal diseases trends in metropolitan France.IntroductionToday, Internet, especially Wikipedia, is an im...

Exploiting Wikipedia Semantics for Computing Word Associations

<p><b>Semantic association computation is the process of automatically quantifying the strength of a semantic connection between two textual units based on various lexi...

COVID-19 research in Wikipedia

Wikipedia is one of the main sources of free knowledge on the Web. During the first few months of the pandemic, over 5,200 new Wikipedia pages on COVID-19 were created, accumulatin...

Email:
Password:

Email: