Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Uncovering Companies Missing from the SABI Database: A Web Scraping Approach

View through CrossRef
This study evaluates the completeness and representativeness of the SABI database, a widely used commercial source for firm-level data in Spain and Portugal, by comparing it to BORME, the official Spanish business register. Using web scraping techniques, we collected and processed approximately 100,000 BORME publications in PDF format, covering the period from 2010 to 2023. These were transformed into a structured dataset comprising over 1.2 million companies, which we then matched against SABI records from the same period. Our analysis reveals that SABI covers only 38.3% of newly established companies, with significant underrepresentation of younger firms, small enterprises, specific sectors, and certain regions. Furthermore, we find clear evidence of survivorship bias: the longer a company has been dissolved, the less likely it is to appear in SABI. Sectoral and geographic disparities are also substantial, and the coverage is skewed toward firms with higher initial capital and specific legal forms. These findings suggest that SABI represents a non-random subset of the Spanish business population, and caution should be exercised when using it for empirical research. Adjustments for sample bias are recommended to improve the reliability of analyses based on this database.
Ediciones Profesionales de la Informacion SL
Title: Uncovering Companies Missing from the SABI Database: A Web Scraping Approach
Description:
This study evaluates the completeness and representativeness of the SABI database, a widely used commercial source for firm-level data in Spain and Portugal, by comparing it to BORME, the official Spanish business register.
Using web scraping techniques, we collected and processed approximately 100,000 BORME publications in PDF format, covering the period from 2010 to 2023.
These were transformed into a structured dataset comprising over 1.
2 million companies, which we then matched against SABI records from the same period.
Our analysis reveals that SABI covers only 38.
3% of newly established companies, with significant underrepresentation of younger firms, small enterprises, specific sectors, and certain regions.
Furthermore, we find clear evidence of survivorship bias: the longer a company has been dissolved, the less likely it is to appear in SABI.
Sectoral and geographic disparities are also substantial, and the coverage is skewed toward firms with higher initial capital and specific legal forms.
These findings suggest that SABI represents a non-random subset of the Spanish business population, and caution should be exercised when using it for empirical research.
Adjustments for sample bias are recommended to improve the reliability of analyses based on this database.

Related Results

Long-range superharmonic Josephson current and spin-triplet pairing correlations in a junction with ferromagnetic bilayers
Long-range superharmonic Josephson current and spin-triplet pairing correlations in a junction with ferromagnetic bilayers
AbstractThe long-range spin-triplet supercurrent transport is an interesting phenomenon in the superconductor/ferromagnet ("Equation missing") heterostructure containing noncolline...
Controls on the Effect of Impact Scraping on High-position and Long-runout Landslides
Controls on the Effect of Impact Scraping on High-position and Long-runout Landslides
Abstract Landslides in mountainous areas act as an important control on morphological landscape evolution and represent a major natural hazard. The dynamic characteristics ...
Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis
Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis
Abstract Background The COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, ...
Macroanalysis of Hand Scraping
Macroanalysis of Hand Scraping
Hand scraping is a manual metalworking method that is still used in many industries to obtain good planarity, low friction and good gliding properties of metal and plastic surfaces...
Uncovering the consequences of batch effect associated missing values in omics data analysis
Uncovering the consequences of batch effect associated missing values in omics data analysis
ABSTRACTStatistical analyses in high-dimensional omics data are often hampered by the presence of batch effects (BEs) and missing values (MVs), but the interaction between these tw...
High Expression of AMIGO2 Is an Independent Predictor of Poor Prognosis in Pancreatic Cancer
High Expression of AMIGO2 Is an Independent Predictor of Poor Prognosis in Pancreatic Cancer
Abstract Background.The AMIGO2 extracellular domain has a leucine - rich repetitive domain (LRR) and encodes a type 1 transmembrane protein , and is a member of the AMIGO g...
WEB PROGRAMMING
WEB PROGRAMMING
"Web Programming" is a comprehensive book that provides a detailed overview of various aspects of web programming. The book is co-authored by Dr. Chitra Ravi and Dr. Mohan Kumar S,...
An Extension of Gregus Fixed Point Theorem
An Extension of Gregus Fixed Point Theorem
AbstractLet "Equation missing" be a closed convex subset of a complete metrizable topological vector space "Equation missing" and "Equation missing" a mapping that satisfies "Equat...

Back to Top