Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Japanese-Mobile-Receipt-OCR-1.3K: A Comprehensive Dataset Analysis and Fine-tuned Vision-Language Model for Structured Receipt Data Extraction

View through CrossRef
Abstract We introduce Japanese-Mobile-Receipt-OCR-1.3K, a curated dataset of 1,300 real-world Japanese receipt images captured via mobile phones and annotated with 34,727 text entries. We also present a fine-tuned vision–language model for end-to-end structured receipt extraction.Our dataset analysis quantifies linguistic and layout characteristics that challenge receipt understanding. These include a heavy-tailed token length distribution (mean 9.3 tokens, maximum 255), diverse text complexity across fields, and marked heterogeneity in character composition with substantial proportions of Kanji, Kana, and numerals. We further assess semantic coverage by quantifying numeric, monetary , and temporal expressions, and measuring named entity recognition coverage across common receipt fields. Leveraging these insights, we adapt a 3B-parameter vision–language backbone to produce structured JSON outputs capturing hierarchical field relationships and standardized numeric and currency formats.Extensive experiments show consistent improvements over strong baselines. Our approach achieves notable gains in field naming consistency, hierarchical structure accuracy, and numeric formatting, alongside reductions in word error rate and character error rate.This work delivers a complete pipeline—from dataset curation and statistical analysis to model adaptation and evaluation—establishing a robust benchmark and practical methodologies for Japanese receipt understanding.
Springer Science and Business Media LLC
Title: Japanese-Mobile-Receipt-OCR-1.3K: A Comprehensive Dataset Analysis and Fine-tuned Vision-Language Model for Structured Receipt Data Extraction
Description:
Abstract We introduce Japanese-Mobile-Receipt-OCR-1.
3K, a curated dataset of 1,300 real-world Japanese receipt images captured via mobile phones and annotated with 34,727 text entries.
We also present a fine-tuned vision–language model for end-to-end structured receipt extraction.
Our dataset analysis quantifies linguistic and layout characteristics that challenge receipt understanding.
These include a heavy-tailed token length distribution (mean 9.
3 tokens, maximum 255), diverse text complexity across fields, and marked heterogeneity in character composition with substantial proportions of Kanji, Kana, and numerals.
We further assess semantic coverage by quantifying numeric, monetary , and temporal expressions, and measuring named entity recognition coverage across common receipt fields.
Leveraging these insights, we adapt a 3B-parameter vision–language backbone to produce structured JSON outputs capturing hierarchical field relationships and standardized numeric and currency formats.
Extensive experiments show consistent improvements over strong baselines.
Our approach achieves notable gains in field naming consistency, hierarchical structure accuracy, and numeric formatting, alongside reductions in word error rate and character error rate.
This work delivers a complete pipeline—from dataset curation and statistical analysis to model adaptation and evaluation—establishing a robust benchmark and practical methodologies for Japanese receipt understanding.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Zero to hero
Zero to hero
Western images of Japan tell a seemingly incongruous story of love, sex and marriage – one full of contradictions and conflicting moral codes. We sometimes hear intriguing stories ...
Depth-aware salient object segmentation
Depth-aware salient object segmentation
Object segmentation is an important task which is widely employed in many computer vision applications such as object detection, tracking, recognition, and ret...
Utilizing Large Language Models for Geoscience Literature Information Extraction
Utilizing Large Language Models for Geoscience Literature Information Extraction
Extracting information from unstructured and semi-structured geoscience literature is a crucial step in conducting geological research. The traditional machine learning extraction ...
Warehouse Receipt Guarantee Fund As Protection For Holders Or Recipients Of Warehouse Receipt Guarantee Rights
Warehouse Receipt Guarantee Fund As Protection For Holders Or Recipients Of Warehouse Receipt Guarantee Rights
Introduction: The agricultural sector is the backbone of the Indonesian economy. However, farmers still face various challenges, including limited access to finance, fluctuating co...
Exploring Historical Labor Markets: Computational Approaches to Job Title Extraction
Exploring Historical Labor Markets: Computational Approaches to Job Title Extraction
Historical job advertisements provide invaluable insights into the evolution of labor markets and societaldynamics. However, extracting structured information, such as job titles, ...
Everyday Life in the "Tourist Zone"
Everyday Life in the "Tourist Zone"
This article makes a case for the everyday while on tour and argues that the ability to continue with everyday routines and social relationships, while at the same time moving thro...

Back to Top