Javascript must be enabled to continue!

Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it’s impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 Unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.

MDPI AG

Mamtimin Qasim Wushour Silamu

Data

2025

Title: Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

Description:

While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text.

For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm.

Because some scripts share common characters, it’s impossible to count and summarize them.

As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted.

To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm.

Using this approach, we can fully consider all 169 Unicode script types.

The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information.

The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.

Back

We provide experimental measurements for the effective scaling of the Taylor–Reynolds number within the bulk $\mathit{Re}_{\unicode[STIX]{x1D706},\mathit{bulk}}$, based on local fl...

Optimizing assembly processes with augmented reality: a case study on TurtleBots

Augmented reality (AR) technology is revolutionizing traditional assembly processes, offering intuitive and interactive guidance that significantly enhances operational efficiency ...

Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open source script identification algorithm for tex...

QUASIRANDOM GROUP ACTIONS

Let $G$ be a finite group acting transitively on a set $\unicode[STIX]{x1D6FA}$. We study what it means for this action to be quasirandom, thereby generalizing Gowers’ study of qua...

Minimal-dimensional representations of reduced enveloping algebras for

Let $\mathfrak{g}=\mathfrak{g}\mathfrak{l}_{N}(\Bbbk )$ , where $\Bbbk$ is an algebraically closed field of characteristic ...

The Basset problem with dynamic slip: slip-induced memory effect and slip–stick transition

When there exists slip on the surface of a solid body moving in an unsteady manner, the extent of slip is not fixed but constantly changes with the time-varying Stokes boundary lay...

Focusing deep-water surface gravity wave packets: wave breaking criterion in a simplified model

Geometric, kinematic and dynamic properties of focusing deep-water surface gravity wave packets are examined in a simplified model with the intent of deriving a wave breaking thres...

Incomparable actions of free groups

Suppose that $X$ is a Polish space, $E$ is a countable Borel equivalence relation on $X$, and $\unicode[STIX]{x1D707}$ is an $E$-invariant Borel probability measure on $X$. We cons...

Email:
Password:

Email:

Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

Related Results