Javascript must be enabled to continue!
Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy
View through CrossRef
While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it’s impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 Unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.
Title: Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy
Description:
While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text.
For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm.
Because some scripts share common characters, it’s impossible to count and summarize them.
As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted.
To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm.
Using this approach, we can fully consider all 169 Unicode script types.
The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information.
The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.
Related Results
Turbulence strength in ultimate Taylor–Couette turbulence
Turbulence strength in ultimate Taylor–Couette turbulence
We provide experimental measurements for the effective scaling of the Taylor–Reynolds number within the bulk $\mathit{Re}_{\unicode[STIX]{x1D706},\mathit{bulk}}$, based on local fl...
Optimizing assembly processes with augmented reality: a case study on TurtleBots
Optimizing assembly processes with augmented reality: a case study on TurtleBots
Augmented reality (AR) technology is revolutionizing traditional assembly processes, offering intuitive and interactive guidance that significantly enhances operational efficiency ...
Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy
Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy
While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open source script identification algorithm for tex...
QUASIRANDOM GROUP ACTIONS
QUASIRANDOM GROUP ACTIONS
Let $G$ be a finite group acting transitively on a set $\unicode[STIX]{x1D6FA}$. We study what it means for this action to be quasirandom, thereby generalizing Gowers’ study of qua...
Minimal-dimensional representations of reduced enveloping algebras for
Minimal-dimensional representations of reduced enveloping algebras for
Let
$\mathfrak{g}=\mathfrak{g}\mathfrak{l}_{N}(\Bbbk )$
, where
$\Bbbk$
is an algebraically closed field of characteristic
...
The Basset problem with dynamic slip: slip-induced memory effect and slip–stick transition
The Basset problem with dynamic slip: slip-induced memory effect and slip–stick transition
When there exists slip on the surface of a solid body moving in an unsteady manner, the extent of slip is not fixed but constantly changes with the time-varying Stokes boundary lay...
Focusing deep-water surface gravity wave packets: wave breaking criterion in a simplified model
Focusing deep-water surface gravity wave packets: wave breaking criterion in a simplified model
Geometric, kinematic and dynamic properties of focusing deep-water surface gravity wave packets are examined in a simplified model with the intent of deriving a wave breaking thres...
Incomparable actions of free groups
Incomparable actions of free groups
Suppose that $X$ is a Polish space, $E$ is a countable Borel equivalence relation on $X$, and $\unicode[STIX]{x1D707}$ is an $E$-invariant Borel probability measure on $X$. We cons...

