Sage Journals: Discover world-class research

Abstract

Recognizing handwritten Chinese documents can improve efficiency and productivity, which makes it a crucial task for power grid enterprises. This paper proposes a novel handwritten document recognition method to enhance recognition accuracy. First, spatial features are extracted from the input images using an inception module, which captures multi-scale spatial characteristics. Subsequently, a space channel parallel attention module is employed to emphasize significant features and suppress interference. The spatial features are then transformed by a bidirectional long short-term memory network, which predicts the probabilities of outputting Chinese characters. Finally, a transcription layer computes the prediction loss for each character, and the final prediction results are obtained after removing redundant placeholders. Validation experiments demonstrate that the accurate rate and correct rate of the proposed method reach 96.92% and 97.66%, respectively, indicating its effectiveness in capturing handwritten character features and improving accuracy, even under the challenge of diverse handwriting styles.

Keywords

Handwritten Chinese recognition attention BiLSTM connectionist temporal classification power grid documents

Introduction

Power system operations and maintenance rely on handwritten documentation through inspection sheets, operation tickets, and defect records for critical information storage. These paper documents serve as repositories of essential data on equipment status, fault conditions, and remedial procedures—factors that directly affect power grid stability and safety. Advanced handwritten document recognition technology enables efficient digitization of historical records with automated data extraction and comprehensive analysis capabilities. This technological advancement enhances operational efficiency, minimizes manual data entry expenses, and provides vital support for equipment assessment, fault diagnosis, and predictive maintenance protocols. As a consequence, it accelerates the digital transformation of power enterprises.

Nevertheless, there still exist several significant challenges due to the inherent characteristics of Chinese handwriting.

High variety: the Chinese language has a vast number of characters, with over 6763 commonly used characters according to the national standard.

Complex and similar structures: Chinese characters have complex structures that can vary significantly in their stroke formations, such as top-bottom, left-right, and enclosed structures. In addition, many characters are quite similar in structure but the meaning differs greatly, as shown in Figure 1.

Various handwriting styles: handwritten Chinese characters exhibit a wide range of styles and deformations among different writers, including variations in size, stroke thickness, and the use of cursive script, making the same character appear significantly different when written by different people.

In addition, there are two more specific challenges in the handwritten documentation of power grid companies. On the one hand, the writing conditions are often challenging. For instance, during equipment inspections in production workshops, the recording staff may lack access to a desk to record their observations and are forced to write while standing. Consequently, the handwriting may be distorted compared to standard shapes. On the other hand, power companies often utilize specialized vocabulary that is not commonly encountered. Therefore, developing a specialized recognition model for power companies is essential to overcome these technical challenges.

Figure 1.

Character pairs with similar structure.

To address these challenges, numerous handwritten Chinese recognition methods have been proposed, which can be primarily categorized into traditional methods and deep learning approaches.

The traditional methods involve three main steps: image preprocessing, feature extraction, and classification. Image preprocessing techniques first enhance the image quality. After that, feature extraction captures distinctive characteristics of the characters, with structural and statistical descriptors. Finally, classifiers leverage the extracted features to recognize the characters. The three steps are separated and operated independently. Deep learning methods, different from traditional ones, recognize characters using end-to-end convolutional neural networks (CNNs). The CNN architecture, with its hierarchical layers of feature extraction and classification, can automatically capture the spatial relationships and local dependencies within the input image. As a result, they can learn the distinctive features and patterns that define each character. Compared to traditional methods, deep learning methods have greatly improved recognition performance.

Among the deep learning methods, approaches based on convolutional recurrent neural network (CRNN)¹ have made significant progress in character recognition, which effectively addresses the challenges of variable text length, complex layouts, and diverse writing styles. The key steps of CRNN are as follows: (a) Spatial feature extraction. First of all, a CRNN model uses convolutional layers to extract spatial features from the input image. The visual patterns and structures present in the document, such as the shape and stroke of the text are obtained. (b) Sequence modeling. After that, the spatial features are fed to the recurrent layers in the CRNN model, which allows the sequential and contextual information inherent in the text to be obtained. This sequential modeling generates a sequence of character predictions at each time step, which is crucial for handling variable-length text and recognizing characters in the correct order. (c) Connectionist temporal classification (CTC). The CTC module aligns the input image sequence with the output text sequence, even though their lengths are different. This is achieved by considering all possible alignments and finding the one that maximizes the probability. In addition, it also introduces a “blank” label to handle variable length inputs and outputs. In fact, CRNN approaches resemble the human cognitive process of reading text. Both involve step-by-step character recognition to build overall understanding, utilizing contextual information for alignment, and handling variable length inputs.

Despite CRNN’s notable achievements in printed character recognition, challenging writing conditions and variations in power grid documentation styles continue to present significant obstacles. Additional research remains necessary in the domain of handwritten power grid document recognition. This paper presents a novel method to address the challenges in handwritten Chinese character recognition for power grid applications. The key contributions are as follows: (a) An inception-based feature extraction approach is proposed, which fuses multi-scale spatial features extracted from different receptive fields. (b) A novel spatial channel parallel attention (SCPA) mechanism is introduced to enhance relevant features while simultaneously suppressing interfering features. (c) Extensive experiments have been conducted on two public datasets and a self-collected dataset from the power grid industry. The results demonstrate that our approach significantly outperforms existing methods in terms of both recognition accuracy and edit distance metrics.

Related work

Traditional handwritten Chinese character recognition methods

Traditional handwritten Chinese character recognition methods can be divided into two steps: feature extraction and classification. Common features include gradient features, vector features, Gabor features, etc. After that, the features are sent to classifiers, such as support vector machine (SVM), K-nearest neighbor (K-NN) algorithm, multi-layer perceptron (MLP), CNN, recurrent neural network (RNN), hidden Markov models (HMMs), and random forest (RF) algorithm,^2,3 to identify and recognize the characters.

Liu et al.⁴ proposed a quadratic discriminant function based on discriminative learning, which combines the quadratic discriminant function with discriminative learning algorithms, using minimal classification error as the evaluation criterion. Long and Jin⁵ proposed an improved quadratic discriminant function classifier with a compact structure, which constructs a vector-based parameter subspace. This classifier employs linear discriminant analysis for dimension reduction, enabling the implementation of handwritten Chinese character recognition on embedded systems with low storage capacity. Zhu et al.⁶ first recognized multiple substructures by density-based clustering. These substructures are then used to further recognize the overall character with a modified quadratic discriminant function (MQDF) classifier. Zhou et al.⁷ proposed discriminative quadratic feature learning (DQFL), which mainly uses the statistical and spatial correlation of features to increase dimensionality; then, discriminative feature extraction (DFE) is employed to reduce it. Wei et al.⁸ used sparse coding to compact the parameters of MQDF and constructed two compact MQDF classifiers using maximum likelihood-based and K-SVD methods. Ma et al.⁹ developed an efficient two-dimensional Gaussian mixture model (GMM) HMM using the Kaldi toolkit. This system segments Chinese characters into basic components in horizontal and vertical directions according to hidden states, which is an improved version of 1D-GMM-HMM.

These traditional schemes have promoted the recognition of handwritten Chinese characters and have achieved promising results on specific datasets for a period of time. However, considering the complexity of actual applications, the accuracy of the models is far from enough. For example, Chhajro et al.¹⁰ reviewed the performance of SVM, K-NN, MLP, CNN, RNN, and RF, and found that CNN models outperform RF, SVM, and MLP to produce reliable results in terms of optimal accuracy.

Deep learning-based handwritten Chinese character recognition methods

Compared to traditional methods, the deep learning framework has much more powerful learning capabilities, which better captures the features of the handwritten characters and thus improves the recognition accuracy.¹¹ Deep learning methods can be further divided into individual character recognition and text line recognition.

Individual character recognition focuses solely on single characters. Cireşan and Meier¹² were the first to use a multiple column deep neural network for recognizing handwritten Chinese characters. Wu et al.¹³ subsequently proposed a recognition method based on a relaxation CNN. Zhang et al.¹⁴ combined deep learning with image decomposition to enhance the model’s recognition capabilities across different handwriting styles. Similarly, Zhong et al.¹⁵ integrated the GoogLeNet with Gabor features to enhance the model’s recognition capabilities. Yang et al.¹⁶ introduced the attention mechanism into handwritten Chinese character recognition, improving recognition accuracy by iteratively updating the text area of interest. Xu et al.¹⁷ proposed a multiple comparison attention model, expanding a single attention map into multiple ones, and calculating attention for different local areas of the text separately. Li et al.¹⁸ proposed a model for recognizing occluded single characters, using a CNN and a generative adversarial network (GAN) to repair the occluded parts, followed by character recognition using an improved GoogLeNet. Gondere et al.¹⁹ proposed a CNN for a classification task for Amharic handwritten character recognition. By collecting handwritten documents, applying data augmentation to build the dataset, and using multiple-task learning to enhance model performance, promising results were achieved.

Text line recognition involves the relationships between different Chinese characters, making it more complex than individual character recognition. Text line recognition can be further divided into two main categories: segmentation-based and segmentation-free methods. The segmentation-based methods use image segmentation frameworks to divide a whole line of text into multiple characters and then recognize each character individually. Peng et al.²⁰ used a fully convolutional network for end-to-end line text segmentation, followed by character recognition. Chen et al.²¹ introduced a novel learning method for segmenting and recognizing online handwritten Chinese text using a CNN as the core classifier. Wu et al.²² combined a deep network-based language model with a traditional language model to improve the results of recognizing text lines. Wang et al.²³ used weakly supervised learning to predict each character with only string-level annotations. The segmentation-free method does not perform explicit segmentation and position prediction of characters, which better addresses character adhesion and is more efficient than the segmentation-based methods.

A typical segmentation-free method is CRNN,¹ which uses a CNN and an RNN for feature extraction, followed by decoding through a HMM or CTC.²⁴ In the work of Su et al.,²⁵ each line of images was processed through a sliding window to extract features, and character HMM models were subsequently trained. Finally, the optimal string was determined based on the Viterbi decoding idea, ensuring the maximization of posterior probability. Du et al.²⁶ replaced the Viterbi decoding with a deep neural network to model the posterior probability of HMM states. In the CTC-based method, Messina and Louradour²⁷ proposed a multi-dimensional long short-term memory RNN that can recognize an entire line of handwritten Chinese characters without segmentation. Wu et al.²⁸ further proposed extracting context from multiple directions based on this. Liu et al.²⁹ introduced an SE module and residual structure into the network and used a context bundle search algorithm in CTC decoding, taking into account both semantic and visual information. Xie et al.³⁰ performed data augmentation on text lines and combined residual long short-term memory (LSTM) with CNNs, enhancing performance while accelerating speed.

In the last two years, some new attempts have emerged. Dan and Li³¹ utilized particle swarm optimization to design a CNN for recognizing handwritten Chinese characters, where each particle represents a different network architecture, and the optimal architecture is found by iteratively updating these particles until the best one is identified. Xu et al.³² utilized a CNN-based system to automatically evaluate penmanship in traditional Chinese handwriting. To increase usability, a mobile application was developed based on the CNN model, allowing users to assess their handwriting easily. Dong et al.³³ offered clear interpretability and an efficient learning approach. They developed the MetChar algorithm to optimize the weight distribution among fixed components and the HybridSelection algorithm to choose components and feed them into MetChar. These two algorithms facilitated the learning of the distance metric for handwritten Chinese characters. Ngo et al.³⁴ proposed an RNN transducer model for recognizing offline handwritten text in Japanese and Chinese. The model has three key components: a visual encoder using CNN and bidirectional LSTM (BiLSTM), a linguistic context encoder using embedded layers and LSTM, and a joint decoder that generates the label sequences. Min et al.³⁵ proposed a lightweight model based on shallow GoogLeNet and an error elimination algorithm. This improved shallow GoogLeNet reduces the number of training parameters while maintaining the depth of the Inception structure.

The proposed method

Algorithm framework

We propose a handwritten Chinese character recognition method based on CRNN to address the issue of handwritten document recognition in the field of power grid enterprises. As shown in Figure 2, this method extracts spatial features from the text images with CNN. The features are then used as the input of the BiLSTM for further processing, and the probability of outputting corresponding characters is reported. Finally, the transcription layer calculates the predicted loss of each character and removes any redundant placeholders to obtain the prediction results.

Figure 2.

CRNN LSTM CTC recognition framework. CRNN: convolutional recurrent neural network; LSTM: long short-term memory; CTC: onnectionist temporal classification; BiLSTM: bidirectional LSTM.

More importantly, to address the significant differences in writing styles among different individuals and the varying importance of features across different regions, we utilize the inception module and propose a “spatial channel parallel attention (SCPA)” module to better extract image features and improve recognition accuracy.

Feature extraction with inception and SCPA

Handwritten Chinese characters are characterized by complex structural patterns and diverse writing styles. Fixed receptive field feature extraction methods show inherent limitations in capturing such structural variations. To address these challenges, we propose an integrated feature extraction framework that combines Inception modules with SCPA. The inception module enables multi-scale feature extraction via parallel convolution branches, whereas SCPA enhances feature representation through spatial attention (SA) and channel attention (CA) mechanisms, leading to robust recognition performance.

As shown in Figure 3, the feature extraction process consists of five cascaded feature extraction units ( ${FEU}_{i}, i = 1, 2, 3, 4, 5$ ). Each FEU employs an inception module with parallel branches of varying receptive fields for initial feature extraction. Subsequently, the SCPA mechanism adaptively weights these features, emphasizing significant patterns while attenuating irrelevant ones. Finally, sigmoid activation and max-pooling layers are applied for non-linear mapping. The first FEU processes the raw handwritten image, with subsequent units operating on the features generated by their preceding unit.

Figure 3.

Feature extraction based on Inception and SCPA. SCPA: spatial channel parallel attention; MP: max pooling.

Inception multi-scale feature extraction

Handwritten documents exhibit various writing styles, including significant variations in size, stroke thickness, and ligatures. Therefore, multi-scale feature extraction is essential, and the inception module provides an efficient solution for this requirement.

As shown in Figure 4, the inception module consists of four parallel branches, each performing a distinct convolution or pooling operation on the input feature map $F_{i}$ . The feature dimensions throughout the network are denoted as $W \times H \times C$ , representing width, height, and channel dimensions, respectively. For example, the input feature map $F_{i}$ has dimensions of $W = 28$ , $H = 28$ , and $C = 192$ . The specific operations of these four branches are detailed as follows.

$1 \times 1$ convolution: This branch executes channel-wise feature projection with $1 \times 1$ convolutions to adjust channel relationships at fixed spatial dimensions for enhanced feature representation.

$3 \times 3$ convolution: This branch implements $3 \times 3$ convolutions for local spatial feature extraction with a moderate receptive field suited to intermediate-scale patterns.

$5 \times 5$ convolution: This branch applies $5 \times 5$ convolutions with an expanded receptive field to extract global structural patterns from broader spatial contexts.

Max pooling (MP): This branch performs $3 \times 3$ MP operations to reduce spatial dimensions and preserve dominant features for translation invariance.

The features from these four parallel branches are combined along the channel dimension to produce an output feature map

F_{out}

with comprehensive multi-scale spatial information.

Figure 4.

Inception module structure. MP: max pooling.

The $1 \times 1$ convolution is incorporated into the $3 \times 3$ , $5 \times 5$ , and MP branches for two crucial functions. First, the $1 \times 1$ convolution enables dense feature extraction through stacked convolution kernels within the same receptive field, which results in richer and more diverse features. Second, the $1 \times 1$ convolutions decrease the number of input channels. For example, in the $3 \times 3$ convolution branch, the $1 \times 1$ convolutions reduce the channels from 192 to 96. This design substantially decreases the overall computational complexity of the inception module, particularly for inputs with high-dimensional feature maps.

In a single inception module, the receptive fields are determined by the kernel sizes of $1$ , $3$ , and $5$ . The cascaded inception modules create a hierarchical structure of receptive fields. Specifically, cascaded $5 \times 5$ kernels extend the receptive field to $21 \times 21$ , achieving the maximum coverage. In contrast, repeated $1 \times 1$ kernels maintain the minimum receptive field of $1 \times 1$ . This comprehensive range of receptive fields accommodates the substantial variations in writing styles and stroke patterns across different individuals.

The inception module introduces additional computational complexity through its multi-branch architecture. Nevertheless, the computational overhead remains acceptable for character recognition tasks, which primarily operate in offline mode.

Spatial channel parallel attention

In addition to the inception module which provides various receptive fields, we also incorporate an attention mechanism to adjust the feature maps. This mechanism learns a set of weights to determine the importance of each location in the feature map. Mathematically, let $X \in R^{H \times W \times C}$ be the input feature, where $H$ , $W$ , and $C$ denote the height, width, and channels, respectively. The attention weight matrix $A$ is computed as

A = softmax (f (X))

(1)

where

f

represents a CNN that maps spatial and channel-wise features to importance scores. The attended feature map is then computed as

X^{attn} = A \otimes X

(2)

where

\otimes

denotes element-wise multiplication. Through the attention weights

A

, the mechanism enables selective feature emphasis by highlighting salient regions and suppressing irrelevant information.

As shown in Figure 5, the proposed SCPA consists of three parallel branches: CA, SA, and the original feature branch. Given an input feature $F_{in}$ , the output feature of SCPA is formulated as

F_{out} = F_{in} \oplus F_{CA} \oplus F_{SA}

(3)

where

\oplus

denotes element-wise addition.

F_{CA}

and

F_{SA}

represent the features weighted by CA and SA, respectively, which are computed as follows.

Figure 5.

Spatial channel parallel attention.

Figure 6 presents the pipeline of CA. The input feature $F_{in}$ has dimensions $W \times H \times C$ , where $W$ and $H$ denote the width and height of the feature, and $C$ denotes the number of channels.

The feature first undergoes max-pooling and average-pooling operations in the spatial dimension, resulting in features $F_{MP}$ and $F_{AP}$ , both with dimensions $1 \times 1 \times C$ .

$F_{MP}$ and $F_{AP}$ are processed through an MLP. The MLP consists of two fully connected layers with a ReLU activation in between. Both features share the same MLP architecture and weights, producing output features $F_{M-MLP}$ and $F_{A-MLP}$ .

The MLP outputs, $F_{M-MLP}$ and $F_{A-MLP}$ , are summed and passed through a sigmoid activation function to generate the CA weights $W_{CA}$ . These weights match the channel dimensionality of the input feature map, with each value indicating the importance of its corresponding channel.

The input feature $F_{in}$ is modulated through channel-wise multiplication with $W_{CA}$ to produce the recalibrated feature map:

F_{CA} = F_{in} \otimes W_{CA}

(4)

The SA mechanism operates similarly, as shown in Figure 7.

The input feature $F_{in}$ first undergoes max-pooling and average-pooling along the channel dimension. The resulting features are concatenated to form $F_{S-pooling}$ with dimensions $W \times H \times 2$ .

$F_{S-pooling}$ is processed through a convolution layer to generate $F_{conv}$ with dimensions $W \times H \times 1$ .

$F_{conv}$ is passed through a sigmoid activation function to produce the SA weights $W_{SA}$ with dimensions $W \times H \times 1$ .

The input feature $F_{in}$ is modulated through spatial-wise multiplication with $W_{SA}$ to obtain the refined feature map:

F_{SA} = F_{in} \otimes W_{SA}

(5)

Figure 6.

Channel attention. MP: max pooling. AP: average pooling. MLP: multi-layer perceptron.

Figure 7.

Spatial attention. MP: max pooling. AP: average pooling.

Through this design, SCPA enhances feature representation and consequently improves recognition accuracy. On the one hand, SCPA incorporates parallel SA and CA mechanisms, enabling the model to selectively emphasize relevant spatial locations and channel-wise features in the input data. The integration of these complementary attention mechanisms facilitates more effective data representation and discriminative feature extraction. On the other hand, SCPA preserves the original input features in its output pathway. This design ensures the retention of essential characteristics in diverse input data, particularly for variations in writing styles and patterns.

BiLSTM recurrent layer

After extracting spatial features, sequential modeling is essential for modeling temporal dependencies in handwritten text. The BiLSTM recurrent layer addresses this requirement by processing feature sequences bidirectionally. Its ability to maintain both preceding and succeeding contextual information, coupled with its memory mechanism, enables the effective handling of variable-length sequences and connected strokes in handwritten characters.

As shown in Figure 2, the spatial features extracted from convolutional layers are reshaped into sequential representations. These features are then processed through a BiLSTM network to generate character-level probability distributions over the predefined character set. The bidirectional architecture enables comprehensive modeling of long-range contextual dependencies within the sequence.

Figure 8 illustrates the architecture of the BiLSTM network, where $x_{t}$ represents the input sequence and $y_{t}$ denotes the output at time step $t$ . The network maintains two types of hidden states: $h_{\,f}^{t}$ for the forward LSTM unit and $h_{b}^{t}$ for the backward LSTM unit. The forward LSTM processes the sequence from time step $t = 1$ to $T$ , while the backward LSTM operates in reverse from $t = T$ to $1$ . This bidirectional processing enables the network to capture both past and future temporal dependencies, enhancing its capability to model complex sequential patterns in the character sequence.

Figure 8.

BiLSTM structure. LSTM: long short-term memory; BiLSTM: bidirectional LSTM.

The LSTM unit architecture, illustrated in Figure 9, comprises essential components for controlling information flow:

A forget gate that regulates the retention of previous state information.

An input gate that modulates the incorporation of new information into the cell state.

A cell state that serves as a memory mechanism for long-term information propagation.

An output gate that filters the cell state information for hidden state generation.

Figure 9.

Basic long short-term memory (LSTM) unit structure.

The spatial features are processed through the BiLSTM layer as follows:

The two-dimensional spatial features from the CNN are transformed into a one-dimensional sequence, preserving spatial relationships.

The sequence elements are processed bidirectionally through forward and backward LSTM units.

Hidden states from both directional processes are concatenated to form a comprehensive feature representation capturing bidirectional contextual information.

This integrated processing maintains spatial coherence while incorporating temporal dependencies, essential for modeling connected strokes and variable-width characters.

The spatial features require reshaping before serving as input to the BiLSTM network. Initially, these features have dimensions of

X \in R^{B \times C \times H \times W}

, where

B

denotes the batch size,

C

represents the channel dimension, and

H

W

indicate the height and width of the feature maps, respectively. Height-wise max-pooling along the vertical dimension transforms the four-dimensional tensor into a three-dimensional representation of shape

R^{B \times C \times W}

. Subsequently, dimension permutation yields the final shape

R^{B \times W \times C}

. This reorganization preserves sequential information while optimizing computational efficiency and maintaining feature discriminability.

The BiLSTM architecture implements adaptive information filtering through its gating mechanisms, enabling effective modeling of long-range dependencies in character sequences. Its integration of bidirectional processing and memory mechanisms provides robust handling of variable-length patterns inherent in handwritten Chinese characters.

CTC transcription layer

Conventional sequence recognition approaches based on softmax classification mandate rigid one-to-one input–output alignments, presenting significant challenges for handwritten Chinese character recognition with its inherent variations in writing styles and character lengths. The CTC²⁴ transcription layer overcomes these constraints through its blank token mechanism, eliminating the requirement for explicit sequence alignment. This formulation enables flexible sequence-to-sequence mapping, making it particularly suitable for processing variable-length handwritten text representations.

The transcription layer processes the probability distributions generated by the recurrent layer, mapping them to character sequences through maximum likelihood estimation at each position. Contemporary neural networks typically employ softmax activation, which enforces strict one-to-one alignment between input features and output probabilities. Such alignment constraints require fixed-length representations and precise spatial correspondence between input images and output sequences. These requirements prove particularly challenging given the inherent variability in handwriting styles. To address this limitation, we implement the CTC framework, which utilizes blank tokens and sequential modeling to handle variable-length sequences and misaligned predictions effectively.

In such a manner, the CTC framework addresses sequence recognition challenges through implicit alignment learning, eliminating the requirement for explicit input–output correspondences. Its blank token mechanism efficiently processes variable-length sequences, providing robust performance for handwritten Chinese character recognition where conventional Softmax-based approaches struggle with stylistic variations.

Loss function

The primary objective of text recognition is to minimize the divergence between predicted sequences and ground truth labels, naturally suggesting the application of character-level cross-entropy loss.

However, conventional character-level cross-entropy loss necessitates exact alignment between predictions and ground truth labels, which proves impractical for handwritten character recognition given the inherent variations in writing styles and character lengths. The CTC loss framework addresses this limitation by incorporating blank tokens, enabling end-to-end sequence learning without requiring explicit temporal alignment between input features and target labels.

To be specific, the input image undergoes convolutional processing to extract spatial features. These spatial features $x$ are subsequently processed by the BiLSTM network. The conditional probability of the BiLSTM network generating output sequence $y$ is given by:

p (y | x) = \sum_{v_{y} \in V_{y}} p (v_{y} | x)

(6)

where

v_{y}

denotes a valid path through the BiLSTM network yielding output

y

, and

V_{y}

represents the set of all possible paths producing output

y

. The path probability

p (v_{y} | x)

decomposes into:

p (v_{y} | x) = \prod_{t = 1}^{T} z_{v_{t}}^{t}

(7)

where

z_{v_{t}}^{t}

represents the activation state at time step

t

along path

v_{y}

, and

T

denotes the sequence length.

Utilizing the path formulation described above, the CTC loss function is formulated as the negative log-likelihood over the entire training set, eliminating the need for explicit character alignments:

L_{CTC} = - \sum_{(x, y) \in D} \ln p (y | x)

(8)

where

D

denotes the training dataset, and

p (y | x)

represents the conditional probability of the BiLSTM network generating target sequence

y

given input features

x

Experiments

Implementation details

Extensive experiments were conducted using the CASIA-HWDB dataset³⁶ and ICDAR-2013 dataset³⁷. The CASIA-HWDB dataset contains over 3,000,000 isolated character samples, while ICDAR-2013 comprises 91,000 samples. The PowerHW dataset, collected from power grid enterprise maintenance records, represents our proprietary dataset. This dataset contains more than 40,000 characters across 1572 distinct categories, contributed by 35 different annotators. All document images underwent binarization to standardize imaging quality. Figure 10 illustrates the character “electricity” from the PowerHW dataset, demonstrating significant variations in writing styles across different annotators. The experiments utilize a $90 : 10$ split ratio for training and testing across all datasets.

Figure 10.

Examples of the Chinese character of “electricity” in PowerHW.

All experiments were conducted on a server equipped with an Intel Core I9-10900F CPU, 64 GB RAM, and an NVIDIA GeForce RTX 3090 GPU. The model was implemented in PyTorch and optimized using the Adam algorithm. The training process consisted of 500 iterations with a batch size of 32, using an initial learning rate of $1 \times 10^{- 4}$ that decreased to $1 \times 10^{- 5}$ every 50 iterations.

Levenshtein edit distance is introduced to evaluate the performance of the method, where accuracy rate (AR) and correct rate (CR) are used.

{\begin{matrix} AR = \frac{N_{t} - D_{e} - I_{e} - S_{e}}{N_{t}} \\ CR = \frac{N_{t} - D_{e} - S_{e}}{N_{t}} \end{matrix}

(9)

The meaning of the notations are as follows:

$N_{t}$ : length of the labels.

$S_{e}$ : number of substitution error.

$D_{e}$ : number of deletion error.

$I_{e}$ : number of insertion error.

Results of the proposed method

The recognition results on real handwritten paperwork are first presented in Figure 11. It can be noticed that, Figure 11 (a) shows the original handwritten characters with a large number of connected and simplified strokes, which brings difficulties and challenges to the recognition. The recognition results of the proposed method are shown in Figure 11(b), which demonstrates that the vast majority of the results are correct. To be specific, the incorrect recognition results are shown in Figure 11(c), where errors occur due to personal writing habits, particularly ligatures that make the character more similar to another one in structure.

Figure 11.

Recognition results of PowerHW: (a) handwritten image, (b) results, and (c) error analysis.

We also evaluate the results using quantitative metrics. Table 1 compares our method with mainstream approaches in terms of AR and CR. On HWDB, our method achieves 96.92% AR and 97.66% CR. On ICDAR2013, the AR and CR values of the proposed method reach 92.67% and 93.18%, respectively. On PowerHW dataset, a specialized dataset for handwritten documents in power companies, we obtain 96.49% AR and 96.85% CR.

Table 1.

Accuracy comparison of mainstream methods.

Dataset	HWDB		ICDAR2013		PowerHW
Method	AR (%)	CR (%)	AR (%)	CR (%)	AR (%)	CR (%)
Wang et al.³⁸	92.04	93.24	88.79	90.67	92.76	93.02
Wu et al.²²	91.87	92.34	83.89	84.78	91.68	92.34
Wang et al.³⁹	92.86	93.22	89.66	90.35	93.08	93.22
Xie et al.³⁰	94.90	95.37	91.55	92.13	93.88	94.16
Wu et al.²⁸	94.29	94.76	86.64	87.43	94.53	94.87
Peng et al.²⁰	94.77	95.08	90.52	89.61	94.96	95.48
Fang et al.⁴⁰	95.35	95.94	91.38	92.15	95.33	95.74
Du et al.⁴¹	94.73	95.15	92.38	92.74	95.89	96.52
Proposed	96.92	97.66	92.67	93.18	96.49	96.85
Accuracy gain	1.57	1.72	0.29	0.44	0.60	0.33

AR: accurate rate; CR: correct rate.

The bold values represent the best performance metrics in each column.

In addition, we evaluate the robustness of the proposed model. As shown in Figure 12, we add Gaussian noise with zero mean ( $μ = 0$ ) and standard deviation $σ$ ranging from 0 to 1 at intervals of 0.1 to the test images of PowerHW. The accuracy decreases as $σ$ increases. Specifically, the performance degradation is mild when $σ \leq 0.4$ . However, when $σ > 0.4$ , the accuracy declines sharply. Notably, at $σ = 0.4$ , the model maintains high accuracy rates: approximately 95% on both HWDB and PowerHW datasets, and 90% on ICDAR2013 dataset. These results demonstrate the strong robustness of the proposed method.

Figure 12.

Curves of accuracy versus noise. Three subfigures, arranged from left to right, illustrate the data obtained from the CASIA-HWDB, ICDAR 2013, and PowerHW datasets, respectively. AR: accurate rate; CR: correct rate.

Analysis of the Results

The superior performance can be attributed to two key architectural components. On the one hand, the inception-based feature extraction module effectively handles writing style diversity by fusing multi-scale spatial features from different receptive fields. This multi-scale approach enables better adaptation to various handwriting characteristics through multiple kernel sizes, providing comprehensive feature representation at different scales.

On the other hand, the SCPA mechanism significantly enhances feature quality by simultaneously utilizing SA and CA. This dual attention mechanism effectively enhances relevant features while suppressing interfering ones, maintaining the essential characteristics of input features while providing more accurate feature representations for subsequent BiLSTM and CTC modules.

Comparison with state-of-the-art methods

The proposed method is compared with current mainstream methods, including Wang et al.,³⁸ Wu et al.,²² Wang et al.,³⁹ Xie et al.,³⁰ Wu et al.,²⁸ Peng et al.,²⁰ Fang et al.,⁴⁰, and Du et al.⁴¹ The experimental results are presented in Table 1.

Our method demonstrates superior performance across all datasets. On HWDB, it achieves an AR of 96.92% and a CR of 97.66%, surpassing all compared methods. The improvements over the second-best approach are 1.57% and 1.72% for AR and CR, respectively. On ICDAR2013, the method yields an AR of 92.67% and a CR of 93.18%, outperforming existing approaches with margins of 0.29% and 0.44% for AR and CR, respectively. On PowerHW, a specialized dataset collected from power enterprises, our method attains an AR of 96.49% and a CR of 96.85%. These results exceed the second-best performance by 0.60% and 0.33% for AR and CR, respectively. This quality gain proves the advantage of the proposed method.

Ablation study

The proposed method utilizes inception for feature extraction and SCPA for feature fusion. We evaluate the contribution of each module through ablation experiments on the PowerHW dataset.

To assess the effectiveness of the inception module, we compare its performance with single-size convolutions. Table 2 presents this comparison, excluding $1 \times 1$ convolution kernels as they only adjust channel dimensions without extracting local spatial features. The results show that the inception module achieves higher accuracy for both AR and CR metrics. These performance gains demonstrate that multi-size kernels in the inception module can better extract features across different writing styles.

Table 2.

Ablation study on feature extraction.

	AR (%)	CR (%)
$3 \times 3$ Conv only	94.86	95.07
$5 \times 5$ Conv only	95.33	95.72
Inception	96.49	96.85

AR: accurate rate; CR: correct rate.

The proposed method utilizes inception for feature extraction and SCPA for feature fusion, and the contribution of each module is verified through a series of ablation experiments. The experiment is conducted on the PowerHW dataset.

Table 3 presents the ablation study of different attention mechanisms: (a) no attention, (b) SA only, (c) CA only, and (d) the proposed SCPA. The results show that the baseline model without attention yields the lowest accuracy, while incorporating attention mechanisms consistently improves performance. Notably, the proposed SCPA, combining both SA and CA, achieves superior performance compared to models using either attention mechanism alone.

Table 3.

Ablation study on attention mechanisms.

Profile	AR (%)	CR (%)
Without attention	93.55	93.94
Spatial attention only	94.92	95.43
Channel attention only	94.76	95.73
SCPA (proposed)	96.49	96.85

AR: accurate rate; CR: correct rate; SCPA: space channel parallel attention.

Discussion

Contribution

This study introduces an innovative approach that addresses the complexity of diverse writing styles and enhances the recognition accuracy of handwritten Chinese characters in power enterprise documents. The proposed method integrates the inception module and a specialized SCPA mechanism to effectively extract multi-scale features from handwritten images. The BiLSTM layers perform sequence prediction for character recognition by capturing both past and future context in the sequence data, enabling high-precision character transformation and prediction. In the final stage, the CTC layer computes the prediction loss. Experiments on the CASIA-HWDB dataset, the ICDAR2013 dataset, and our custom PowerHW dataset have yielded impressive results, with the AR reaching 96.92% and the CR reaching 97.66%. These superior performance metrics validate the effectiveness of our proposed method.

Weakness

The CRNN approach first utilizes CNN to extract spatial features from the image and then employs RNN/LSTM to construct a temporal model from the spatial features. This approach performs effectively on text with standardized layout and formatting.

However, in handwritten documents, various non-standard patterns frequently occur, such as skewed writing, modifications, and misalignment, as shown in Figure 13. Our future research will focus on addressing the challenges of handwritten Chinese character recognition in complex document layouts with irregular formatting and structure.

Figure 13.

Weakness and failure cases of the proposed method: (a) skewed writing, (b) modifications, and (c) misalignment.

The CRNN approach first utilizes CNN to extract spatial features from the image and then employs RNN/LSTM to construct a temporal model from the spatial features. This approach is relatively well-suited for recognizing text with a standardized layout and formatting.

Conclusion

This paper proposes a character recognition method for handwritten documents in power systems, addressing the challenges of diverse writing styles and improving recognition accuracy. The method incorporates the Inception module and SCPA in the convolutional layer to effectively extract features from handwritten images. The recurrent layer employs BiLSTM to transform and predict character probability outputs, while the transcription layer calculates prediction loss for final results. Experiments on the CASIA-HWDB dataset, ICDAR2013 dataset, and our custom PowerHW dataset demonstrate that the AR and CR accuracies reach 96.92% and 97.66%, respectively. Ablation studies confirm the effectiveness of both the inception-based multi-scale feature extraction and the SCPA mechanism, while comprehensive experiments validate the method’s robustness.

Nevertheless, the proposed method shows limitations when processing poorly formatted characters, including skewed writing, modifications, and misalignment. Future work will focus on improving handwritten Chinese character recognition in unstructured document layouts.

Footnotes

Acknowledgment

Not applicable.

Authors’ contribution

Dajun Xiao and Xialing Xu: study conception and design; Lianfei Shan and Tao Liu: data collection; Dajun Xiao, Tao Liu and Xin Li: analysis and interpretation of results; Xialing Xu and Yue Zhang: draft manuscript preparation. Xialing Xu, Xin Li and Yue Zhang reviewed the results and approved the final version of the manuscript.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Science and Technology Project of State Grid Corporation of China (Central China Branch) (No. 52140023000P).

ORCID iD

Yue Zhang

Appendix 1

References

Shi

Bai

Yao

. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 2016; 39: 2298–2304.

Ahlawat

Choudhary

. Hybrid CNN-SVM classifier for handwritten digit recognition. Procedia Comput Sci 2020; 167: 2554–2560.

Chychkarov

Serhiienko

Syrmamiikh

, et al. Handwritten digits recognition using SVM, KNN, RF and deep learning neural networks. CMIS 2021; 2864: 496–509.

Liu

Sako

Fujisawa

. Discriminative learning quadratic discriminant function for handwriting recognition. IEEE Trans Neural Netw 2004; 15(2): 430–444.

Long

Jin

. Building compact MQDF classifier for large character set recognition by subspace distribution sharing. Pattern Recognit 2008; 41 : 2916–2925.

Zhu

Zhang

. A handwritten Chinese character recognition method combining sub-structure recognition. In: Proceedings of international conference on frontiers in handwriting recognition (ICFHR), 2016, pp.518–523. October 23–26, 2016, Shenzhen, China, IEEE New Jersey, USA

Zhou

Zhang

Yin

, et al. Discriminative quadratic feature learning for handwritten Chinese character recognition. Pattern Recognit 2016; 49: 7–18.

Wei

. Compact MQDF classifiers using sparse coding for handwritten Chinese character recognition. Pattern Recognit 2018; 76: 679–690.

Wang

. An open-source library of 2D-GMM-HMM based on Kaldi toolkit and its application to handwritten Chinese character recognition. In: Proceedings of the international conference on image and graphics. December 26 to 28, 2021. Haikou, China, pp.235–244. Springer, Berlin, Germany.

10.

Chhajro

Khan

, et al. Handwritten Urdu character recognition via images using different machine learning and deep learning techniques. Indian J Sci Technol 2020; 13(17): 1746–1754.

11.

Sánchez-DelaCruz

Loeza-Mejía

. Importance and challenges of handwriting recognition with the implementation of machine learning techniques: A survey. Appl Intell 2024; 54: 6444–6465.

12.

Cireşan

Meier

. Multi-column deep neural networks for offline handwritten Chinese character classification. In: Proceedings of the international joint conference on neural networks, July 12–17, 2015, Killarney, Ireland, pp.1–6. IEEE, USA.

13.

Fan

, et al. Handwritten character recognition by alternately trained relaxation convolutional neural network. In: Proceedings of international conference on frontiers in handwriting recognition, September 1 to 4, 2014. Crete Island, Greece, pp.291–296. IEEE, USA.

14.

Zhang

Bengio

Liu

. Online and offline handwritten Chinese character recognition: A comprehensive study and new benchmark. Pattern Recognit 2017; 61: 348–360.

15.

Zhong

Jin

Xie

. High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps. In: Proceedings of the international conference on document analysis and recognition, August 23 to 26, 2015, Nancy, France, pp.846–850. IEEE, USA.

16.

Yang

Zhou

, et al. Improving offline handwritten Chinese character recognition by iterative refinement. In: Proceedings of IAPR international conference on document analysis & recognition, November 9 to 15, 2017, Kyoto, Japan, pp.5–10. IEEE, USA.

17.

Bai

Liu

. Multiple comparative attention network for offline handwritten Chinese character recognition. In: Proceedings of the international conference on document analysis and recognition, September 20 to 25, 2019, Sydney, Australia, pp.595–600. IEEE, USA.

18.

Song

Zhang

. Occluded offline handwritten Chinese character recognition using deep convolutional generative adversarial network and improved googLeNet. Neural Comput Appl 2020; 32: 4805–4819.

19.

Gondere

Schmidt-Thieme

Boltena

, et al. Handwritten Amharic character recognition using a convolutional neural network. Arch Data Sci Ser A 2020; 6 : 9.

20.

Peng

Jin

, et al. A fast and accurate fully convolutional network for end-to-end handwritten Chinese text segmentation and recognition. In: Proceedings of the international conference on document analysis and recognition, September 20 to 25, 2019, Sydney, Australia, pp.25–30. IEEE, USA.

21.

Chen

Zhang

Liu

. Improved learning for online handwritten Chinese text recognition with convolutional prototype network. In: Proceedings of the international conference on document analysis and recognition, August 21 to 26, 2023, San José, California, USA, pp.38–53. Springer, Berlin, Germany.

22.

Yin

Liu

. Improving handwritten Chinese text recognition using neural network language models and convolutional neural network shape models. Pattern Recognit 2017; 65: 251–264.

23.

Wang

Yin

, et al. Weakly supervised learning for over-segmentation based handwritten Chinese text recognition. In: Proceedings of international conference on frontiers in handwriting recognition, September 8 to 10, 2020, Dortmund, Germany, pp.157–162. IEEE, USA.

24.

Graves

Fernández

Gomez

, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the international conference on machine learning; June 25 to 29, 2006, Pittsburgh, Pennsylvania, USA, pp.369–376. ACM, USA.

25.

Zhang

Guan

, et al. Off-line recognition of realistic Chinese handwriting using segmentation-free strategy. Pattern Recognit 2009; 42 : 167–182.

26.

Wang

Zhai

, et al. Deep neural network based hidden Markov model for offline handwritten Chinese text recognition. In: Proceedings of international conference on pattern recognition, December 4 to 8, 2016, Cancún, Mexico, pp.3428–3433. IEEE, USA.

27.

Messina

Louradour

. Segmentation-free handwritten Chinese text recognition with LSTM-RNN. In: Proceedings of the international conference on document analysis and recognition, August 23 to 26, 2015, Nancy, France, pp.171–175. IEEE, USA.

28.

Yin

Chen

, et al. Handwritten Chinese text recognition using separable multi-dimensional recurrent neural network. In: Proceedings of IAPR international conference on document analysis & recognition, November 9 to 15, 2017, Kyoto, Japan, pp.79–84. IEEE, USA.

29.

Liu

Sun

Kang

, et al. Searching from the prediction of visual and language model for handwritten Chinese text recognition. In: Proceedings of the international conference on document analysis and recognition, September 5 to 10, 2021, Lausanne, Switzerland pp.274–288. Springer Cham, Switzerland.

30.

Xie

Lai

Liao

, et al. High performance offline handwritten Chinese text recognition with a new data preprocessing and augmentation pipeline. In: Proceedings of the IAPR international workshop on document analysis systems, July 26 to 29, 2020, Wuhan, China, pp.45–59. Springer Cham, Switzerland.

31.

Dan

. Particle swarm optimization-based convolutional neural network for handwritten Chinese character recognition. J Adv Comput Intell Intell Inform 2023; 27: 165–172.

32.

Mittal

Ahmed

, et al. Assessing penmanship of Chinese handwriting: A deep learning-based approach. Read Writ 2024; 1: 1–21.

33.

Dong

Varde

Stevanovic

, et al. Interpretable distance metric learning for handwritten Chinese character recognition. In: Proceedings under IEEE International Conference of Big Data intelligence and computing, February 27 to March 2, 2019, Kyoto, Japan, pp.113–118. IEEE, USA.

34.

Ngo

Nguyen

, et al. Recurrent neural network transducer for Japanese and Chinese offline handwritten text recognition. In: Proceedings of the international conference on document analysis and recognition, September 5 to 10, 2021, Lausanne, Switzerland, pp.364–376. Springer Cham, Switzerland.

35.

Min

Zhu

Wang

. Offline handwritten Chinese character recognition based on improved GoogLeNet. In: Proceedings of the international conference on artificial intelligence and pattern recognition, June 26 to 28, 2020, Zhongshan, China. pp.42–46. ACM, USA.

36.

Liu

Yin

Wang

, et al. CASIA Online and Offline Chinese Handwriting Databases. In: Proceedings of the international conference on document analysis and recognition, September 18 to 21, 2011, Beijing, China, pp.37–41. IEEE, USA.

37.

Yin

Wang

Zhang

, et al. ICDAR 2013 Chinese handwriting recognition competition. In: Proceedings of the international conference on document analysis and recognition, August 25 to 28, 2013, Washington, DC, USA, pp.1464–1470. IEEE, USA.

38.

Wang

Chen

, et al. Deep knowledge training and heterogeneous CNN for handwritten Chinese text recognition. In: Proceedings of international conference on frontiers in handwriting recognition, October 9 to 12, 2017, Valencia, Spain, pp.84–89. IEEE, USA.

39.

Wang

Z-R

Wang

W-C

, et al. A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition. Int J Doc Anal Recognit 2018; 21: 241–251.

40.

Fang

Xie

Wang

, et al. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), June 19 to June 25, 2021, Online, pp.7098–7107. IEEE, USA.

41.

Chen

Jia

, et al. SVTR: Scene Text Recognition with a Single Visual Model. In: Proceedings of international joint conference on Artificial intelligence, July 23 to 29, 2022, Vienna, Austria, pp.884–890. IEEE, USA.

Attention-based handwritten Chinese recognition for power grid maintenance documents

Abstract

Keywords

Introduction

Related work

Traditional handwritten Chinese character recognition methods

Deep learning-based handwritten Chinese character recognition methods

The proposed method

Algorithm framework

Feature extraction with inception and SCPA

Inception multi-scale feature extraction

Spatial channel parallel attention

BiLSTM recurrent layer

CTC transcription layer

Loss function

Experiments

Implementation details

Results of the proposed method

Analysis of the Results

Comparison with state-of-the-art methods

Ablation study

Discussion

Contribution

Weakness

Conclusion

Footnotes

Acknowledgment

Authors’ contribution

Declaration of conflicting interests

Funding

ORCID iD

Appendix 1

References