CASIA Online and Offline Chinese Handwriting Databases

Caution: If your research interest is isolated Chinese character recognition (evaluating preprocessing, feature extraction and classification algorithms), please Download standard sample data without submitting application form.

Introduction

 The online and offline Chinese handwriting databases, CASIA-OLHWDB and CASIA-HWDB, were built by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). The handwritten samples were produced by 1,020 writers using Anoto pen on papers, such that both online and offline data were obtained. The samples include both isolated characters and handwritten texts (continuous scripts). We collected data from writers from 2007 to 2010, and completed the segmentation and annotation in 2010. The databases include six datasets of online data and six datasets of offline data, in each case, three for isolated characters (DB1.0–1.2) and three for handwritten texts (DB2.0–2.2). In either online or offline case, the datasets of isolated characters contain about 3.9 million samples of 7,356 classes (7,185 Chinese characters and 171 symbols), and the datasets of handwritten texts contain about 5,090 pages and 1.35 million character samples. All the data has been segmented and annotated at character level, and each dataset is partitioned into standard training and test subsets.

A comprehensive description of the databases has been published at ICDAR 2011 (download the paper).

Character Sets

 For our handwriting data collection, we compiled a character set based on the standard sets GB2312-80 and Modern Chinese Character List of Common Use (Common Set in brief). The GB2312-80 contains 6,763 Chinese characters, including 3,755 in level-1 set and 3,008 in level-2 set. The Common Set contains 7,000 Chinese characters. We collected the union of the two sets, containing 7,170 characters, for possible recognition of practical documents. We further added 15 Chinese characters that we met in our experience. We also collected a set of 171 symbols, including 52 English letters, 10 digits, and some frequently used punctuation marks, mathematics and physical symbols. The total number of character classes is thus 7,356.
 For collecting handwritten texts, we asked each writer to hand-copy five texts. We compiled three sets of texts (referred to as versions V1–V3), mostly downloaded from news Web pages except there are five texts of ancient Chinese poems in both V1 and V2. Each set contains 50 texts, each containing 150–370 characters. The three sets were used in different stages of handwriting data collection. The texts in each set were further divided into 10 subsets (referred to as templates T1–T10), each containing five texts to be written by one writer.

Data Collection

 We collected handwriting data in three stages using three sets (versions) of templates. Each set has 10 templates to be written by 10 writers. A template has 13–15 pages of isolated characters and five pages of texts. For a template set, the isolated characters are divided into three groups: symbols, frequent Chinese and low frequency Chinese. The symbols are always on the first page, followed by Chinese characters. The first six templates of a set print the same group of frequent Chinese characters in six different orders by rotating six equal parts, and the last four templates print the low frequency Chinese characters in four difference orders. Rotation guarantees that each character is written equally in different time intervals for balanced writing quality. In addition, each template has five pages of different texts. The three sets (versions) of templates are summarized in Table I. V1 and V3 have the same set of isolated characters. The number of isolated Chinese characters in V1 and V3 is actually 7,184, not 7,185, because the templates of V1 were designed earliest. The templates of V3 inherited the isolated character set of V1 and updated the texts. The frequent Chinese character set of V1 and V3 is actually the level-1 set of GB2312-80, which was commonly taken as a standard set of Chinese character recognition research.

Version Template #pages #symbols #Chinese #texts/chars
V1 T1-T6
T7-T10
20
19
171
171
3755
3429
30/7,464
20/4,918
V2 T1-T6
T7-T10
20
18
171
171
3866
3319
30/7,802
20/5,196
V3 T1-T6
T7-T10
20
19
171
171
3755
3429
30/9,039
20/6,016


Data Partitioning

 The distribution of templates in (either online or offline) datasets DB1.0-1.2 is shown in Table V, and DB2.0-2.2 have the same partitioning. Compared to isolated characters datasets, the handwritten text dataset OLHWDB2.2 has missing training writer of template V2-T9, and the HWDB2.0 has a missing test writer of template V2-T3. In all databases, the ration of training writers and test writers as 4:1.

Dataset Partition Template #writers per template total
DB1.0 Train V2 T1-T6 55:52:54:57:57:61 336
Test V2 T1-T6 14:14:14:14:14:14 84
DB1.1 Train V1 T1-T6 9:8:8:10:9:9 53
V3 T1-T6 31:32:32:30:31:31 187
Test V1 T1-T6 2:3:3:1:2:2 13
V3 T1-T6 8:7:7:9:8:8 47
DB1.2 Train V2 T7-T10 60:60:60:60 240
Test V2 T7-T10 15:15:15:15 60


Recommendations of usage

1) Handwritten document segmentation
2) Handwritten character recognition
3) Text line recognition
4) Handwritten document retrieval
5) Writer adaptation
6) Writer identification