CASIA Online and Offline Chinese Handwriting Databases

This page provides standard datasets for evaluating isolated handwritten Chinese character recognition, including feature data generated using existing feature extraction algorithms and original character sample data. The test datasets (including text data) of ICDAR2013 Competition are also provided here.

Download Feature Data

To enable the evaluation of machine learning and classification algorithms on standard feature data, we provide the feature data of offline handwriting datasets HWDB1.0 and HWDB1.1, online handwriting datasets OLHWDB1.0 and OLHWDB1.1. The samples fall in 3,755 classes of Chinese characters in GB2312-80 level-1 set. The datasets HWDB1.1 and OLHWDB1.1 (300 writers) are proposed to be used for preliminary experiments of Chinese character recognition of standard category set. The datasets HWDB1.0 and OLHWDB1.0 (420 writers) can be added to HWDB1.1 and OLHWDB1.1 for enlarging the training set size.

The feature extraction methods are specified in the reference below, and the results reported there can be used for fair comparison:

C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, Online and Offline Handwritten Chinese Character Recognition: Benchmarking on New Databases, Pattern Recognition, 46(1): 155-162, 2013.

For offline characters, the feature extracted is the 8-direction histogram of normalization-cooperated gradient feature (NCGF), combined with pseudo 2D normalization method line density projection interpolation. The resulting feature is 512D.

For online characters, the feature extracted is the 8-direction histogram of original trajectory direction combined with pseudo 2D bi-moment normalization. The resulting feature is 512D.

The feature data of each dataset is partitioned into two subsets for training and testing, respectively. The numbers of writers and samples of the files are shown in the Table below.

Dataset #class Dimension #writer #sample
Train Test Train Test
HWDB1.0 3,740 512 336 84> 1,246,991 309,684
HWDB1.1 3,755 512 240 60 897,758 223,991
OLHWDB1.0 3,740 512 336 84 1,256,009 314,042
OLHWDB1.1 3,755 512 240 60 898,573 224,559

The format of the feature data files is described in fileFormat-mpf.pdf. In brief, each file has a header with the header size given as the first 4-byte integer number in the file. The last two integer numbers in the header give the number of samples in the file and the feature dimensionality. Following the header are the records of all samples, each sample including a 2-byte label (GB code) and the feature vector, each dimension in a unsigned char byte.

For the possibility of writer-specific data analysis, the feature data of each writer is stored in a file named after the writer index. The training or test data files of a dataset are packed in a ZIP archive. Please click the links below for download.


OLHWDB1.0tst (96MB)
OLHWDB1.1trn (274MB)
OLHWDB1.1tst (68MB)

Download Character Sample Data

We provide the isolated character datasets HWDB1.1 (offline) and OLHWDB1.1 (online) for study of character preprocessing and feature extraction methods. Each dataset is partitioned into a standard training set (240 writers) and a test set (60 writers). The download links are below. All them can be decompressed using the ALZip or other ZIP software. The format descriptions of offline characters (*.gnt) and online characters (*.pot) can be found in the pages of Offline Database and Online Database, respectively.

HWDB1.1trn_gnt (1873MB), or download in two parts (part 1, part 2)
HWDB1.1tst_gnt (471MB)

OLHWDB1.1trn_pot (187MB)
OLHWDB1.1tst_pot (47MB)

The datasets HWDB1.0 and OLHWDB1.0 are provided as single files for training set and test set.

HWDB1.0train_gnt (2741MB)  AlZip RAR, or download in three parts (part 1, part 2, part 3)
HWDB1.0test_gnt (681MB) AlZip RAR
OLHWDB1.0train_pot (258MB) ZIP
OLHWDB1.0test_pot (65MB) ZIP

Download Competition Test Data

Based on the CASIA-HWDB and CASIA-OLHWDB databases, we organized Chinese Handwriting Recognition competitions in 2010, 2011 and 2013. Now, we open the test data of competition for research. There are four datasets generated by 60 writers: offline character data, online character data, offline text data, online text data. The data format specifications can be found in the pages of Offline Database and Online Database.

Offline Character Data (448MB)
Online Character Data (45MB)
Offline Text Data (140MB)
Online Text Data (13MB)