CASIA Online and Offline Chinese Handwriting Databases

This page provides standard datasets for evaluating isolated handwritten Chinese character recognition, including feature data generated using existing feature extraction algorithms and original character sample data. The test datasets (including text data) of ICDAR2013 Competition are also provided here.

From February 2020, all the offline character images and text pages (annotated in text line level) are downloadable on this page.

The full data of online data will be available soon.

Download Feature Data

To enable the evaluation of machine learning and classification algorithms on standard feature data, we provide the feature data of offline handwriting datasets HWDB1.0 and HWDB1.1, online handwriting datasets OLHWDB1.0 and OLHWDB1.1. The samples fall in 3,755 classes of Chinese characters in GB2312-80 level-1 set. The datasets HWDB1.1 and OLHWDB1.1 (300 writers) are proposed to be used for preliminary experiments of Chinese character recognition of standard category set. The datasets HWDB1.0 and OLHWDB1.0 (420 writers) can be added to HWDB1.1 and OLHWDB1.1 for enlarging the training set size.

The feature extraction methods are specified in the reference below, and the results reported there can be used for fair comparison:

C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, Online and Offline Handwritten Chinese Character Recognition: Benchmarking on New Databases, Pattern Recognition, 46(1): 155-162, 2013.

For offline characters, the feature extracted is the 8-direction histogram of normalization-cooperated gradient feature (NCGF), combined with pseudo 2D normalization method line density projection interpolation. The resulting feature is 512D.

For online characters, the feature extracted is the 8-direction histogram of original trajectory direction combined with pseudo 2D bi-moment normalization. The resulting feature is 512D.

The feature data of each dataset is partitioned into two subsets for training and testing, respectively. The numbers of writers and samples of the files are shown in the Table below.

Dataset

#class

Dimension

#writer

#sample

Train

Test

Train

Test

HWDB1.0

3,740

512

336

84

1,246,991

309,684

HWDB1.1

3,755

512

240

60

897,758

223,991

OLHWDB1.0

3,740

512

336

84

1,256,009

314,042

OLHWDB1.1

3,755

512

240

60

898,573

224,559


The format of the feature data files is described in fileFormat-mpf.pdf. In brief, each file has a header with the header size given as the first 4-byte integer number in the file. The last two integer numbers in the header give the number of samples in the file and the feature dimensionality. Following the header are the records of all samples, each sample including a 2-byte label (GB code) and the feature vector, each dimension in a unsigned char byte.

For the possibility of writer-specific data analysis, the feature data of each writer is stored in a file named after the writer index. The training or test data files of a dataset are packed in a ZIP archive. Please click the links below for download.

HWDB1.0trn(397MB)
HWDB1.0tst(98MB)
HWDB1.1trn(287MB)
HWDB1.1tst(71MB)

OLHWDB1.0trn(384MB)
OLHWDB1.0tst (96MB)
OLHWDB1.1trn (274MB)
OLHWDB1.1tst (68MB)

Download Character Sample Data

We provide the isolated character datasets HWDB1.0-1.2 (offline) and OLHWDB1.0-1.2 (online) for study of isolated character recognition and the pre-training of classifier for text line recognition. Each dataset is partitioned into a standard training set and a test set of disjoint writers. The download links are below. The format descriptions of offline characters (*.gnt) and online characters (*.pot) can be found in the pages of Offline Database and Online Database, respectively.

The full datasets of HWDB1.0-1.2 can be downloaded below.

Gnt1.0Train part1 (962MB), part2 (973MB), part3 (983MB)
Gnt1.0Test (722MB)
Gnt1.1Train part1 (897MB), part2 (943MB)
Gnt1.1Test (468MB)
Gnt1.2Train part1 (998MB), part2 (987MB)
Gnt1.2Test (510MB)

The full datasets of OLHWDB1.0-1.2 can be downloaded below.

Pot1.0Train (273MB)
Pot1.0Test (68MB)
Pot1.1Train (189MB)
Pot1.1Test (47MB)
Pot1.2Train (196MB)
Pot1.2Test (50MB)

Download Textline (Page) Data

We provide the offline text line data (stored in DGRL files, each page contains multiple lines) of HWDB2.0-2.2 and online text line data (stored in WPTT files) of OLHWDB2.0-2.2 for study of text line recognition . The statistics of text lines of each dataset can be found in the pages of  Offline Database and  Online Database. Each dataset is paritioned into sets of pages (text lines) for training and testing.

The full datasets of HWDB2.0-2.2 can be downloaded below.

HWDB2.0Train (709MB)
HWDB2.0Test (175MB)
HWDB2.1Train (540MB)
HWDB2.1Test (136MB)
HWDB2.2Train (515MB)
HWDB2.2Test (128MB)

The full datasets of OLHWDB2.0-2.2 can be downloaded below.

WPTT2.0-Train (62MB)
WPTT2.0-Test (15MB)
WPTT2.1-Train (50MB)
WPTT2.1-Test (12MB)
WPTT2.2-Train (44MB)
WPTT2.2-Test (10MB) 

Download Competition Test Data

Based on the CASIA-HWDB and CASIA-OLHWDB databases, we organized Chinese Handwriting Recognition competitions in 2010, 2011 and 2013. Now, we open the test data of competition for research. There are four datasets generated by 60 writers: offline character data, online character data, offline text data, online text data. The data format specifications can be found in the pages of Offline Database and Online Database.

Offline Character Data (448MB)
Online Character Data (45MB)
Offline Text Data (140MB) in DGRL format
Online Text Data (12MB)