CASIA Online and Offline Chinese Handwriting Databases

Offline Database

 For offline data collection, the handwritten pages were scanned (in resolution of 300DPT) to obtain color images, which were segmented and labeled using annotation tools. After the annotation, the database images have background labeled as 255 and foreground pixels in 255 gray levels (0-254). So, binary images can be obtained by simply changing the foreground pixels to 1 and the background pixels to 0.

Offline data examples

(a) Isolated character samples

(b) Handwritten text samples


 There are three datasets of isolated characters in the offline handwriting database. The statistics of these datasets are shown in Table 1. The datasets include 1,020 files, and each file (*.gnt) stores concatenated the gray-scale character images of one writer. The file format of *.gnt is specified in Table 2.

Table 1. Statistics of offline isolated character datasets

Dataset #writers #character samples
total Symbol Chinese/#class
HWDB1.0 420 1,680,258 71,122 1,609,136/3,866
HWDB1.1 300 1,172,907 51,158 1,121,749/3,755
HWDB1.2 300 1,041,970 50,981 990,989/3,319
Total 1,020 3,895,135 173,261 3,721,874/7,185

 HWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total).
 HWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols.
 HWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in HWDB1.2 (3,319 classes) is a disjoint set of HWDB1.0.
 HWDB1.0 and HWDB1.2 include 7185 Chinese characters (7,185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.

Table 2. Format of offline isolated character data file (*.gnt)

Item Type Length Instance Comment
Sample size unsigned int 4B   Number of bytes for one sample (byte count to next sample)
Tag code (GB) char 2B "啊"=0xb0a1 Stored as 0xa1b0  
Width unsigned short 2B   Number of pixels in a row
Height unsigned short 2B   Number of rows
Bitmap unsigned char Width*Height bytes   Stored row by row

 Here are three example files (download the files), one file for each dataset, and you can view them using this software (download the software) developed by us.
 If you want to get all the full datasets of CASIA-HWDB1.0-1.2, please click here.


 The offline text databases were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.371) and four pages are missing because of data loss. Each page is stored in a *.dgr file named after the writer index and page number. In addition to the gray-scale image, the data file also includes ground-truths of text line segmentation, character segmentation and character class labels (in GB codes). The statistics of the datasets and the format of *.dgr file are shown in Table 3 and Table 4, respectively.

Table 3. Statistics of offline handwritten text datasets

Dataset #writers #pages #lines #character/#class #out-of-class sample
HWDB2.0 419 2,092 20,495 538,868/1,222 1,106
HWDB2.1 300 1,500 17,292 429,553/2,310 172
HWDB2.2 300 1,499 14,443 380,993/1,331 581
Total 1,019 5,091 52,230 1,349,414/2,703 1,859

  Out-of-class samples are samples out of the 7,356 classes (all of classes in HWDB1.0-1.2).

Table 4. Format of offline text data file (*.dgr)

Item Type Length Instance
File Header
Size of Header int 4B Number of bytes: 36+strlen(illustr)
Format code ASCII (char*) 8B "DGR"
Illustration Text Arbitrary "#......\0"
Code type ASCII (char*) 20B "ASCII", "GB", etc.
Code length Short 2B 1, 2, 4, etc.
Bits per pixel Short 2B Typically 1(B/W image), 8 (Gray image)
Image Records (concatenated)
Image height int 4B Height (pixels) of document image
Image width int 4B Width (pixels) of document image
Line number int 4B Number of lines in the image
Line Records (concatenated)
Char number int 4B Number of characters in a line
Character Records (concatenated)
Label (code) Code type Code length Each byte is 0xff(-1) for garbage
Top-left coordinates Short int 2B + 2B (top, left)
Height (H) Short int 2B Height (pixels) of a character
Width (W) Short int 2B Width (pixels) of a character
Bitmap BYTE H*( (W + 7 ) / 8) or H*W Binary or gray image

 Here are five example files (download the files) corresponding five pages written by a person, and you can view them using this software (download the software) developed by us.
 We also give a reference(C++ code) for how to recovery the document image from the *.dgr file.
 If you want to get the full datasets of CASIA-HWDB2.0-2.2, please click here.