CASIA Online and Offline Chinese Handwriting Databases

Online Database

 For handwriting data collection using Anoto pen, all the template pages were printed on papers with dot pattern. On the printed template pages, each isolated character was written in the space below the pre-printed character, and each text was written on a separate page with the template text printed in the upper part. During writing, the online data (stroke trajectory: sequences of (x,y) coordinates) were recorded by the Anoto pen and later transmitted to computers.

Online data examples

(a) Isolated character samples

(b) Handwritten text sample

CASIA-OLHWDB1.0-1.2

 There are three datasets of isolated characters in the online database. The statistics of these datasets are shown in Table 1. The datasets include 1020 files, and each file (*.pot) stores stroke trajectories of concatenated pages written by one person. The file format of *.pot is specified in Table 2.

Table 1. Statistics of online isolated character datasets

Dataset #writers #character samples
total Symbol Chinese/#class
OLHWDB1.0 420 1,694,741 71,806 1,622,935/3,866
OLHWDB1.1 300 1,174,364 51,232 1,123,132/3,755
OLHWDB1.2 300 1,042,912 51,181 991,731/3,319
Total 1,020 3,912,017 174,219 3,737,798/7,185


 OLHWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total).
 OLHWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols.
 OLHWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in OLHWDB1.2 (3,319 classes) is a disjoint set of OLHWDB1.0.
 OLHWDB1.0 and OLHWDB1.2 include 7185 Chinese characters (7,185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.

Table 2. Format of online isolated character data file (*.pot)

Item Type Length Instance Comment
Sample size unsigned short 2B   Number of bytes for one sample (byte count to next sample)
Tag code (GB) DWORD 4B "啊"=0x0000b0a1 Stored as 0xa1b00000 Only two bytes (GB2132 or GBK) are meaningful
Stroke number unsigned short 2B   Number of strokes in a sample
Strokes (concatenated). Each stroke is a point sequence from pen-down to lift
Coordinates (x, y) (concatenated) short 2B+2B   All values less than 32768
Stroke end (-1, 0) signed short 2B+2B    
Character end tag
Character end
(-1,-1)
signed short 2B+2B    


 Here are three example files (download the files), one file for each dataset, and you can view them using this software (download the software) developed by us.
 If you want to get the full datasets of CASIA-OLHWDB1.0-1.2, please click here.

CASIA-OLHWDB2.0-2.2

 The online handwritten text datasets were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.671) and three pages (2 pages of no.328 and 1 page of no.685) are missing because of data loss. Each page is stored in a *.ptts file named after the writer index and page number. In addition to the stroke trajectory data of the page, the data file also includes ground-truths of text line segmentation, character segmentation and character class labels (in GB codes). The statistics of the datasets and the format of *.ptts file are shown in Table 3 and Table 4, respectively.

Table 3. Statistics of online handwritten text datasets

Dataset #writers #pages #lines #character/#class #out-of-class sample
OLHWDB2.0 420 2,098 20,573 540,009/1,214 1,282
OLHWDB2.1 300 1,500 17,282 429,083/2,256 255
OLHWDB2.2 299 1,494 14,365 379,812/1,303 581
Total 1,019 5,092 52,220 1,348,904/2,655 2,088

Out-of-class samples are samples out of the 7,356 classes (all of classes in OLHWDB1.0-1.2).

Table 4. Format of online text file (*.ptts)

Item Type Length Instance
File Header
Size of Header int 4B Number of bytes: 54 + strlen (illustr) + 1 (there is a '\0' at the end of illustration)
Format code ASCII (char*) 8B "PTTS"
Illustration Text Arbitrary "#......\0"
Code type ASCII (char*) 20B "GB"
Code length short int 2B 2
Data type ASCII (char*) 20B "short"
Sample length int 4B  
Page index int 4B Corresponding to that in trajectory
Stroke number int 4B  
Strokes (concatenated)
Point number short 2B  
Points (concatenated)
Coordinates (x, y) (concatenated) unsigned short 2B+2B no PEN_DOWN and PEN_UP, the numbers are 10 times the original coordinates
Line number unsigned short 2B  
Lines (concatenated)
Line stroke number unsigned short 2B Number of strokes in a line
Line stroke index (concatenated) unsigned short 2B Index numbers of the strokes
Line char number unsigned short 2B Number of characters in the line
Chars (concatenated)
Tag code Code type codeLenth If the Tag code equal to 0xffff, it is an abnormal character.
Char stroke number unsigned short 2B Number of strokes in the character
Char stroke index (concatenated) unsigned short 2B Index numbers of the strokes


 Here are five example files (download the files), corresponding to five papers written by a person. You can view them using this software (download the software) developed by us.
 If you want to get the full datasets of CASIA-OLHWDB2.0-2.2, please click here.