CASIA Online and Offline Chinese Handwriting Databases

Online Database

 For handwriting data collection using Anoto pen, all the template pages were printed on papers with dot pattern. On the printed template pages, each isolated character was written in the space below the pre-printed character, and each text was written on a separate page with the template text printed in the upper part of the page. During writing, the online data (stroke trajectory: sequences of (x,y) coordinates) were recorded by the Anoto pen and later transmitted to computers.

Online data examples

(a) Isolated character samples

(b) Handwritten text sample

CASIA-OLHWDB1.0-1.2

 There are three datasets of isolated characters in the online database. The statistics of these datasets are shown in Table 1. The datasets include 1020 files, and each file (*.pot) stores character samples written by one person. The file format of *.pot is specified in Table 2.

Table 1. Statistics of online isolated character datasets

Dataset #writers #character samples
total Symbol Chinese/#class
OLHWDB1.0 420 1,694,741 71,806 1,622,935/3,866
OLHWDB1.1 300 1,174,364 51,232 1,123,132/3,755
OLHWDB1.2 300 1,042,912 51,181 991,731/3,319
Total 1,020 3,912,017 174,219 3,737,798/7,185


 OLHWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total).
 OLHWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols.
 OLHWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in OLHWDB1.2 (3,319 classes) is a disjoint set of OLHWDB1.0.
 OLHWDB1.0 and OLHWDB1.2 together include 7185 Chinese characters (7,185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.

Table 2. Format of online isolated character data file (*.pot)

Item Type Length Instance Comment
Sample size unsigned short 2B   Number of bytes for one sample (byte count to next sample)
Tag code (GB) DWORD 4B "啊"=0x0000b0a1 Stored as 0xa1b00000 Only two bytes (GB2132 or GBK) are meaningful
Stroke number unsigned short 2B   Number of strokes in a sample
Strokes (concatenated). Each stroke is a point sequence from pen-down to lift
Coordinates (x, y) (concatenated) short 2B+2B   All values less than 32768
Stroke end (-1, 0) signed short 2B+2B    
Character end tag
Character end
(-1,-1)
signed short 2B+2B    


 Here are three example files (download the files), one file for each dataset, and you can view them using this software (download the software) developed by us.

The full datasets of CASIA-OLHWDB1.0-1.2 can be downloaded at here.

CASIA-OLHWDB2.0-2.2

 The online handwritten text datasets were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.671) and three pages (2 pages of no.328 and 1 page of no.685) are missing because of data loss. Each page is stored in a *.wptt file named after the writer index and page number. In addition to the stroke trajectory data of the page, the data file also includes ground-truths of text line segmentation and character class labels (text line transcript in GB codes). The statistics of the datasets and the format of *.wptt file are shown in Table 3 and Table 4, respectively.

Table 3. Statistics of online handwritten text datasets

Dataset

#writers

#pages

#lines

#character/#class

#out-of-class sample

OLHWDB2.0

420

2,098

20,573

540,009/1,214

1,282

OLHWDB2.1

300

1,500

17,282

429,083/2,256

255

OLHWDB2.2

299

1,494

14,365

379,812/1,303

581

Total

1,019

5,092

52,220

1,348,904/2,655

2,088

Out-of-class samples are samples out of the 7,356 classes (all of classes in OLHWDB1.0-1.2).

Table 4. Format of online text file (*.wptt)

Item

Type

Length

Instance

File Header

Size of Header

long int

4B

Number of bytes:
54 + strlen (illustr) + 1 (there is a '\0' at the end of illustration)

Format code

ASCII (char*)

8B

“WPTT”

Illustration

Text

Arbitrary

“#......\0”

Code type

ASCII (char*)

20B

“GB”

Code length

short int

2B

2

Data type

ASCII (char*)

20B

“short”

Sample length

int

4B

 

Page index

int

4B

Corresponding to that in trajectory

Stroke number

int

4B

 

Strokes (concatenated)

Point number

short

2B

 

Points (concatenated)

Coordinates (x, y) (concatenated)

unsigned short

2B+2B

no PEN_DOWN and PEN_UP, all coordinates are multiplied by 10.

Line number

unsigned short

2B

 

Lines (concatenated)

Line stroke number

unsigned short

2B

 

Line stroke index (concatenated)

unsigned short

2B*lineStrkNum

 

Line char number

unsigned short

2B

 

Chars (concatenated)

Tag code

Code type

codelength*lineCharNum

If the Tag code equal to 0xffff, it is an abnormal character.

Sample Lenth:
4+4+4+strkNum*[2+strkPtNum*4]+2+lineNum*[2+2*lineStrkNum+2+lineCharNum*codeLength].

    WPTT handwritten pages can viewed using this software (download the software) developed by us. An example C++ code for reading data from *wptt file is given in WPTTRead.cpp.pdf.
 The full datasets of CASIA-OLHWDB2.0-2.2 can be downloaded here.