CASIA-AHCDB: Chinese Ancient Handwritten Characters Database

 

Ancient Handwritten Characters Database (CASIA-AHCDB) is designed for character recognition research. The database contains more than 2.2 million annotated character samples of 10,658 classes. The character samples come from more than 12,000 pages of annotated Chinese ancient handwritten documents. According to different sources of documents, the database is mainly divided into two sub-databases: Complete Library in Four Sections (style1) and Ancient Buddhist Scriptures (style2). Each sub-database can be divided into three parts based on its applications: basic category set, enhanced category set and reserved category set. The basic category sets of style1 and style2 have the same 2,365 classes, and the enhanced category sets of style1 and style2 have no intersecting classes. For reserved category set, training and testing set are not divided due to the few samples.

 

Style1 contains 25 books, numbered “book_01” to “book_25”. Among them, (book_01, book_02) were written by one person, so did (book_03, book_04), (book_05, book_06) and (book_07, book_08) , the rest are written by different people. We make books 01-20 as training set and books 21-25 as testing set.

 

Style2 contains Buddhist scriptures documents from 10 different periods. The writer of each volume is no longer verifiable. The 001 volumes of Buddhist scriptures in the 01 period are numbered “period_01/volume_001”. We make Buddhist scriptures from period 09-10 as training set and Buddhist scriptures from period 01-08 as testing set.

 

Table I. Structure and Statistic of CASIA-AHCDB

Database Structure

Classes

Characters

 

 

 

 

CASIA AHCDB

 

 

Style1

Basic Category

Train

2,365

832,939

Test

2,365

254,162

Enhanced Category

Train

3,227

89,204

Test

3,227

36,258

Reserved Category

3,819

19,763

 

 

Style2

Basic Category

Train

2,365

728,423

Test

2,365

204,547

Enhanced Category

Train

783

71,179

Test

783

19,597

Reserved Category

2,450

8,213

Summation

12,229

2,264,285

 

Table II. GNTX Format

Item

Length

Comment

Sample size

4 bytes

Number of bytes for one sample

Unicode

4 bytes

Unicode

Width

2 bytes

Number of pixels in a row

Height

2 bytes

Number of rows

Bitmap

width * height bytes

Store row by row

 

Data Download

    style1_basic_test
    style1_basic_train_part1
    style1_basic_train_part2
    style1_basic_train_part3
    style1_enhanced
    style2

 

Reference

Yue Xu, Fei Yin, Da-Han Wang, Xu-Yao Zhang, Zhaoxiang Zhang, Cheng-Lin Liu, CASIA-AHCDB: A large-scale Chinese ancient handwritten characters database, Proc. 15th ICDAR, Sydney, Australia, September 20-25, 2019, pp.793-798.

 

Contact Information

Haidian | Beijing | China

Phone : (+86-10)8254-4797

Fax : (+86-10) 8254-4594

Email:liucl@nlpr.ia.ac.cn

Website:www.nlpr.ia.ac.cn/pal/