CASIA Online Handwritten Flowchart Dataset

Introduction

The online handwritten flowchart dataset, CASIA-OHFC, was built by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). CASIA-OHFC contains 2,957 diagrams which were created from about 600 flowchart templates with various complexity. The diagrams were drawn by 205 writers using Huawei tablets, and each writer averagely drew 15 different diagrams given 15 different templates. Each diagram contains a number of handwritten strokes, and each stroke is a sequence of points recording the (x,y) coordinates, time, pressure and the state of pen tip. CASIA-OHFC involves 31 classes of symbols including common graphic symbols and text. A typical template (diagram) has about 10 graphic symbols, several connecting arrows, and some instances of texts (inside the graphic symbols or beside the arrows indicating the meaning of every operation). Two types of labels are provided for each stroke: the semantic class and instance ID of its associate symbol. The dataset is released in Ink Markup Language standard. Figure 1 shows an example of annotated online diagram.

Figure 1. An example of annotated online flowchart in CASIA-OHFC. Symbol classes are denoted by different colors. Symbol IDs are omitted for clean display.

 

The CASIA-OHFC dataset and corresponding printed flowchart templates are packed in zip archive. Please click the links below for download.

CASIA-OHFC (214MB)

flowchart templates (34.4MB)

A comprehensive description of the dataset has been published at IEEE Transactions on Multimedia 2021. Please refer to and cite X. -L. Yun, Y. -M. Zhang, F. Yin and C. -L. Liu, "Instance GNN: A Learning Framework for Joint Symbol Segmentation and Recognition in Online Handwritten Diagrams," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2021.3087000.

Symbol Set

We collected 600 printed flowchart images from the Internet as the templates of CASIA-OHFC. The templates include text and 32 classes of graphic symbols. A full list of the symbol set can be found in Table 1.

Table 1. Flowchart Symbol Set

Data Partitioning

Although there are 32 classes graphic symbols in the collected flowchart templates, two extremely scarce classes (i.e., ’data storage’ and ’sequential access storage’) are removed from the raw online handwritten data. Therefore, there are 30 symbol classes in CASIA-OHFC. To prevent the writers’ drawing styles affecting performance evaluation, the diagrams are randomly divided into three subsets—a training set, a validation set and a test set at the ratio of 7:1:2 according to the 205 writers. Therefore, there are 143/20/42 writers in the training/validation/test set, respectively. An overview of the dataset is shown in Table 2. The file list of training set, validation set and test set are stored in file “CASIA-OHFC_TrainList.txt”, “CASIA-OHFC_ValidationList.txt”, “CASIA-OHFC_TestList.txt”, respectively.

Table 2. An Overview of CASIA-OHFC

Dataset

#Classes

Partition

#Writers

#Templates

#Diagrams

#Strokes

#Symbols

CASIA-OHFC

31

Train

Validation

143

20

600

215

2073

286

592267

86139

63368

8728

Test

42

385

598

171313

18280

Recommendations of usage

The dataset can be used for online and offline handwritten diagram recognition, i.e., stroke classification and symbol segmentation and recognition. We find that some classes in CASIA-OHFC are very hard to recognize due to writing vagueness and the lack of training samples. To make the experimental results more stable, we merge the ’rounded rectangular’ symbol into ’process’ class, and combine the classes which have less than 90 symbol instances in the whole dataset into ’other’ category. Therefore, there are only 18 classes including symbols and ’text’ in our experiment. Note that the ’other’ class includes 13 small classes: ’stored data’, ’oval callout’, ’rectangular callout’, ’off page connector’, ’or’, ’summing junction’, ’card’, ’internal storage’, ’merge’, ’extract’, ’hard disk’, ’annotation’ and ’paper type’. We recommend researchers to evaluate flowchart recognition algorithms on 18 classes since the performance on the minority classes can be highly unstable.

Dataset format

The dataset is released in InkML format. File name is named after “template ID” “writer name” and additional word “revised”, and they are separated by character “_”, such as file name “123_ 张五_revised.inkml. Each stroke is stored in field <trace> with a unique ID, and is consist of 5 channels, i.e., x-coordinate (X), y-coordinate (Y), pressure (F), pen tip state (S) and timestamp (T) respectively. States “1” “0” and “2” denote the pen down, pen move and pen up, respectively.  A shortened inkml file is shown below:

 

<ink xmlns = "http://www.w3.org/TR/InkML">

<traceFormat>

         <channel name = "X" type="decimal"/>

         <channel name = "Y" type="decimal"/>

         <channel name = "F" type="decimal"/>

         <channel name = "S" type="integer"/>

         <channel name = "T" type="decimal"/>

</traceFormat>

<annotation type = "UI">2018_NLPR_Flowchart</annotation>

<annotation type = "copyright">CASIA/NLPR/PAL</annotation>

<annotation type = "template">186</annotation>

<annotation type = "writer">林玮泽_revised</annotation>

<trace id = "0">

         176.90005        69.88426 0.3385442        1       20063136,

         175.90057        67.88593 0.3844651        0       20063160,

         172.90213        67.88593 0.4772838        0       20063177,

        

         149.91411        122.84013        0.5744992        0       20063409,

         150.91359        119.84264        0.5725452        0       20063417,

         150.91359        119.84264        0.5725452        2       20063425

</trace>

<trace id = "343">

         507.72565        244.74106        0.3634587        1       20336935,

         508.72516        245.74023        0.4191500        0       20336955,

         508.72516        248.73773        0.4997557        0       20336969,

         514.72198        254.73273        0.3898388        0       20337525,

         514.72198        254.73273        0.3898388        2       20337530

</trace>

<traceGroups  xml:id= "344">

         <annotation type = "truth">Labeled Flowchart</annotation>

         <traceGroup  xml:id= "345">

                  <annotation type = "truth">TEXT</annotation>

                  <traceView traceDataRef= "0"/>

                  <traceView traceDataRef= "1"/>

                 

                  <traceView traceDataRef= "19"/>

                  <traceView traceDataRef= "20"/>

                  <traceView traceDataRef= "21"/>

                  <annotationXML href = "TEXT_0"/>

         </traceGroup>

         <traceGroup  xml:id= "346">

                  <annotation type = "truth">ELLIPSE</annotation>

                  <traceView traceDataRef= "22"/>

                  <annotationXML href = "ELLIPSE_0"/>

         </traceGroup>

         </traceGroups>

</ink>

A shortened inkml file example.

 

Every symbol instance is stored in the field <traceGroup> with a unique ID. The context in field <annotation> is the category of the symbol instance, such as “TEXT”, “ELLIPSE”. The value of attribute “traceDataRef” in the field <traceView> represents the corresponding strokes (IDs) consisted the symbol. And the value of attribute “href” in <annotationXML> indicates the symbol category and instance-level ID, and they are separated by “_”. Note that the IDs in <annotationXML> are not always continuous.

 

Condition of Use

The online handwritten flowchart dataset, CASIA-OHFC, built by the CASIA, are released for academic research free of cost under an agreement.

Commercial use of the databases is subject to charge. For possible license of commercial use, please contact Fei Yin ( fyin@nlpr.ia.ac.cn). The database of commercial use is enlarged to contain all the online handwritten flowcharts.

The application form of the dataset for academic research can be downloaded bellowing:

English version

Chinese version

Conditions of Academic Use

  1. All samples in the databases under this agreement can only be used by the group of the named applicant and can only be used for research purpose. No samples can be used for any commercial purpose.
  2. The Institute of Automation of CAS retains the copyright of all sample data in the databases.
  3. Publications of research results on the database should be appropriately acknowledged. The recommended reference is below:

    X. -L. Yun, Y. -M. Zhang, F. Yin and C. -L. Liu, "Instance GNN: A Learning Framework for Joint Symbol Segmentation and Recognition in Online Handwritten Diagrams," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2021.3087000.

Contact

Cheng-Lin Liu ( liucl@nlpr.ia.ac.cn), Yan-Ming Zhang ( ymzhang@nlpr.ia.ac.cn)

National Laboratory of Pattern Recognition (NLPR)

Institute of Automation of Chinese Academy of Sciences

95 Zhongguancun East Road, Beijing 100190, P.R. China

Contact Information

Haidian | Beijing | China

Phone : (+86-10)8254-4797

Fax : (+86-10) 8254-4594

Email:liucl@nlpr.ia.ac.cn

Website:www.nlpr.ia.ac.cn/pal/