CASIA-PGDP5K: Plane Geometry Diagram Parsing Dataset

1. Introduction

The Plane Geometry Diagram Parsing Dataset (PGDP5K) was constructed by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). Our dataset consists of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types and 6 text types, labeled with more fine-grained annotations at primitive level, including primitive classes, locations and relationships, where 1,813 non-duplicated images are selected from the Geometry3K dataset and other 3,187 images are collected from three popular textbooks across grades 6-12 on mathematics curriculum websites by taking screenshots from PDF books. Combined with above annotations and geometric prior knowledge, it can generate intelligible geometric propositions automatically and uniquely.

Download: CASIA-PGDP5K.zip (80.7 MB)


2. Annotations

2.1. Geometric Primitives

We divided geometric primitives into 3 classes: point, line and circle, and we annotated their parsing positions and uniform pixel widths.

  • Point: The point covers inter-section point, tangent point, endpoint and independent point.
  • Line: The line consists of solid line, dash line and mixture of solid and dash. It is worth noting that we only label the longest line segment of all collinear lines.
  • Circle: The circle includes complete circle and arc.
  • Fig. 1 Examples of geometric primitive.

    2.2. Non-geometric Primitives

    For non-geometric primitives, we annotated the bounding box, symbol class and text class, and recorded corresponding text contents.

  • Text: We divided texts into 6 classes. We divided the texts into line, point, angle, length, degree and area, making fine-grained text classification as a new sub-task of diagram parsing.
  • Fig. 2 Examples of text.

  • Symbol: We divided symbols into 6 super-classes and 16 sub-classes: perpendicular, angle, bar, parallel, arrow and head, where classes of angle, bar and parallel have multiple forms. We subdivided the heads into 2 classes to distinguish different indication relations of different arrows.
  • Fig. 3 Examples of symbol.

    2.3. Relationships

    As to primitive relations, we constructed a relation graph of elementary relations among primitives in Fig. 4. We divided primitive relations into 4 classes: geo2geo, text2geo, sym2geo and sym2text. For relations of geometric primitives, we only construct relations between point and line, point and circle, because other high-level relations among geometric primitives could be derived from these two basic relations. We defined a two-tuple with multiple entities to represent the relation between primitives. We take points, symbols and texts as subjects, and set other primitives related as objects. Some relation tuples are shown in Fig. 5.

    Fig. 4 Primitive relationship graph of plane geometry diagram.

    Fig. 5 Relation tuples of PGDP5K dataset. 'P#', 'L#', 'C#', 'T#'' and 'S#'' denote instances of point, line, circle, text and symbol, respectively.

    2.4. Geometric Description Language

    We formed the high-level and comprehensible specifications of geometric description language (GDL), which mainly consists of a list of geometric propositions formatted by proposition templates. As shown in Tab. 1, we defined 4 types of proposition templates about basic relations: Geometry Shape, Geo2Geo, Text2Geo and Sym2Geo.

  • Geometry Shape: Geometry shapes are basic elements of high-level propositions. We give proposition templates of 5 types of fundamental geometry shapes: point, line, circle, angle and arc, where line, angle and arc have several equivalent expressions.
  • Geo2Geo: 3 types of proposition templates are defined for relations among geometric primitives: point lies on line, point lies on circle and point is center of circle.
  • Text2Geo: The relations of text with geometric primitives are divided into 6 types according to text class. Among the these proposition templates, the ones of degree and length are not unique.
  • Sym2Geo: The propositions of symbol with geometric primitive are divided into 4 groups according to symbol class, and there are 2 proposition templates of symbol bar.
  • Table. 1 Geometric proposition templates of primitive relation. '$' represents geometric primitives and '&' denotes text content.

    Relation Class

    Primitive Class

    Proposition Templates

    Geo Shape

    Point

     · Point($)

    Line

     · Line($,$), Line($)

    Circle

     · Circle($,radius_$)

    Angle

     · Angle($,$,$), Angle($)

    Arc

     · Arc($,$), Arc($,$,$)

    Geo2Geo

    Point

     · PointLiesOnLine($,Line($,$))

     · PointLiesOnCircle($,Circle($,radius_$))

     · Circle($,radius_$)

    Text(&)2Geo

    Text_point

     ·Point(&)

    Text_line

     · Line(&)

    Text_angle

     · Equals(MeasureOf(Angle($,$,$)),Mea-sureOf(angle &))

    Text_degree

     · Equals(MeasureOf(Angle($,$,$)), &)

     · Equals(MeasureOf(Arc($,$)), &)

    Text_length

     · Equals(LengthOf(Line($,$)), &)

     · Equals(LengthOf(Arc($,$)), &)

    Text_area

     -

    Sym2Geo

    Sym_perpendicular

     · Perpendicular(Line($,$), Line($,$))

    Sym_angle

     · Equals(MeasureOf(Angle($,$,$)), MeasureOf(Angle($,$,$)))

    Sym_bar

     · Equals(LengthOf(Line($,$)), LengthOf(Line($,$)))

     · Equals(LengthOf(Arc($,$)), LengthOf(Arc($,$)))

    Sym_parallel

     ·Parallel(Line($,$), Line($,$))

    2.5. Dataset Distributions

    Fig. 6 displays class distributions of geometry shape, symbol, text and relation. They are all subject to the long-tailed distribution evidently. Note that text is seen as a special symbol recorded in the symbol distribution.

    Fig. 6 Distributions of PGDP5K Dataset. (a)(b)(c)(d) denote the class distribution of shape, symbol, text and relation, respectively.

    2.6. Dataset Formats

    Fig. 7 Format of annotation

    Fig. 8 Format of logic form


    3. Condition of Use

  • The CASIA-PGDP5K: Plane Geometry Diagram Parsing Dataset, built by CASIA, are released for academic research free of cost under an agreement.
  • Commercial use of the databases is subject to charge. For possible license of commercial use, please contact Fei Yin (fyin@nlpr.ia.ac.cn).
  • The application form of the dataset for academic research can be downloaded bellowing:


          English version

          Chinese version


    Reference

    A comprehensive description of PGDP5K dataset was described in:

          Yihan Hao, Mingliang Zhang, Fei Yin and Lin-Lin Huang. PGDP5K: A Diagram Parsing Dataset for Plane Geometry Problems, In ICPR 2022.

    The dataset was firstly used in the research work:

          Ming-Liang Zhang, Fei Yin, Yi-Han Hao and Cheng-Lin Liu. Plane Geometry Diagram Parsing, In IJCAI 2022.

    If this dataset helps you, please cite the papers above.


    Contact

    Fei Yin (fyin@nlpr.ia.ac.cn)

    National Laboratory of Pattern Recognition (NLPR)

    Institute of Automation of Chinese Academy of Sciences

    95 Zhongguancun East Road, Beijing 100190, P.R. China