This dataset was built with the help of Master Learner Company. It has been evaluated in the paper entitled 'Printed/Handwritten Texts and Graphics Separation in Complex Documents Using Conditional Random Fields'.
The 'Train' folder and 'Test' folder are training set(300 documents) and test test(100 documents), respectively. Original images (JPG format) with its ground truth annotations (XML format) could be found in the 'Image' folder and 'Annotations' folder. Each document image is manual labeled with printed text, handwritten text, graphics, images and tables at region level. Visualization of annotations can be found in the 'LabeledImage' folder. Bounding boxes of different color represent different categories of content. Please note that the bounding boxes are only used to separate different categories of content, they should not be considered as object instances with semantic or logical meaning. One should use a binarization method to obtain the foreground pixels in each bounding box. Please note that each pixel may be inside multiple bounding boxes of different categories, we assign the pixel with label of the bounding box whose area is the smallest.
Download TestPaper1.0.zip （2.04G）