MiikeMineStamps Dataset

dataset

posted on 2021-05-18, 15:56 authored by Paola BuitragoPaola Buitrago, Evgeny ToropovEvgeny Toropov, Rajanie PrabhaRajanie Prabha

The MiikeMineStamps dataset is a naturally long-tailed open-ended dataset in the domain of Japanese historical documents. It contains 5056 images of Japanese stamps that belong to 407 classes including two special ones. The stamps were extracted from a large compendium of historical documents from the Japanese company Mitsui Mi’ike Mine, one of the largest business archives in modern Japan that spans half a century, includes tens of thousands of documents, and has been widely used by labor historians, business historians, and others.

The dataset can be used as a benchmark for Open Long-Tailed Recognition (OLTR) challenges.

The data is available for free to researchers for non-commercial use.

-----------

Stamp naming conventions

-----------

- Stamp names are lowercase letters, underscores, and two special characters: "+" and "?".

- Most names contain only letters, e.g. "kuru".

- "+" in a name means that only one half of a stamp is visible, because two conforming documents were stamped together. Only the top of a stamp is visible if "+" is after the stamp name (e.g. 'kei+') and only the bottom of a stamp is visible if "+" before the stamp name (e.g. '+kei').

- "?" in a name (e.g."oomuta?unhontenin") means that one symbol in a stamp was not recognized.

- Name "??" is a special name that means that a stamp was not recognized.

- Name "+??" is a special name that means that a stamp was not recognized and only its bottom is visible.

-----------

Methodology

-----------

The methodology to produce the MiikeMineStamps dataset follows the general principle of active learning. The process was done in cycles. On every cycle, a machine learning model first predicted bounding boxes and object class for all unlabeled images in the Mitsui Mi’ike Mine documents dataset. Then, a subset of images were selected based on an adjustable criterion. Images with a large number of objects that have high uncertainty were preferred. The images and the predictions are passed over to a team of human experts to verify the labels and correct them if necessary. The ML model is then retrained on all the verified data available, and the cycle is considered complete.

In the case of the open class set, we do not know object classes beforehand and

cannot train an object detector model that looks for a specific set of classes. Instead a two-step approach was employed. First, an object detector model finds instances of the generic “stamp” class, then an image classification model is used to recognize a specific class in cropped out images of “stamps”. This approach provides the advantage of transferring the difficulty of dealing with open class set and long tail distribution from the detection to the classification setup, where there are more tools to manage it.

We apply the detection algorithm on the images to extract stamps, and resize these stamps to 80 by 80 pixels. Then, the cropped stamps are individually passed to the image

classifier. While any off-the-shelf object detector architecture can be taken for the “stamp” detection step, the image classification model must be able to handle the open

class set and the long tail challenges. We assume the number of instances per class varies from one to several hundred. Furthermore, we assume the existence of previously unseen classes.

More details can be found in the accompanying ICDAR2021 paper where the dataset is presented (P. A. Buitrago et al., 2021).

-----------

Acknowledgements

-----------

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges and Bridges-2 systems, which are supported by NSF award number ACI-1445606 and ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). The work was made possible through the XSEDE Extended Collaborative Support Service (ECSS) program.