Data Exploration¶
To obtain documents to label, you can press “Next Sample” to move onto the next document in the dataset. Duet assists the user in building a custom model by automatically choosing documents from the unlabeled dataset. This process is called "automatic sampling". Duet has a unique technology to select the most informative document while respecting the original data distribution to avoid selection bias. Duet uses the most updated model to select the next document for the user to label, which optimizes for the productivity of human effort by choosing the most informative example that the model is confused about. After you complete 5 labeled segments for each entity/sub-entity, click the "Next Sample" button would have a drop down menu from which you can choose an entity to sample for automatically. The default choice in this menu is "Any", which means the system would decide which entity to sample for based on the state of the model. When you choose an entity to sample for, you are telling the system that you want more examples to label for this entity and the system will present you with more examples that confuse the system about this entity. We recommend relying on automatic sampling when you made some progress with your model by labeling a few examples and adding few features.
You can also explore the unlabeled dataset by searching for one or more keywords. We recommend using search in the early phase of building your entity extractor to find positive examples for your entities/sub-entities that can help the system pick up the semantics of your model. For example, in the following screenshot, we can search for the word "Avenue" to find some US addresses to label.
You can search by keywords that exist as well as keywords that don't exist in the unlabeled documents or a particular filename (only available for datasets that have a nested folder structure format). The search filter shows input fields that when filled will translate to commands in the search field. Each independent search term is separated by a comma. The "Includes" filter has one or more search keywords that will be found in the documents you look for while the "Excludes" filter has one or more keywords that should not be in the documents you look for. If you type keywords in the exclude textfield, they will be displayed as -keyword in the search field. These complex filters help you find specific documents you want to label. If the dataset is a zip file that has a folder structure where each document is a .txt file, you can search by file name. Learn more about dataset structures here.