Data Exploration

To obtain documents to label, you can press “Next Sample” to move onto the next document in the dataset. Duet assists the user in building a custom model by automatically choosing documents from the unlabeled dataset. This process is called "automatic sampling". Duet has a unique technology to select the most informative document while respecting the original data distribution to avoid selection bias. Duet uses the most updated model to select the next document for the user to label, which optimizes the productivity of human effort by choosing the most informative example that the model is confused about. If you decide not to select a category for a document, the skipped document will be automatically labeled as the global "Other". Please note that if you'd like to watch an instructional video on data exploration you can access it here.

When selecting the next sample, you can access the drop down menu to select the category from which you'd like to sample. The default is any category, but you can select any schema category that is not "Other". This is only permissible after you are out of the getting started mode.

Alternatively, you can explore the dataset through keyword or filename search (only available for datasets that have a nested folder structure format). Searching will query documents based on keywords that exist or that don't exist in the document. If you've looked at the preview of the dataset that you're using to teach the model and planned the schema based on this dataset, searching keywords can be a good way to find documents that use language that you know ought to be associated with a specific category. The search filter shows input the fields that when filled will translate to commands in the search field. Each independent search term is separated by a comma. The "Includes" filter has one or more search keywords that will be found in the documents you look for while the "Excludes" filter has one or more keywords that should not be in the documents you look for. If you type keywords in the exclude textfield, they will be displayed as -keyword in the search field. These complex filters help you find specific documents you want to label. If the dataset is a zip file that has a folder structure where each document is a .txt file, you can search by file name. Learn more about dataset structures here.

