Skip to content

Getting Started with Model Teaching

At a high level, creating a classification model in Duet is a 3-step process that starts by uploading your unlabeled data, teaching the model and finally publishing and deploying the model. Please note that if you'd like to watch an instructional video on building a document classifier from end-to-end you can access it here.

The process of teaching your Duet model the semantics of your schema is done in the same way you teach another human. The teaching process is iterative and interactive. Teaching has three primary stages: defining a schema that can be edited over time, labeling documents, resolving prediction conflicts by reviewing suggested ML features and adding the relevant features. These stages are not done in any strict order. As you edit your schema, add labels, resolve conflicts and add features, the model will continuously update (i.e. train) in the background. Each document that you label will be followed by another document, and the system will bring feature suggestions and conflicts to your attention as they arise.

Schema edits are encouraged as you label documents and add features to fix conflicts. Duet encourages you to build your model as you explore the unlabeled dataset. It is not recommended that you commit to your full schema without looking at enough data points. It is more recommended to build the schema (add/rename/delete/move) as you label documents and add features to fix conflicts. After you teach your model, you test it with test queries to see if the quality of the schema categories meets your target quality goals. If not, go back to teach the model.

To start the iterative process and publish your first version of the model, you define the simplest schema that you know of and incrementally add more categories as you explore the datasets. For the few categories you defined, you are encouraged to provide enough labels for each leaf category before publishing the first version of your model. In the example below, at least 10 labels would be suggested to satisfy the "payment methods" and "orders".

Parent categories do not require positive labels themselves, as assignments to their children provide them with positive labels. The categories that begin with "Other" do not require positive labels before publishing. As you continue to teach though, the "Other" categories teach the system the semantics of what is not in any of the user-defined categories, in the same way that you teach the system what the user-defined categories are. This will create a higher-quality model.

While getting started, searching for keywords will help find documents relevant to your initial schema more quickly. If you're trying to find documents that fit into a payment category, searching 'payment' will help you find relevant samples.

Quality metrics will not show on the schema categories until you have provided five labels for the category. When you have, the system will let you know with a pop-up.

Whereas feature suggestions and conflicts will begin showing as soon as the system finds them appropriate, predictions will not show in the document widget until you have labeled 10 total documents. When you have, the system will let you know with a pop-up.

Once predictions are active, they will show both on the schema and on the document widget. In the example below, the predicted schema category is "Payment Methods", as indicated by the blue highlight.

Prior to labeling 5 documents for each leaf node, if you click the "Publish" tab the system will warn you that the quality of the model likely will not be sufficient and that you should teach more.

Back to top