What is a model?¶
A model is a function deployed on a REST endpoint that takes an unstructured document as input and produces a JSON structure categorizing the document as an output.
Models are built with machine learning to learn a specific problem space. With more examples and features, the model will learn and predict more accurate results. Due to this, models are usually specialists at specific tasks.
Some models are used in classifying documents (these models are called classifiers) while others are used to extract entities from text (these models are called extractors).
Do I need to know my final set of of categories for my classifier upfront?¶
No. Duet enables you to build your categories schema as you explore the unlabeled dataset. You don't need to commit the full schema. You can edit the schema by adding/renaming/deleting/moving nodes in your hierarchical schema as you see more data points without having to repeat any labeling work.
How do I know I am done teaching?¶
You can check the quality metric in the upper right corner of the main window after each update of the model.
How do I design a good schema?¶
Create non-overlapping categories in semantics. The more distinguishable your categories are, the easier it is to teach their semantics to the system. The more the overlap in semantics, the more difficult it is to teach the system and the higher the chance to confuse the system.
Do you offer data splicing services natively in your platform?¶
We do not. If you want have data that you want to splice into different columns and rows to fit our formatting requirements, we suggest the Powery Query tool in Microsoft Excel.
What is the difference between classification and extraction?¶
In classification, the whole input document is tagged with a single category from a set of user-defined categories.
In entity extraction, a phrase with its character offset is extracted from the text (e.g. an address, a product name, etc.).
How many schema categories and features can I have in a model?¶
In Duet, you can create categories in a hierarchy with a maximum depth of 3 levels. Features per model is limited to 1MB of content (phrases) including the inactive features.
What is the right ratio between labels and features?¶
Features should be at most 15-20% of the number of labels. If you have too many features, you will have to balance with enough labels. Too many features might lead to build a model that over-fits on the training data and can generalize less on unseen data when you deploy your model.
How many labels should be added to each category?¶
More semantically difficult categories will require more labels than semantically easier labels. In general, 80-100 labels per category will be sufficient.
What is a deployed version of a model?¶
Every time you update your model, you need to publish the latest changes and create a new version of the model that you can start consuming after the publish operation. The published version is available through a REST endpoint that takes a single text document as input and produces a JSON response classifying the text into the proper category as output. Learn more about deployments here.
How does the quality metric evolve over the course of the teaching process?¶
Quality metric of each category does not show up during the getting started when there are not enough labels and you are not able to publish your model. Once you have added enough labels and Duet enables you to publish your first model version, quality metric will show up for categories with enough labels. Note that quality metric will not show up for most of the nodes that start with "other" prefix since some of them don't have enough labels to show a quality metric value. The quality metric value will fluctuate radically at the beginning of the teaching process when there are not enough labels and enough features. As the model becomes more mature, the quality metric will be much more stable and will fluctuate within a very small range.
How are the quality metrics calculated and on which data?¶
The quality metric is calculated on the full sampling set (your unlabeled dataset associated with your model). It is calculated using the current model you have built so far.