Skip to content

Getting Started with Model Teaching

At a high level, creating an entity extraction model in Duet is a 3-step process that starts by uploading your unlabeled data, iteratively teaching the model and finally publishing and deploying the model.

The process of teaching your Duet model the semantics of your schema is done in the same way you teach another human. The teaching process is iterative and interactive. Teaching has three primary stages:

  • defining a schema that can be edited over time as you explore the data,
  • labeling entities in documents, and
  • resolving prediction conflicts by reviewing suggested ML features and adding the relevant features.

These stages are not done in any strict order. As you edit your schema, add labels, resolve conflicts and add features, the model will continuously update (i.e. train) in the background. Each document that you label will be followed by another document, and the system will bring feature suggestions and conflicts to your attention as they arise. As this article covers the basic stages and preliminary information associated with teaching an entity extraction model, more specialized information such as the various ways to label text, edit a schema, or fix conflicts by looking at suggested features will be covered in subsequent articles.

To start the iterative process and publish your first version of the model, you define the simplest schema that you know of and incrementally add more sub-entities as you explore the dataset. You discover these requirements as you teach the system the semantics of your schema. For example, if you want to extract billing addresses, you need to label a few addresses. For each address, you also need to label its sub-entities, namely street, city, state and zip. Duet allows you to build entity extractors that can detect a single entity like a person name, an organization, or an IP clause in a contract. Additionally, Duet gives you the power to build entity extractors that can detect an entity and its sub-entities by defining hierarchical schemas. For example if you want to extract a US address, it is usually not enough to extract the span of text that is an address, but you would typically want to also extract the address sub-entities like street, city, state and zip. Hierarchical schemas in Duet support up to one parent and two levels of children. This allows you in the US address example to divide street further into an apartment number and a street name. The goal of the hierarchy is to detect the right grouping of the sub-entities.

To obtain addresses from your dataset to label, you can press “Next Sample” to move onto the next document in the dataset. Alternatively, you can search for a keyword that you think will be positively associated with an address; searching will query documents that contain your search keywords. You can see that there are colored solid underlines beneath tokens in the document that correspond to the colored labels in the schema. These are predictions by the system that you need to verify or correct. Here, they are correct. You can select the proper entity or sub-entity on the right, and then highlight the words in the document by clicking and dragging your cursor over the word(s) or by double-clicking a single word if you want to label a single word. In this example, select each entity to click and drag on the text that matches each entity after selecting it from the schema on the right.

Once labeled, the label for each entity is a colored highlight matching the color of the schema category.

To learn more about search queries and obtaining documents from your dataset, please read about data exploration.

Schema edits are encouraged as you label entities within documents and add features to fix conflicts. Duet encourages you to build your entity extractor as you explore the unlabeled dataset. It is not recommended that you commit to your full schema without looking at enough data points. It is more recommended to build the schema (add/rename/delete/move) as you label documents and add features to fix conflicts. After you teach your model, you test it with test queries to see if the quality of the schema categories meets your target quality goals. If not, go back to teach the model.

Keep in mind a few things:

  • As you label the parent "Address" on the first document, it will automatically label the tokens preceding and following your label as "Not Address".
  • For the first 3 documents, errors and predictions will not appear. Only feature suggestions will appear.
  • For the few entities you defined, it is highly recommended that you provide 5 labeled segments for each entity including the top parent entity ("Address") before you publish the first version of your model.
Back to top