Teach the Model¶
Teaching a model is an iterative process until you obtain the quality you desire for each entity in your schema. Teaching happens interactively. The first step in teaching is labeling a few segments in the text that teach the system the semantics of your schema. For example, if you want to extract billing addresses, you need to label a few addresses. For each address, you also need to label its sub-entities, namely street, city, state and zip. Duet allows you to build entity extractors that can detect a single entity like a person name, an organization, or an IP clause in a contract. Additionally, Duet gives you the power to build entity extractors that can detect an entity and its sub-entities by defining hierarchical schemas. For example if you want to extract a US address, it is usually not enough to extract the span of text that is an address, but you would typically want to also extract the address sub-entities like street, city, state and zip. Hierarchical schemas in Duet support up to one parent and two levels of children. This allows you in the US address example to divide street further into an apartment number and a street name. The goal of the hierarchy is to detect the right grouping of the sub-entities. Imagine you have two US addresses in the same paragraph, the hierarchical schema will help detect the right grouping of street, city, state, and zip to belong to the appropriate address. It is highly recommended to use hierarchical schemas in Duet when you are interested in extracting sub-entities that are grouped appropriately. Extracting sub-entities and their grouping is useful in making these entities actionable in downstream tasks that involve automation workflows.
To get started on teaching a new entity extraction model, we strongly recommend that you put one or more search keywords in the search bar to help find positive examples for your entities. To obtain addresses from your dataset to label, you can press “Next Sample” to move onto the next document in the dataset. Alternatively, you can search for a keyword that you think will be positively associated with an address; searching will query documents that contain your search keywords. Read the document and label the spans of words in the document that fit the entities in the schema by labeling. For example, we can search for the word "avenue" to find some US addresses in our dataset.
You can label entities and sub-entities in the following ways:
You can select the proper entity or sub-entity on the right, and then highlight the words in the document by clicking and dragging your cursor over the word(s) or by double-clicking a single word if you want to label a single word. It is recommended at the early stage of the model that you label the sub-entities as well as the top parent entity so that you teach the system the proper grouping of the sub-entities by labeling the top parent entity. In this example, an address is being predicted in a document. The prediction is shown as solid underlines that are colored in correspondence with schema categories to the right. It looks correct, so let's label. Select "Address" on the right and click and drag starting at "701" and drag all the way to "20004". Also, select each sub-entity and do label the address components.
Once labeled, the label for each entity is a highlight over each segment. The colors of the highlights each correspond to a schema entity. Each entity gets its own color. You must select a schema entity on the right in order to see its labels on the document. Here, "Address" is selected and thus its green highlight is visible. You can also see that the solid prediction underlines are still present alongside the highlight.
After labeling your first 3 documents, the system will prompt you to verify a specific phrase to label in this document for every new document you pull through search or by clicking "Next Sample". In the example below, The system asks you if "a" is part of "Street". You should answer by yes (click the tick) or no (click the "x") or I don't know (click "?"). Answering the question by clicking one of the options will result in a positive label, a negative label or an undecided label. In this example, click "x" to submit a negative label of token "a" to entity "Street". You always need to answer the Verify question and provide the right label to the token that they system asks you to verify as this helps the system improve the quality of the model.
When the system generates positive predictions that you want to label negative, you can label one or more tokens as negative. The way you label a token as negative is by clicking the "Flip Label" button and clicking on the token. The Flip Label button lets you flip a single token from to a positive label if it was predicted negative or flip a token to a negative label if it was predicted positive with respect to the entity selected on the right. Providing negative labels would fix the false postive predictions.
In teaching, it is not enough to label US addresses and their components, but it is equally important to tell the system the clues that made you as a subject matter expert identify the span of text as a US address. This is what we call "teaching through features". When you label and the system suggests features, it will sometimes ask you to tell it why. It will phrase the feature suggestion as an English question to make it a bit clearer what the system is trying to figure out. In the example below, we see that the system is wondering if "StreetNumber" is found at the end of an "Address". The system hasn't yet been taught that those numbers at the end of the address as a "Zip Code". You would not want to add this feature. Meanwhile, the feature "zip" coming at the end of an "Address" is a good feature to add. Framing suggested features in standard English questions will help you to discern whether the suggestion matches what you know as a subject matter expert.
Features will be suggested by the system when your labels don't match the system prediction. The two forms of feature suggestions are:
- Dictionary phrases, which can represent concepts like "city", and
- Context features, which compose a lexical pattern out of two dictionary features to form a linguistic pattern. For example a context feature composed of "city" preceding "state" is helpful to capture something like "Bellevue, Washington".
Read more about feature suggestions at the bottom of this page.
The teaching loop is fully interactive where the model gets updated with every schema edit, labeling a new document, or adding a new feature suggested by the system when there are conflicts. To start the iterative process and publish your first version of the model, you define the simplest schema that you know of and incrementally add more sub-entities as you explore the dataset. Keep in mind a few things:
- As you label the parent "Address" on the first document, it will automatically label the tokens preceding and following your label as "Not Address".
- For the first 3 documents, errors and predictions will not appear. Only feature suggestions will appear.
- For the few entities you defined, we highly recommend you provide 5 labeled segments for each entity including the top parent entity (address) before you publish the first version of your model.
Once you move onto the next document, the top right of the page will update to reflect the total number of labeled documents you added so far. The counters next to each entity on the right will also update after the model automatically trains with the new label you just submitted. You may see conflicts between what was labeled and what the system predicts. In case of conflicts, you will need to look at the suggested features at the bottom right corner of the page and review them to select the ones relevant to each entity. This has the effect of fixing the conflicts and improving the overall quality of the model.
To edit the schema:
Plan the new entities that you'd like to add, delete, rename or move to another parent node.
Select the pencil icon next to Categories on the right.
Make desired changes. Press "Done" to finalize.
As you are updating the models by editing the schema or adding labels, conflicts between the label and the system’s prediction will arise. The system will suggest features to remedy these conflicts. Feature suggestions are shown at the bottom right of the page. Duet will only show you the five best feature suggestions at a time. It is closed when there are no suggestions. Hovering over a suggested feature will show the category it is suggested for. You need to review the content of the feature before adding it. Please make sure it is relevant to the category and will help the system distinguish this category from other categories.
Clicking a suggestion will open a popup where you can add it as a new feature or add the suggested phrases to an existing feature.
At the beginning of the model teaching process, the model won't have enough features to detect entities correctly. At this point, you might see the system showing plenty of positive predictions. There is an example of an overshot below. When this occurs, a simple fix is to label one or more tokens that are predicted positive as negative to the parent entity "Address". The way you label a token as negative is by clicking the "Flip Label" button and click on the token. The Flip Label button lets you flip a single token from to a positive label if it was predicted negative or flip a token to a negative label if it was predicted positive with respect to the entity selected on the right. Providing negative labels would fix the false postive predictions.