Balancing the dataset

Suggest Edits

It is worth verifying that there are no single phrases in the case collection for which there are no more than one or two examples for a given structure.

For example, in the intent of CANCELLATION, we would have 200 phrases with the structure I want to cancel the product, but only two phrases I do not want this product anymore. In this case, the recognition of the structure I do not want + the product will be much worse than I want to cancel/I want to cancel the product.

Here we would like to recommend adding more paraphrases so that the recognition of both structures are trained equally well. The problem with too few phrases with a given linguistic structure can also be observed in many cases in the confusion matrix discussed earlier.

Updated over 1 year ago