Phrases in the training dataset should be direct and simple, as is human-bot communication. When talking or writing to bots, people tend to be clear and concise. User utterances are often closer to browser queries than well-built sentences.
Of course, users that are not yet aware of the fact that they’re writing or talking to a bot might build long and complex sentences, but for the sake of training datasets efficacy, it’s best to provide training phrases according to that rule.
|Bad phrase||Good phrase|
|*I’m calling because I have a credit card and well I was hoping it provides some kind of insurance but I didn’t find anything about it is it possible that you checked that for me||can you check if my credit card provides some insurance|
|I’ve been abroad currently I have a personal account with your bank and I wanted to ask if there is a possibility to open a foreign currency account online cause I can’t go to the bank personally for now||is it possible to open a foreign currency account online if I have a personal account|
As you see, the good training phrases consist of a simple opening phrase:
Can you / Is it possible / I wanted to know if, etc.
And words conveying the essential meaning:
important verbs and keywords: credit card, provide insurance, open, foreign currency account, online
At the end of the day, it is these content words that the recognition is based on. This means if a user provides "longish" and non-standard utterances that contain some of such words, the model will still be able to assign an appropriate intent to it.
Searching for unique ways of asking things usually does not bring benefits. It’s good to focus on the most common ways of expressing the idea and add simple paraphrases. The more of such similar, concise utterances in the dataset, the better the recognition.
|Bot question||Bad phrase||Good phrase|
|Do you have a business account in our bank?||oh I used to but decided to move it to another bank||no not currently|
|Do you have a loan in our bank?||unfortunately||I do yes|
|Do you use our mobile app?||once in a blue moon||not often to be honest|
Try to avoid abstract phrases that require complex meaning synthesis, unless you need a specific expression to be recognized and associated with a specific intent.
The chances that "once in a blue moon" occurs in user input are very little; thus, it is best to provide the model with representative data that it will have a chance to deal with.
It is also a good practice to review user utterances as the bot goes public. Then you can observe what user expressions look like and what the most common utterances are, and just use them in your training set to enhance recognition
Updated about 1 year ago