Comparison among different backbones. FRRs with FAR fixed at once per hour:

| model                 | params(K) | epoch     | hi_xiaowen | nihao_wenwen |
|-----------------------|-----------|-----------|------------|--------------|
| GRU                   | 203       | 80(avg30) | 0.088901   | 0.083827     |
| TCN                   | 134       | 80(avg30) | 0.023494   | 0.029884     |
| DS_TCN                | 287       | 80(avg30) | 0.005357   | 0.006390     |
| DS_TCN(spec_aug)      | 287       | 80(avg30) | 0.008176   | 0.005075     |
| MDTC                  | 156       | 80(avg10) | 0.007142   | 0.005920     |
| MDTC_Small            | 31        | 80(avg10) | 0.005357   | 0.005920     |

Next, we use CTC loss to train the model, with DS_TCN and FSMN.
and we use CTC prefix beam search to decode and detect keywords,
the detection is either in non-streaming or streaming fashion.

Since the FAR is pretty low when using CTC loss,
the follow result is FRRs with FAR fixed at once per 12 hours:

Comparison between Max-pooling and CTC loss.
The CTC model is fine-tuned with base model trained on WenetSpeech(23 epoch).
FRRs with FAR fixed at once per 12 hours


| model                 | loss        | hi_xiaowen | nihao_wenwen |
|-----------------------|-------------|------------|--------------|
| DS_TCN(spec_aug)      | Max-pooling | 0.051217   | 0.021896     |
| DS_TCN(spec_aug)      | CTC         | 0.056574   | 0.056856     |


Comparison between DS_TCN(Pretrained with Wenetspeech, 23 epoch)
and FSMN(modelscope released, xiaoyunxiaoyun model).
FRRs with FAR fixed at once per 12 hours:

| model                 | params(K)   | hi_xiaowen | nihao_wenwen |
|-----------------------|-------------|------------|--------------|
| DS_TCN(spec_aug)      | 955         | 0.056574   | 0.056856     |
| FSMN(spec_aug)        | 756         | 0.031012   | 0.022460     |

Comparison Between stream_score_ctc and score_ctc.
FRRs with FAR fixed at once per 12 hours:

| model                 | stream      | hi_xiaowen | nihao_wenwen |
|-----------------------|-------------|------------|--------------|
| DS_TCN(spec_aug)      | no          | 0.056574   | 0.056856     |
| DS_TCN(spec_aug)      | yes         | 0.132694   | 0.057044     |
| FSMN(spec_aug)        | no          | 0.031012   | 0.022460     |
| FSMN(spec_aug)        | yes         | 0.115215   | 0.020205     |

Note: when using CTC prefix beam search to detect keywords in streaming case(detect in each frame),
we record the probability of a keyword in a decoding path once the keyword appears in this path.
Actually the probability will increase through the time, so we record a lower value of probability,
which result in a higher False Rejection Rate in Detection Error Tradeoff result.
The actual FRR will be lower than the DET curve gives in a given threshold.

Now, the model with CTC loss may not get the best performance,
but it's more robust compared with the classification model using CE/Max-pooling loss.
For more result of FSMN-CTC KWS model, you can click [modelscope](https://modelscope.cn/models/damo/speech_charctc_kws_phone-wenwen/summary).