3.1 KiB

Comparison among different backbones. FRRs with FAR fixed at once per hour:

model params(K) epoch hi_xiaowen nihao_wenwen
GRU 203 80(avg30) 0.088901 0.083827
TCN 134 80(avg30) 0.023494 0.029884
DS_TCN 287 80(avg30) 0.005357 0.006390
DS_TCN(spec_aug) 287 80(avg30) 0.008176 0.005075
MDTC 156 80(avg10) 0.007142 0.005920
MDTC_Small 31 80(avg10) 0.005357 0.005920

Next, we use CTC loss to train the model, with DS_TCN and FSMN. and we use CTC prefix beam search to decode and detect keywords, the detection is either in non-streaming or streaming fashion.

Since the FAR is pretty low when using CTC loss, the follow result is FRRs with FAR fixed at once per 12 hours:

Comparison between Max-pooling and CTC loss. The CTC model is fine-tuned with base model trained on WenetSpeech(23 epoch). FRRs with FAR fixed at once per 12 hours

model loss hi_xiaowen nihao_wenwen
DS_TCN(spec_aug) Max-pooling 0.051217 0.021896
DS_TCN(spec_aug) CTC 0.056574 0.056856

Comparison between DS_TCN(Pretrained with Wenetspeech, 23 epoch) and FSMN(modelscope released, xiaoyunxiaoyun model). FRRs with FAR fixed at once per 12 hours:

model params(K) hi_xiaowen nihao_wenwen
DS_TCN(spec_aug) 955 0.056574 0.056856
FSMN(spec_aug) 756 0.031012 0.022460

Comparison Between stream_score_ctc and score_ctc. FRRs with FAR fixed at once per 12 hours:

model stream hi_xiaowen nihao_wenwen
DS_TCN(spec_aug) no 0.056574 0.056856
DS_TCN(spec_aug) yes 0.132694 0.057044
FSMN(spec_aug) no 0.031012 0.022460
FSMN(spec_aug) yes 0.115215 0.020205

Note: when using CTC prefix beam search to detect keywords in streaming case(detect in each frame), we record the probability of a keyword in a decoding path once the keyword appears in this path. Actually the probability will increase through the time, so we record a lower value of probability, which result in a higher False Rejection Rate in Detection Error Tradeoff result. The actual FRR will be lower than the DET curve gives in a given threshold.

Now, the model with CTC loss may not get the best performance, but it's more robust compared with the classification model using CE/Max-pooling loss.
For more result of FSMN-CTC KWS model, you can click modelscope.