wekws

AI/wekws

History

[ctc] KWS with CTCloss training and CTC prefix beam search detection. (#135 )

* add ctcloss training scripts.

* update compute_det_ctc

* fix typo.

* add fsmn model, can use pretrained kws model from modelscope.

* Add streaming detection of CTC model. Add CTC model onnx export. Add CTC model's result in README; For now CTC model runtime is not supported yet.

* QA run.sh, maxpooling training scripts is compatible. Ready to PR.

* Add a streaming kws demo, support fsmn online forward

* fix typo.

* Align Stream FSMN and Non-Stream FSMN, both in feature extraction and model forward.

* fix repeat activation, add a interval restrict.

* fix timestamp when subsampling!=1.

* fix flake8, update training script and README, give pretrained ckpt.

* fix quickcheck and flake8

* Add realtime CTC-KWS demo in README.

---------

Co-authored-by: dujing <dujing@xmov.ai>

2023-08-16 10:07:04 +08:00

conf

[ctc] KWS with CTCloss training and CTC prefix beam search detection. (#135 )

2023-08-16 10:07:04 +08:00

local

[examples] correct a spelling mistake (#24 )

2021-12-05 21:24:06 +08:00

path.sh

add mannul random seed so we can reproduce the experimental results

2021-11-19 16:23:03 +08:00

README.md

[ctc] KWS with CTCloss training and CTC prefix beam search detection. (#135 )

2023-08-16 10:07:04 +08:00

run_ctc.sh

[ctc] KWS with CTCloss training and CTC prefix beam search detection. (#135 )

2023-08-16 10:07:04 +08:00

run_fsmn_ctc.sh

[ctc] KWS with CTCloss training and CTC prefix beam search detection. (#135 )

2023-08-16 10:07:04 +08:00

run.sh

[ctc] KWS with CTCloss training and CTC prefix beam search detection. (#135 )

2023-08-16 10:07:04 +08:00

tools

[examples] support hi xiaowen dataset

2021-11-10 18:57:52 +08:00

wekws

[wekws] rename kws to wekws (#76 )

2022-08-27 11:57:44 +08:00

README.md

Comparison among different backbones, all models use Max-Pooling loss. FRRs with FAR fixed at once per hour:

model	params(K)	epoch	hi_xiaowen	nihao_wenwen
GRU	203	80(avg30)	0.088901	0.083827
TCN	134	80(avg30)	0.023494	0.029884
DS_TCN	287	80(avg30)	0.005357	0.006390
DS_TCN(spec_aug)	287	80(avg30)	0.008176	0.005075
MDTC	156	80(avg10)	0.007142	0.005920
MDTC_Small	31	80(avg10)	0.005357	0.005920

Next, we use CTC loss to train the model, with DS_TCN and FSMN backbones. and we use CTC prefix beam search to decode and detect keywords, the detection is either in non-streaming or streaming fashion.

Since the FAR is pretty low when using CTC loss, the follow results are FRRs with FAR fixed at once per 12 hours:

Comparison between Max-pooling and CTC loss. The CTC model is fine-tuned with base model pretrained on WenetSpeech(23 epoch, not converged). FRRs with FAR fixed at once per 12 hours

model	loss	hi_xiaowen	nihao_wenwen	model ckpt
DS_TCN(spec_aug)	Max-pooling	0.051217	0.021896	dstcn-maxpooling
DS_TCN(spec_aug)	CTC	0.056574	0.056856	dstcn-ctc

Comparison between DS_TCN(Pretrained with Wenetspeech, 23 epoch, not converged) and FSMN(Pretained with modelscope released xiaoyunxiaoyun model, fully converged). FRRs with FAR fixed at once per 12 hours:

model	params(K)	hi_xiaowen	nihao_wenwen	model ckpt
DS_TCN(spec_aug)	955	0.056574	0.056856	dstcn-ctc
FSMN(spec_aug)	756	0.031012	0.022460	fsmn-ctc

Now, the DSTCN model with CTC loss may not get the best performance, because the pretraining phase is not sufficiently converged. We recommend you use pretrained FSMN model as initial checkpoint to train your own model.

Comparison Between stream_score_ctc and score_ctc. FRRs with FAR fixed at once per 12 hours:

model	stream	hi_xiaowen	nihao_wenwen
DS_TCN(spec_aug)	no	0.056574	0.056856
DS_TCN(spec_aug)	yes	0.132694	0.057044
FSMN(spec_aug)	no	0.031012	0.022460
FSMN(spec_aug)	yes	0.115215	0.020205

Note: when using CTC prefix beam search to detect keywords in streaming case(detect in each frame), we record the probability of a keyword in a decoding path once the keyword appears in this path. Actually the probability will increase through the time, so we record a lower value of probability, which result in a higher False Rejection Rate in Detection Error Tradeoff result. The actual FRR will be lower than the DET curve gives in a given threshold.

On some small data KWS tasks, we believe the FSMN-CTC model is more robust compared with the classification model using CE/Max-pooling loss. For more infomation and results of FSMN-CTC KWS model, you can click modelscope.

For realtime CTC-KWS, we should process wave input on streaming-fashion, include feature extraction, keyword decoding and detection and some postprocessing. Here is a demo in python, the core code is in wekws/bin/stream_kws_ctc.py, you can refer it to implement the runtime code.