FunASR语言识别的环境安装、推理
一、环境配置
源码地址:FunASR
FunASR/README_zh.md at main · alibaba-damo-academy/FunASR · GitHub
1、创建虚拟环境
conda create -n funasr python==3.9 -y
conda activate funasr
2、模型下载
实时语音识别模型地址:FunASR语音识别模型下载
二、推理识别模型
1、实时语音识别
from funasr import AutoModel
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel(model="paraformer-zh-streaming")
import soundfile
import os
wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
print(res)
注:chunk_size
为流式延时配置,[0,10,5]
表示上屏实时出字粒度为10*60=600ms
,未来信息为5*60=300ms
。每次推理输入为600ms
(采样点数为16000*0.6=960
),输出为对应文字,最后一个语音片段输入需要设置is_final=True
来强制输出最后一个字。
2、非实时语音识别
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model_dir = "iic/SenseVoiceSmall"
model = AutoModel(
model=model_dir,
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
)
# en
res = model.generate(
input=f"{model.model_path}/example/en.mp3",
cache={},
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
use_itn=True,
batch_size_s=60,
merge_vad=True, #
merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)
参数说明:
model_dir
:模型名称,或本地磁盘中的模型路径。vad_model
:表示开启VAD,VAD的作用是将长音频切割成短音频,此时推理耗时包括了VAD与SenseVoice总耗时,为链路耗时,如果需要单独测试SenseVoice模型耗时,可以关闭VAD模型。vad_kwargs
:表示VAD模型配置,max_single_segment_time
: 表示vad_model
最大切割音频时长, 单位是毫秒ms。use_itn
:输出结果中是否包含标点与逆文本正则化。batch_size_s
表示采用动态batch,batch中总音频时长,单位为秒s。merge_vad
:是否将 vad 模型切割的短音频碎片合成,合并后长度为merge_length_s
,单位为秒s。ban_emo_unk
:禁用emo_unk标签,禁用后所有的句子都会被赋与情感标签
未完...
参考:https://github.com/modelscope/FunASR/blob/main/README_zh.md
原文地址:https://blog.csdn.net/m0_60657960/article/details/145224489
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!