Question about inference speed

#10

by cX1y - opened Sep 1

cX1y

Sep 1

This is a great job. I have deployed it to a local service and run tests. I have the following questions that I would like you to answer. My machine is A100, 40G. I tested the model inference under the Nemo framework and the onnx version of the onnx_asr project. The inference demo is like this:
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.restore_from(restore_path="models/parakeet-tdt-0.6b-v3/parakeet-tdt-0.6b-v3.nemo")

help(asr_model.transcribe)

decoding_cfg = asr_model.cfg.decoding

asr_model.change_attention_model(self_attention_model="rel_pos_local_attn", att_context_size=[256, 256])

output = asr_model.transcribe(['audio/car_16000.wav'], batch_size=32, timestamps=True, verbose=False)
import time
start_time = time.time()
output = asr_model.transcribe(['audio/car_16000.wav'], batch_size=128, timestamps=True, verbose=False)
processing_time = time.time() - start_time
print(f"耗时: {processing_time:.4f} 秒")
print(output[0].text)

import time
start_time = time.time()
output = asr_model.transcribe(['audio/car_16000.wav'], batch_size=128, timestamps=True, verbose=False)
processing_time = time.time() - start_time
print(f"耗时: {processing_time:.4f} 秒")
print(output[0].text)

The multiple operations are to simulate a hot start operation. The current problem is related to inference time. On the one hand, after multiple inferences of the same 37s audio, the RTF is around 0.01. On the other hand, for shorter audio, such as 10s, it is around 0.02 after multiple inferences. This can be understood as the internal caching mechanism of the model. However, when I change the audio file during inference, its RTF can only be maintained at around 0.1. I would like to ask if this is normal? In other words, if I keep processing different short audio files within 30s, how can I ensure that the RTFx can be 3000+, that is, RTF << 0.001?

nithinraok

NVIDIA org Sep 10

if audio samples are less than 10 min, you can skip asr_model.change_attention_model(self_attention_model="rel_pos_local_attn", att_context_size=[256, 256]) . This will improve RTFx.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment