📍 FOSS Meetup, Kochi @ KeyValue, Smart Kochi.
Saturday, June 24, 2023
under-rated model
released by OpenAI.Size | Parameters | Required VRAM | Relative speed |
---|---|---|---|
tiny | 39 M | ~1 GB | ~32x |
base | 74 M | ~1 GB | ~16x |
small | 244 M | ~2 GB | ~6x |
medium | 769 M | ~5 GB | ~2x |
large | 1550 M | ~10 GB | 1x |
Whisper is competitive with state of art commercial and open source systems. Diagram from whisper research paper p.9
- It supports the below platforms:
Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.
Picture from fast.lesson covering steps in finetuning a text classifier model
Model | WER |
---|---|
tiny | 102.7 |
base | 122.9 |
small | 104.8 |
medium | 137.8 |
large-v1 | 107.1 |
large-v2 | 103.2 |
Malayalam models performance in whisper event according to leaderboard
thennal/whisper-medium-ml
parambharath/whisper-small-ml
I was sceptical about the winning models becuase of:
thennal/whisper-medium-ml model card readme
Last commit in thennal/whisper-medium-ml
Datasets
1. Word Error Rate(WER)
2. Character Error Rate(CER)
Time for a new adventure
Dependencies:
numerize pandas librosa soundfile
Development library:
def load_common_voice_malayalam_dataset():
dataset = load_dataset(
"mozilla-foundation/common_voice_11_0",
"ml",
split="test"
)
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.map(normalise)
dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])
return dataset
def evaluate_whisper_model_common_voice(
model_name: str, # The model name
werlist: List[float], # WER List
cerlist: List[float],# CER list
modelsizelist: List[str], # model size list
timelist: List[float], # time(s) list
bs:int =16, # batch size. Default value is 16.
)->None:
whisper_asr = pipeline(
"automatic-speech-recognition", model=model_name, device=0
)
dataset = load_common_voice_malayalam_dataset()
predictions = []
references = []
start = time.time()
for out in whisper_asr(data(dataset), batch_size=bs):
predictions.append(normalizer((out["text"])))
references.append(normalizer(out["reference"][0]))
end = time.time()
...
df = pd.DataFrame({"predictions": predictions, "ground_truth": references})
df["model_name"] = model_name
df["wer"] = df.apply(lambda row: wer(normalizer(row["ground_truth"]), normalizer(row["predictions"])), axis=1)
df["cer"] = df.apply(lambda row: cer(normalizer(row["ground_truth"]), normalizer(row["predictions"])), axis=1)
df["total_time"] = end-start
rwer = wer(references, predictions)
rwer = round(100 * rwer, 2)
werlist.append(rwer)
print(f"The WER of model: {rwer}")
rcer = cer(references, predictions)
rcer = round(100 * rcer, 2)
cerlist.append(rcer)
print(f"The CER of model: {rcer}")
...
print(f"The model size is: {get_model_size(whisper_asr.model)}")
modelsizelist.append(get_model_size(whisper_asr.model))
df["model_size"] = get_model_size(whisper_asr.model)
save_name = model_name.split("/")
print(save_name)
df.to_parquet(f"{save_name[0]}_{save_name[1]}_commonvoice.parquet")
clear_gpu_memory()
Started with 6 fine-tuned models in Malayalam and compared it with 6 model versions released by OpenAI.
Output from benchmarking tool
Word Error Rate in Common Voice-9 test split
Character Error Rate in Common Voice-9 test split
Output from benchmarking tool
Word Error Rate in MSC
Character Error rate in MSC
Github project
https://github.com/kurianbenoy/malayalam_asr_benchmarking
Benchmarking results
Results on SMC Malayalam Speech corpus
https://huggingface.co/datasets/kurianbenoy/
malayalam_msc_benchmarking
Results on Common Voice 11
https://huggingface.co/datasets/kurianbenoy/
malayalam_common_voice_benchmarking
Open LLM leaderboard in huggingface spaces
thennal/whisper-medium-ml
Kurian Benoy || OpenAI Whisper and it’s amazing power to do fine-tuning demonstrated on my mother-tongue