Building a fine-tuned translation system for English-Malayalam
- Loading Data
- Transforming data into DataLoaders
- Training fine-tuned translation system
- Now let's translate with our trained models
- Thanks to
Hey, everyone. We all are familiar with translation systems like using google translate. So today, let's build a fine tuned translation system for converting text from english to malayalam. It's built using Blurr library - built on top of Hugging face and fast.ai made by Wayde Gilliam. Also our translation system is going to be fine tuned on top of KDE specific dataset. You can find the trained model here.
Installation
! python3 -m pip install -Uqq datasets== fastai==2.6.3
! python3 -m pip install -Uqq transformers[sentencepiece]
! python3 -m pip install -Uqq ohmeow-blurr==1.0.5
! python3 -m pip install -Uqq nltk
! python3 -m pip install -Uqq sacrebleu
! python3 -m pip install -Uqq git+https://github.com/huggingface/huggingface_hub#egg=huggingface-hub["fastai"]
Loading Data
A translation system is an example of sequence to sequence models, which is usually used for tasks which involves generating new data. Translation usually needs datasets in both the source language and target language (the language to which it needs to be translated).
We are using KDE4 datasets, and choose both source language and translation language as english and malayalam respectively. Usually these datasets are curated by community volunteers to their native language, and this was probably done by KDE community volunteers in Kerala. When someone is localizing these texts into there in local languague, usually computer science specific terms are still written in english.
import pandas
from datasets import load_dataset
raw_datasets = load_dataset("kde4", lang1="en", lang2="ml", split="train[:1000]")
Most of translation dataset is in form of id and translation json output - with both en
and ml
as objects.
raw_datasets[0]
from blurr.text.data.all import *
from blurr.text.modeling.all import *
from blurr.text.utils import *
from fastai.data.all import *
from fastai.callback.all import *
from fastai.learner import load_learner, Learner
from fastai.optimizer import *
from transformers import *
pretrained_model_name = "Helsinki-NLP/opus-mt-en-ml"
model_cls = AutoModelForSeq2SeqLM
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
translation_df = pd.DataFrame(raw_datasets["translation"], columns=["en", "ml"])
translation_df.head()
blocks = (Seq2SeqTextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("en"), get_y=ColReader("ml"), splitter=RandomSplitter())
dls = dblock.dataloaders(translation_df, bs=16)
dls.show_batch(dataloaders=dls, max_n=2, input_trunc_at=100, target_trunc_at=250)
Training fine-tuned translation system
Using blurr High-level API
Bugs in ohmeow v1.0.4 has been fixed by the open-source maintainer.
from blurr.text.utils import BlurrText
NLP = BlurrText()
learn = BlearnerForTranslation.from_data(
translation_df,
pretrained_model_name,
src_lang_name="English",
src_lang_attr="en",
trg_lang_name="Malayalam",
trg_lang_attr="ml",
dl_kwargs={"bs": 16},
)
metrics_cb = BlearnerForTranslation.get_metrics_cb()
learn.fit_one_cycle(1, lr_max=4e-5, cbs=[metrics_cb])
learn.show_results(learner=learn, input_trunc_at=500, target_trunc_at=250)
b = dls.one_batch()
len(b), b[0]["input_ids"].shape, b[1].shape
dls.show_batch(dataloaders=dls, input_trunc_at=250, target_trunc_at=250)
seq2seq_metrics = {"bleu": {"returns": "bleu"}, "meteor": {"returns": "meteor"}, "sacrebleu": {"returns": "score"}}
model = BaseModelWrapper(hf_model)
learn_cbs = [BaseModelCallback]
fit_cbs = [Seq2SeqMetricsCallback(custom_metrics=seq2seq_metrics)]
learn = Learner(
dls,
model,
opt_func=partial(Adam),
loss_func=PreCalculatedCrossEntropyLoss(), # CrossEntropyLossFlat()
cbs=learn_cbs,
splitter=partial(blurr_seq2seq_splitter, arch=hf_arch),
)
learn.freeze()
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(15, lr_max=5e-4, cbs=fit_cbs)
learn.show_results(learner=learn, input_trunc_at=500, target_trunc_at=500)
test_text = "How are you doing"
outputs = learn.blurr_generate(test_text, key="translation_texts", num_return_sequences=3)
outputs
export_fname = "saved_model"
learn.metrics = None
learn.export(fname=f"{export_fname}.pkl")
from huggingface_hub import push_to_hub_fastai
push_to_hub_fastai(
learn,
"kurianbenoy/kde_en_ml_translation_model",
commit_message="New version with 15 epoch of training",
)
test_text = "How are you doing"
inf_learn = load_learner(fname=f"{export_fname}.pkl")
inf_learn.blurr_translate(test_text)
test_text1 = "Add All Found Feeds to Akregator."
inf_learn = load_learner(fname=f"{export_fname}.pkl")
inf_learn.blurr_translate(test_text1)
test_text2 = "Subscribe to site updates (using news feed)."
inf_learn = load_learner(fname=f"{export_fname}.pkl")
inf_learn.blurr_translate(test_text2)
Expected: 'സൈറ്റുകളിലെ പുതുമകളറിയാന്\u200d വരിക്കാരനാകുക (വാര്\u200dത്താ ഫീഡുകള്\u200d ഉപയോഗിച്ചു്
Thanks to
- Wayde Gilliam - for creating blurr, and helping with doubts in translation bits
- Kevin Bird - for helping in editing the article.
- Ashwin Jayaprakash - for trying out notebook and reporting issues which was later fixed by Wayde in blurr.
fin.