Practical Deep Learning for Coders Course - Lesson 4
This blog-post series captures my weekly notes while I attend the fastaiv5 course conducted by University of Queensland with fast.ai. So off to week 4, where we will get started with NLP and transformers
- Introduction to lesson
- Why using a different framework - Transformers
- ULMFiT architecture
- Fundamental libraries in ML
- NLP notebook tokenization
- Test, Validation, Training Dataset
- Understanding metrics
- Pearson Coefficient
Introduction to lesson
Almost 100+ people watched live virtually and lesson were held live in front of a bunch of audience in University of Queensland. Prof. John Williams
opened session by telling about filling a separate form, for people interested in attending the hackathon organized end of the course.
During the start Jeremy
mentioned he would love folks to organize a online hackathon by community for folks attending remotely as well. Yet right now Jeremy and John doesn't have the capacity to organize one.
Todays lesson is something a lot of regulars of fast.ai course are excited about as it covers really new material on transformers.
Why using a different framework - Transformers
Since this course is fastai, it may feel a bit weird when we are today going to use a different library
called transformers
.
Important:As practitioners, it's important for us to learn more than one framework.
Note: Differences with fastai and transformers:
Transformers
provide lot of state of art models, and theTokenizers
library build with Rust is really good at the moment.
- It's good to get exposure to a library which is not so layered like
fast.ai
, which is reason that makes it super useful for beginners.
ULMFiT architecture
The idea of fine-tuning a pre-trained NLP model in this way was pioneered by an algorithm called Universal Language Model Fine-tuning for Text Classification aka ULMFiT which was first presented actually in a fastai course.
What's a pretrained model and what is finetuning?
Consider finetuning, as tweaking functions in such a way when if you are already some values of a, b lever are good and optimal for a particular function. Then tweaking value of c is easier right?
Steps in ULMFiT
ULMFiT archtecture consits of three steps:
- Training a language models with general dataset like wikipedia. So it gets so good in predicting next words. Now in
transformers
one big difference compared toULMFiT
is we use masking instead of predicting next word - IMDB lnagage build a language model, build on top of LM for wikipedia
- In three step, is where model classifier comes and based on this label sentences as postive, negative etc.
Fundamental libraries in ML
Four fundamental libraries you always need in datascience are:
- NumPy
- Pandas
- matplotlib
- Pytorch
Important: It looks pretty cool, if you build the state of art stuff. Yet if you don’t know fundamentals, you will encounter trouble. So i will recommend you to get started by first complete reading the Deep Learning for Coders book, then the Macinskey book on Python for Data Analysis, 3E which is free completely online.
NLP notebook tokenization
Getting started with NLP for absolute beginners
It's been only a year or two since NLP has been getting good results, for computer vision things are being optimistic for a long time now.
- Tokenization, is converting the text blurbs into a set of small tokens.
- Numericalization is the process of converting these tokens to numbers for models to train
We used deberta-v3
as base model as some models are always found to give good results. Yet there are lot of pretrained models available in public which can just found by searching like Patent for patent models in Huggingface models hub.
Test, Validation, Training Dataset
The most important concept in ML is creating:
- Test set
- Validation set
- Training set
Important: (Jeremy) Kaggle competitions are really a good way to create a good validation set... Beginners generally tend to overfit ... In real world outside of kaggle you often won’t know it’s overfit. You just destroy value for organizations silently... You really don’t get it untill you screw it up a few times.How (and why) to create a good validation set
Test Set
is a separate data which is not used by ML model for learning. It's kept as separate hold out dataset for further testing.
Understanding metrics
With the validation set, we are measuring some metrics like accuracy which tell how good our ML model is. In Kaggle for every competition there is a metric available to optimize based on.
Why metrics is different from loss?
So check the article written by Rachael Thomas on The problem with metrics is a big problem for AI.
Pearson Coefficient
Understanding metrics is very key, especially in Kaggle competitons. According to this Kaggle competition page: "Submissions are evaluated on the Pearson correlation coefficient between the predicted and actual similarity scores."
Jeremy's way of teaching this concept was explaining with code for us to get intuition
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing(as_frame=True)
df = df["data"].join(df["target"]).sample(1000, random_state=52)
df.head()
np.set_printoptions(precision=2, suppress=True)
# to get correlation coefficent between every row of matrix with every other matrix
np.corrcoef(df, rowvar=False)
def relation_matrix(x, y):
return np.corrcoef(x, y)[0][1]
np.corrcoef(df.HouseAge, df.MedHouseVal)
relation_matrix(df.HouseAge, df.MedHouseVal)
When I ran through this code I was thinking about how [0,1] element in this corelation matrix is 0.12, when value of relation_matrix returns something as 0.11658535. I asked this simple doubt in the forum and after a while, I got answer from one of the course TAs Nick(n-e-w).
def show_corr(df, a, b):
x, y = df[a], df[b]
plt.scatter(x, y, alpha=0.5, s=4)
plt.title(f"{a} vs {b}; r: {relation_matrix(x, y):.2f}")
show_corr(df, "MedInc", "AveRooms")
show_corr(df[df.AveRooms < 15], "MedInc", "AveRooms")
If you look at two graphs, once we removed the Average room <15. We notice a huge difference in r value which denotes the pearson coefficent
are sensitive to outliers. Thus we got an intutive feeling of what the metrics is doing, and how it's being affected by outliers. Even if you make small error in some of predictions you will notice a hugh bump in leaderboard which affects your position, as pearson correlation penalizes heavily for wrong predictions.
Next week, will be the fifth lesson and the last one for month of May. The course will resume again after a three weeks breaks during month of June when monsoon season delights us here in Kerala with rain and everyone else with more fastai.