= "" creds
This is my attempt to see how well we can build a NLP model for Natural Language Processing with Disaster Tweets.
According to competition you are required to :
In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we’ve created a quick tutorial to get you up and running.
Downloading Data
from pathlib import Path
= Path("~/.kaggle/kaggle.json").expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
! kaggle competitions download -c nlp-getting-started
nlp-getting-started.zip: Skipping, found more recently modified local copy (use --force to force download)
! unzip nlp-getting-started.zip
import pandas as pd
= pd.read_csv("train.csv")
df df.head()
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
="object") df.describe(include
keyword | location | text | |
---|---|---|---|
count | 7552 | 5080 | 7613 |
unique | 221 | 3341 | 7503 |
top | fatalities | USA | 11-Year-Old Boy Charged With Manslaughter of T... |
freq | 45 | 104 | 10 |
"input"] = df["text"] df[
Tokenization
from datasets import Dataset, DatasetDict
= Dataset.from_pandas(df) ds
ds
Dataset({
features: ['id', 'keyword', 'location', 'text', 'target', 'input'],
num_rows: 7613
})
= "microsoft/deberta-v3-small" model_nm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
= AutoTokenizer.from_pretrained(model_nm) tokz
def tok_func(x):
return tokz(x["input"])
= ds.map(tok_func, batched=True) tok_ds
Parameter 'function'=<function tok_func at 0x7f28da60b8b0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
# collapse_output
= tok_ds[0]
row "input"], row["input_ids"] row[
('Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
[1,
581,
65453,
281,
262,
18037,
265,
291,
953,
117831,
903,
4924,
17018,
43632,
381,
305,
2])
= tok_ds.rename_columns({"target": "labels"}) tok_ds
tok_ds
Dataset({
features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 7613
})
# collapse_output
0] tok_ds[
{'id': 1,
'keyword': None,
'location': None,
'text': 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
'labels': 1,
'input': 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
'input_ids': [1,
581,
65453,
281,
262,
18037,
265,
291,
953,
117831,
903,
4924,
17018,
43632,
381,
305,
2],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Validation, Traning, Testing
= pd.read_csv("test.csv")
eval_df eval_df.head()
id | keyword | location | text | |
---|---|---|---|---|
0 | 0 | NaN | NaN | Just happened a terrible car crash |
1 | 2 | NaN | NaN | Heard about #earthquake is different cities, s... |
2 | 3 | NaN | NaN | there is a forest fire at spot pond, geese are... |
3 | 9 | NaN | NaN | Apocalypse lighting. #Spokane #wildfires |
4 | 11 | NaN | NaN | Typhoon Soudelor kills 28 in China and Taiwan |
="object") eval_df.describe(include
keyword | location | text | |
---|---|---|---|
count | 3237 | 2158 | 3263 |
unique | 221 | 1602 | 3243 |
top | deluged | New York | 11-Year-Old Boy Charged With Manslaughter of T... |
freq | 23 | 38 | 3 |
= tok_ds.train_test_split(0.25, seed=34)
model_dataset model_dataset
DatasetDict({
train: Dataset({
features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 5709
})
test: Dataset({
features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1904
})
})
"input"] = eval_df["text"]
eval_df[= Dataset.from_pandas(eval_df).map(tok_func, batched=True) eval_ds
Training Models
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
= 128
bs = 4 epochs
= DataCollatorWithPadding(tokenizer=tokz) data_collator
= TrainingArguments("test-trainer") training_args
= AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2) model
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['lm_predictions.lm_head.bias', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.weight', 'pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
= Trainer(
trainer
model,
training_args,=model_dataset["train"],
train_dataset=model_dataset["test"],
eval_dataset=data_collator,
data_collator=tokz,
tokenizer )
trainer.train()
The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: location, text, id, input, keyword. If location, text, id, input, keyword are not expected by `DebertaV2ForSequenceClassification.forward`, you can safely ignore this message.
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
***** Running training *****
Num examples = 5709
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 2142
Step | Training Loss |
---|---|
500 | 0.491000 |
1000 | 0.406300 |
1500 | 0.323600 |
2000 | 0.265800 |
Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-2000
Configuration saved in test-trainer/checkpoint-2000/config.json
Model weights saved in test-trainer/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-2000/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=2142, training_loss=0.3674473464210717, metrics={'train_runtime': 184.9649, 'train_samples_per_second': 92.596, 'train_steps_per_second': 11.581, 'total_flos': 222000241127892.0, 'train_loss': 0.3674473464210717, 'epoch': 3.0})
= trainer.predict(eval_ds).predictions.astype(float)
preds preds
The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: location, text, id, input, keyword. If location, text, id, input, keyword are not expected by `DebertaV2ForSequenceClassification.forward`, you can safely ignore this message.
***** Running Prediction *****
Num examples = 3263
Batch size = 8
array([[-2.78964901, 3.02934074],
[-2.77013326, 3.00309706],
[-2.74731326, 2.972296 ],
...,
[-2.8556931 , 3.08512282],
[-2.7085278 , 2.88177919],
[-2.7887187 , 3.00746083]])
1. Just happened a terrible car crash
2. Heard about #earthquake is different cities, stay safe everyone.
3. There is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all.
The above are samples from our Test set, looks all disaster tweets which seems to have been predicted correctly. This is my first iteration in which I tried mostly editing from Jeremy’s notebook on getting started with NLP in about 1 hour.