This is my attempt to see how well we can build a NLP model for Natural Language Processing with Disaster Tweets.

According to competition you are required to :

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Downloading Data

creds = ""
from pathlib import Path

cred_path = Path("~/.kaggle/kaggle.json").expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)
! kaggle competitions download -c nlp-getting-started
nlp-getting-started.zip: Skipping, found more recently modified local copy (use --force to force download)
! unzip nlp-getting-started.zip
import pandas as pd
df = pd.read_csv("train.csv")
df.head()
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1
2 5 NaN NaN All residents asked to 'shelter in place' are ... 1
3 6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... 1
df.describe(include="object")
keyword location text
count 7552 5080 7613
unique 221 3341 7503
top fatalities USA 11-Year-Old Boy Charged With Manslaughter of T...
freq 45 104 10
df["input"] = df["text"]

Tokenization

from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds
Dataset({
    features: ['id', 'keyword', 'location', 'text', 'target', 'input'],
    num_rows: 7613
})
model_nm = "microsoft/deberta-v3-small"
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x):
    return tokz(x["input"])


tok_ds = ds.map(tok_func, batched=True)
Parameter 'function'=<function tok_func at 0x7f28da60b8b0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
row = tok_ds[0]
row["input"], row["input_ids"]

('Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
 [1,
  581,
  65453,
  281,
  262,
  18037,
  265,
  291,
  953,
  117831,
  903,
  4924,
  17018,
  43632,
  381,
  305,
  2])
tok_ds = tok_ds.rename_columns({"target": "labels"})
tok_ds
Dataset({
    features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 7613
})
tok_ds[0]

{'id': 1,
 'keyword': None,
 'location': None,
 'text': 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
 'labels': 1,
 'input': 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
 'input_ids': [1,
  581,
  65453,
  281,
  262,
  18037,
  265,
  291,
  953,
  117831,
  903,
  4924,
  17018,
  43632,
  381,
  305,
  2],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Validation, Traning, Testing

eval_df = pd.read_csv("test.csv")
eval_df.head()
id keyword location text
0 0 NaN NaN Just happened a terrible car crash
1 2 NaN NaN Heard about #earthquake is different cities, s...
2 3 NaN NaN there is a forest fire at spot pond, geese are...
3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires
4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan
eval_df.describe(include="object")
keyword location text
count 3237 2158 3263
unique 221 1602 3243
top deluged New York 11-Year-Old Boy Charged With Manslaughter of T...
freq 23 38 3
model_dataset = tok_ds.train_test_split(0.25, seed=34)
model_dataset
DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5709
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1904
    })
})
eval_df["input"] = eval_df["text"]
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Training Models

from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
bs = 128
epochs = 4
data_collator = DataCollatorWithPadding(tokenizer=tokz)
training_args = TrainingArguments("test-trainer")
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['lm_predictions.lm_head.bias', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.weight', 'pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
trainer = Trainer(
    model,
    training_args,
    train_dataset=model_dataset["train"],
    eval_dataset=model_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokz,
)
trainer.train()
The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: location, text, id, input, keyword. If location, text, id, input, keyword are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 5709
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2142
[2142/2142 03:04, Epoch 3/3]
Step Training Loss
500 0.491000
1000 0.406300
1500 0.323600
2000 0.265800

</div> </div>

Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-2000
Configuration saved in test-trainer/checkpoint-2000/config.json
Model weights saved in test-trainer/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-2000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


TrainOutput(global_step=2142, training_loss=0.3674473464210717, metrics={'train_runtime': 184.9649, 'train_samples_per_second': 92.596, 'train_steps_per_second': 11.581, 'total_flos': 222000241127892.0, 'train_loss': 0.3674473464210717, 'epoch': 3.0})
</div> </div> </div>
preds = trainer.predict(eval_ds).predictions.astype(float)
preds
The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: location, text, id, input, keyword. If location, text, id, input, keyword are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3263
  Batch size = 8
[408/408 00:05]
array([[-2.78964901,  3.02934074],
       [-2.77013326,  3.00309706],
       [-2.74731326,  2.972296  ],
       ...,
       [-2.8556931 ,  3.08512282],
       [-2.7085278 ,  2.88177919],
       [-2.7887187 ,  3.00746083]])
1. Just happened a terrible car crash
2. Heard about #earthquake is different cities, stay safe everyone.
3. There is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all.

The above are samples from our Test set, looks all disaster tweets which seems to have been predicted correctly. This is my first iteration in which I tried mostly editing from Jeremy's notebook on getting started with NLP in about 1 hour.

</div>