When I first started training models after the FAST.ai course I wasn't sure how much it was necessary to actually test things. I still am not perfectly sure, but essentially copied the method used in this University of Texas paper where they hand-labeled 500 tweets that had been removed from the "training" and "validation" sets.
I did the same and was very pleased to find that while my model was 99% accurate on the validation set it was still 93.6% accurate on my hand-labeled test set.
I know I can improve my test set as well, but I also know that it's important to finish the project to the accuracy that is most useful.
Below is the code used for inference from the trained ULMfit twitter model:
Comments