Codebook Utilization and Unused Vocabulary
#2
by
JosephusCheung
- opened
I'm using your tokenizer to encode a dataset of images, and I've encountered something unexpected. I noticed that a significant portion of the vocabulary is not being used. In my run, approximately 29,933 vocabulary tokens were left unused.
This seems to result in a somewhat low codebook utilization rate.
I was wondering, is this expected behavior? Have you observed this phenomenon before, or is it more likely that there's a distribution issue with my particular dataset?
Any insights would be greatly appreciated.