Codebook Utilization and Unused Vocabulary

by JosephusCheung - opened 8 days ago

8 days ago

I'm using your tokenizer to encode a dataset of images, and I've encountered something unexpected. I noticed that a significant portion of the vocabulary is not being used. In my run, approximately 29,933 vocabulary tokens were left unused.
This seems to result in a somewhat low codebook utilization rate.

I was wondering, is this expected behavior? Have you observed this phenomenon before, or is it more likely that there's a distribution issue with my particular dataset?

Any insights would be greatly appreciated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment