BAAI
/

Codebook Utilization and Unused Vocabulary

#2
by JosephusCheung - opened

I'm using your tokenizer to encode a dataset of images, and I've encountered something unexpected. I noticed that a significant portion of the vocabulary is not being used. In my run, approximately 29,933 vocabulary tokens were left unused.
This seems to result in a somewhat low codebook utilization rate.

I was wondering, is this expected behavior? Have you observed this phenomenon before, or is it more likely that there's a distribution issue with my particular dataset?

Any insights would be greatly appreciated.

Sign up or log in to comment