Text Generation
Transformers
Safetensors
bailing_moe
conversational
custom_code

Terrible trivia knowledge for the size

#15
by ChuckMcSneed - opened

Knows very little despite being 1T, very disappointing.

It has worse world knowledge than Llama 3 70B. I was expecting it to be at least comparable to Deepseek in this regard. It seems some labs believe that real world knowledge isn't that important. Talking with these models feels like talking to someone who has spent their whole life under a rock reading science books. That is what happens when you train with 90% synthetic data.

Do you have any examples of prompts that produced bad results?

@jebbam Very simple ones like "Which song is rapper viper best known for?" and other similar questions, llama knows it, ling does not. Try asking it about anything slightly away from mainstream, you'll notice how limited it is.

Do you have any examples of prompts that produced bad results?

Just ask it anything specify about any TV show, anime or game, it will hallucinate the answer most of the time.
For example, this is a random trivia from Pokémon Crystal, and it's not even an obscure thing, it's almost impossible to skip if you play the game: https://www.serebii.net/crystal/dratini.shtml

Prompt: How do you obtain a Dratini with ExtremeSpeed in Pokemon Crystal?

Ling-1T (Hallucinate the entire answer):
Ling-1T

GLM (Answered correctly):
GLM

GLM, GLM AIR, Kimi-k2, Deepseek, Ernie-300B answered this correctly. Ling/Ring 1T and Qwen (any size) are the only models that are not able to answer this specify questions of all the models I tested right now, you can try yourself.

Yes, I see that Ling (and Ring) didn't generate a correct answer, but Llama 3.3 instruct gave "You'll Cowards Don't Even Smoke Crack", which AFAICT is the correct answer. Ling does appear to do well on scientific questions. So if you want an LLM that knows pop culture, Ling doesn't appear to be a good model. If you want technical answers, it is a good option.

Shinku's "90% synthetic data" claim seems dubious though.

@jebbam I wonder if it’s really that good at STEM, can you give me an example where Ling or Ring perform better than other much smaller models? I'm not talking about Llama which is old, but Qwen 3 230B and GLM-4.6, which are ~3x smaller.

I don't know offhand if Ling/Ring outperform Qwen or GLM in STEM, both of which are very good models IMHO. To be clear, I'm not affiliated with any of these companies.

What is your basis for the claim that it is trained 90% synthetic data?

I exaggerated that, but it's probably not that far from the truth: "Pre-training used over 20T high-quality tokens, with > 40% reasoning-dense data in later stages."

That is probably by design, they're sacrificing the model's recall ability in exchange for making it "smarter".

Do you have any examples of prompts that produced bad results?

Just ask it anything specify about any TV show, anime or game, it will hallucinate the answer most of the time.
For example, this is a random trivia from Pokémon Crystal, and it's not even an obscure thing, it's almost impossible to skip if you play the game: https://www.serebii.net/crystal/dratini.shtml

Prompt: How do you obtain a Dratini with ExtremeSpeed in Pokemon Crystal?

Ling-1T (Hallucinate the entire answer):
Ling-1T

GLM (Answered correctly):
GLM

GLM, GLM AIR, Kimi-k2, Deepseek, Ernie-300B answered this correctly. Ling/Ring 1T and Qwen (any size) are the only models that are not able to answer this specify questions of all the models I tested right now, you can try yourself.

Sorry I couldn't get this case repro-ed. From the interface it looks like you were using ZenMux for this. I've noted the case and we'll do some further evaluation.

RichardBian changed discussion status to closed

Sign up or log in to comment