“The doom lies in yourself, not in your name.”

#15
by jukofyork - opened

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion

The doom is still buried within Command-A for sure.

The doom is still buried within Command-A for sure.

Only another 38 days to go:

image.png

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

image.png

🤞

A step 601 preview - all with temperature = 0:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

  • It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
  • The Grimdark story was noticeably (much!) better compared to the inverse.
  • The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).
Size Name
287M command-a-03-2025-lora-Q8_0.ggu
541M command-a-03-2025-lora-F16.gguf
541M command-a-03-2025-lora-BF16.gguf
1.1G command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

image.png

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

I'm also finding this works well with Kimi-K2:

Use interesting/original names - No "Elara Voss" or "Kael Thorn" AI-slop names please.

There must be enough slop (and scraped discussions about slop) already in its training data for it to infer that you are fed up with "slop" in general, and not just these two specific examples!

Interestingly, this also seems to cause it to write slightly differently. If you look at the first story I created above, it breaks the paragraphs up very differently to normal and I think this in turn made the writing quite interesting and different: it reminded me of Cormac McCarthy's terse writing style, but with the deliberate lack of world explanation of the Malazan Book of the Fallen series that leaves you wondering WTF is going on!

Hopefully Kimi-K3 will have seen enough "it's not X, but Y" to avoid that too! 😱

Yeah, I have it setup locally using the lossless DF11 quant:

https://huggingface.co/DFloat11/FLUX.1-Krea-dev-DF11-ComfyUI

(the FP8 was noticeably worse)

along with the FP8 "scaled" text encoder from here:

https://huggingface.co/comfyanonymous/flux_text_encoders

(it makes almost no difference using FP8 for the text encoder)

That's what I've noticed too when using imagegen models, fp8 hurts them a lot. Q8_0 ggufs(yes, they exist for imagegen) however are much less lossy.

it makes almost no difference using FP8 for the text encoder

So it's like the TTS models with an LLM -> neural codec. You can quantize the LLM part but not the neural codec part.

Q8_0 ggufs(yes, they exist for imagegen)

Did they ever get ggufs working for ByteDance-Seed/BAGEL-7B-MoT ?

I've been using the DF11 but it's quite slow / spills over to CPU.

Hopefully Kimi-K3 will have seen enough "it's not X, but Y" to avoid that too!

You know interestingly, I found that good old Mistral-Large can identify these and highlight them a lot better than Sonnet-4.5, Command-A and Kimi-K2 when prompted to find them in text a chapter.
(they can all do it with better prompting / examples but I just found it interesting that the model predating this slop could do it)

👈 expand image

image

Tested Minimax-M2, they distilled gpt-ass, what a terrible model:

User:

What is 15 * 23?

Model:

<think>
The user asks: "What is 15 * 23?" This is a simple multiplication. 15 * 23 = 345. So answer: 345.

We should respond with the answer. There's no policy violation. So answer: 345.

We can also show steps: 15*20=300, 15*3=45, sum=345.

Thus final answer: 345.
</think>

\(15 \times 23 = 345\).

Besides wasting tokens on "safety", it hallucinates like crazy in other prompts.

They distilled gpt-ass, what a terrible model

What a waste :(

it hallucinates like crazy in other prompts.

So does gpt-ass!

@Downtown-Case @treehugg3 You might be interested. I think OLMo-2-0325-32B has comparable performance to Seed-OSS-36B. I have been using the checkpoints at this branch, at the end of the first stage of pretraining (at 6T tokens, before the "ingredient" datasets are added): https://huggingface.co/allenai/OLMo-2-0325-32B/tree/stage1-step721901-tokens6056B

For a base model, it has been surprisingly good at completions and IMO fairly slop-free and definitely uncensored. They did very little pretrain filtering which is interesting because most other models have extensive filtering these days. I also tried the checkpoint at 3T tokens. At 3T tokens it is less coherent but should be more steerable. I am going to experiment with some of the other checkpoints to find a sweet spot if it is not 6T. Get it while you can because HF is hosting over seven hundred checkpoints of the model in this one repo.

One idea is to continue from the last initial pretrain checkpoint and train on high-quality tokens specific to your targets. The paper talks a lot about this "mid-stage" pretraining that might be useful. https://arxiv.org/pdf/2501.00656

Mmmm... I am interested, though with a caveat:

"max_position_embeddings": 4096

That's rather small :(

Seed-OSS base was interesting because it could grasp a huge context reasonably well, even with its relative sloppiness. I had similar high hopes for Falcon H1 34B base, but (at least last time I checked it with llama.cpp), it just breaks with too many tokens fed to it.

Potential trainability is interesting though. I do wonder what it'd be like if, say, continue trained on a bunch of whole books. But that's no small financial endeavor.

MikuPad is being updated again!

Sign up or log in to comment