Writing an LLM from scratch, part 22 – training our LLM

241 points | by gpjt a day ago

9 comments

mettamage 21 hours ago
Here's part 1 [1]. Since his archive goes by date, it makes it a bit easier to guestimate which part is made in which month.
[1] https://www.gilesthomas.com/2024/12/llm-from-scratch-1
[-]
- ziyunli 4 hours ago
  seems like you can filter by tag https://www.gilesthomas.com/llm-from-scratch
- 3abiton 13 hours ago
  It's interesting 22 parts in under a year, seems like a fun up to date project. Karpathy did something very similar with nanochat (following nanogpt).
mrasong 16 hours ago
The cost comparison between local RTX 3090 and cloud A100 clusters is useful, but I wonder if the author accounted for hidden overhead—like data transfer time for large datasets or the time spent debugging CUDA compatibility issues on local hardware.
js8 16 hours ago
It's based on a book https://www.manning.com/books/build-a-large-language-model-f..., is it a good book?
[-]
- checker659 14 hours ago
  I have done a little bit of DL stuff (with keras) before this. I'm currently in the attention chapter. The book gives you the code, but I feel like there is very little in the way of building intuition. Thankfully, there are tons of videos online to help with that.
  I think it is a great guide. An extended tutorial if you will (at least until this point in my reading). Also having the code right in front of you helps a lot. For example, I was under the impression that embedding vectors were static like in word2vec. Turns out, they are learnable parameters too. I wouldn't have been able to tell for sure if I didn't have the code right in front of me.
  [-]
  - dvt 10 hours ago
    > The book gives you the code, but I feel like there is very little in the way of building intuition.
    There isn't really much intuition to begin with, and I don't really think building intuition will be useful, anyway. Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work. Heck, even implementing a Markov chain from scratch (which can be done in an afternoon with no prior knowledge) can feel magical when it starts outputting semi-legible sentences.
    It's like trying to build intuition when it comes to technical results like the Banach-Tarski paradox or Löb's theorem. Imo, understanding the math (which in the case of LLMs is actually quite simple) is orders of magnitude more valuable than "building intuition," whatever that might mean.
    [-]
    - checker659 7 hours ago
      > Even when looking at something as barebones as perceptrons
      I was thinking something like "it is trying to approximate a non-linear function" (which is what it is in the case of MLPs).
    - CamperBob2 2 hours ago
      Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work.
      Check out the Karpathy "Zero to Hero" videos, and try to follow along by building an MLP implementation in your own language of choice. He does a good job of building intuition because he doesn't skip much of anything.