Himansh Raj

Building F75: Training a Mini GPT on a Single GPU

· 2 min read

Building F75: Training a Mini GPT on a Single GPU

I recently built F75, a compact GPT-style language model project that I used to learn more about training and testing small language models on my Nvidia 5060 Ti (16 GB).

The goal wasn't to chase state-of-the-art benchmarks. The goal was to understand model behavior deeply by building, training, and iterating on a smaller architecture end to end.

Why I made F75

When you work with large foundation models, it's easy to lose intuition because the systems are massive and expensive to run. F75 was my way of going back to first principles:

  • define a clear architecture,
  • train it on a focused objective,
  • observe outputs closely,
  • and compare behavior across configuration changes.

This project helped me sharpen practical instincts around training dynamics and generation quality.

Model snapshot

Here are the core specs from F75:

  • 107,648 parameters
  • 4 attention heads
  • 32 embedding dimension
  • 61-token vocabulary
  • Context length: 64
  • 8 Transformer blocks
  • GPT-2 style architecture

Despite being small, the model learned useful sequence patterns and basic spelling behavior, which was encouraging for such a compact setup.

Dropout vs non-dropout experiments

One of the key things I explored was how the model behaves with and without dropout.

I trained/evaluated two versions:

  • Non-dropout model
  • Dropout model (dropout=0.1)

Comparing generations from both versions gave me a clearer view into overfitting tendencies and output stability. This simple A/B setup made the project much more educational than training only one checkpoint.

Inference workflow

Running inference in F75 is intentionally simple:

  • run python main_infrence.py for the non-dropout model,
  • switch to dropout=0.1 and load the dropout checkpoint,
  • run inference again and compare outputs.

That tight loop made iteration fast and helped me reason about architecture and regularization choices in a very concrete way.

What I learned

F75 reinforced a few things for me:

  1. Small models are excellent for building intuition quickly.
  2. Controlled experiments (like dropout vs non-dropout) teach more than one-off training runs.
  3. Clean inference workflows make research iteration dramatically easier.

If you're learning LLM training, building a compact model like this is one of the best ways to understand what's really happening under the hood.

If you want to check out the project directly, here it is: F75 on GitHub.