Learning About Attention

Posted on Tue 04 March 2025 in posts

Contents

Introduction
Attention mechanism
Flash attention
Models
- A note on tokenizers
Optimizers
Conclusion and Future Work

Introduction

Over the course of the last few months I set myself a goal of implementing more papers and algorithms from scratch.

I was interested in both basic computer science structures like linked lists and trees, as well as ML/AI specific things like transformers and their latest advancements.

I intended to document more of the process but instead, this post is a collective update of what I have implemented and learned so far and what I will do next. Also this is specifically related to transformers.

Attention mechanism

A few months ago I followed Andrej Karpathy's tutorial on transformers (see here). After following the video and implementing it myself, I tried doing it in tinygrad. This allowed me to easily train a character level model on my M1 mac. See also my exploration in flow matching here and subsequent posts.

I decided to revisit the attention mechanism in pytorch again but this time I wanted to implement more than just a decoder-only and allow the code to more adaptable. Hence this code was born.

It contains the implementation based on the original Attention is All You Need paper.

Flash attention

I wanted to try and squeeze as much power and performance of my M1 mac as possible. Flash attention (see here) seemed like a great approach. It essentially computes the attention step block-wise and carefully manages the device's memory.

It was tailored to CUDA machines and I decided to simply try out its naive implementation in python and pytorch using MPD (so no C++ MPS code for now).

I added benchmark coding which seemed to confirm the speedup. The result is here. This file is basically all transformer architectures with flash attention added.

Models

The ease and familiarity with decoder-only model from the tutorial made it my current go-to tester model. In the future I want to add encoder architectures as well.

The model is located here here and here for both regular and flash attention based models.

They are trained using tiny Shakespeare dataset for simplicity.

A note on tokenizers

Original naive character based tokenizer lets one train a character level model. This model works great and is very impressive in generating text based on characters (one by one).

I wanted to extend it with a word-level model and so I added a similarly simple and naive word tokenizer based entirely on the tiny Shakespeare dataset.

This new word-level model is harder to train and might need more tricks used in larger llms training and generating. That is still a work in progress.

Optimizers

I learned about Muon optimizer from this post and various discussions on X (formerly Twitter).

I decided to try my hand at implementing it in pytorch. So far this is a work in progress.

Conclusion and Future Work

I will continue exploring transformers, their developments and improvements to better understand how LLMs work under the hood. In the future, I plan to implement also updated RNN architectures based on this work, hopefully any other new transformer variants (Linformer?).

While implementing these, I occasionally used AI tools to refine ideas and debug code issues, but all core implementations were built from scratch.