AI Navigate

Trained a 0.8M model on business email generation.

Reddit r/LocalLLaMA / 3/21/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • A 0.8M parameter model for business email generation was trained using a novel architecture called Silia that replaces FFN with attention blocks to reduce parameters while preserving performance.
  • The training used the email-datasets-20k dataset with an 80/20 data split and prompts like "Write a polite refusal email" to demonstrate outputs.
  • The architecture combines 2x Strawberry Blocks, scaled dot-product attention, rotary positional embeddings, QK norm, multi-head attention, SiLU non-linearity, and a weight-tied output projection, forming a Swiglu-into-attention variant named Silia.
  • The work references a prior 28M-model email experiment and shows that the smaller model can achieve comparable loss and performance on the same data/config, signaling potential for practical business email automation.
Trained a 0.8M model on business email generation.

Hi everyone,

I saw this post (28m model email experiment) and decided to train on this dataset but with a different architecture.

Test prompt same as the above post:

Prompt: Write a polite refusal email<|eop|>

Output:

``` Correcting Reason for Recent Experience Dear [Officer's Name],

I hope this email finds you well.

As you know, [Company Name] has recently received your experience with [Product Name], a hugely successful funding round, and we’re committed to providing you with the billing records and targeted a highly value of [Client Name].

As you know, our rebranding initiative is currently undergoing a significant rebranding phase, and we want to sincerely apologize for the timing. We value your time and appreciate your understanding.

We understand your concerns swiftly and want to assure you that we will be moving forward. Please let us know your availability.

Sincerely, [Name] Junior Developer [Company Name]<|eot|> ```

  1. <|eop|> means end-of-prompt and <|eot|> means end-of-text.

  2. <|eop|> is used at the end of prompt and the model uses <|eot|> at the end of the generated output.

I've been experimenting with a simple idea. That is, completely removing FFN and replacing the Linear layers in Swiglu FFN with Attention layers. Thus converting Swiglu into something I call Silia (Silu in attention). It achieved similar loss and performance (compared to a standard Attention + Swiglu architecture) on same dataset & training config with much less parameters.

This is the architecture diagram:

Input tokens | [Token Embedding] | [2x Strawberry Blocks] |--- Scaled Dot Product Attention | |--- Rotary Positional Embeddings | |--- QK Norm | |--- Multi-Headed Attention |--- SiLU non-linearity * Scaled Dot Product Attention |--- Scaled Dot Product Attention | | [Output Projection (weight-tied)] | Next token logits

I trained on email-datasets-20k dataset which was used in the post I linked above.

This is the model training config: {"dataset": {"data_division": 0.8, "load_from_file": true, "path": "data/email.bin"}, "checkpoints": {"path": "bin/email", "interval": 1000, "create_checkpoints": true}, "model_hyperparams": {"vocab_size": 8192, "block_size": 256, "n_layer": 2, "n_head": 4, "n_embd": 64}, "optimizer_hyperparams": {"eps": 1e-08, "beta1": 0.9, "beta2": 0.99, "weight_decay": 0.001, "use_muon": false, "momentum": 0.95}, "model_path": "bin/email/email.strawberry", "encoder_path": "bin/cl8k.bin", "init_from": "scratch", "seed": "auto", "gradient_accumulation_steps": 1, "batch_size": 16, "max_iters": 10000, "eval_interval": 1000, "log_interval": 100, "eval_iters": 100, "decay_lr": true, "lr_decay_iters": 10000, "learning_rate": 0.002, "cooldown_frac": 0.4, "warmup_iters": 500, "min_lr": 0.0002}

The model has 0.8M total params out of which 0.3M are non-embedding params. The model has 2 blocks (4 attention layers & 2 activations in total), 4 attention heads.

I used my custom tokenizer with 8k vocab size. It is just Regex + BPE tokenizer which Andrej Karpathy made in one of his videos, the only difference is I'm using o200k_base regex pattern which was used for GPT-4.

After tokenization the dataset had 5.5M total tokens, after splitting by 80/20 rule, I had 4.4M train tokens, 1.1M val tokens. The dataset had ~20M chars in total. I trained on the dataset for ~10 epochs.

The final train & val loss were 1.65 & 1.68 respectively.

I've attached some screenshots of loss & demo generations.

Here's the github repo link: https://github.com/SrijanSriv211/Strawberry

You can download the model from here: https://github.com/SrijanSriv211/Strawberry/releases/tag/s0.2a

Thank you :)

submitted by /u/SrijSriv211
[link] [comments]