This is 2026 and everyone should have their own LLM benchmark!
I like writing, obviously, but oftentimes I am anxious to start and can postpone it indefinitely, shuffling talking points in my mind and procrastinating in front of a blank page. Now, part of it is of course the creative process, but partly I just need a small initial push to get the process rolling.
Wouldn’t it be nice to start already from a first draft generated by an LLM from my raw thoughts and observations? And ideally this draft should be as close as possible to what I would want to see in the end both by content, and structure, and style. The only thing left was to find the best model for that.
Benchmarking Methodology
Luckily I already have a dozen or so blog posts which I could use for this experiment. The plan was the following:
-
Preparation: Manually write prompts with my “thoughts and observations” for the suitable posts, which can later be used for generation. Of course this is not pure or scientifically correct, but this whole benchmark is not particularly serious anyway.
-
Generation: Provide each prompt and all posts except the to-be-generated one as style references to the generation models and ask them to generate a draft in my style. I used OpenCode for this to keep it close to the real usage, even though it probably does not suit well all the models.
-
Evaluation: Feed each original post together with all corresponding generated ones to all evaluations models and ask them to score on 1-10 scale how much work it would be to turn each draft into the finished article. All generated articles are anonymized for this step to avoid any bias.
-
Analysis: Average out each generator+evaluator scores over all posts and produce a pretty result table. Instead of using any fancy plotting libraries I produced the table as an SVG image programmatically from the raw data.
In both generation and evaluation prompts I intentionally avoided any specifics on what to concentrate on (voice, structure, tone) and let the models decide for themselves what was relevant.
Results
As seen from the table both GPT-5.2 and Claude Sonnet 4.5 performed noticeably better than others, although still far from perfect. I was expecting to see a bias in models towards their own texts, but none of them showed it, with the two leaders giving preference to each other. As suspected, the last two places went to Mistral and TNG Chimera - models least adapted for using OpenCode.
According to my observations the generated posts mostly struggled with brevity and natural, personal tone. All too often they fell into a “documentation trap” producing more overexplaining and formal tutorials instead of personal blog posts. In their defense, some of my posts are indeed just dry technical instructions, but Claude and GPT were best at deciding which topic calls for which tone. Another challenge unsurprisingly was my somewhat dry humor, which was often either missing completely (making the text sound tutorial-like) or was too forced and cringe to the level of “How do you do, fellow kids?”.
All in all, even if some of the drafts I read were quite good, I am not sure I am tempted to use this scheme in the future. Not even once did I feel the urge to start editing the drafts I read - perhaps because they just sound like someone else’s thoughts. I think I simply enjoy the process of writing too much to rob myself of enjoyment by automating it.
Observations and Remarks
Nevertheless, I don’t consider this project to be a failure as I still learned something out of it.
-
In the first version of the evaluation template I asked the model to give a score before providing a short reasoning behind it. However, this could have accidentally led to models justifying a random score instead of really evaluating them. In the final version I first require reasoning and only then the score.
-
I was happy to have used Scala for writing a small CLI tool that automated the steps. It made it much easier to understand, review, and trust the proposed AI changes (compared to python, for example) without any overhead (that Rust would have brought).
-
Assembling an SVG file from code was surprisingly easy and gave me full control over the resulting image. It probably would have been much harder to replicate it with some visualization library or an HTML table/grid. I highly recommend this approach.
-
During development I noticed that models sometimes had problems understanding which generated post is written by which model, thus leading to confusing evaluations. Turns out this was a bug in OpenCode that I reported and that was promptly fixed afterwards.
-
In the end I ran the benchmark three times and averaged the scores, because after a single run one model was unbelievably better than all the rest. Still, because each request had a very short context, it probably cost more in tokens to write the code (I used mostly GLM 4.7 for that and was impressed by its quality).