Alfonso Graziano - Tech Lead & Software Engineer

Once upon a time…

Everything started with a model capable of predicting the next token in a sequence of tokens. The neural network able of doing so has been trained on the Transformer architecture made famous by the paper “Attention is all you need”.

Then, researchers scaled the capabilities of these networks through more data and more compute. We reached a ceiling. We basically found out that improvements in these systems are not linear, but require exponentially more data and compute. As per today, with the current architecture, we don’t have enough data and compute to do so much more progress.

Within a few years though, we found out a lot of ways to improve and optimize these models. I’m not a ML Engineer, so I’ll not go through the details, but I want to give the insights for a couple of techniques which are currently used and made the models better:

Chain of Thought (CoT) is a prompting technique that encourages a model to generate step-by-step reasoning or intermediate explanations before delivering a final answer. There are several forms of CoT prompting, but the simplest approach is to include instructions in the prompt such as, “Please reason through this problem and explain your steps before giving the answer,” which guides the model to articulate its logic prior to responding.

Mixture of Experts: several specialized sub-networks (“experts”) handle different parts of the input, and a gating network decides which expert(s) to activate for each input, enabling efficient, scalable, and accurate processing.

Knowledge Distillation: In this process, a large pre-trained "teacher" model transfers its knowledge to a smaller "student" model, enabling the smaller model to retain much of the accuracy of the larger model while being more computationally efficient.

These techniques improved quite a bit the models and are still showing signs of improvements. Researchers also use fine-tuning, reinforcement learning and other techniques to still improve model accuracy on our questions.

At this point, LLMs are able to reasonably write decent code. This is just the starting point though. While people used LLMs, we found out something interesting: the prompt and the context act as inference-time learning! So we can drastically change and tweak the output of the model, based on the context we pass. This has been proven by an article released in July 2025.

Now we can think “alright, my network can learn at inference-time for free, so I can just give it more and more context and the results will work better for my use case”. Unfortunately, reality is more complex. There are two major limitations:

Context window is limited

Even with a long context window (let’s say 1M tokens), another paper found out that, the more information we put in the context, the worse the results are as. the LLM “struggles” to retain all the information.

Therefore, unfortunately, we cannot just max out the context window, pass all our data and hope that the results will be better. Also, given that we usually pay per token, reaching the ceiling of our context window is also very unpractical from an economic perspective.

We need to find a set of techniques which allows us to simultaneously maximize the amount of context we provide to the LLM so that we improve the performances on the current task, and minimize the amount of context we provide, so that we don’t affect LLM performances and the costs are kept under control.

These techniques are the basis for “Context Engineering”. Context engineering is the art&science needed to retrieve the relevant context, so that we can allow the LLM to correctly perform the task we need, while maintaining the context size under control.

The rise of vibe coding

One of the best scientists of our time does a tweet, Internet explodes and a new term is born.

People are fascinated: it looks like Software Engineers are cooked! Even my mom can now build software system, right?

Interestingly enough, it looks like the attention span of people in 2025 is not even a single tweet. If we read everything, especially in the second part Karpathy explains that this is not really coding, it’s something that feels like it and produces artifacts that you’d otherwise generally with coding, but it’s definitely not coding.

Tools like Lovable, Bolt, v0 and so on emerged very quickly. People started playing with them and using them. These tools target non-developer. This is not what we really care. We care about tools built for devs by devs.

In a small amount of time we now have:

Cursor

Codex

Copilot

Claude Code

Gemini CLI

Windsurf

The list is so long that I don’t want to to spend so much time. You got the gist: we now have a lot of tools which will help developers to write code.

These tools are not just models: they are what we call Agents. An agent is a software system, which wraps a model (Gemini 2.5, GPT-5, Sonnet-4.5 etc). This agent has a few things:

Access to tools like fetch, grep, terminal usage, filesystem, MCP servers

A way to inject context (through memory with RAG, rules, commands, workflows and much more, depending on the tool)

A system prompt and a definition of the agent like how it should behave and so on. Please note that all the content that goes in the agent definition, then becomes a part of the agent context.

Software engineers quickly figured out that in complex production systems vibe coding (just writing what you need and hope for the best) just doesn’t work.

Vibe coding falls short on security, testing and so many aspects. It also focuses on just writing code that barely works, not creating maintainable systems.

There has to be a better way…

Spec-driven AI-development comes into play

It’s all about context, tools, and implementing AI in the right phases of the SDLC.

The idea behind spec-driven development is that we don’t just ask the AI to do something. We start with an high-level goal, we break it down into tasks, analyze the architecture, create stories and then the AI will work on a task which has been refined together with a human, small enough to fit into a PR and relatively easy to implement.

There are several reasons why this approach works much better.

The first is that it introduces a much stronger Human-in-the-Loop (HITL) component. Instead of the human simply defining the high-level goal, they actively participate after each phase, reviewing artifacts generated by the LLM such as architecture diagrams, user stories, and other deliverables. This ensures continuous alignment between what the LLM produces and what’s actually needed.

This process helps minimize drift. The key idea is straightforward: the more guidance we provide to the LLM throughout the process, the better it follows our intent. In contrast, when the LLM is left to operate entirely on its own, for example, with a single zero-shot prompt, each additional round of iteration performed by the agent itself increases the likelihood that the model will go off road and produce results that diverge significantly from the desired implementation.

The second main reason I’ve found is that interacting with the agent naturally forces us to think more deeply about the problem. Through this process, we often uncover edge cases, potential improvements, and flaws in our initial approach. The AI, therefore, isn’t just a passive executor of our instructions: it becomes a copilot, helping us reason more critically and explore the problem space from new angles. This dynamic has repeatedly helped me identify issues and insights I likely wouldn’t have discovered on my own.

Currently, there is no single approach to spec-driven design. As this is a relatively new approach, multiple vendors are investigating different routes. As the market is crowded with tools and approaches, I’ll briefly discuss some of them which I’ve found to be interesting.

Even if these tools and approaches have different workflow and offer different capabilities, there is a common ground:

The vast majority of them saves intermediate artifacts (as plans, requirements, design documents, architecture) as Markdown files.

There is usually a way to inject more context, as rules, commands, persistent knowledge and so on. This allows the model to know better our codebase and plan accordingly using our coding standards, best practices and preferences.

Code indexing is performed, so that when needed, the tools can grep the needed code to follow a pattern, look into a specific file and get references to required functions

HITL is required from the requirements document creation to the commit. Each step generates text artifacts which needs to be reviewed. Then the code needs to be reviewed to ensure that it aligns with the expectations before opening a PR.

Use it when you need it and don’t overcomplicate things: the cool things about these approaches is that you’re not locked in. You’re still 100% responsible for the code you write, so if you want to write a feature without AI, or if it doesn’t make sense (because we know we will implement that feature faster than using AI) we can just proceed on our own without breaking the approach.

In order to not bloat the context window, it’s usually a good recommendation to open a new chat every time we’re starting a new operation

Kiro

Kiro is an entire IDE (a VSCode fork) which offers a full UI to implement a spec-driven flow. The key idea is that, instead of starting from a chat window where we ask the model to perform a coding task, we create a requirements spec, saved as markdown, then from the spec we generate a design document, and from the design document, we create a set of tasks.

During the generation of all these artifacts, the system and the human work together to refine everything up to the point where the tasks are well defined and implementing them is relatively straightforward for the model.

The human can define what task is implemented and in what order thanks for a simple, yet effective UI.

Spec-kit

Spec-kit works through a CLI and can be used with a lot of different agents (differently from Kiro where you end up using their IDE). The idea is relatively simple: you create a spec, then an implementation plan, you break down the tasks and then implement each one.

All the specs are saved in a folder and can be versioned.

BMAD

BMAD is more complex and has a learning curve. Differently from Kiro and spec-kit, BMAD offers the usage of multiple agents, each one with multiple workflows. The BMAD general workflow is similar to the others, but as it’s more powerful it’s also more complex to learn and use correctly. BMAD is more a full E2E approach rather than a development methodology. With BMAD we do everything from the initial PRD to the actual implementation.

BMAD also allows to build custom agents, which makes it perfect to be tailored to custom needs.

As these approaches have points of contact, can be compared. If you’re interested in the comparation to learn what to use, videos and articles are emerging such as:

YouTubeBMAD vs. Spek Kit vs. Open Spec: Which AI Coding Methodology is Best?

BMAD vs. Spek Kit vs. Open Spec: Which AI Coding Methodology is Best?

Just typing vague prompts at an AI chatbot, or "vibe-coding," is a dead end for any serious software project. It leads to poor output, lost context, and unmaintainable code. A new wave of structured AI development **methodologies** is here to fix that. But are they any good? I decided to find out. I took one project—building a landing page for The Gray Cat channel with Next.js, Tailwind, and three live API integrations—and built it three separate times using three competing **approaches**: the heavyweight BMAD method, GitHub’s Spek Kit, and the fast-moving Open Spec. The results were shocking: one took eight hours, and another took just seven minutes. In this video, I break down the entire process for all three **approaches**: the setup, the philosophy, the painful, slow parts, and the final results. By the end, you'll know exactly which one is right for your project and which one might be a colossal waste of time. *[ LINKS ]* - *BMAD:* [https://github.com/bmad-code-org/BMAD-METHOD](https://github.com/bmad-code-org/BMAD-METHOD) - *GitHub Spek Kit:* [https://github.com/github/spec-kit](https://github.com/github/spec-kit) - *OpenSpec:* [https://github.com/Fission-AI/OpenSpec/](https://github.com/Fission-AI/OpenSpec/) *[ TIMECODES ]* - 00:00 - Intro: The Problem with "Vibe-Coding" - 01:10 - The Project & Tech Stack - 02:09 - Method 1: The BMAD Beast (8 Hours) - 04:02 - Method 2: GitHub's Spek Kit (Under 2 Hours) - 05:47 - Method 3: The Open Spec Speedrun (7 Minutes) - 07:28 - Head-to-Head Comparison - 08:55 - Conclusion: Which One Should You Use? - 10:44 - Outro --- The Gray Cat: Where AI meets code. Your essential guide to building next-generation software with modern AI tooling. Master AI IDEs, structured development, and practical AI app development. Welcome to the future of coding.

A few more options on the market:

Open spec

AgentOS

Deciding what to use definitely depends on your use case and needs. This approach is also similar to what engineers are doing at FAANG companies to leverage AI correctly:

Apart from global context optimizations (like rules and skills), some companies are also developing tools to perform local context optimization. For example, if in our codebase we’re using a specific version of a library, the model might not be trained on that library and therefore hallucinate the surface APIs or not use the best practices. That’s why companies like Tessl are building tools which allow our agent to work better with our codebase, based on the libraries and frameworks we’re currently using. It all goes down to improving the context quality in multiple ways!

Current limitations and future

Spec-driven AI as per now it’s still evolving quite fast, the tools are not stable and there are breaking changes.

One first limitation is the usage of these tools within the SDLC.

Ideally, at some point we will have something which better integrates all the aspects of the SDLC: our IDE will be able to “talk” to our PM tool (as Jira, ADO, etc). The breakdown of the stories will be AI-assisted as well in a unified workspace. The discovery phase will happen in tools like Miro, which will then generate complete requirement documents after the product discovery phase.

Then, after the AI writes code, we will be able to perform automatically code reviews on GH, test our code and ship it to production with AI connected to our observability stack to perform preemptive alerting and logs analysis.

This will be available at some point in the near future as vendors are already starting to expose these features.

Currently, there are multiple tools scattered across the ecosystem and we’re starting to see protocols to let them communicate (as MCP, A2A).

Other limitations are inherited from the models (hallucinations, accuracy, context window, reasoning capabilities, tool usage), but ML researchers and engineers are working to overcome them by working on multiple fields.

Month after month, the surface of the tasks we can tackle with AI becomes wider and wider with the hope that we will build better systems in less time, with less stress and more quality.

Software Engineering is changing again, and maybe faster than ever. People are excited and scared of losing their job at the same time. I strongly believe that this evolution will be net positive even if we have a lot of challenges ahead of us.

My recipe to survive and grow in this environment is the same: be curious, try to push yourself to the edge of the knowledge in this field, build something that makes sense and, most importantly, enjoy the journey.