Best Practices in the World of Vibe Coding
It’s been almost three years since I last wrote about AI, all the way back in the GPT-3 days. I had totally forgotten I had written that post, so reading that now, after how far AI has come in that time, was a bit strange.
I don’t think I ever would have thought I’d be using AI on a daily basis the way I am. Back then, it was pretty capable but still a bit clunky and awkward, but today it just… works. Most of the time at least.
In fact, there’s a whole new fad going around called “vibe coding”. In case you haven’t heard of it (though you probably have — it’s pretty ubiquitous at this point), vibe coding is a style of programming where you describe what you want in natural language and let AI generate the code. It’s often done without a lot of oversight, just trusting what AI is outputting.
And the thing is that it kind of works. A few days ago I had Claude Code generate a simple tool for personal use, but which I really didn’t want to spend the time creating myself. It was more of a proof-of-concept, just to see if it was possible.
And indeed in about 20 minutes or so, with some input from me along the way, I had a workable tool. It was a bit creepy to be honest, that I hadn’t written a single line of code, and yet I had this tool that just worked.
Or last weekend when I wanted shipmonk/dead-code-detector to support marking step definition methods in Behat contexts as used. But again, I knew if I had to write it myself, it’d never get done. But with about 30 minutes using Claude Code, I had a workable PR that passed all tests. And again, not one line written by me.
But that started to make me think, where does this leave us as developers? What about all the best practices developed over decades, like separation of concerns, test/behavior-driven development, SOLID principles, design patterns, etc? What’s the point when AI can just write and update all the code?
I’ve been working on a side project lately where I’ve gotten to explore these questions at length and I’ve had some interesting breakthroughs I wanted to share.
Table of Contents
- Does It Work?
- Defining the Contract
- Defining the Architecture
- Strict Typing
- So What Can AI DO Then?
- Redefining Your Role
- My Workflow
- Takeaways
Does It Work?
So again, the idea behind vibe coding is that AI can generate code that just works, often much faster than we can.
But the question is, what does it mean to “work”? That’s not just a rhetorical question, either.
Who defines what it means for the end product to work? When I had Claude Code generate that personal tool, I knew it worked because I tested it. I opened it up and used the functionality and it did what I wanted (after a couple of tweaks, also communicated to Claude).
And that works great for a simple tool like this was, only 1200 lines of code or so. It was a perfect test case for just cutting AI loose to do its thing. I told it what I wanted, it created it, and I was happy — end of story.
But the trouble comes with more complicated systems, beyond just 1200 lines of code. Or when it’s for an actual business use, not just a fun personal side project with no real consequences if something fails.
What would happen if you asked AI to make a modification, and it decided to do some other random thing, too? I use Gemini at work every day and I’ve had it happen. My coworkers have often described Gemini as a drunk intern who sometimes knows what he’s doing. And that’s not just Gemini of course — that goes for any LLM.
So if it hallucinated some other random code it should insert into your project, how would you know, especially if you weren’t reviewing every line of code (which is kind of the point of vibe coding)?
Maybe you’d find out weeks or months down the road, when a user couldn’t do something that was previously supported or got a totally unexpected outcome.
OK, so maybe you create an AI agent to do code reviews, so it’ll let you know what’s changed. But that’s not infallible, either. And how far does it go? Do you have another agent review the code review agent’s work?
At some point, we need to answer the question, what does it mean to “work”? And inversely, how do we know when things fail?
Defining the Contract
The answer, as it happens, is those same best practices that seemed at risk of being rendered obsolete.
In my opinion we begin with automated tests. Tests are one of the best ways of codifying what it means for the code to “work”. When I do X, Y happens. If Y doesn’t happen, it doesn’t work.
If you have 100 unit tests that pass, and AI makes a change that makes 10 of them fail, it obviously did something it shouldn’t have. We have a real, concrete way of saying, “No, this no longer works the way it’s supposed to.”
It becomes a contract not just with AI, but also with anyone else who might work on the code, including yourself or other developers. The contract says, “This is what it means to work. If these things aren’t true, the system is broken.”
Who gets to define that contract? As usual, it’s back to the stakeholders. Developers, product owners, and stakeholders sit down to hash out what the success criteria are in detail.
And hey, if it’s your own side project, then you might fill all three of those roles. If so, that’s fine, but that means it falls to you to define what success means for your project.
But if the stakeholders are external, then they need to define those requirements, and the requirements need to be iteratively translated into executable tests that can programmatically validate that the success criteria are met. A nice way to do this is through behavior-driven development, using natural-language scenarios written in Gherkin-like syntax.
Scenario: User transfers money to another account
Given I have a balance of $500
And my friend has a balance of $200
When I transfer $100 to my friend
Then my balance should be $400
And my friend's balance should be $300
You can translate this into real, executable tests using tools like Cucumber or Behat.
These are pretty high-level tests, though. For example, in the scenario above, transferring funds from one account to another requires testing multiple components at once — probably creating a transaction, a source and destination account instance, etc., and verifying they all work together.
That’s good, but you should also have lower-level unit tests to confirm each unit works in isolation. And of course, yes, that means defining what it means for that unit to “work”.
Pretty soon, we get this really nice top-down approach: you create high-level, scenario-driven tests that outline the system’s broad behavior, and then narrow down to specific unit tests for individual components.
And now you have a strict contract that defines success, within which AI must operate. You know if something has gone awry because one or more levels of tests will fail. The tests are your early warning system, signaling that requirements aren’t being met — and since you took the time to define those requirements in detail and turn them into tests, you know these are real failures in the system’s behavior, not just errors that can be ignored.
Because your contract isn’t just with AI, it’s with the stakeholders as well. In a very real way, when all tests are green, you know that you’ve met the stakeholders’ definition of success — at least as well as they defined it; it’s possible that the criteria could be updated over time, at which time you create more tests.
Defining the Architecture
But I have news for you (which hopefully shouldn’t be news at all): to make your code as testable as possible, it’s going to require maintaining a certain quality of code. If you just mix business logic and presentation, or let your infrastructure bleed into the domain layer, it’s going to be really hard, if not impossible, to test.
And just like that, we have separation of concerns. We have layers. And since it’s probably going to be hard to test without it, we likely need dependency injection. We have decoupled components, because you don’t want to set up 20 mocks just to test something.
Some dispute this, but generally tests drive clean code, because highly coupled, entangled code is just not very testable. What is testable are small, isolated components that have few dependencies, which just so happen to correlate to a clean architecture.
But now that means you need to plan out your architecture. How many layers will you have? How will they communicate with each other?
We can even codify those architectural decisions using a tool like Deptrac, which helps us enforce architectural boundaries. So, if AI imports something from the infrastructure layer within the domain layer, we’ll know about it.
Strict Typing
Our tests, along with a well-defined architecture, will go a long way toward making sure our system does what it’s supposed to.
But I find that enforcing static type analysis also helps a lot in detecting potential bugs early. If you have a method that can return User|null but you treat it as though it just returns User, you can get a pretty nasty surprise when it returns null instead. Unfortunately the unit tests might not capture every possible eventuality, so you might not catch it until it’s already in production.
On the other hand, static type analyzers like PHPStan or Psalm will catch something like that right away and throw an error.
But in my opinion it really only works on the strictest settings. PHPStan on lower levels can still let a lot of issues pass through. Personally, I try my best to run everything at the max level, which means even handling mixed properly.
I find that AI isn’t always great at writing strongly typed code in a dynamically typed language like PHP, so insidious bugs like this can crop up quite easily. If you enforce strict type checking at the highest level possible, then AI will be confined to producing code that properly respects types. Going back to the previous example, that means that after getting the return value from the method that returns User|null, it needs to handle the null case, or else PHPStan or Psalm will complain. And that means, fewer bugs.
So What Can AI DO Then?
Ugh, all of that doesn’t sound like very much fun, does it?
Well to me it does, but if you’re comparing it to the dream of just telling AI what you want and it spits it out in 30 minutes, this is very different.
We’re back to having to actually design the system, rather than just setting AI loose to do its thing.
But with strictness comes clarity: we know what the system is actually supposed to do. And because of that clarity, we can define a clear contract, and enforce that contract with automated tools like BDD scenarios, unit tests, architectural boundary validation, and static type analysis.
Do you have to use all of those tools? No, of course not. But the more of them you use, and the more ways you define your definition of “works” and enforce the code quality needed to reduce hard-to-detect bugs, the better AI can work within those guardrails to produce code that’s going to do what you need it to do. When it breaks out of those guardrails — and it will — your tools will catch it and force AI to stick to the plan.
I want to reword that a bit because I don’t like talking about AI as though we’re forcing it to do anything. It can be a hugely valuable coding partner. But just like you wouldn’t hand off a project idea to a dev team without defining your requirements very clearly, so you have to tell AI what the conditions of success are so it can test its output against that.
And Claude Code is very good at doing exactly that. You can tell it something like, “Implement the minimal code to pass these tests. All tests must pass, and phpstan must not return any errors,” and it’ll keep going until that is true.
What I suggest is telling AI up front what the requirements are. If using Claude, you can create a CLAUDE.md file in your root directory that briefly discusses the project, architectural philosophy, conditions for success, etc., and every new Claude instance will read that file automatically.
Redefining Your Role
AI can still do a lot of the work. It can even create the unit/integration tests. However, I strongly recommend verifying every test it generates to ensure it’s actually testing what you want it to test. If not, ask it to generate a test that asserts X, and it’ll do that. However, if you’re going to closely review any of its output, this is the area to do so, since you already know in detail what behaviors of the system need to be tested.
I like to use the traditional TDD approach: write a test that fails, then write the minimal code to make it pass. Then you know it’s doing something useful.
You want to keep a bit of a short leash on it through this process. If you let it, it’ll happily go off and try to create whole swaths of the project. Ideally, you want to work on one component, one slice of the system at a time (I’ll discuss my personal workflow a bit more below). Generate a small handful of tests for one component, and generate the code that makes those tests pass. Specify “minimal code” because again, otherwise it’ll generate code that does things it thinks you’re going to want later. But we just want to create what’s needed now, to make these tests pass — nothing more.
Pretty soon, you get into a sort of rhythm. Ask it to create tests, review those tests, ask it to make the tests pass. You can still do very little coding yourself if that’s what you want. It’s all up to you.
In this way you become somewhat of an architect of the system, more guiding the process and ensuring its success, rather than doing a lot of the low-level work yourself. Of course, you’re free to jump in anywhere and in any way you want, but your role becomes a bit more supervisory.
If you’re anything like me, this will be an odd transition. It feels disconcerting to have these features coming together right before your eyes without you having to actually do the work. It almost feels like cheating. But, when you have your system well-defined and your tests in place, you can be confident that the code being generated actually does what you want it to do.
If you want, you can even create several agents, which works really well in Claude. You can create a test writer agent, a code implementer agent, and a code reviewer agent.
- Test writer: Obvious; it writes tests. It is forbidden to touch actual non-test code.
- Code implementer: It writes the minimal code to make the tests pass.
- Code reviewer: It is tasked with making sure all code adheres to your architectural philosophy — that all code is well-designed, nothing is superfluous that isn’t needed for the tests, that it passes static analysis, conforms to architectural boundaries, etc.
Claude can help you create subagents, where you can describe each of their roles, and it creates them for you. I’ve had great success with this.
For one of my projects, I also have a pre-commit validator, which I guess can be lumped in with a code reviewer, but it’s just how I did it. I got tired of running all the commands to verify and style my code, so I created a subagent to handle it. This was for TypeScript, so it makes sure there are no TypeScript errors, that ESLint runs without errors, that all tests pass, that my coverage threshold is met, and finally runs Prettier on it. It’s just a final catch-all in case I missed anything myself.
My Workflow
I like doing a top-down approach in narrow vertical slices.
So I’ll choose one feature I want to develop. I write scenarios in Gherkin syntax and have Behat generate the snippets to make the steps executable.
Each new feature is a use case in my application layer, so my test should, at some point, dispatch a command to an application service handler. I ask Claude to write out the code for the steps, but they will of course fail for now.
So let’s use a simple registration example, which is a simplified version of something I actually created.
Feature: User Registration
In order to use the system
As a new user
I need to be able to register an account
Scenario: Successfully register a new user
When I register with:
| email | bob@example.com |
| password | SecurePass123! |
Then I should be registered
When I run vendor/bin/behat --append-snippets, it creates two steps, like this:
<?php
declare(strict_types=1);
namespace App\Tests\UseCase;
use Behat\Step\Then;
use Behat\Step\When;
use Behat\Behat\Tester\Exception\PendingException;
use Behat\Behat\Context\Context;
use Behat\Gherkin\Node\PyStringNode;
use Behat\Gherkin\Node\TableNode;
/**
* Defines application features from the specific context.
*/
final class UserContext implements Context
{
/**
* Initializes context.
*
* Every scenario gets its own context instance.
* You can also pass arbitrary arguments to the
* context constructor through behat.yml.
*/
public function __construct()
{
}
#[When('I register with:')]
public function iRegisterWith(TableNode $table): void
{
throw new PendingException();
}
#[Then('I should be registered')]
public function iShouldBeRegistered(): void
{
throw new PendingException();
}
}
So I ask Claude to generate the steps, and we get something like this (after making sure types are statically enforced):
<?php
declare(strict_types=1);
namespace App\Tests\UseCase;
use App\Application\User\Command\RegisterUserCommand;
use App\Domain\User\User;
use App\Domain\User\ValueObject\UserId;
use Behat\Behat\Context\Context;
use Behat\Gherkin\Node\TableNode;
use Behat\Step\Then;
use Behat\Step\When;
use Symfony\Component\Uid\Uuid;
use Webmozart\Assert\Assert;
/**
* @internal
*/
final class UserContext implements Context
{
private TestContainer $container;
private ?string $registeredUserId = null;
public function __construct()
{
$this->container = new TestContainer();
}
#[When('I register with:')]
public function iRegisterWith(TableNode $table): void
{
$data = $table->getRowsHash();
Assert::keyExists($data, 'email', 'Email is required');
Assert::string($data['email'], 'Email must be a string');
Assert::keyExists($data, 'password', 'Password is required');
Assert::string($data['password'], 'Password must be a string');
// Generate client-side UUID (v7 is time-ordered, better for databases)
$this->registeredUserId = Uuid::v7()->toString();
$command = new RegisterUserCommand(
userId: $this->registeredUserId,
email: $data['email'],
password: $data['password'],
);
$this->container->getCommandBus()->dispatch($command);
}
#[Then('I should be registered')]
public function iShouldBeRegistered(): void
{
Assert::string($this->registeredUserId, 'No user ID was registered');
// Verify that the user can be retrieved by ID
$repository = $this->container->getUserRepository();
$user = $repository->getById(UserId::fromString($this->registeredUserId));
Assert::notNull($user, 'User should be retrievable by the ID that was registered');
Assert::eq($user->id, $this->registeredUserId, 'Retrieved user ID should match the registered ID');
}
}
I’m not saying that was all at once; it took some iteration, and of course the Testcontainer was created as well, along with test versions of the command bus and user repository.
So obviously this test won’t pass, not least because the classes don’t exist. So I’d have Claude create bare-bones stub classes just so PHPStan doesn’t throw fatal errors. Really at this point I’d technically only need the RegisterUserCommand class since the test uses it directly; we don’t even need the handler yet since that works in the background, waiting for commands to be dispatched.
Next, we drop down to the domain layer. I know that to register a user, I’ll need a user ID, an email address, and a password.
So we’ll need four value objects: UserId, Email, PlainPassword, and HashedPassword.
For each, I have Claude create unit tests. I verify that those tests test what I want them to. If not, I ask it to create another test that does X, whatever that might be.
Once I’m happy, we have failing tests. So I ask Claude to generate the minimal code to make the tests pass. It’ll do that until all tests are green.
Rinse and repeat for the next value object, and the next, and the next. Then, for the User class, for User::register().
Then we plug all that into the application handler and get something like:
<?php
declare(strict_types=1);
namespace App\Application\User\Command;
use App\Application\MessageBus\CommandHandlerInterface;
use App\Application\MessageBus\CommandInterface;
use App\Domain\User\PasswordHasherInterface;
use App\Domain\User\User;
use App\Domain\User\UserRepositoryInterface;
use App\Domain\User\ValueObject\Email;
use App\Domain\User\ValueObject\HashedPassword;
use App\Domain\User\ValueObject\PlainPassword;
use App\Domain\User\ValueObject\UserId;
final readonly class RegisterUserHandler implements CommandHandlerInterface
{
public function __construct(
private PasswordHasherInterface $passwordHasher,
private UserRepositoryInterface $userRepository,
) {
}
#[\Override]
public function __invoke(CommandInterface $command): void
{
$plainPassword = PlainPassword::fromString($command->password);
$hashedPassword = HashedPassword::fromHash(
$this->passwordHasher->hash($plainPassword),
);
$user = User::register(
UserId::fromString($command->userId),
Email::fromString($command->email),
$hashedPassword,
);
$this->userRepository->save($user);
}
}
And if everything was wired properly, the behat scenario should now pass.
I might create a few other scenarios, like unique email, etc., but at this point all the major value objects are in place for this use case.
Then I’d move up to the top level and have Claude create E2E tests that run the same scenarios as the use case tests. It would test the entire system by making actual HTTP requests.
At this point, I’d have to tell Claude to create a controller that would wire everything together, but since we already did the lower-level work, it should really just have to construct the command and dispatch it to the application layer, just like our test did.
And now we have one vertical slice of real functionality that works. All tests pass, PHPStan doesn’t throw any errors, and all architectural boundaries are respected. And, Claude generated almost all of the code. I just orchestrated and oversaw the whole process.
Again, it’s a far cry from what most people think of when you say “vibe coding”. Sure, I could have just told Claude I wanted a feature that lets users register accounts. But then I wouldn’t absolutely know, beyond a doubt, that all requirements were being met.
Takeaways
It just so happens that the process of giving AI guardrails to operate within is the same process needed to build a robust, well-designed system architecture. It’s all the steps you should be taking anyway, but so many people often don’t.
I believe that with AI, the same steps are even more important than ever before. They were best practices before, but now they’re essential if you want to give AI direction.
So, in a funny way, our best practices don’t disappear at all; they actually become more important. Of course, just like always, some people will totally ignore them — and just like always, they’ll likely come to regret it later.
Like I said, I’m not saying you absolutely must follow all of these rules. If you’re working within an existing system, you might not be able to totally change how you’re already doing things.
But to me, the basic principles are:
- Get very clear on requirements
- Codify those requirements in automated unit and integration tests, one feature at a time
- Get very clear on the architecture you want to use, and make sure the AI is aware of that decision
- Use static type analysis at the highest level your system will allow, to catch potential bugs
If you do this, then you can trust AI to do as much or as little of the work as you like, and you can be relatively certain that you’re headed in the right direction.
Do I think AI will replace developers? I said no three years ago, and I would still say no. I do wonder where we’ll be when I look back in yet another three years but I doubt that AI will suddenly be able to totally replace (good) developers.
I do think that it will become a more and more central part of our process — even to the point of being essential if we want to get things done quickly.
But being a core part of the process will never negate the need for best practices. Best practices will always be best practices, whether it’s humans doing the developing, or AI.