Feb 07, 2026
Here is a use case for you. You are building a service, say something as simple as a chatbot, that calls the Gemini API to generate responses. Everything works smoothly at first, but when you start load testing, you notice request throttling. After digging deeper, you realize that your Gemini tier only supports 40 concurrent requests at a time. Now imagine this is a well-known application with 200k+ users, and during peak hours you can easily hit 100 or more concurrent requests. How do you manage this load without breaking the user experience? Do you ask users to come back later? Do you create 10 different Gemini API keys, which honestly doesn't sound that crazy?
But as a backend engineer, what is the right solution here, one that scales cleanly, stays within limits, and doesn't rely on shortcuts or hacks?
Something like semaphores works really nicely here. Think of it as a bucket that can hold only 40 tokens. Every time a user makes a request, one token is put into the bucket, and when the request finishes, that token is taken out. If the bucket is already full with 40 tokens, the next request doesn't fail or get rejected, it simply waits for a short moment. As soon as one request completes and a token is removed, there is space in the bucket again and the waiting request can move forward. During busy times this might add a small delay, maybe a few hundred milliseconds, but the system stays stable. No errors for users, no crashes, just controlled flow using tokens and a bucket.
Let's try to implement the same idea for the above use case. Since we are going to do this in Go, I'm assuming your fundamentals of channels and concurrency are clear, but even if they aren't, don't worry, I'll explain everything along the way.
We'll build a semaphore-style architecture in Go to control concurrency using the token and bucket model we discussed earlier. In Go, channels fit this use case perfectly, because a channel can act like our bucket, and the values inside it act like tokens. By limiting the channel's capacity, we automatically limit how many concurrent requests can run at the same time, making channels a natural and clean way to implement this kind of controlled load handling.
var geminiSemaphore = make(chan struct{}, 40)This creates a channel, or bucket, with a buffer size (capacity) of 40 tokens. These tokens can be of any type, such as int or string, but for simplicity we use the empty struct type (struct). This simply means that at most 40 tokens can exist in the bucket at any time, which directly implies that at most 40 Gemini API calls can run concurrently.
Become a member
Next, we wrap Gemini Call with Semaphore
func callGemini(ctx context.Context, prompt string) (string, error) {
// putting a token in bucket
geminiSemaphore <- struct{}{}
// defers are called at end of function
// this function will take token out of bucket after completion
// to make room for next requests
// but why defer ?
// because it will execute in any case : panics, success, err etc
// so token will be release no matter what and will not be blocked in any case
defer func() {
<-geminiSemaphore // this code takes out a token from bucket
}()
// Gemini API call simulation
time.Sleep(200 * time.Millisecond) // simulate API latency
return "Gemini response for: " + prompt, nil
}The above code starts 40 goroutines running at the same time, each one simulating a Gemini API call. But what happens to the 41st request? Let's break it down.
When the API handler receives the 41st request, it still calls the callGemini function, but at this point the bucket is already full. Since there is no space to put a new token into the bucket, the line geminiSemaphore <- struct{}{} blocks. The function simply waits there for a few milliseconds. As soon as one of the earlier goroutines finishes execution and removes its token from the bucket, space becomes available.
The waiting request then places its token into the bucket and starts executing normally. Nothing fails, nothing crashes, it just waits its turn. If at any point you want to reroute or cancel a request, you can always use context to handle timeouts, cancellations, or client disconnects cleanly.
This architecture continuously processes requests and makes room for new ones as older ones finish, in an asynchronous and controlled way. It enables you to control load in your Go applications without causing any issues or giving users errors. The token-and-bucket technique may be used practically any place you need to manage concurrency and safeguard downstream systems while maintaining the responsiveness and dependability of your application. This was only one use case but the same approach can be applied almost anywhere you need to control concurrency and protect downstream systems while still keeping your application responsive and reliable.
This blog focused on semaphore design and how real-world scaling issues may be resolved using a straightforward scheme. It's a tiny concept that has a big impact, and once you get it, you'll see how it enables you to control load in your Go applications without causing issues or showing errors to users. Hope you liked it !
If you found this explanation helpful, please share it with your friends and colleagues! If you have any questions or suggestions for future topics, feel free to ping me on Twitter / X or LinkedIn.