← Back to Blog

How I Built a $0 AI-Powered SaaS: The 2026 Scale to Zero Blueprint

By JupiterGoals Team

While building JupiterGoals, I wanted to leverage the power of AI while keeping costs at an absolute minimum ($0 if possible) by relying on a smart mix of cloud and local resources.

This post breaks down the architecture and the specific engineering decisions I made to achieve that.

I’ve decided to use Google Cloud Platform (GCP) as they currently offer the most generous credits for startups. While the goal is to keep costs at a minimum without relying on credits, having them provides a safety net to scale up and utilize higher GPU-compute instances for verification when needed.

High-Level Architecture

Here is an overview of the system:

Architecture of JupiterGoals

1. The Serverless Java Foundation

First, the core service needs to be serverless. As I am building this product on top of my normal work, I’d rather spend time developing features than patching servers. Serverless provides the freedom to manage scale and updates automatically.

The challenge with Java in a serverless context is the “Cold Start.” Java by default starts relatively slowly (3-5 seconds boot time), and frameworks with heavy reflection can push this even higher. Additionally, the JVM relies on the JIT (Just-In-Time) compiler to optimize performance over time—a benefit you lose in ephemeral serverless functions that die after a few invocations.

The Solution: GraalVM Native Image To address this, I used a GraalVM native image. This allows us to perform Ahead-of-Time (AOT) compilation, converting the Java application into a standalone native executable.

2. Event-Driven & Resilient (Spring Modulith)

On the application side, I employed an Event-Driven Architecture combined with Domain-Driven Design (DDD) principles. I utilized Spring Modulith to structure the application modules.

This architecture is crucial for spot instances. If a process gets terminated abruptly, the transaction is rolled back, and the event remains in the queue to be retried when the system comes back online.

3. The Hybrid GPU Strategy

Inference is the most expensive part of an AI SaaS. To mitigate this, I designed a hybrid approach using a Redis Queue. This decouples the inference work from the core service and allows me to utilize heterogeneous compute resources:

  1. Cloud Spot Instances: I use Google Cloud L4 instances. The standard cost is around $600/month, but Spot instances are about 70% cheaper (around $250/month).
    • Risk: Spot instances can be Preempted (terminated) at any time.
    • Mitigation: The event-driven design ensures no data loss during preemption.
  2. Local “Bring Your Own” Compute: I have a local machine with an RTX 4000 series GPU (roughly equivalent to an L40s around $1500/month).
    • Implementation: I used a Cloudflare Tunnel to securely connect my local machine to the Redis instance.
    • Concurrency: Redis is single-threaded, so there is no contention locking issues.
    • Reliability: I utilized the Reliable Queue Pattern with LMOVE (or BLMOVE). This ensures that if a worker (local or cloud) picks up a job and crashes, the job is not lost—it is moved to a processing queue and can be recovered.

Standard L4 monthly cost Standard L4 instances monthly cost

Spot L4 monthly cost Spot L4 instances monthly cost

4. Database Connection Management

In a serverless environment, connection exhaustion is a silent killer.

5. High-Performance Inference: vLLM vs. Ollama

For local development, Ollama is fantastic. However, for a production SaaS, it bottlenecks quickly.

Time to first token Source: Red Hat Developers. See the full benchmark comparison.

I found the throughput with Ollama too slow for concurrent user requests. I switched to vLLM (Virtual Large Language Model).


Challenges I Faced

Key Learnings

Achieve your goals without the burnout

Join the waitlist for JupiterGoals AI today.

Join the Waitlist