If you’ve experimented with large language models (LLMs), you’ve likely encountered hallucinations and outdated answers. Retrieval-Augmented Generation (RAG) fixes that by grounding an LLM’s output in your own data. In this guide, you’ll learn how to build and deploy a production-ready RAG pipeline using Llama 3 for generation, Milvus for vector search, and Sealos for seamless cloud deployment. We’ll walk through architecture, deployment, code, and best practices—so you can go from zero to working system quickly and safely.
What You’ll Build
- A scalable RAG service that:
- Ingests and indexes your documents
- Retrieves relevant chunks via Milvus vector search
- Generates final, grounded answers using Llama 3
- Deployed on Sealos, a Kubernetes-powered cloud operating system that simplifies app, database, storage, and domain management
By the end, you’ll have a working FastAPI service you can call with a question and get accurate, cited responses backed by your data.
Why RAG, Why Now?
- Accuracy and compliance: RAG reduces hallucinations by injecting facts from your domain-specific corpus.
- Cost and control: You can use open-source models (like Llama 3) and host your own data stack with Milvus.
- Speed to production: Platforms like Sealos make it easy to assemble cloud-native pieces—vector DB, GPU inference, API, and scaling—without wrestling with raw Kubernetes.
Core Concepts: How RAG Works
At a high level:
-
You index your knowledge base into a vector store (Milvus):
- Split documents into chunks
- Create vector embeddings for each chunk
- Insert vectors and metadata into Milvus
-
At query time:
- Embed the user’s question
- Search Milvus for the top-k similar chunks
- Build a prompt that includes the question and retrieved context
- Ask an LLM (Llama 3) to answer based on this context
This flow grounds the LLM, boosting factual reliability and controllability.
Architecture Overview
We’ll use the following components:
- Llama 3 (Generation): Meta’s Llama 3 8B Instruct model, served via vLLM (OpenAI-compatible API)
- Embeddings: A lightweight open-source embedding model (e.g., BAAI/bge-small-en-v1.5)
- Milvus: A high-performance vector database for similarity search
- FastAPI: A simple API for your RAG service
- Sealos: To deploy and manage everything with minimal ops friction
Text diagram:
- Users → FastAPI /ask → Embedding model → Milvus → Retrieve top-k chunks → Prompt Builder → Llama 3 (vLLM) → Answer
On Sealos:
- Milvus runs as an app with persistent storage
- vLLM runs on GPU nodes
- FastAPI runs as a standard web app
- Object storage holds your documents (optional)
- Sealos provides DNS, TLS, secrets, and scaling
Learn more about Sealos at https://sealos.io.
Prerequisites
- A Sealos account and workspace
- Access to GPU nodes for Llama 3 inference (or use a CPU-friendly generation model for testing)
- A Hugging Face token with access to Llama 3 (accept the license on the model page)
- Basic Docker and Python familiarity
Local development requirements:
- Python 3.10+
- pip install packages (listed below)
- A set of documents to index (markdown, PDFs converted to text, HTML, etc.)
Step 1: Provision Your Infrastructure on Sealos
Sealos streamlines app deployment, storage, secrets, and networking. You can use the web console (App Launchpad) or CLI.
1.1 Milvus (Vector Database)
Option A: App Launchpad (recommended)
- In the Sealos console, open the App Store or Launchpad.
- Search for Milvus (standalone) and deploy with a persistent volume (e.g., 50–200 GB depending on your corpus).
- Note the service endpoint (e.g., milvus:19530 within the cluster, or assign an external address if needed).
Option B: Helm (if you prefer)
- Create a dedicated namespace, set storage class, and install Milvus standalone via the official Helm chart.
- Expose it internally via ClusterIP and deploy your RAG app in the same namespace for low-latency access.
Environment variable you’ll use in your app:
- MILVUS_URI=milvus:19530
- MILVUS_DB=default
1.2 Llama 3 with vLLM
We’ll serve Llama 3 8B Instruct via vLLM’s OpenAI-compatible server.
- Ensure you have GPU nodes in your Sealos cluster (NVIDIA drivers and runtime configured).
- Accept the Llama 3 license on Hugging Face, then set a secret HUGGING_FACE_HUB_TOKEN in Sealos.
Run the vLLM container (App Launchpad or YAML):
- Image: vllm/vllm-openai:latest
- Command example:
- --model meta-llama/Meta-Llama-3-8B-Instruct
- --dtype auto
- --max-model-len 8192
- --tensor-parallel-size 1 (or more if multi-GPU)
- Ports: 8000
- Env: HUGGING_FACE_HUB_TOKEN
- GPU: request appropriate GPU resources (e.g., 1x A10 or A100)
- Expose internally as vllm:8000 or assign a domain via Sealos Gateway if you want external access
The server exposes OpenAI-compatible REST endpoints at /v1/chat/completions.
1.3 Object Storage (Optional)
If your documents are not yet in your repo, use Sealos Object Storage (S3-compatible) to upload your corpus. You can mount from the API or indexing job.
1.4 Secrets and Config
In Sealos, create environment variables and secrets for your apps:
- VLLM_BASE_URL=http://vllm:8000/v1
- VLLM_API_KEY=dummy (vLLM allows a dummy key by default; set one for consistency)
- MILVUS_URI=milvus:19530
- MILVUS_DB=default
Step 2: Data Preparation and Indexing
You’ll chunk documents, build embeddings, and insert vectors into Milvus. For simplicity, we’ll use the BAAI/bge-small-en-v1.5 embedding model (384-dimensional), which is fast and works well for many English corpora.
Install dependencies locally:
- pip install sentence-transformers pymilvus fastapi uvicorn openai numpy tqdm
If you prefer a single file for indexing and testing, use the example below.
2.1 Choose Chunking Strategy
General guidance:
- Chunk size: 300–800 tokens (or ~1,000–2,000 characters)
- Overlap: 10–20% to preserve context across boundaries
- Keep metadata (source URL, title, section) for filtering and citations
2.2 Milvus Collection Schema
We’ll store:
- id: VarChar primary key
- embedding: FloatVector (dim=384)
- text: the chunk text
- source: where it came from
- doc_id: group chunks by document
- chunk_id: numeric index within the doc
We’ll use cosine similarity. In Milvus, set metric_type=IP and L2-normalize embeddings to approximate cosine similarity.
2.3 Indexing Script
Example: index.py
Notes:
- For real projects, replace load_documents() with your own loader (filesystem, Git repo, S3 bucket in Sealos).
- For large datasets, consider batch insertion and running this as a Sealos CronJob.
Step 3: Retrieval and Generation Service (FastAPI)
We’ll build a minimal FastAPI app to:
- Receive a question
- Embed it
- Search Milvus
- Compose a system/user prompt
- Call Llama 3 via vLLM
- Return a grounded answer with sources
3.1 FastAPI Service
Create app.py:
Test locally:
- Start Milvus locally or port-forward to your Sealos Milvus
- Ensure vLLM is reachable at VLLM_BASE_URL
- uvicorn app:app --host 0.0.0.0 --port 8080
Example request:
- curl -X POST http://localhost:8080/ask -H "Content-Type: application/json" -d '{"question":"What does article-1 say about X?"}'
3.2 Containerize the Service
Dockerfile:
Build and push:
- docker build -t your-registry/rag-llama3:latest .
- docker push your-registry/rag-llama3:latest
You can use Sealos Image Hub or your own registry, then deploy via the Sealos console.
Step 4: Deploy on Sealos
You can deploy the FastAPI container via the Sealos Launchpad UI:
- Create new app
- Image: your-registry/rag-llama3:latest
- Ports: 8080
- Env:
- MILVUS_URI=milvus:19530
- MILVUS_DB=default
- COLLECTION_NAME=rag_chunks
- EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
- VLLM_BASE_URL=http://vllm:8000/v1
- VLLM_API_KEY=dummy
- LLM_MODEL=meta-llama/Meta-Llama-3-8B-Instruct
- Resources: CPU/memory requests (e.g., 0.5 CPU, 1–2 GB RAM)
Optionally, expose the service with a public domain:
- Use Sealos Gateway to attach a domain and enable HTTPS
- Configure an Ingress if using YAML flows
Kubernetes YAML (if you prefer IaC):
Then add an Ingress (or use Sealos domain management) to expose rag-api externally.
Step 5: Test the End-to-End Flow
- Verify Milvus collection exists and loaded: check pymilvus or the Milvus dashboard
- Verify vLLM is serving Llama 3: curl http://vllm:8000/v1/models
- Verify RAG API:
- curl -X POST https://your-domain/ask -H "Content-Type: application/json" -d '{"question":"What is in article-1?"}'
- Confirm the response includes an answer and contexts with source citations
If you see empty contexts:
- Check embeddings and ensure normalize_embeddings=True
- Ensure the collection index exists and collection.load() has been called
- Confirm MILVUS_URI is reachable from the API pod (same namespace recommended)
Step 6: Production Hardening and Optimizations
RAG quality depends on thoughtful retrieval, robust infrastructure, and safe prompting. Here’s a checklist.
Retrieval Quality
- Chunking:
- Use semantic chunking (e.g., by headings for Markdown)
- Tune chunk size and overlap based on your documents and LLM context window
- Embeddings:
- bge-small-en-v1.5 (384-d) is fast for general English
- Consider e5-large-v2 or bge-base/bge-large for higher accuracy (trade-off: speed/latency)
- Normalize embeddings for cosine similarity with IP metric in Milvus
- Indexing:
- HNSW: great for low-latency search, tune M and efConstruction
- IVF_FLAT/IVF_PQ: better for very large corpora, with quantization
- Maintain a separate scalar index for metadata filtering (e.g., source, date)
- Hybrid search:
- Combine lexical (BM25) and vector results; re-rank with cross-encoders (e.g., bge-reranker-large)
Prompting and Guardrails
- System prompts:
- Force the model to only use provided context; instruct to say “I don’t know” for missing info
- Citations:
- Include source indices and return them to the client
- Safety:
- Filter prompts for PII or malicious instructions
- Add content moderation where required
Caching and Cost Control
- Response caching:
- Cache frequent Q&A pairs at the API layer (e.g., Redis)
- Embedding cache:
- Cache embeddings for repeated queries
- Batch requests:
- For indexing, batch embeddings to maximize throughput
Observability
- Metrics:
- Track latency, hit rate (how often contexts actually contain the answer), token usage
- Logs:
- Log query, retrieved sources, and anonymized outputs (respect privacy)
- Tracing:
- Use OpenTelemetry to trace across API → Milvus → vLLM
- On Sealos:
- Integrate with monitoring stacks you deploy alongside (Prometheus/Grafana) and set alerts
Data Lifecycle
- Updates:
- Run nightly or on-demand indexing jobs (Sealos CronJob) to pick up new/changed docs
- Deletes:
- Implement soft deletes or filters (e.g., is_active flag) if hard deletes aren’t instant
- Multi-tenancy:
- Separate collections per tenant or use a tenant_id scalar field for filtering
Security
- Secrets:
- Store API keys and tokens in Sealos Secrets, not in images
- Network:
- Restrict egress where possible, use NetworkPolicies within the cluster
- Access:
- Use Sealos RBAC to limit who can deploy/modify apps
- TLS:
- Terminate HTTPS at Sealos Gateway or Ingress controller
Practical Applications
- Internal knowledge assistants: Answer employee questions using your wiki, Confluence, or docs
- Customer support copilots: Pull from product manuals and past tickets to reduce handling time
- Compliance and policy Q&A: Grounded answers citing the exact policy text
- Engineering search: Query codebases, design docs, and RFCs with precise snippets
- Research assistants: Surface relevant passages from large PDFs or scientific docs
Each use case benefits from careful source management and metadata filters (department, product, document type, publish date).
Troubleshooting Guide
- Llama 3 fails to load:
- Ensure your GPU meets memory requirements (8B model typically needs ~16–24 GB GPU RAM with paged attention)
- Accept the model license on Hugging Face and set HUGGING_FACE_HUB_TOKEN
- Slow generation:
- Reduce max_tokens and temperature
- Use tensor parallelism if multiple GPUs are available
- Consider quantized variants (e.g., AWQ) with vLLM
- Poor retrieval:
- Increase chunk size or overlap
- Use a stronger embedding model
- Try HNSW with higher efSearch at query time
- Empty results:
- Confirm embeddings are normalized when using IP
- Check that the collection is loaded (collection.load())
- Verify that the field names match and output_fields include the right fields
Cost and Scaling Tips
- Scale Milvus vertically (CPU, RAM) and assign sufficient disk IOPS for large collections
- Scale your API horizontally; stateless FastAPI pods are easy to autoscale
- Assign GPU to vLLM instances; scale out replicas for concurrency
- Use Sealos autoscaling policies to match demand patterns
- Cache popular answers and avoid regenerating identical responses
Extending the Pipeline
- Add a re-ranker:
- Use a cross-encoder like bge-reranker-base to improve the ordering of retrieved chunks
- Structured outputs:
- Ask Llama 3 for JSON-formatted answers; validate with a schema
- Tool use:
- Add tools for code execution or database queries, but keep guardrails strict
- Multi-lingual:
- Use multilingual embeddings (e.g., bge-m3) if your corpus spans languages
A Note on Model and License
Llama 3 is released under the Llama 3 Community License. Accept the license and ensure your usage complies with the terms. When deploying via vLLM, use a valid Hugging Face token to download weights at runtime.
Why Sealos for RAG?
Sealos (https://sealos.io) is a cloud operating system that makes Kubernetes accessible. For RAG, it gives you:
- One-click app deployments (Milvus, your API) via Launchpad
- Built-in object storage, secrets, domain/SSL management
- GPU scheduling for LLM inference
- Multi-tenant workspaces and clear cost boundaries
- GitOps and YAML support for teams that prefer IaC
This means faster iterations, fewer moving parts, and production-grade reliability without a DevOps marathon.
Summary and Next Steps
You built a fully functional RAG pipeline:
- Indexed your documents into Milvus with high-quality embeddings
- Deployed Llama 3 inference via vLLM on GPU
- Exposed a FastAPI service that performs retrieval and grounded generation
- Deployed everything on Sealos for an integrated, scalable setup
Key takeaways:
- RAG is the most practical way to inject domain knowledge into LLMs while reducing hallucinations
- Milvus provides fast and scalable vector search; choose the right index and embedding model
- Llama 3 offers strong open-source generation; pair with vLLM for performant serving
- Sealos simplifies deployment, scaling, and operations so you can focus on product features
Next steps:
- Add re-ranking for improved retrieval quality
- Implement response caching and analytics
- Expand your document loaders (PDF parsing, HTML cleaning, S3 ingestion)
- Harden security and observability for enterprise environments
With this foundation, you can confidently ship AI assistants that are accurate, auditable, and fast—powered by your data and deployed on a platform designed for cloud-native AI workloads.
Explore with AI
Get AI insights on this article