Skip to content

Tutorial: Build a RAG Pipeline

A Q&A system that retrieves relevant documents before answering:

  1. Embed a small document corpus with apiType: embedding
  2. Embed the user’s question with the same model
  3. Find the most relevant documents using cosine similarity
  4. Inject those documents as {{context}} into a chat prompt
  5. The LLM answers grounded in real data instead of hallucinating

By the end (~20 min) you’ll understand how to wire embeddings and chat completions together into a retrieval-augmented generation (RAG) pipeline.


Terminal window
pip install prompty[jinja2,openai]

Create a .env file with your OpenAI key:

Terminal window
OPENAI_API_KEY=sk-your-key-here

Create embed.prompty — this generates vector embeddings from text:

---
name: text-embedder
description: Generate embeddings for text input
model:
id: text-embedding-3-small
provider: openai
apiType: embedding
connection:
kind: key
apiKey: ${env:OPENAI_API_KEY}
inputs:
- name: text
kind: string
default: Hello, world!
---
{{text}}

Step 3: Generate embeddings for your documents

Section titled “Step 3: Generate embeddings for your documents”

Embed a small set of documents. In production you’d do this once and store the vectors in a database — here we keep everything in memory:

from prompty import invoke
documents = [
"Prompty is a markdown file format for LLM prompts with YAML frontmatter.",
"The pipeline has four stages: render, parse, execute, and process.",
"Use apiType: embedding to generate vector embeddings from text.",
"Thread inputs with kind: thread enable multi-turn conversations.",
"Tool calling lets the model invoke your functions automatically.",
]
# Embed each document
doc_vectors = []
for doc in documents:
vector = invoke("embed.prompty", inputs={"text": doc})
doc_vectors.append(vector)
print(f"Embedded {len(doc_vectors)} documents")
print(f"Vector dimensions: {len(doc_vectors[0])}") # 1536

Write a retriever that finds the most relevant documents using cosine similarity. No vector database needed — this works in memory for small datasets:

import math
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
if norm_a == 0 or norm_b == 0:
return 0.0
return dot / (norm_a * norm_b)
def retrieve(query_vector: list[float], top_k: int = 2) -> list[str]:
"""Find the top_k most similar documents."""
scored = [
(cosine_similarity(query_vector, dv), doc)
for dv, doc in zip(doc_vectors, documents)
]
scored.sort(key=lambda x: x[0], reverse=True)
return [doc for _, doc in scored[:top_k]]

Create rag-qa.prompty — a chat prompt that receives retrieved context:

---
name: rag-qa
description: Answer questions using retrieved context
model:
id: gpt-4o-mini
provider: openai
apiType: chat
connection:
kind: key
apiKey: ${env:OPENAI_API_KEY}
options:
temperature: 0.3
maxOutputTokens: 1024
inputs:
- name: question
kind: string
default: What is Prompty?
- name: context
kind: string
default: ""
---
system:
You are a helpful assistant. Answer the user's question using ONLY the
provided context. If the context doesn't contain enough information,
say "I don't have enough information to answer that."
Context:
{{context}}
user:
{{question}}

The system prompt instructs the model to only use the provided context — this is what makes RAG effective at reducing hallucination.


Now connect all the pieces: embed the query → retrieve → inject context → answer:

from prompty import invoke
def ask(question: str) -> str:
# 1. Embed the question
query_vector = invoke("embed.prompty", inputs={"text": question})
# 2. Retrieve relevant documents
relevant_docs = retrieve(query_vector, top_k=2)
context = "\n\n".join(f"- {doc}" for doc in relevant_docs)
# 3. Answer with grounded context
answer = invoke("rag-qa.prompty", inputs={
"question": question,
"context": context,
})
return answer
# Try it out
print(ask("How does the Prompty pipeline work?"))
# → "The pipeline has four stages: render, parse, execute, and process..."
print(ask("What are thread inputs?"))
# → "Thread inputs with kind: thread enable multi-turn conversations..."

User question: "How does the pipeline work?"
embed.prompty → [0.012, -0.045, 0.078, ...] (query vector)
retrieve() → "The pipeline has four stages..."
"Use apiType: embedding to..."
rag-qa.prompty → system: "Answer using ONLY the context..."
context: (retrieved docs)
user: "How does the pipeline work?"
LLM response → "The pipeline has four stages: render, parse,
execute, and process."

✅ Creating an embedding prompt with apiType: embedding
✅ Generating vector embeddings from text with invoke()
✅ Building a cosine similarity retriever from scratch
✅ Injecting retrieved context into a chat prompt via {{context}}
✅ Wiring the full RAG pipeline: embed → retrieve → answer