CapybaraDB

Build Sam Altman Bot

1. Overview

In this tutorial, you’ll build a terminal-based ChatBot that answers questions about Sam Altman’s blog posts. The project utilizes a Retrieval-Augmented Generation (RAG) pattern by:

Setting Up CapybaraDB and OpenAI.
Scraping Sam Altman’s blog content.
Storing each blog article as a separate document in CapybaraDB with date and time appended.
Retrieving relevant articles based on user queries.
Generating context-aware answers using OpenAI’s language model.

This approach ensures that the chatbot’s responses are grounded in actual blog content, reducing inaccuracies and enhancing reliability.

2. Set Up CapybaraDB and OpenAI

Before diving into scraping and interacting with the data, it’s essential to set up CapybaraDB for storing and retrieving your data and OpenAI for generating responses.

Install Dependencies

Ensure you have all the necessary Python libraries installed:

bash
1pip install requests beautifulsoup4 openai capybaradb python-dotenv

Environment Variables

Create a file named .env in your project directory to securely store your API credentials:

env
1CAPYBARA_API_KEY=your_capybara_api_key
2CAPYBARA_PROJECT_ID=your_capybara_project_id
3OPENAI_API_KEY=your_openai_api_key

Security Tip: Add .env to your .gitignore to prevent accidentally committing sensitive information.

Basic Usage of CapybaraDB & OpenAI

We’ll use two main libraries:

CapybaraDB:
- Store your scraped articles.
- Automatically create vector embeddings when EmbText is used.
OpenAI:
- Generate final responses to user queries.

With the dependencies installed and environment variables configured, you’re now ready to move on to scraping Sam Altman’s blog and storing the data into CapybaraDB.

3. Scrape Sam Altman’s Blog

Next, we’ll create a Python script responsible for scraping Sam Altman’s blog and saving the articles to CapybaraDB. This script should be run only once (or whenever new blog posts need to be synced) to populate the database.

Create `scrape_and_save.py`

Create a new Python script named scrape_and_save.py. This script handles the scraping of blog posts and saving them into CapybaraDB.

python
1# scrape_and_save.py
2
3import requests
4from bs4 import BeautifulSoup
5from dotenv import load_dotenv
6from capybaradb import CapybaraDB, EmbText
7from datetime import datetime
8import time
9
10# Load environment variables from .env
11load_dotenv()
12
13# Initialize CapybaraDB
14capybara = CapybaraDB()
15db = capybara.db("sam_altman_blog_db")
16collection = db.collection("blog_articles")  # Each doc = 1 article
17
18def scrape_sam_altman_blog_articles():
19    base_url = "https://blog.samaltman.com/"
20    articles = []
21
22    for page_num in range(1, 13):  # Scraping up to 12 pages
23        if page_num == 1:
24            url = base_url  # First page doesn't have a page number in the URL
25        else:
26            url = f"{base_url}?page={page_num}"
27
28        print(f"Fetching page {page_num}: {url}")
29        response = requests.get(url)
30
31        if not response.ok:
32            print(f"Could not fetch page {page_num}. Status code: {response.status_code}. Stopping scrape.")
33            break
34
35        soup = BeautifulSoup(response.text, "html.parser")
36        page_articles = soup.find_all("article", class_="post")
37
38        # If we find no articles, likely we've reached the end
39        if not page_articles:
40            print(f"No articles found on page {page_num}. Stopping scrape.")
41            break
42
43        for entry in page_articles:
44            # Extract Title and URL
45            title_tag = entry.find("div", class_="post-title").find("h2").find("a")
46            if title_tag:
47                title = title_tag.get_text(strip=True)
48                post_url = title_tag['href']
49            else:
50                print(f"Article on page {page_num} missing title. Skipping.")
51                continue
52
53            # Extract Date
54            footer = entry.find("footer", class_="homepage-post-footer")
55            if footer:
56                date_span = footer.find("span", class_="posthaven-formatted-date")
57                if date_span and date_span.has_attr("data-unix-time"):
58                    unix_time = int(date_span["data-unix-time"])
59                    # Convert Unix timestamp to readable date
60                    date = datetime.utcfromtimestamp(unix_time).strftime('%Y-%m-%d')
61                else:
62                    print(f"Article '{title}' on page {page_num} missing date. Skipping.")
63                    continue
64            else:
65                print(f"Article '{title}' on page {page_num} missing footer. Skipping.")
66                continue
67
68            # Extract Content
69            body_div = entry.find("div", class_="post-body").find("div", class_="posthaven-post-body")
70            if body_div:
71                # Clean up the content by replacing multiple spaces/newlines
72                content = body_div.get_text(separator="\n", strip=True)
73            else:
74                print(f"Article '{title}' on page {page_num} missing content. Skipping.")
75                continue
76
77            # Structure the document
78            doc = {
79                "title": title,
80                "date": date,
81                "url": post_url,
82                "content": EmbText(content),
83            }
84            articles.append(doc)
85            print(f"Scraped article: '{title}'")
86
87        # Optional: Pause between page requests to be polite to the server
88        time.sleep(1)  # Sleep for 1 second
89
90    return articles
91
92def insert_articles_into_capybara(articles):
93    """Insert each article as a separate document into CapybaraDB."""
94    try:
95        if articles:
96            response = collection.insert(articles)
97            print(f"Inserted {len(articles)} articles into CapybaraDB.")
98        else:
99            print("No articles to insert.")
100    except Exception as e:
101        print("Error inserting articles:", e)
102
103def main():
104    # Scrape all articles
105    scraped_articles = scrape_sam_altman_blog_articles()
106    # Insert them into CapybaraDB
107    insert_articles_into_capybara(scraped_articles)
108    print("Scraping and insertion completed.")
109
110if __name__ == "__main__":
111    main()

Explanation of `scrape_and_save.py`

Scraping Logic:
- Iterates through pages 1 to 12 of Sam Altman’s blog.
- Extracts titles, URLs, dates, and content from each blog post.
- Skips articles missing essential elements.
- Waits 1 second between page requests to avoid overloading the server.
Insertion Logic:
- Wraps content with EmbText for semantic querying.
- Each article is saved as a separate document in CapybaraDB.

Run the Scraping and Saving Script

Ensure Dependencies are Installed:

bash
1pip install requests beautifulsoup4 openai capybaradb python-dotenv

Set Up Environment Variables:

env
1CAPYBARA_API_KEY=your_capybara_api_key
2CAPYBARA_PROJECT_ID=your_capybara_project_id
3OPENAI_API_KEY=your_openai_api_key

Run the Script:

bash
1python scrape_and_save.py

Note: Run this script only once to sync blog articles with CapybaraDB. Rerun it only when new articles are available.

4. Query and Generate Answers

Now, we’ll create the chatbot that interacts with the user, queries CapybaraDB for relevant content, and uses OpenAI to generate answers based on that content.

Create `chat_bot.py`

Create a new Python script named chat_bot.py. This script will handle user interactions, query the database, and generate responses using OpenAI’s API.

python
1# chat_bot.py
2
3import openai
4from dotenv import load_dotenv
5from capybaradb import CapybaraDB
6from capybaradb import EmbText
7from datetime import datetime
8import time
9
10# Load environment variables from .env
11load_dotenv()
12
13# Initialize CapybaraDB
14capybara = CapybaraDB()
15db = capybara.db("sam_altman_blog_db")
16collection = db.collection("blog_articles")
17
18def query_capybara_db(user_query, top_k=3):
19    """Returns the top_k chunks that best match the user's query."""
20    try:
21        results = collection.query(
22            query=user_query,
23            top_k=top_k
24        )
25        return results
26    except Exception as e:
27        print("Error querying the collection:", e)
28        return {}
29
30def generate_answer(user_query, context_chunks):
31    """
32    Creates a prompt using relevant context from CapybaraDB
33    and calls OpenAI to generate an answer.
34    """
35    relevant_context = "\n\n".join(
36        f"{i + 1}. {match.get('chunk', '')}" for i, match in enumerate(context_chunks)
37    )
38
39    prompt = f"""
40You are a helpful assistant. Use the following context to answer the user question.
41
42Context:
43{relevant_context}
44
45Question:
46{user_query}
47
48Answer:
49"""
50
51    try:
52        response = openai.chat.completions.create(
53            model="gpt-4",  # or "gpt-3.5-turbo"
54            messages=[{"role": "user", "content": prompt}],
55            temperature=0.7,
56        )
57        return response.choices[0].message.content.strip()
58    except Exception as e:
59        print("Error generating answer:", e)
60        return None
61
62def chat_bot():
63    print("\nWelcome to the Sam Altman Blog ChatBot!")
64    print("Type 'exit' or 'quit' to stop.\n")
65
66    while True:
67        user_query = input("Ask a question about Sam Altman's blog >> ")
68        if user_query.lower() in ["exit", "quit"]:
69            print("Goodbye!")
70            break
71
72        results = query_capybara_db(user_query, top_k=3)
73        chunks = results.get("matches", [])
74
75        if not chunks:
76            print("No relevant information found. Please try again.\n")
77            continue
78
79        answer = generate_answer(user_query, chunks)
80        if answer:
81            print(f"\nAnswer:\n{answer}\n")
82        else:
83            print("\nSorry, I couldn't generate an answer.\n")
84
85def main():
86    chat_bot()
87
88if __name__ == "__main__":
89    main()

Explanation of `chat_bot.py`

Initialization:
- CapybaraDB Connection: Connects to the existing sam_altman_blog_db database and accesses the blog_articles collection.
- Environment Variables: Loads API keys from the .env file.
Querying Logic (query_capybara_db):
- Semantic Search: Uses CapybaraDB’s semantic search to find the top k relevant chunks based on the user’s query.
- Error Handling: Logs any issues during the querying process.
Answer Generation (generate_answer):
- Context Assembly: Combines the retrieved text chunks into a single context block.
- Prompt Construction: Formats the prompt to instruct OpenAI to use the provided context to answer the question.
- OpenAI API Call: Sends the prompt to OpenAI’s ChatCompletion endpoint using the specified model (gpt-4 recommended for better performance).
- Temperature Setting: Controls the randomness of the response; 0.7 offers a balance between creativity and coherence.
Chat Loop (chat_bot):
- User Interaction: Prompts the user for questions in the terminal.
- Processing: Retrieves relevant context from CapybaraDB and generates answers using OpenAI.
- Termination: Users can exit the chat by typing 'exit' or 'quit'.

Run the ChatBot

After successfully scraping and saving the blog articles, you can now interact with the chatbot.

Ensure Dependencies are Installed (if not already done):

bash
1pip install requests beautifulsoup4 openai capybaradb python-dotenv

Set Up Environment Variables:

Ensure your .env file contains the necessary API credentials as shown earlier.

Run the ChatBot Script:

Execute the chatbot script:

bash
1python chat_bot.py

Example Interaction:

text
1Welcome to the Sam Altman Blog ChatBot!
2Type 'exit' or 'quit' to stop.
3
4Ask a question about Sam Altman's blog >> What are Sam Altman's views on AI policy?
5
6Answer:
7Sam Altman emphasizes the importance of developing AI responsibly and ensuring that its benefits are widely distributed. He advocates for proactive measures in AI safety, ethical considerations, and global cooperation to mitigate potential risks associated with advanced artificial intelligence technologies.

Note: The quality and accuracy of the answers depend on the content of the scraped blog posts and the effectiveness of the semantic search.

5. Next Steps & Beyond

Congratulations! 🎉 You’ve built a Retrieval-Augmented Generation (RAG) system around Sam Altman’s blog using CapybaraDB for semantic retrieval and OpenAI for generating context-aware answers.

Where to Go from Here

Fine-Tune Your Pipeline
- Advanced Chunking: Adjust chunk sizes and overlaps in CapybaraDB to optimize retrieval accuracy.
- Metadata Utilization: Incorporate additional metadata (e.g., tags, categories) to enable more granular and topic-specific queries.
Scale the Dataset
- Additional Data Sources: Incorporate other blogs, transcripts, or documents to broaden the chatbot’s knowledge base.
- Automation: Set up periodic scraping and data insertion scripts to keep the database updated with new blog posts automatically.
Model Options
- Different LLMs: Experiment with other large language models from providers like Anthropic or Cohere to compare performance, cost, and output quality.
- Model Fine-Tuning: Fine-tune models on your specific dataset for improved accuracy and relevance.
Deploy to Production
- API Integration: Build a simple API endpoint (using frameworks like Flask or FastAPI) to integrate the chatbot into web or mobile applications.
- User Interface Enhancements: Transition from a terminal-based interface to a more user-friendly web or desktop application for broader accessibility.
Handle Data Privacy
- Respect Scraping Policies: Ensure compliance with website terms of service and respect for data privacy.
- Secure Data Handling: Implement secure storage and handling practices, especially if dealing with sensitive information.
Performance Optimization
- Caching Strategies: Implement caching for frequent queries to reduce latency and API costs.
- Parallel Processing: Optimize scraping and data processing using asynchronous programming or multiprocessing to enhance efficiency.
Enhance User Feedback
- Feedback Loops: Allow users to provide feedback on the chatbot’s responses to iteratively improve accuracy and relevance.
- Analytics and Monitoring: Track usage patterns and performance metrics to identify areas for improvement.

Final Thoughts

By combining the semantic retrieval capabilities of CapybaraDB with the generative power of OpenAI, you’ve created a robust and intelligent chatbot that can effectively answer questions grounded in real content. This foundation opens up numerous possibilities for building more advanced and specialized AI-driven applications.

Keep experimenting, keep building, and most importantly—have fun with it! 🚀

If you encounter any challenges or have further questions, refer back to the following resources:

Enjoy your new Sam Altman Blog ChatBot!