Build a Chatbot with Sam Altman’s Blog

Build Sam Altman Bot

1. Overview

In this tutorial, you’ll build a terminal-based ChatBot that answers questions about Sam Altman’s blog posts. The project utilizes a Retrieval-Augmented Generation (RAG) pattern by:

  1. Setting Up CapybaraDB and OpenAI.
  2. Scraping Sam Altman’s blog content.
  3. Storing each blog article as a separate document in CapybaraDB with date and time appended.
  4. Retrieving relevant articles based on user queries.
  5. Generating context-aware answers using OpenAI’s language model.

This approach ensures that the chatbot’s responses are grounded in actual blog content, reducing inaccuracies and enhancing reliability.


2. Set Up CapybaraDB and OpenAI

Before diving into scraping and interacting with the data, it’s essential to set up CapybaraDB for storing and retrieving your data and OpenAI for generating responses.

Install Dependencies

Ensure you have all the necessary Python libraries installed:

pip install requests beautifulsoup4 openai capybaradb python-dotenv

Environment Variables

Create a file named .env in your project directory to securely store your API credentials:

CAPYBARA_API_KEY=your_capybara_api_key
CAPYBARA_PROJECT_ID=your_capybara_project_id
OPENAI_API_KEY=your_openai_api_key

Security Tip: Add .env to your .gitignore to prevent accidentally committing sensitive information.

Basic Usage of CapybaraDB & OpenAI

We’ll use two main libraries:

  1. CapybaraDB:

    • Store your scraped articles.
    • Automatically create vector embeddings when EmbText is used.
  2. OpenAI:

    • Generate final responses to user queries.

With the dependencies installed and environment variables configured, you’re now ready to move on to scraping Sam Altman’s blog and storing the data into CapybaraDB.

3. Scrape Sam Altman’s Blog

Next, we’ll create a Python script responsible for scraping Sam Altman’s blog and saving the articles to CapybaraDB. This script should be run only once (or whenever new blog posts need to be synced) to populate the database.

Create scrape_and_save.py

Create a new Python script named scrape_and_save.py. This script handles the scraping of blog posts and saving them into CapybaraDB.

# scrape_and_save.py

import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from capybaradb import CapybaraDB, EmbText
from datetime import datetime
import time

# Load environment variables from .env
load_dotenv()

# Initialize CapybaraDB
capybara = CapybaraDB()
db = capybara.db("sam_altman_blog_db")
collection = db.collection("blog_articles")  # Each doc = 1 article

def scrape_sam_altman_blog_articles():
    base_url = "https://blog.samaltman.com/"
    articles = []

    for page_num in range(1, 13):  # Scraping up to 12 pages
        if page_num == 1:
            url = base_url  # First page doesn't have a page number in the URL
        else:
            url = f"{base_url}?page={page_num}"

        print(f"Fetching page {page_num}: {url}")
        response = requests.get(url)

        if not response.ok:
            print(f"Could not fetch page {page_num}. Status code: {response.status_code}. Stopping scrape.")
            break

        soup = BeautifulSoup(response.text, "html.parser")
        page_articles = soup.find_all("article", class_="post")

        # If we find no articles, likely we've reached the end
        if not page_articles:
            print(f"No articles found on page {page_num}. Stopping scrape.")
            break

        for entry in page_articles:
            # Extract Title and URL
            title_tag = entry.find("div", class_="post-title").find("h2").find("a")
            if title_tag:
                title = title_tag.get_text(strip=True)
                post_url = title_tag['href']
            else:
                print(f"Article on page {page_num} missing title. Skipping.")
                continue

            # Extract Date
            footer = entry.find("footer", class_="homepage-post-footer")
            if footer:
                date_span = footer.find("span", class_="posthaven-formatted-date")
                if date_span and date_span.has_attr("data-unix-time"):
                    unix_time = int(date_span["data-unix-time"])
                    # Convert Unix timestamp to readable date
                    date = datetime.utcfromtimestamp(unix_time).strftime('%Y-%m-%d')
                else:
                    print(f"Article '{title}' on page {page_num} missing date. Skipping.")
                    continue
            else:
                print(f"Article '{title}' on page {page_num} missing footer. Skipping.")
                continue

            # Extract Content
            body_div = entry.find("div", class_="post-body").find("div", class_="posthaven-post-body")
            if body_div:
                # Clean up the content by replacing multiple spaces/newlines
                content = body_div.get_text(separator="\n", strip=True)
            else:
                print(f"Article '{title}' on page {page_num} missing content. Skipping.")
                continue

            # Structure the document
            doc = {
                "title": title,
                "date": date,
                "url": post_url,
                "content": EmbText(content),
            }
            articles.append(doc)
            print(f"Scraped article: '{title}'")

        # Optional: Pause between page requests to be polite to the server
        time.sleep(1)  # Sleep for 1 second

    return articles

def insert_articles_into_capybara(articles):
    """Insert each article as a separate document into CapybaraDB."""
    try:
        if articles:
            response = collection.insert(articles)
            print(f"Inserted {len(articles)} articles into CapybaraDB.")
        else:
            print("No articles to insert.")
    except Exception as e:
        print("Error inserting articles:", e)

def main():
    # Scrape all articles
    scraped_articles = scrape_sam_altman_blog_articles()
    # Insert them into CapybaraDB
    insert_articles_into_capybara(scraped_articles)
    print("Scraping and insertion completed.")

if __name__ == "__main__":
    main()

Explanation of scrape_and_save.py

  • Scraping Logic:

    • Iterates through pages 1 to 12 of Sam Altman’s blog.
    • Extracts titles, URLs, dates, and content from each blog post.
    • Skips articles missing essential elements.
    • Waits 1 second between page requests to avoid overloading the server.
  • Insertion Logic:

    • Wraps content with EmbText for semantic querying.
    • Each article is saved as a separate document in CapybaraDB.

Run the Scraping and Saving Script

  1. Ensure Dependencies are Installed:

    pip install requests beautifulsoup4 openai capybaradb python-dotenv
  2. Set Up Environment Variables:

    CAPYBARA_API_KEY=your_capybara_api_key
    CAPYBARA_PROJECT_ID=your_capybara_project_id
    OPENAI_API_KEY=your_openai_api_key
  3. Run the Script:

    python scrape_and_save.py

Note: Run this script only once to sync blog articles with CapybaraDB. Rerun it only when new articles are available.

4. Query and Generate Answers

Now, we’ll create the chatbot that interacts with the user, queries CapybaraDB for relevant content, and uses OpenAI to generate answers based on that content.

Create chat_bot.py

Create a new Python script named chat_bot.py. This script will handle user interactions, query the database, and generate responses using OpenAI’s API.

# chat_bot.py

import openai
from dotenv import load_dotenv
from capybaradb import CapybaraDB
from capybaradb import EmbText
from datetime import datetime
import time

# Load environment variables from .env
load_dotenv()

# Initialize CapybaraDB
capybara = CapybaraDB()
db = capybara.db("sam_altman_blog_db")
collection = db.collection("blog_articles")

def query_capybara_db(user_query, top_k=3):
    """Returns the top_k chunks that best match the user's query."""
    try:
        results = collection.query(
            query=user_query,
            top_k=top_k
        )
        return results
    except Exception as e:
        print("Error querying the collection:", e)
        return {}

def generate_answer(user_query, context_chunks):
    """
    Creates a prompt using relevant context from CapybaraDB
    and calls OpenAI to generate an answer.
    """
    relevant_context = "\n\n".join(
        f"{i + 1}. {match.get('chunk', '')}" for i, match in enumerate(context_chunks)
    )

    prompt = f"""
You are a helpful assistant. Use the following context to answer the user question.

Context:
{relevant_context}

Question:
{user_query}

Answer:
"""

    try:
        response = openai.chat.completions.create(
            model="gpt-4",  # or "gpt-3.5-turbo"
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print("Error generating answer:", e)
        return None

def chat_bot():
    print("\nWelcome to the Sam Altman Blog ChatBot!")
    print("Type 'exit' or 'quit' to stop.\n")

    while True:
        user_query = input("Ask a question about Sam Altman's blog >> ")
        if user_query.lower() in ["exit", "quit"]:
            print("Goodbye!")
            break

        results = query_capybara_db(user_query, top_k=3)
        chunks = results.get("matches", [])

        if not chunks:
            print("No relevant information found. Please try again.\n")
            continue

        answer = generate_answer(user_query, chunks)
        if answer:
            print(f"\nAnswer:\n{answer}\n")
        else:
            print("\nSorry, I couldn't generate an answer.\n")

def main():
    chat_bot()

if __name__ == "__main__":
    main()

Explanation of chat_bot.py

  • Initialization:

    • CapybaraDB Connection: Connects to the existing sam_altman_blog_db database and accesses the blog_articles collection.
    • Environment Variables: Loads API keys from the .env file.
  • Querying Logic (query_capybara_db):

    • Semantic Search: Uses CapybaraDB’s semantic search to find the top k relevant chunks based on the user’s query.
    • Error Handling: Logs any issues during the querying process.
  • Answer Generation (generate_answer):

    • Context Assembly: Combines the retrieved text chunks into a single context block.
    • Prompt Construction: Formats the prompt to instruct OpenAI to use the provided context to answer the question.
    • OpenAI API Call: Sends the prompt to OpenAI’s ChatCompletion endpoint using the specified model (gpt-4 recommended for better performance).
    • Temperature Setting: Controls the randomness of the response; 0.7 offers a balance between creativity and coherence.
  • Chat Loop (chat_bot):

    • User Interaction: Prompts the user for questions in the terminal.
    • Processing: Retrieves relevant context from CapybaraDB and generates answers using OpenAI.
    • Termination: Users can exit the chat by typing 'exit' or 'quit'.

Run the ChatBot

After successfully scraping and saving the blog articles, you can now interact with the chatbot.

  1. Ensure Dependencies are Installed (if not already done):

    pip install requests beautifulsoup4 openai capybaradb python-dotenv
  2. Set Up Environment Variables:

    Ensure your .env file contains the necessary API credentials as shown earlier.

  3. Run the ChatBot Script:

    Execute the chatbot script:

    python chat_bot.py

Example Interaction:

Welcome to the Sam Altman Blog ChatBot!
Type 'exit' or 'quit' to stop.

Ask a question about Sam Altman's blog >> What are Sam Altman's views on AI policy?

Answer:
Sam Altman emphasizes the importance of developing AI responsibly and ensuring that its benefits are widely distributed. He advocates for proactive measures in AI safety, ethical considerations, and global cooperation to mitigate potential risks associated with advanced artificial intelligence technologies.

Note: The quality and accuracy of the answers depend on the content of the scraped blog posts and the effectiveness of the semantic search.

5. Next Steps & Beyond

Congratulations! 🎉 You’ve built a Retrieval-Augmented Generation (RAG) system around Sam Altman’s blog using CapybaraDB for semantic retrieval and OpenAI for generating context-aware answers.

Where to Go from Here

  1. Fine-Tune Your Pipeline

    • Advanced Chunking: Adjust chunk sizes and overlaps in CapybaraDB to optimize retrieval accuracy.
    • Metadata Utilization: Incorporate additional metadata (e.g., tags, categories) to enable more granular and topic-specific queries.
  2. Scale the Dataset

    • Additional Data Sources: Incorporate other blogs, transcripts, or documents to broaden the chatbot’s knowledge base.
    • Automation: Set up periodic scraping and data insertion scripts to keep the database updated with new blog posts automatically.
  3. Model Options

    • Different LLMs: Experiment with other large language models from providers like Anthropic or Cohere to compare performance, cost, and output quality.
    • Model Fine-Tuning: Fine-tune models on your specific dataset for improved accuracy and relevance.
  4. Deploy to Production

    • API Integration: Build a simple API endpoint (using frameworks like Flask or FastAPI) to integrate the chatbot into web or mobile applications.
    • User Interface Enhancements: Transition from a terminal-based interface to a more user-friendly web or desktop application for broader accessibility.
  5. Handle Data Privacy

    • Respect Scraping Policies: Ensure compliance with website terms of service and respect for data privacy.
    • Secure Data Handling: Implement secure storage and handling practices, especially if dealing with sensitive information.
  6. Performance Optimization

    • Caching Strategies: Implement caching for frequent queries to reduce latency and API costs.
    • Parallel Processing: Optimize scraping and data processing using asynchronous programming or multiprocessing to enhance efficiency.
  7. Enhance User Feedback

    • Feedback Loops: Allow users to provide feedback on the chatbot’s responses to iteratively improve accuracy and relevance.
    • Analytics and Monitoring: Track usage patterns and performance metrics to identify areas for improvement.

Final Thoughts

By combining the semantic retrieval capabilities of CapybaraDB with the generative power of OpenAI, you’ve created a robust and intelligent chatbot that can effectively answer questions grounded in real content. This foundation opens up numerous possibilities for building more advanced and specialized AI-driven applications.

Keep experimenting, keep building, and most importantly—have fun with it! 🚀

If you encounter any challenges or have further questions, refer back to the following resources:


Enjoy your new Sam Altman Blog ChatBot!