Build a Chatbot with Sam Altman’s Blog
1. Overview
In this tutorial, you’ll build a terminal-based ChatBot that answers questions about Sam Altman’s blog posts. The project utilizes a Retrieval-Augmented Generation (RAG) pattern by:
- Setting Up CapybaraDB and OpenAI.
- Scraping Sam Altman’s blog content.
- Storing each blog article as a separate document in CapybaraDB with date and time appended.
- Retrieving relevant articles based on user queries.
- Generating context-aware answers using OpenAI’s language model.
This approach ensures that the chatbot’s responses are grounded in actual blog content, reducing inaccuracies and enhancing reliability.
2. Set Up CapybaraDB and OpenAI
Before diving into scraping and interacting with the data, it’s essential to set up CapybaraDB for storing and retrieving your data and OpenAI for generating responses.
Install Dependencies
Ensure you have all the necessary Python libraries installed:
pip install requests beautifulsoup4 openai capybaradb python-dotenv
Environment Variables
Create a file named .env
in your project directory to securely store your API credentials:
CAPYBARA_API_KEY=your_capybara_api_key CAPYBARA_PROJECT_ID=your_capybara_project_id OPENAI_API_KEY=your_openai_api_key
Security Tip: Add
.env
to your.gitignore
to prevent accidentally committing sensitive information.
Basic Usage of CapybaraDB & OpenAI
We’ll use two main libraries:
-
CapybaraDB:
- Store your scraped articles.
- Automatically create vector embeddings when
EmbText
is used.
-
OpenAI:
- Generate final responses to user queries.
With the dependencies installed and environment variables configured, you’re now ready to move on to scraping Sam Altman’s blog and storing the data into CapybaraDB.
3. Scrape Sam Altman’s Blog
Next, we’ll create a Python script responsible for scraping Sam Altman’s blog and saving the articles to CapybaraDB. This script should be run only once (or whenever new blog posts need to be synced) to populate the database.
Create scrape_and_save.py
Create a new Python script named scrape_and_save.py
. This script handles the scraping of blog posts and saving them into CapybaraDB.
# scrape_and_save.py import requests from bs4 import BeautifulSoup from dotenv import load_dotenv from capybaradb import CapybaraDB, EmbText from datetime import datetime import time # Load environment variables from .env load_dotenv() # Initialize CapybaraDB capybara = CapybaraDB() db = capybara.db("sam_altman_blog_db") collection = db.collection("blog_articles") # Each doc = 1 article def scrape_sam_altman_blog_articles(): base_url = "https://blog.samaltman.com/" articles = [] for page_num in range(1, 13): # Scraping up to 12 pages if page_num == 1: url = base_url # First page doesn't have a page number in the URL else: url = f"{base_url}?page={page_num}" print(f"Fetching page {page_num}: {url}") response = requests.get(url) if not response.ok: print(f"Could not fetch page {page_num}. Status code: {response.status_code}. Stopping scrape.") break soup = BeautifulSoup(response.text, "html.parser") page_articles = soup.find_all("article", class_="post") # If we find no articles, likely we've reached the end if not page_articles: print(f"No articles found on page {page_num}. Stopping scrape.") break for entry in page_articles: # Extract Title and URL title_tag = entry.find("div", class_="post-title").find("h2").find("a") if title_tag: title = title_tag.get_text(strip=True) post_url = title_tag['href'] else: print(f"Article on page {page_num} missing title. Skipping.") continue # Extract Date footer = entry.find("footer", class_="homepage-post-footer") if footer: date_span = footer.find("span", class_="posthaven-formatted-date") if date_span and date_span.has_attr("data-unix-time"): unix_time = int(date_span["data-unix-time"]) # Convert Unix timestamp to readable date date = datetime.utcfromtimestamp(unix_time).strftime('%Y-%m-%d') else: print(f"Article '{title}' on page {page_num} missing date. Skipping.") continue else: print(f"Article '{title}' on page {page_num} missing footer. Skipping.") continue # Extract Content body_div = entry.find("div", class_="post-body").find("div", class_="posthaven-post-body") if body_div: # Clean up the content by replacing multiple spaces/newlines content = body_div.get_text(separator="\n", strip=True) else: print(f"Article '{title}' on page {page_num} missing content. Skipping.") continue # Structure the document doc = { "title": title, "date": date, "url": post_url, "content": EmbText(content), } articles.append(doc) print(f"Scraped article: '{title}'") # Optional: Pause between page requests to be polite to the server time.sleep(1) # Sleep for 1 second return articles def insert_articles_into_capybara(articles): """Insert each article as a separate document into CapybaraDB.""" try: if articles: response = collection.insert(articles) print(f"Inserted {len(articles)} articles into CapybaraDB.") else: print("No articles to insert.") except Exception as e: print("Error inserting articles:", e) def main(): # Scrape all articles scraped_articles = scrape_sam_altman_blog_articles() # Insert them into CapybaraDB insert_articles_into_capybara(scraped_articles) print("Scraping and insertion completed.") if __name__ == "__main__": main()
Explanation of scrape_and_save.py
-
Scraping Logic:
- Iterates through pages 1 to 12 of Sam Altman’s blog.
- Extracts titles, URLs, dates, and content from each blog post.
- Skips articles missing essential elements.
- Waits 1 second between page requests to avoid overloading the server.
-
Insertion Logic:
- Wraps content with
EmbText
for semantic querying. - Each article is saved as a separate document in CapybaraDB.
- Wraps content with
Run the Scraping and Saving Script
-
Ensure Dependencies are Installed:
pip install requests beautifulsoup4 openai capybaradb python-dotenv
-
Set Up Environment Variables:
CAPYBARA_API_KEY=your_capybara_api_key CAPYBARA_PROJECT_ID=your_capybara_project_id OPENAI_API_KEY=your_openai_api_key
-
Run the Script:
python scrape_and_save.py
Note: Run this script only once to sync blog articles with CapybaraDB. Rerun it only when new articles are available.
4. Query and Generate Answers
Now, we’ll create the chatbot that interacts with the user, queries CapybaraDB for relevant content, and uses OpenAI to generate answers based on that content.
Create chat_bot.py
Create a new Python script named chat_bot.py
. This script will handle user interactions, query the database, and generate responses using OpenAI’s API.
# chat_bot.py import openai from dotenv import load_dotenv from capybaradb import CapybaraDB from capybaradb import EmbText from datetime import datetime import time # Load environment variables from .env load_dotenv() # Initialize CapybaraDB capybara = CapybaraDB() db = capybara.db("sam_altman_blog_db") collection = db.collection("blog_articles") def query_capybara_db(user_query, top_k=3): """Returns the top_k chunks that best match the user's query.""" try: results = collection.query( query=user_query, top_k=top_k ) return results except Exception as e: print("Error querying the collection:", e) return {} def generate_answer(user_query, context_chunks): """ Creates a prompt using relevant context from CapybaraDB and calls OpenAI to generate an answer. """ relevant_context = "\n\n".join( f"{i + 1}. {match.get('chunk', '')}" for i, match in enumerate(context_chunks) ) prompt = f""" You are a helpful assistant. Use the following context to answer the user question. Context: {relevant_context} Question: {user_query} Answer: """ try: response = openai.chat.completions.create( model="gpt-4", # or "gpt-3.5-turbo" messages=[{"role": "user", "content": prompt}], temperature=0.7, ) return response.choices[0].message.content.strip() except Exception as e: print("Error generating answer:", e) return None def chat_bot(): print("\nWelcome to the Sam Altman Blog ChatBot!") print("Type 'exit' or 'quit' to stop.\n") while True: user_query = input("Ask a question about Sam Altman's blog >> ") if user_query.lower() in ["exit", "quit"]: print("Goodbye!") break results = query_capybara_db(user_query, top_k=3) chunks = results.get("matches", []) if not chunks: print("No relevant information found. Please try again.\n") continue answer = generate_answer(user_query, chunks) if answer: print(f"\nAnswer:\n{answer}\n") else: print("\nSorry, I couldn't generate an answer.\n") def main(): chat_bot() if __name__ == "__main__": main()
Explanation of chat_bot.py
-
Initialization:
- CapybaraDB Connection: Connects to the existing
sam_altman_blog_db
database and accesses theblog_articles
collection. - Environment Variables: Loads API keys from the
.env
file.
- CapybaraDB Connection: Connects to the existing
-
Querying Logic (
query_capybara_db
):- Semantic Search: Uses CapybaraDB’s semantic search to find the top
k
relevant chunks based on the user’s query. - Error Handling: Logs any issues during the querying process.
- Semantic Search: Uses CapybaraDB’s semantic search to find the top
-
Answer Generation (
generate_answer
):- Context Assembly: Combines the retrieved text chunks into a single context block.
- Prompt Construction: Formats the prompt to instruct OpenAI to use the provided context to answer the question.
- OpenAI API Call: Sends the prompt to OpenAI’s
ChatCompletion
endpoint using the specified model (gpt-4
recommended for better performance). - Temperature Setting: Controls the randomness of the response;
0.7
offers a balance between creativity and coherence.
-
Chat Loop (
chat_bot
):- User Interaction: Prompts the user for questions in the terminal.
- Processing: Retrieves relevant context from CapybaraDB and generates answers using OpenAI.
- Termination: Users can exit the chat by typing
'exit'
or'quit'
.
Run the ChatBot
After successfully scraping and saving the blog articles, you can now interact with the chatbot.
-
Ensure Dependencies are Installed (if not already done):
pip install requests beautifulsoup4 openai capybaradb python-dotenv
-
Set Up Environment Variables:
Ensure your
.env
file contains the necessary API credentials as shown earlier. -
Run the ChatBot Script:
Execute the chatbot script:
python chat_bot.py
Example Interaction:
Welcome to the Sam Altman Blog ChatBot! Type 'exit' or 'quit' to stop. Ask a question about Sam Altman's blog >> What are Sam Altman's views on AI policy? Answer: Sam Altman emphasizes the importance of developing AI responsibly and ensuring that its benefits are widely distributed. He advocates for proactive measures in AI safety, ethical considerations, and global cooperation to mitigate potential risks associated with advanced artificial intelligence technologies.
Note: The quality and accuracy of the answers depend on the content of the scraped blog posts and the effectiveness of the semantic search.
5. Next Steps & Beyond
Congratulations! 🎉 You’ve built a Retrieval-Augmented Generation (RAG) system around Sam Altman’s blog using CapybaraDB for semantic retrieval and OpenAI for generating context-aware answers.
Where to Go from Here
-
Fine-Tune Your Pipeline
- Advanced Chunking: Adjust chunk sizes and overlaps in CapybaraDB to optimize retrieval accuracy.
- Metadata Utilization: Incorporate additional metadata (e.g., tags, categories) to enable more granular and topic-specific queries.
-
Scale the Dataset
- Additional Data Sources: Incorporate other blogs, transcripts, or documents to broaden the chatbot’s knowledge base.
- Automation: Set up periodic scraping and data insertion scripts to keep the database updated with new blog posts automatically.
-
Model Options
- Different LLMs: Experiment with other large language models from providers like Anthropic or Cohere to compare performance, cost, and output quality.
- Model Fine-Tuning: Fine-tune models on your specific dataset for improved accuracy and relevance.
-
Deploy to Production
- API Integration: Build a simple API endpoint (using frameworks like Flask or FastAPI) to integrate the chatbot into web or mobile applications.
- User Interface Enhancements: Transition from a terminal-based interface to a more user-friendly web or desktop application for broader accessibility.
-
Handle Data Privacy
- Respect Scraping Policies: Ensure compliance with website terms of service and respect for data privacy.
- Secure Data Handling: Implement secure storage and handling practices, especially if dealing with sensitive information.
-
Performance Optimization
- Caching Strategies: Implement caching for frequent queries to reduce latency and API costs.
- Parallel Processing: Optimize scraping and data processing using asynchronous programming or multiprocessing to enhance efficiency.
-
Enhance User Feedback
- Feedback Loops: Allow users to provide feedback on the chatbot’s responses to iteratively improve accuracy and relevance.
- Analytics and Monitoring: Track usage patterns and performance metrics to identify areas for improvement.
Final Thoughts
By combining the semantic retrieval capabilities of CapybaraDB with the generative power of OpenAI, you’ve created a robust and intelligent chatbot that can effectively answer questions grounded in real content. This foundation opens up numerous possibilities for building more advanced and specialized AI-driven applications.
Keep experimenting, keep building, and most importantly—have fun with it! 🚀
If you encounter any challenges or have further questions, refer back to the following resources:
Enjoy your new Sam Altman Blog ChatBot!