Build a Chatbot with Sam Altman’s Blog
1. Overview
In this tutorial, you’ll build a terminal-based ChatBot that answers questions about Sam Altman’s blog posts. The project utilizes a Retrieval-Augmented Generation (RAG) pattern by:
- Setting Up CapybaraDB and OpenAI.
- Scraping Sam Altman’s blog content.
- Storing each blog article as a separate document in CapybaraDB with date and time appended.
- Retrieving relevant articles based on user queries.
- Generating context-aware answers using OpenAI’s language model.
This approach ensures that the chatbot’s responses are grounded in actual blog content, reducing inaccuracies and enhancing reliability.
2. Set Up CapybaraDB and OpenAI
Before diving into scraping and interacting with the data, it’s essential to set up CapybaraDB for storing and retrieving your data and OpenAI for generating responses.
Install Dependencies
Ensure you have all the necessary Python libraries installed:
bash1pip install requests beautifulsoup4 openai capybaradb python-dotenv
Environment Variables
Create a file named .env
in your project directory to securely store your API credentials:
env1CAPYBARA_API_KEY=your_capybara_api_key 2CAPYBARA_PROJECT_ID=your_capybara_project_id 3OPENAI_API_KEY=your_openai_api_key
Security Tip: Add
.env
to your.gitignore
to prevent accidentally committing sensitive information.
Basic Usage of CapybaraDB & OpenAI
We’ll use two main libraries:
-
CapybaraDB:
- Store your scraped articles.
- Automatically create vector embeddings when
EmbText
is used.
-
OpenAI:
- Generate final responses to user queries.
With the dependencies installed and environment variables configured, you’re now ready to move on to scraping Sam Altman’s blog and storing the data into CapybaraDB.
3. Scrape Sam Altman’s Blog
Next, we’ll create a Python script responsible for scraping Sam Altman’s blog and saving the articles to CapybaraDB. This script should be run only once (or whenever new blog posts need to be synced) to populate the database.
Create scrape_and_save.py
Create a new Python script named scrape_and_save.py
. This script handles the scraping of blog posts and saving them into CapybaraDB.
python1# scrape_and_save.py 2 3import requests 4from bs4 import BeautifulSoup 5from dotenv import load_dotenv 6from capybaradb import CapybaraDB, EmbText 7from datetime import datetime 8import time 9 10# Load environment variables from .env 11load_dotenv() 12 13# Initialize CapybaraDB 14capybara = CapybaraDB() 15db = capybara.db("sam_altman_blog_db") 16collection = db.collection("blog_articles") # Each doc = 1 article 17 18def scrape_sam_altman_blog_articles(): 19 base_url = "https://blog.samaltman.com/" 20 articles = [] 21 22 for page_num in range(1, 13): # Scraping up to 12 pages 23 if page_num == 1: 24 url = base_url # First page doesn't have a page number in the URL 25 else: 26 url = f"{base_url}?page={page_num}" 27 28 print(f"Fetching page {page_num}: {url}") 29 response = requests.get(url) 30 31 if not response.ok: 32 print(f"Could not fetch page {page_num}. Status code: {response.status_code}. Stopping scrape.") 33 break 34 35 soup = BeautifulSoup(response.text, "html.parser") 36 page_articles = soup.find_all("article", class_="post") 37 38 # If we find no articles, likely we've reached the end 39 if not page_articles: 40 print(f"No articles found on page {page_num}. Stopping scrape.") 41 break 42 43 for entry in page_articles: 44 # Extract Title and URL 45 title_tag = entry.find("div", class_="post-title").find("h2").find("a") 46 if title_tag: 47 title = title_tag.get_text(strip=True) 48 post_url = title_tag['href'] 49 else: 50 print(f"Article on page {page_num} missing title. Skipping.") 51 continue 52 53 # Extract Date 54 footer = entry.find("footer", class_="homepage-post-footer") 55 if footer: 56 date_span = footer.find("span", class_="posthaven-formatted-date") 57 if date_span and date_span.has_attr("data-unix-time"): 58 unix_time = int(date_span["data-unix-time"]) 59 # Convert Unix timestamp to readable date 60 date = datetime.utcfromtimestamp(unix_time).strftime('%Y-%m-%d') 61 else: 62 print(f"Article '{title}' on page {page_num} missing date. Skipping.") 63 continue 64 else: 65 print(f"Article '{title}' on page {page_num} missing footer. Skipping.") 66 continue 67 68 # Extract Content 69 body_div = entry.find("div", class_="post-body").find("div", class_="posthaven-post-body") 70 if body_div: 71 # Clean up the content by replacing multiple spaces/newlines 72 content = body_div.get_text(separator="\n", strip=True) 73 else: 74 print(f"Article '{title}' on page {page_num} missing content. Skipping.") 75 continue 76 77 # Structure the document 78 doc = { 79 "title": title, 80 "date": date, 81 "url": post_url, 82 "content": EmbText(content), 83 } 84 articles.append(doc) 85 print(f"Scraped article: '{title}'") 86 87 # Optional: Pause between page requests to be polite to the server 88 time.sleep(1) # Sleep for 1 second 89 90 return articles 91 92def insert_articles_into_capybara(articles): 93 """Insert each article as a separate document into CapybaraDB.""" 94 try: 95 if articles: 96 response = collection.insert(articles) 97 print(f"Inserted {len(articles)} articles into CapybaraDB.") 98 else: 99 print("No articles to insert.") 100 except Exception as e: 101 print("Error inserting articles:", e) 102 103def main(): 104 # Scrape all articles 105 scraped_articles = scrape_sam_altman_blog_articles() 106 # Insert them into CapybaraDB 107 insert_articles_into_capybara(scraped_articles) 108 print("Scraping and insertion completed.") 109 110if __name__ == "__main__": 111 main()
Explanation of scrape_and_save.py
-
Scraping Logic:
- Iterates through pages 1 to 12 of Sam Altman’s blog.
- Extracts titles, URLs, dates, and content from each blog post.
- Skips articles missing essential elements.
- Waits 1 second between page requests to avoid overloading the server.
-
Insertion Logic:
- Wraps content with
EmbText
for semantic querying. - Each article is saved as a separate document in CapybaraDB.
- Wraps content with
Run the Scraping and Saving Script
-
Ensure Dependencies are Installed:
bash1pip install requests beautifulsoup4 openai capybaradb python-dotenv
-
Set Up Environment Variables:
env1CAPYBARA_API_KEY=your_capybara_api_key 2CAPYBARA_PROJECT_ID=your_capybara_project_id 3OPENAI_API_KEY=your_openai_api_key
-
Run the Script:
bash1python scrape_and_save.py
Note: Run this script only once to sync blog articles with CapybaraDB. Rerun it only when new articles are available.
4. Query and Generate Answers
Now, we’ll create the chatbot that interacts with the user, queries CapybaraDB for relevant content, and uses OpenAI to generate answers based on that content.
Create chat_bot.py
Create a new Python script named chat_bot.py
. This script will handle user interactions, query the database, and generate responses using OpenAI’s API.
python1# chat_bot.py 2 3import openai 4from dotenv import load_dotenv 5from capybaradb import CapybaraDB 6from capybaradb import EmbText 7from datetime import datetime 8import time 9 10# Load environment variables from .env 11load_dotenv() 12 13# Initialize CapybaraDB 14capybara = CapybaraDB() 15db = capybara.db("sam_altman_blog_db") 16collection = db.collection("blog_articles") 17 18def query_capybara_db(user_query, top_k=3): 19 """Returns the top_k chunks that best match the user's query.""" 20 try: 21 results = collection.query( 22 query=user_query, 23 top_k=top_k 24 ) 25 return results 26 except Exception as e: 27 print("Error querying the collection:", e) 28 return {} 29 30def generate_answer(user_query, context_chunks): 31 """ 32 Creates a prompt using relevant context from CapybaraDB 33 and calls OpenAI to generate an answer. 34 """ 35 relevant_context = "\n\n".join( 36 f"{i + 1}. {match.get('chunk', '')}" for i, match in enumerate(context_chunks) 37 ) 38 39 prompt = f""" 40You are a helpful assistant. Use the following context to answer the user question. 41 42Context: 43{relevant_context} 44 45Question: 46{user_query} 47 48Answer: 49""" 50 51 try: 52 response = openai.chat.completions.create( 53 model="gpt-4", # or "gpt-3.5-turbo" 54 messages=[{"role": "user", "content": prompt}], 55 temperature=0.7, 56 ) 57 return response.choices[0].message.content.strip() 58 except Exception as e: 59 print("Error generating answer:", e) 60 return None 61 62def chat_bot(): 63 print("\nWelcome to the Sam Altman Blog ChatBot!") 64 print("Type 'exit' or 'quit' to stop.\n") 65 66 while True: 67 user_query = input("Ask a question about Sam Altman's blog >> ") 68 if user_query.lower() in ["exit", "quit"]: 69 print("Goodbye!") 70 break 71 72 results = query_capybara_db(user_query, top_k=3) 73 chunks = results.get("matches", []) 74 75 if not chunks: 76 print("No relevant information found. Please try again.\n") 77 continue 78 79 answer = generate_answer(user_query, chunks) 80 if answer: 81 print(f"\nAnswer:\n{answer}\n") 82 else: 83 print("\nSorry, I couldn't generate an answer.\n") 84 85def main(): 86 chat_bot() 87 88if __name__ == "__main__": 89 main()
Explanation of chat_bot.py
-
Initialization:
- CapybaraDB Connection: Connects to the existing
sam_altman_blog_db
database and accesses theblog_articles
collection. - Environment Variables: Loads API keys from the
.env
file.
- CapybaraDB Connection: Connects to the existing
-
Querying Logic (
query_capybara_db
):- Semantic Search: Uses CapybaraDB’s semantic search to find the top
k
relevant chunks based on the user’s query. - Error Handling: Logs any issues during the querying process.
- Semantic Search: Uses CapybaraDB’s semantic search to find the top
-
Answer Generation (
generate_answer
):- Context Assembly: Combines the retrieved text chunks into a single context block.
- Prompt Construction: Formats the prompt to instruct OpenAI to use the provided context to answer the question.
- OpenAI API Call: Sends the prompt to OpenAI’s
ChatCompletion
endpoint using the specified model (gpt-4
recommended for better performance). - Temperature Setting: Controls the randomness of the response;
0.7
offers a balance between creativity and coherence.
-
Chat Loop (
chat_bot
):- User Interaction: Prompts the user for questions in the terminal.
- Processing: Retrieves relevant context from CapybaraDB and generates answers using OpenAI.
- Termination: Users can exit the chat by typing
'exit'
or'quit'
.
Run the ChatBot
After successfully scraping and saving the blog articles, you can now interact with the chatbot.
-
Ensure Dependencies are Installed (if not already done):
bash1pip install requests beautifulsoup4 openai capybaradb python-dotenv
-
Set Up Environment Variables:
Ensure your
.env
file contains the necessary API credentials as shown earlier. -
Run the ChatBot Script:
Execute the chatbot script:
bash1python chat_bot.py
Example Interaction:
text1Welcome to the Sam Altman Blog ChatBot! 2Type 'exit' or 'quit' to stop. 3 4Ask a question about Sam Altman's blog >> What are Sam Altman's views on AI policy? 5 6Answer: 7Sam Altman emphasizes the importance of developing AI responsibly and ensuring that its benefits are widely distributed. He advocates for proactive measures in AI safety, ethical considerations, and global cooperation to mitigate potential risks associated with advanced artificial intelligence technologies.
Note: The quality and accuracy of the answers depend on the content of the scraped blog posts and the effectiveness of the semantic search.
5. Next Steps & Beyond
Congratulations! 🎉 You’ve built a Retrieval-Augmented Generation (RAG) system around Sam Altman’s blog using CapybaraDB for semantic retrieval and OpenAI for generating context-aware answers.
Where to Go from Here
-
Fine-Tune Your Pipeline
- Advanced Chunking: Adjust chunk sizes and overlaps in CapybaraDB to optimize retrieval accuracy.
- Metadata Utilization: Incorporate additional metadata (e.g., tags, categories) to enable more granular and topic-specific queries.
-
Scale the Dataset
- Additional Data Sources: Incorporate other blogs, transcripts, or documents to broaden the chatbot’s knowledge base.
- Automation: Set up periodic scraping and data insertion scripts to keep the database updated with new blog posts automatically.
-
Model Options
- Different LLMs: Experiment with other large language models from providers like Anthropic or Cohere to compare performance, cost, and output quality.
- Model Fine-Tuning: Fine-tune models on your specific dataset for improved accuracy and relevance.
-
Deploy to Production
- API Integration: Build a simple API endpoint (using frameworks like Flask or FastAPI) to integrate the chatbot into web or mobile applications.
- User Interface Enhancements: Transition from a terminal-based interface to a more user-friendly web or desktop application for broader accessibility.
-
Handle Data Privacy
- Respect Scraping Policies: Ensure compliance with website terms of service and respect for data privacy.
- Secure Data Handling: Implement secure storage and handling practices, especially if dealing with sensitive information.
-
Performance Optimization
- Caching Strategies: Implement caching for frequent queries to reduce latency and API costs.
- Parallel Processing: Optimize scraping and data processing using asynchronous programming or multiprocessing to enhance efficiency.
-
Enhance User Feedback
- Feedback Loops: Allow users to provide feedback on the chatbot’s responses to iteratively improve accuracy and relevance.
- Analytics and Monitoring: Track usage patterns and performance metrics to identify areas for improvement.
Final Thoughts
By combining the semantic retrieval capabilities of CapybaraDB with the generative power of OpenAI, you’ve created a robust and intelligent chatbot that can effectively answer questions grounded in real content. This foundation opens up numerous possibilities for building more advanced and specialized AI-driven applications.
Keep experimenting, keep building, and most importantly—have fun with it! 🚀
If you encounter any challenges or have further questions, refer back to the following resources:
Enjoy your new Sam Altman Blog ChatBot!