Do you just copy and paste into an Ai tool, or are you doing something smarter?
Do you just copy and paste into an Ai tool, or are you doing something smarter?
You are an expert forum analyst. Answer the user's question based ONLY on the following context posts from the forum.
--- CONTEXT ---
Post 1 (by Judge Jules): "..."
Post 2 (by Judge Jules): "..."
Post 3 (by Judge Jules): "..."
--- END CONTEXT ---
User's Question: "What does Judge Jules think about settler expansion?"
As promised, for anyone interested in how the "thing" works under the hood, here's a breakdown of the architecture. It's a full data pipeline now.
The Overall Flow
The system works in four main stages:
1. **Scrape & Parse:** Collects the raw data from the forum.
2. **Store & Structure:** Puts the data into a relational database.
3. **Enrich & Embed:** Adds layers of AI-generated meaning to the data.
4. **Retrieve & Generate:** Uses the enriched data to answer questions intelligently.
---
1. The Scraper Engine
This isn't a basic scraper. It's a hybrid system designed to be robust.
---
- **Authentication:** It uses a headless browser (`Playwright`) to handle the full login flow, including the 2FA step. Once logged in, it extracts the session cookies.
- **Scraping:** For the actual high-volume scraping, it uses the authenticated session with the `requests` library for speed. This is much faster than running a full browser for every page.
- **Parsing:** It uses `BeautifulSoup` to parse the HTML. Crucially, it has two different parser modes: one for the detailed structure of thread pages and another for the more compact structure of search result pages. It automatically detects the URL type and uses the correct parser.
- **Data Extraction:** It's not just grabbing post text. It pulls out everything relationally: `post_id`, `user_id`, `thread_id`, post number, timestamps, and it also parses out all the **quotes** and **reactions** for each post.
2. The Database (The "Long-Term Memory")
All the scraped data goes into a local **SQLite database**. This is the system's memory. Instead of a pile of messy JSON files, the data is organized into tables that are all linked together:
---
- `users`: Stores user info.
- `threads`: Stores thread titles and IDs.
- `posts`: The core table with all post content.
- `quotes`: A relational table linking which post quotes which other post.
- `reactions`: A table linking users and posts through reactions (e.g., 'Like').
3. The Enrichment Pipeline (The "Intelligence" Layer)
This is where the raw data gets its "intelligence." This is a multi-step, asynchronous process using Google's Gemini models.
a) AI Enrichment (`gemini-2.5-flash-lite`):
Every single post in the database is sent to Gemini Flash to generate:
This process took about 10 hours for the ~9k posts, respecting API rate limits.
- A concise 1-2 sentence **summary**.
- A list of 2-5 relevant **topic tags** (e.g., "settler expansion," "media bias").
- A **sentiment score** (positive, neutral, negative).
b) Vector Embeddings (`gemini-embedding-001`):
This is the core of the RAG system. Every post's raw text is converted into a 768-dimensional vector (basically a list of 768 numbers). Think of it as a "semantic fingerprint" or a coordinate on a map of meaning. Posts that discuss similar concepts will have vectors that are mathematically close to each other. This is what allows us to search by *idea*, not just by keyword.
---
4. The Agent (The "Brain")
This is the `agent.py` script that you can interact with. It uses a method called **Retrieval-Augmented Generation (RAG)**.
Here's how it works when you ask it a question like, "What does Judge Jules think about settler expansion?":
- **Fuzzy Matching:** First, it uses fuzzy logic to correct typos in the username ("Judge Julze" → "Judge Jules").
- **Hybrid Search:** It then performs a **hybrid search**. It does a fast SQL query to get all posts by "Judge Jules," and *then* it calculates the semantic similarity between your question's vector and the vector of each of that user's posts.
- **Context Retrieval:** It grabs the Top 50 most semantically relevant posts from that user.
- **Prompt Engineering:** It builds a new, complex prompt for a powerful AI model (`gemini-2.5-pro`). The prompt looks something like this:
Code:You are an expert forum analyst. Answer the user's question based ONLY on the following context posts from the forum. --- CONTEXT --- Post 1 (by Judge Jules): "..." Post 2 (by Judge Jules): "..." Post 3 (by Judge Jules): "..." --- END CONTEXT --- User's Question: "What does Judge Jules think about settler expansion?"
- **Generation:** The AI then generates an answer, but its knowledge is restricted to *only* the context posts we provided. This forces it to be factual to our specific forum and prevents it from making things up.
TL;DR: It's a full-stack data analysis platform. It scrapes, structures, and enriches forum data, then uses a sophisticated RAG architecture to provide context-aware answers based on a semantic understanding of our own discussions.
Actually going to write a quick one myself now, interested to see what comes out. Not as robust, just a scraper, and then I'll manually send it off for analysis and see what matches as I'm sure it'll depend on what prompts given.
There isn't enough bandwith to cover the utter tragedies across the planet atm.
Pretty much all of these are man made!
The info coming out of Sudan is horrific, tens of thousands of children starved to death. Murder, rape and starvation continues and millions of people are at risk.
I'm thinking of reverting to my previous mindset of, if it doesn't happen in my daily life, it's not my reality.
Selfish, but as you say, it's just too much
In your RAG, what are you storing? All the posts or a subset? Curious.As promised, for anyone interested in how the "thing" works under the hood, here's a breakdown of the architecture. It's a full data pipeline now.
The Overall Flow
The system works in four main stages:
1. **Scrape & Parse:** Collects the raw data from the forum.
2. **Store & Structure:** Puts the data into a relational database.
3. **Enrich & Embed:** Adds layers of AI-generated meaning to the data.
4. **Retrieve & Generate:** Uses the enriched data to answer questions intelligently.
---
1. The Scraper Engine
This isn't a basic scraper. It's a hybrid system designed to be robust.
---
- **Authentication:** It uses a headless browser (`Playwright`) to handle the full login flow, including the 2FA step. Once logged in, it extracts the session cookies.
- **Scraping:** For the actual high-volume scraping, it uses the authenticated session with the `requests` library for speed. This is much faster than running a full browser for every page.
- **Parsing:** It uses `BeautifulSoup` to parse the HTML. Crucially, it has two different parser modes: one for the detailed structure of thread pages and another for the more compact structure of search result pages. It automatically detects the URL type and uses the correct parser.
- **Data Extraction:** It's not just grabbing post text. It pulls out everything relationally: `post_id`, `user_id`, `thread_id`, post number, timestamps, and it also parses out all the **quotes** and **reactions** for each post.
2. The Database (The "Long-Term Memory")
All the scraped data goes into a local **SQLite database**. This is the system's memory. Instead of a pile of messy JSON files, the data is organized into tables that are all linked together:
---
- `users`: Stores user info.
- `threads`: Stores thread titles and IDs.
- `posts`: The core table with all post content.
- `quotes`: A relational table linking which post quotes which other post.
- `reactions`: A table linking users and posts through reactions (e.g., 'Like').
3. The Enrichment Pipeline (The "Intelligence" Layer)
This is where the raw data gets its "intelligence." This is a multi-step, asynchronous process using Google's Gemini models.
a) AI Enrichment (`gemini-2.5-flash-lite`):
Every single post in the database is sent to Gemini Flash to generate:
This process took about 10 hours for the ~9k posts, respecting API rate limits.
- A concise 1-2 sentence **summary**.
- A list of 2-5 relevant **topic tags** (e.g., "settler expansion," "media bias").
- A **sentiment score** (positive, neutral, negative).
b) Vector Embeddings (`gemini-embedding-001`):
This is the core of the RAG system. Every post's raw text is converted into a 768-dimensional vector (basically a list of 768 numbers). Think of it as a "semantic fingerprint" or a coordinate on a map of meaning. Posts that discuss similar concepts will have vectors that are mathematically close to each other. This is what allows us to search by *idea*, not just by keyword.
---
4. The Agent (The "Brain")
This is the `agent.py` script that you can interact with. It uses a method called **Retrieval-Augmented Generation (RAG)**.
Here's how it works when you ask it a question like, "What does Judge Jules think about settler expansion?":
- **Fuzzy Matching:** First, it uses fuzzy logic to correct typos in the username ("Judge Julze" → "Judge Jules").
- **Hybrid Search:** It then performs a **hybrid search**. It does a fast SQL query to get all posts by "Judge Jules," and *then* it calculates the semantic similarity between your question's vector and the vector of each of that user's posts.
- **Context Retrieval:** It grabs the Top 50 most semantically relevant posts from that user.
- **Prompt Engineering:** It builds a new, complex prompt for a powerful AI model (`gemini-2.5-pro`). The prompt looks something like this:
Code:You are an expert forum analyst. Answer the user's question based ONLY on the following context posts from the forum. --- CONTEXT --- Post 1 (by Judge Jules): "..." Post 2 (by Judge Jules): "..." Post 3 (by Judge Jules): "..." --- END CONTEXT --- User's Question: "What does Judge Jules think about settler expansion?"
- **Generation:** The AI then generates an answer, but its knowledge is restricted to *only* the context posts we provided. This forces it to be factual to our specific forum and prevents it from making things up.
TL;DR: It's a full-stack data analysis platform. It scrapes, structures, and enriches forum data, then uses a sophisticated RAG architecture to provide context-aware answers based on a semantic understanding of our own discussions.
In your RAG, what are you storing? All the posts or a subset? Curious.
lol what a fucking muppet.There isn't enough bandwith to cover the utter tragedies across the planet atm.
Pretty much all of these are man made!
The info coming out of Sudan is horrific, tens of thousands of children starved to death. Murder, rape and starvation continues and millions of people are at risk.
Mild punishment for October 7th imo.
May anyone who championed this be cursed for eternity.
Anyone who gaslit, called ppl who challenged a racist be similarly cursed.
Silence is complicity and too many cunts are silent.
OK, now I'm ready to ban this WUM cuntMild punishment for October 7th imo.
Maybe death to all Israelis and their supporters would be mild punishment for the genocide they are inflicting on Palestine.Mild punishment for October 7th imo.
FROM THE RIVER TO THE SEAMaybe death to all Israelis and their supporters would be mild punishment for the genocide they are inflicting on Palestine.
FREE PALESTINE.
Ok thats too far. There are many Israelis who are just as dismayed at what's happening. Same as Nazi occupied territories back in 30s and 40s.Maybe death to all Israelis and their supporters would be mild punishment for the genocide they are inflicting on Palestine.
FREE PALESTINE.
Ok thats too far. There are many Israelis who are just as dismayed at what's happening. Same as Nazi occupied territories back in 30s and 40s.
Blitz is an out and out cunt though. That much is true.
This.I assume he was reversing the statement back to him, to show the absurdity of it, rather than actually meaning that.
Fair enough. However the cunt would never have picked up on it either.This.
Fair enough. However the cunt would never have picked up on it either.
May anyone who championed this be cursed for eternity.
Anyone who gaslit, called ppl who challenged a racist be similarly cursed.
Silence is complicity and too many cunts are silent.