• You may have to login or register before you can post and view our exclusive members only forums.
    To start viewing messages, select the forum that you want to visit from the selection below.

Middle East Violence (content may offend)

Do you just copy and paste into an Ai tool, or are you doing something smarter?

I made a thing. It'll scrape a given thread, topic keyword, or username, and put all posts, tagged with topic, sentiment analysis, category, into a DB and rag for LLM analysis and context etc.

I'll post the details in a spoiler tag a bit later.
 
I was going to put it online and share it, but when I started a web UI I realised I was just distracting myself with another hobby project instead of working on more important things, so it's all CLI menu driven at the moment.
 
As promised, for anyone interested in how the "thing" works under the hood, here's a breakdown of the architecture. It's a full data pipeline now.

The Overall Flow
The system works in four main stages:
1. **Scrape & Parse:** Collects the raw data from the forum.
2. **Store & Structure:** Puts the data into a relational database.
3. **Enrich & Embed:** Adds layers of AI-generated meaning to the data.
4. **Retrieve & Generate:** Uses the enriched data to answer questions intelligently.

---

1. The Scraper Engine
This isn't a basic scraper. It's a hybrid system designed to be robust.
  • **Authentication:** It uses a headless browser (`Playwright`) to handle the full login flow, including the 2FA step. Once logged in, it extracts the session cookies.
  • **Scraping:** For the actual high-volume scraping, it uses the authenticated session with the `requests` library for speed. This is much faster than running a full browser for every page.
  • **Parsing:** It uses `BeautifulSoup` to parse the HTML. Crucially, it has two different parser modes: one for the detailed structure of thread pages and another for the more compact structure of search result pages. It automatically detects the URL type and uses the correct parser.
  • **Data Extraction:** It's not just grabbing post text. It pulls out everything relationally: `post_id`, `user_id`, `thread_id`, post number, timestamps, and it also parses out all the **quotes** and **reactions** for each post.
---

2. The Database (The "Long-Term Memory")
All the scraped data goes into a local **SQLite database**. This is the system's memory. Instead of a pile of messy JSON files, the data is organized into tables that are all linked together:
  • `users`: Stores user info.
  • `threads`: Stores thread titles and IDs.
  • `posts`: The core table with all post content.
  • `quotes`: A relational table linking which post quotes which other post.
  • `reactions`: A table linking users and posts through reactions (e.g., 'Like').
---

3. The Enrichment Pipeline (The "Intelligence" Layer)
This is where the raw data gets its "intelligence." This is a multi-step, asynchronous process using Google's Gemini models.

a) AI Enrichment (`gemini-2.5-flash-lite`):
Every single post in the database is sent to Gemini Flash to generate:
  • A concise 1-2 sentence **summary**.
  • A list of 2-5 relevant **topic tags** (e.g., "settler expansion," "media bias").
  • A **sentiment score** (positive, neutral, negative).
This process took about 10 hours for the ~9k posts, respecting API rate limits.

b) Vector Embeddings (`gemini-embedding-001`):
This is the core of the RAG system. Every post's raw text is converted into a 768-dimensional vector (basically a list of 768 numbers). Think of it as a "semantic fingerprint" or a coordinate on a map of meaning. Posts that discuss similar concepts will have vectors that are mathematically close to each other. This is what allows us to search by *idea*, not just by keyword.

---

4. The Agent (The "Brain")
This is the `agent.py` script that you can interact with. It uses a method called **Retrieval-Augmented Generation (RAG)**.

Here's how it works when you ask it a question like, "What does Judge Jules think about settler expansion?":
  • **Fuzzy Matching:** First, it uses fuzzy logic to correct typos in the username ("Judge Julze" → "Judge Jules").
  • **Hybrid Search:** It then performs a **hybrid search**. It does a fast SQL query to get all posts by "Judge Jules," and *then* it calculates the semantic similarity between your question's vector and the vector of each of that user's posts.
  • **Context Retrieval:** It grabs the Top 50 most semantically relevant posts from that user.
  • **Prompt Engineering:** It builds a new, complex prompt for a powerful AI model (`gemini-2.5-pro`). The prompt looks something like this:
    Code:
    You are an expert forum analyst. Answer the user's question based ONLY on the following context posts from the forum.
    
    --- CONTEXT ---
    Post 1 (by Judge Jules): "..."
    Post 2 (by Judge Jules): "..."
    Post 3 (by Judge Jules): "..."
    --- END CONTEXT ---
    
    User's Question: "What does Judge Jules think about settler expansion?"
  • **Generation:** The AI then generates an answer, but its knowledge is restricted to *only* the context posts we provided. This forces it to be factual to our specific forum and prevents it from making things up.

    TL;DR: It's a full-stack data analysis platform. It scrapes, structures, and enriches forum data, then uses a sophisticated RAG architecture to provide context-aware answers based on a semantic understanding of our own discussions.
 
A- thats not even English.
B- Im a full retard with this stuff.. im fucked come the takeover of the Machines.
C- however a machine seems to understand my intentions and purpose on posts better than some humans.
 
As promised, for anyone interested in how the "thing" works under the hood, here's a breakdown of the architecture. It's a full data pipeline now.

The Overall Flow
The system works in four main stages:
1. **Scrape & Parse:** Collects the raw data from the forum.
2. **Store & Structure:** Puts the data into a relational database.
3. **Enrich & Embed:** Adds layers of AI-generated meaning to the data.
4. **Retrieve & Generate:** Uses the enriched data to answer questions intelligently.

---

1. The Scraper Engine
This isn't a basic scraper. It's a hybrid system designed to be robust.
  • **Authentication:** It uses a headless browser (`Playwright`) to handle the full login flow, including the 2FA step. Once logged in, it extracts the session cookies.
  • **Scraping:** For the actual high-volume scraping, it uses the authenticated session with the `requests` library for speed. This is much faster than running a full browser for every page.
  • **Parsing:** It uses `BeautifulSoup` to parse the HTML. Crucially, it has two different parser modes: one for the detailed structure of thread pages and another for the more compact structure of search result pages. It automatically detects the URL type and uses the correct parser.
  • **Data Extraction:** It's not just grabbing post text. It pulls out everything relationally: `post_id`, `user_id`, `thread_id`, post number, timestamps, and it also parses out all the **quotes** and **reactions** for each post.
---

2. The Database (The "Long-Term Memory")
All the scraped data goes into a local **SQLite database**. This is the system's memory. Instead of a pile of messy JSON files, the data is organized into tables that are all linked together:
  • `users`: Stores user info.
  • `threads`: Stores thread titles and IDs.
  • `posts`: The core table with all post content.
  • `quotes`: A relational table linking which post quotes which other post.
  • `reactions`: A table linking users and posts through reactions (e.g., 'Like').
---

3. The Enrichment Pipeline (The "Intelligence" Layer)
This is where the raw data gets its "intelligence." This is a multi-step, asynchronous process using Google's Gemini models.

a) AI Enrichment (`gemini-2.5-flash-lite`):
Every single post in the database is sent to Gemini Flash to generate:
  • A concise 1-2 sentence **summary**.
  • A list of 2-5 relevant **topic tags** (e.g., "settler expansion," "media bias").
  • A **sentiment score** (positive, neutral, negative).
This process took about 10 hours for the ~9k posts, respecting API rate limits.

b) Vector Embeddings (`gemini-embedding-001`):
This is the core of the RAG system. Every post's raw text is converted into a 768-dimensional vector (basically a list of 768 numbers). Think of it as a "semantic fingerprint" or a coordinate on a map of meaning. Posts that discuss similar concepts will have vectors that are mathematically close to each other. This is what allows us to search by *idea*, not just by keyword.

---

4. The Agent (The "Brain")
This is the `agent.py` script that you can interact with. It uses a method called **Retrieval-Augmented Generation (RAG)**.

Here's how it works when you ask it a question like, "What does Judge Jules think about settler expansion?":
  • **Fuzzy Matching:** First, it uses fuzzy logic to correct typos in the username ("Judge Julze" → "Judge Jules").
  • **Hybrid Search:** It then performs a **hybrid search**. It does a fast SQL query to get all posts by "Judge Jules," and *then* it calculates the semantic similarity between your question's vector and the vector of each of that user's posts.
  • **Context Retrieval:** It grabs the Top 50 most semantically relevant posts from that user.
  • **Prompt Engineering:** It builds a new, complex prompt for a powerful AI model (`gemini-2.5-pro`). The prompt looks something like this:
    Code:
    You are an expert forum analyst. Answer the user's question based ONLY on the following context posts from the forum.
    
    --- CONTEXT ---
    Post 1 (by Judge Jules): "..."
    Post 2 (by Judge Jules): "..."
    Post 3 (by Judge Jules): "..."
    --- END CONTEXT ---
    
    User's Question: "What does Judge Jules think about settler expansion?"
  • **Generation:** The AI then generates an answer, but its knowledge is restricted to *only* the context posts we provided. This forces it to be factual to our specific forum and prevents it from making things up.

    TL;DR: It's a full-stack data analysis platform. It scrapes, structures, and enriches forum data, then uses a sophisticated RAG architecture to provide context-aware answers based on a semantic understanding of our own discussions.

Actually going to write a quick one myself now, interested to see what comes out. Not as robust, just a scraper, and then I'll manually send it off for analysis and see what matches as I'm sure it'll depend on what prompts given.
 
Actually going to write a quick one myself now, interested to see what comes out. Not as robust, just a scraper, and then I'll manually send it off for analysis and see what matches as I'm sure it'll depend on what prompts given.

I started off just like that, with a scraper that saved to json files, then I attached those json's with some initial prompts to different LLM's to see results.

I'm busy with a RAG implementation for a work project, so I used this scraper as a way to teach myself what works well and how to best implement.
 
I would have thought the locals telling the cruise ship to fuck off should have been a hint.
 
There isn't enough bandwith to cover the utter tragedies across the planet atm.
Pretty much all of these are man made!

The info coming out of Sudan is horrific, tens of thousands of children starved to death. Murder, rape and starvation continues and millions of people are at risk.
 
There isn't enough bandwith to cover the utter tragedies across the planet atm.
Pretty much all of these are man made!

The info coming out of Sudan is horrific, tens of thousands of children starved to death. Murder, rape and starvation continues and millions of people are at risk.

I'm thinking of reverting to my previous mindset of, if it doesn't happen in my daily life, it's not my reality.

Selfish, but as you say, it's just too much
 
I'm thinking of reverting to my previous mindset of, if it doesn't happen in my daily life, it's not my reality.

Selfish, but as you say, it's just too much


Cos whats even the point atm.
No one seems to really give a shit.

And anyone who does is pissing in the wind
 
As promised, for anyone interested in how the "thing" works under the hood, here's a breakdown of the architecture. It's a full data pipeline now.

The Overall Flow
The system works in four main stages:
1. **Scrape & Parse:** Collects the raw data from the forum.
2. **Store & Structure:** Puts the data into a relational database.
3. **Enrich & Embed:** Adds layers of AI-generated meaning to the data.
4. **Retrieve & Generate:** Uses the enriched data to answer questions intelligently.

---

1. The Scraper Engine
This isn't a basic scraper. It's a hybrid system designed to be robust.
  • **Authentication:** It uses a headless browser (`Playwright`) to handle the full login flow, including the 2FA step. Once logged in, it extracts the session cookies.
  • **Scraping:** For the actual high-volume scraping, it uses the authenticated session with the `requests` library for speed. This is much faster than running a full browser for every page.
  • **Parsing:** It uses `BeautifulSoup` to parse the HTML. Crucially, it has two different parser modes: one for the detailed structure of thread pages and another for the more compact structure of search result pages. It automatically detects the URL type and uses the correct parser.
  • **Data Extraction:** It's not just grabbing post text. It pulls out everything relationally: `post_id`, `user_id`, `thread_id`, post number, timestamps, and it also parses out all the **quotes** and **reactions** for each post.
---

2. The Database (The "Long-Term Memory")
All the scraped data goes into a local **SQLite database**. This is the system's memory. Instead of a pile of messy JSON files, the data is organized into tables that are all linked together:
  • `users`: Stores user info.
  • `threads`: Stores thread titles and IDs.
  • `posts`: The core table with all post content.
  • `quotes`: A relational table linking which post quotes which other post.
  • `reactions`: A table linking users and posts through reactions (e.g., 'Like').
---

3. The Enrichment Pipeline (The "Intelligence" Layer)
This is where the raw data gets its "intelligence." This is a multi-step, asynchronous process using Google's Gemini models.

a) AI Enrichment (`gemini-2.5-flash-lite`):
Every single post in the database is sent to Gemini Flash to generate:
  • A concise 1-2 sentence **summary**.
  • A list of 2-5 relevant **topic tags** (e.g., "settler expansion," "media bias").
  • A **sentiment score** (positive, neutral, negative).
This process took about 10 hours for the ~9k posts, respecting API rate limits.

b) Vector Embeddings (`gemini-embedding-001`):
This is the core of the RAG system. Every post's raw text is converted into a 768-dimensional vector (basically a list of 768 numbers). Think of it as a "semantic fingerprint" or a coordinate on a map of meaning. Posts that discuss similar concepts will have vectors that are mathematically close to each other. This is what allows us to search by *idea*, not just by keyword.

---

4. The Agent (The "Brain")
This is the `agent.py` script that you can interact with. It uses a method called **Retrieval-Augmented Generation (RAG)**.

Here's how it works when you ask it a question like, "What does Judge Jules think about settler expansion?":
  • **Fuzzy Matching:** First, it uses fuzzy logic to correct typos in the username ("Judge Julze" → "Judge Jules").
  • **Hybrid Search:** It then performs a **hybrid search**. It does a fast SQL query to get all posts by "Judge Jules," and *then* it calculates the semantic similarity between your question's vector and the vector of each of that user's posts.
  • **Context Retrieval:** It grabs the Top 50 most semantically relevant posts from that user.
  • **Prompt Engineering:** It builds a new, complex prompt for a powerful AI model (`gemini-2.5-pro`). The prompt looks something like this:
    Code:
    You are an expert forum analyst. Answer the user's question based ONLY on the following context posts from the forum.
    
    --- CONTEXT ---
    Post 1 (by Judge Jules): "..."
    Post 2 (by Judge Jules): "..."
    Post 3 (by Judge Jules): "..."
    --- END CONTEXT ---
    
    User's Question: "What does Judge Jules think about settler expansion?"
  • **Generation:** The AI then generates an answer, but its knowledge is restricted to *only* the context posts we provided. This forces it to be factual to our specific forum and prevents it from making things up.

    TL;DR: It's a full-stack data analysis platform. It scrapes, structures, and enriches forum data, then uses a sophisticated RAG architecture to provide context-aware answers based on a semantic understanding of our own discussions.
In your RAG, what are you storing? All the posts or a subset? Curious.
 
In your RAG, what are you storing? All the posts or a subset? Curious.

All posts that get scraped are embedded and stored.

So far about 9k posts, 250 users, and about 10 threads I think. I'm not actively scraping more.

If I'm interested in a user, or thread, or subject, then I initiate the targeted scrape. For example if I asked the system to scrape for the topic "RAG", your post and my reply would end up scraped, tagged, embedded, and all that would be stored.
 
There isn't enough bandwith to cover the utter tragedies across the planet atm.
Pretty much all of these are man made!

The info coming out of Sudan is horrific, tens of thousands of children starved to death. Murder, rape and starvation continues and millions of people are at risk.
lol what a fucking muppet.
Talk about phoning it in.
thousands of posts about the Jews, and 1 miserable post about Sudan.
And imagine to yourself that the people of Sudan did not even slaughter thousands to trigger this atrocity for them.
 
Maybe death to all Israelis and their supporters would be mild punishment for the genocide they are inflicting on Palestine.

FREE PALESTINE.
Ok thats too far. There are many Israelis who are just as dismayed at what's happening. Same as Nazi occupied territories back in 30s and 40s.

Blitz is an out and out cunt though. That much is true.
 
lol @ you lot.
What a pathetic bunch of leftist idiots.

You people are the vessel on which human suffering continues to deliver.
You have a problem with reality, not with me.

You have already ruined Europe with psychopathic immigration laws, now you are on a mission to ruin the Middle East.

I'll keep going, showing you how you are supporting a colonial bunch of uneducated, barbaric, sadistic, terror farming group of people in a fight against a modern, democratic, native and peace seeking people and nation.

It's tough I know, truth can be like that sometimes.
Seems that for the lot of you, truth is hard all the time.
 


May anyone who championed this be cursed for eternity.

Anyone who gaslit, called ppl who challenged a racist be similarly cursed.

Silence is complicity and too many cunts are silent.
PhotoofdevastatedDresden%20-%20Jason%20Dawsey.jpg


May anyone who championed this be cursed for eternity.

Anyone who gaslit, called ppl who challenged a racist be similarly cursed.

Silence is complicity and too many cunts are silent.







You fucking muppet.
 
cd20ac038c10565ce5a345e528f5500e,f882e715

This Gaza photographer stages Hamas propaganda

Even the Bild has managed to pick up on what you pathetic terror supporting lot have not.
Everything is fair game for these barbaric people.
Every ounce of western morality that you may think that all humans share or hold dear, for them is nothing, it's a tool to be used against you.
It's evil to the core, this is Gaza.
 
Back
Top Bottom