Skip to main content

Data Collection

How Sentry Analytics collects and processes data from multiple sources.

📊 Data Sources

SourceData TypeMethodUpdate Frequency
KununuEmployee reviewsWeb scrapingOn-demand
GoogleBusiness reviewsPlaces APIOn-demand
RedditDiscussionsReddit APIOn-demand
IndeedJob vacanciesWeb scrapingOn-demand

🔧 Scraping Architecture

User Request → Smart Search → Scraping Queue → 4 Parallel Jobs → Database

┌─────────┴─────────┐
↓ ↓ ↓ ↓
Kununu Google Reddit Indeed
↓ ↓ ↓ ↓
└─────────┬─────────┘

Sentiment Analysis (AI)

Topic Extraction (ABSA)

Trust Score Calculation

1. Kununu Scraper

Purpose: Employee reviews from DACH region

Data Collected

  • Review title and text
  • Rating (1-5 stars)
  • Pros and cons
  • Job role and location
  • Review date

Technology

  • Selenium with undetected-chromedriver
  • Anti-bot: delays, user-agent rotation
  • Rate: ~100 reviews/minute

2. Google Reviews

Purpose: Customer/employer reviews from Google Maps

Data Collected

  • Reviewer name
  • Rating (1-5 stars)
  • Review text
  • Date

Technology

  • Google Places API (official)
  • Rate: 50 requests/day (free tier)
  • Cached 24 hours

3. Reddit Scraper

Purpose: Company discussions and sentiment

Data Collected

  • Post title and body
  • Comments
  • Upvotes/score
  • Subreddit source
  • Date

Subreddits

  • r/jobs, r/careerguidance
  • r/cscareerquestions
  • r/germany (DACH companies)

Technology

  • PRAW (Python Reddit API)
  • OAuth2 authentication
  • Rate: 60 requests/minute

4. Indeed Vacancies

Purpose: Track job openings for turnover analysis

Data Collected

  • Job title
  • Location
  • Salary range
  • Posted date

Technology

  • Selenium scraping
  • Rate: ~50 jobs/minute

One-click analysis creates 4 scraping jobs:

POST /api/scraping/magic-search
{
"company_name": "BMW",
"country": "de"
}

Response:

{
"status": "scraping_started",
"jobs_started": 4,
"message": "🚀 Analysis started!"
}

🔄 Job Management

Job Statuses

StatusMeaning
pendingIn queue
runningCurrently scraping
completedFinished
failedError occurred
cancelledManually stopped

Monitor Jobs

GET /api/scraping/jobs/JOB_ID

# Response
{
"status": "running",
"reviews_collected": 156,
"progress_percent": 78
}

⚡ Post-Processing

After scraping completes:

  1. Deduplication - Remove duplicates
  2. Sentiment Analysis - AI categorization (Gemini 2.5)
  3. Topic Extraction - ABSA for aspects
  4. Trust Score - Recalculate
  5. Views Refresh - Update materialized views

🤖 Ethical Scraping

We implement responsible data collection:

  • ✅ Respectful delays (1-5 seconds)
  • ✅ User-agent rotation
  • ✅ Rate limiting compliance
  • ✅ Error backoff
  • ✅ Robots.txt respect

Data collection is triggered on-demand via Smart Search or API.