Data Collection
How Sentry Analytics collects and processes data from multiple sources.
📊 Data Sources
| Source | Data Type | Method | Update Frequency |
|---|---|---|---|
| Kununu | Employee reviews | Web scraping | On-demand |
| Business reviews | Places API | On-demand | |
| Discussions | Reddit API | On-demand | |
| Indeed | Job vacancies | Web scraping | On-demand |
🔧 Scraping Architecture
User Request → Smart Search → Scraping Queue → 4 Parallel Jobs → Database
↓
┌─────────┴─────────┐
↓ ↓ ↓ ↓
Kununu Google Reddit Indeed
↓ ↓ ↓ ↓
└─────────┬─────────┘
↓
Sentiment Analysis (AI)
↓
Topic Extraction (ABSA)
↓
Trust Score Calculation
1. Kununu Scraper
Purpose: Employee reviews from DACH region
Data Collected
- Review title and text
- Rating (1-5 stars)
- Pros and cons
- Job role and location
- Review date
Technology
- Selenium with undetected-chromedriver
- Anti-bot: delays, user-agent rotation
- Rate: ~100 reviews/minute
2. Google Reviews
Purpose: Customer/employer reviews from Google Maps
Data Collected
- Reviewer name
- Rating (1-5 stars)
- Review text
- Date
Technology
- Google Places API (official)
- Rate: 50 requests/day (free tier)
- Cached 24 hours
3. Reddit Scraper
Purpose: Company discussions and sentiment
Data Collected
- Post title and body
- Comments
- Upvotes/score
- Subreddit source
- Date
Subreddits
- r/jobs, r/careerguidance
- r/cscareerquestions
- r/germany (DACH companies)
Technology
- PRAW (Python Reddit API)
- OAuth2 authentication
- Rate: 60 requests/minute
4. Indeed Vacancies
Purpose: Track job openings for turnover analysis
Data Collected
- Job title
- Location
- Salary range
- Posted date
Technology
- Selenium scraping
- Rate: ~50 jobs/minute
📦 Magic Search
One-click analysis creates 4 scraping jobs:
POST /api/scraping/magic-search
{
"company_name": "BMW",
"country": "de"
}
Response:
{
"status": "scraping_started",
"jobs_started": 4,
"message": "🚀 Analysis started!"
}
🔄 Job Management
Job Statuses
| Status | Meaning |
|---|---|
pending | In queue |
running | Currently scraping |
completed | Finished |
failed | Error occurred |
cancelled | Manually stopped |
Monitor Jobs
GET /api/scraping/jobs/JOB_ID
# Response
{
"status": "running",
"reviews_collected": 156,
"progress_percent": 78
}
⚡ Post-Processing
After scraping completes:
- Deduplication - Remove duplicates
- Sentiment Analysis - AI categorization (Gemini 2.5)
- Topic Extraction - ABSA for aspects
- Trust Score - Recalculate
- Views Refresh - Update materialized views
🤖 Ethical Scraping
We implement responsible data collection:
- ✅ Respectful delays (1-5 seconds)
- ✅ User-agent rotation
- ✅ Rate limiting compliance
- ✅ Error backoff
- ✅ Robots.txt respect
Data collection is triggered on-demand via Smart Search or API.