Glassdoor Reviews Harvester
Harvests employer reviews from Glassdoor using a 3-query SerpAPI strategy.
API Endpoint
GET /api/glassdoor/{company_name}?country=de
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| company_name | string | required | Company name |
| country | string | "de" | Country code (de, us, uk, at, ch, fr) |
Response
{
"status": "success",
"data": {
"company_name": "SAP",
"country": "de",
"source": "glassdoor",
"total_count": 20,
"positive_count": 14,
"negative_count": 2,
"sentiment_ratio": 0.7,
"reviews": [
{
"id": "https://glassdoor.de/...",
"title": "Great place to work",
"text": "Pros: Great culture Cons: Slow career growth",
"rating": 4.2,
"positive_text": "Great culture",
"negative_text": "Slow career growth",
"employee_status": "Current",
"date": "2026-01-15"
}
]
},
"remaining_quota": 197
}
How It Works (v1 — SerpAPI Strategy)
Uses 3 targeted SerpAPI Google Search queries per company (costs 3 credits):
- Query 1 —
site:glassdoor.{domain} "{company}" "Pros" "Cons" reviews→ structured Pros/Cons - Query 2 —
site:glassdoor.{domain} "{company}" rating stars employee→ ratings + Knowledge Panel - Query 3 —
site:glassdoor.{domain} "{company}" "recommend" OR "CEO" review→ recommendations
Each query extracts: rating, positive_text (Pros), negative_text (Cons),
employee_status (Current/Former). Maximum results per company: 50
(configurable).
Supported Domains
| Country | Domain |
|---|---|
| Germany | glassdoor.de |
| USA | glassdoor.com |
| UK | glassdoor.co.uk |
| Austria | glassdoor.at |
| Switzerland | glassdoor.ch |
| France | glassdoor.fr |
Cost
- 3 SerpAPI credits per company (3-query strategy)
- Monthly limit: 200 requests → ~66 companies/month
Storage
Reviews fetched by the scraping job are saved to the reviews table
(source='glassdoor', trust_weight=0.20). Each review gets a deterministic
review_id (SHA-256 hash) to prevent duplicates on re-scrape.
Review ID Stability Contract (Updated 2026-03-03)
When Glassdoor source IDs are absent, a deterministic fallback ID is used:
review_id = f"glassdoor-{hashlib.sha256(f'{company_slug}-{text}'.encode()).hexdigest()[:16]}"
This replaces non-deterministic Python hash() behavior and ensures stable,
cross-run duplicate detection with company-level namespacing.
After the first scrape:
- Glassdoor reviews appear in the UI alongside Kununu/Google
- Trust Score reads from DB (no SerpAPI quota on every request)
- Re-scraping adds only new snippets (
ON CONFLICT DO NOTHING)
🔮 Roadmap: Glassdoor v2 (Selenium-based)
Planned for Sprint 9-10 — Requires Glassdoor account + session cookies
The current SerpAPI approach extracts snippets only. Future
glassdoor_spider.py (already implemented, not yet deployed) uses Selenium +
cookies to scrape:
- Full review text (not snippets)
- Sub-ratings: Work-Life Balance, Culture, Management, Career Growth
- CEO Approval % (exact number)
- "Would Recommend to Friend" % (exact number)
- Unlimited pagination (all pages)
Prerequisites: Glassdoor session cookies in
.credentials/glassdoor_cookies.json (requires periodic manual refresh —
cookies expire ~30 days).