Skip to main content

Glassdoor Reviews Harvester

Harvests employer reviews from Glassdoor using a 3-query SerpAPI strategy.

API Endpoint

GET /api/glassdoor/{company_name}?country=de

Parameters

ParameterTypeDefaultDescription
company_namestringrequiredCompany name
countrystring"de"Country code (de, us, uk, at, ch, fr)

Response

{
"status": "success",
"data": {
"company_name": "SAP",
"country": "de",
"source": "glassdoor",
"total_count": 20,
"positive_count": 14,
"negative_count": 2,
"sentiment_ratio": 0.7,
"reviews": [
{
"id": "https://glassdoor.de/...",
"title": "Great place to work",
"text": "Pros: Great culture Cons: Slow career growth",
"rating": 4.2,
"positive_text": "Great culture",
"negative_text": "Slow career growth",
"employee_status": "Current",
"date": "2026-01-15"
}
]
},
"remaining_quota": 197
}

How It Works (v1 — SerpAPI Strategy)

Uses 3 targeted SerpAPI Google Search queries per company (costs 3 credits):

  1. Query 1site:glassdoor.{domain} "{company}" "Pros" "Cons" reviews → structured Pros/Cons
  2. Query 2site:glassdoor.{domain} "{company}" rating stars employee → ratings + Knowledge Panel
  3. Query 3site:glassdoor.{domain} "{company}" "recommend" OR "CEO" review → recommendations

Each query extracts: rating, positive_text (Pros), negative_text (Cons), employee_status (Current/Former). Maximum results per company: 50 (configurable).

Supported Domains

CountryDomain
Germanyglassdoor.de
USAglassdoor.com
UKglassdoor.co.uk
Austriaglassdoor.at
Switzerlandglassdoor.ch
Franceglassdoor.fr

Cost

  • 3 SerpAPI credits per company (3-query strategy)
  • Monthly limit: 200 requests → ~66 companies/month

Storage

Reviews fetched by the scraping job are saved to the reviews table (source='glassdoor', trust_weight=0.20). Each review gets a deterministic review_id (SHA-256 hash) to prevent duplicates on re-scrape.

Review ID Stability Contract (Updated 2026-03-03)

When Glassdoor source IDs are absent, a deterministic fallback ID is used:

review_id = f"glassdoor-{hashlib.sha256(f'{company_slug}-{text}'.encode()).hexdigest()[:16]}"

This replaces non-deterministic Python hash() behavior and ensures stable, cross-run duplicate detection with company-level namespacing.

After the first scrape:

  • Glassdoor reviews appear in the UI alongside Kununu/Google
  • Trust Score reads from DB (no SerpAPI quota on every request)
  • Re-scraping adds only new snippets (ON CONFLICT DO NOTHING)

🔮 Roadmap: Glassdoor v2 (Selenium-based)

Planned for Sprint 9-10 — Requires Glassdoor account + session cookies

The current SerpAPI approach extracts snippets only. Future glassdoor_spider.py (already implemented, not yet deployed) uses Selenium + cookies to scrape:

  • Full review text (not snippets)
  • Sub-ratings: Work-Life Balance, Culture, Management, Career Growth
  • CEO Approval % (exact number)
  • "Would Recommend to Friend" % (exact number)
  • Unlimited pagination (all pages)

Prerequisites: Glassdoor session cookies in .credentials/glassdoor_cookies.json (requires periodic manual refresh — cookies expire ~30 days).