Date-Filtered Scraping
Status: Completed Completed: 2025-12-20 Location:
vartovii/scrapers/kununu_spider.py
Summary
Implemented date-based filtering for Kununu spider to only collect reviews from the last N months (configurable, default: 24).
Features
- Configurable
months_backparameter - API accepts months_back (default: 24) - Date cutoff filtering - Reviews older than cutoff are skipped
- Early stopping - Stops after 20 consecutive old reviews (chronological optimization)
- Incremental scraping - Skips already-scraped reviews in database
- Real-time progress tracking -
reviews_collectedcounter updated per review
Implementation Details
Spider Changes
# vartovii/scrapers/kununu_spider.py
self.months_back = int(getattr(self, 'months_back', 24))
self.cutoff_date = datetime.now() - timedelta(days=self.months_back * 30)
# Early stop after 20 consecutive old reviews
if self.consecutive_old_reviews >= self.old_reviews_stop_threshold:
raise scrapy.exceptions.CloseSpider("Date filtering complete")
API Changes
# backend/scraping_api.py
class MagicSearchRequest(BaseModel):
months_back: int = 24 # New field
Pipeline Changes
# vartovii/pipelines/pipelines.py
# After each review commit, update counter:
UPDATE scraping_jobs SET reviews_collected = reviews_collected + 1 WHERE job_id = %s
Fixes Applied (Session 2025-12-20)
- ✅ Browser-like headers added (bypass 403)
- ✅ DB connectivity standardized (Cloud SQL Proxy + env-based DB config)
- ✅ Pipeline company_name normalization
- ✅ Fake company validation (reject if not on Kununu)
- ✅ Cancel job endpoint fixed
Test Results
| Company | Reviews Collected | Time |
|---|---|---|
| Audi | 289+ reviews | ~2 min |
| Bosch | 100+ reviews | ~3 min |
Related Files
vartovii/scrapers/kununu_spider.pybackend/scraping_api.pyvartovii/pipelines/pipelines.pyvartovii/config/settings.py