Zum Hauptinhalt springen

Date-Filtered Scraping

Status: Completed Completed: 2025-12-20 Location: vartovii/scrapers/kununu_spider.py


Summary

Implemented date-based filtering for Kununu spider to only collect reviews from the last N months (configurable, default: 24).

Features

  • Configurable months_back parameter - API accepts months_back (default: 24)
  • Date cutoff filtering - Reviews older than cutoff are skipped
  • Early stopping - Stops after 20 consecutive old reviews (chronological optimization)
  • Incremental scraping - Skips already-scraped reviews in database
  • Real-time progress tracking - reviews_collected counter updated per review

Implementation Details

Spider Changes

# vartovii/scrapers/kununu_spider.py
self.months_back = int(getattr(self, 'months_back', 24))
self.cutoff_date = datetime.now() - timedelta(days=self.months_back * 30)

# Early stop after 20 consecutive old reviews
if self.consecutive_old_reviews >= self.old_reviews_stop_threshold:
raise scrapy.exceptions.CloseSpider("Date filtering complete")

API Changes

# backend/scraping_api.py
class MagicSearchRequest(BaseModel):
months_back: int = 24 # New field

Pipeline Changes

# vartovii/pipelines/pipelines.py
# After each review commit, update counter:
UPDATE scraping_jobs SET reviews_collected = reviews_collected + 1 WHERE job_id = %s

Fixes Applied (Session 2025-12-20)

  1. ✅ Browser-like headers added (bypass 403)
  2. ✅ DB connectivity standardized (Cloud SQL Proxy + env-based DB config)
  3. ✅ Pipeline company_name normalization
  4. ✅ Fake company validation (reject if not on Kununu)
  5. ✅ Cancel job endpoint fixed

Test Results

CompanyReviews CollectedTime
Audi289+ reviews~2 min
Bosch100+ reviews~3 min
  • vartovii/scrapers/kununu_spider.py
  • backend/scraping_api.py
  • vartovii/pipelines/pipelines.py
  • vartovii/config/settings.py