Skip to main content

Vacancy Scraping System

Implemented: 2025-12-21 Status: ✅ Live


How It Works

┌─────────────────────────────────────────────────────────────┐
│ User creates job: source = 'vacancies' │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Worker picks up job → calls _execute_vacancy_job() │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Check SerpAPI quota (200/month limit) │
│ └─ Available? → Search Google for job listings │
│ └─ Query: "site:indeed.de OR site:stepstone.de {company}" │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ SerpAPI returns results (usually 5-10 job links) │
│ └─ Found < 5? → Add mock data as fallback (8 jobs) │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Persist vacancy updates through shared repository seams │
│ └─ Deduplicate by vacancy_hash │
│ └─ Preserve lifecycle state and job completion flow │
└─────────────────────────────────────────────────────────────┘

Components

1. SerpAPI Client

File: vartovii/scrapy_services/serpapi_client.py

  • Uses Google Search API (not Google Jobs - which is deprecated)
  • Query: site:indeed.de OR site:stepstone.de {company} jobs
  • Quota: 200 requests/month (tracked in vartovii/data/serpapi_usage.json)
  • Key: Set in .env as SERPAPI_KEY

2. Proxy Rotator

File: vartovii/config/proxies.py

  • 20 proxies configured (10 Ukrainian + 10 German)
  • Auto-rotation on each request
  • Failed proxies temporarily excluded
  • Config: vartovii/data/proxies.txt

3. Worker Integration

File: backend/scraper_worker.py

Method _execute_vacancy_job():

  1. Runs vacancy collection through the background worker/runtime flow
  2. Tries SerpAPI if quota available
  3. Falls back to mock data if SerpAPI finds < 5 results
  4. Persists vacancy lifecycle updates through the shared vacancy repository

Legacy CLI and helper paths in vartovii/ now use the same shared vacancy repository seam as the background worker/runtime flow, so persistence behavior stays aligned across automated and manual runs.

Local helper paths also follow the same packaged runtime contract as the background worker flow, reducing drift between manual refresh runs and the main runtime surface.


Database Schema

Table: job_vacancies

ColumnTypeDescription
idINTEGERPrimary key
company_slugVARCHARCompany identifier
job_titleVARCHARPosition title
locationVARCHARJob location
job_urlVARCHAR(500)Link to job posting
sourceVARCHAR'serpapi', 'mock', 'indeed'
vacancy_hashVARCHARUnique hash for deduplication
is_activeBOOLEANCurrently active
created_atTIMESTAMPFirst seen

Usage

Option 1: Via Dashboard/API

Create a scraping job with source: 'vacancies':

curl -X POST "http://localhost:8000/api/scraping/magic-search" \
-H "Content-Type: application/json" \
-d '{"company_name": "BMW", "sources": ["vacancies"]}'

Option 2: CLI Harvester

cd /Users/vitaliiradionov/Desktop/Vartovii
export SERPAPI_KEY="your_key"
python vartovii/scrapy_services/vacancy_harvester.py "BMW" --slug bmw-ag --save

Quota Management

SerpAPI Free Tier: 250/month Our Limit: 200/month (safety margin)

Check remaining quota:

from services.serpapi_client import get_serpapi_client
client = get_serpapi_client()
print(f"Remaining: {client.get_remaining_quota()}")

Quota resets automatically on the 1st of each month.


Files Reference

FilePurpose
vartovii/scrapy_services/serpapi_client.pySerpAPI client with quota tracking
vartovii/config/proxies.pyProxy rotation module
vartovii/scrapy_services/vacancy_harvester.pyStandalone CLI tool
vartovii/data/proxies.txt20 configured proxies
vartovii/data/serpapi_usage.jsonMonthly usage counter
backend/scraper_worker.pyWorker with _execute_vacancy_job()
backend/scraping_job_runner.pyJob-runner orchestration and refresh

Test Results

🔍 Starting vacancy harvest for Bosch...
Using SerpAPI (quota: 194)
SerpAPI found 4 vacancies
Using mock vacancy data as fallback
✅ Saved 12 new vacancies for Bosch
Total vacancies in DB: 12