Vacancy Scraping System
Implemented: 2025-12-21 Status: ✅ Live
How It Works
┌─────────────────────────────────────────────────────────────┐
│ User creates job: source = 'vacancies' │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Worker picks up job → calls _execute_vacancy_job() │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Check SerpAPI quota (200/month limit) │
│ └─ Available? → Search Google for job listings │
│ └─ Query: "site:indeed.de OR site:stepstone.de {company}" │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ SerpAPI returns results (usually 5-10 job links) │
│ └─ Found < 5? → Add mock data as fallback (8 jobs) │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Persist vacancy updates through shared repository seams │
│ └─ Deduplicate by vacancy_hash │
│ └─ Preserve lifecycle state and job completion flow │
└─────────────────────────────────────────────────────────────┘
Components
1. SerpAPI Client
File: vartovii/scrapy_services/serpapi_client.py
- Uses Google Search API (not Google Jobs - which is deprecated)
- Query:
site:indeed.de OR site:stepstone.de {company} jobs - Quota: 200 requests/month (tracked in
vartovii/data/serpapi_usage.json) - Key: Set in
.envasSERPAPI_KEY
2. Proxy Rotator
File: vartovii/config/proxies.py
- 20 proxies configured (10 Ukrainian + 10 German)
- Auto-rotation on each request
- Failed proxies temporarily excluded
- Config:
vartovii/data/proxies.txt
3. Worker Integration
File: backend/scraper_worker.py
Method _execute_vacancy_job():
- Runs vacancy collection through the background worker/runtime flow
- Tries SerpAPI if quota available
- Falls back to mock data if SerpAPI finds < 5 results
- Persists vacancy lifecycle updates through the shared vacancy repository
Legacy CLI and helper paths in vartovii/ now use the same shared vacancy
repository seam as the background worker/runtime flow, so persistence behavior
stays aligned across automated and manual runs.
Local helper paths also follow the same packaged runtime contract as the background worker flow, reducing drift between manual refresh runs and the main runtime surface.
Database Schema
Table: job_vacancies
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| company_slug | VARCHAR | Company identifier |
| job_title | VARCHAR | Position title |
| location | VARCHAR | Job location |
| job_url | VARCHAR(500) | Link to job posting |
| source | VARCHAR | 'serpapi', 'mock', 'indeed' |
| vacancy_hash | VARCHAR | Unique hash for deduplication |
| is_active | BOOLEAN | Currently active |
| created_at | TIMESTAMP | First seen |
Usage
Option 1: Via Dashboard/API
Create a scraping job with source: 'vacancies':
curl -X POST "http://localhost:8000/api/scraping/magic-search" \
-H "Content-Type: application/json" \
-d '{"company_name": "BMW", "sources": ["vacancies"]}'
Option 2: CLI Harvester
cd /Users/vitaliiradionov/Desktop/Vartovii
export SERPAPI_KEY="your_key"
python vartovii/scrapy_services/vacancy_harvester.py "BMW" --slug bmw-ag --save
Quota Management
SerpAPI Free Tier: 250/month Our Limit: 200/month (safety margin)
Check remaining quota:
from services.serpapi_client import get_serpapi_client
client = get_serpapi_client()
print(f"Remaining: {client.get_remaining_quota()}")
Quota resets automatically on the 1st of each month.
Files Reference
| File | Purpose |
|---|---|
vartovii/scrapy_services/serpapi_client.py | SerpAPI client with quota tracking |
vartovii/config/proxies.py | Proxy rotation module |
vartovii/scrapy_services/vacancy_harvester.py | Standalone CLI tool |
vartovii/data/proxies.txt | 20 configured proxies |
vartovii/data/serpapi_usage.json | Monthly usage counter |
backend/scraper_worker.py | Worker with _execute_vacancy_job() |
backend/scraping_job_runner.py | Job-runner orchestration and refresh |
Test Results
🔍 Starting vacancy harvest for Bosch...
Using SerpAPI (quota: 194)
SerpAPI found 4 vacancies
Using mock vacancy data as fallback
✅ Saved 12 new vacancies for Bosch
Total vacancies in DB: 12