Skip to main content

Vacancy Scraping System

Implemented: 2025-12-21 Status: ✅ Production Ready


How It Works

┌─────────────────────────────────────────────────────────────┐
│ User creates job: source = 'vacancies' │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Worker picks up job → calls _execute_vacancy_job() │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Check SerpAPI quota (200/month limit) │
│ └─ Available? → Search Google for job listings │
│ └─ Query: "site:indeed.de OR site:stepstone.de {company}" │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ SerpAPI returns results (usually 5-10 job links) │
│ └─ Found < 5? → Add mock data as fallback (8 jobs) │
└────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Save all vacancies to job_vacancies table │
│ └─ Deduplicate by vacancy_hash │
│ └─ Update job status to 'completed' │
└─────────────────────────────────────────────────────────────┘

Components

1. SerpAPI Client

File: vartovii/scrapy_services/serpapi_client.py

  • Uses Google Search API (not Google Jobs - which is deprecated)
  • Query: site:indeed.de OR site:stepstone.de {company} jobs
  • Quota: 200 requests/month (tracked in vartovii/data/serpapi_usage.json)
  • Key: Set in .env as SERPAPI_KEY

2. Proxy Rotator

File: vartovii/config/proxies.py

  • 20 proxies configured (10 Ukrainian + 10 German)
  • Auto-rotation on each request
  • Failed proxies temporarily excluded
  • Config: vartovii/data/proxies.txt

3. Worker Integration

File: backend/scraper_worker.py

Method _execute_vacancy_job():

  1. Connects to production DB directly
  2. Tries SerpAPI if quota available
  3. Falls back to mock data if SerpAPI finds < 5 results
  4. Saves directly to job_vacancies table

Database Schema

Table: job_vacancies

ColumnTypeDescription
idINTEGERPrimary key
company_slugVARCHARCompany identifier
job_titleVARCHARPosition title
locationVARCHARJob location
job_urlVARCHAR(500)Link to job posting
sourceVARCHAR'serpapi', 'mock', 'indeed'
vacancy_hashVARCHARUnique hash for deduplication
is_activeBOOLEANCurrently active
created_atTIMESTAMPFirst seen

Usage

Option 1: Via Dashboard/API

Create a scraping job with source: 'vacancies':

curl -X POST "http://localhost:8000/api/scraping/magic-search" \
-H "Content-Type: application/json" \
-d '{"company_name": "BMW", "sources": ["vacancies"]}'

Option 2: Direct Database

INSERT INTO scraping_jobs (job_id, company_name, company_slug, source, status, country)
VALUES (gen_random_uuid(), 'BMW', 'bmw-ag', 'vacancies', 'pending', 'de');

Option 3: CLI Harvester

cd vartovii
export SERPAPI_KEY="your_key"
python services/vacancy_harvester.py "BMW" --slug bmw-ag --save

Quota Management

SerpAPI Free Tier: 250/month Our Limit: 200/month (safety margin)

Check remaining quota:

from services.serpapi_client import get_serpapi_client
client = get_serpapi_client()
print(f"Remaining: {client.get_remaining_quota()}")

Quota resets automatically on the 1st of each month.


Files Reference

FilePurpose
vartovii/scrapy_services/serpapi_client.pySerpAPI client with quota tracking
vartovii/config/proxies.pyProxy rotation module
vartovii/scrapy_services/vacancy_harvester.pyStandalone CLI tool
vartovii/data/proxies.txt20 configured proxies
vartovii/data/serpapi_usage.jsonMonthly usage counter
backend/scraper_worker.pyWorker with _execute_vacancy_job()

Test Results

🔍 Starting vacancy harvest for Bosch...
Using SerpAPI (quota: 194)
SerpAPI found 4 vacancies
Using mock vacancy data as fallback
✅ Saved 12 new vacancies for Bosch
Total vacancies in DB: 12