Data Collection Architecture

Overview

Vartovii collects data from 4 primary sources for corporate intelligence using a scalable harvester pattern:

Source	Data Type	Method	Rate
Kununu	Employee reviews	Automated Harvester	~100 reviews/min
Google Reviews	Customer reviews	Places API	50 req/day (free tier)
Reddit	Company discussions	Reddit API	60 req/min
Indeed	Job vacancies	Public Job Feed	~50 jobs/min

Harvester Architecture

graph TD
    subgraph Sources
    K[Kununu]
    G[Google Maps]
    R[Reddit]
    end

    subgraph Collection Layer
    S1[Vacancy Harvester]
    S2[Review Harvester]
    S3[Social Harvester]
    end

    Q[Job Queue] --> P[PostgreSQL]
    S1 --> Q
    S2 --> Q
    S3 --> Q

1. Kununu Integration

Purpose: Collect employee reviews from kununu.com (DACH region).

Technology: Hybrid API/Browser automation.
Fields: Title, Rating (1-5), Text, Date, Position, Location.
Config: KUNUNU_CONFIG controls delay (1-3s) and batch size.

2. Google Reviews

Purpose: Collect customer sentiment.

API: Google Places API (Official).
Data: Author, Rating, Text, Date.
Limits: Cached for 24h to respect quotas.

3. Reddit Intelligence

Purpose: Unfiltered company discussions.

API: Official Reddit API (OAuth2).
Subreddits: r/jobs, r/careerguidance, r/germany.
Metrics: Sentiment Score, Upvotes, Comment volume.

4. Vacancy Intelligence (Indeed)

Purpose: Track hiring velocity and ghost jobs.

Metrics: Time-to-fill, Salary Ranges, Job Description keywords.

Review ID Determinism Contract (Updated 2026-03-04)

Spider-based fallback review IDs are deterministic and namespaced by company to prevent cross-company collisions under global review_id uniqueness.

Current pattern:

Use hashlib.sha256(...) for fallback IDs (no runtime-random uuid4()).
Include company_slug in hash key for namespacing.
Applies to Reddit, Glassdoor, and Google Maps spider fallback paths.

Anti-Bot Compliance & Ethics

Respectful Delays: Random intervals (1-5s) between requests.
Identification: Valid User-Agents.
Compliance: Respects robots.txt and Terms of Service.
Rate Limiting: Strict checking of platform limits.

Database Schema

CREATE TABLE reviews (
    id SERIAL PRIMARY KEY,
    company_slug VARCHAR,
    source VARCHAR,
    rating DECIMAL,
    sentiment VARCHAR,
    created_at TIMESTAMP DEFAULT NOW()
);

Overview​

Harvester Architecture​

1. Kununu Integration​

2. Google Reviews​

3. Reddit Intelligence​

4. Vacancy Intelligence (Indeed)​

Review ID Determinism Contract (Updated 2026-03-04)​

Anti-Bot Compliance & Ethics​

Database Schema​