Skip to main content

Data Collection Architecture

Overview

Vartovii collects data from 4 primary sources for corporate intelligence using a scalable harvester pattern:

SourceData TypeMethodRate
KununuEmployee reviewsAutomated Harvester~100 reviews/min
Google ReviewsCustomer reviewsPlaces API50 req/day (free tier)
RedditCompany discussionsReddit API60 req/min
IndeedJob vacanciesPublic Job Feed~50 jobs/min

Harvester Architecture

graph TD
subgraph Sources
K[Kununu]
G[Google Maps]
R[Reddit]
end

subgraph Collection Layer
S1[Vacancy Harvester]
S2[Review Harvester]
S3[Social Harvester]
end

Q[Job Queue] --> P[PostgreSQL]
S1 --> Q
S2 --> Q
S3 --> Q

1. Kununu Integration

Purpose: Collect employee reviews from kununu.com (DACH region).

  • Technology: Hybrid API/Browser automation.
  • Fields: Title, Rating (1-5), Text, Date, Position, Location.
  • Config: KUNUNU_CONFIG controls delay (1-3s) and batch size.

2. Google Reviews

Purpose: Collect customer sentiment.

  • API: Google Places API (Official).
  • Data: Author, Rating, Text, Date.
  • Limits: Cached for 24h to respect quotas.

3. Reddit Intelligence

Purpose: Unfiltered company discussions.

  • API: Official Reddit API (OAuth2).
  • Subreddits: r/jobs, r/careerguidance, r/germany.
  • Metrics: Sentiment Score, Upvotes, Comment volume.

4. Vacancy Intelligence (Indeed)

Purpose: Track hiring velocity and ghost jobs.

  • Metrics: Time-to-fill, Salary Ranges, Job Description keywords.

Anti-Bot Compliance & Ethics

  1. Respectful Delays: Random intervals (1-5s) between requests.
  2. Identification: Valid User-Agents.
  3. Compliance: Respects robots.txt and Terms of Service.
  4. Rate Limiting: Strict checking of platform limits.

Database Schema

CREATE TABLE reviews (
id SERIAL PRIMARY KEY,
company_slug VARCHAR,
source VARCHAR,
rating DECIMAL,
sentiment VARCHAR,
created_at TIMESTAMP DEFAULT NOW()
);