Data Collection Architecture
Overview
Vartovii collects data from 4 primary sources for corporate intelligence using a scalable harvester pattern:
| Source | Data Type | Method | Rate |
|---|---|---|---|
| Kununu | Employee reviews | Automated Harvester | ~100 reviews/min |
| Google Reviews | Customer reviews | Places API | 50 req/day (free tier) |
| Company discussions | Reddit API | 60 req/min | |
| Indeed | Job vacancies | Public Job Feed | ~50 jobs/min |
Harvester Architecture
graph TD
subgraph Sources
K[Kununu]
G[Google Maps]
R[Reddit]
end
subgraph Collection Layer
S1[Vacancy Harvester]
S2[Review Harvester]
S3[Social Harvester]
end
Q[Job Queue] --> P[PostgreSQL]
S1 --> Q
S2 --> Q
S3 --> Q
1. Kununu Integration
Purpose: Collect employee reviews from kununu.com (DACH region).
- Technology: Hybrid API/Browser automation.
- Fields: Title, Rating (1-5), Text, Date, Position, Location.
- Config:
KUNUNU_CONFIGcontrols delay (1-3s) and batch size.
2. Google Reviews
Purpose: Collect customer sentiment.
- API: Google Places API (Official).
- Data: Author, Rating, Text, Date.
- Limits: Cached for 24h to respect quotas.
3. Reddit Intelligence
Purpose: Unfiltered company discussions.
- API: Official Reddit API (OAuth2).
- Subreddits: r/jobs, r/careerguidance, r/germany.
- Metrics: Sentiment Score, Upvotes, Comment volume.
4. Vacancy Intelligence (Indeed)
Purpose: Track hiring velocity and ghost jobs.
- Metrics: Time-to-fill, Salary Ranges, Job Description keywords.
Anti-Bot Compliance & Ethics
- Respectful Delays: Random intervals (1-5s) between requests.
- Identification: Valid User-Agents.
- Compliance: Respects
robots.txtand Terms of Service. - Rate Limiting: Strict checking of platform limits.
Database Schema
CREATE TABLE reviews (
id SERIAL PRIMARY KEY,
company_slug VARCHAR,
source VARCHAR,
rating DECIMAL,
sentiment VARCHAR,
created_at TIMESTAMP DEFAULT NOW()
);