Data Quality & Validation System
Last Updated: 2026-02-06
Vartovii implements automated data quality checks to ensure clean, reliable company intelligence. This guide covers validation mechanisms, quality metrics, and troubleshooting.
Overview
The Data Quality System consists of 4 layers:
┌─────────────────────────────────────┐
│ 1. Entity Type Validation │ ← Prevent contamination
│ (During scraping) │
├─────────────────────────────────────┤
│ 2. Post-Scraping Alerts │ ← Immediate feedback
│ (After job completes) │
├─────────────────────────────────────┤
│ 3. Quality Metrics API │ ← Monitoring dashboard
│ (/api/admin/data-quality) │
├─────────────────────────────────────┤
│ 4. Manual Cleanup Tools │ ← Admin actions
│ (DELETE /api/admin/review/{id}) │
└─────────────────────────────────────┘
1. Entity Type Validation
Purpose: Prevent scraping apartment/product reviews instead of company reviews
Location: vartovii/utils/entity_validation.py
How It Works
68 Detection Keywords across 3 categories:
| Category | Keywords (Examples) | Use Case |
|---|---|---|
| Apartment | apartment, flat, landlord, rent, tenant | Prevent apartment complex reviews |
| Product/Service | product, purchase, delivery, shipping | Prevent e-commerce product reviews |
| Company | employer, workplace, boss, salary | Identify valid company reviews |
Scoring System:
- Apartment keyword: +2 points
- Product keyword: +1 point
- Company keyword: -1 point
- Threshold: Score ≥ 3 → Skip review
Integration
Reddit Scraper:
from vartovii.utils.entity_validation import should_skip_review
# Before yielding review
should_skip, confidence, reason = should_skip_review(
title=post_title,
text=review_text
)
if should_skip:
self.validation_skipped_count += 1
self.logger.info(f"⚠️ Skipped: {reason} (confidence: {confidence}%)")
continue
Google Scraper:
should_skip, confidence, reason = should_skip_review(
title=place_name,
text=review_text
)
if should_skip:
skipped_count += 1
logger.info(f"⚠️ Skipped: {reason}")
continue
Test Results
python backend/scripts/test_validation.py
| Test Case | Result |
|---|---|
| Apartment complex | ✅ Skipped (100% confidence) |
| Product review | ✅ Skipped (67% confidence) |
| Company review | ✅ Kept (valid) |
2. Post-Scraping Alerts
Purpose: Immediate feedback after scraping completes
Location: vartovii/utils/scraping_alerts.py
Severity Levels
| Level | Icon | Meaning | Example |
|---|---|---|---|
INFO | ℹ️ | Normal | "Found 50 reviews" |
WARNING | ⚠️ | Attention needed | "20% missing sentiment" |
ERROR | ❌ | Problem detected | "High empty review rate" |
CRITICAL | 🚨 | Urgent action needed | "Validation failed completely" |
Validation Checks
-
Review Count
- ERROR: 0 reviews found
- WARNING: Below expected minimum
-
Missing Sentiment
- ERROR: >50% missing sentiment
- WARNING: >20% missing sentiment
-
Empty Reviews
- ERROR: >30% no text
- WARNING: >10% no text
Usage
from vartovii.utils.scraping_alerts import log_scraping_summary
# After scraping completes
alerts = log_scraping_summary(
company_name="Valora",
source="reddit",
expected_min_reviews=10
)
Output (Console Logs):
============================================================
📊 Scraping Summary: Valora (reddit)
============================================================
ℹ️ Scraping Successful: Found 90 reviews for Valora from reddit
Details: {'company': 'Valora', 'source': 'reddit', 'count': 90}
============================================================
Total alerts: 1
INFO: 1
============================================================
3. Quality Metrics API
Purpose: Monitor data quality across all companies
Endpoint: GET /api/admin/data-quality
Query Parameters
# All companies, last 30 days (default)
GET /api/admin/data-quality
# Specific company
GET /api/admin/data-quality?company_name=Valora
# Custom timeframe
GET /api/admin/data-quality?days=7
Response Structure
{
"summary": {
"total_reviews": 68319,
"total_companies": 75,
"total_sources": 6,
"missing_sentiment": 37803,
"missing_sentiment_pct": 55.3,
"potential_duplicates": 20,
"empty_reviews": 50
},
"recent_activity": {
"days": 30,
"by_source": [
{"source": "reddit", "count": 5007, "avg_rating": 3.12},
{"source": "kununu", "count": 115, "avg_rating": 3.46}
]
},
"sentiment_distribution": {
"POSITIVE": 4032,
"NEUTRAL": 3086,
"NEGATIVE": 4553
},
"issues": {
"duplicates": [
{
"company": "Lidl Deutschland",
"text_preview": "Great workplace...",
"count": 4,
"review_ids": ["id1", "id2", ...]
}
],
"empty_reviews": [...]
}
}
Metrics Explained
| Metric | What It Tracks | Action Threshold |
|---|---|---|
missing_sentiment_pct | Reviews without AI analysis | >20% → Run ABSA |
potential_duplicates | Same text, different IDs | >0 → Review manually |
empty_reviews | No positive/negative text | >10% → Check scraper |
recent_activity | Scraping health by source | Low counts → Investigate |
4. Manual Cleanup Tools
Delete Review
Endpoint: DELETE /api/admin/review/{review_id}
Use Cases:
- Remove duplicate reviews
- Delete apartment contamination
- Clean up invalid data
Example:
curl -X DELETE "http://localhost:8000/api/admin/review/abc-123"
Response:
{
"status": "success",
"message": "Review abc-123 deleted",
"aspects_deleted": 5
}
Troubleshooting Guide
Issue: High Missing Sentiment %
Symptoms:
/data-qualityshows >50% missing sentiment- Trust Score calculation fails
Causes:
- ABSA not triggered after scraping
- Reviews scraped while ABSA was down
Solution:
# Trigger ABSA manually
curl -X POST "http://localhost:8000/api/admin/run-absa?company_name=Valora"
Issue: Duplicate Reviews Detected
Symptoms:
/data-qualityshowspotential_duplicates > 0- Same review text with different IDs
Causes:
- Scraper ran multiple times
- Review updated/re-scraped
Solution:
# 1. Get duplicate review IDs from /data-quality
# 2. Delete duplicates (keep newest)
curl -X DELETE "http://localhost:8000/api/admin/review/{older_review_id}"
Issue: Empty Reviews (No Text)
Symptoms:
- Reviews have rating but no text
empty_reviews > 10%
Causes:
- Source has rating-only reviews
- Scraper text extraction failed
Solution:
# Check scraper logs
tail -100 backend/logs/scraping_service.log | grep "empty"
# If extraction issue, fix scraper XPath/selectors
# If valid rating-only reviews, these are OK (Edge case)
Issue: Entity Validation False Positives
Symptoms:
- Valid company reviews skipped
validation_skipped_counttoo high
Causes:
- Company name contains trigger words (e.g., "Apartment Therapy" magazine)
- Keywords too broad
Solution:
# Edit vartovii/utils/entity_validation.py
# Adjust APARTMENT_KEYWORDS or threshold
# Current threshold: score >= 3
# Test changes
python backend/scripts/test_validation.py
Best Practices
✅ Do This
-
Run ABSA After Scraping
curl -X POST "/api/admin/run-absa?company_name={company}" -
Check Quality Metrics Weekly
curl "/api/admin/data-quality" | jq '.summary' -
Monitor Scraping Logs
tail -f backend/logs/scraping_service.log
❌ Don't Do This
- Don't ignore warnings - WARNING alerts indicate potential issues
- Don't delete reviews without checking - Verify duplicates first
- Don't bypass validation - Entity checking prevents contamination
API Reference
Endpoints Summary
| Endpoint | Method | Purpose |
|---|---|---|
/api/admin/data-quality | GET | Get quality metrics |
/api/admin/review/{id} | DELETE | Delete review |
/api/admin/run-absa | POST | Trigger sentiment analysis |
Error Codes
| Code | Meaning | Action |
|---|---|---|
| 404 | Review/company not found | Check spelling |
| 500 | Database error | Check logs |
| 405 | Method not allowed | Verify HTTP method |
Implementation Timeline
| Phase | Status | Completion Date |
|---|---|---|
| Phase 1: Valora Cleanup | ✅ | 2026-02-06 |
| Phase 2: Entity Validation | ✅ | 2026-02-06 |
| Phase 3: ABSA Auto-Trigger | ✅ | 2026-02-06 |
| Phase 4: Quality Metrics | ✅ | 2026-02-06 |
| Phase 5: Alerting System | ✅ | 2026-02-06 |
| Phase 6: Documentation | ✅ | 2026-02-06 |
Further Reading
For issues or questions, check scraping logs or contact the development team.