Перейти до основного вмісту

Data Quality & Validation System

Last Updated: 2026-02-06

Vartovii implements automated data quality checks to ensure clean, reliable company intelligence. This guide covers validation mechanisms, quality metrics, and troubleshooting.


Overview

The Data Quality System consists of 4 layers:

┌─────────────────────────────────────┐
│ 1. Entity Type Validation │ ← Prevent contamination
│ (During scraping) │
├─────────────────────────────────────┤
│ 2. Post-Scraping Alerts │ ← Immediate feedback
│ (After job completes) │
├─────────────────────────────────────┤
│ 3. Quality Metrics API │ ← Monitoring dashboard
│ (/api/admin/data-quality) │
├─────────────────────────────────────┤
│ 4. Manual Cleanup Tools │ ← Admin actions
│ (DELETE /api/admin/review/{id}) │
└─────────────────────────────────────┘

1. Entity Type Validation

Purpose: Prevent scraping apartment/product reviews instead of company reviews

Location: vartovii/utils/entity_validation.py

How It Works

68 Detection Keywords across 3 categories:

CategoryKeywords (Examples)Use Case
Apartmentapartment, flat, landlord, rent, tenantPrevent apartment complex reviews
Product/Serviceproduct, purchase, delivery, shippingPrevent e-commerce product reviews
Companyemployer, workplace, boss, salaryIdentify valid company reviews

Scoring System:

  • Apartment keyword: +2 points
  • Product keyword: +1 point
  • Company keyword: -1 point
  • Threshold: Score ≥ 3 → Skip review

Integration

Reddit Scraper:

from vartovii.utils.entity_validation import should_skip_review

# Before yielding review
should_skip, confidence, reason = should_skip_review(
title=post_title,
text=review_text
)

if should_skip:
self.validation_skipped_count += 1
self.logger.info(f"⚠️ Skipped: {reason} (confidence: {confidence}%)")
continue

Google Scraper:

should_skip, confidence, reason = should_skip_review(
title=place_name,
text=review_text
)

if should_skip:
skipped_count += 1
logger.info(f"⚠️ Skipped: {reason}")
continue

Test Results

python backend/scripts/test_validation.py
Test CaseResult
Apartment complex✅ Skipped (100% confidence)
Product review✅ Skipped (67% confidence)
Company review✅ Kept (valid)

2. Post-Scraping Alerts

Purpose: Immediate feedback after scraping completes

Location: vartovii/utils/scraping_alerts.py

Severity Levels

LevelIconMeaningExample
INFOℹ️Normal"Found 50 reviews"
WARNING⚠️Attention needed"20% missing sentiment"
ERRORProblem detected"High empty review rate"
CRITICAL🚨Urgent action needed"Validation failed completely"

Validation Checks

  1. Review Count

    • ERROR: 0 reviews found
    • WARNING: Below expected minimum
  2. Missing Sentiment

    • ERROR: >50% missing sentiment
    • WARNING: >20% missing sentiment
  3. Empty Reviews

    • ERROR: >30% no text
    • WARNING: >10% no text

Usage

from vartovii.utils.scraping_alerts import log_scraping_summary

# After scraping completes
alerts = log_scraping_summary(
company_name="Valora",
source="reddit",
expected_min_reviews=10
)

Output (Console Logs):

============================================================
📊 Scraping Summary: Valora (reddit)
============================================================
ℹ️ Scraping Successful: Found 90 reviews for Valora from reddit
Details: {'company': 'Valora', 'source': 'reddit', 'count': 90}
============================================================
Total alerts: 1
INFO: 1
============================================================

3. Quality Metrics API

Purpose: Monitor data quality across all companies

Endpoint: GET /api/admin/data-quality

Query Parameters

# All companies, last 30 days (default)
GET /api/admin/data-quality

# Specific company
GET /api/admin/data-quality?company_name=Valora

# Custom timeframe
GET /api/admin/data-quality?days=7

Response Structure

{
"summary": {
"total_reviews": 68319,
"total_companies": 75,
"total_sources": 6,
"missing_sentiment": 37803,
"missing_sentiment_pct": 55.3,
"potential_duplicates": 20,
"empty_reviews": 50
},
"recent_activity": {
"days": 30,
"by_source": [
{"source": "reddit", "count": 5007, "avg_rating": 3.12},
{"source": "kununu", "count": 115, "avg_rating": 3.46}
]
},
"sentiment_distribution": {
"POSITIVE": 4032,
"NEUTRAL": 3086,
"NEGATIVE": 4553
},
"issues": {
"duplicates": [
{
"company": "Lidl Deutschland",
"text_preview": "Great workplace...",
"count": 4,
"review_ids": ["id1", "id2", ...]
}
],
"empty_reviews": [...]
}
}

Metrics Explained

MetricWhat It TracksAction Threshold
missing_sentiment_pctReviews without AI analysis>20% → Run ABSA
potential_duplicatesSame text, different IDs>0 → Review manually
empty_reviewsNo positive/negative text>10% → Check scraper
recent_activityScraping health by sourceLow counts → Investigate

4. Manual Cleanup Tools

Delete Review

Endpoint: DELETE /api/admin/review/{review_id}

Use Cases:

  • Remove duplicate reviews
  • Delete apartment contamination
  • Clean up invalid data

Example:

curl -X DELETE "http://localhost:8000/api/admin/review/abc-123"

Response:

{
"status": "success",
"message": "Review abc-123 deleted",
"aspects_deleted": 5
}

Troubleshooting Guide

Issue: High Missing Sentiment %

Symptoms:

  • /data-quality shows >50% missing sentiment
  • Trust Score calculation fails

Causes:

  1. ABSA not triggered after scraping
  2. Reviews scraped while ABSA was down

Solution:

# Trigger ABSA manually
curl -X POST "http://localhost:8000/api/admin/run-absa?company_name=Valora"

Issue: Duplicate Reviews Detected

Symptoms:

  • /data-quality shows potential_duplicates > 0
  • Same review text with different IDs

Causes:

  1. Scraper ran multiple times
  2. Review updated/re-scraped

Solution:

# 1. Get duplicate review IDs from /data-quality
# 2. Delete duplicates (keep newest)
curl -X DELETE "http://localhost:8000/api/admin/review/{older_review_id}"

Issue: Empty Reviews (No Text)

Symptoms:

  • Reviews have rating but no text
  • empty_reviews > 10%

Causes:

  1. Source has rating-only reviews
  2. Scraper text extraction failed

Solution:

# Check scraper logs
tail -100 backend/logs/scraping_service.log | grep "empty"

# If extraction issue, fix scraper XPath/selectors
# If valid rating-only reviews, these are OK (Edge case)

Issue: Entity Validation False Positives

Symptoms:

  • Valid company reviews skipped
  • validation_skipped_count too high

Causes:

  1. Company name contains trigger words (e.g., "Apartment Therapy" magazine)
  2. Keywords too broad

Solution:

# Edit vartovii/utils/entity_validation.py
# Adjust APARTMENT_KEYWORDS or threshold
# Current threshold: score >= 3

# Test changes
python backend/scripts/test_validation.py

Best Practices

✅ Do This

  1. Run ABSA After Scraping

    curl -X POST "/api/admin/run-absa?company_name={company}"
  2. Check Quality Metrics Weekly

    curl "/api/admin/data-quality" | jq '.summary'
  3. Monitor Scraping Logs

    tail -f backend/logs/scraping_service.log

❌ Don't Do This

  1. Don't ignore warnings - WARNING alerts indicate potential issues
  2. Don't delete reviews without checking - Verify duplicates first
  3. Don't bypass validation - Entity checking prevents contamination

API Reference

Endpoints Summary

EndpointMethodPurpose
/api/admin/data-qualityGETGet quality metrics
/api/admin/review/{id}DELETEDelete review
/api/admin/run-absaPOSTTrigger sentiment analysis

Error Codes

CodeMeaningAction
404Review/company not foundCheck spelling
500Database errorCheck logs
405Method not allowedVerify HTTP method

Implementation Timeline

PhaseStatusCompletion Date
Phase 1: Valora Cleanup2026-02-06
Phase 2: Entity Validation2026-02-06
Phase 3: ABSA Auto-Trigger2026-02-06
Phase 4: Quality Metrics2026-02-06
Phase 5: Alerting System2026-02-06
Phase 6: Documentation2026-02-06

Further Reading


For issues or questions, check scraping logs or contact the development team.