Architecture¶

This document describes the overall architecture of the Data Sources Manager, explaining how the different components work together.

System Overview¶

The Data Sources Manager is designed as a simple yet efficient system for tracking, scoring, and querying data sources. It consists of the following main components:

Data Source Metadata: JSON files that describe each data source
Schemas: JSON Schema definitions that enforce consistency
Configuration: Settings for categories, scoring weights, etc.
Tools: Python scripts for fetching, scoring, and indexing
Search Index: SQLite-based index for efficient queries
Documentation: Comprehensive guides and references

Component Architecture¶

Data Flow Diagram¶

+----------------+     +----------------+     +----------------+
| Data Sources   |     | Fetch Sources  |     | Source Health  |
| JSON Files     | --> | Python Tool    | --> | Metrics        |
+----------------+     +----------------+     +----------------+
        |                      |                     |
        v                      v                     v
+----------------+     +----------------+     +----------------+
| Score Sources  |     | Index Sources  |     | Search Index   |
| Python Tool    | --> | Python Tool    | --> | (SQLite)       |
+----------------+     +----------------+     +----------------+
        ^                      ^                     |
        |                      |                     v
+----------------+     +----------------+     +----------------+
| Configuration  |     | Schemas        |     | Query API      |
| JSON Files     |     | JSON Schema    |     | (Python)       |
+----------------+     +----------------+     +----------------+

Directory Structure¶

data-sources-manager/
├── data-sources/                # Data source metadata files
│   └── vulnerability/           # Grouped by category
│       ├── cve/                 # Subcategories
│       │   ├── nvd.json
│       │   ├── vendor-advisory.json
│       │   └── …
│       ├── exploit-db.json
│       └── …
├── schemas/                     # JSON schema definitions
│   ├── source.schema.json       # Source metadata schema
│   └── quality.schema.json      # Quality scoring schema
├── config/                      # Configuration files
│   ├── categories.json          # Category definitions
│   └── scoring-config.json      # Quality scoring weights
├── tools/                       # Python utilities
│   ├── fetch_sources.py         # Update source data
│   ├── score_sources.py         # Calculate quality scores
│   └── index_sources.py         # Build search index
├── docs/                        # Documentation
│   ├── index.md                 # Home page
│   ├── getting-started/         # Getting started guides
│   ├── usage/                   # Usage guides
│   └── api/                     # API references
└── .github/workflows/           # CI/CD automation
    ├── update-sources.yml       # Daily source updates
    └── lint-schemas.yml         # Schema validation

Key Components¶

1. Data Source Metadata¶

Each data source is represented by a JSON file that follows the source.schema.json schema. These files contain metadata about the source, including:

Basic information (ID, name, URL, description)
Categorization (category, sub-category, tags)
Quality metrics (quality_score, last_updated)
Format and other attributes

2. JSON Schemas¶

The system uses JSON Schema to enforce consistency across all data source files:

source.schema.json: Defines the structure for data source metadata
quality.schema.json: Defines the structure for quality scoring attributes

3. Configuration¶

Configuration files in the config/ directory control various aspects of the system:

categories.json: Defines categories and tags for organizing sources
scoring-config.json: Sets weights for quality scoring and update schedules

4. Python Tools¶

The system includes several Python tools for managing data sources:

fetch_sources.py: Fetches data from sources and updates metadata
score_sources.py: Calculates quality scores based on freshness, authority, etc.
index_sources.py: Builds a search index for efficient queries
validate_sources.py: Validates source files against the schemas
validate_api_keys.py: Validates API keys for external services

5. Search Index¶

The search index is a SQLite database (index.db) that provides fast access to sources based on various criteria:

category_index: Sources organized by category
tag_index: Sources organized by tag
format_index: Sources organized by format
quality_index: Sources organized by quality buckets
source_lookup: Detailed information for each source by ID

6. CI/CD Automation¶

GitHub Actions workflows automate various tasks:

update-sources.yml: Runs daily to fetch the latest data, update quality scores, and rebuild the index
lint-schemas.yml: Validates the JSON schemas on every push and pull request

Data Flow¶

The typical data flow in the system is as follows:

Source Addition: A contributor adds a new data source JSON file
Validation: The file is validated against the schema
Fetching: The fetch_sources.py tool checks the source's health and updates metadata
Scoring: The score_sources.py tool calculates quality scores based on freshness, authority, etc.
Indexing: The index_sources.py tool builds the search index
Querying: Users query the search index to find relevant sources

Design Principles¶

The Data Sources Manager follows these design principles:

Simplicity: Simple file-based storage with minimal dependencies
Consistency: Strict schema validation to ensure data quality
Modularity: Separate tools for different functions
Efficiency: Optimized search index for fast queries
Extensibility: Easy to add new sources and categories
Automation: Automated updates and validation

Future Enhancements¶

Potential future enhancements to the architecture include:

API Server: A REST API for remote access to the data
Web Interface: A web-based UI for browsing and managing sources
Analytics: Advanced analytics on source quality and usage
Notification System: Alerts for source changes or problems
Integration: Direct integration with downstream systems

Conclusion¶

The Data Sources Manager's architecture is designed to be simple, flexible, and efficient, making it easy to track, score, and query data sources for LLM-based projects.