Architecture¶
This document describes the overall architecture of the Data Sources Manager, explaining how the different components work together.
System Overview¶
The Data Sources Manager is designed as a simple yet efficient system for tracking, scoring, and querying data sources. It consists of the following main components:
- Data Source Metadata: JSON files that describe each data source
- Schemas: JSON Schema definitions that enforce consistency
- Configuration: Settings for categories, scoring weights, etc.
- Tools: Python scripts for fetching, scoring, and indexing
- Search Index: SQLite-based index for efficient queries
- Documentation: Comprehensive guides and references
Component Architecture¶
Data Flow Diagram¶
+----------------+ +----------------+ +----------------+
| Data Sources | | Fetch Sources | | Source Health |
| JSON Files | --> | Python Tool | --> | Metrics |
+----------------+ +----------------+ +----------------+
| | |
v v v
+----------------+ +----------------+ +----------------+
| Score Sources | | Index Sources | | Search Index |
| Python Tool | --> | Python Tool | --> | (SQLite) |
+----------------+ +----------------+ +----------------+
^ ^ |
| | v
+----------------+ +----------------+ +----------------+
| Configuration | | Schemas | | Query API |
| JSON Files | | JSON Schema | | (Python) |
+----------------+ +----------------+ +----------------+
Directory Structure¶
data-sources-manager/
├── data-sources/ # Data source metadata files
│ └── vulnerability/ # Grouped by category
│ ├── cve/ # Subcategories
│ │ ├── nvd.json
│ │ ├── vendor-advisory.json
│ │ └── …
│ ├── exploit-db.json
│ └── …
├── schemas/ # JSON schema definitions
│ ├── source.schema.json # Source metadata schema
│ └── quality.schema.json # Quality scoring schema
├── config/ # Configuration files
│ ├── categories.json # Category definitions
│ └── scoring-config.json # Quality scoring weights
├── tools/ # Python utilities
│ ├── fetch_sources.py # Update source data
│ ├── score_sources.py # Calculate quality scores
│ └── index_sources.py # Build search index
├── docs/ # Documentation
│ ├── index.md # Home page
│ ├── getting-started/ # Getting started guides
│ ├── usage/ # Usage guides
│ └── api/ # API references
└── .github/workflows/ # CI/CD automation
├── update-sources.yml # Daily source updates
└── lint-schemas.yml # Schema validation
Key Components¶
1. Data Source Metadata¶
Each data source is represented by a JSON file that follows the source.schema.json
schema. These files contain metadata about the source, including:
- Basic information (ID, name, URL, description)
- Categorization (category, sub-category, tags)
- Quality metrics (quality_score, last_updated)
- Format and other attributes
2. JSON Schemas¶
The system uses JSON Schema to enforce consistency across all data source files:
source.schema.json
: Defines the structure for data source metadataquality.schema.json
: Defines the structure for quality scoring attributes
3. Configuration¶
Configuration files in the config/
directory control various aspects of the system:
categories.json
: Defines categories and tags for organizing sourcesscoring-config.json
: Sets weights for quality scoring and update schedules
4. Python Tools¶
The system includes several Python tools for managing data sources:
fetch_sources.py
: Fetches data from sources and updates metadatascore_sources.py
: Calculates quality scores based on freshness, authority, etc.index_sources.py
: Builds a search index for efficient queriesvalidate_sources.py
: Validates source files against the schemasvalidate_api_keys.py
: Validates API keys for external services
5. Search Index¶
The search index is a SQLite database (index.db
) that provides fast access to sources based on various criteria:
category_index
: Sources organized by categorytag_index
: Sources organized by tagformat_index
: Sources organized by formatquality_index
: Sources organized by quality bucketssource_lookup
: Detailed information for each source by ID
6. CI/CD Automation¶
GitHub Actions workflows automate various tasks:
update-sources.yml
: Runs daily to fetch the latest data, update quality scores, and rebuild the indexlint-schemas.yml
: Validates the JSON schemas on every push and pull request
Data Flow¶
The typical data flow in the system is as follows:
- Source Addition: A contributor adds a new data source JSON file
- Validation: The file is validated against the schema
- Fetching: The
fetch_sources.py
tool checks the source's health and updates metadata - Scoring: The
score_sources.py
tool calculates quality scores based on freshness, authority, etc. - Indexing: The
index_sources.py
tool builds the search index - Querying: Users query the search index to find relevant sources
Design Principles¶
The Data Sources Manager follows these design principles:
- Simplicity: Simple file-based storage with minimal dependencies
- Consistency: Strict schema validation to ensure data quality
- Modularity: Separate tools for different functions
- Efficiency: Optimized search index for fast queries
- Extensibility: Easy to add new sources and categories
- Automation: Automated updates and validation
Future Enhancements¶
Potential future enhancements to the architecture include:
- API Server: A REST API for remote access to the data
- Web Interface: A web-based UI for browsing and managing sources
- Analytics: Advanced analytics on source quality and usage
- Notification System: Alerts for source changes or problems
- Integration: Direct integration with downstream systems
Conclusion¶
The Data Sources Manager's architecture is designed to be simple, flexible, and efficient, making it easy to track, score, and query data sources for LLM-based projects.