Skip to content

Querying Data Sources

The Data Sources Manager provides multiple ways to query and access the data sources.

Using the Search Index

The most efficient way to query data sources is through the SQLite-based search index. The index is built by the index_sources.py tool and provides fast access to sources based on various criteria.

Basic Index Structure

The index contains several lookup tables:

  • category_index: Sources organized by category
  • tag_index: Sources organized by tag
  • format_index: Sources organized by format
  • quality_index: Sources organized by quality buckets
  • source_lookup: Detailed information for each source by ID

Opening the Index

from sqlitedict import SqliteDict

# Open the index
with SqliteDict("index.db") as db:
    # Use the index...
    pass

Query Examples

Finding Sources by Category

with SqliteDict("index.db") as db:
    # Get all vulnerability sources
    vulnerability_sources = db["category_index"].get("vulnerability", [])

    # Print the IDs
    print(f"Found {len(vulnerability_sources)} vulnerability sources:")
    for source_id in vulnerability_sources:
        print(f"- {source_id}")

Finding Sources by Tag

with SqliteDict("index.db") as db:
    # Get sources with a specific tag
    official_sources = db["tag_index"].get("official", [])

    # Print the IDs
    print(f"Found {len(official_sources)} official sources:")
    for source_id in official_sources:
        print(f"- {source_id}")

Finding High-Quality Sources

with SqliteDict("index.db") as db:
    # Get excellent quality sources (90-100 score)
    excellent_sources = db["quality_index"].get("excellent", [])

    # Print the IDs
    print(f"Found {len(excellent_sources)} excellent sources:")
    for source_id in excellent_sources:
        print(f"- {source_id}")

Getting Source Details

with SqliteDict("index.db") as db:
    # Get details for a specific source
    nvd = db["source_lookup"].get("nvd-cve")

    if nvd:
        print(f"National Vulnerability Database:")
        print(f"  URL: {nvd['url']}")
        print(f"  Quality: {nvd['quality_score']}")
        print(f"  Last Updated: {nvd['last_updated']}")

Finding Sources by Multiple Criteria

with SqliteDict("index.db") as db:
    # Get all vulnerability sources
    vulnerability_sources = db["category_index"].get("vulnerability", [])

    # Get all JSON format sources
    json_sources = db["format_index"].get("json", [])

    # Find vulnerability sources in JSON format
    json_vulnerability_sources = set(vulnerability_sources).intersection(json_sources)

    print(f"Found {len(json_vulnerability_sources)} JSON vulnerability sources")

    # Get details for each source
    for source_id in json_vulnerability_sources:
        source = db["source_lookup"].get(source_id)
        print(f"- {source['name']} (Quality: {source['quality_score']})")

Direct File Access

If you prefer to work directly with the source files, they are stored as JSON files in the data-sources/ directory, organized by category and subcategory:

data-sources/
└── vulnerability/
    ├── cve/
    │   ├── nvd.json
    │   └── vendor-advisory.json
    ├── exploit-db.json
    └── ...

You can read these files directly using standard file I/O:

import json
import glob
from pathlib import Path

# Find all source files
source_files = glob.glob("data-sources/**/*.json", recursive=True)

# Read each file
sources = []
for file_path in source_files:
    with open(file_path, "r") as f:
        source = json.load(f)
        sources.append(source)

# Filter sources based on criteria
high_quality_sources = [s for s in sources if s.get("quality_score", 0) >= 90]

print(f"Found {len(high_quality_sources)} high-quality sources")
for source in high_quality_sources:
    print(f"- {source['name']} (Quality: {source['quality_score']})")

Python API

The Data Sources Manager includes Python tools that can be imported and used in your own code:

from tools.index_sources import build_category_index, build_tag_index, build_source_lookup

# Load all sources
sources = load_sources()

# Build indices
category_index = build_category_index(sources)
tag_index = build_tag_index(sources)
source_lookup = build_source_lookup(sources)

# Use the indices
vulnerability_sources = category_index.get("vulnerability", [])
official_sources = tag_index.get("official", [])

# Find sources that match both criteria
for source_id in set(vulnerability_sources).intersection(official_sources):
    source = source_lookup.get(source_id)
    print(f"- {source['name']} (Quality: {source['quality_score']})")

Best Practices

  • Use the index for most queries, as it's optimized for fast lookups
  • Cache results when making multiple queries to avoid reopening the index
  • Consider quality scores when selecting sources to use
  • Check last_updated to ensure data is recent enough for your needs