Querying Data Sources¶
The Data Sources Manager provides multiple ways to query and access the data sources.
Using the Search Index¶
The most efficient way to query data sources is through the SQLite-based search index. The index is built by the index_sources.py
tool and provides fast access to sources based on various criteria.
Basic Index Structure¶
The index contains several lookup tables:
category_index
: Sources organized by categorytag_index
: Sources organized by tagformat_index
: Sources organized by formatquality_index
: Sources organized by quality bucketssource_lookup
: Detailed information for each source by ID
Opening the Index¶
from sqlitedict import SqliteDict
# Open the index
with SqliteDict("index.db") as db:
# Use the index...
pass
Query Examples¶
Finding Sources by Category¶
with SqliteDict("index.db") as db:
# Get all vulnerability sources
vulnerability_sources = db["category_index"].get("vulnerability", [])
# Print the IDs
print(f"Found {len(vulnerability_sources)} vulnerability sources:")
for source_id in vulnerability_sources:
print(f"- {source_id}")
Finding Sources by Tag¶
with SqliteDict("index.db") as db:
# Get sources with a specific tag
official_sources = db["tag_index"].get("official", [])
# Print the IDs
print(f"Found {len(official_sources)} official sources:")
for source_id in official_sources:
print(f"- {source_id}")
Finding High-Quality Sources¶
with SqliteDict("index.db") as db:
# Get excellent quality sources (90-100 score)
excellent_sources = db["quality_index"].get("excellent", [])
# Print the IDs
print(f"Found {len(excellent_sources)} excellent sources:")
for source_id in excellent_sources:
print(f"- {source_id}")
Getting Source Details¶
with SqliteDict("index.db") as db:
# Get details for a specific source
nvd = db["source_lookup"].get("nvd-cve")
if nvd:
print(f"National Vulnerability Database:")
print(f" URL: {nvd['url']}")
print(f" Quality: {nvd['quality_score']}")
print(f" Last Updated: {nvd['last_updated']}")
Finding Sources by Multiple Criteria¶
with SqliteDict("index.db") as db:
# Get all vulnerability sources
vulnerability_sources = db["category_index"].get("vulnerability", [])
# Get all JSON format sources
json_sources = db["format_index"].get("json", [])
# Find vulnerability sources in JSON format
json_vulnerability_sources = set(vulnerability_sources).intersection(json_sources)
print(f"Found {len(json_vulnerability_sources)} JSON vulnerability sources")
# Get details for each source
for source_id in json_vulnerability_sources:
source = db["source_lookup"].get(source_id)
print(f"- {source['name']} (Quality: {source['quality_score']})")
Direct File Access¶
If you prefer to work directly with the source files, they are stored as JSON files in the data-sources/
directory, organized by category and subcategory:
data-sources/
└── vulnerability/
├── cve/
│ ├── nvd.json
│ └── vendor-advisory.json
├── exploit-db.json
└── ...
You can read these files directly using standard file I/O:
import json
import glob
from pathlib import Path
# Find all source files
source_files = glob.glob("data-sources/**/*.json", recursive=True)
# Read each file
sources = []
for file_path in source_files:
with open(file_path, "r") as f:
source = json.load(f)
sources.append(source)
# Filter sources based on criteria
high_quality_sources = [s for s in sources if s.get("quality_score", 0) >= 90]
print(f"Found {len(high_quality_sources)} high-quality sources")
for source in high_quality_sources:
print(f"- {source['name']} (Quality: {source['quality_score']})")
Python API¶
The Data Sources Manager includes Python tools that can be imported and used in your own code:
from tools.index_sources import build_category_index, build_tag_index, build_source_lookup
# Load all sources
sources = load_sources()
# Build indices
category_index = build_category_index(sources)
tag_index = build_tag_index(sources)
source_lookup = build_source_lookup(sources)
# Use the indices
vulnerability_sources = category_index.get("vulnerability", [])
official_sources = tag_index.get("official", [])
# Find sources that match both criteria
for source_id in set(vulnerability_sources).intersection(official_sources):
source = source_lookup.get(source_id)
print(f"- {source['name']} (Quality: {source['quality_score']})")
Best Practices¶
- Use the index for most queries, as it's optimized for fast lookups
- Cache results when making multiple queries to avoid reopening the index
- Consider quality scores when selecting sources to use
- Check last_updated to ensure data is recent enough for your needs