Local Search System
The SignalWire Agents SDK includes a powerful local search system that provides DataSphere-compatible search functionality without external dependencies. This system uses advanced query preprocessing, local embeddings, and hybrid search techniques to enable agents to search through document collections offline.
Overview
The local search system provides:
- Offline Search: No external API calls or internet required
- Hybrid Search: Combines vector similarity and keyword search
- Document Processing: Supports multiple file formats (Markdown, PDF, DOCX, etc.)
- Smart Chunking: Intelligent document segmentation with context preservation
- Advanced Query Processing: NLP-enhanced query understanding
- Flexible Deployment: Local embedded mode or remote server mode
- SQLite Storage: Portable
.swsearch
index files
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Documents │───▶│ Index Builder │───▶│ .swsearch DB │
│ (MD, PDF, etc.) │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Agent │───▶│ Search Skill │───▶│ Search Engine │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Installation Options
The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:
Basic Search (~500MB)
pip install signalwire-agents[search]
Includes:
- Core search functionality
- Sentence transformers for embeddings
- SQLite FTS5 for keyword search
- Basic document processing (text, markdown)
Full Document Processing (~600MB)
pip install signalwire-agents[search-full]
Adds:
- PDF processing (PyPDF2)
- DOCX processing (python-docx)
- HTML processing (BeautifulSoup4)
- Additional file format support
Advanced NLP Features (~700MB)
pip install signalwire-agents[search-nlp]
Adds:
- spaCy for advanced text processing
- NLTK for linguistic analysis
- Enhanced query preprocessing
- Language detection
⚠️ Additional Setup Required:
python -m spacy download en_core_web_sm
Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.
NLP Backend Control: You can choose which NLP backend to use:
- NLTK (default): Fast processing, good for most use cases
- spaCy: Better quality but slower, requires model download
Configure via the nlp_backend
parameter in your search skill.
All Search Features (~700MB)
pip install signalwire-agents[search-all]
Includes everything above
⚠️ Additional Setup Required:
python -m spacy download en_core_web_sm
Performance Note: This includes advanced NLP features which improve search quality but increase response times.
Minimal Installation (Base SDK only)
pip install signalwire-agents
Search functionality will show helpful error messages when dependencies are missing.
Quick Start
1. Install Dependencies
pip install signalwire-agents[search-full]
2. Build a Search Index
# Build from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch
# Build from multiple individual files
sw-search README.md docs/agent_guide.md docs/architecture.md --output knowledge.swsearch
# Build from mixed sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples --file-types md,py --output comprehensive.swsearch
# Build from a directory (traditional approach)
sw-search docs --output docs.swsearch
# Include specific file types
sw-search docs --file-types md,txt,py
# Exclude patterns
sw-search docs --exclude "**/test/**,**/__pycache__/**"
3. Use in Your Agent
from signalwire_agents import AgentBase
class MyAgent(AgentBase):
def __init__(self):
super().__init__()
# Add search capability using the concepts guide
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"description": "Search the comprehensive SDK concepts guide for information",
"index_file": "concepts.swsearch",
"count": 5
})
agent = MyAgent()
agent.serve()
4. Test the Search
Ask your agent: "How do I create a new agent?" and it will search the comprehensive concepts guide to provide detailed answers.
Building Search Indexes
Search indexes are SQLite databases with the .swsearch
extension that contain processed documents, embeddings, and search metadata.
Basic Index Building
# Build index from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch
# Build from multiple individual files
sw-search README.md docs/agent_guide.md docs/architecture.md --output knowledge.swsearch
# Build from mixed sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples --file-types md,py --output comprehensive.swsearch
# Build from a directory (traditional approach)
sw-search docs --output docs.swsearch
# Include specific file types
sw-search docs --file-types md,txt,py
# Exclude patterns
sw-search docs --exclude "**/test/**,**/__pycache__/**"
Advanced Index Building
# Full configuration example with multiple sources
sw-search docs/signalwire_agents_concepts_guide.md ./examples README.md \
--output ./knowledge.swsearch \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--file-types md,txt,rst,py \
--exclude "**/test/**,**/__pycache__/**" \
--model sentence-transformers/all-mpnet-base-v2 \
--tags documentation,api \
--verbose
Supported File Types
Format | Extension | Requirements |
---|---|---|
Markdown | .md | Built-in |
Text | .txt | Built-in |
Python | .py | Built-in |
reStructuredText | .rst | Built-in |
.pdf | search-full | |
Word Documents | .docx | search-full |
HTML | .html | search-full |
JSON | .json | Built-in |
Index Structure
Each .swsearch
file contains:
- Document chunks with embeddings and metadata
- Full-text search index (SQLite FTS5)
- Configuration and model information
- Synonym cache for query expansion
Using the Search Skill
The native_vector_search
skill provides search functionality to your agents.
Basic Configuration
self.add_skill("native_vector_search", {
"tool_name": "search_knowledge",
"description": "Search the knowledge base",
"index_file": "knowledge.swsearch"
})
Advanced Configuration
NLP Backend Selection
Choose between NLTK (fast) and spaCy (better quality) for query processing:
# Fast NLTK processing (default)
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"nlp_backend": "nltk" # Fast, good for most use cases
})
# Better quality spaCy processing
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"nlp_backend": "spacy" # Slower but better quality, requires model download
})
Performance Comparison:
- NLTK: ~50-100ms query processing, good synonym expansion
- spaCy: ~150-300ms query processing, better POS tagging and entity recognition
Custom Embedding Models
# Use a different embedding model
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"model": "sentence-transformers/all-MiniLM-L6-v2" # Smaller, faster model
})
Query Enhancement
The system automatically enhances queries using:
- Language detection
- POS tagging (with NLP dependencies)
- Synonym expansion using WordNet
- Keyword extraction
- Vector embeddings
Response Customization
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
# Customize responses for voice calls
"response_prefix": "Based on the documentation, here's what I found:",
"response_postfix": "Would you like me to search for more specific information?",
# Custom no-results message
"no_results_message": "I couldn't find information about '{query}'. Try rephrasing your question.",
# SWAIG function fillers for natural conversation
"swaig_fields": {
"fillers": {
"en-US": [
"Let me search the documentation",
"Checking our knowledge base",
"Looking that up for you"
]
}
}
})
Tag-Based Filtering
# Only search documents tagged with specific categories
self.add_skill("native_vector_search", {
"tool_name": "search_api_docs",
"index_file": "docs.swsearch",
"tags": ["api", "reference"], # Only search API docs
"description": "Search API reference documentation"
})
Complete Configuration Example
self.add_skill("native_vector_search", {
# Tool configuration
"tool_name": "search_docs",
"description": "Search SDK documentation for detailed information",
# Index configuration
"index_file": "docs.swsearch",
"build_index": True, # Auto-build if missing
"source_dir": "./docs", # Source for auto-build
"file_types": ["md", "txt"],
# Search parameters
"count": 5, # Number of results
"distance_threshold": 0.1, # Similarity threshold
"tags": ["documentation"], # Filter by tags
# NLP backend selection
"nlp_backend": "nltk", # or "spacy" for better quality
# Response formatting
"response_prefix": "Based on the documentation:",
"response_postfix": "Would you like more details?",
"no_results_message": "No information found for '{query}'",
# SWAIG configuration
"swaig_fields": {
"fillers": {
"en-US": ["Let me search for that", "Checking the docs"]
}
}
})
Multiple Search Instances
You can add multiple search instances for different document collections:
# Documentation search with spaCy for better quality
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"nlp_backend": "spacy",
"description": "Search SDK documentation"
})
# Code examples search with NLTK for speed
self.add_skill("native_vector_search", {
"tool_name": "search_examples",
"index_file": "examples.swsearch",
"nlp_backend": "nltk",
"description": "Search code examples"
})
Local vs Remote Modes
The search skill supports both local and remote operation modes.
Local Mode (Default)
Pros:
- Faster (no network latency)
- Works offline
- Simple deployment
- Lower operational complexity
Cons:
- Higher memory usage per agent
- Index files must be distributed with each agent
- Updates require redeploying agents
Configuration:
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch", # Local file
"nlp_backend": "nltk" # Choose NLP backend
})
Remote Mode
Pros:
- Lower memory usage per agent
- Centralized index management
- Easy updates without redeploying agents
- Better scalability for multiple agents
- Shared resources
Cons:
- Network dependency
- Additional infrastructure complexity
- Potential latency
Configuration:
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"remote_url": "http://localhost:8001", # Search server
"index_name": "docs", # Index name on server
"nlp_backend": "nltk" # NLP backend for query preprocessing
})
Running a Remote Search Server
- Start the search server:
python examples/search_server_standalone.py
- The server provides HTTP API:
POST /search
- Search the indexesGET /health
- Health check and available indexesPOST /reload_index
- Add or reload an index
- Test the API:
curl -X POST "http://localhost:8001/search" \
-H "Content-Type: application/json" \
-d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'
Automatic Mode Detection
The skill automatically detects which mode to use:
- If
remote_url
is provided → Remote mode - If
index_file
is provided → Local mode - Remote mode takes priority if both are specified
Advanced Configuration
Custom Embedding Models
CLI Reference
sw-search Command
sw-search <source_dir> [options]
Arguments:
source_dir
- Directory containing documents to index
Options:
--output FILE
- Output .swsearch file (default:<source_dir>.swsearch
)--chunk-size SIZE
- Chunk size in characters (default: 500)--chunk-overlap SIZE
- Overlap between chunks (default: 50)--file-types TYPES
- Comma-separated file extensions (default: md,txt,rst)--exclude PATTERNS
- Comma-separated glob patterns to exclude--languages LANGS
- Comma-separated language codes (default: en)--model MODEL
- Embedding model name (default: sentence-transformers/all-mpnet-base-v2)--tags TAGS
- Comma-separated tags to add to all chunks--verbose
- Show detailed progress information--validate
- Validate the created index after building
Subcommands:
validate - Validate Search Index
sw-search validate <index_file> [--verbose]
Validates an existing .swsearch index file and shows statistics.
search - Search Within Index
sw-search search <index_file> <query> [options]
Search within an existing .swsearch index file. This is useful for:
- Testing search quality and relevance
- Exploring index contents
- Debugging search results
- Scripting and automation
Search Options:
--count COUNT
- Number of results to return (default: 5)--distance-threshold FLOAT
- Minimum similarity score (default: 0.0)--tags TAGS
- Comma-separated tags to filter by--nlp-backend {nltk,spacy}
- NLP backend to use (default: nltk)--verbose
- Show detailed information including index stats--json
- Output results as JSON for scripting--no-content
- Hide content in results (show only metadata)
Examples:
# Build from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch
# Build from multiple sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples README.md \
--output comprehensive.swsearch \
--file-types md,py,txt \
--verbose
# Traditional directory-based approach
sw-search ./documentation \
--output knowledge.swsearch \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--file-types md,rst,txt \
--exclude "**/drafts/**" \
--tags documentation,help \
--verbose
# Validate an existing index
sw-search validate concepts.swsearch --verbose
# Search within an index
sw-search search concepts.swsearch "how to create an agent"
sw-search search concepts.swsearch "API reference" --count 3 --verbose
sw-search search concepts.swsearch "configuration" --tags documentation --json
# Use different NLP backends
sw-search search concepts.swsearch "deployment options" --nlp-backend nltk # Fast
sw-search search concepts.swsearch "deployment options" --nlp-backend spacy # Better quality
# Advanced search with filtering
sw-search search concepts.swsearch "deployment options" \
--count 10 \
--distance-threshold 0.1 \
--tags "deployment,production" \
--nlp-backend spacy \
--verbose
# JSON output for scripting
sw-search search concepts.swsearch "error handling" --json | jq '.results[0].content'
# Build multiple specialized indexes
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch
sw-search examples --output examples.swsearch --file-types py,md
Index Validation
# Validate an existing index
python -c "
from signalwire_agents.search import SearchEngine
engine = SearchEngine('docs.swsearch')
print(f'Index stats: {engine.get_stats()}')
"
API Reference
SearchEngine Class
from signalwire_agents.search import SearchEngine
# Load an index
engine = SearchEngine("docs.swsearch")
# Perform search
results = engine.search(
query_vector=[...], # Optional: pre-computed query vector
enhanced_text="search query", # Enhanced query text
count=5, # Number of results
distance_threshold=0.0, # Minimum similarity score
tags=["documentation"] # Filter by tags
)
# Get index statistics
stats = engine.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total files: {stats['total_files']}")
IndexBuilder Class
from signalwire_agents.search import IndexBuilder
# Create index builder
builder = IndexBuilder(
model_name="sentence-transformers/all-mpnet-base-v2",
chunk_size=500,
chunk_overlap=50,
verbose=True
)
# Build index
builder.build_index(
source_dir="./docs",
output_file="docs.swsearch",
file_types=["md", "txt"],
exclude_patterns=["**/test/**"],
tags=["documentation"]
)