Local Search System

The SignalWire Agents SDK includes a powerful local search system that provides DataSphere-compatible search functionality without external dependencies. This system uses advanced query preprocessing, local embeddings, and hybrid search techniques to enable agents to search through document collections offline.

Overview

The local search system provides:

Offline Search: No external API calls or internet required
Hybrid Search: Combines vector similarity and keyword search
Document Processing: Supports multiple file formats (Markdown, PDF, DOCX, etc.)
Smart Chunking: Intelligent document segmentation with context preservation
Advanced Query Processing: NLP-enhanced query understanding
Flexible Deployment: Local embedded mode or remote server mode
SQLite Storage: Portable .swsearch index files

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Documents     │───▶│   Index Builder  │───▶│  .swsearch DB   │
│ (MD, PDF, etc.) │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                         │
                                                         ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│     Agent       │───▶│  Search Skill    │───▶│  Search Engine  │
│                 │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Installation Options

The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:

Basic Search (~500MB)

pip install signalwire-agents[search]

Includes:

Core search functionality
Sentence transformers for embeddings
SQLite FTS5 for keyword search
Basic document processing (text, markdown)

Full Document Processing (~600MB)

pip install signalwire-agents[search-full]

Adds:

PDF processing (PyPDF2)
DOCX processing (python-docx)
HTML processing (BeautifulSoup4)
Additional file format support

Advanced NLP Features (~700MB)

pip install signalwire-agents[search-nlp]

Adds:

spaCy for advanced text processing
NLTK for linguistic analysis
Enhanced query preprocessing
Language detection

⚠️ Additional Setup Required:

python -m spacy download en_core_web_sm

Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.

NLP Backend Control: You can choose which NLP backend to use:

NLTK (default): Fast processing, good for most use cases
spaCy: Better quality but slower, requires model download

Configure via the nlp_backend parameter in your search skill.

All Search Features (~700MB)

pip install signalwire-agents[search-all]

Includes everything above

⚠️ Additional Setup Required:

python -m spacy download en_core_web_sm

Performance Note: This includes advanced NLP features which improve search quality but increase response times.

Minimal Installation (Base SDK only)

pip install signalwire-agents

Search functionality will show helpful error messages when dependencies are missing.

Quick Start

1. Install Dependencies

pip install signalwire-agents[search-full]

2. Build a Search Index

# Build from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch

# Build from multiple individual files
sw-search README.md docs/agent_guide.md docs/architecture.md --output knowledge.swsearch

# Build from mixed sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples --file-types md,py --output comprehensive.swsearch

# Build from a directory (traditional approach)
sw-search docs --output docs.swsearch

# Include specific file types
sw-search docs --file-types md,txt,py

# Exclude patterns
sw-search docs --exclude "**/test/**,**/__pycache__/**"

3. Use in Your Agent

from signalwire_agents import AgentBase

class MyAgent(AgentBase):
    def __init__(self):
        super().__init__()
        
        # Add search capability using the concepts guide
        self.add_skill("native_vector_search", {
            "tool_name": "search_docs",
            "description": "Search the comprehensive SDK concepts guide for information",
            "index_file": "concepts.swsearch",
            "count": 5
        })

agent = MyAgent()
agent.serve()

4. Test the Search

Ask your agent: "How do I create a new agent?" and it will search the comprehensive concepts guide to provide detailed answers.

Building Search Indexes

Search indexes are SQLite databases with the .swsearch extension that contain processed documents, embeddings, and search metadata.

Basic Index Building

# Build index from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch

# Build from multiple individual files
sw-search README.md docs/agent_guide.md docs/architecture.md --output knowledge.swsearch

# Build from mixed sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples --file-types md,py --output comprehensive.swsearch

# Build from a directory (traditional approach)
sw-search docs --output docs.swsearch

# Include specific file types
sw-search docs --file-types md,txt,py

# Exclude patterns
sw-search docs --exclude "**/test/**,**/__pycache__/**"

Advanced Index Building

# Full configuration example with multiple sources
sw-search docs/signalwire_agents_concepts_guide.md ./examples README.md \
    --output ./knowledge.swsearch \
    --chunking-strategy sentence \
    --max-sentences-per-chunk 8 \
    --file-types md,txt,rst,py \
    --exclude "**/test/**,**/__pycache__/**" \
    --model sentence-transformers/all-mpnet-base-v2 \
    --tags documentation,api \
    --verbose

Supported File Types

Format	Extension	Requirements
Markdown	`.md`	Built-in
Text	`.txt`	Built-in
Python	`.py`	Built-in
reStructuredText	`.rst`	Built-in
PDF	`.pdf`	`search-full`
Word Documents	`.docx`	`search-full`
HTML	`.html`	`search-full`
JSON	`.json`	Built-in

Index Structure

Each .swsearch file contains:

Document chunks with embeddings and metadata
Full-text search index (SQLite FTS5)
Configuration and model information
Synonym cache for query expansion

Using the Search Skill

The native_vector_search skill provides search functionality to your agents.

Basic Configuration

self.add_skill("native_vector_search", {
    "tool_name": "search_knowledge",
    "description": "Search the knowledge base",
    "index_file": "knowledge.swsearch"
})

Advanced Configuration

NLP Backend Selection

Choose between NLTK (fast) and spaCy (better quality) for query processing:

# Fast NLTK processing (default)
self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "index_file": "docs.swsearch",
    "nlp_backend": "nltk"  # Fast, good for most use cases
})

# Better quality spaCy processing
self.add_skill("native_vector_search", {
    "tool_name": "search_docs", 
    "index_file": "docs.swsearch",
    "nlp_backend": "spacy"  # Slower but better quality, requires model download
})

Performance Comparison:

NLTK: ~50-100ms query processing, good synonym expansion
spaCy: ~150-300ms query processing, better POS tagging and entity recognition

Custom Embedding Models

# Use a different embedding model
self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "index_file": "docs.swsearch",
    "model": "sentence-transformers/all-MiniLM-L6-v2"  # Smaller, faster model
})

Query Enhancement

The system automatically enhances queries using:

Language detection
POS tagging (with NLP dependencies)
Synonym expansion using WordNet
Keyword extraction
Vector embeddings

Response Customization

self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "index_file": "docs.swsearch",
    
    # Customize responses for voice calls
    "response_prefix": "Based on the documentation, here's what I found:",
    "response_postfix": "Would you like me to search for more specific information?",
    
    # Custom no-results message
    "no_results_message": "I couldn't find information about '{query}'. Try rephrasing your question.",
    
    # SWAIG function fillers for natural conversation
    "swaig_fields": {
        "fillers": {
            "en-US": [
                "Let me search the documentation",
                "Checking our knowledge base",
                "Looking that up for you"
            ]
        }
    }
})

Tag-Based Filtering

# Only search documents tagged with specific categories
self.add_skill("native_vector_search", {
    "tool_name": "search_api_docs",
    "index_file": "docs.swsearch", 
    "tags": ["api", "reference"],  # Only search API docs
    "description": "Search API reference documentation"
})

Complete Configuration Example

self.add_skill("native_vector_search", {
    # Tool configuration
    "tool_name": "search_docs",
    "description": "Search SDK documentation for detailed information",
    
    # Index configuration
    "index_file": "docs.swsearch",
    "build_index": True,  # Auto-build if missing
    "source_dir": "./docs",  # Source for auto-build
    "file_types": ["md", "txt"],
    
    # Search parameters
    "count": 5,  # Number of results
    "distance_threshold": 0.1,  # Similarity threshold
    "tags": ["documentation"],  # Filter by tags
    
    # NLP backend selection
    "nlp_backend": "nltk",  # or "spacy" for better quality
    
    # Response formatting
    "response_prefix": "Based on the documentation:",
    "response_postfix": "Would you like more details?",
    "no_results_message": "No information found for '{query}'",
    
    # SWAIG configuration
    "swaig_fields": {
        "fillers": {
            "en-US": ["Let me search for that", "Checking the docs"]
        }
    }
})

Multiple Search Instances

You can add multiple search instances for different document collections:

# Documentation search with spaCy for better quality
self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "index_file": "docs.swsearch",
    "nlp_backend": "spacy",
    "description": "Search SDK documentation"
})

# Code examples search with NLTK for speed
self.add_skill("native_vector_search", {
    "tool_name": "search_examples", 
    "index_file": "examples.swsearch",
    "nlp_backend": "nltk",
    "description": "Search code examples"
})

Local vs Remote Modes

The search skill supports both local and remote operation modes.

Local Mode (Default)

Pros:

Faster (no network latency)
Works offline
Simple deployment
Lower operational complexity

Cons:

Higher memory usage per agent
Index files must be distributed with each agent
Updates require redeploying agents

Configuration:

self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "index_file": "docs.swsearch",  # Local file
    "nlp_backend": "nltk"  # Choose NLP backend
})

Remote Mode

Pros:

Lower memory usage per agent
Centralized index management
Easy updates without redeploying agents
Better scalability for multiple agents
Shared resources

Cons:

Network dependency
Additional infrastructure complexity
Potential latency

Configuration:

self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "remote_url": "http://localhost:8001",  # Search server
    "index_name": "docs",  # Index name on server
    "nlp_backend": "nltk"  # NLP backend for query preprocessing
})

Running a Remote Search Server

Start the search server:

python examples/search_server_standalone.py

The server provides HTTP API:

POST /search - Search the indexes
GET /health - Health check and available indexes
POST /reload_index - Add or reload an index

Test the API:

curl -X POST "http://localhost:8001/search" \
     -H "Content-Type: application/json" \
     -d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'

Automatic Mode Detection

The skill automatically detects which mode to use:

If remote_url is provided → Remote mode
If index_file is provided → Local mode
Remote mode takes priority if both are specified

Advanced Configuration

Custom Embedding Models

CLI Reference

sw-search Command

sw-search <source_dir> [options]

Arguments:

source_dir - Directory containing documents to index

Options:

--output FILE - Output .swsearch file (default: <source_dir>.swsearch)
--chunk-size SIZE - Chunk size in characters (default: 500)
--chunk-overlap SIZE - Overlap between chunks (default: 50)
--file-types TYPES - Comma-separated file extensions (default: md,txt,rst)
--exclude PATTERNS - Comma-separated glob patterns to exclude
--languages LANGS - Comma-separated language codes (default: en)
--model MODEL - Embedding model name (default: sentence-transformers/all-mpnet-base-v2)
--tags TAGS - Comma-separated tags to add to all chunks
--verbose - Show detailed progress information
--validate - Validate the created index after building

Subcommands:

validate - Validate Search Index

sw-search validate <index_file> [--verbose]

Validates an existing .swsearch index file and shows statistics.

search - Search Within Index

sw-search search <index_file> <query> [options]

Search within an existing .swsearch index file. This is useful for:

Testing search quality and relevance
Exploring index contents
Debugging search results
Scripting and automation

Search Options:

--count COUNT - Number of results to return (default: 5)
--distance-threshold FLOAT - Minimum similarity score (default: 0.0)
--tags TAGS - Comma-separated tags to filter by
--nlp-backend {nltk,spacy} - NLP backend to use (default: nltk)
--verbose - Show detailed information including index stats
--json - Output results as JSON for scripting
--no-content - Hide content in results (show only metadata)

Examples:

# Build from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch

# Build from multiple sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples README.md \
    --output comprehensive.swsearch \
    --file-types md,py,txt \
    --verbose

# Traditional directory-based approach
sw-search ./documentation \
    --output knowledge.swsearch \
    --chunking-strategy sentence \
    --max-sentences-per-chunk 8 \
    --file-types md,rst,txt \
    --exclude "**/drafts/**" \
    --tags documentation,help \
    --verbose

# Validate an existing index
sw-search validate concepts.swsearch --verbose

# Search within an index
sw-search search concepts.swsearch "how to create an agent"
sw-search search concepts.swsearch "API reference" --count 3 --verbose
sw-search search concepts.swsearch "configuration" --tags documentation --json

# Use different NLP backends
sw-search search concepts.swsearch "deployment options" --nlp-backend nltk  # Fast
sw-search search concepts.swsearch "deployment options" --nlp-backend spacy  # Better quality

# Advanced search with filtering
sw-search search concepts.swsearch "deployment options" \
    --count 10 \
    --distance-threshold 0.1 \
    --tags "deployment,production" \
    --nlp-backend spacy \
    --verbose

# JSON output for scripting
sw-search search concepts.swsearch "error handling" --json | jq '.results[0].content'

# Build multiple specialized indexes
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch
sw-search examples --output examples.swsearch --file-types py,md

Index Validation

# Validate an existing index
python -c "
from signalwire_agents.search import SearchEngine
engine = SearchEngine('docs.swsearch')
print(f'Index stats: {engine.get_stats()}')
"

API Reference

SearchEngine Class

from signalwire_agents.search import SearchEngine

# Load an index
engine = SearchEngine("docs.swsearch")

# Perform search
results = engine.search(
    query_vector=[...],  # Optional: pre-computed query vector
    enhanced_text="search query",  # Enhanced query text
    count=5,  # Number of results
    distance_threshold=0.0,  # Minimum similarity score
    tags=["documentation"]  # Filter by tags
)

# Get index statistics
stats = engine.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total files: {stats['total_files']}")

IndexBuilder Class

from signalwire_agents.search import IndexBuilder

# Create index builder
builder = IndexBuilder(
    model_name="sentence-transformers/all-mpnet-base-v2",
    chunk_size=500,
    chunk_overlap=50,
    verbose=True
)

# Build index
builder.build_index(
    source_dir="./docs",
    output_file="docs.swsearch",
    file_types=["md", "txt"],
    exclude_patterns=["**/test/**"],
    tags=["documentation"]
)

Overview​

Architecture​

Installation Options​

Basic Search (~500MB)​

Full Document Processing (~600MB)​

Advanced NLP Features (~700MB)​

All Search Features (~700MB)​

Minimal Installation (Base SDK only)​

Quick Start​

1. Install Dependencies​

2. Build a Search Index​

3. Use in Your Agent​

4. Test the Search​

Building Search Indexes​

Basic Index Building​

Advanced Index Building​

Supported File Types​

Index Structure​

Using the Search Skill​

Basic Configuration​

Advanced Configuration​

NLP Backend Selection​

Custom Embedding Models​

Query Enhancement​

Response Customization​

Tag-Based Filtering​

Complete Configuration Example​

Multiple Search Instances​

Local vs Remote Modes​

Local Mode (Default)​

Remote Mode​

Running a Remote Search Server​

Automatic Mode Detection​

Advanced Configuration​

Custom Embedding Models​

CLI Reference​

sw-search Command​

validate - Validate Search Index​

search - Search Within Index​

Index Validation​

API Reference​

SearchEngine Class​

IndexBuilder Class​

Overview

Architecture

Installation Options

Basic Search (~500MB)

Full Document Processing (~600MB)

Advanced NLP Features (~700MB)

All Search Features (~700MB)

Minimal Installation (Base SDK only)

Quick Start

1. Install Dependencies

2. Build a Search Index

3. Use in Your Agent

4. Test the Search

Building Search Indexes

Basic Index Building

Advanced Index Building

Supported File Types

Index Structure

Using the Search Skill

Basic Configuration

Advanced Configuration

NLP Backend Selection

Custom Embedding Models

Query Enhancement

Response Customization

Tag-Based Filtering

Complete Configuration Example

Multiple Search Instances

Local vs Remote Modes

Local Mode (Default)

Remote Mode

Running a Remote Search Server

Automatic Mode Detection

Advanced Configuration

Custom Embedding Models

CLI Reference

sw-search Command

validate - Validate Search Index

search - Search Within Index

Index Validation

API Reference

SearchEngine Class

IndexBuilder Class