Back to Projects
FileSense

FileSense

A local-first semantic file search engine that combines traditional keyword search (BM25) with AI-powered semantic search (FAISS + sentence embeddings) to help you find your documents instantly.

Technology Stack

Python
FastAPI
React
TypeScript
FAISS
Sentence Transformers
SQLite
Docling

Key Challenges

  • Implementing hybrid search combining BM25 and dense vector search
  • Efficient text chunking for large documents
  • Fast retrieval across thousands of documents
  • Multi-format document processing (TXT, DOCX, PDF)
  • OCR support for scanned PDFs

Key Learnings

  • Hybrid search architectures
  • Vector embeddings and semantic search
  • BM25 lexical retrieval
  • Document processing pipelines
  • FastAPI with async operations
  • React state management for search interfaces

FileSense: Local-First Semantic File Search Engine

Overview

FileSense is a powerful local-first semantic file search engine that combines traditional keyword search (BM25) with AI-powered semantic search (FAISS + sentence embeddings). Designed for personal knowledge management and document discovery, FileSense helps you find your documents instantly by understanding the meaning behind your queries, not just matching keywords.

Core Features

Hybrid Search Architecture

  • BM25 Lexical Search: Fast keyword-based retrieval using the BM25 algorithm
  • Dense Vector Search: Semantic understanding using sentence embeddings and FAISS
  • Weighted Fusion: Combine both approaches with adjustable alpha parameter
  • Reciprocal Rank Fusion (RRF): Rank-based fusion for better result diversity
  • Parallel Execution: Both search methods run concurrently for optimal performance

Multi-Format Document Support

  • TXT Files: Direct text extraction
  • DOCX Files: Microsoft Word document parsing
  • PDF Files: Text extraction with OCR support via Docling
  • Smart Chunking: Intelligent text splitting (1000 chars, 100 overlap)
  • Hash-based Deduplication: Skip already-indexed files automatically

Fast Retrieval Performance

  • FAISS Vector Index: Millisecond-level search across thousands of documents
  • Persistent Caching: BM25 and FAISS indices cached to disk
  • Incremental Updates: Add new documents without rebuilding entire index
  • Memory Mapping: Efficient FAISS index loading

Modern Web Interface

  • React + TypeScript: Clean, responsive frontend
  • Real-time Search: Debounced input with instant results
  • File Preview: Preview documents directly in the browser
  • File Actions: Open files and folders from the search interface
  • Keyboard Navigation: Power-user friendly keyboard shortcuts

How It Works

Indexing Pipeline

The indexing process transforms your documents into searchable chunks:

from src.pp import Pipeline

pipeline = Pipeline()
result = pipeline.index_dir("/path/to/your/documents")
print(f"Indexed {result['inserted']} documents")

The Pipeline:

  1. File Discovery: Recursive directory scanning
  2. Deduplication: SHA256 hash check to skip duplicates
  3. Text Extraction: Format-specific loaders (TXT, DOCX, PDF)
  4. Smart Chunking: Split documents into 1000-character chunks
  5. Dual Indexing: Build both BM25 and FAISS indices
  6. SQLite Storage: Persist metadata and content

Search Execution

curl -X POST http://localhost:8000/hybrid_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning concepts",
    "k": 10,
    "alpha": 0.5
  }'

Search Parameters:

  • query: Your search query in natural language
  • k: Number of results to return (default: 10)
  • alpha: Balance between BM25 (0) and dense (1) search (default: 0.5)
  • deduplicate: Remove duplicate chunks from same file (default: true)
  • rerank: Apply RRF reranking (default: false)

Result Fusion

FileSense combines results from both search methods:

  1. Parallel Execution: BM25 and dense search run simultaneously
  2. Score Fusion: Combine scores using weighted average or RRF
  3. Deduplication: Remove multiple chunks from the same file
  4. Metadata Enrichment: Add file path, type, and preview

Technical Architecture

Core Components

  • Pipeline (src/services/pipeline.py): Central orchestrator coordinating all operations
  • Hybrid Search (src/services/hybrid_search.py): Core search logic with parallel execution
  • BM25 Retriever (src/services/bm25_ret.py): Lexical search with disk caching
  • Vector Index (src/services/idx.py): FAISS index management
  • Embeddings (src/services/emb.py): Sentence embedding generation (Qwen3-Embedding-0.6B)
  • Text Chunker (src/services/text_chunker.py): Smart document splitting
  • File Manager (src/services/file_manager.py): SQLite database operations
  • Document Loaders: Format-specific text extraction (TXT, DOCX, PDF)

Database Schema

SQLite database (./db/db.sqlite3) stores:

Files Table:

  • Document ID, file hash (SHA256)
  • File path, name, type, size
  • Text content, chunk index
  • Modified timestamp

Folders Table:

  • Folder path and recursive flag
  • Indexed file count
  • Last indexed timestamp

Caching Strategy

Persistent cache files in ./db/:

  • bm25_cache.pkl: BM25 retriever state
  • faiss_index.bin: FAISS vector index
  • db.sqlite3: Document metadata

API Endpoints

| Endpoint | Method | Description | |----------|--------|-------------| | /health_check | GET | Health check | | /search | POST | Legacy filename + content search | | /hybrid_search | POST | Primary hybrid BM25 + dense search | | /index | POST | Index all files in a directory | | /index_files | POST | Index specific files by path | | /unindex | POST | Remove files by ID, path, or hash | | /indexed_files | GET | List indexed files with pagination | | /preview_file | POST | Preview file content | | /open_file | POST | Open file in default app | | /open_folder | POST | Open folder in file manager | | /folders | GET/POST | List/add indexed folders |

Technology Stack

Backend

  • Python 3.12+: Core language
  • FastAPI: Modern async web framework
  • Uvicorn: ASGI server
  • FAISS: Facebook AI Similarity Search for vector indexing
  • BM25s: Fast BM25 implementation with incremental updates
  • Sentence Transformers: Qwen3-Embedding-0.6B for embeddings
  • SQLite: Lightweight database for metadata
  • Docling: PDF OCR and document parsing
  • python-docx: Word document processing

Frontend

  • React 18: UI framework
  • TypeScript: Type-safe development
  • Vite: Fast build tool
  • Axios: HTTP client

Why I Built This

The Problem

Traditional file search has limitations:

  • Keyword Matching: Misses documents that use different words
  • No Context: Can't understand meaning or intent
  • Slow Performance: Linear search through all files
  • Format Limitations: Struggles with PDFs and complex formats
  • No Preview: Requires opening files to check relevance

The Solution

FileSense addresses these issues:

  • Semantic Understanding: Finds documents by meaning, not just keywords
  • Hybrid Approach: Combines lexical and semantic search for best results
  • Fast Retrieval: Millisecond search across thousands of documents
  • Format Agnostic: Handles TXT, DOCX, and PDF with OCR
  • Rich Preview: See document content before opening

Installation & Usage

Quick Start

# Clone the repository
git clone https://github.com/sidmanale643/file-sense.git
cd file-sense

# Create virtual environment and install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

# Start the API server
uvicorn main:app --reload

The API runs on http://localhost:8000 with auto-generated docs at /docs.

Frontend Setup

cd frontend
npm install
npm run dev

The UI runs on http://localhost:5173.

Development Status

Current Features

  • Hybrid BM25 + dense search
  • Multi-format document support
  • PDF OCR via Docling
  • Smart text chunking
  • Incremental index updates
  • Folder management
  • File preview and actions
  • Modern React UI
  • Persistent caching

Technical Highlights

  • Parallel search execution for low latency
  • Incremental BM25 updates (much faster than refitting)
  • Memory-mapped FAISS index for efficiency
  • Hash-based deduplication
  • Thread-safe database operations

Impact & Vision

FileSense represents a new approach to personal document search - combining the reliability of traditional keyword search with the intelligence of semantic understanding. Whether you're searching through research papers, documentation, or personal notes, FileSense helps you find what you need by understanding what you mean.

The project is actively maintained with ongoing improvements to search quality, performance, and user experience.

Designed by sidmanale643
© 2026. All rights reserved.