Implemented a new reproducible search benchmark suite

We've added a comprehensive benchmark suite to evaluate search performance, covering various corpus sizes, pipeline configurations, and query types including keyword, semantic, and temporal. By using deterministic pseudo-embeddings, the tests are fully reproducible without requiring external API keys. This allows for precise measurement of Precision@5, Recall@10, and MRR, and we've already confirmed that our hybrid search pipeline significantly outperforms keyword-only approaches across all scales. Benchmark improvements