feat: Implement RAG knowledge base

  • Added POST endpoint for uploading and processing files to the knowledge base.
  • Implemented user authentication and authorization checks to ensure only admin users can upload files.
  • Supported file types: .txt, .pdf, .docx with content extraction and embedding generation.
  • Integrated content extraction utilities for various file formats.
  • Enhanced error handling for file processing and extraction failures.

chore: Add blacklisted selectors and domain blacklist for web scraping

  • Created a comprehensive list of selectors to blacklist during web scraping to avoid non-content elements.
  • Implemented domain blacklist to prevent scraping from known heavy or restricted sources.

feat: Develop recursive URL crawler with worker threads

  • Implemented a recursive URL crawler that extracts and crawls URLs from web pages up to a specified depth.
  • Added support for domain prioritization, link frequency sorting, and blacklist filtering.
  • Utilized worker threads for concurrent URL fetching to improve performance.

refactor: Enhance vector utilities for embedding dimensions

  • Added utility functions to ensure vector dimensions and normalize vectors for embedding processing.

fix: Improve error handling in crawl-fetch worker

  • Enhanced error handling in the crawl-fetch worker to manage fetch timeouts and HTTP errors more effectively.

chore: Add TypeScript definitions for external libraries

  • Added TypeScript definitions for mammoth and pdfjs-dist to improve type safety in file parsing utilities.
pushed
subh05sus/ai-tutor7:52 PM - Nov 2, 2025
subh05sus

subh05sus pushed several new features and refinements to the main branch: they added a RAG knowledge-base API with admin-only file uploads (TXT/PDF/DOCX), extraction and embedding generation, plus an enhanced web-scraping blacklist for selectors and domains. They also introduced a recursive URL crawler using worker threads, refactored vector utilities for embedding dimensions and normalization, improved error handling in the crawl-fetch worker, and added TypeScript definitions for external parsing libraries.