pushed

feat: Implement RAG knowledge base

  • Added POST endpoint for uploading and processing files to the knowledge base.
  • Implemented user authentication and authorization checks to ensure only admin users can upload files.
  • Supported file types: .txt, .pdf, .docx with content extraction and embedding generation.
  • Integrated content extraction utilities for various file formats.
  • Enhanced error handling for file processing and extraction failures.

chore: Add blacklisted selectors and domain blacklist for web scraping

  • Created a comprehensive list of selectors to blacklist during web scraping to avoid non-content elements.
  • Implemented domain blacklist to prevent scraping from known heavy or restricted sources.

feat: Develop recursive URL crawler with worker threads

  • Implemented a recursive URL crawler that extracts and crawls URLs from web pages up to a specified depth.
  • Added support for domain prioritization, link frequency sorting, and blacklist filtering.
  • Utilized worker threads for concurrent URL fetching to improve performance.

refactor: Enhance vector utilities for embedding dimensions

  • Added utility functions to ensure vector dimensions and normalize vectors for embedding processing.

fix: Improve error handling in crawl-fetch worker

  • Enhanced error handling in the crawl-fetch worker to manage fetch timeouts and HTTP errors more effectively.

chore: Add TypeScript definitions for external libraries

  • Added TypeScript definitions for mammoth and pdfjs-dist to improve type safety in file parsing utilities.
subh05sus/ai-tutor7:52 PM - Nov 2, 2025
subh05sus

subh05sus pushed several new features and refinements to the main branch: they added a RAG knowledge-base API with admin-only file uploads (TXT/PDF/DOCX), extraction and embedding generation, plus an enhanced web-scraping blacklist for selectors and domains. They also introduced a recursive URL crawler using worker threads, refactored vector utilities for embedding dimensions and normalization, improved error handling in the crawl-fetch worker, and added TypeScript definitions for external parsing libraries.

feat: Implement RAG knowledge base - Added POST endpoint for uploading and processing files to the knowledge base. - Implemented user authentication and authorization checks to ensure only admin users can upload files. - Supported file types: .txt, .pdf, .docx with content extraction and embedding generation. - Integrated content extraction utilities for various file formats. - Enhanced error handling for file processing and extraction failures. chore: Add blacklisted selectors and domain blacklist for web scraping - Created a comprehensive list of selectors to blacklist during web scraping to avoid non-content elements. - Implemented domain blacklist to prevent scraping from known heavy or restricted sources. feat: Develop recursive URL crawler with worker threads - Implemented a recursive URL crawler that extracts and crawls URLs from web pages up to a specified depth. - Added support for domain prioritization, link frequency sorting, and blacklist filtering. - Utilized worker threads for concurrent URL fetching to improve performance. refactor: Enhance vector utilities for embedding dimensions - Added utility functions to ensure vector dimensions and normalize vectors for embedding processing. fix: Improve error handling in crawl-fetch worker - Enhanced error handling in the crawl-fetch worker to manage fetch timeouts and HTTP errors more effectively. chore: Add TypeScript definitions for external libraries - Added TypeScript definitions for mammoth and pdfjs-dist to improve type safety in file parsing utilities. - subh05sus/ai-tutor