feat: Implement RAG knowledge base - Added POST endpoint for uploading and processing files to the knowledge base. - Implemented user authentication and authorization checks to ensure only admin users can upload files. - Supported file types: .txt, .pdf, .docx with content extraction and embedding generation. - Integrated content extraction utilities for various file formats. - Enhanced error handling for file processing and extraction failures. chore: Add blacklisted selectors and domain blacklist for web scraping - Created a comprehensive list of selectors to blacklist during web scraping to avoid non-content elements. - Implemented domain blacklist to prevent scraping from known heavy or restricted sources. feat: Develop recursive URL crawler with worker threads - Implemented a recursive URL crawler that extracts and crawls URLs from web pages up to a specified depth. - Added support for domain prioritization, link frequency sorting, and blacklist filtering. - Utilized worker threads for concurrent URL fetching to improve performance. refactor: Enhance vector utilities for embedding dimensions - Added utility functions to ensure vector dimensions and normalize vectors for embedding processing. fix: Improve error handling in crawl-fetch worker - Enhanced error handling in the crawl-fetch worker to manage fetch timeouts and HTTP errors more effectively. chore: Add TypeScript definitions for external libraries - Added TypeScript definitions for mammoth and pdfjs-dist to improve type safety in file parsing utilities.

pushed

feat: Implement RAG knowledge base

Added POST endpoint for uploading and processing files to the knowledge base.
Implemented user authentication and authorization checks to ensure only admin users can upload files.
Supported file types: .txt, .pdf, .docx with content extraction and embedding generation.
Integrated content extraction utilities for various file formats.
Enhanced error handling for file processing and extraction failures.

chore: Add blacklisted selectors and domain blacklist for web scraping

Created a comprehensive list of selectors to blacklist during web scraping to avoid non-content elements.
Implemented domain blacklist to prevent scraping from known heavy or restricted sources.

feat: Develop recursive URL crawler with worker threads

Implemented a recursive URL crawler that extracts and crawls URLs from web pages up to a specified depth.
Added support for domain prioritization, link frequency sorting, and blacklist filtering.
Utilized worker threads for concurrent URL fetching to improve performance.

refactor: Enhance vector utilities for embedding dimensions

Added utility functions to ensure vector dimensions and normalize vectors for embedding processing.

fix: Improve error handling in crawl-fetch worker

Enhanced error handling in the crawl-fetch worker to manage fetch timeouts and HTTP errors more effectively.

chore: Add TypeScript definitions for external libraries

Added TypeScript definitions for mammoth and pdfjs-dist to improve type safety in file parsing utilities.

subh05sus/ai-tutor • 7:52 PM - Nov 2, 2025

subh05sus pushed several new features and refinements to the main branch: they added a RAG knowledge-base API with admin-only file uploads (TXT/PDF/DOCX), extraction and embedding generation, plus an enhanced web-scraping blacklist for selectors and domains. They also introduced a recursive URL crawler using worker threads, refactored vector utilities for embedding dimensions and normalization, improved error handling in the crawl-fetch worker, and added TypeScript definitions for external parsing libraries.

feat: Implement RAG knowledge base - Added POST endpoint for uploading and processing files to the knowledge base. - Implemented user authentication and authorization checks to ensure only admin users can upload files. - Supported file types: .txt, .pdf, .docx with content extraction and embedding generation. - Integrated content extraction utilities for various file formats. - Enhanced error handling for file processing and extraction failures. chore: Add blacklisted selectors and domain blacklist for web scraping - Created a comprehensive list of selectors to blacklist during web scraping to avoid non-content elements. - Implemented domain blacklist to prevent scraping from known heavy or restricted sources. feat: Develop recursive URL crawler with worker threads - Implemented a recursive URL crawler that extracts and crawls URLs from web pages up to a specified depth. - Added support for domain prioritization, link frequency sorting, and blacklist filtering. - Utilized worker threads for concurrent URL fetching to improve performance. refactor: Enhance vector utilities for embedding dimensions - Added utility functions to ensure vector dimensions and normalize vectors for embedding processing. fix: Improve error handling in crawl-fetch worker - Enhanced error handling in the crawl-fetch worker to manage fetch timeouts and HTTP errors more effectively. chore: Add TypeScript definitions for external libraries - Added TypeScript definitions for mammoth and pdfjs-dist to improve type safety in file parsing utilities. - subh05sus/ai-tutor