
lina_database_decoder
Created Apr 2026
Added a preprocessing step in the decoder pipeline to discard any database rows containing fewer than two sign groups. By filtering out these inherently less informative or noisy sequences, the frequency analysis and random-init decipherment models can focus on higher-quality, richer data, reducing irrelevant processing. 
Added a new iterative search mechanism to the random-initialization decoding strategy, which now runs $N$ consecutive random cypher generations and retains only the one producing the highest internal semantic consistency score (measured via WordNet). I also introduced a centralized summary output (outputs/strategy_summary.csv) to track and rank the performance of all active strategies. This upgrade dramatically improves the quality of random-init outputs and streamlines benchmarking across different approaches.
Optimized the semantic consistency scoring process by caching synset lookups on a per-token basis. Previously, the meaning_scorer re-queried WordNet synsets for every pairwise comparison, leading to redundant overhead. By shifting to a synset_map, we significantly reduce computation time during translation analysis. 
We have replaced the dependency on large semantic embedding models for the meaning scorer with a more lightweight, deterministic approach using WordNet's Wu-Palmer similarity metric. This refactor improves maintainability by removing the need for heavy external ML libraries like sentence-transformers, while maintaining a robust way to evaluate semantic consistency between translated tokens. 
Improved the suffix-stripping mechanism in the meaning_scorer's linguistic fallback logic. By reordering the suffix list to prioritize longer matches and simplifying the stripping process, domain vocabulary matching is now more accurate and robust. This improves the quality of meaning assessments when the primary transformer model is unavailable.
Added a new meaning_scorer subcomponent to evaluate the quality of transcriptions. It utilizes sentence-transformers for embedding-based semantic coherence and domain relevance, with a robust offline fallback using vocabulary overlap and heuristics. These scores are now integrated to provide better diagnostic insights into the quality of generated decipherments. 
Added a new utility that fetches, cleans, and samples English words from online dictionary repositories to populate the project's word pool. This replaces the hardcoded CSV dependency with a refreshable, automated process for better scalability and data consistency. 
This update transitions the decoder to rely exclusively on CSV files for outputs, replacing JSON-based formats to improve consistency. A new random_init strategy has been added, which uses a pre-defined word pool to generate random sign-to-word mappings for testing. These changes streamline the decipherment workflow and provide a clearer structure for evaluating different translation strategies. 
