We updated the metadata preprocessing pipeline to only include English data. By integrating the langdetect library, the generator now evaluates the expected output string and ensures it is classified as English before yielding it. This simple filter helps maintain higher data quality by automatically discarding multi-language artifacts or unsupported test cases from the dataset.

Filtered out non-English expected outputs in metadata preprocessing - zhengqunkoo/tweet-prediction