zhengqunkoo/tweet-prediction

tweet-prediction

0 subscribers

PythonJupyter NotebookShell

With @arjo129, for the unpossibly competition

Created Jun 2017

zhengqunkoo/tweet-prediction

Live activities

pushed

Fix beam search logic bug in metadata processing

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

Fixed a critical logic error in beam_search where the new_top_k dictionary was being prematurely reset inside the iteration loop. By moving the initialization outside the loop, the algorithm now correctly aggregates all candidates before performing the top-k selection. This ensures the beam search maintains coherence across candidates throughout the expansion process. Debugging

pushed

Updated submission configuration parameters in gcloud batch script

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

The gcloud/bash script has been updated to reflect changes in the make_submissions.py interface. The command now uses the 'test' target and updates the k and j parameters to 10 respectively, ensuring better alignment with current processing requirements for submission generation.

pushed

Update to prediction handling and submission processing for model evaluation

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

Improved the flexibility of model evaluation by allowing beam search parameter 'k' to exceed 3 while still restricting output to the top 3 predictions. Also updated 'make_submissions' to support batch processing multiple .pickle files simultaneously, streamlining the submission workflow for large datasets.

pushed

Refactored beam search output parsing with new strip_prediction utility

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

Updated the parsing logic in gcloud/metadata_preproc.py to use a more robust strip_prediction function for handling sequence delimiters. This change streamlines how model outputs are processed during beam search, replacing brittle indexing with a dedicated utility. It ensures cleaner output handling and resolves previous issues with trailing characters in predictions. Hero

pushed

Refined output parsing to exclude empty or non-content word tokens

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

Updated the parse_output function in metadata_preproc.py to ensure that empty strings and single-space characters are not captured as valid words during post-processing. This prevents downstream issues where noise or whitespace might be incorrectly interpreted as content tokens.

pushed

Cleaned up a bad merge in metadata preprocessing by removing unused imports that could cause confusion in the text pipeline.

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

This commit fixes leftover merge noise in gcloud/metadata_preproc.py by dropping unused imports and keeping only the symbols the module actually relies on. The code change is small, but it helps reduce ambiguity in the preprocessing path and makes the file easier to maintain after repeated merge fixes. Practical effect: no new behavior, just a cleaner and less error-prone module.

pushed

Added a fallback for the CFFI JSON backend to simplify test environment setups

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

To improve the developer workflow and make testing more accessible, a graceful fallback was introduced when importing the ijson parser. If the high-performance yajl2_cffi backend isn't installed, the application will now seamlessly default to the standard ijson library. Minor adjustments were also made to the testing logic to better handle reading evaluation files. This limits environment setup friction, making it significantly easier to run tests on systems without C bindings.

pushed

Fix IndexError and extra words in beam search predictions

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

Fixed an issue in the beam search algorithm where predictions could result in extra words or crash with an IndexError when processing the last word. The search logic was updated to maintain candidate states properly, utilizing probability multiplication, and ensuring the loop finishes cleanly across all k candidate branches. Additionally, ijson now gracefully falls back to the default implementation if the CFFI backend is missing. This prevents runtime crashes while noticeably improving the reliability and accuracy of parsed predictions.

pushed

Added an eval CLI mode to metadata preprocessing so model predictions can be tested from the command line.

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

This change extends gcloud/metadata_preproc.py with a simple command-line evaluation path that runs test_model_twitter and prints predictions for supplied arguments. It looks aimed at making ad hoc testing easier without modifying the script or wiring up a separate harness. The practical effect is faster manual validation of model behavior during development.

pushed

Optimize beam search probability arithmetic by using direct multiplication instead of logarithms

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

To speed up arithmetic during the beam search process, sequence probabilities are now updated by directly multiplying them instead of converting to log-scale for addition. While logarithmic probabilities are typically used to avoid floating-point underflow, skipping the overhead of repeated log() operations provides a much faster execution path. This trade-off should result in a noticeable speedup when evaluating and selecting the top candidates, provided the sequences are short enough to remain within standard floating-point limits.

pushed

Improved beam search termination logic and ijson dependency usage

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

The beam_search function's completion criteria has been updated to count finished candidates properly before returning, replacing the prone-to-crash IndexError handler. A simple fallback import was also added to seamlessly switch to the base ijson library if the yajl2_cffi backend is unavailable. This provides much greater stability for sequence generation and overall deployment.

pushed

Fixed a URL typo in gcloud/make_submissions.py to correct submission handling.

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

This change corrects a typo in a URL used by the submission script. It’s a small fix, but it likely prevents requests from going to the wrong endpoint or failing unexpectedly during submission generation. The practical effect is more reliable behavior in the gcloud submission workflow.

pushed

Added JSON splitting and parallel processing for beam search

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

The data processing pipeline has been updated to support splitting large JSON scoring files into smaller chunks. A new bash script was also introduced to execute the submission generation script concurrently across these splits using nohup. This change improves overall efficiency by parallelizing the workload and ensuring already-evaluated tweets aren't redundantly processed.

pushed

Adjusted model training checkpoint frequency and gracefully handled language detection exceptions.

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

Processing potentially noisy metadata occasionally caused the langdetect library to throw exceptions, crashing the data preparation pipeline. The language detection step is now wrapped in a try-except block to silently skip problematic inputs and keep the generator flowing. Additionally, the model checkpointing frequency was reduced to every 20 epochs, significantly cutting down on disk I/O and boosting overall training speed.

pushed

Fixed input parsing and model training workflow in metadata preprocessing

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

This patch resolves a few minor issues in the data preprocessing and training script. File handling for the ijson parser was updated to binary read mode ('rb') to prevent decoding errors, and unused variables were removed. Additionally, the training loop logic was smoothed out to properly handle building a new model versus resuming from saved weights.

pushed

Filtered out non-English expected outputs in metadata preprocessing

zhengqunkoo/tweet-prediction • about 9 years • view on GitHub

We updated the metadata preprocessing pipeline to only include English data. By integrating the langdetect library, the generator now evaluates the expected output string and ensures it is classified as English before yielding it. This simple filter helps maintain higher data quality by automatically discarding multi-language artifacts or unsupported test cases from the dataset.

- End of feed -