Comparison of Outlier Detection Algorithms on String Data
arXiv cs.LG / 3/13/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- A new arXiv thesis compares two string data outlier detection algorithms: a variant of local outlier factor using a weighted Levenshtein distance, and a hierarchical left regular expression learner.
- The first method adapts LOF to strings by calculating data density with a Levenshtein-based metric that incorporates hierarchical character classes.
- The second method introduces a hierarchical left regular expression learner that infers a regex representing the expected data to identify anomalies.
- Experimental results across various datasets show that both algorithms can conceptually detect outliers in string data, with the regex-based approach excelling when the expected structure is distinct from outliers and LOF variants performing well when edit distances separate outliers from expected data.
- The work addresses a gap in string data outlier detection and suggests applications in data cleaning and anomaly detection for system log files.




