Interpretation of User Comments for Detection of Malicious Websites
Automated understanding of natural language is a challenging problem, which has remained open for decades. We have investigated its special case, focused on identifying relevant concepts in natural-language text in the context of a specific given task. We have developed a set of general-purpose language interpretation techniques and applied them to the task of detecting malicious websites by analyzing comments of website visitors. In this context, concepts are related to behavior or contents of websites, such as presence of pop-ups and false testimonials.
The developed algorithms are based on probabilistic topic models and other dimensionality reduction techniques applied to a special case of multi-label text classification, where concepts are output labels. We integrate information about the target task with other relevant information, including relations among concepts and external knowledge sources using a concept graph. The system iterates between training a topic model on the partially labeled data and optimizing the parameters and the label assignments. We analyze several alternative versions of this mechanism, such as one that measures the quality of separation among topics and eliminates words that are not discriminative.
For the task of detecting malicious websites, we have developed an approach that applies machine-learning techniques to the automatically collected data about websites and achieves 98% precision and 95% recall. We present a crowdsourcing system for collecting multiple-choice and free-text comments from website visitors, which is especially useful when other sources of information are insufficient or unreliable. We improve detection performance by considering the text features in the comments about the website. This performance gain is greater when using unstructured free-text comments than using multiple-choice comments. Finally, we have evaluated the performance of our language interpretation framework, and shown that the performance gain from the extracted concepts is related to the popularity of the website and task-based concepts are complementary to text features for obscure websites.
History
Date
2014-06-23Degree Type
- Dissertation
Department
- Computer Science
Degree Name
- Doctor of Philosophy (PhD)