Carnegie Mellon University
msharifi_phd_lti_2012.pdf (3.98 MB)

Interpretation of User Comments for Detection of Malicious Websites

Download (3.98 MB)
posted on 2024-05-09, 19:09 authored by Mehrbod Sharifi

 Automated understanding of natural language is a challenging problem, which has remained  open for decades. We have investigated its special case, focused on identifying relevant concepts  in natural-language text in the context of a specific given task. We have developed a set of  general-purpose language interpretation techniques and applied them to the task of detecting  malicious websites by analyzing comments of website visitors. In this context, concepts are  related to behavior or contents of websites, such as presence of pop-ups and false testimonials. 

The developed algorithms are based on probabilistic topic models and other dimensionality  reduction techniques applied to a special case of multi-label text classification, where concepts  are output labels. We integrate information about the target task with other relevant information,  including relations among concepts and external knowledge sources using a concept graph. The  system iterates between training a topic model on the partially labeled data and optimizing the  parameters and the label assignments. We analyze several alternative versions of this  mechanism, such as one that measures the quality of separation among topics and eliminates  words that are not discriminative. 

For the task of detecting malicious websites, we have developed an approach that applies machine-learning techniques to the automatically collected data about websites and achieves  98% precision and 95% recall. We present a crowdsourcing system for collecting multiple-choice and free-text comments from website visitors, which is especially useful when other  sources of information are insufficient or unreliable. We improve detection performance by  considering the text features in the comments about the website. This performance gain is greater  when using unstructured free-text comments than using multiple-choice comments. Finally, we  have evaluated the performance of our language interpretation framework, and shown that the  performance gain from the extracted concepts is related to the popularity of the website and task-based concepts are complementary to text features for obscure websites.  




Degree Type

  • Dissertation


  • Computer Science

Degree Name

  • Doctor of Philosophy (PhD)


Jamie Carbonell

Usage metrics



    Ref. manager