As the data generated on the internet exponentially increases, developing guided data collection methods become more and more essential to the research process. This paper proposes an approach to building a self-guiding web-crawler to collect data specifically from extremist websites.
The guidance component of the web-crawler is achieved through the use of sentiment-based classification rules which allow the crawler to make decisions on the content of the webpage it downloads.
First, content from 2,500 webpages was collected for each of the four different sentiment-based classes: pro-extremist websites, anti-extremist websites, neutral news sites discussing extremism and finally sites with no discussion of extremism. Then parts of speech tagging was used to find the most frequent keywords in these pages.
Utilizing sentiment software in conjunction with classification software a decision tree that could effectively discern which class a particular page would fall into was generated. The resulting tree showed an 80% success rate on differentiating between the four classes and a 92% success rate at classifying specifically extremist pages.
This decision tree was then applied to a randomly selected sample of pages for each class. The results from the secondary test showed similar results to the primary test and hold promise for future studies using this framework.