----------------------------------------
GENERAL INFORMATION
----------------------------------------

1. Title of Dataset:  CASOS/IDeaS Labeled Trolling Dataset

2. Author Information

Author Contact Information
    Name: Joshua Uyheng
    Institution: Carnegie Mellon University
    Address: 5000 Forbes Avenue, Pittsburgh, Pennsylvania, USA, 15213
    Email: juyheng@cs.cmu.edu
    Office Phone Number: 412-508-0630


---------------------------------------
DATA & FILE OVERVIEW
---------------------------------------

Directory of Files:

   A. Filename:  kilt_data.csv
   
      Short description:  This CSV file contains a dataset of tweets (N = 4,917) labeled as trolling or not based on the consensus of three manual annotators.


Additional Notes on File Relationships, Context, or Content: In line with Twitter's terms of service, we only include the tweet ID and the binary label of whether or not the tweet has been annotated to be an instance of trolling. Various libraries are available to rehydrate tweets using only tweet IDs, such as the twarc library.


----------------------------------------------------------------------------------------------------------
DATA DESCRIPTION FOR: kilt_data.csv
----------------------------------------------------------------------------------------------------------

1. Number of variables: 2


2. Number of cases/rows: 4,917


3. Missing data codes: The dataset has no missing data; however, some tweets may not be available upon rehydration due to the potential for Twitter to suspend accounts, the decision of accounts to make their data private, etc.

4. Variable List

    A. Name: tweet_id  
	
       Description: Denotes the numeric ID representing each individual tweet in the dataset. Tweet IDs are assigned by Twitter.

    B. Name: label  
	
       Description: Denotes the 1 if the tweet has been labeled to be an instance of trolling and 0 otherwise. Labels are produced through the consensus of three manual annotators.

-------------------------------------------------------
METHODOLOGICAL INFORMATION
-------------------------------------------------------

1. Software-specific information:

Name: Microsoft Excel
Version: 2016 
System Requirements: Windows or macOS
Open Source? (Y/N):  N

Additional Notes: The data were initially entered into Microsoft Excel and can be input into Excel. However, for stability and access, the files have been saved in CSV format and can be viewed and analyzed across all operating systems, including Linux. 


2. Equipment-specific information:

Manufacturer: Lenovo 
Model: Idea Pad 3

Additional Notes: The data were entered, cleaned, and formatted on this Dell computer. 


3. Date of data collection: 20200201-20200229

--------------------------------------------------
NOTES ON REPRODUCIBILITY 
--------------------------------------------------

While it may be virtually impossible to recreate these data exactly as shown in the datasets, using the methodology listed in the published paper would enable researchers to obtain a subset of our original annotated dataset and perform statistical analysis on the textual features of trolling versus non-trolling. Machine learning models could also be trained on rehydrated tweets in the manner following the methods in our published paper.

The paper is available open access CC BY-NC-ND 4.0 through the following link: https://doi.org/10.1016/j.ipm.2022.103012. It may be cited as follows:

Uyheng, J., Moffitt, J. D., & Carley, K. M. (2022). The language and targets of online trolling: A psycholinguistic approach for social cybersecurity. Information Processing & Management, 59(5), 103012. https://doi.org/10.1016/j.ipm.2022.103012