Swearword dictionary for Data Sciene Project

My co-author Anatoliy Gruzd and his social media lab created the data repository for our project on online swearing contagion. If anyone needs a dictionary for online swearwords for #DataScience project, please find our data repository and publication: https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=doi:10.5683/SP/J59UUG 


Probability: How likely a person becomes a hate crime victim in the U.S.?

Bayes’ Theorem is one way of calculating conditional probability, especially insightful to compute real probability of a rare event.

Along with the rise in alt-right, white nationalism etc., the hate crime has become once again a very sensitive yet important issue. Being an Asian immigrant myself, I wondered what will be my chance of becoming a hate crime victim (and other demographic types as well), and how large or small my chance is compared to other demographics.

Here is the excel file that includes my analysis : HateCrimeProbability

My analyses were based on the data from

Here’s my analytic approach:

(1) Calculate a probability of a person in the U.S. becoming a hate crime victim when the person is Asian (and other demographic types)

: This is a Bayes’ Theorem problem, defined as P(A|B)

P(A|B) is calculated with three information.

P(A) = a probability for a person in the U.S.to become a hate crime victim = Total number of hate crime victims / Total number of U.S. population
P(B) = a probability that a person is Asian (or other demographic) in the U.S. = the proportion of Asian population in the U.S. Census.
P(B|A) = known probability that how many are Asian victims among the known hate crime events = Total number of Asian victims / Total number of hate crime victims.

Based on the information above,

P(A|B) = P(B|A)*P(A) / P(B)

(2) Calculate percent difference between the probability of Asian (and other demographic types) and White (the demographic type that is conventionally understood as the majority population).

: Percentage difference is computed as

{(New number  – Reference number)/ Reference number} * 100

That is,

{(Asian P(A|B) – White P(A|B)) / White P(A|B)} * 100

(3) Thanks to my colleague Steve Doig,  there’s an easier way to gauge the disproportion than the percentage change. Which is simply to divide the two probability scores

That is,

Asian P(A|B) / White P(A|B).



Overall, the actual probability of becoming a hate crime victim is incredibly low (See the column F).  For example, an Asian’s chance of attacked by a hate crime in actual life in the U.S. is 0.000007. For African American, it’s higher than that, but still 0.000052. These numbers are not surprising because I computed them with the reference of the total U.S. populations.

However, if you see the column G, the chance is highly disproportional by race and religious types. For example, as an Asian, I’m only 85% more likely to be a hate crime victim than my white friends. However, if a person is Black/African American, the relative change of probability increases 1185.74% than his or her white friends. Even worse, if a person is Jewish, or Muslim, the relative change of probability increases 2889.15% or 2550.2% compared to the white friend.

The percentage change is too big to grasp. As Steve Doig recommended, if I use the simple ratio, it turns out that the probability for a Black American to be a hate victim in reality is 13 times greater than his or her white friends; 30 times greater for Jewish; and 27 times greater for Muslim than their white friends.

Should we concern about the actual probability of happening in real life or should we concern about the “structured disproportion” of the likelihood? I’ll leave it as an open question.



Get Twitter Data (tweets + user info) with R

Get Twitter Data (Tweets + User Information) Using R (Script link HERE)
: This script is a simple set of codes for Twitter Data capture. You will need to install twitteR, RCurl, ROAuth. You also need to set up an app for Twitter authorization purpose (apps.twitter.com).  This script allows you (1) collect the twitter data from searchTwitter fucntion;  (2) tract the information associated with users who tweeted; (3) join the tweets and user info by using “merge” function.

Twitter Profile Data Collection and Classification Model (API & Python)

 Twitter User Description  Python Script link HERE

This Github repository is developed by  Hunter J. Priniski, my talented research assistant.  It includes the scripts for Twitter profile data collection , and the Machine-learning (Random Forests) model for Twitter profile classification . The repository is not yet finalized, and has been updated time to time.

The script is run in Python 3.4.0 environment.

Basic R-Siena Script

RScript link HERE

RSiena is used for longitudinal, actor-based statistical network modeling. Especially for those interested in social influence and social selection process in social network, R-Siena is a useful tool for investigation. I had a small workshop about the basic use of R-Siena. Find the workshop ppt HERE  , which I presented at Social Media Lab, Ryerson University, CA.

This script is a simplified version of Rscript for RSiena, modified from Snjider et al.’s original manual.