Labeling for awareness: My journey with HRF: Blue Witness

Marcos Morales
7 min readJul 29, 2021

HRF

As I wrap up my time with Lambda School, I was given the opportunity to work on a product for the organization Human Rights First. As drawn from their about us, the organization is an independent advocacy and action organization that challenges America to live up to its ideals. Their main focus is helping America and private companies to respect human rights and the rule of law (Source). Human Rights First had been working on a new program, that centered itself on being able to bring more awareness on police use of force, and being able to report on incidents for people to see and understand. I would be placed on the Data Science team, where I would use my expertise and knowledge to help prepare the program for release. This is my journey.

Blue Witness

Lambda School and HRF have been building the Blue Witness program for several months. The software was built, as per the HRF’s description, for the purpose of being able to report police uses of force throughout the United States, using social media to source and get all of its data (Source). It’d be capable to report every incident, as well as saying what level of force was used, and be able to display all of this in an interactive heatmap of the country.

Blue Witness display page (not released as of this time)

The website receives all of its data through a Twitter scrapper that grabs any tweet that meets the requirements of potentially being an incident report. Even so, the Twitter scraper can only be so accurate. So, how is the data cleaned up and verified, so that the incidents reported on the website are not only accurate but actual incidents in the first place? That is where the Natural Language Processing model comes in.

The NLP model, Bert

Bert is a Natural Language Processing model made and developed by Google AI. Without going into too much of the specifics and mathematical nuances of what the model entailed, what a Natural Language Processing model does is break down sentences, words, and letters into numerical values, and via extremely complicated mathematical processes, be capable of understanding them. Of course, Blue Witness could have a team of moderators that could manually check every incident scraped and give it a classification, but this would not only be extremely hard to do, but it’d also be extremely work-heavy and time-intensive. So instead, HRF wanted us to develop an NLP model that could make these classifications and labels for us. The model was meant to be capable of reading a tweet, and classify the incident in five different levels, classified as such:

  • Level 0 if there was no police presence at all
  • Level 1 if there was a nonviolent police presence
  • Level 2 if there was a physical violent police presence
  • Level 3 if there were blunt force utensils involved (such as batons)
  • Level 4 if chemical or electric weapons were discharged
  • Level 5 if lethal force was enacted, such as pistols and explosives.

However, NLP models tend to be fed upon tens of thousands, if not hundreds of thousands of training data points to function. They tend to have thousands of data instances that are already labeled, so the model can learn how to function correctly.

Consider this

There is an infant that does not know how to speak french. If you wish to teach that infant how to speak french, you will have to make them understand how the language functions, how it's spoken, and how it's spelled. All of this is done by having them hear samples of the language. Spoken language, although being able to be taught, is mainly learned through hearing. The human brain hears the language and begins associating words and phrases with ideas or objects. The more of the language it hears, the more it can learn, and more of a complete understanding of the language the brain will have. The NLP model, in this sense, functions exactly like a human brain. All forms of Machine Learning, or Artificial Intelligence, follow the same principle, they need data, or information, to learn the task that they were given. In this case, we’re teaching the BERT model English, or to be specific, what English words identify the difference between a peaceful protest with no police presence and a policeman shoving someone into the ground.

A showcase of how an NLP model starts breaking down the values of words, and how it slowly learns a language (Source)

The hard journey of inputting 0s and 5s

When I was handed off the project, I was shown a Bert model with roughly around seven hundred fifty labels. It was assigned not only to my team but to the back end and the front end of the project to label as many data points as possible. The objective, five thousand tweets labeled and ready for the model to learn from. Follow the next day, and I see that we had already made the great leap from seven hundred fifty to seven hundred sixty.

We were going at a very slow pace

With four weeks being given for the project’s deadline, and a Bert model expecting around four thousand new data points, I tasked it upon myself, as the Machine Learning engineer of the team, to label as many tweets I could. The process went as follows:

→ An unlabeled tweet would be given to me

→ I would read the entire tweet, and consider what it was reporting

→ Based on the report, I would classify it as either a 0–5, and then send that tweet to the database.

As simple as the task may seem, it’s far harder than anyone could imagine. Reading and understanding the nuances of every tweet that could separate a two from a three might not seem so aggravating at first, but once someone has to do it one hundred to two hundred times a day, it becomes a huge task. My first three weeks were fully dedicated to labeling as much data as I humanly could. Not only did it involve me shaping my schedule into getting as many labels as I could do, but it also involved me not rushing nor going too fast, as if I were to make any mistakes on my labeling, that would be giving the model information that would hurt it more than help it.

The Result

After three weeks of constant labeling, I was able to increase the training data by 4,500 data points. With the help of the team, we went from 750 data points to over 6,000.

Data Labeled (As you can see, we reached upwards of 6035 data points.

And now, with over five thousand data points to be used for training, and one thousand data points which we could use to test out the performance of the model, a new BERT was trained. Once that was accomplished, I focused code-wise to create a statistical report of the performance of the model, seeing how correctly it would label tweets based on its training, and the results ended up like the picture below:

What you see as the predicted labels is what the model predicted for the labels of certain tweeetys. Meanwhile, the true labels mean the actual labels (inputted by humans) of said tweets. This is called a Confusion Matrix, and it’s a nice visualization to understand the performance of a model. Now, keep in mind, this is a Machine Learning project done on what could be considered a minute amount of data. Despite the lack of data points and cases given to the model, the performance was more than adequate! We could see some mistakes the model made classifying a 0 and 1, but considering that both of these numbers would result in no report being made for the website, what mattered was the model correctly classifying a 0–1 and a 2–5. After all, the 0–1s would be scrapped away, while the 2–5s would be sent to the administrator page. So, I made a binary classification report, where I would understand how good the model is classifying between these two groups.

This confusion matrix is a lot more simple, but it also showcases some very positive results. In fact, these results can be broken down to the numerical values seen below:

Precision is the number of predicted positive labels that were actually positive, while recall represents the number of predicted negative labels that were actually negative. Both of these make the statistical value of the F1-score, which in simple terms showcases how accurate a model is. Bert showcased significantly high F1 scores, allowing for the Blue Witness group to be capable of saying that Bert can confidently classify nonviolent cases versus violent cases (with some errors, of course. No system is ever perfect.) I showcased my statistical reports to the teams and finished up my work with the model.

Now, as my time ends with the team, and HRF prepares Blue Witness for release, I am confident that the Bert model will serve them well in classifying what tweets are not incident reports of police use of force at all, and classifying what tweets actually report police officers using different levels of force. It was a great experience, a professional environment I learned much from, and good coworkers that were willing to help. This is only one of my first steps as a data scientist, and I am excited to see what the future holds.

--

--

Marcos Morales

I’m a Guatemalan student at lambda school focusing on data analysis and machine learning.