Ryan Arias Delafosse // Humanitarian Data Scientist

Project

Automated Humanitarian Data Classification

Summary


Github Repository


The objective of this framework is to detect, classify, and recommend a category of data in the context of natural disasters and humanitarian emergencies. The categories of humanitarian data collected during an onset of a crisis is usually limited to the most immediate needs of a population. By identifying these categories and classifying incoming data, humanitarians and information managers can focus on collecting, cleaning, and utilizing information while automatically classified data is analyzed and mapped.


Over 2,000 datasets were scraped from the Humanitarian Data Exchange (HDX). The datasets pulled had Humanitarian Exchange Language (HXL) tags, which identify the type of data. This data was processed and distilled down to a certain number of tags that are only relevant to rapid needs assessments. These tags were then turned into categories for a training set for our model.Word embeddings were created using fastText, an NLP library created by Facebook. An MLP model is then trained using these word embeddings.


Our classification accuracy averaged 91.8%, with some categories more successful than others. There were high levels of confidence on certain variables such as indicators, number affected, and administrative levels. There were interesting failures, such as confusion between time variables (“date” and “year” categories).


Assorted rapid needs assessment datasets are then tested to see the model's accuracy.

Tech

Python, classification problem, multi-layer perceptron

Organizations

Humanitarian Data Exchange

Year

2021