Commit 384ef3fc authored by Sina's avatar Sina
Browse files

proposal added

parent 5e9d6ab4
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
\documentclass[12pt]{extarticle}
\usepackage[utf8]{inputenc}
\usepackage{cite}
\usepackage[headheight=20pt,headsep=0pt]{geometry}
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{IEEEtrantools}
\graphicspath{ {./images/} }
\title{Project Proposal \\
Natural Language Processing with Disaster Tweets
}
\author{Sina Abbasi}
\date{Feb. 2022}
\begin{document}
\maketitle
% \leftskip=-0.5in
% \rightskip=-0.5in
\section*{Motivation}
\begin{wrapfigure}{r}{0.4\textwidth}
\begin{center}
% \centering
\includegraphics[width=0.3\textwidth,height=0.6\textwidth]{twitterfig1.png}
\end{center}
\caption{Percentage of users that get their news from that social media.}
\label{fig:twitter}
\end{wrapfigure}
Nowadays, thanks to smartphones and easy access to internet across the world the free social media applications are one of the best places to get the breaking news instantly. Especially after COVID-19 hit, news consumption from social medias rapidly increased where based on Pew Research Center survey conducted 31 August - 7 September 2020, about half of US adults (53\%) get their news "often" or "sometimes" from social media \cite{shearer2021news}. Moreover, for example, about half of Twitter's users get their news from there as you can see in Fig. \ref{fig:twitter} \cite{shearer2021news}. Moreover, people use Twitter to report emergency news that are seeing them in real-time. Therefore, many companies and news agencies are interested to monitor the Twitter to get real-time and breaking news. More importantly, organizations such as disaster relief organization want to know the news about disasters as soon as possible. This monitoring can not be done by humans because we are dealing with on average 6000 tweets per second.\\
Pragmatically monitoring Twitter by just analyzing the words in each sentence separately is not a promising approach. For example, the author of a tweet said: "On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE" \cite{tw}; the word "ABLAZE" is used not in actual meaning but metaphorically.\\
\section*{Project details}
In this project we are going to build a deep learning model that can classify which tweets are pointing a real disaster or not. Obviously, since we will work on text data, this task will be a Natural Language Processing (NLP) task. This challenge is a classification task and the sequential model (not trained model) can be used in other classification tasks in different domains, i.e. stock market predication based on tweets or news, sentiment classification for reviews of an online market like amazon, spam news classification and etc.\\
The input of the task will be text and the binary output should determine that weather the text describes an real disaster or not. The evaluation will be based on F1 score between predicted and expected answer and F1 is calculated as follow:
\begin{IEEEeqnarray}{rCl}
\label{eq:f1}
F1=2\times \frac{precision\times recall}{precision+recall},
\end{IEEEeqnarray}
where
\begin{IEEEeqnarray}{rCl}
\label{eq:precision}
precision=\frac{TP}{TP+FP},
\end{IEEEeqnarray}
\begin{IEEEeqnarray}{rCl}
\label{eq:recall}
recall=\frac{TP}{TP+FN},
\end{IEEEeqnarray}
TP, FP and FN stands for true positive, false positive and false negative, respectively. $F1 = 1$ is the best value that means perfect precision and recall and $F1=0$ is the worst.
\\
\section*{Data set}
In Fig. \ref{fig:dataSamples}, you can see data set includes five columns where \\
\texttt{keyword}: a particular keyword from the tweet (may be blank)\\
\texttt{location}: the location the tweet was sent from (may be blank)\\
\texttt{text}: the text of the tweet\\
\texttt{target}: this denotes whether a tweet is about a real disaster (1) or not (0).\\
Train data has about 7.5k sample rows, and target values has fair distribution as you can see in Fig. \ref{fig:targetdist}.\\
\begin{figure}[!t]
\centering
%\includegraphics[scale=0.4]{img/figure1.eps}
\includegraphics[width=1\textwidth]{images/dataSamples.png}
\caption{Data samples.}
\label{fig:dataSamples}
\end{figure}
\begin{figure}[!t]
\centering
%\includegraphics[scale=0.4]{img/figure1.eps}
\includegraphics[width=0.8\textwidth]{images/distribution.png}
\caption{Distribution of target values.}
\label{fig:targetdist}
\end{figure}
\section*{Competition details}
The task is an active kaggle competition with currently 830 competitors that you can find it \href{https://www.kaggle.com/c/nlp-getting-started/}{\textcolor{blue}{\textit{here}}}. Best score is 1.0 (perfect score) and top 50 scores vary between 0.84 to 1.0.\\
\section*{Related works}
Authors in \cite{sit2019identifying} works on similar data sets but the tweets related to hurricane. They get the data from Twitter and aiming to first classify and then analyse disaster-related tweets. In their binary classification phase, classification methods: LSTM, CNN, SVM, Logistic Regression and Ridge are performed and based on their evaluation Long Short-Term Memory (LSTM) get better fit to sequential order of textual data.
Looking through the top solutions in Kaggle website where they used LSTM structure to solve this task, we found and run a notebook which gave us 0.81 F1 score. This is a decent score for a NLP task. However, based on new viral method published by google researchers at 2018 called "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" which has already 34000 citation, the BERT can learn better contextual text \cite{devlin2018bert}. Therefore, we are going to use this transformer for our task and try to get better F1 performance than LSTM.
\bibliographystyle{IEEEtran}
\bibliography{refs}
\end{document}
@article{shearer2021news,
title={News use across social media platforms in 2020},
author={Shearer, Elisa and Mitchell, Amy},
year={2021},
publisher={Pew Research Center}
}
@article{devlin2018bert,
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1810.04805},
year={2018}
}
@article{sit2019identifying,
title={Identifying disaster-related tweets and their semantic, spatial and temporal context using deep learning, natural language processing and spatial analysis: a case study of Hurricane Irma},
author={Sit, Muhammed Ali and Koylu, Caglar and Demir, Ibrahim},
journal={International Journal of Digital Earth},
year={2019},
publisher={Taylor \& Francis}
}
@misc{tw,
title = {tweet: On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE},
howpublished = {\url{https://twitter.com/AnyOtherAnnaK/status/629195955506708480}}
}
\ No newline at end of file
\documentclass[12pt]{extarticle}
\usepackage[utf8]{inputenc}
\usepackage{cite}
\usepackage[headheight=20pt,headsep=0pt]{geometry}
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{IEEEtrantools}
\graphicspath{ {./images/} }
\title{Project Proposal \\
Natural Language Processing with Disaster Tweets
}
\author{Sina Abbasi}
\date{}
\begin{document}
\maketitle
% \leftskip=-0.5in
% \rightskip=-0.5in
\section*{Motivation}
\begin{wrapfigure}{r}{0.4\textwidth}
\begin{center}
% \centering
\includegraphics[width=0.3\textwidth,height=0.6\textwidth]{twitterfig1.png}
\end{center}
\caption{Percentage of users that get their news from that social media.}
\label{fig:twitter}
\end{wrapfigure}
Nowadays, thanks to smartphones and easy access to internet across the world the free social media applications are one of the best places to get the breaking news instantly. Especially after COVID-19 hit, news consumption from social medias rapidly increased where based on Pew Research Center survey conducted 31 August - 7 September 2020, about half of US adults (53\%) get their news "often" or "sometimes" from social media \cite{shearer2021news}. Moreover, for example, about half of Twitter's users get their news from there as you can see in Fig. \ref{fig:twitter} \cite{shearer2021news}. Moreover, people use Twitter to report emergency news that are seeing them in real-time. Therefore, many companies and news agencies are interested to monitor the Twitter to get real-time and breaking news. More importantly, organizations such as disaster relief organization want to know the news about disasters as soon as possible. This monitoring can not be done by humans because we are dealing with on average 6000 tweets per second.\\
Pragmatically monitoring Twitter by just analyzing the words in each sentence separately is not a promising approach. For example, the author of a tweet said: "On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE" \cite{tw}; the word "ABLAZE" is used not in actual meaning but metaphorically.\\
\section*{Project details}
In this project we are going to build a deep learning model that can classify which tweets are pointing a real disaster or not. Obviously, since we will work on text data, this task will be a Natural Language Processing (NLP) task. This challenge is a classification task and the sequential model (not trained model) can be used in other classification tasks in different domains, i.e. stock market predication based on tweets or news, sentiment classification for reviews of an online market like amazon, spam news classification and etc.\\
The input of the task will be text and the binary output should determine that weather the text describes an real disaster or not. The evaluation will be based on F1 score between predicted and expected answer and F1 is calculated as follow:
\begin{IEEEeqnarray}{rCl}
\label{eq:f1}
F1=2\times \frac{precision\times recall}{precision+recall},
\end{IEEEeqnarray}
where
\begin{IEEEeqnarray}{rCl}
\label{eq:precision}
precision=\frac{TP}{TP+FP},
\end{IEEEeqnarray}
\begin{IEEEeqnarray}{rCl}
\label{eq:recall}
recall=\frac{TP}{TP+FN},
\end{IEEEeqnarray}
TP, FP and FN stands for true positive, false positive and false negative, respectively. $F1 = 1$ is the best value that means perfect precision and recall and $F1=0$ is the worst.
\\
\section*{Data set}
In Fig. \ref{fig:dataSamples}, you can see data set includes five columns where \\
\texttt{keyword}: a particular keyword from the tweet (may be blank)\\
\texttt{location}: the location the tweet was sent from (may be blank)\\
\texttt{text}: the text of the tweet\\
\texttt{target}: this denotes whether a tweet is about a real disaster (1) or not (0).\\
Train data has about 7.5k sample rows, and target values has fair distribution as you can see in Fig. \ref{fig:targetdist}.\\
\begin{figure}[!t]
\centering
%\includegraphics[scale=0.4]{img/figure1.eps}
\includegraphics[width=1\textwidth]{images/dataSamples.png}
\caption{Data samples.}
\label{fig:dataSamples}
\end{figure}
\begin{figure}[!t]
\centering
%\includegraphics[scale=0.4]{img/figure1.eps}
\includegraphics[width=0.8\textwidth]{images/distribution.png}
\caption{Distribution of target values.}
\label{fig:targetdist}
\end{figure}
\section*{Competition details}
The task is an active kaggle competition with currently 830 competitors that you can find it \href{https://www.kaggle.com/c/nlp-getting-started/}{\textcolor{blue}{\textit{here}}}. Best score is 1.0 (perfect score) and top 50 scores vary between 0.84 to 1.0.\\
\section*{Related works}
Authors in \cite{sit2019identifying} works on similar data sets but the tweets related to hurricane. They get the data from Twitter and aiming to first classify and then analyse disaster-related tweets. In their binary classification phase, classification methods: LSTM, CNN, SVM, Logistic Regression and Ridge are performed and based on their evaluation Long Short-Term Memory (LSTM) get better fit to sequential order of textual data.
Looking through the top solutions in Kaggle website where they used LSTM structure to solve this task, we found and run a notebook which gave us 0.81 F1 score. This is a decent score for a NLP task. However, based on new viral method published by google researchers at 2018 called "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" which has already 34000 citation, the BERT can learn better contextual text \cite{devlin2018bert}. Therefore, we are going to use this transformer for our task and try to get better F1 performance than LSTM.
\bibliographystyle{IEEEtran}
\bibliography{refs}
\end{document}
@article{shearer2021news,
title={News use across social media platforms in 2020},
author={Shearer, Elisa and Mitchell, Amy},
year={2021},
publisher={Pew Research Center}
}
@article{devlin2018bert,
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1810.04805},
year={2018}
}
@article{sit2019identifying,
title={Identifying disaster-related tweets and their semantic, spatial and temporal context using deep learning, natural language processing and spatial analysis: a case study of Hurricane Irma},
author={Sit, Muhammed Ali and Koylu, Caglar and Demir, Ibrahim},
journal={International Journal of Digital Earth},
year={2019},
publisher={Taylor \& Francis}
}
@misc{tw,
title = {tweet: On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE},
howpublished = {\url{https://twitter.com/AnyOtherAnnaK/status/629195955506708480}}
}
\ No newline at end of file
%% Cell type:code id:790abd3e tags:
``` python
import pandas as pd
```
%% Cell type:code id:1c15fe13 tags:
``` python
df_train = pd.read_csv('./train.csv')
df_test = pd.read_csv('./test.csv')
df_train.sample(5)
```
%% Output
id keyword location \
2333 3357 demolition Lisbon, Portugal
1491 2149 catastrophe Portugal
5318 7593 outbreak New York, NY
2252 3225 deluged Clearwater, FL
6653 9533 terrorist ????? ???? ????
text target
2333 Draw Day Demolition Daily football selection s... 0
1491 Alaska's #Wolves face catastrophe Denali Wolve... 0
5318 An outbreak of Legionnaires' disease in New Yo... 1
2252 @LisaToddSutton The reason I bring this up bc... 0
6653 #UdhampurAgain 2 terrorist shot dead.. #Udhampur 1
%% Cell type:code id:582b9cb8 tags:
``` python
import matplotlib.pyplot as plt
target_count = df_train.target.value_counts()
plt.figure(figsize=(8,4))
plt.bar(['not a disaster','disaster'], target_count.values)
plt.title("Real disaster or not distribution")
```
%% Output
Text(0.5, 1.0, 'Real disaster or not distribution')
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment