João Cepêda

Telecommunications Fraud Detection Whit Data Mining Techniques

Introduction

Since the beginning of commercial telecommunications services, companies have been struggling whit the problem of fraud. In the telecommunications area, fraud is defined as the use of services without the intention of paying them. The biggest consequence of this problem is mainly the loss of a big amount of revenue by this companies. In the year 2013 alone, around 2% of the revenue ( about 46 Billion USD out of 2.2 Trillion ) was lost to fraud worldwide. Due to this situation, the Telecom industry has been seeking ways to detect and prevent fraudulent attacks on their systems.

The amount of data generated by this industry has such a big dimension, that data mining is presented as the best solution to help with the detection and prediction process. Data mining studies patterns in data which can be used to achieve potential new knowledge. In this case, the patterns that the data mining process tries to unveil are of fraudulent behaviour or on the contrary, the normal behaviour.

This project studies studies one of the main problem that affect the classification process on the anomaly detection system which is the unbalanced data effect. This effect, as the name suggests, consists on using data that is composed by instance that in its majority belong to one of the classes. In this case, the most abundant class is most likeley to be the normal class. Classification models (classifiers) built on data with this property are most likely to classify data that belong to the most abundant class, which is clearly a problem on the fraud detection because only a few or no frauds will be detected.

Objectives

The first objective of this project is to evaluate the performance of the Naive Bayes classifier on unbalanced data sets.

The second objective is to apply existing adaptation to the model and evaluate the performance improvements.

And the third objective is to propose new solution to counter the unbalanced data effect on the classification process.

Strategy

The strategy adopted for the results evaluation will be mainly empirical. This is the best way to evaluate a good output of the process, since that in this field repetition is the best way to achieve the best results.

To achieve the goal, first we will study previous solutions each objective and them we will try to optimize them by switching some procedures , by doing some deeper changes in the algorithms or even by adding some new features to the classification process.