Friday, August 5, 2011

Data preperation code for Hearst Challenge Data Mining Competition

http://code.google.com/p/hearstchallenge/

This code project contains the data preprocessing scripts in python for the hearst challenge, this converts hearst challenge data in to
svm light format. The script current supports only the svm light format but it can be easily modified to
write the data in some other format.

1. Concat the Modeling files in to single file
Modify the directory path in the concatFiles.py to the path which contains the Modeling_n.csv and run
python concatFiles.py
this creates a concatenated file newModeling.csv
2. Data PreProcessing.
(i) Most of the categorical attributes and IDs are converted in to binary nominal attributes
eg.ETECH_GROUP,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD_V4,etc..
(ii) The attribute City contains 15478 distinct values and to convert it in to numerical value, distance between each city
and the geographic center of USA. Wanna Know how I did it see distinctCityState.py and getCityCoordinates.py
(iii)Trait attributes are converted in to 73 binary nominal attributes for traits listed in the data dictionary
(iv) The code contains lot of comments and is self explanatory
(v) place the concatenated Model file, Validation file and the files distFromCentreOfUSA_new.csv and traits.csv in a directory
configure this directory as basedir in the file hearst2svmlite.py and then run
python hearst2svmlite.py

You will get the below files:
distinctVal.txt -
open_flag.train - svm lite training file for open_flag
click_flag.train - svm lite training file for click_flat
valid.test - scaled validation file in svm light format for prediction
valid_id.test - contains new_id,new_mailing_id for each line in the validation
test

The general Problem with the data is it contains very less % of Positives, 92% are negatives and only 8% are positives in the training data and there is a challenge of handling high imbalance.

You are welcome to use the code and I will be happy to hear from you if you have anything to say.

I got this question from Vivek(vivek.vichare@gmail.com), you can find my answer below:

Hey Venkatesh,

I am novice to the field of data mining and came across your python code for Hearst challenge. I myself have tried to participate in the challenge and submitted using GBM algo but not with great results.

I tried your data preprocessing code on python and it works to create datasets for SVM. I wished to try out SVM by myself but am facing challenges as to how to run svmlight. What platform to use and how to go about. It would be great if you can provide some direction.

Thanks,

Vivek

Answer:
Hi Vivek,
Please post this question to my blog comments, as it will be useful for others who are following the code as well.

1. I have updated the latest scripts, yesterday make sure you generate the svmlight files out of them
2. There is huge class imbalance in action, 92% -ve and only 8% +ve classes, hence you need use some techniques such as sampling to handle the class imbalance
3. there are more than 1000 features in the generated svmlight file, so u may need to do some feature extraction / feature selection before training the model
4. For general instructions on running svm, refer to this link A Practical Guide to Support Vector Classification

Good Luck !!

update:
I am currently in 49th place in the leader-board with a score 0.26307, the score of the 1st rank is 0.2272. You can see that top 50 scores are very close. This could be probably due to the reason of difficulty that exists in predicting click_flag, my models seem to perform decently in predicting the open_flag, but miserably fail in case of click_flag. Any model which manages to predict the click_flag will to even a little extent will sure top the leader-board.


















*** Moved to 44th Place in the leader-board with score 0.25126