Download phase 1

P REDICTING INTEREST IN RENTAL APARTMENTS WAI -Y IP TANG L EIDEN I NSTITUTE OF AND DR . C.J V EENMAN A DVANCED C OMPUTER S CIENCE [email protected] O VERVIEW R ESEARCH Q UESTION Data has been growing as a exponential rate, whilst computing power and storage come at a lower cost. As companies collect more data, they want to use the data to further improve their business.These companies have their own data department or elicit assistance from outside sources. One of the ways is to turn to competition platforms such as Kaggle with their problems and host a competition. These competions can have rewards varying from reputation points to high monetary rewards. I joined such competiton on Kaggle to report on my approach for such a competition. How to accurately predict the interest in rental apartments based on their rental listing content through the use of data scientific methods? K AGGLE W ISDOM Kaggle has been one of the more well known competition platform for companies to host competitions and lovers of datascience to compete. Anthony Goldbloom, CEO and founder of Kaggle, spoke during the Extract SF 2015. Anthony said that throughout the recent year there has been a trend for ways to win and succeed in a Kaggle competition. According to Anthony there are two winning approaches who seem to dominate almost every competition. These approaches are called Handcrafted Feature Engineering and Neural Networks. Handcrafted Feature Engineering This approach tries to explore the data by plotting and understanding the data, through for example domain knowledge, to generate creative features which correlate with the target variable. After which a machine learning algorithm is ran. The algorithm that won has been most of the time ensembles of decision trees such a Random Forest and in the recent months XGBoost. This approach spends a lot of time generating features and little time on running and fine tuning the algorithm. This approach works well with structured data. A PPROACH P HASE 2 The general approach to this Kaggle competition consist of three phases: In phase 2, Data cleaning, Preprocessing and Reduction, the competitor tries to: 1. Feature Extraction and Engineering 1. Clean the data from missing values, wrong values and outliers through deletion or normalisation. One of these methods to detect outliers is for example is Fast Fourier Transformation (FFT), rolling medians or hierarchical clustering. 2. Data cleaning, Preprocessing and Reduction 3. Data mining and Evaluation These phases aren’t set in stone and can be visited multiple times. 2. Prepare the data in such a way that the data can be used with the selected machine learning method. P HASE 1 3. Reduce the amount of features from phase one. In phase 1, Feature Extraction and Engineering, the competitor tries to extract and generate as many features as possible. A basic way to do this is by converting every categorical value into a numerical one. This can be done in a few basic ways: 1. Encoding labels: Encode every categorical value into a numerical representation through the use of for example a label encoder. 2. Vectorize text into numerical feature vectors: Extract text and use for example the of the Bag of Words or ’Bag of n-grams’ representation to turn every word into a feature with the number of counts of said word. And depending on the lenght of the text use a way to normalise these values such as Term Frequency Inverse Document Frequency (TF-IDF). Retrieved from http://3rdsectorlabs.com/services/data-cleaning-and-enriching/ P HASE 3 In phase 3, Data mining and Evaluation, the competitor tries out different kind of machine learning methods and build models for them. One of such a example is XGBoost, a gradient boosting algorithm. 3. Extracting time information: Split date and time values into their numerical parts. 4. Feature engineering: Deriving and generating new creative features from the existing dataset that correlate with the target variable. This could be anything from plotting many graphs to gain insight to subgroup discovery or manipulating and transforming the data. Retrieved from http://www.onestopportal.com/images/featuresbanner.png Retrieved from http://dmlc.ml/2016/12/14/GPU-accelerated-xgboost.html Neural Networks This approach tries to use neural networks and deep learning to learn from the data. This approach spends a lot of time fine tuning the neural network and little time on generating features. This approach works well with datasets that are unstructured, contain problems related to images, video, speech and audio. After implementing the machine learning methods, the competitor will evaluate each model based on their Logarithmic Loss. N X M X 1 − yij log pij N i=1 j=1 Retrieved from http://copyrightuser.org/topics/text-and-data-mining/ (1)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download phase 1