Download phase 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
P REDICTING INTEREST IN RENTAL
APARTMENTS
WAI -Y IP TANG
L EIDEN I NSTITUTE
OF
AND DR .
C.J V EENMAN
A DVANCED C OMPUTER S CIENCE
[email protected]
O VERVIEW
R ESEARCH Q UESTION
Data has been growing as a exponential rate,
whilst computing power and storage come
at a lower cost. As companies collect more
data, they want to use the data to further
improve their business.These companies have
their own data department or elicit assistance
from outside sources. One of the ways is to
turn to competition platforms such as Kaggle
with their problems and host a competition.
These competions can have rewards varying
from reputation points to high monetary rewards. I joined such competiton on Kaggle to
report on my approach for such a competition.
How to accurately predict the interest in rental apartments based on their rental listing content
through the use of data scientific methods?
K AGGLE W ISDOM
Kaggle has been one of the more well
known competition platform for companies to
host competitions and lovers of datascience
to compete. Anthony Goldbloom, CEO and
founder of Kaggle, spoke during the Extract
SF 2015. Anthony said that throughout the recent year there has been a trend for ways to
win and succeed in a Kaggle competition. According to Anthony there are two winning approaches who seem to dominate almost every competition. These approaches are called
Handcrafted Feature Engineering and Neural
Networks.
Handcrafted Feature Engineering
This approach tries to explore the data by plotting and understanding the data, through for
example domain knowledge, to generate creative features which correlate with the target
variable. After which a machine learning algorithm is ran. The algorithm that won has
been most of the time ensembles of decision
trees such a Random Forest and in the recent
months XGBoost. This approach spends a lot
of time generating features and little time on
running and fine tuning the algorithm. This
approach works well with structured data.
A PPROACH
P HASE 2
The general approach to this Kaggle competition consist of three phases:
In phase 2, Data cleaning, Preprocessing and
Reduction, the competitor tries to:
1. Feature Extraction and Engineering
1. Clean the data from missing values,
wrong values and outliers through deletion or normalisation. One of these
methods to detect outliers is for example is Fast Fourier Transformation (FFT),
rolling medians or hierarchical clustering.
2. Data cleaning, Preprocessing and Reduction
3. Data mining and Evaluation
These phases aren’t set in stone and can be visited multiple times.
2. Prepare the data in such a way that the
data can be used with the selected machine learning method.
P HASE 1
3. Reduce the amount of features from
phase one.
In phase 1, Feature Extraction and Engineering, the competitor tries to extract and generate as many features as possible. A basic way
to do this is by converting every categorical
value into a numerical one. This can be done
in a few basic ways:
1. Encoding labels: Encode every categorical value into a numerical representation
through the use of for example a label
encoder.
2. Vectorize text into numerical feature
vectors: Extract text and use for example the of the Bag of Words or ’Bag
of n-grams’ representation to turn every word into a feature with the number of counts of said word. And depending on the lenght of the text use a way
to normalise these values such as Term
Frequency Inverse Document Frequency
(TF-IDF).
Retrieved from http://3rdsectorlabs.com/services/data-cleaning-and-enriching/
P HASE 3
In phase 3, Data mining and Evaluation, the
competitor tries out different kind of machine
learning methods and build models for them.
One of such a example is XGBoost, a gradient
boosting algorithm.
3. Extracting time information: Split date
and time values into their numerical
parts.
4. Feature engineering: Deriving and generating new creative features from the
existing dataset that correlate with the
target variable. This could be anything
from plotting many graphs to gain insight to subgroup discovery or manipulating and transforming the data.
Retrieved from http://www.onestopportal.com/images/featuresbanner.png
Retrieved from http://dmlc.ml/2016/12/14/GPU-accelerated-xgboost.html
Neural Networks
This approach tries to use neural networks
and deep learning to learn from the data. This
approach spends a lot of time fine tuning the
neural network and little time on generating features. This approach works well with
datasets that are unstructured, contain problems related to images, video, speech and audio.
After implementing the machine learning
methods, the competitor will evaluate each
model based on their Logarithmic Loss.
N X
M
X
1
−
yij log pij
N i=1 j=1
Retrieved from http://copyrightuser.org/topics/text-and-data-mining/
(1)