Download - Catalyst

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Evaluating the Predictors of Bike Share Rentals
BI TECH CP303 Data Mining – Project 1
May 4, 2015
ID # 1413082
Introduction:
Bike share programs in the US have grown rapidly in the past 5 years, offering an attractive
transportation option in urban areas. Capital Bikeshare in Washington, D.C. is one of the largest, with
over 3,000 bikes at 350 stations across the city. Similar programs exist in many other cities including
Pronto’s Seattle-based cycle share. With only 500 bikes and 50 stations though there is clearly potential
for expanding the new program. This study aims to understand what factors predict successful bike
stations in order to inform where new stations should be installed.
Data was taken from three sources: rental data from Capital Bikeshare, bike station data with
counts of nearby amenities and road features from Open Streetmap, and weather data from UC Irvine.
Over 2.4 million bike rental events occurring throughout 2012 were examined. These were recorded
against 347 bike stations, which range in average daily rentals from less than 2 to nearly 200.
1
ID # 1413082
Methods:
The three data sources were combined and used to create a model that predicts the average
number of rentals per day given a host of inputs. Details about each station’s surroundings and weather
data were merged with rental events, creating one large data set with 119 predictors. Station density, a
potentially key factor of success, was also added using each station’s latitude and longitude to calculate
the number of other stations within 0.75km. The data set was then narrowed by combining similar
predictors into logical groups (food, nightlife, health services, tourism, etc.) and removing variables that
were empty, duplicative, ambiguous, or with limited number of observations.
Next the data was preprocessed for modeling. Categorical variables were exploded into
multiple true/false indicators, and all predictors were both normalized and scaled. The data was split
into sets – 75% to train the model and the remainder to test. Linear regression was used as the
prediction method. Multiple models were compared using selection, ridge, and lasso regression
techniques in order to reduce model size while maintaining prediction accuracy.
Comparison of Model Reduction Techniques
2
ID # 1413082
Among the five, the backward selection method was used to determine the final model due to
its low variability and comparable mean RMSE.
Results:
Among the model variables that describe a station’s surroundings, a few different types
appeared to be better predictors of rentals. The number of other nearby stations scored high in model
importance, suggesting that a denser bike share program encourages use. Stations also did well that
were surrounded by common destinations like food and nightlife, hotels and tourism, and in areas
friendlier to non-car travelers as indicated by street crossings, traffic signals, and bus stations/stops.
Based on these findings the recommendation would be to focus new stations in a small
geographic region in the most urban areas before expanding outward. Having stations close together
and in more dense areas appears to encourage use.
3
ID # 1413082
Discussion:
While the model results point to several variables that help predict rentals, the model itself
cannot be used to evaluate the potential of new stations. One reason is that the model includes rental
event information as a predictor of rentals. Obviously rental event data would not be available when
evaluating new station proposals. Instead, the model could be run using Pronto rental history and
Seattle data to test whether the variables hold similar importance as in the Washington, D. C. and
Capital Bikeshare data sets.
4
ID # 1413082
Appendix:
Station rentals shown by Customer Type.
Adjusted R2 Comparison of All Models in Backward Selection
5
ID # 1413082
Code
All data analysis was done using R 3.1.3. Libraries used include dplyr, ggplot2, lubridate, and caret.
library(dplyr)
library(ggplot2)
library(lubridate)
library(caret)
set.seed(1234) # set a seed
setwd('H:/BI Tech/Data Mining/Project1/')
usage = read.delim('usage_2012.tsv',
sep = '\t',
header = TRUE)
weather = read.delim('daily_weather.tsv',
sep = '\t',
header = TRUE)
stations = read.delim('stations.tsv',
sep = '\t',
header = TRUE)
# Summarize and merge data
#head(usage)
custs_per_day =
usage %>%
group_by(time_start = as.Date(time_start), station_start, cust_type) %>%
summarize(no_rentals = n(),
duration_mins = mean(duration_mins, na.rm = TRUE))
#head(custs_per_day)
# make date formats consistent
custs_per_day$time_start = ymd(custs_per_day$time_start)
weather$date = ymd(weather$date)
weather_rentals = merge(custs_per_day, weather,
by.x = 'time_start', by.y = 'date')
# group_by all factors and summarize continuous variables to generate data frame to merge with station.
model_data =
weather_rentals %>%
group_by(
station_start,
cust_type,
weekday,
season_code,
is_holiday,
is_work_day,
weather_code) %>%
summarize(
rentals = mean(no_rentals),
6
ID # 1413082
duration = mean(duration_mins),
temp = mean(temp),
subjective_temp = mean(subjective_temp),
humidity = mean(humidity),
windspeed = mean(windspeed))
#head(model_data)
# Calculate number of stations within .75km of each station
library(sp)
stations_raw = stations
get.dists <- function(i) {
ref.pt <- with(stations[i,],c(long,lat))
points <- as.matrix(with(stations[-i,],cbind(long,lat)))
dists <- spDistsN1(points, ref.pt, longlat=T)
return(length(which(dists<0.75))) # within .75km
}
stations$other_stations <- sapply(1:nrow(stations),get.dists)
# merge with stations
final_data = merge(model_data, stations,
by.x = 'station_start',
by.y = 'station')
data = final_data
rm(final_data)
# remove variables from the data that won't be used for modeling, e.g. lat/long
data_to_model =
data %>%
select(-station_start, -id, -terminal_name, -lat, -long)
dim(data_to_model)
head(data_to_model)
model = lm(rentals ~ ., data = data_to_model)
summary(model)
# some features don't exist around any of our stations, e.g. 'turning_loop'
table(data_to_model$turning_loop)
# remove using 'colSums' and 'which' functions
colSums(data_to_model[ , 15:143])
columns_to_remove = names(which(colSums(data_to_model[ , 15:143]) == 0)) # 'which' columns have a sum of 0
data_to_model = data_to_model[ , !(names(data_to_model) %in% columns_to_remove)]
# try model again
model = lm(rentals ~ ., data = data_to_model)
summary(model)
# some columns have NAs
table(data_to_model$vending_machine)
table(data_to_model$storage)
table(data_to_model$dojo)
table(data_to_model$tax_service)
7
ID # 1413082
table(data_to_model$telephone)
# Remove features with not enough observations
data_to_model =
data_to_model %>%
select(
-vending_machine,
-storage,
-dojo,
-tax_service,
-telephone)
# try model again
model = lm(rentals ~ ., data = data_to_model)
summary(model)
# Convert categorical variables into factors
data_to_model$weekday = factor(data_to_model$weekday,
labels = 0:6,
levels = 0:6)
data_to_model$season_code = factor(data_to_model$season_code)
data_to_model$is_holiday = factor(data_to_model$is_holiday)
data_to_model$is_work_day = factor(data_to_model$is_work_day)
data_to_model$weather_code = factor(data_to_model$weather_code)
# try model again
model = lm(rentals ~ ., data = data_to_model)
summary(model)
# Remove is_work_day to maintain covariate independence
data_to_model$is_work_day = NULL
# try model again
model = lm(rentals ~ ., data = data_to_model)
summary(model)
write.table(data_to_model, 'Model_Data.tsv', sep = '\t', row.names = FALSE)
data = data_to_model
##############################################
# make dummy/indicator variables
data$weekday = factor(data$weekday)
data$season_code = factor(data$season_code)
data$weather_code = factor(data$weather_code)
no_factors = as.data.frame(model.matrix(rentals ~ .-1, data = data))
no_factors$rentals = data$rentals
# combine variables to reduce model size
no_factors$food = no_factors$fast_food + no_factors$restaurant + no_factors$cafe + no_factors$food_court +
no_factors$food_cart + no_factors$bar.restaurant
no_factors$government = no_factors$embassy + no_factors$government + no_factors$townhall
no_factors$health_services = no_factors$doctors + no_factors$dentist + no_factors$clinic + no_factors$pharmacy +
no_factors$hospital
no_factors$kindergarten = no_factors$internal_kindergarten + no_factors$kindergarten
no_factors$nightlife = no_factors$bar + no_factors$club + no_factors$pub + no_factors$nightclub + no_factors$bar.restaurant
no_factors$school = no_factors$school + no_factors$school..historic.
8
ID # 1413082
no_factors$tourism = no_factors$tourist + no_factors$artwork + no_factors$information + no_factors$museum +
no_factors$sculpture +no_factors$tour_guide + no_factors$attraction + no_factors$landmark + no_factors$gallery +
no_factors$arts_centre
no_factors$stripclub = no_factors$stripclub + no_factors$strip_club
no_factors =
no_factors %>%
select(-fast_food, -restaurant, -cafe, -food_court, -food_cart, -embassy, -townhall, -doctors, -dentist, -clinic, -pharmacy, hospital, -internal_kindergarten, -bar, -club, -pub, -nightclub, -bar.restaurant, -school..historic., -tourist, -artwork, -information,
-museum, -sculpture, -tour_guide, -attraction, -landmark, -gallery, -parking_entrance, -parking_exit, -strip_club, -arts_centre, bureau_de_change, -marker)
# Start Modeling
in_train = createDataPartition(y = no_factors$rentals,
p = 0.75,
list = FALSE)
train = no_factors[in_train,] #rows in data that are part of train
test = no_factors[-in_train,] #rows in data that are not part of train
# subset selection
forward_model = train(rentals ~ .,
data = na.omit(train),
method = 'leapForward',
preProcess = c('center', 'scale'),
# try models size 1 - 23
tuneGrid = expand.grid(nvmax = 1:99),
trControl = trainControl(method = 'cv', number = 5))
backward_model = train(rentals ~ .,
data = na.omit(train),
method = 'leapBackward',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(nvmax = 1:99),
trControl = trainControl(method = 'cv', number = 5))
ridge_model = train(rentals ~ .,
data = na.omit(train),
method = 'ridge',
preProcess = c('center', 'scale'),
tuneLength = 10,
#tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
ridge_model2 = train(rentals ~ .,
data = train,
method = 'foba',
preProcess = c('center', 'scale'),
tuneLength = 10,
trControl = trainControl(method = 'cv', number = 5))
lasso_model = train(rentals ~ .,
data = na.omit(train),
method = 'lasso',
preProcess = c('center', 'scale'),
tuneLength = 10,
trControl = trainControl(method = 'cv', number = 5))
9
ID # 1413082
#plot(backward_model)
#plot(varImp(backward_model))
# Compare models
results = resamples(list(forward_selection = forward_model,
backward_selection = backward_model,
ridge_regression = ridge_model,
ridge_regression2 = ridge_model2,
lasso_regression = lasso_model))
summary(results)
# limited model
train2 = train %>%
select(rentals, cust_typeRegistered, cust_typeCasual, duration, hotel, post_box, crossing, food,
other_stations, no_bikes,traffic_signals, nightlife, no_empty_docks, tourism, government, place_of_worship,
bus_station, health_services, bus_stop, bank, atm, drinking_water, veterinary, public_building, turning_circle,
post_office, theatre, school)
test2 = test %>%
select(rentals, cust_typeRegistered, cust_typeCasual, duration, hotel, post_box, crossing, food,
other_stations, no_bikes,traffic_signals, nightlife, no_empty_docks, tourism, government, place_of_worship,
bus_station, health_services, bus_stop, bank, atm, drinking_water, veterinary, public_building, turning_circle,
post_office, theatre, school)
limited_model = train(rentals ~ .,
data = na.omit(train2),
method = 'leapBackward',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(nvmax = 1:27),
trControl = trainControl(method = 'cv', number = 5))
10