Download - Catalyst

Evaluating the Predictors of Bike Share Rentals BI TECH CP303 Data Mining – Project 1 May 4, 2015 ID # 1413082 Introduction: Bike share programs in the US have grown rapidly in the past 5 years, offering an attractive transportation option in urban areas. Capital Bikeshare in Washington, D.C. is one of the largest, with over 3,000 bikes at 350 stations across the city. Similar programs exist in many other cities including Pronto’s Seattle-based cycle share. With only 500 bikes and 50 stations though there is clearly potential for expanding the new program. This study aims to understand what factors predict successful bike stations in order to inform where new stations should be installed. Data was taken from three sources: rental data from Capital Bikeshare, bike station data with counts of nearby amenities and road features from Open Streetmap, and weather data from UC Irvine. Over 2.4 million bike rental events occurring throughout 2012 were examined. These were recorded against 347 bike stations, which range in average daily rentals from less than 2 to nearly 200. 1 ID # 1413082 Methods: The three data sources were combined and used to create a model that predicts the average number of rentals per day given a host of inputs. Details about each station’s surroundings and weather data were merged with rental events, creating one large data set with 119 predictors. Station density, a potentially key factor of success, was also added using each station’s latitude and longitude to calculate the number of other stations within 0.75km. The data set was then narrowed by combining similar predictors into logical groups (food, nightlife, health services, tourism, etc.) and removing variables that were empty, duplicative, ambiguous, or with limited number of observations. Next the data was preprocessed for modeling. Categorical variables were exploded into multiple true/false indicators, and all predictors were both normalized and scaled. The data was split into sets – 75% to train the model and the remainder to test. Linear regression was used as the prediction method. Multiple models were compared using selection, ridge, and lasso regression techniques in order to reduce model size while maintaining prediction accuracy. Comparison of Model Reduction Techniques 2 ID # 1413082 Among the five, the backward selection method was used to determine the final model due to its low variability and comparable mean RMSE. Results: Among the model variables that describe a station’s surroundings, a few different types appeared to be better predictors of rentals. The number of other nearby stations scored high in model importance, suggesting that a denser bike share program encourages use. Stations also did well that were surrounded by common destinations like food and nightlife, hotels and tourism, and in areas friendlier to non-car travelers as indicated by street crossings, traffic signals, and bus stations/stops. Based on these findings the recommendation would be to focus new stations in a small geographic region in the most urban areas before expanding outward. Having stations close together and in more dense areas appears to encourage use. 3 ID # 1413082 Discussion: While the model results point to several variables that help predict rentals, the model itself cannot be used to evaluate the potential of new stations. One reason is that the model includes rental event information as a predictor of rentals. Obviously rental event data would not be available when evaluating new station proposals. Instead, the model could be run using Pronto rental history and Seattle data to test whether the variables hold similar importance as in the Washington, D. C. and Capital Bikeshare data sets. 4 ID # 1413082 Appendix: Station rentals shown by Customer Type. Adjusted R2 Comparison of All Models in Backward Selection 5 ID # 1413082 Code All data analysis was done using R 3.1.3. Libraries used include dplyr, ggplot2, lubridate, and caret. library(dplyr) library(ggplot2) library(lubridate) library(caret) set.seed(1234) # set a seed setwd('H:/BI Tech/Data Mining/Project1/') usage = read.delim('usage_2012.tsv', sep = '\t', header = TRUE) weather = read.delim('daily_weather.tsv', sep = '\t', header = TRUE) stations = read.delim('stations.tsv', sep = '\t', header = TRUE) # Summarize and merge data #head(usage) custs_per_day = usage %>% group_by(time_start = as.Date(time_start), station_start, cust_type) %>% summarize(no_rentals = n(), duration_mins = mean(duration_mins, na.rm = TRUE)) #head(custs_per_day) # make date formats consistent custs_per_day$time_start = ymd(custs_per_day$time_start) weather$date = ymd(weather$date) weather_rentals = merge(custs_per_day, weather, by.x = 'time_start', by.y = 'date') # group_by all factors and summarize continuous variables to generate data frame to merge with station. model_data = weather_rentals %>% group_by( station_start, cust_type, weekday, season_code, is_holiday, is_work_day, weather_code) %>% summarize( rentals = mean(no_rentals), 6 ID # 1413082 duration = mean(duration_mins), temp = mean(temp), subjective_temp = mean(subjective_temp), humidity = mean(humidity), windspeed = mean(windspeed)) #head(model_data) # Calculate number of stations within .75km of each station library(sp) stations_raw = stations get.dists <- function(i) { ref.pt <- with(stations[i,],c(long,lat)) points <- as.matrix(with(stations[-i,],cbind(long,lat))) dists <- spDistsN1(points, ref.pt, longlat=T) return(length(which(dists<0.75))) # within .75km } stations$other_stations <- sapply(1:nrow(stations),get.dists) # merge with stations final_data = merge(model_data, stations, by.x = 'station_start', by.y = 'station') data = final_data rm(final_data) # remove variables from the data that won't be used for modeling, e.g. lat/long data_to_model = data %>% select(-station_start, -id, -terminal_name, -lat, -long) dim(data_to_model) head(data_to_model) model = lm(rentals ~ ., data = data_to_model) summary(model) # some features don't exist around any of our stations, e.g. 'turning_loop' table(data_to_model$turning_loop) # remove using 'colSums' and 'which' functions colSums(data_to_model[ , 15:143]) columns_to_remove = names(which(colSums(data_to_model[ , 15:143]) == 0)) # 'which' columns have a sum of 0 data_to_model = data_to_model[ , !(names(data_to_model) %in% columns_to_remove)] # try model again model = lm(rentals ~ ., data = data_to_model) summary(model) # some columns have NAs table(data_to_model$vending_machine) table(data_to_model$storage) table(data_to_model$dojo) table(data_to_model$tax_service) 7 ID # 1413082 table(data_to_model$telephone) # Remove features with not enough observations data_to_model = data_to_model %>% select( -vending_machine, -storage, -dojo, -tax_service, -telephone) # try model again model = lm(rentals ~ ., data = data_to_model) summary(model) # Convert categorical variables into factors data_to_model$weekday = factor(data_to_model$weekday, labels = 0:6, levels = 0:6) data_to_model$season_code = factor(data_to_model$season_code) data_to_model$is_holiday = factor(data_to_model$is_holiday) data_to_model$is_work_day = factor(data_to_model$is_work_day) data_to_model$weather_code = factor(data_to_model$weather_code) # try model again model = lm(rentals ~ ., data = data_to_model) summary(model) # Remove is_work_day to maintain covariate independence data_to_model$is_work_day = NULL # try model again model = lm(rentals ~ ., data = data_to_model) summary(model) write.table(data_to_model, 'Model_Data.tsv', sep = '\t', row.names = FALSE) data = data_to_model ############################################## # make dummy/indicator variables data$weekday = factor(data$weekday) data$season_code = factor(data$season_code) data$weather_code = factor(data$weather_code) no_factors = as.data.frame(model.matrix(rentals ~ .-1, data = data)) no_factors$rentals = data$rentals # combine variables to reduce model size no_factors$food = no_factors$fast_food + no_factors$restaurant + no_factors$cafe + no_factors$food_court + no_factors$food_cart + no_factors$bar.restaurant no_factors$government = no_factors$embassy + no_factors$government + no_factors$townhall no_factors$health_services = no_factors$doctors + no_factors$dentist + no_factors$clinic + no_factors$pharmacy + no_factors$hospital no_factors$kindergarten = no_factors$internal_kindergarten + no_factors$kindergarten no_factors$nightlife = no_factors$bar + no_factors$club + no_factors$pub + no_factors$nightclub + no_factors$bar.restaurant no_factors$school = no_factors$school + no_factors$school..historic. 8 ID # 1413082 no_factors$tourism = no_factors$tourist + no_factors$artwork + no_factors$information + no_factors$museum + no_factors$sculpture +no_factors$tour_guide + no_factors$attraction + no_factors$landmark + no_factors$gallery + no_factors$arts_centre no_factors$stripclub = no_factors$stripclub + no_factors$strip_club no_factors = no_factors %>% select(-fast_food, -restaurant, -cafe, -food_court, -food_cart, -embassy, -townhall, -doctors, -dentist, -clinic, -pharmacy, hospital, -internal_kindergarten, -bar, -club, -pub, -nightclub, -bar.restaurant, -school..historic., -tourist, -artwork, -information, -museum, -sculpture, -tour_guide, -attraction, -landmark, -gallery, -parking_entrance, -parking_exit, -strip_club, -arts_centre, bureau_de_change, -marker) # Start Modeling in_train = createDataPartition(y = no_factors$rentals, p = 0.75, list = FALSE) train = no_factors[in_train,] #rows in data that are part of train test = no_factors[-in_train,] #rows in data that are not part of train # subset selection forward_model = train(rentals ~ ., data = na.omit(train), method = 'leapForward', preProcess = c('center', 'scale'), # try models size 1 - 23 tuneGrid = expand.grid(nvmax = 1:99), trControl = trainControl(method = 'cv', number = 5)) backward_model = train(rentals ~ ., data = na.omit(train), method = 'leapBackward', preProcess = c('center', 'scale'), tuneGrid = expand.grid(nvmax = 1:99), trControl = trainControl(method = 'cv', number = 5)) ridge_model = train(rentals ~ ., data = na.omit(train), method = 'ridge', preProcess = c('center', 'scale'), tuneLength = 10, #tuneGrid = expand.grid(nvmax = 1:23), trControl = trainControl(method = 'cv', number = 5)) ridge_model2 = train(rentals ~ ., data = train, method = 'foba', preProcess = c('center', 'scale'), tuneLength = 10, trControl = trainControl(method = 'cv', number = 5)) lasso_model = train(rentals ~ ., data = na.omit(train), method = 'lasso', preProcess = c('center', 'scale'), tuneLength = 10, trControl = trainControl(method = 'cv', number = 5)) 9 ID # 1413082 #plot(backward_model) #plot(varImp(backward_model)) # Compare models results = resamples(list(forward_selection = forward_model, backward_selection = backward_model, ridge_regression = ridge_model, ridge_regression2 = ridge_model2, lasso_regression = lasso_model)) summary(results) # limited model train2 = train %>% select(rentals, cust_typeRegistered, cust_typeCasual, duration, hotel, post_box, crossing, food, other_stations, no_bikes,traffic_signals, nightlife, no_empty_docks, tourism, government, place_of_worship, bus_station, health_services, bus_stop, bank, atm, drinking_water, veterinary, public_building, turning_circle, post_office, theatre, school) test2 = test %>% select(rentals, cust_typeRegistered, cust_typeCasual, duration, hotel, post_box, crossing, food, other_stations, no_bikes,traffic_signals, nightlife, no_empty_docks, tourism, government, place_of_worship, bus_station, health_services, bus_stop, bank, atm, drinking_water, veterinary, public_building, turning_circle, post_office, theatre, school) limited_model = train(rentals ~ ., data = na.omit(train2), method = 'leapBackward', preProcess = c('center', 'scale'), tuneGrid = expand.grid(nvmax = 1:27), trControl = trainControl(method = 'cv', number = 5)) 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download - Catalyst