Download Financial Market Volatility Final Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Modern portfolio theory wikipedia , lookup

2010 Flash Crash wikipedia , lookup

Value at risk wikipedia , lookup

Black–Scholes model wikipedia , lookup

Transcript
E6893 Big Data Analytics:
Financial Market Volatility
Final Project Presentation
Jimmy Zhong, Tim Wu, Oliver Zhou, John Terzis
December 22, 2014
1
E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms
© 2014 CY Lin, Columbia University
Feature Selection/Extraction using Hadoop
• Map Reduce programming model
used to generate feature matrix from
raw price data across 100’s of symbols.
• Raw price data is first merged with
feature symbols from a fixed set of
user determined features on timestamp.
• Feature extraction is done on reducer
by creating forward and backward
looking volatility values for each
timestamp for each symbol.
• Resultant feature matrix contains over
300 columns from a starting point of 12.
• Feature matrix can be further
transformed using a script to perform
time-series clustering on intra-day price
activity.
2
Supervised Learning on Spark using MLLIB
• Spark was installed and pyspark used
to perform cross-validated Ridge
Regression using Stochastic Gradient
Descent with the goal of producing a
regressor that can predict volatility for
some forward looking interval (60
Minute, 1 Day, 10 Day etc) for a given
symbol.
• A combination of MLLIB and scikit learn
were used since MLLIB did not have
python bindings yet for cross-validated
splitting of dataset.
• Spark was ran on data held in HDFS.
• Results obtained were tested on a hold
out sample and R^2 calculated to show
how much variance could be explained
by the regressor.
3
Time-Series Analysis: Forecasting multistep
ahead base on GARCH model and calculate VAR
Motivation: Real world financial time series has property called volatility clustering; that is
periods of relative calm are interrupted by bursts of volatility. An extreme market
movement might represent a significant downside risk to the security portfolio of an
investor. Using RHadoop ecosystem to forecast the future volatility and calculate Value at
Risk (VAR) can help investor to prepared for losses arising from natural or man-made
catastrophes, even of a magnitude not experienced before.
Algorithm:
1. Used PIG and Python script to pre-process the raw data (AAPL) then load it into Rstudio
2. Applied R code (TimeSeriesAnalysis.R). Calculated the return in percentage.
3. Applied GARCH modeling to forecast the future volatility and calculate VAR
4. Applied Extreme Value Theory (EVT) to fit a GPD distribution to the tails
Result:
1. Calculated Forecast for the volatility and Value at Risk (VaR) at 99% confidence level
(Loss is expected to be exceeded only 1% of the time). In this example, AAPL (2008 –
2009), we calculated that 99% probability the monthly return is above 4%.
2. Used statistical hypothesis tests (Ljung-Box) for autocorrelation in squared returns (p
value ~0, reject the null hypothesis of no autocorrelations in the squared returns at 1%
significance level). GARCH model should be employed in modeling the return timeseries.
4
Time-Series Analysis: Forecasting multistep ahead base on GARCH model and
calculate VAR
5
Time-Series Analysis: Forecasting multistep ahead base on GARCH model and
calculate VAR
6
Time-Series Analysis: Forecasting multistep ahead base on
GARCH model and calculate VAR
Tail of the AAPL % Return data
Quantile-quantile plot
7
K-Means Clustering
• Goal is to attempt to relate different time
intervals to stock volatility through
clustering.
• Symbols: AIG, AMZN, PEP
• Vector Dimensions: Normalized Volume,
Symbol Volatility +1 Day, VIX Volatility +1
Day, Time Interval
• Time Intervals: Period of Day, Day of Week,
Fiscal Quarter, Year
• K-means clustering in R and Hadoop with
cluster size of 3-4
• Euclidean Distance Measure used since all
features were real valued.
8
Cluster Results
• No strong correlation of time intervals to
symbol volatility across all three sectors.
• No strong correlation between VIX volatility
and symbol volatility.
• There is a significant relationship between
volume and symbol volatility.
9
Logistic Regression
• Goal is to use classification model to separate variables out during
feature selection and identify which ones generate the best
predictive power
• Stock Symbols Tested: AIG, AMZN, PEP
• Parameters in Dataset:
• Normalized Volume, Symbol Volatility +1 Day, VIX Volatility +1
Day, Time Interval
• Targeted predicting when Symbol VIX Volatility would rise over .25,
which historically is a rough cutoff between regime changes from
low to high volatility market cycles.
10
Logistic Regression Results
• Measured by AUC (Area Under Curve)
• 1 is a True Positive and 0 is a True Negative, while .5 is
completely Random
• Little to no relationship with time intervals to symbol volatility,
but that may be skewed by market crashes
• VIX volatility and symbol volatility are nearly completely
randomly related
• There is a significant relationship between volume and symbol
volatility.
11
Questions ?
12