Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 179 Database Project Instructor: Dr Eamonn Keogh Computer Science & Engineering Department 318 EBII University of California - Riverside Riverside, CA 92521 [email protected] Class web page www.cs.ucr.edu/~eamonn/cs179 Administration I Class Meeting Times Class Activities: Discussion M 03:10 p.m. - 04:00 p.m. SPR 2339 LAB F 02:10 p.m. - 05:00 p.m. ENGR2 129 (first 15 minutes rule) We will not meet every week. You are obliged to view the class web page every Monday morning to check for announcements. You are 100% responsible for any announcements/changes I might post to the web page. Administration II Presentation of Final Project: You will need to give a short group presentation in the last two weeks (details later). You must show up to one of these final presentation sessions, or take a failing grade (exception, you work out an alternate plan with me by the end of week 4). Note that the sessions may go very late!! You must be prepared to stay for the entire sessions. Sign ups for time slots will be made available on a first-come first-served basis later in the quarter. Administration III Groups: Groups may be of size 2 or 3. Only one person who did not get an A or B in CS 166 may be in a group. (I may make exceptions if the numbers require it). If you need to be a “group” of one, talk to me after class. You should take your responsibility to your group seriously. In most case I expect that everyone in the group will get the same grade, but I reserve the right to give different grades where warranted. Administration IIII Grading : • Project binder: 90% • Presentation (including demonstration of project): 10% Your project binder (exhaustive details in class handouts) is a document in which you prove to me (or any reader) that you solved the problem given to you using a good design process. It must be in the format explicitly stated in the handouts. Your presentation is your chance to review and highlight the quality of your work. Administration V Office Hours: I am normally in my office 6-7 days a week. You may visit me any time. If you wish to be 100% certain I am there you may make an appointment by email with at least 24 hours notice. (Note that if you make an appointment, and then fail to keep it or show up late, the grade for your entire group will suffer). If you email me, you must include “CS179” in the subject heading and note your group name (i.e. CS179-smith-jones-zoe) in the body. Administration VI Important: If a member of your group commits an act of academic dishonesty, all members of the group will receive a failing grade! Don’t know the exact definition of academic dishonesty? It is your job to find out! (This is true in general, not just for this class). http://www.cs.ucr.edu/content/students/index.php?choice=academdis http://cnas.ucr.edu/~cnas/student/dishonesty.pdf There are certain rules which must be followed in this class, they are made clear on the handouts, follow them or get a written exception from me. If you write In order to handle spatial data efficiently, as required in computer aided design, we decided to use an R-tree. We implemented it... Everyone in your group gets a failing grade. Instead you should write It was noted by Guttman [12] that “In order to handle spatial data efficiently, as required in computer aided… XXX 2004: “Similarity matching is useful in two aspects. First, it is a subroutine of many data mining tasks, such as classification, clustering, rule discovery, outlier detection, and query by contents. Second, it is important in its own right for exploratory data analysis.” It is possible to convert the subsequence matching problem into whole matching, by placing a sliding window of size …. A time series of length N is by definition a sequence of real numbers, and therefore can be considered as a point in N-dimensional space. This immediately suggests that … R-tree …. Since a time series may contain thousands of points,.. This phenomenon is known as the dimensionality curse problem, and in order to utilize the powers of SAMs we need to first perform dimensionality reduction. .. three steps: Establish a distance measure Disttrue for the raw data series. In this thesis, we focus on Euclidean distance Disttrue. Produce a feature extraction function F that reduces the dimensionality of the data from the original length N to n that can be handled by an appropriate index structure. Establish a distance measure Distfeature in the feature space (of n dimensions). The first dimensionality reduction technique proposed for indexing time series in the literature is to use the Discrete Fourier Transform. The basic idea is that any realistic signal can be characterized by the superposition of a finite number of sine/cosine waves, each of which is represented by a single complex number known as a Fourier coefficient. … and many Fourier coefficients have a very low amplitude and therefore can be discarded without much loss of information…. Keogh 2000: “Similarity search is useful in its own right as a tool for exploratory data analysis, and it is also an important subroutine of many data mining applications such as clustering , classification and mining of association rules. ”Keogh 2000: “it is possible to convert subsequence matching to whole matching by sliding a "window" of length n….” “A time series C = {c1…cn} with n datapoints can be considered as a point in ndimensional space. This immediately suggests that time series could be indexed by multidimensional index structure such as the R-tree and its many variant. Since realisticqueries typically contain 20 to 1,000 datapoints (i.e. n varies from 20 to 1000) and most multidimensional index structures have poor performance at dimensionalities greater than 8-12 [12], we need to first perform dimensionality reduction. following three steps. Establish a distance metric from a domain expert (in this case Euclidean distance). Produce a dimensionality reduction technique that reduces the dimensionality of the data from n to N, where N can be efficiently handled by your favorite index structure. Produce a distance measure defined ... The first technique suggested for dimensionality reduction of time series was the Discrete Fourier Transform (DFT) [1]. The basic idea of spectral decomposition is that any signal, no matter how complex, can be represented by the superposition of a finite number of sine (and/or cosine) waves, where each wave represented by a single complex number known as a Fourier coefficient [29]. ….. . many of the Fourier coefficients have very low amplitude and thus contribute little to reconstructed signal. These low amplitude coefficients can be discarded without much loss of information… So, what are spatial queries? Databases are applications which store data in a format which supports querying. Imagine we have a database of restaurants in California. The database should probably be able to support queries like… • Return a list of all vegetarian restaurants. • Return the phone number of Marios Pizza on 123 Spruce st. • Return the restaurants that have a 4-star or higher rating. However there are many reasonable queries that most of-theshelf database systems do not support…. • Return a list of all restaurants with 5 miles of my house. • Return (in order of distance) the 3 pizza restaurants nearest to UCR. Nearest neighbor query Range query Your project is to build a database that supports spatial queries, as well as classic database queries. Although you could do this from scratch, I highly recommend that you do this by building some code that sits on top of an off-the-shelf database (ie Microsoft Access, Oracle, FoxPro, PostgreSQL). I also highly recommend that you do this by implementing an R-tree. In some sense the sentence above, “Your project is to build a database that…”, is misleading. I won’t be grading the quality of your database directly. Your project is really to demonstrate your ability to design medium to large scale software. User Interface Spatial Search Engine (probably R-Tree) Classic Database Name Marios Pizza ID Type Phone Location Grade 888-1212 ITA 1 244, 365 D Joes Bugers 2 US 848-1298 34, 764 A Jo’s Mexican 3 MEX 878-1333 123, 32 A Sues Pasta 4 ITA 878-1342 876, 65 B ce Enter an address and we will find the location of the nearest Californian university obably R-Tree) 221 Baker Street, Riverside Exclude Religious Schools Exclude Cal States Location UCV 244, 365 Q 34, 764 S 123, 32 G 876, 65 W The nearest university is CSUSB. Click here for admissions information ce Click on the map and we will find the location of the nearest Californian university obably R-Tree) Exclude Religious Schools Exclude Cal States Location UCV 244, 365 Q 34, 764 S 123, 32 G 876, 65 W The nearest university is CSUSB. Click here for admissions information ce Choose a location and we will find the location of the nearest Californian university obably R-Tree) Location UCV 244, 365 Q 34, 764 S 123, 32 G 876, 65 W LAX Golden Gate Bridge Balboa Park, SD Ontario Mills Exclude Religious Schools Exclude Cal States The nearest university is CSUSB. Click here for admissions information ce obably R-Tree) The GPS unit tells me you are in UCR, Riverside California. Do you want to know the location of the nearest University? Exclude Religious Schools Exclude Cal States Location UCV 244, 365 Q 34, 764 S 123, 32 G 876, 65 W The nearest university is CSUSB. Click here for admissions information To begin, you must come up with an application area which has a spatial element (I.e Restaurants in Orange County, California brown bear sightings, Locations of car crashes in Riverside). You must write a two page description of the problem, in the first person. The project description should begin by informally explaining the domain from the customer’s perspective (“As a restaurant critic… ”). Then explaining the utility of database for the customer (“The database will allow me to … …it will also help me…”). After I approve the project description, I (and/or our TA) will assume the role of the customer (I may add some requirements). Thereafter anytime you have a question about what the customer wants, you must come to see me. If you make an assumption, and it is the wrong assumption, you will have to redo your work, or take a major grade penalty. How am I going to get the Spatial locations of 500 places? • • • The web. A GPS unit. Use a grid overlay. If you use a grid overlay you must do it very carefully, and document the process. Note that treating the problem as existing on a Euclidian plane is actually incorrect. Since the locations are on a sphere there will be an inherent error in the distances reported. This effect would not show up in an area the size of Riverside, but would show up for an area the size of California. However you may ignore it in this project. Important Reminder Do not leave here today thinking… “how am I going to code this R-tree thing”, or “what language should I use”. Leave here thinking… “How is our group going to elicit the problem, design, build and test this piece of software? What is the best design process to use? How are we going to convince the professor, (with the contents of our project binder) that we used a high quality process to solve this problem?”. In particular, you probably want to spend a few weeks researching the design process before you even consider the particular application problem.