Download Ponte Kft. - Webstar Csoport

Deszkamodellek Pályázati azonosító: GOP 1.1.1-09/1-2009-0019 Verziószám: v1.0 Dátum: 2011. 03. 31. Készítette: Webstar Csoport Kft. Az elkészítésben részt vett: Ponte Kft. Elfogadó: BDE Research Közhasznú Nonprofit Kft. DESZKAMODELLEK A deszkamodell célja az egyes modulok egy-egy működőképes változatának megalkotása és az algoritmusok kipróbálása. A működőképes algoritmusokat leteszteljük, hogy eldönthessük megfelelnek-e a velük szemben támasztott funkcionális követelményeknek. Amennyiben valamelyik algoritmus hibás, úgy megpróbáljuk kijavítani, vagy újat keresünk helyette. Ezt mindaddig ismételjük, ameddig elő nem áll egy olyan változat, amely megfelel az elvárásoknak. Ebben a részben a hatékonysággal még csak minimálisan foglalkozunk (a triviálisan gyenge hatékonyságú megoldásokat elkerüljük). Az optimalizáció egy későbbi ütemben valósul meg. ELNEVEZÉS A deszkamodell során használt terméknév: Product Opinion Finder and Analyzer, röviden POFA. FELADATCSOPORTOK A prototípus moduljait feladatjuk alapján implementálási szempontokat is figyelembe véve az alábbi gyűjtőfeladatokba szervezhetjük:       Adatgyűjtés: o [1] crawler-ek vezérlése. Előfeldolgozás: o [2] tördelő, o [6] tokenelő, o [3] entitáslista építés, o [4] kategória struktúra építés, o [5] entitás tulajdonság keresés (jó lenne, ha szekvenciálisan bekapcsolható lenne ide). Elemzés: o [7] hozzászólás kategorizálás (entitás + kategória összerendelés), o [8] tájolás, o [9] hasznosság. Lekérdezés: o [10] megjelenítés. Tanítás: o [11] tanítás. Karbantartás: o [12] karbantartás. Ez a csoportosítás egyszerűbbé teszi a függőségi viszonyok kezelését. A feladatcsoportok fentről lefelé haladva valamelyest előfeltételei egymásnak, azonban jól egybefogható csoportot alkotnak. Az egyes feladatcsoportok belső működése eltérő (pl: az Előkészítés szekvenciálisan végrehajtható, de az Elemzés lépései futhatnak párhuzamosan). Az Adatgyűjtés feladata, hogy az előre megadott weblapokon fellelhető termékek adatlapjait és hozzászólásait összegyűjtse. Az Előfeldolgozás feladata, hogy a begyűjtött HTML formátumú oldalakból kinyerje a rendszer számára hasznos információkat, és azokat a megfelelő formában eltárolja. Ebben a lépésben még összefüggéseket nem keresünk. Egyszerre több előfeldolgozás feladatcsoport is futhat párhuzamosan. Az Elemzés feladata, hogy a rendszerben már eltárolt elemek között összefüggéseket tárjon fel, illetve, azokat felhasználva értelmezni próbálja a szövegeket. Ennek a feladatcsoportnak az egyes elemei kívülről is hívhatók. Külső hívás híján pedig folyamatosan halad végig a hozzászólásokon, és egyfajta gyorsítótárat alkotva előzetesen kiszámolja az egyes hozzászólások értékeit (kategóriák, tájolás, hasznosság). A Lekérdezés feladata a felhasználóval való kapcsolattartás. Ez a feladatcsoport magában foglalja a publikus webes szolgáltatást, a keresőkifejezések értelmezését, a lekérdezések végrehajtását, az eredmények aggregálását és a megjelenítést is. A Tanítás feladata a felhasználói visszajelzésekből érkező tanítópéldák összegyűjtése és ezeket felhasználva a rendszer folyamatos vagy rendszeres tanítása. A Karbantartás feladata az adatbázisok karbantartása, hogy a lehető legjobb teljesítményt adja a rendszer. OSZTÁLYOKRA BONTÁS                crawler: o konfigurációs XML betöltése; tördelő: o entitáslista építése: nem lesz külön modul, a forrás oldalakról gyűjtjük ki az entitásokat tördeléskor; entitások tulajdonságainak keresése; kategória struktúra építése; tokenelés: o hsz ↔ entitás, kategória összerendelés: nem külön modul, a forrás odalakról kinyerhető információ, tördelés közben eltároljuk; hsz tájolás; hsz hasznosság; megjelenítés; felhasználói visszajelzés, tanítás; karbantartás; eseménykezelő: az egyes feladatcsoportok közötti kommunikációt valósítja meg (függőségek egyszerűsítése); szabályrendszer: az oldalak hasznos részeinek kibontásához szükséges; szabályok illesztését hajtja végre; domain specifikus szabályok kiválogatása; adatbázis kapcsolat: magas szintű DB kérelmek lefordítása alacsony szintű DB parancsokká, és azok végrehajtása; vezérlő (opcionális). A deszkamodellt úgy próbáljuk felépíteni, hogy a prototípus létrehozásakor minimális mennyiségű kódót kelljen úrjaírni. FELHASZNÁLT ESZKÖZÖK A deszkamodell építéséhez felhasznált külső, nem standard eszközök listája. NYELV A deszkamodell elemei Java és Python nyelven készülnek az Eclipse és Netbeans fejlesztőkörnyezet felhasználásával. PROJEKT KEZELŐ A projekt Java-ban készülő részét a Maven projektkezelő eszköz segítségével építjük össze, hogy a projekt hordozható legyen, illetve a későbbekben elkészülő prototípust már könnyebben össze levessen építeni. CSOMAGOK Az alábbiakban felsoroljuk a deszkamodellben felhasznált nem szabványos csomagokat, és az esetlegesen rajtuk végrehajtott módosításokat. JUnit v4.8.2 Honlap: http://www.junit.org/ Funkció: unit teszteket végrehajtó modul crawler4j v2.2 Honlap: http://code.google.com/p/crawler4j/ Funkció: könnyen konfigurálható, nyílt forráskódú webcrawler. Módosítások:  Page.load(): a karakterkódolás kezelését módosítani kellett, hogy jól felismerje az oldalak karakter kódolását JTidy r938 Honlap: http://jtidy.sourceforge.net/ Funkció: HTML parser és DOM fa építő modul, a nem szabványos HTML oldalakat is képes kezelni és javítani. Snowball Honlap: http://snowball.tartarus.org/ Funkció: Szabály alapú többnyelvű szótövező csomag neo4j REST v0.8 / neo4j Server v1.2 / neo4j embedded Honlap: http://neo4j.org/, http://wiki.neo4j.org/content/Getting_Started_REST http://components.neo4j.org/neo4j-rest/, Funkció: Különálló és beágyazott gráf adatbázis szerver. ESZKÖZÖK  Firefox addon: https://addons.mozilla.org/en-US/firefox/addon/2691 JSON Honlap: http http://www.json.org/ Funkció: JSON adatformátum kezelését megvalósító csomag LÉTREHOZOTT OSZTÁLYOK A modulokon felül létrehozott segéd osztályok. Pair Rendezett pár sablon osztály. PofaDomainRule, PofaDomainRuleList, PofaRuleMatcher Az egyes domain-ekhez tartozó oldalak szerkezetét leíró szabályrendszert kezelő osztályok.    PofaDomainRule: egy szabály, ami egy (típus, DOM útvonal) párosból áll PofaDomainRuleList: PofaDomainRule-ok listája PofaRuleMatcher: egy HTML oldalra alkalmazza a megadott szabályokat és visszaadja az sikeresen illesztett szabályok által kapott oldalrészeket. PofaStopWords Minden - a prototípus által támogatott - nyelvhez tárolja a stopszavakat. Neo4jDBInterface, PofaNeo4jDB, Neo4jRelationship Neo4j adatbázis kapcsolatot kezelő osztályok.    Neo4jDBInterface: alacsony szintű hozzáférést biztosít PofaNeo4jDB: magas szintű hozzáférést biztosít Neo4jRelationship: egy lekérdezés során a gráf bejárásához szükséges relációk listája ADATBÁZISOK DOMAIN DB A kiválasztott crawler modul (crawler4j) biztosítja saját magának egy önálló adatbázssal, hogy ugyanazokat az URL-eket ne töltse le többször ugyanabban a menetben. Amit nekünk ezen felül tárolnunk kell, az az egyes domain nevek súlyozása, amivel az újralátogatás gyakoriságát és sorrendjét tudjuk beállítani. Ehhez minden domain névhez el kell tárolnunk a következő adatokat:      domain név fontossági súly: a felhasználók milyen arányban találnak itt hasznos hozzászólásokat változási súly: milyen gyakran változik a tartalom utolsó látogatás dátuma látogatások száma DOKUMENTUMTÁR Feladata, hogy a Tördelő által letöltött szöveges értékeléseket tárolja és a későbbi feldolgozáshoz szükséges alapvető információkat is eltárolja hozzájuk. Az alábbi adatokat szükséges tárolnunk:         Dokumentum egyedi azonosítója (MD5 hash kód) Nyers szöveg (HTML tag-ekkel) forrás URL-je Hasznossági érték Hasznossági érték időbélyege Tájolási érték Tájolási érték időbélyege egyéb, a későbbi feldolgozás során társított címkék ENTITÁS- ÉS KATEGÓRIATÁR A kereshető termékek és azok kategóriáinak tárolása. Egy-egy termék több néven is szerepelhet, ezért ezeket a neveket csoportokba kell foglalni.   csoport azonosító o terméknév 1 o terméknév 2 o … o terméknév n terméktípus kategória o kategória név   termékjellemző kategória o főkategória név  alkategória név gyakori kifejezések kategória o kategória név GRÁF ADATBÁZIS Próba képpen a deszkamodellek egy gráf adatbázist használnak, amiben az összes korábban említett adatbázis benne van. Tehát a tervezett 4 adatbázis helyett egyetlen közös gráfban tároljuk le az összes adatot (nem számítva az egyes csomagok különálló adatbázisait; ilyen pl: a crawler4j). A gráf jó választásnak tűnik abból a szempontból, hogy az egyes adatelemek között rengeteg kapcsolat lesz, illetve az egyes elemek struktúrája sem jól definiált a legtöbb esetben. A „Domain DB” azonban nem képezi szerves részét a gráfnak; egy XML formátumú szabályrendszer segítségével írjuk le a meglátogatandó oldalakat és azok szerkezetét. A crawler által feljegyzett információk (fontosság, utolsó látogatás, stb.) is tárolhatók ebben az XML-ben. VÁLTOZTATÁSOK A deszkamodell építése során előjöttek olyan helyzetek, amiket előzetesen nem láttunk. Ezek megoldása időnként az eredeti tervek módosítását is megkövetelte. Ezeket a változataásokat írtuk itt össze. DOMAIN SZABÁLYRENDSZER Az eredetileg tervezett NAVIGATION szabálytípus feleslegessé vált. A crawler amúgy is bejárná az oldalakat, és ennek a szabálynak a beépítése a meglévő crawler csomagba bonyolult lenne. A crawler első futtatásai alatt kiderült, hogy a jelenlegi szabályrendszer időnként túl sok adatot jelöl ki egy-egy oldalon. A szabályok a mostani formájukban csak befoglalást jelentenek, azaz a DOM fán kijelölnek egy útvonalat, és minden HTML tartalom, ami a kijelölt útvonalon elérhető, azt kiválasztják. Sok esetben ebben a kijelölésben “szemét” is van, ezért célszerűnek egy további szabály felvétele is, ami kizárást jelöl ki. A kizárást is egy DOM útvonalra vonatkozó relatív (a befoglaló szabály illeszkedési helyétől számítva) reguláris kifejezéssel adhatnánk meg. Például: <SELECT TYPE="TEXT" PATH="html>body>div>p#content-box" EXCLUDING="(div#ad|div#navigation|div#share)" /> A példában a kijelölésből kivesszük az ad, navigation és share részeit az egyébként kijelölt tartalomnak. Ez hasznos, hiszen sok weblapon az egyes hozzászólások köré egy sablon fejléc és lábléc kerül, de a számunkra hasznos szöveg nincs külön kiemelve. Ezekkel a szabályokkal levághatók a felesleges elemek. CRAWLER A crawler nem végez előzetes ellenőrzést a letöltött oldalakon, hogy van-e rajtuk használható adat, hanem minden megtalált oldalt (de csak a megengedett domain-ekről) továbbít a Tördelőnek. A Tördelő szétbontja az oldalt a használható elemekre, amiket továbbít a megfelelő további moduloknak. Ha egy oldalon nincs hasznosítható elem, akkor nem továbbít semmit. TÖRDELŐ A nyelv felismerés bekerült Tördelőbe is, mert a helyes indexelés előfeltétele, hogy a megfelelő nyelven kisbetűsítsük a szöveget, és a stopszavak eltávolítása is ezen alapul. Ez a nyelvfelismerés azonban nem helyettesíti a Tokenelőben végrehajtott nyelv felismerést, mert itt a teljes oldal szövegét egyben vizsgáljuk. A tördelő funkcionalítását kibővítettük az entitások gyűjtésével, illetve az entitások, kategóriák és hozzászólások összerendelésével. Ezek az információk a forrásként használt oldalakból egyértelműen és egyszerűen kinyerhetők, ezért sokkal egyszerűbb ezeket letárolni tördelés közben, mint utólag valamilyen heurisztikával kitalálni. FORRÁSKÓD Neo4jDBInterface.java package com.wcs.pofa.db; import import import import import import import import java.io.BufferedReader; java.io.IOException; java.io.InputStreamReader; java.io.OutputStreamWriter; java.net.HttpURLConnection; java.net.URL; java.util.ArrayList; java.util.regex.Pattern; import import import org.json.JSONArray; org.json.JSONException; org.json.JSONObject; import com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection; /** * Wrapper class for communicating with a neo4j REST server. */ public final class Neo4jDBInterface { /** * Server response object */ public class ServerErrorResponse extends Throwable { /** * */ private static final long serialVersionUID = 4213803944885637897L; private int returnCode; private String response; /** * Indicates that the error is not a standard HTTP error (e.g. unexpected response) */ public static final int OTHER_ERROR = -1; public ServerErrorResponse(int returnCode, String response) { this .returnCode = returnCode; this .response = response; } public int getReturnCode() { return returnCode; } public String getResponse() { return response; } @Override public String toString() { return returnCode + ": " + response; } } /** * Sends a request to the neo4j REST server. * @param request PofaHttpRequestType (GET, POST, PUT, DELETE) * @param path URL to send request to (path to node, relationship, etc.) * @param data additional data (request body) * @return Server response body if request successful * @throws ServerErrorResponse on error response */ private String sendRequest(Neo4jHttpRequestType request, String path, String data) throws ServerErrorResponse { int responseCode = HttpURLConnection.HTTP_NO_CONTENT; StringBuilder response = new StringBuilder(); try { // Send data URL url = new URL(path); HttpURLConnection conn = (HttpURLConnection) url.openConnection (); conn.addRequestProperty("Content-type", "application/json"); conn.addRequestProperty("Accept", "application/json"); OutputStreamWriter writer = null; if (Neo4jHttpRequestType.GET != request) { conn.setDoOutput(true); switch (request) { case PUT: conn.setRequestMethod("PUT"); case POST: writer = new OutputStreamWriter(conn.getOutputStream(), "UTF-8"); // append data writer.write(data); writer.flush(); break ; case DELETE: conn.setRequestMethod("DELETE"); break ; } } // Get the response IOException error = null; BufferedReader reader; responseCode = conn.getResponseCode(); try { reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8")); } catch (IOException e) { error = e; reader = new BufferedReader(new InputStreamReader(conn.getErrorStream(), "UTF-8")); } String line; while ((line = reader.readLine()) != null) response.append(line); reader.close(); if (null != writer) writer.close(); // deceide if return nicely or with error if (null == error) return response.toString(); else throw error; } catch (IOException e) { throw new ServerErrorResponse(responseCode, e.getMessage()); } } response.toString() + " public Neo4jDBInterface(String databaseUrl) throws ServerErrorResponse { String response = ""; JSONObject answer; try { // query database server for root node response = sendRequest(Neo4jHttpRequestType.GET, databaseUrl + "/", null); answer = new JSONObject(response); nodeUrl = answer.getString("node") + "/"; indexNodeUrl = answer.getString("node_index") + "/"; indexRelationshipUrl = answer.getString("relationship_index") + "/"; relationshipUrl = nodeUrl.replaceFirst("node", "relationship"); nodeUrlRegex = Pattern.quote(nodeUrl); relationshipUrlRegex = Pattern.quote(relationshipUrl); rootNode = answer.getString("reference_node").replaceFirst(this.nodeUrl, ""); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown response:\n" + response); } } " + server /* ********************************************************************************************** **************** */ /** * HTTP request types required for communication with the neo4j REST server */ private static enum Neo4jHttpRequestType { GET, POST, PUT, DELETE } /** * Result of traverse */ public static enum Neo4jTraverseResult { NODE, RELATIONSHIP, PATH } /** * Traverse order */ public static enum DEPTH_FIRST, BREADTH_FIRST } Neo4jTraverseOrder { /** * Traverse return filter */ public static enum Neo4jTraverseReturnFilter { ALL, ALL_BUT_START_NODE } /** * Traverse return uniqueness filter */ public static enum Neo4jTraverseUniqueness { NODE_PATH, NODE } /** * ID of root node (relative URL to server) */ private String rootNode; /** * Node URL */ private String nodeUrlRegex; private String nodeUrl; /** * Relationship URL */ private String relationshipUrlRegex; private String relationshipUrl; /** * Index URLs */ private String indexNodeUrl; private String indexRelationshipUrl; /* ********************************************************************************************** **************** */ /** * Returns root node. */ public String getRootNode() { return rootNode; } /** * Creates a new node in the graph with initial properties set. * @param properties a JSON with the properties to set * @return server node ID of the new node * @throws ServerErrorResponse if request couldn't be completed */ public String createNode(JSONObject properties) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.POST, nodeUrl.substring(0, nodeUrl.length() - 1), properties.toString()); try { JSONObject answer = new JSONObject(response); return (answer.getString("self").replaceFirst(this.nodeUrlRegex, "")); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } /** * Removes the specified node. * @param node node ID * @throws ServerErrorResponse if request couldn't be completed */ public void removeNode(String node) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, nodeUrl + node, null); } /** * Gets all properties of specified node. * @param node node ID * @return server response body (JSON or empty string) * @throws ServerErrorResponse if request couldn't be completed */ public JSONObject getNodeProperties(String node) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, nodeUrl + node + "/properties", null); try { if (response.isEmpty()) response = "{}"; return (new JSONObject(response)); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } /** * Replaces all properties on a node with the supplied set of properties. * @param node node ID * @param properties a JSON with the properties to set * @throws ServerErrorResponse if request couldn't be completed */ public void setNodeProperties(String node, JSONObject properties) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.PUT, nodeUrl + node + "/properties", properties.toString()); } /** * Removes all properties from a node. * @param node node ID * @throws ServerErrorResponse if request couldn't be completed */ public void removeNodeProperties(String node) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, nodeUrl + node + "/properties", null); } /** * Returns the value of specified property of specified node. * @param node node ID * @param property property name * @return value of property * @throws ServerErrorResponse if request couldn't be completed */ public Object getNodeProperty(String node, String property) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, nodeUrl + node + "/properties/" + property, null); try { return new JSONObject("{\"a\":" + response + "}").get("a"); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } /** * Changes a property of a node. Leaves all other properties intact. * @param node node ID * @param property name of property to change/create * @param value value of property * @throws ServerErrorResponse if request couldn't be completed */ public void setNodeProperty(String node, String property, Object value) throws ServerErrorResponse { String val; if (value instanceof String) val = "\"" + value + "\""; else val = value.toString(); sendRequest(Neo4jHttpRequestType.PUT, nodeUrl + node + "/properties/" + property, val); } /** * Removes specified property from specified node * @param node node ID * @param property property name * @throws ServerErrorResponse if request couldn't be completed */ public void removeNodeProperty(String node, String property) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, nodeUrl + node + "/properties/" + property, null); } /* ********************************************************************************************** **************** */ /** * Create a new relationship between 2 nodes * @param from start node (node ID) of relationship * @param to end node (node ID) of relationship * @param type type identifier of relationship * @param properties properties to set for relationsip * @return relationship ID * @throws IOException if request couldn't be composed * @throws ServerErrorResponse if request couldn't be completed */ public String createRelationship(String from, String to, String type, JSONObject properties) throws IOException, ServerErrorResponse { JSONObject params; params = new JSONObject(); try { params.put("to", nodeUrl + to); params.put("type", type); params.put("data", properties); String response = sendRequest(Neo4jHttpRequestType.POST, nodeUrl + from + "/relationships", params.toString()); try { JSONObject answer = new JSONObject(response); return (answer.getString("self").replaceFirst(this.nodeUrlRegex, "")); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } catch (JSONException e) { throw new IOException("Cannot compose request"); } } /** * Gets relationship type and properties. * @param relationship relationship ID * @return a JSON object with type and data keys * @throws ServerErrorResponse if request couldn't be completed */ public JSONObject getRelationship(String relationship) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship, null); try { JSONObject answer = new JSONObject(response); JSONObject result = new JSONObject(); result.put("start", answer.getString("start").replaceFirst(nodeUrlRegex, "")); result.put("end", answer.getString("end").replaceFirst(nodeUrlRegex, "")); result.put("type", answer.get("type")); result.put("data", answer.get("data")); return (result); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, response:\n" + response); } } "Unknown server /** * Removes specified relationship * @param relationship relationship ID * @throws ServerErrorResponse if request couldn't be completed */ public void removeRelationship(String relationship) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, relationshipUrl + relationship, null); } /** * Gets relationship type. * @param relationship relationship ID * @return relationship type identifier * @throws ServerErrorResponse if request couldn't be completed */ public String getRelationshipType(String relationship) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship, null); try { JSONObject answer = new JSONObject(response); return (answer.getString("type")); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } /** * Gets relationship properties. * @param relationship relationship ID * @return a JSON object with the properties * @throws ServerErrorResponse if request couldn't be completed */ public JSONObject getRelationshipProperties(String relationship) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship, null); try { JSONObject answer = new JSONObject(response); return (answer.getJSONObject("data")); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } /** * Replaces all properties on a relationship with the supplied set of properties. * @param relationship relationship ID * @param properties JSON with the properties to set * @throws ServerErrorResponse if request couldn't be completed */ public void setRelationshipProperties(String relationship, JSONObject properties) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.PUT, relationshipUrl + relationship + "/properties", properties.toString()); } /** * Removes all relationships from the specified relationship. * @param relationship relationship ID * @throws ServerErrorResponse is the request couldn't be completed */ public void removeRelationshipProperties(String relationship) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, relationshipUrl + relationship + "/properties", null); } /** * Returns the value of specified property of specified relationship. * @param relationship relationship ID * @param property property name * @return value of property * @throws ServerErrorResponse if request couldn't be completed */ public Object getRelationshipProperty(String relationship, String property) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship + "/properties/" + property, null); return JSONObject.stringToValue(response); } /** * Changes a property of a relationship. Leaves all other properties intact. * @param relationship relationship ID * @param property property name to change/create * @param value value to set * @throws ServerErrorResponse if request couldn't be completed */ public void setRelationshipProperty(String relationship, String property, Object value) throws ServerErrorResponse { String val; if (value instanceof String) val = "\"" + value + "\""; else val = value.toString(); sendRequest(Neo4jHttpRequestType.PUT, relationshipUrl + relationship + "/properties/" + property, val); } /** * Remove specified property from specified relationship * @param relationship relationship ID * @param property property name * @throws ServerErrorResponse if request couldn't be completed */ public void removeRelationshipProperty(String relationship, String property) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, relationshipUrl + relationship + "/properties/", null); } /* ********************************************************************************************** **************** */ /** * Returns the selected relationships of a node * @param node node ID * @param dir relationship direction * @param types list of relationship types to include * @return array of relationship information (like getRelationship()) * @throws ServerErrorResponse if request couldn't be completed */ public JSONArray getNodeRelationships(String node, ArrayList<Neo4jRelationship> relationships) throws ServerErrorResponse { StringBuilder target = new StringBuilder(); if (relationships.size() != 0) { target.append(nodeUrl + node + "/relationships/" + relationships.get(0).getDirection().toString().toLowerCase()); target.append("/"); target.append(relationships.get(0).getType()); for (int i = 1; i != relationships.size(); i++) target.append("&" + relationships.get(i).getType()); } else target.append(nodeUrl + node + "/relationships/all"); String response = sendRequest(Neo4jHttpRequestType.GET, target.toString(), null); try { JSONArray answer = new JSONArray(response); JSONArray result = new JSONArray(); for (int i = 0; i != answer.length(); i++) { JSONObject item = (JSONObject)answer.get(i); result.put(new JSONObject(). put("start", item.getString("start").replaceFirst(this.nodeUrlRegex, "")). put("end", item.getString("end").replaceFirst(this.nodeUrlRegex, "")). put("type", item.getString("type")). put("data", item.getString("data")). put("relationship", item.getString("self").replaceFirst(this.relationshipUrlRegex, "")) ); } return result; } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, response:\n" + response); } } "Unknown server /* ********************************************************************************************** **************** */ /** * Adds a node to the index. * @param node node ID to add to index * @param key index key * @param value index value * @throws ServerErrorResponse if request couldn't be completed */ public void addNodeToIndex(String indexName, String node, String key, String value) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.POST, indexNodeUrl + indexName + "/" + key + "/" + value, "\"" + nodeUrl + node + "\""); } /** * Removes a node from the index * @param node node ID to remove from index * @param key indexing key * @param value index value * @throws ServerErrorResponse if request couldn't be completed */ public void removeNodeFromIndex(String node, String indexName, String key, String value) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, indexNodeUrl + indexName + "/" + key + "/" + value + "/" + node, null); } /** * Queries DB index for (key, value) pair for matching nodes. * @param key index key * @param value index value * @return JSONArray with node IDs and node properties * @throws ServerErrorResponse if request couldn't be completed */ public JSONArray queryNodeIndex(String indexName, String key, String value) throws ServerErrorResponse { try { String response = sendRequest(Neo4jHttpRequestType.GET, indexNodeUrl + indexName + "/" + key + "/" + value, null); try { JSONArray answer = new JSONArray(response); JSONArray result = new JSONArray(); for (int i = 0; i != answer.length(); i++) { JSONObject item = answer.getJSONObject(i); JSONObject indexHit = new JSONObject(); indexHit.put("node", item.getString("self").replaceFirst(nodeUrlRegex, "")); indexHit.put("data", item.get("data")); result.put(indexHit); } return (result); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } catch (ServerErrorResponse e) { if (HttpURLConnection.HTTP_NOT_FOUND == e.getReturnCode()) return new JSONArray(); throw e; } } /** * Adds a relationship to the index. * @param relationship relationship ID to add to index * @param key index key * @param value index value * @throws ServerErrorResponse if request couldn't be completed */ public void addRelationshipToIndex(String indexName, String relationship, String key, String value) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.POST, indexRelationshipUrl + indexName + "/" + key + "/" + value, "\"" + relationshipUrl + relationship + "\""); } /** * Removes a relationship from the index * @param relationship relationship ID to remove from index * @param key indexing key * @param value index value * @throws ServerErrorResponse if request couldn't be completed */ public void removeRelationshipFromIndex(String relationship, String indexName, String key, String value) throws ServerErrorResponse { sendRequest(Neo4jHttpRequestType.DELETE, indexRelationshipUrl + indexName + "/" + key + "/" + value + "/" + relationship, null); } /** * Queries DB index for (key, value) pair for matching relationship. * @param key index key * @param value index value * @return JSONArray with relationship IDs and relationship properties * @throws ServerErrorResponse if request couldn't be completed */ public JSONArray queryRelationshipIndex(String indexName, String key, String value) throws ServerErrorResponse { String response = sendRequest(Neo4jHttpRequestType.GET, indexRelationshipUrl + indexName + "/" + key + "/" + value, null); //TODO: format response try { JSONArray answer = new JSONArray(response); JSONArray result = new JSONArray(); for (int i = 0; i != answer.length(); i++) { JSONObject item = answer.getJSONObject(i); JSONObject indexHit = new JSONObject(); indexHit.put("node", item.getString("self").replaceFirst(nodeUrlRegex, "")); indexHit.put("data", item.get("data")); result.put(indexHit); } return (result); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server response:\n" + response); } } /* ********************************************************************************************** **************** */ /** * General purpose graph traverser. * @param startNode node ID to start traversing from * @param order depth-first or width-first traversing * @param uniqueness uniqueness filter * @param relationships * @param pruneEvaluatorJS Javascript evaluator to prune graph while traversing. If empty then maxDepth is used * @param returnFilter which nodes to return * @param maxDepth Maximum depth to traverse from start node. Ignored if pruneEvaluatorJS is not empty. * @return JSONArray containing requested return objects (nodes, relationships or paths) of reached entites in graph. * @throws IOException if request coundn't be composed * @throws ServerErrorResponse if request couldn't be completed */ public JSONArray traverse(String startNode, Neo4jTraverseResult returnType, Neo4jTraverseOrder order, Neo4jTraverseUniqueness uniqueness, ArrayList<Neo4jRelationship> relationships, String pruneEvaluatorJS, Neo4jTraverseReturnFilter returnFilter, int maxDepth) throws ServerErrorResponse { JSONObject request = new JSONObject(); try { // traverse order switch (order) { case DEPTH_FIRST: request.put("order", "depth first"); break; case BREADTH_FIRST: request.put("order", "breadth first"); break; } IOException, // uniqueness switch (uniqueness) { case NODE: request.put("uniqueness", "node"); case NODE_PATH: request.put("uniqueness", "node path"); } // relationships JSONArray relations = new JSONArray(); for (int i = 0; i != relationships.size(); i++) relations.put(new JSONObject(). put("type", relationships.get(i).getType()). put("direction", relationships.get(i).getDirection().toString().toLowerCase()) ); request.put("relationships", relations); // prune evaluator if (null != pruneEvaluatorJS && pruneEvaluatorJS.length() > 0) request.put("prune evaluator", new JSONObject(). put("language", "javascript"). put("body", pruneEvaluatorJS) ); // return filter JSONObject filter = new JSONObject(); filter.put("language", "builtin"); switch (returnFilter) { case ALL: filter.put("name", "all"); break; case ALL_BUT_START_NODE: filter.put("name", "all but start node"); break; } request.put("return filter", filter); // max depth request.put("max depth", maxDepth); // send request String response = sendRequest(Neo4jHttpRequestType.POST, nodeUrl + startNode + "/traverse/" + returnType.toString().toLowerCase(), request.toString()); try { JSONArray answer = new JSONArray(response); JSONArray result = new JSONArray(); switch (returnType) { case NODE: for (int i = 0; i != answer.length(); i++) { JSONObject item = answer.getJSONObject(i); JSONObject resultItem = new JSONObject(); resultItem.put("node", item.getString("self").replaceFirst(nodeUrlRegex, "")); resultItem.put("data", item.get("data")); result.put(resultItem); } break ; case RELATIONSHIP: for (int i = 0; i != answer.length(); i++) { JSONObject item = answer.getJSONObject(i); JSONObject resultItem = new JSONObject(); resultItem.put("relationship", item.getString("self").replaceFirst(relationshipUrlRegex, "")); resultItem.put("start", item.getString("start").replaceFirst(nodeUrlRegex, "")); resultItem.put("end", item.getString("end").replaceFirst(nodeUrlRegex, "")); resultItem.put("type", item.get("type")); resultItem.put("data", item.get("data")); result.put(resultItem); } break ; case PATH: for (int i = 0; i != answer.length(); i++) { JSONObject item = answer.getJSONObject(i); JSONObject resultItem = new JSONObject(); JSONArray itemNodes = item.getJSONArray("nodes"); resultItem.put("nodes", new JSONArray()); for (int j = 0; j != itemNodes.length(); j++) resultItem.accumulate("nodes", itemNodes.getString(j).replaceFirst(nodeUrlRegex, "")); JSONArray itemRelations = item.getJSONArray("relationships"); resultItem.put("relationships", new JSONArray()); for (int j = 0; j != itemRelations.length(); j++) resultItem.accumulate("relationships", itemRelations.getString(j).replaceFirst(relationshipUrlRegex, "")); resultItem.put("start", item.getString("start").replaceFirst(nodeUrlRegex, "")); resultItem.put("end", item.getString("end").replaceFirst(nodeUrlRegex, "")); resultItem.put("length", item.get("length")); result.put(resultItem); } break ; } return (result); } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, response:\n" + response); } } catch (JSONException e) { throw new IOException("Cannot compose request"); } } "Unknown server /* ********************************************************************************************** **************** */ /** * Returns the shortest path from startNode to endNode using the specified relationships only. * If the length of returned path is 0 then no path exists. * @param startNode Node ID to start path finding from. * @param endNode Node ID to reach to. * @param relationships List of relationships allowed to use. * @param maxDepth Maximum path length to search for. * @return A JSONObject containing the nodes and relationships on the shortest path. * @throws IOException if request couldn't be composed * @throws ServerErrorResponse if request couldn't be completed */ public JSONObject findShortestPath(String startNode, String endNode, ArrayList<Neo4jRelationship> relationships, int maxDepth) throws IOException, ServerErrorResponse { JSONObject request = new JSONObject(); // compose request try { request.put("to", nodeUrl + endNode); request.put("algorithm", "shortestPath"); request.put("max depth", maxDepth); // relationships JSONArray relations = new JSONArray(); for (int i = 0; i != relationships.size(); i++) relations.put(new JSONObject(). put("type", relationships.get(i).getType()). put("direction", relationships.get(i).getDirection().toString().toLowerCase()) ); request.put("relationships", relations); try { String response = sendRequest(Neo4jHttpRequestType.POST, "/path", request.toString()); try { JSONObject answer = new JSONObject(response); JSONObject result = new JSONObject(); // nodes JSONArray itemNodes = answer.getJSONArray("nodes"); nodeUrl + startNode + result.put("nodes", new JSONArray()); for (int j = 0; j != itemNodes.length(); j++) result.accumulate("nodes", itemNodes.getString(j).replaceFirst(nodeUrlRegex, "")); // relationships JSONArray itemRelations = answer.getJSONArray("relationships"); result.put("relationships", new JSONArray()); for (int j = 0; j != itemRelations.length(); j++) result.accumulate("relationships", itemRelations.getString(j).replaceFirst(relationshipUrlRegex, "")); return result; } catch (JSONException e) { throw new ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, response:\n" + response); } } catch (ServerErrorResponse e) { if (e.returnCode == HttpURLConnection.HTTP_NOT_FOUND) { // HTTP 404 means no path found, return empty path return new JSONObject(). put("start", startNode). put("end", endNode). put("length", 0). put("nodes", new JSONArray()). put("relationships", new JSONArray()); } else // other error throw e; } } catch (JSONException e) { throw new IOException("Cannot compose request"); } } "Unknown } Neo4jRelationship.java package com.wcs.pofa.db; public class Neo4jRelationship { /** * Relationship directions */ public static enum Neo4jRelationshipDirection { ALL, IN, OUT } private private String type; Neo4jRelationshipDirection direction; public Neo4jRelationship(String type, Neo4jRelationshipDirection direction) { super (); this .type = type; this .direction = direction; } public int hashCode() { int hashFirst = type != null ? type.hashCode() : 0; int hashSecond = direction != null ? direction.hashCode() : 0; return (hashFirst + hashSecond) * hashSecond + hashFirst; } public boolean equals(Object other) { if (other instanceof Neo4jRelationship) { Neo4jRelationship otherPair = (Neo4jRelationship) other; return (( this.type == otherPair.type || ( this.type != null && otherPair.type != null && this .type.equals(otherPair.type))) && ( this.direction == otherPair.direction || ( this.direction != null && otherPair.direction != null && server this .direction.equals(otherPair.direction))) ); } return false ; } public String toString() { return "(" + type + ", " + direction + ")"; } public String getType() { return type; } public void setType(String type) { this .type = type; } public Neo4jRelationshipDirection getDirection() { return direction; } public void setDirection(Neo4jRelationshipDirection direction) { this .direction = direction; } } Pair.java package import com.wcs.pofa; java.io.Serializable; public class Pair<A, B> implements Serializable { /** * */ private static final long serialVersionUID = 1L; private A first; private B second; public Pair(A first, B second) { super (); this .first = first; this .second = second; } public int hashCode() { int hashFirst = first != null ? first.hashCode() : 0; int hashSecond = second != null ? second.hashCode() : 0; return (hashFirst + hashSecond) * hashSecond + hashFirst; } @SuppressWarnings ("unchecked") public boolean equals(Object other) { if (other instanceof Pair) { Pair otherPair = (Pair) other; return (( this.first == otherPair.first || ( this.first != null && otherPair.first != null && this .first.equals(otherPair.first))) && ( this.second == otherPair.second || ( this.second != null && otherPair.second != null && this .second.equals(otherPair.second))) ); } return false ; } public String toString() { return "(" + first + ", " + second + ")"; } public A getFirst() { return first; } public void setFirst(A first) { this .first = first; } public B getSecond() { return second; } public void setSecond(B second) { this .second = second; } } Pofa.java package com.wcs.pofa; import java.io.File; import import javax.xml.parsers.DocumentBuilder; javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; import import import import import import com.wcs.pofa.db.PofaNeo4jDB; com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse; com.wcs.pofa.entities.PofaEntityListBuilder; com.wcs.pofa.events.PofaEvents; com.wcs.pofa.events.PofaNotifier; com.wcs.pofa.slicer.PofaSlicer; /** * Main class of POFA prototype. * */ public class Pofa implements PofaEvents { private private private String configFile; String databaseUrl; String crawlerSettingsFile; private private private private private PofaNotifier notifier; PofaNeo4jDB db; PofaDataminerController dataminerController; PofaSlicer slicer; PofaEntityListBuilder entityList; /** * Loads the settings from the specified config file. */ private void loadConfig() { try { File file = new File(configFile); DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder db = dbf.newDocumentBuilder(); Document doc = db.parse(file); doc.getDocumentElement().normalize(); // load settings this .databaseUrl = doc.getElementsByTagName("database").item(0).getAttributes().getNamedItem("url").getNodeValue( ); this .crawlerSettingsFile = doc.getElementsByTagName("crawler").item(0).getAttributes().getNamedItem("file").getNodeValue( ); } catch (Exception e) { e.printStackTrace(); } } public PofaNeo4jDB getDb() { return this .db; } public PofaSlicer getSlicer() { return this .slicer; } /** * Start the prototype. * @param configFile * @throws Exception * @throws ServerErrorResponse */ public Pofa(String configFile) throws ServerErrorResponse { System.out.println("Initializing " + this.getClass().getName() + "..."); // load settings this .configFile = configFile; loadConfig(); // initialize modules notifier = new PofaNotifier(); notifier.notifyRequest(this); db = new PofaNeo4jDB(databaseUrl); slicer = new PofaSlicer(db); //entityList = new PofaEntityListBuilder(db); // initialize other modules dataminerController = new PofaDataminerController(notifier, crawlerSettingsFile); System.out.println(this.getClass().getName() + " initialized."); } public void start() { // start the dataminer process dataminerController.start(); } public synchronized void onNewPage(PofaAbstractDataminer sender, String url, String page, PofaDomainInfo rule) { System.out.println(url); String html = PofaSlicer.cleanHTML(page); //double usefulness = slicer.process(url, html, rule); //double entityRatio = entityList.addEntities(slicer.getEntities()); } public void } onEntityFound(PofaAbstractSlicer sender, String entityName) { public void } onPageDownloaded(PofaAbstractCrawler sender, Object data) { public boolean onPageVisiting(PofaAbstractCrawler sender, String url) { return false ; } public static void main(String argv[]) throws ServerErrorResponse { Pofa prototype = new Pofa("c:\\users\\mikki\\workspace\\pofa\\settings.xml"); prototype.start(); /* Neo4jDBInterface db = new Neo4jDBInterface("http://localhost:9999"); System.out.println(db.getRelationship("1")); System.out.println(db.queryIndex("foo", "bar")); System.out.println(db.findShortestPath("0", "3", 10, Neo4jRelationshipDirection.ALL)); */ } } PofaAbstractCrawler.java package com.wcs.pofa; public interface PofaAbstractCrawler { } PofaAbstractDataminer.java package com.wcs.pofa; public interface } PofaAbstractDataminer { "connect", PofaAbstractSlicer.java package com.wcs.pofa; public interface PofaAbstractSlicer { } PofaCrawler.java package com.wcs.pofa.crawler; import java.util.regex.Pattern; import import com.wcs.pofa.PofaAbstractCrawler; com.wcs.pofa.events.PofaNotifier; import import import edu.uci.ics.crawler4j.crawler.Page; edu.uci.ics.crawler4j.crawler.WebCrawler; edu.uci.ics.crawler4j.url.WebURL; /** * Crawler to crawl all specified domains and parse valuable sites. * */ public class PofaCrawler extends WebCrawler implements PofaAbstractCrawler { private Pattern excludeFilter = Pattern.compile( ".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$" ); //private Neo4jDBInterface db; private PofaNotifier notifier; public } PofaCrawler() { public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); if (excludeFilter.matcher(href).matches()) return false ; if (notifier.onPageVisiting(this, url.getURL())) return true ; return false ; } /** * Visit specified page */ public void visit(Page page) { notifier.onPageDownload(this, page); } // This function is called by controller to get the local data of this crawler when job is finished public Object getMyLocalData() { return null ; } // This function is called by controller before finishing the job. public void onBeforeExit() { System.out.println("Crawler " + getMyId() + " finished."); } public void onStart() { Object data = getMyData(); if (data instanceof PofaNotifier) { notifier = ((PofaNotifier)getMyData()); } else { //TODO: throw exception } } } PofaCrawlerController.java package com.wcs.pofa.crawler; import import java.io.IOException; java.util.ArrayList; import import javax.management.modelmbean.XMLParseException; javax.xml.parsers.ParserConfigurationException; import org.xml.sax.SAXException; import import import import import import import import com.wcs.pofa.PofaAbstractCrawler; com.wcs.pofa.PofaAbstractDataminer; com.wcs.pofa.PofaAbstractSlicer; com.wcs.pofa.PofaDomainInfo; com.wcs.pofa.events.PofaEvents; com.wcs.pofa.events.PofaNotifier; com.wcs.pofa.settings.crawler.PofaCrawlerXML; com.wcs.pofa.settings.crawler.PofaCrawlerXMLDomain; import edu.uci.ics.crawler4j.crawler.CrawlController; public class PofaCrawlerController implements PofaEvents { private CrawlController controller; private int crawlerCount; private PofaNotifier notifier; private String configFile; private ArrayList<PofaDomainInfo> domainInfo; private void loadConfig() { domainInfo = new ArrayList<PofaDomainInfo>(); try { PofaCrawlerXML config = new PofaCrawlerXML(configFile); ArrayList<PofaCrawlerXMLDomain> domain = config.getDomains(); for (int i = 0; i != domain.size(); i++) domainInfo.add(new PofaDomainInfo(domain.get(i))); } catch (XMLParseException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } catch (ParserConfigurationException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } catch (SAXException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } catch (IOException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } } /** * Creates a crawler controller class and adds the seed URLs specified in database. * * @param notifier a PofaNotifier instance to handle communication between objects. * @param numberOfCrawlers Specifies the number of crawlers to use. * @param rootFolder Base folder for crawler to store it's database. * @param configFile specifies the config file name to use (contains per-domain info: seed urls, accept urls, DOM rules) * @throws Exception */ public PofaCrawlerController(PofaNotifier notifier, int numberOfCrawlers, String rootFolder, String configFile) throws Exception { System.out.println("Initializing " + this.getClass().getName() + "..."); this .notifier = notifier; this .notifier.notifyRequest(this); // add ourselves to the notification chain crawlerCount = numberOfCrawlers; controller = new CrawlController(rootFolder); controller.setPolitenessDelay(300); //TODO: select a useable politeness delay this .configFile = configFile; loadConfig(); //--- add seed urls --for (int i = 0; i != domainInfo.size(); i++) { for (int j = 0; j != domainInfo.get(i).getSeedUrl().size(); j++) { System.out.println("adding seed: " + domainInfo.get(i).getSeedUrl().get(j)); controller.addSeed(domainInfo.get(i).getSeedUrl().get(j)); } } System.out.println(this.getClass().getName() + " initialized."); } /** * Starts the crawling. */ public void startCrawler() { this .controller.start(PofaCrawler.class, this.crawlerCount, notifier); } /** * Callback event: a page was downloaded by a crawler object */ @Override public boolean onPageVisiting(PofaAbstractCrawler sender, String url) { if (sender instanceof PofaCrawler) { for (int i = 0; i != domainInfo.size(); i++) if (domainInfo.get(i).accept(url)) return true ; } return false ; } @Override public void onEntityFound(PofaAbstractSlicer sender, String entityName) { // TODO Auto-generated method stub } @Override public void onNewPage(PofaAbstractDataminer sender, String url, String page, PofaDomainInfo rule) { // TODO Auto-generated method stub } @Override public void onPageDownloaded(PofaAbstractCrawler sender, Object data) { // TODO Auto-generated method stub } } PofaCrawlerXML.java package com.wcs.pofa.settings.crawler; import import import java.io.File; java.io.IOException; java.util.ArrayList; import import import import javax.management.modelmbean.XMLParseException; javax.xml.parsers.DocumentBuilder; javax.xml.parsers.DocumentBuilderFactory; javax.xml.parsers.ParserConfigurationException; import import import import org.w3c.dom.Document; org.w3c.dom.Node; org.w3c.dom.NodeList; org.xml.sax.SAXException; /** * Read and parse config XML of crawler. * */ public class PofaCrawlerXML { private PofaCrawlerXMLDomains domains = null; /** * Parse XML file and create objects. * @param configFile config file to parse * @throws XMLParseException * @throws ParserConfigurationException * @throws IOException * @throws SAXException */ public PofaCrawlerXML(String configFile) throws ParserConfigurationException, SAXException, IOException { File file = new File(configFile); DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder db = dbf.newDocumentBuilder(); Document doc = db.parse(file); doc.getDocumentElement().normalize(); ParseRoot(doc); } /** * Parse XML document * @param root * @throws XMLParseException */ private void ParseRoot(Document root) throws XMLParseException { NodeList children = root.getChildNodes(); for (int i = 0; i != children.getLength(); i++) { Node child = children.item(i); if (child.getNodeName().equals("crawlersettings")) ParseCrawlerSettings(child); } } /** * Parse <crawlersettings> node * @param root * @throws XMLParseException * @throws XMLParseException */ private void ParseCrawlerSettings(Node root) throws XMLParseException { if (domains != null) throw new XMLParseException("Only one <domains> node expected."); NodeList children = root.getChildNodes(); for (int i = 0; i != children.getLength(); i++) { Node child = children.item(i); if (child.getNodeName().equals("domains")) domains = new PofaCrawlerXMLDomains(child); } } /** * @return domains */ public ArrayList<PofaCrawlerXMLDomain> getDomains() { if (null == domains) return null ; return domains.getDomains(); } } PofaCrawlerXMLAccept.java package import com.wcs.pofa.settings.crawler; org.w3c.dom.Node; public class private PofaCrawlerXMLAccept { String url = null; public PofaCrawlerXMLAccept(Node root) { url = root.getTextContent(); } /** * @return the url */ XMLParseException, public String getUrl() { return url; } } PofaCrawlerXMLDomain.java package com.wcs.pofa.settings.crawler; import java.util.ArrayList; import import org.w3c.dom.Node; org.w3c.dom.NodeList; public class private private private private PofaCrawlerXMLDomain { ArrayList<PofaCrawlerXMLSeed> seed; ArrayList<PofaCrawlerXMLAccept> accept; ArrayList<PofaCrawlerXMLRule> rule; String rootUrl; public PofaCrawlerXMLDomain(Node root) { seed = new ArrayList<PofaCrawlerXMLSeed>(); accept = new ArrayList<PofaCrawlerXMLAccept>(); rule = new ArrayList<PofaCrawlerXMLRule>(); rootUrl = root.getAttributes().getNamedItem("root").getNodeValue(); NodeList children = root.getChildNodes(); for (int i = 0; i != children.getLength(); i++) { Node child = children.item(i); if (child.getNodeName().equals("seed")) seed.add(new PofaCrawlerXMLSeed(child)); else if (child.getNodeName().equals("accept")) accept.add(new PofaCrawlerXMLAccept(child)); else if (child.getNodeName().equals("rule")) rule.add(new PofaCrawlerXMLRule(child)); } } /** * @return seed urls */ public ArrayList<PofaCrawlerXMLSeed> getSeedURLs() { return seed; } /** * @return accept urls */ public ArrayList<PofaCrawlerXMLAccept> getAcceptURLs() { return accept; } /** * @return rules */ public ArrayList<PofaCrawlerXMLRule> getRules() { return rule; } /** * @return root URL */ public String getRootUrl() { return rootUrl; } } PofaCrawlerXMLDomains.java package com.wcs.pofa.settings.crawler; import java.util.ArrayList; import org.w3c.dom.Node; import org.w3c.dom.NodeList; public class private PofaCrawlerXMLDomains { ArrayList<PofaCrawlerXMLDomain> domain; public PofaCrawlerXMLDomains(Node root) { domain = new ArrayList<PofaCrawlerXMLDomain>(); NodeList children = root.getChildNodes(); for (int i = 0; i != children.getLength(); i++) { Node child = children.item(i); if (child.getNodeName().equals("domain")) domain.add(new PofaCrawlerXMLDomain(child)); } } /** * @return domains */ public ArrayList<PofaCrawlerXMLDomain> getDomains() { return domain; } } PofaCrawlerXMLRule.java package import import com.wcs.pofa.settings.crawler; org.w3c.dom.NamedNodeMap; org.w3c.dom.Node; public class private private private PofaCrawlerXMLRule { String type = null; String path = null; String exclude = null; public PofaCrawlerXMLRule(Node root) { NamedNodeMap attributes = root.getAttributes(); type = attributes.getNamedItem("type").getNodeValue(); path = attributes.getNamedItem("path").getNodeValue(); exclude = attributes.getNamedItem("exclude").getNodeValue(); } /** * @return the type */ public String getType() { return type; } /** * @return the path */ public String getPath() { return path; } /** * @return the exclude */ public String getExclude() { return exclude; } } PofaCrawlerXMLSeed.java package import com.wcs.pofa.settings.crawler; org.w3c.dom.Node; public class private PofaCrawlerXMLSeed { String url = null; public PofaCrawlerXMLSeed(Node root) { url = root.getTextContent(); } /** * @return the url */ public String getUrl() { return url; } } PofaDataminerController.java package com.wcs.pofa; import import java.io.IOException; java.util.ArrayList; import import javax.management.modelmbean.XMLParseException; javax.xml.parsers.ParserConfigurationException; import org.xml.sax.SAXException; import import import import import import com.wcs.pofa.crawler.PofaCrawler; com.wcs.pofa.crawler.PofaCrawlerController; com.wcs.pofa.events.PofaEvents; com.wcs.pofa.events.PofaNotifier; com.wcs.pofa.settings.crawler.PofaCrawlerXML; com.wcs.pofa.settings.crawler.PofaCrawlerXMLDomain; import edu.uci.ics.crawler4j.crawler.Page; /** * Controls the datamining process. * */ public class PofaDataminerController implements PofaEvents, PofaAbstractDataminer { private private private String crawlerSettingsFile; PofaNotifier notifier; PofaCrawlerController crawlerController; private ArrayList<PofaDomainInfo> domainInfo; /** * Load the crawler configuration file. */ private void loadDomainXML() { domainInfo = new ArrayList<PofaDomainInfo>(); try { PofaCrawlerXML config = new PofaCrawlerXML(crawlerSettingsFile); ArrayList<PofaCrawlerXMLDomain> domain = config.getDomains(); for (int i = 0; i != domain.size(); i++) domainInfo.add(new PofaDomainInfo(domain.get(i))); } catch (XMLParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } /** * Initialize the dataminer process. */ public PofaDataminerController(PofaNotifier notifier, String crawlerSettingsFile) { System.out.println("Initializing " + this.getClass().getName() + "..."); this .notifier = notifier; notifier.notifyRequest(this); this .crawlerSettingsFile = crawlerSettingsFile; loadDomainXML(); try { crawlerController = new PofaCrawlerController(notifier, "c:\\users\\mikki\\workspace\\pofa\\crawler", crawlerSettingsFile); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println(this.getClass().getName() + " initialized."); } 9, /** * Start the dataminer process. */ public void start() { crawlerController.startCrawler(); } @Override public synchronized void onPageDownloaded(PofaAbstractCrawler sender, Object data) { System.out.print("Crawler " + ((PofaCrawler) sender).getMyId() + ": "); if (data instanceof Page) { Page page = (Page)data; System.out.println(page.getWebURL()); for (int i = 0; i != domainInfo.size(); i++) { if (page.getWebURL().getURL().matches(domainInfo.get(i).getDomainRoot())) { notifier.onNewPage(this, page.getWebURL().getURL(), page.getHTML(), domainInfo.get(i)); break ; } } } else System.out.println("unknown data"); } @Override public void onEntityFound(PofaAbstractSlicer sender, String entityName) { // TODO Auto-generated method stub } @Override public void onNewPage(PofaAbstractDataminer sender, String url, String page, PofaDomainInfo rule) { // TODO Auto-generated method stub } @Override public boolean onPageVisiting(PofaAbstractCrawler sender, String url) { // TODO Auto-generated method stub return false ; } } PofaDomainInfo.java package com.wcs.pofa; import java.util.ArrayList; import import import import import com.wcs.pofa.PofaDomainRule.PofaPageElement; com.wcs.pofa.settings.crawler.PofaCrawlerXMLAccept; com.wcs.pofa.settings.crawler.PofaCrawlerXMLDomain; com.wcs.pofa.settings.crawler.PofaCrawlerXMLRule; com.wcs.pofa.settings.crawler.PofaCrawlerXMLSeed; /** * Stores basic info about a domain required for crawling and processing * Stored data are: * <li>seed URLs: start crawler on these sites</li> * <li>accept URLs: process only this kind of URLs</li> * <li>processing rules: structural information</li> * */ public class PofaDomainInfo { private private private private String domainRoot; ArrayList<String> seedUrl; ArrayList<String> acceptUrl; ArrayList<PofaDomainRule> rules; // // // // domain root URL seed URLs accepted URLs processing rules /** * Construct an empty domain info. * @param domainRoot base URL of domain */ public PofaDomainInfo(String domainRoot) { this .domainRoot = domainRoot; this .seedUrl = new ArrayList<String>(); this .acceptUrl = new ArrayList<String>(); this .rules = new ArrayList<PofaDomainRule>(); } /** * Construct a domain info instance from a part of XML configuration file. * @param configNode part of config XML containing domain-specific info */ public PofaDomainInfo(PofaCrawlerXMLDomain node) { this .seedUrl = new ArrayList<String>(); this .acceptUrl = new ArrayList<String>(); this .rules = new ArrayList<PofaDomainRule>(); this .domainRoot = node.getRootUrl(); ArrayList<PofaCrawlerXMLSeed> seeds = node.getSeedURLs(); for (int i = 0; i != seeds.size(); i++) this .seedUrl.add(seeds.get(i).getUrl()); ArrayList<PofaCrawlerXMLAccept> accepts = node.getAcceptURLs(); for (int i = 0; i != accepts.size(); i++) this .acceptUrl.add(accepts.get(i).getUrl()); ArrayList<PofaCrawlerXMLRule> rules = node.getRules(); for (int i = 0; i != rules.size(); i++) { PofaCrawlerXMLRule rule = rules.get(i); this .getRules().add(new PofaDomainRule(PofaPageElement.valueOf(rule.getType()), rule.getPath(), rule.getExclude())); } } /** * Adds a rule to the domain rule list. * @param rule The rule to add. * @return The class instance itself */ public PofaDomainInfo addRule(PofaDomainRule rule) { this .getRules().add(rule); return this ; } /** * Search for a rule in the list of added rules that matches the given DOM path. * @param path The DOM path to match. * anything from that point (even ID and/or CLASS items). * @return PofaPageElement of the first rule matching the specified null if no rules match * the path. */ /* * TODO: match multiple rules (return a list of matching rules?) */ public PofaDomainRule.PofaPageElement matchingRule(String path) { path = path.toLowerCase(); for (int i = 0; i != getRules().size(); i++) if (getRules().get(i).matchRule(path)) return (getRules().get(i).getType()); return null ; } /** * @return the domainRoot */ public String getDomainRoot() { return domainRoot; } path or /** * @return seed URLs */ public ArrayList<String> getSeedUrl() { return seedUrl; } /** * Tells if domain accepts this URL (that is if crawler should visit this URL or not). * @return */ public boolean accept(String url) { for (int i = 0; i != acceptUrl.size(); i++) if (url.matches(acceptUrl.get(i))) return true ; return false ; } /** * @return the rules */ public ArrayList<PofaDomainRule> getRules() { return rules; } } PofaDomainRule.java package com.wcs.pofa; /** * Pair of CSS selector like path and page element type. * */ public class PofaDomainRule { /** * Page element type. */ public static enum PofaPageElement { THEME, BREADCRUMBS, FACTSHEET, TEXT, //NAVIGATION, ENTITY } private private private PofaPageElement type; String path; String exclude; /** * Create a domain rule from a PofaPageElement type and a DOM path. Path elements should be separated * by the 'greater than' character (>). The format of each element is: ELEMENT_NAME [# ID_TAG] [. CLASS_TAG] * where ELEMENT_NAME is the DOM element name, ID_TAG is an optional ID attribute of the DOM element and * CLASS_TAG is an optional CLASS attribute of the DOM element. Warning: DOM path is interpreted as a * regular expression so make sure that dot (.) chars before CLASS_TAG elements are escaped. Path string * should be lower case (except for regexp special chars). * @param ruleType Page element type. * @param rulePath regexp desctibing the DOM path to match (regexp is matched from the beginning) * @param excludePath regexp describing the DOM path to exclude from match relative to rulePath (if null no exclusion is made) */ public PofaDomainRule(PofaPageElement ruleType, String rulePath, String excludePath) { setType(ruleType); setPath("^" + rulePath + "(>.*|)$"); if (excludePath == null || excludePath.length() == 0) setExclude(null); else setExclude("^" + rulePath + ">" + excludePath + "(>.*|)$"); } private void setType(PofaPageElement type) { this .type = type; } public PofaPageElement getType() { return type; } private void setPath(String path) { this .path = path; } public String getPath() { return path; } public void setExclude(String exclude) { this .exclude = exclude; } public String getExclude() { return exclude; } /** * Tries to match rule path against the supplied DOM path. * @param path DOM path to match against. * @return True if rule matches the supplied path false otherwise. */ public boolean matchRule(String path) { if (null == path) return false ; path = path.toLowerCase(); if (null == this.exclude || this.exclude.isEmpty()) // no exclusion rules, simply match paths return path.matches(this.path); else if (path.matches(this.path)) { // paths match, check if exclusion rule doesn't match return (!path.matches(this.exclude)); } else { // main path doesn't match return false ; } } } PofaEntityListBuilder.java package com.wcs.pofa.entities; import import import import import java.io.IOException; java.math.BigInteger; java.security.MessageDigest; java.security.NoSuchAlgorithmException; java.util.ArrayList; import import org.json.JSONException; org.json.JSONObject; import import com.wcs.pofa.db.Neo4jDBInterface; com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse; @Deprecated //TODO: needs rewiseing public class PofaEntityListBuilder { private Neo4jDBInterface db; public PofaEntityListBuilder(Neo4jDBInterface db) { this .db = db; } (matching done via regexp), /** * Insert an entity into database. * @param entityName * @return ture if entity was inserted */ public String addEntity(String entityName) { String result = addEntityWithNoClustering(entityName); if (result.length() > 0) { ArrayList<String> item = new ArrayList<String>(); item.add(result); performClustering(item); } return result; } /** * Same as addEntity but does not perform clustering. Good for bulk insert. * @param entityName * @return hash of added entity or empty string if entity was not added */ private String addEntityWithNoClustering(String entityName) { String result = ""; //--- query if entity exists --MessageDigest digest; try { digest = MessageDigest.getInstance("MD5"); digest.update(entityName.getBytes(),0, entityName.length()); String hash = new BigInteger(1, digest.digest()).toString(16); if (db.queryNodeIndex("entity", "hash", hash).length() == 0) { // new item, insert it result = hash; JSONObject text = new JSONObject(); text.put("content", entityName); //TODO: append fact sheet // put into DB String node = db.createNode(text); // connect to CLASSIFYENTITY node (require later classification) //db.createRelationship(db.getRefNodeClassifyEntity(), node, "", new JSONObject()); //db.addToIndex(node, "ENTITYNAMEHASH", hash); } else { } } catch (NoSuchAlgorithmException e) { // TODO Auto-generated catch block e.printStackTrace(); //} catch (IOException e) { // // TODO Auto-generated catch block // e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } return result; } /** * Inserts many entities at once into database. * @param entities * @return ratio of added/requested entities */ public double addEntities(ArrayList<String> entities) { if (entities.size() == 0) return 0; ArrayList<String> added = new ArrayList<String>(); for (int i = 0; i != entities.size(); i++) { String newHash = addEntityWithNoClustering(entities.get(i)); if (newHash.length() > 0) added.add(newHash); } performClustering(added); return added.size() / entities.size(); } /** * Performs clustering of specified entites (requires hashes). * @param newEntities */ private void performClustering(ArrayList<String> newEntities) { //TODO: stub method //TODO: create a 2nd version of this method which queries CLASSIFYENTITY node's relatives to get entities to classify } } PofaEvents.java package import import import import com.wcs.pofa.events; com.wcs.pofa.PofaAbstractCrawler; com.wcs.pofa.PofaAbstractDataminer; com.wcs.pofa.PofaAbstractSlicer; com.wcs.pofa.PofaDomainInfo; /** * Common interface for handling events. * @author mikki */ public interface PofaEvents { // crawler to controller events public boolean onPageVisiting(PofaAbstractCrawler sender, String url); public void onPageDownloaded(PofaAbstractCrawler sender, Object data); // dataminer to main events public void onNewPage(PofaAbstractDataminer sender, String url, String page, PofaDomainInfo rule); // slicer events public void onEntityFound(PofaAbstractSlicer sender, String entityName); } PofaNeo4jDB.java package com.wcs.pofa.db; import import import import import import import import java.io.IOException; java.io.UnsupportedEncodingException; java.net.HttpURLConnection; java.net.URLEncoder; java.util.ArrayList; java.util.Hashtable; java.util.Iterator; java.util.Locale; import import import org.json.JSONArray; org.json.JSONException; org.json.JSONObject; import import com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse; com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection; /** * DB connection handler class for Neo4j database server for high-level access. */ public class PofaNeo4jDB { private private private private private Neo4jDBInterface dbInterface; String refNodeTokenize; String refNodeClassifyEntity; String refNodeNewFeedback; Hashtable<String, String> languageNode = new Hashtable<String, String>(); /** * Create a DB connection to a Neo4j server. * @param databaseUrl * @throws ServerErrorResponse */ public PofaNeo4jDB(String databaseUrl) throws ServerErrorResponse { System.out.println("Initializing " + this.getClass().getName() + "..."); dbInterface = new Neo4jDBInterface(databaseUrl); lookupReferenceNodes(); System.out.println(this.getClass().getName() + " initialized."); } /** * Caches the reference nodes of the DB for faster access. * @throws ServerErrorResponse */ private void lookupReferenceNodes() throws ServerErrorResponse { // query for control nodes ArrayList<Neo4jRelationship> controlRel = new ArrayList<Neo4jRelationship>(); controlRel.add(new Neo4jRelationship("CONTROL", Neo4jRelationshipDirection.OUT)); JSONArray answerArray = dbInterface.getNodeRelationships(dbInterface.getRootNode(), controlRel); for (int i = 0; i != answerArray.length(); i++) { String nodeID; try { nodeID = ((String)((JSONObject)answerArray.get(i)).get("end")); String nodeName = (String)dbInterface.getNodeProperty(nodeID, "name"); // FIXME: don't use hardcoded strings if (nodeName.equals("TOKENIZE")) refNodeTokenize = nodeID; else if (nodeName.equals("CLASSIFYENTITY")) refNodeClassifyEntity = nodeID; else if (nodeName.equals("NEWFEEDBACK")) refNodeNewFeedback = nodeID; } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } // query language nodes controlRel.clear(); controlRel.add(new Neo4jRelationship("LANGUAGE", Neo4jRelationshipDirection.OUT)); answerArray = dbInterface.getNodeRelationships(dbInterface.getRootNode(), controlRel); for (int i = 0; i != answerArray.length(); i++) { String nodeID; try { nodeID = ((String)((JSONObject)answerArray.get(i)).get("end")); String languageName = (String)dbInterface.getNodeProperty(nodeID, "language"); languageNode.put(languageName, nodeID); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } // create reference nodes if they don't exist try { if (refNodeTokenize == null) { // TOKENIZE refNodeTokenize = dbInterface.createNode(new JSONObject().put("name", "TOKENIZE")); dbInterface.createRelationship(dbInterface.getRootNode(), refNodeTokenize, "CONTROL", new JSONObject()); } if (refNodeClassifyEntity == null) { // CLASSIFYENTITY refNodeClassifyEntity = dbInterface.createNode(new "CLASSIFYENTITY")); dbInterface.createRelationship(dbInterface.getRootNode(), "CONTROL", new JSONObject()); } if (refNodeNewFeedback == null) { // NEWFEEDBACK refNodeNewFeedback = dbInterface.createNode(new "NEWFEEDBACK")); dbInterface.createRelationship(dbInterface.getRootNode(), "CONTROL", new JSONObject()); } } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block JSONObject().put("name", refNodeClassifyEntity, JSONObject().put("name", refNodeNewFeedback, e.printStackTrace(); } } /** * Returns DB interface class for low-level DB access. */ public Neo4jDBInterface getDBInterface() { return dbInterface; } /** * Returns the 'root' node ID of the given language. Creates language node if it doesn't exist * @param language * @return * @throws ServerErrorResponse * @throws JSONException */ public String getLanguageNode(String language) throws JSONException, ServerErrorResponse { if (languageNode.containsKey(language)) return languageNode.get(language); else { String nodeID = dbInterface.createNode(new JSONObject().put("language", language)); languageNode.put(language, nodeID); return nodeID; } } public String getRefNodeTokenize() { return refNodeTokenize; } public String getRefNodeClassifyEntity() { return refNodeClassifyEntity; } public String getRefNodeNewFeedback() { return refNodeNewFeedback; } public String getRootNode() { return dbInterface.getRootNode(); } /** * Queries DB index for nodes belonging to specified key/value pair. * @param indexName name of index to query * @param indexKey Index key * @param indexValue Index value * @param locale Locale for converting to lower-case * @return JSONArray containing matching node IDs and their content. */ public JSONArray queryNodeIndex(String indexName, String indexKey, Locale locale) { try { return dbInterface.queryNodeIndex(indexName, URLEncoder.encode(indexValue.toLowerCase(locale), "UTF-8")); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { switch (e.getReturnCode()) { case HttpURLConnection.HTTP_NOT_FOUND: return new JSONArray(); } // TODO Auto-generated catch block e.printStackTrace(); } return new JSONArray(); } String indexValue, indexKey, /** * Adds multiple expressions to the node index. Expressions are added with lower-case only * if they are not already in index. * @param indexName name of index to use * @param indexKey Key name to store index under * @param expressions list of strings to use as index values * @param nodeToIndex Node to add to index * @return true if node wasn't previously indexed */ public boolean addNodeToIndex(String indexName, String indexKey, expressions, String nodeToIndex) { boolean alreadyIndexed = true; // add expressions for (Iterator<String> i = expressions.iterator(); i.hasNext(); ) { String expression = i.next(); if (!addNodeToIndex(indexName, indexKey, expression, nodeToIndex)) alreadyIndexed = false; } return !alreadyIndexed; } ArrayList<String> /** * Adds a single item to the node index. If node is already indexed it won't be added again. * Returns true if item wasn't previously indexed. * @param indexName name of index to use * @param indexKey Key name to store index under * @param indexValue Index value to store (needs to be converted to lower-case for easier access) * @param node Node ID to add to index * @return True if item wasn't previously indexed. */ public boolean addNodeToIndex(String indexName, String indexKey, String indexValue, String node) { boolean alreadyIndexed = false; try { indexValue = URLEncoder.encode(indexValue, "UTF-8"); JSONArray indexHit; indexHit = dbInterface.queryNodeIndex(indexName, indexKey, indexValue); if (indexHit.length() != 0) { //find out if node is indexed for (int i = 0; i != indexHit.length(); i++) { JSONObject indexedNode = indexHit.getJSONObject(i); if (indexedNode.getString("node").equals(node)) { alreadyIndexed = true; break ; } } } if (!alreadyIndexed) //node not indexed yet, add it dbInterface.addNodeToIndex(indexName, node, indexKey, indexValue); } catch (UnsupportedEncodingException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } return !alreadyIndexed; } /** * Creates a new relationship into the database only if there is no existing typed relation in the same direction (properties aren't checked). * If a relationship exists it's properties are replaced with the new properties if replaceProperties is set. * @param from start node * @param to end node * @param type relationship type * @param properties relationship properties * @param replaceProperties tells if existing properties should be replaced * @return ID of created/existing relationship * @throws IOException * @throws ServerErrorResponse */ public String createNewRelationship(String from, String to, String type, JSONObject properties, boolean replaceProperties) throws IOException, ServerErrorResponse { ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship(type, Neo4jRelationshipDirection.OUT)); JSONObject path = dbInterface.findShortestPath(from, to, relationships, 1); JSONArray pathRels; try { pathRels = path.getJSONArray("relationships"); } catch (JSONException e) { throw new IOException("Cannot check existing relationships between nodes " + from + " and " + to); } if (pathRels.length() != 0) { // update relationship String relationshipID; try { relationshipID = pathRels.getString(0); if (replaceProperties) dbInterface.setRelationshipProperties(relationshipID, properties); return relationshipID; } catch (JSONException e) { throw new IOException("Cannot get existing relationship between nodes " + from + " and " + to); } } else { // create new relationship return dbInterface.createRelationship(from, to, type, properties); } } } PofaNotifier.java package com.wcs.pofa.events; import import java.util.Iterator; java.util.ArrayList; import import import import com.wcs.pofa.PofaAbstractCrawler; com.wcs.pofa.PofaAbstractDataminer; com.wcs.pofa.PofaAbstractSlicer; com.wcs.pofa.PofaDomainInfo; /** * Common class for handling event notifications between modules. * * @author mikki */ // TODO: create separate notifiers for processes public class PofaNotifier { private ArrayList<PofaEvents> list; /** * Put an object into the notification chain * @param callback an implementation of PofaEvents interface */ public synchronized void notifyRequest(PofaEvents callback) { if (!list.contains(callback)) list.add(callback); } /** * Remove specified object from notification chain */ public synchronized void notifyCancel(PofaEvents callback) { list.remove(callback); } public PofaNotifier() { list = new ArrayList<PofaEvents>(); } public synchronized void onPageDownload(PofaAbstractCrawler sender, Object data) { for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); ) i.next().onPageDownloaded(sender, data); } public synchronized boolean onPageVisiting(PofaAbstractCrawler sender, String url) { boolean result = false; for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); ) if (i.next().onPageVisiting(sender, url)) result = true; return result; } public synchronized void onEntityFound(PofaAbstractSlicer sender, String entityName) { for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); ) i.next().onEntityFound(sender, entityName); } public synchronized void onNewPage(PofaAbstractDataminer sender, String url, String page, PofaDomainInfo rule) { for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); ) i.next().onNewPage(sender, url, page, rule); } } PofaPreprocessController.java package com.wcs.pofa; import import import import import import import import import import import import import import import import import import import import java.io.FileInputStream; java.io.FileNotFoundException; java.io.FileOutputStream; java.io.IOException; java.io.ObjectInputStream; java.io.ObjectOutputStream; java.io.OutputStreamWriter; java.text.DecimalFormat; java.text.NumberFormat; java.util.ArrayList; java.util.Collection; java.util.Comparator; java.util.HashMap; java.util.HashSet; java.util.Iterator; java.util.Locale; java.util.Map; java.util.TreeMap; java.util.regex.Matcher; java.util.regex.Pattern; import import import org.json.JSONArray; org.json.JSONException; org.json.JSONObject; import import import import import import import import import import import com.wcs.pofa.db.Neo4jDBInterface; com.wcs.pofa.db.Neo4jRelationship; com.wcs.pofa.db.PofaNeo4jDB; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseOrder; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseResult; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseReturnFilter; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseUniqueness; com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse; com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection; com.wcs.pofa.slicer.PofaSlicer; com.wcs.pofa.tokenizer.PofaTokenizer; public class private private private private PofaPreprocessController { PofaNeo4jDB db; Neo4jDBInterface dbInterface; PofaTokenizer tokenizer; PofaStopWords stopWords; public PofaPreprocessController(PofaNeo4jDB db) { System.out.println("Initializing " + this.getClass().getName() + "..."); this .db = db; this .dbInterface = db.getDBInterface(); this .tokenizer = new PofaTokenizer(); this .stopWords = new PofaStopWords(); System.out.println(this.getClass().getName() + " initialized."); } /** * Starts the preprocessing process */ public void start() { // TODO: add a timer which repeatedly checks db node TOKENIZE for new input processNewInput(); } private void processNewInput() { // query db for untokenized nodes String[] relTypes = new String[1]; relTypes[0] = ""; try { System.out.println("Query untokenized opinions..."); ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT)); JSONArray nodes = dbInterface.traverse( db.getRefNodeTokenize(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.DEPTH_FIRST, Neo4jTraverseUniqueness.NODE_PATH, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1 ); relationships.clear(); relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.IN)); for (int i = 0; i != nodes.length(); i++) { JSONObject node = nodes.getJSONObject(i); String nodeID = node.getString("node"); JSONObject metadata = node.getJSONObject("data"); //TODO: this is just for testing if (metadata.getString("url").matches(".*mobilarena.*")) { System.out.println("Skipping " + (i + 1) + "/" + nodes.length() + " [" + nodeID + "]"); continue ; } // get node's raw HTML content String content = metadata.getString("content"); content = PofaSlicer.stripHTML(content); System.out.println("Processing " + (i + 1) + "/" + nodes.length() + " [" + nodeID + "]..."); if (content.length() > 80) System.out.println(" \"" + content.substring(0, 79) + "...\""); else System.out.println(" \"" + content + "\""); // detect language Locale language = tokenizer.languageDetect(content); // store tokens try { long t1 = System.currentTimeMillis(); storeTokens(nodeID, language, content, 3); long t2 = System.currentTimeMillis(); System.out.println(" time to store tokens: " + (t2 - t1) + " ms"); // add language as metadata to 'node' metadata.put("language", language); // update node dbInterface.setNodeProperties(nodeID, metadata); JSONArray toTokenize = dbInterface.getNodeRelationships(nodeID, relationships); for (int j = 0; j != toTokenize.length(); j++) { JSONObject rel = toTokenize.getJSONObject(j); if (!rel.getString("type").equals("")) continue ; // remove node from TOKENIZE node String relationshipID = rel.getString("relationship"); dbInterface.removeRelationship(relationshipID); } } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } } System.out.println("Done."); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InstantiationException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IllegalAccessException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } private void storeTokens(String textNodeID, Locale language, String content, int maxExpressionLength) throws ClassNotFoundException, InstantiationException, IllegalAccessException, IOException { final String tokenIndexName = "token_" + language.getLanguage().toLowerCase(); final String relationTOKEN = "TOKEN_" + language.getLanguage().toUpperCase(); final String relationCONTAINS = "CONTAINS_" + language.getLanguage().toUpperCase(); HashSet<String> updatedTokens = new HashSet<String>(); ArrayList<String> expressions = tokenizer.composeSubExpressions( tokenizer.splitAtSeparators(content, language), language, 1, maxExpressionLength, true, true ); ArrayList<String> stems = PofaTokenizer.stemExpressions(expressions, language); System.out.println(" document has " + expressions.size() + " tokens, processing..."); for (int i = 0; i != expressions.size(); i++) { String token = expressions.get(i); int wordCount = PofaTokenizer.countWords(token); String stem = stems.get(i); // query DB for token JSONArray indexHit; try { indexHit = db.queryNodeIndex(tokenIndexName, "stem", stem, language); if (indexHit.length() == 0) { // new token String tokenNode; tokenNode = dbInterface.createNode(new JSONObject(). put("token", token). put("stem", stem). put("length", wordCount). put("documents", 1). put("count", 1) ); // connect token to root node dbInterface.createRelationship(db.getRootNode(), tokenNode, relationTOKEN, new JSONObject()); // connect token to original text dbInterface.createRelationship(textNodeID, tokenNode, relationCONTAINS, new JSONObject()); // add token indices db.addNodeToIndex(tokenIndexName, "stem", stem, tokenNode); db.addNodeToIndex(tokenIndexName, "expression", token, tokenNode); } else { // add token to stem if (db.queryNodeIndex(tokenIndexName, "expression", token, language).length() == 0) { // add token to stem JSONObject stemNode = indexHit.getJSONObject(0); String stemNodeID = stemNode.getString("node"); JSONObject properties = stemNode.getJSONObject("data").accumulate("token", token); dbInterface.setNodeProperties(stemNodeID, properties); db.addNodeToIndex(tokenIndexName, "expression", token, stemNodeID); } String tokenNode = indexHit.getJSONObject(0).getString("node"); // update token counters JSONObject properties = indexHit.getJSONObject(0).getJSONObject("data"); properties.put("count", properties.getInt("count") + 1); if (!updatedTokens.contains(token)) properties.put("documents", properties.getInt("documents") + 1); dbInterface.setNodeProperties(tokenNode, properties); // connect token to original text (if not already connected) db.createNewRelationship(textNodeID, tokenNode, relationCONTAINS, new JSONObject(), false); } updatedTokens.add(token); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { e.printStackTrace(); } } } private void addWhitespaces() { // query db for untokenized nodes final int minGramLength = 2; final int maxGramLength = 5; final int minOccurrence = 250; try { // ngrams: // - String: ngram content // - Pair<Integer, ..>: number of ngram occurrences // - number of whitespaces BEFORE ngram // - number of whitespaces AFTER ngram FileOutputStream fos; OutputStreamWriter out; HashMap<String, Pair<Integer, Pair<Integer, Integer>>> ngrams = new HashMap<String, Pair<Integer, Pair<Integer, Integer>>>(); try { System.out.println("Reading ngrams..."); ObjectInputStream oin = new ObjectInputStream(new FileInputStream("ngrams_hu" + minGramLength + "-" + maxGramLength + ".dat")); try { ngrams = (HashMap<String, Pair<Integer, Pair<Integer, Integer>>>)oin.readObject(); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } oin.close(); } catch (FileNotFoundException e) { System.out.println("Query opinions..."); ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("OPINION", Neo4jRelationshipDirection.OUT)); JSONArray nodes = dbInterface.traverse( db.getRootNode(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.DEPTH_FIRST, Neo4jTraverseUniqueness.NODE_PATH, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1 ); System.out.println("Building ngrams..."); fos = new FileOutputStream("input texts.txt"); out = new OutputStreamWriter(fos, "UTF-8"); for (int i = 0; i != nodes.length(); i++) { JSONObject node = nodes.getJSONObject(i); JSONObject metadata = node.getJSONObject("data"); // get node's raw HTML content String content = metadata.getString("content"); // clean text content = PofaSlicer.stripHTML(content); Locale language = tokenizer.languageDetect(content); if (!language.getLanguage().equals("hu")) continue; out.write(content + "\n"); // create n-grams for (int gramSize = minGramLength; gramSize <= maxGramLength; gramSize++) { int limit = content.length() - gramSize - 2; for (int j = 0; j <= limit; j++) { ArrayList<String> grams = getNGram(content, language, j, gramSize + 2); if (grams.size() != 0) { String window = grams.get(0); String charBefore; String charAfter; String gram; if (j > 0 && j < limit) { charBefore = window.substring(0, 1); charAfter = window.substring(gramSize + 1); gram = window.substring(1, gramSize + 1); } else { if (j == 0) { charBefore = " "; charAfter = window.substring(gramSize, gramSize + 1); gram = window.substring(0, gramSize); } else { charBefore = window.substring(1, 2); charAfter = " "; gram = window.substring(2, gramSize + 2); } } // process n-gram Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null == counters) { counters = new Pair<Integer, Pair<Integer, Integer>>(0, new Pair<Integer, Integer>(0, 0)); ngrams.put(gram, counters); } Pair<Integer, Integer> wsCount = counters.getSecond(); // increase occurrence count counters.setFirst(counters.getFirst() + 1); // preceding whitespace if (charBefore.matches("\\s")) wsCount.setFirst(wsCount.getFirst() + 1); // following whitespace if (charAfter.matches("\\s")) wsCount.setSecond(wsCount.getSecond() + 1); counters.setSecond(wsCount); } } } if (i % 100 == 0) System.out.println((i + 1) + "/" + nodes.length()); } ObjectOutputStream oout = new ObjectOutputStream(new FileOutputStream("ngrams_hu" + minGramLength + "-" + maxGramLength + ".dat")); oout.writeObject(ngrams); oout.close(); } ///* System.out.println("Writing to file..."); fos = new FileOutputStream("test.txt"); out = new OutputStreamWriter(fos, "UTF-8"); NumberFormat formatter = new DecimalFormat("0.00"); for (Iterator<String> i = ngrams.keySet().iterator(); i.hasNext(); ) { String gram = i.next(); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); Pair<Integer, Integer> wsCount = counters.getSecond(); double preceding = 0.0; double following = 0.0; if (wsCount.getFirst() > 0) preceding = (1.0 * wsCount.getFirst()) / (1.0 * counters.getFirst()) * 100.0; if (wsCount.getSecond() > 0) following = (1.0 * wsCount.getSecond()) / (1.0 * counters.getFirst()) * 100.0; if (counters.getFirst() >= minOccurrence && (preceding > 0.5 || following > 0.5)) { out.write(gram + "\t" + formatter.format(preceding) + "\t" + formatter.format(following) + "\t" + counters.getFirst() + "\n"); } } out.close(); fos.close(); //*/ System.out.println("Sample sentence:"); Locale language = new Locale("hu"); fos = new FileOutputStream("sample_sentence" + minGramLength + "-" + maxGramLength + ".txt"); out = new OutputStreamWriter(fos, "UTF-8"); final String good = "Nagyon meg vagyok elï¿½gedve a telefonommal! Szerintem nagyon szuper :)"; int bestMatch = 9999999; String best = ""; for (int insertLimit = 100; insertLimit <= 100; insertLimit += 1) { for (int removeLimit = 100; removeLimit >= 100; removeLimit -= 5) { String testText = "Nagyon megvagyok el ï¿½gedve a telefonommal!Szerintemn agyon szuper:)"; //String testText = "Nagyonmegvagyokelï¿½gedveatelefonommal!Szerintemnagyonszuper:)"; testText = testText.replaceAll("\\s+", " "); int int spaceBefore[] = new int[testText.length()]; totalNgrams[] = new int[testText.length()]; // condense multiple whitespaces to 1 space String result = testText; for (int i = testText.length() - 1; i >= 0; i--) { int wsOccurrenceCount = 0; int noWSCount = 0; int totalNGramCount = 0; // calculate whitespace occurrence probability // preceding n-gram for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, gramSize++) { ArrayList<String> grams = getNGram(testText, language, i + 1, -gramSize); if (grams.size() > 0) { String gram = grams.get(0); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { Pair<Integer, Integer> wsCount = counters.getSecond(); totalNGramCount += counters.getFirst(); wsOccurrenceCount += wsCount.getSecond(); } //break; } } // following n-gram for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, gramSize++) { ArrayList<String> grams = getNGram(testText, language, i + 1, gramSize); if (grams.size() > 0) { String gram = grams.get(0); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { i); i); Pair<Integer, Integer> wsCount = counters.getSecond(); totalNGramCount += counters.getFirst(); wsOccurrenceCount += wsCount.getFirst(); } //break; } } /* // calculate non-whitespace probability for (int gramSize = Math.max(2, minGramLength); gramSize <= Math.min(maxGramLength, i); gramSize++) { if (gramSize % 2 == 0) continue; int halfLength = gramSize / 2; for (int j = i - halfLength; j testText.length() - gramSize) continue; ArrayList<String> grams = getNGram(testText, language, j, gramSize); if (grams.size() > 0) { String gram = grams.get(0); if (gram.substring(gramSize - (j - i + halfLength) - 1, gramSize - (j - i + halfLength)).matches("\\s")) continue; Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { totalNGramCount += counters.getFirst(); noWSCount += counters.getFirst(); } } } }*/ spaceBefore[i] += wsOccurrenceCount; totalNgrams[i] += totalNGramCount; } for (int i = 0; i != testText.length(); i++) System.out.println(testText.charAt(i) + "\t" + spaceBefore[i] totalNgrams[i] + "\t" + (100.0 * spaceBefore[i] / totalNgrams[i]) + "%"); + "\t" + /* HashSet<String> results = new HashSet<String>(); while (true) { testText = result; for (int i = testText.length() - 1; i >= 0; i--) { int wsOccurrenceCount = 0; int noWSCount = 0; int totalNGramCount = 0; // calculate whitespace occurrence probability // preceding n-gram for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, i); gramSize++) { ArrayList<String> grams = getNGram(testText, language, i + 1, -gramSize); if (grams.size() > 0) { String gram = grams.get(0); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { Pair<Integer, Integer> wsCount = counters.getSecond(); totalNGramCount += counters.getFirst(); wsOccurrenceCount += wsCount.getSecond(); } //break; } } // following n-gram for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, gramSize++) { ArrayList<String> grams = getNGram(testText, language, i + 1, gramSize); if (grams.size() > 0) { String gram = grams.get(0); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { i); Pair<Integer, Integer> wsCount = counters.getSecond(); totalNGramCount += counters.getFirst(); wsOccurrenceCount += wsCount.getFirst(); } //break; } } // calculate non-whitespace probability for (int gramSize = Math.max(2, minGramLength); gramSize <= Math.min(maxGramLength, i); gramSize++) { if (gramSize % 2 == 0) continue; int halfLength = gramSize / 2; for (int j = i - halfLength; j testText.length() - gramSize) continue; ArrayList<String> grams = getNGram(testText, language, j, gramSize); if (grams.size() > 0) { String gram = grams.get(0); if (gram.substring(gramSize - (j - i + halfLength) - 1, gramSize - (j - i + halfLength)).matches("\\s")) continue; Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { totalNGramCount += counters.getFirst(); noWSCount += counters.getFirst(); } } } } double wsChance = (1.0 * wsOccurrenceCount) / (1.0 * totalNGramCount); if (wsChance >= insertLimit * 0.01) { result = result.substring(0, i + 1) + " " + result.substring(i + 1); } } result = result.replaceAll("\\s+", " "); if (results.contains(result)) break; results.add(result); } */ /* //while (true) { testText = result; for (int i = testText.length() - 1; i >= 0; i--) { double chanceToInsert = 0.0; // before i-th char int weightedAverageDivisor = 0; String position = testText.substring(0, i + 1); //FIXME: only for debugging position display // preceding n-grams for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, i); gramSize++) { ArrayList<String> grams = getNGram(testText, language, i + 1, -gramSize); for (Iterator<String> j = grams.iterator(); j.hasNext(); ) { String gram = j.next(); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { Pair<Integer, Integer> wsCount = counters.getSecond(); if (wsCount.getSecond() >= minOccurrence) { // calculate chances int weight = 1; //gramSize; // * wsCount.getSecond(); weightedAverageDivisor += weight; chanceToInsert += weight * ((1.0 * wsCount.getSecond()) / (1.0 counters.getFirst())); } } } } * if (testText.substring(i, i + 1).matches("\\s")) { // check if whitespace can be erased // following n-grams for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, result.length() - (i + 1) - 1); gramSize++) { ArrayList<String> grams = getNGram(testText, language, i, -gramSize); for (Iterator<String> j = grams.iterator(); j.hasNext(); ) { String gram = j.next(); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { Pair<Integer, Integer> wsCount = counters.getSecond(); if (wsCount.getFirst() >= minOccurrence) { // calculate chances int weight = 1; //gramSize; // * wsCount.getFirst(); weightedAverageDivisor += weight; chanceToInsert += weight * ((1.0 * wsCount.getFirst()) / (1.0 * counters.getFirst())); } } } } // check remove if (weightedAverageDivisor > 0) { chanceToInsert /= (1.0 * weightedAverageDivisor); //if (chanceToInsert < (removeLimit * 0.01)) // result = result.substring(0, i) + result.substring(i + 1); } } else { // check if whitespace needs to be added // following x-grams for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength, result.length() - (i + 1)); gramSize++) { ArrayList<String> grams = getNGram(testText, language, i, -gramSize); for (Iterator<String> j = grams.iterator(); j.hasNext(); ) { String gram = j.next(); Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram); if (null != counters) { Pair<Integer, Integer> wsCount = counters.getSecond(); if (wsCount.getFirst() >= 10) { // calculate chances int weight = 1; //gramSize; // * wsCount.getFirst(); weightedAverageDivisor += weight; chanceToInsert += weight * ((1.0 * wsCount.getFirst()) / (1.0 * counters.getFirst())); } } } } // check insert if (weightedAverageDivisor > 0) { chanceToInsert /= (1.0 * weightedAverageDivisor); if (chanceToInsert >= (insertLimit * 0.01)) result = result.substring(0, i + 1) + " " + result.substring(i + 1); } } } result = result.replaceAll("\\s+", " "); if (results.contains(result)) break; results.add(result); //} */ // condense spaces int similarity = calculateStringSimilarity(result, good, language); if (similarity < bestMatch) { //System.out.println(" " + insertLimit + "%/" + removeLimit + "%: \"" + result + "\" " + similarity); best = " " + insertLimit + "%/" + removeLimit + "%: \"" + result + "\" " + similarity; bestMatch = similarity; } System.out.println(" " + insertLimit + "%/" + removeLimit + "%: \"" + result + "\""); //out.write(" " + insertLimit + "%/" + removeLimit + "%: \"" + result + "\"\n"); //System.out.println(" " + insertLimit + "%: \"" + result + "\""); } } System.out.println("best " + minOccurrence + ": " + best); out.close(); fos.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println("Done."); } /** * Returns the requested length N-grams from the input string. * @param document Document to get N-gram from * @param startPos starting position for extraction * @param length length of requested n-gram (negative values extract before <code>startPos</code>) * @return the list of requested length n-grams */ public static ArrayList<String> getNGram(String document, Locale locale, int startPos, int length) { ArrayList<String> result = new ArrayList<String>(); if (length >= 1) { // look ahead int nextPos = startPos + length; if (nextPos <= document.length()) { String gram = document.substring(startPos, nextPos).toLowerCase(locale); result.add(gram); int i = 0; while (i != gram.length() && nextPos != document.length()) { switch (gram.charAt(i)) { case ' ': //FALLTHROUGH case '\t': //FALLTHROUGH case '\n': //FALLTHROUGH case '\r': // remove whitespace and add a new char to the end gram = gram.substring(0, i) + gram.substring(i + 1) + document.charAt(nextPos); nextPos++; result.add(gram.toLowerCase(locale)); break ; default : i++; } } } } else if (length <= -1) { // look behind int nextPos = startPos + length; if (nextPos >= 0) { String gram = document.substring(nextPos, startPos); nextPos--; result.add(gram); int i = gram.length() - 1; while (i >= 0 && nextPos >= 0) { switch (gram.charAt(i)) { case ' ': //FALLTHROUGH case '\t': //FALLTHROUGH case '\n': //FALLTHROUGH case '\r': // remove whitespace and add a new char to the end gram = document.charAt(nextPos) + gram.substring(0, i) + gram.substring(i + 1); nextPos--; result.add(gram.toLowerCase(locale)); break ; default : i--; } } } } return result; } public static int calculateStringSimilarity(String string1, String string2, Locale locale) { int errors = 10 + Math.abs(69 - string1.length()); //"Nagyon megvagyok el ï¿½gedve a telefonommal!Szerintemn agyon szuper:)" if (string1.matches("Nagyon meg vagyok elï¿½gedve a telefonommal! Szerintem nagyon szuper :\\)")) return 0; if (string1.matches("^Nagyon ")) errors--; if (string1.matches(".* meg .*")) errors--; if (string1.matches(".* vagyok .*")) errors--; if (string1.matches(".* elï¿½gedve .*")) errors--; if (string1.matches(".* a .*")) errors--; if (string1.matches(".* telefonommal! .*")) errors--; if (string1.matches(".* Szerintem .*")) errors--; if (string1.matches(".* nagyon .*")) errors--; if (string1.matches(".* szuper .*")) errors--; if (string1.matches(".* :\\)$")) errors--; string1 = string1.toLowerCase(locale); string2 = string2.toLowerCase(locale); int score = 0; int n = string1.length(); int m = string2.length(); if (0 == n) score = m; else if (0 == m) score = n; else { int edits[][] = new int[n + 1][m + 1]; for (int i = 0; i <= n; i++) edits[i][0] = i; for (int i = 0; i <= m; i++) edits[0][i] = i; for (int i = 0; i != n; i++) { for (int j = 0; j != m; j++) { int cost = 0; if (string1.charAt(i) != string2.charAt(j)) cost = 1; edits[i + 1][j + 1] = cost + Math.min(Math.min(edits[i][j + 1], edits[i][j]), edits[i + 1][j]); } } score = edits[n][m]; } return errors + score; } public void calculateOpinionMeasures() { Neo4jDBInterface dbInterface = db.getDBInterface(); System.out.println("Query opinions..."); ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("OPINION", Neo4jRelationshipDirection.OUT)); try { JSONArray nodes = dbInterface.traverse( db.getRootNode(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.DEPTH_FIRST, Neo4jTraverseUniqueness.NODE_PATH, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1 ); System.out.println("Processing..."); FileOutputStream fos = new FileOutputStream("measures.txt"); OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8"); for (int i = 0; i != nodes.length(); i++) { JSONObject node; try { node = nodes.getJSONObject(i); JSONObject metadata = node.getJSONObject("data"); // get node's raw HTML content String content = metadata.getString("content"); // clean text content = PofaSlicer.stripHTML(content); // detect language Locale language = tokenizer.languageDetect(content); if (!language.getLanguage().equals("hu")) continue; // tokenize ArrayList<String> tokens = PofaTokenizer.tokenize(content, language); ArrayList<String> stems = PofaTokenizer.stemTokens(tokens, language); // compute stats int TotalWords = tokens.size(); int TotalSentences = 1; int TotalSyllables = 0; int TotalComplexWords = 0; int TotalSpecials = 0; // smileys and other weird tokens for (int j = 0; j != tokens.size(); j++) { String token = tokens.get(j); String stem = stems.get(j); if (PofaTokenizer.isSeparator(token)) { if (token.matches("\\.+|\\?+|\\!+")) TotalSentences++; if (token.length() > 1) TotalSpecials++; } else { TotalSyllables += (token.length() token.replaceAll("[aï¿½eï¿½iï¿½oï¿½ï¿½ï¿½uï¿½ï¿½ï¿½Aï¿½Eï¿½Iï¿½Oï¿½ï¿½ï¿½Uï¿½ï¿½ï¿½]", "").length()); if (stem.length() stem.replaceAll("[aï¿½eï¿½iï¿½oï¿½ï¿½ï¿½uï¿½ï¿½ï¿½Aï¿½Eï¿½Iï¿½Oï¿½ï¿½ï¿½Uï¿½ï¿½ï¿½]", "").length() > 3) TotalComplexWords++; } } // write results out.write( node.getInt("node") + "\t" + TotalWords + "\t" + TotalSentences + "\t" + TotalSyllables + "\t" + TotalComplexWords + "\t" + TotalSpecials + "\t" + "\"" + content + "\"\n" ); } catch (JSONException e) { } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InstantiationException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IllegalAccessException e) { // TODO Auto-generated catch block e.printStackTrace(); } if (i % 100 == 99) System.out.println((i + 1) + "/" + nodes.length()); } } catch (IOException e) { e.printStackTrace(); - } catch (ServerErrorResponse e) { e.printStackTrace(); } } public static void main(String argv[]) throws ServerErrorResponse { new PofaTokenizer(); PofaNeo4jDB db = new PofaNeo4jDB("http://localhost:7474/db/data"); PofaPreprocessController ppc = new PofaPreprocessController(db); ppc.start(); //ppc.addWhitespaces(); //ppc.calculateOpinionMeasures(); /* // query all opinions for DB export Neo4jDBInterface dbInterface = db.getDBInterface(); ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("OPINION", Neo4jRelationshipDirection.OUT)); try { JSONArray opinionNodes = dbInterface.traverse(db.getRootNode(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.BREADTH_FIRST, Neo4jTraverseUniqueness.NODE, relationships, "", Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1); FileOutputStream fos = new FileOutputStream("opinion_export_en.csv"); OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8"); relationships.clear(); relationships.add(new Neo4jRelationship("APPLIES_TO", Neo4jRelationshipDirection.OUT)); int opinionID = 0; for (int i = 0; i != opinionNodes.length(); i++) { try { JSONObject opinionNode = opinionNodes.getJSONObject(i); String opinion = opinionNode.getJSONObject("data").getString("content"); opinion = PofaSlicer.stripHTML(opinion); Locale language = ppc.tokenizer.languageDetect(opinion); if (!language.getLanguage().equals("en")) continue; if (opinion.length() < 30 || opinion.length() > 500) continue; if (opinion.matches("^(@.*|Specifikï¿½ciï¿½.*)$")) continue; JSONArray categories = dbInterface.traverse( opinionNode.getString("node"), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.DEPTH_FIRST, Neo4jTraverseUniqueness.NODE, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1); StringBuilder categoryItems = new StringBuilder(); for (int j = 0; j != categories.length(); j++) { String categoryName categories.getJSONObject(j).getJSONObject("data").getString("name"); categoryName.replaceAll("Comments", ""); if (0 == categoryName.length() || categoryName.matches("Home page|Fï¿½oldal")) continue; if (categoryItems.length() > 0) categoryItems.append(";"); categoryItems.append(categoryName); } = //out.write(opinionNode.getString("node") + "\t" + language.getLanguage() + "\t" + opinion.length() + "\t" + opinion + "\n"); //opinion.replaceAll("\"", "\\\""); out.write(opinionID + "\t" + opinionNode.getString("node") + language.getLanguage() + "\t" + categoryItems + "\t" + opinion + "\t0\t0\t0\n"); opinionID++; } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } "\t" + System.out.println("Written " + opinionID + " lines"); out.close(); fos.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } */ /* ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT)); try { JSONArray nodes = db.traverse(db.getRefNodeTokenize(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.DEPTH_FIRST, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1); System.out.println(nodes.length()); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } */ //new PofaPreprocessController(db).start(); /* ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT)); JSONArray nodes = db.traverse(db.getRefNodeTokenize(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.DEPTH_FIRST, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1); System.out.println(nodes.length()); */ } } PofaQueryEngine.java package com.wcs.pofa.query; import import import import import import java.io.IOException; java.util.ArrayList; java.util.HashMap; java.util.Iterator; java.util.Locale; java.util.regex.Pattern; import import import org.json.JSONArray; org.json.JSONException; org.json.JSONObject; import import import import import import import import import import import import import com.wcs.pofa.Pair; com.wcs.pofa.db.Neo4jDBInterface; com.wcs.pofa.db.Neo4jRelationship; com.wcs.pofa.db.PofaNeo4jDB; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseOrder; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseResult; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseReturnFilter; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseUniqueness; com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse; com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection; com.wcs.pofa.query.PofaQueryResultList.ResultTriplet; com.wcs.pofa.slicer.PofaSlicer; com.wcs.pofa.tokenizer.PofaTokenizer; /** * Common query engine for processing user queries. * */ public class private private PofaQueryEngine { PofaNeo4jDB db; PofaTokenizer tokenizer; public PofaQueryEngine(PofaNeo4jDB db) { this .db = db; this .tokenizer = new PofaTokenizer(); } /** * Match parts of query string to entity names found in DB. * Returns lists of matched names in order of match quality and matched text of original query. * @param queryString * @param locale */ private PofaQueryResultList<String, String, String> findEntities(String queryString, Locale locale) { PofaQueryResultList<String, String, String> result = new PofaQueryResultList<String, String, String>(); String queryExpression = queryString; //look for entities in query string boolean reDo = true; while (reDo) { reDo = false; //compose fixed length query expressions ArrayList<String> expressions = tokenizer.composeSubExpressions(queryExpression, locale, true, false); for (int i = 0; i != expressions.size() && !reDo; i++) { String matchedExpression = expressions.get(i); JSONArray indexHit; try { indexHit = db.queryNodeIndex("entity", "name", matchedExpression, locale); for (int j = 0; j != indexHit.length(); j++) { String key = indexHit.getJSONObject(j).getString("node"); // find out true name of matched node String trueName; trueName = indexHit.getJSONObject(j).getJSONObject("data").getString("name"); // insert found result to result list result.addResult( matchedExpression, compareMatch2Query(queryString, trueName, locale), key, trueName ); reDo = true; } if (reDo) { // remove matched part of query queryExpression = queryExpression.replaceFirst(Pattern.quote(matchedExpression), "|"). replaceAll("\\| \\||^\\| | \\|$", ""). replaceAll("^\\s+|\\s+$", ""). replaceAll("\\s+", " "); } } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } return result; } /** * Match parts of query string to category names found in DB. * Returns lists of matched names in order of match quality and matched text of original query. * @param queryString * @param locale */ private PofaQueryResultList<String, String, String> Locale locale) { PofaQueryResultList<String, String, String> result String, String>(); findCategories(String = new queryString, PofaQueryResultList<String, String queryExpression = queryString; //look for entities in query string boolean reDo = true; while (reDo) { reDo = false; //compose fixed length query expressions ArrayList<String> expressions = tokenizer.composeSubExpressions(queryExpression, locale, true, true); for (int i = 0; i != expressions.size() && !reDo; i++) { String matchedExpression = expressions.get(i); JSONArray indexHit; try { indexHit = db.queryNodeIndex("category", "name", matchedExpression, locale); for (int j = 0; j != indexHit.length(); j++) { String key = indexHit.getJSONObject(j).getString("node"); // find out true name of matched node String trueName; trueName = indexHit.getJSONObject(j).getJSONObject("data").getString("name"); // insert found result to result list result.addResult( matchedExpression, compareMatch2Query(queryString, trueName, locale), key, trueName ); reDo = true; } if (reDo) { // remove matched part of query queryExpression = queryExpression.replaceFirst(Pattern.quote(matchedExpression), "|"). replaceAll("\\| \\||^\\| | \\|$", ""). replaceAll("^\\s+|\\s+$", ""). replaceAll("\\s+", " "); } } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } return result; } private PofaQueryResultList<String, String, String> findEntitesFromCategories(String queryString, PofaQueryResultList<String, String, String> categoryList, Locale locale) { PofaQueryResultList<String, String, String> result = new PofaQueryResultList<String, String, String>(); // get entities for top categories HashMap<String, Pair<Integer, JSONObject>> JSONObject>>(); hits = new HashMap<String, Neo4jDBInterface dbInterface = db.getDBInterface(); for (Iterator<String> i = categoryList.getMatches(); i.hasNext(); ) { String matchedQuery = i.next(); double rankLimit = -1; for (Iterator<ResultTriplet<String, String>> categoryList.getMatchElements(matchedQuery); j.hasNext(); ) { ResultTriplet<String, String> item = j.next(); if (-1 == rankLimit) rankLimit = item.getRank() * 0.3; else if (item.getRank() < rankLimit) break ; System.err.println(" Pair<Integer, j checking: " + item.getID() + " \"" + item.getMatch() + "\""); // get entities from categories ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); = relationships.add(new Neo4jRelationshipDirection.ALL)); Neo4jRelationship("BELONGS_TO", try { JSONArray comments = dbInterface.traverse( item.getID(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.BREADTH_FIRST, Neo4jTraverseUniqueness.NODE, relationships, "" , Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1 ); for (int k = 0; k != comments.length(); k++) { try { JSONObject node = comments.getJSONObject(k); String key = node.getString("node"); if (!hits.containsKey(key)) hits.put(key, new Pair<Integer, JSONObject>(1, node.getJSONObject("data"))); else { Pair<Integer, JSONObject> oldValue = hits.get(key); oldValue.setFirst(oldValue.getFirst() + 1); hits.put(key, oldValue); } String trueName = node.getJSONObject("data").getString("name"); // insert found result to result list result.addResult( item.getMatch(), compareMatch2Query(queryString, trueName, locale), key, trueName ); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } } return result; } private void getComments(String nodeID) { Neo4jDBInterface dbInterface = db.getDBInterface(); ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT)); relationships.add(new Neo4jRelationship("APPLIES_TO", Neo4jRelationshipDirection.IN)); try { JSONArray comments = dbInterface.traverse( //nodeID, db.getRefNodeClassifyEntity(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.BREADTH_FIRST, Neo4jTraverseUniqueness.NODE, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 2 ); for (int i = 0; i != comments.length(); i++) { String url; try { JSONObject data = comments.getJSONObject(i).getJSONObject("data"); url = data.getString("url"); if (!url.matches(".*mobilarena.*")) { System.out.println("\"" + PofaSlicer.stripHTML(data.getString("content")) + "\""); System.out.println(" " + url); } } catch (JSONException e) { } } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * Interprete and execute query string. * @param queryString */ public void query(String queryString, Locale locale) { System.out.println("Query string: \"" + queryString + "\", language: " locale.toString()); /* * goal: * - identify entities * - gather relevant info for affected entities * - store info in session (? - for faster access, logging) * - display * method: * - find out entity names from query (stop if at any point search has results) * - 1: simple entity name matching * - 2: category name matching * - 3: full-text search * - filter entity names using the remaining part of query * - if using a filter would result in empty set, ignore that filter and warn user * - gather all comments separately for entities * - sort comments * - display * * - for every entity recommend a best match * - store the others as 'similar matches' */ + // get entities PofaQueryResultList<String, String, String> foundEntities; foundEntities = findEntities(queryString, locale); // get categories PofaQueryResultList<String, String, String> foundCategories; foundCategories = findCategories(queryString, locale); // get categories PofaQueryResultList<String, String, String> foundEntitiesFromCategories; foundEntitiesFromCategories = findEntitesFromCategories(queryString, locale); System.out.println("Best matching entity names:"); for (Iterator<String> i = foundEntities.getMatches(); i.hasNext(); ) { String matchedQuery = i.next(); double rankLimit = -1; for (Iterator<ResultTriplet<String, String>> foundEntities.getMatchElements(matchedQuery); j.hasNext(); ) { ResultTriplet<String, String> item = j.next(); if (-1 == rankLimit) rankLimit = item.getRank() * 0.3; else if (item.getRank() < rankLimit) break ; System.out.println( " " + matchedQuery + ": " + item.getRank() + " " + "\"" + item.getMatch() + "\" " + item.getID() ); } } foundCategories, j = System.out.println("Best matching category names:"); for (Iterator<String> i = foundCategories.getMatches(); i.hasNext(); ) { String matchedQuery = i.next(); double rankLimit = -1; for (Iterator<ResultTriplet<String, String>> j foundCategories.getMatchElements(matchedQuery); j.hasNext(); ) { ResultTriplet<String, String> item = j.next(); if (-1 == rankLimit) rankLimit = item.getRank() * 0.3; else if (item.getRank() < rankLimit) break ; System.out.println( " " + matchedQuery + ": " + item.getRank() + " " + "\"" + item.getMatch() + "\" " + item.getID() ); } } System.out.println("Best found entities (category -> entity):"); for (Iterator<String> i = foundEntitiesFromCategories.getMatches(); i.hasNext(); ) { String matchedQuery = i.next(); double rankLimit = -1; for (Iterator<ResultTriplet<String, String>> j foundEntitiesFromCategories.getMatchElements(matchedQuery); j.hasNext(); ) { ResultTriplet<String, String> item = j.next(); if (-1 == rankLimit) rankLimit = item.getRank() * 0.3; else if (item.getRank() < rankLimit) break ; System.out.println( " " + matchedQuery + ": " + item.getRank() + " " + "\"" + item.getMatch() + "\" " + item.getID() ); } } } /** * Calculates similarity of two strings (relative Levenshtein distance) * @param string1 * @param string2 * @return 1.0 if strings match otherwise a real number in [0..1) interval */ public static Double calculateStringSimilarity(String string1, String locale) { string1 = string1.toLowerCase(locale); string2 = string2.toLowerCase(locale); string2, = = Locale int score = 0; int n = string1.length(); int m = string2.length(); if (0 == n) score = m; else if (0 == m) score = n; else { int edits[][] = new int[n + 1][m + 1]; for (int i = 0; i <= n; i++) edits[i][0] = i; for (int i = 0; i <= m; i++) edits[0][i] = i; for (int i = 0; i != n; i++) { for (int j = 0; j != m; j++) { int cost = 0; if (string1.charAt(i) != string2.charAt(j)) cost = 1; edits[i + 1][j + 1] = cost + Math.min(Math.min(edits[i][j + 1], edits[i][j]), edits[i + 1][j]); } } score = edits[n][m]; } int max = Math.max(n, m); if (0 == max) return 1.0; else return (Double)((1.0 * (max - score)) / (1.0 * max)); } /** * Compares a result string to the original query string word-by-words and returns a * similarity score based on how the words of the initial query are matched in the result. * @param query Original query string * @param match Found matching string to compare to the query * @param locale Locale to use * @return Similarity score */ public static double compareMatch2Query(String query, String match, Locale locale) { //TODO: filter both strings (remove special chars) String[] queryParts = query.split("\\s+"); String[] matchParts = match.split("\\s+"); // calculate word-to-word similarities, average the best scores double totalScore = 0; int firstFullMatchQuery = -1; int lastFullMatchQuery = -1; int firstFullMatchResult = -1; int lastFullMatchResult = -1; for (int i = 0; i != queryParts.length; i++) { double bestScore = 0; for (int j = 0; j != matchParts.length; j++) { double score = calculateStringSimilarity(matchParts[j], queryParts[i], locale); if (score > bestScore) { bestScore = score; if (bestScore == 1.0) { if (-1 == firstFullMatchResult) firstFullMatchResult = j; lastFullMatchResult = j; break ; } } } if (bestScore == 1.0) { if (-1 == firstFullMatchQuery) firstFullMatchQuery = i; lastFullMatchQuery = i; } totalScore += bestScore; } totalScore /= queryParts.length; // calculate matching part length (word count) ratio relative to the longest string's word count double avgMatchLen = 1.0 * (lastFullMatchQuery - firstFullMatchQuery + 1 + lastFullMatchResult - firstFullMatchResult + 1) / 2; double avgMatchLenRatio = (1.0 * (lastFullMatchQuery - firstFullMatchQuery + 1 + lastFullMatchResult firstFullMatchResult + 1) / 2) / (1.0 * Math.max(queryParts.length, matchParts.length)); // modify score with match length ratio totalScore *= avgMatchLenRatio; // calculate multi-word match bonus if (avgMatchLen > 1) { // find the matching parts of both initial strings String matchingQueryPart = ""; for (int i = firstFullMatchQuery; i <= lastFullMatchQuery; i++) if (i == firstFullMatchQuery) matchingQueryPart = queryParts[i]; else matchingQueryPart += " " + queryParts[i]; String matchingResultPart = ""; for (int i = firstFullMatchResult; i <= lastFullMatchResult; i++) if (i == firstFullMatchResult) matchingResultPart = matchParts[i]; else matchingResultPart += " " + matchParts[i]; double multiWordBonus = calculateStringSimilarity(matchingResultPart, matchingQueryPart, locale); totalScore = (totalScore + multiWordBonus) / 2; } return totalScore; } public static void main(String argv[]) throws ServerErrorResponse, IOException { PofaNeo4jDB db = new PofaNeo4jDB("http://localhost:7474/db/data"); PofaQueryEngine query = new PofaQueryEngine(db); Locale locale = new Locale("hu"); /* query.query("canon 50d fujifilm s2500hd", locale); //query.query("canon 50d", locale); System.out.println("----"); query.query("laptop, vï¿½sï¿½rlï¿½s", locale); System.out.println("----"); query.query("nokia classic", locale); System.out.println("----"); */ ///* long t1 = System.currentTimeMillis(); query.query("nokia classic samsung galaxy mobiltelefon", locale); //query.getComments("1"); long t2 = System.currentTimeMillis(); //System.err.println("Query took: " + (t2 - t1) + " ms"); //*/ /* ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>(); relationships.add(new Neo4jRelationship("VALUE", Neo4jRelationshipDirection.OUT)); JSONArray items = db.getDBInterface().traverse( "174", //db.getRootNode(), Neo4jTraverseResult.NODE, Neo4jTraverseOrder.BREADTH_FIRST, Neo4jTraverseUniqueness.NODE, relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1 ); for (int i = 0; i != items.length(); i++) try { System.out.println(items.getJSONObject(i).getJSONObject("data").getString("name")); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } */ //query.query("notebook laptop netbook", locale); } } PofaQueryResultList.java package import import import com.wcs.pofa.query; java.util.HashMap; java.util.Iterator; java.util.TreeSet; /** * A special class that can hold different query results. */ public class PofaQueryResultList<Query, ID, Match> { /** * A triplet that holds (rank, ID, match) values as a search result. * @param <ID> type of database item IDs * @param <Match> type of match content */ public static class ResultTriplet<_ID, _Match> implements Comparable<ResultTriplet<_ID, _Match>> { private private Double rank; _ID id; private _Match match; public ResultTriplet(Double rank, _ID id, _Match match) { this .rank = rank; this .id = id; this .match = match; } public Double getRank() { return rank; } public _ID getID() { return id; } public _Match getMatch() { return match; } /** * Reverse order based on rank (best match first) */ public int compareTo(ResultTriplet<_ID, _Match> other) { if (this.rank > other.rank) return -1; if (this.rank < other.rank) return 1; return 0; } } /* * return: * list of: * - string: matched part of query * - ordered list of: * - double: match factor * - string: DB node ID: entity node * - string: entity node name */ private HashMap<Query, TreeSet<ResultTriplet<ID, Match>>> results; public PofaQueryResultList() { results = new HashMap<Query, TreeSet<ResultTriplet<ID, Match>>>(); } /** * Insert a new result into the result list. * @param matchedQuery Part of original query that was matched * @param rank rank of search result to be added * @param id DB item ID of search result to be added * @param match DB match content of search result to be added */ public void addResult(Query matchedQuery, Double rank, ID id, Match match) { if (results.containsKey(matchedQuery)) { results.get(matchedQuery). add(new ResultTriplet<ID, Match>(rank, id, match)); } else { TreeSet<ResultTriplet<ID, Match>> items = new TreeSet<ResultTriplet<ID, Match>>(); items.add(new ResultTriplet<ID, Match>(rank, id, match)); results.put(matchedQuery, items); } } /** * Returns an iterator to the added original query part items. * @return iterator */ public Iterator<Query> getMatches() { return results.keySet().iterator(); } /** * Returns an iterator to a specific result query. * @param matchedQuery * @return */ public Iterator<ResultTriplet<ID, Match>> getMatchElements(Query matchedQuery) { return results.get(matchedQuery).iterator(); } /** * Tells if result list is empty. * @return */ public boolean isEmpty() { return results.isEmpty(); } } PofaQueryResultList.java package com.wcs.pofa; import java.util.ArrayList; import import import import org.w3c.dom.Document; org.w3c.dom.NamedNodeMap; org.w3c.dom.Node; org.w3c.dom.NodeList; import com.wcs.pofa.PofaDomainRule.PofaPageElement; public class PofaRuleMatcher { public static String getNodeText(Node node, boolean decorate, int depth) { if (null == node) return ""; StringBuilder result = new StringBuilder(); String indent = ""; if (decorate) { for (int i = 0; i != depth * 2; i++) result.append(" "); indent = new String(result); } result.append("<" + node.getNodeName()); NamedNodeMap nodeAttrs = node.getAttributes(); for (int i = 0; i < nodeAttrs.getLength(); i++) result.append(" " + nodeAttrs.item(i).getNodeName() nodeAttrs.item(i).getNodeValue() + '"'); result.append(">"); if (decorate) result.append("\n"); NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node child = children.item(i); switch (child.getNodeType()) { case Node.ELEMENT_NODE: result.append(getNodeText(child, decorate, depth + 1)); break ; case Node.TEXT_NODE: result.append(indent + child.getNodeValue()); if (decorate) result.append("\n"); break ; } } result.append(indent + "</" + node.getNodeName() + ">"); if (decorate) result.append("\n"); return result.toString(); } public static String getDocumentText(Document doc, boolean decorate) { StringBuilder result = new StringBuilder(); NodeList children = doc.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node child = children.item(i); switch (child.getNodeType()) { case Node.ELEMENT_NODE: + "=\"" + result.append(getNodeText(child, decorate, 0)); break ; case Node.TEXT_NODE: result.append(child.getNodeValue()); break ; } } return result.toString(); } private static String domWalkerContent(Node node, String path, PofaDomainRule rule) { String result = ""; if (null == node) return result; // compose absolute qualified node name if ("" == path) path = node.getNodeName(); else path += ">" + node.getNodeName(); // include ID name (if present) Node nodeAttr = node.getAttributes().getNamedItem("id"); if (null != nodeAttr) path += "#" + nodeAttr.getNodeValue(); // include class name (if present) nodeAttr = node.getAttributes().getNamedItem("class"); if (null != nodeAttr) path += "." + nodeAttr.getNodeValue(); boolean match = rule.matchRule(path); if (match) { // collect content result += "<" + node.getNodeName(); NamedNodeMap nodeAttrs = node.getAttributes(); for (int j = 0; j < nodeAttrs.getLength(); j++) result += " " + nodeAttrs.item(j).getNodeName() nodeAttrs.item(j).getNodeValue() + '"'; result += ">"; } + "=\"" + NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node child = children.item(i); // evaluate DOM node switch (child.getNodeType()) { case Node.ELEMENT_NODE: if (match) result += domWalkerContent(child, path, rule); else domWalkerContent(child, path, rule); break ; case Node.TEXT_NODE: if (match) result += child.getNodeValue(); break ; case Node.ENTITY_REFERENCE_NODE: break ; } } if (match) result += "</" + node.getNodeName() + ">"; return result; } private static ArrayList<Pair<PofaPageElement, String>> domWalker(Node node, String path, PofaDomainRule rule) { ArrayList<Pair<PofaPageElement, String>> result = new ArrayList<Pair<PofaPageElement, String>>(); if (null == node) return result; // compose absolute qualified node name if ("" == path) path = node.getNodeName(); else path += ">" + node.getNodeName(); // include ID name (if present) Node nodeAttr = node.getAttributes().getNamedItem("id"); if (null != nodeAttr) path += "#" + nodeAttr.getNodeValue(); // include class name (if present) nodeAttr = node.getAttributes().getNamedItem("class"); if (null != nodeAttr) path += "." + nodeAttr.getNodeValue(); // match DOM path against supplied rules NodeList children = node.getChildNodes(); if (rule.matchRule(path)) { // found a rule: cumulate content String content = "<" + node.getNodeName(); NamedNodeMap nodeAttrs = node.getAttributes(); for (int j = 0; j < nodeAttrs.getLength(); j++) content += " " + nodeAttrs.item(j).getNodeName() + "=\"" + nodeAttrs.item(j).getNodeValue() + '"'; content += ">"; for (int i = 0; i < children.getLength(); i++) { Node child = children.item(i); switch (child.getNodeType()) { case Node.ELEMENT_NODE: content += domWalkerContent(child, path, rule); break ; case Node.TEXT_NODE: content += child.getNodeValue(); break ; } } content += "</" + node.getNodeName() + ">"; result.add(new Pair<PofaPageElement, String>(rule.getType(), content)); } else { // parse children for (int i = 0; i < children.getLength(); i++) { Node child = children.item(i); if (Node.ELEMENT_NODE == child.getNodeType()) { // process subtree ArrayList<Pair<PofaPageElement, String>> subResult = domWalker(child, path, rule); // merge results for (int j = 0; j != subResult.size(); j++) result.add(subResult.get(j)); } } } return result; } /** * Extracts content from the supplied DOM using the supplied rules. * @param nodes The DOM to extract content from. * @param rules The rules to use. * @return List of (page element type, content) pairs. Page element type is the PofaPageElement part * of the matching rule path, content is the XHTML content of the DOM in the appiled path (including tags * and attributes). */ public static ArrayList<Pair<PofaPageElement, String>> ruleMatcher(NodeList nodes, PofaDomainInfo rules) { ArrayList<Pair<PofaPageElement, String>> result = new ArrayList<Pair<PofaPageElement, String>>(); for (int i = 0; i != rules.getRules().size(); i++) { for (int j = 0; j != nodes.getLength(); j++) { ArrayList<Pair<PofaPageElement, String>> ruleResult = domWalker(nodes.item(j), "", rules.getRules().get(i)); for (int k = 0; k != ruleResult.size(); k++) result.add(ruleResult.get(k)); } } return result; } } PofaSlicer.java package import import import import import import com.wcs.pofa.slicer; java.io.ByteArrayInputStream; java.io.IOException; java.io.InputStream; java.util.ArrayList; java.util.Iterator; java.util.Locale; import import import import import import import org.json.JSONArray; org.json.JSONException; org.json.JSONObject; org.w3c.dom.Document; org.w3c.dom.Node; org.w3c.dom.NodeList; org.w3c.tidy.Tidy; import import import import import import import import import import com.wcs.pofa.Pair; com.wcs.pofa.PofaAbstractSlicer; com.wcs.pofa.PofaDomainInfo; com.wcs.pofa.PofaRuleMatcher; com.wcs.pofa.PofaUtils; com.wcs.pofa.PofaDomainRule.PofaPageElement; com.wcs.pofa.db.Neo4jDBInterface; com.wcs.pofa.db.PofaNeo4jDB; com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse; com.wcs.pofa.tokenizer.PofaTokenizer; /** * Receives an (almost properly formatted) HTML as string and extracts parts of it described by page rules * (PofaDomainRuleList). Stores the extracts (slices) in the DocumentStore DB and passes them further. * */ public class PofaSlicer implements PofaAbstractSlicer { private private private private Tidy tidy = new Tidy(); PofaNeo4jDB db; Neo4jDBInterface dbInterface; PofaTokenizer tokenizer; public PofaSlicer(PofaNeo4jDB db) { System.out.println("Initializing " + this.getClass().getName() + "..."); this .db = db; this .dbInterface = db.getDBInterface(); this .tokenizer = new PofaTokenizer(); // initialize JTidy tidy.setQuiet(true); tidy.setHideComments(true); tidy.setShowWarnings(false); tidy.setShowErrors(0); tidy.setXHTML(true); System.out.println(this.getClass().getName() + " initialized."); } public static String stripHTML(String data) { //TODO: handle escaped characters // remove HTML tags data = data.replaceAll("<(\".*?\"|'.*?'|.*?)*>", " "); // collapse whitespaces data = data.replaceAll("\\s+", " "); // trim data = data.replaceAll("(^\\s+|\\s+$)", ""); return data; } /** * Clean downloaded HTML from unwanted content (such as scrips, style, comments) * @param html * @return */ public static String cleanHTML(String html) { // remove script tags html = html.replaceAll("(?s)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>", " "); // remove comments html = html.replaceAll("", " "); return html; } private String displayDOM(NodeList nodes, int indent) { String dom = ""; String spaces = ""; for (int i = 0; i != indent; i++) spaces += " "; for (int i = 0; i != nodes.getLength(); i++) { String nodeName; Node node = nodes.item(i); if (node.getNodeType() == Node.TEXT_NODE) nodeName = "[TEXT] \"" + node.getNodeValue() + "\""; else { nodeName = node.getNodeName(); // include ID name (if present) Node nodeAttr = node.getAttributes().getNamedItem("id"); if (null != nodeAttr) nodeName += "#" + nodeAttr.getNodeValue(); // include class name (if present) nodeAttr = node.getAttributes().getNamedItem("class"); if (null != nodeAttr) nodeName += "." + nodeAttr.getNodeValue(); } dom += spaces + nodeName + "\n" + displayDOM(node.getChildNodes(), indent + 2); } return dom; } /** * Processes the contents of a HTML page. Extracts parts of it described by the supplied rules. * * @param content The textual content of the page to analyze * @param rules The rules to use for extraction * @return Amount of new information found in content expressed by formula: * number_of_new_slices / number_of_found_slices * where only textual slices are counted. */ public double process(String url, String pageContent, PofaDomainInfo rules) { ArrayList<String> entityNames = new ArrayList<String>(); ArrayList<String> categoryNames = new ArrayList<String>(); ArrayList<String> commentNodeIDs = new ArrayList<String>(); JSONObject factSheet = new JSONObject(); // prepare InputStream is = new ByteArrayInputStream(pageContent.getBytes()); // parse content Document tidyDoc = tidy.parseDOM(is, null); //System.err.println(displayDOM(tidyDoc.getChildNodes(), 0)); //FIXME: debug output // generate slices ArrayList<Pair<PofaPageElement, String>> matches; matches = PofaRuleMatcher.ruleMatcher(tidyDoc.getChildNodes(), rules); int newSlices = 0; // metadata JSONObject metaData = new JSONObject(); try { metaData.put("url", url); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } StringBuilder rawText = new StringBuilder(); // process metadata slices for (int i = 0; i != matches.size(); i++) { Pair<PofaPageElement, String> match = matches.get(i); PofaPageElement matchType = match.getFirst(); String matchContent = match.getSecond(); switch (matchType) { case THEME: try { String textContent = stripHTML(matchContent); metaData.put("THEME", textContent); rawText.append(textContent + "\n"); } catch (JSONException e) { //TODO: error handling } break ; case BREADCRUMBS: { // clean entity name (remove HTML tags, starting and ending special chars) String textContent = PofaTokenizer.tokenListToExpression( PofaTokenizer.cleanSeparators( PofaTokenizer.tokenize( stripHTML(matchContent), null ), true ) ); categoryNames.add(textContent); rawText.append(textContent + "\n"); break ; } case ENTITY: { // clean entity name (remove HTML tags, starting and ending special chars) String textContent = PofaTokenizer.tokenListToExpression( PofaTokenizer.cleanSeparators( PofaTokenizer.tokenize( stripHTML(matchContent), null ), true ) ); entityNames.add(textContent); rawText.append(textContent + "\n"); break ; } case FACTSHEET: { String factSheetItem; // TODO: look for hidden values too (inside html tag) // remove html tags & separate keys from values factSheetItem = matchContent.replaceAll("<(\".*?\"|'.*?'|.*?)*>|:|ï¿½|,\\s+", "\n"); // collapse whitespaces factSheetItem = factSheetItem.replaceAll("\\n+", "\n"); // inner trim factSheetItem = factSheetItem.replaceAll("( |\\t)+", " "); // trim factSheetItem = factSheetItem.replaceAll("(?m)(^\\s+|\\s+$|^\\n)", ""); // key: first non-empty line // values: all other non-empty lines String[] sheet = factSheetItem.split("\n"); String key = ""; ArrayList<String> values = new ArrayList<String>(); for (int j = 0; j != sheet.length; j++) { String item = sheet[j]; if (item.length() != 0) if (key.length() == 0) key = item; else values.add(item); } // store factsheet if (key.length() > 0 && values.size() > 0) for (int j = 0; j != values.size(); j++) try { if (factSheet.has(key)) factSheet.accumulate(key, values.get(j)); else factSheet.put(key, new JSONArray().put(values.get(j))); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } break ; } default : } } // process TEXT slices for (int i = 0; i != matches.size(); i++) { Pair<PofaPageElement, String> match = matches.get(i); PofaPageElement matchType = match.getFirst(); String matchContent = match.getSecond(); if (PofaPageElement.TEXT == matchType) { rawText.append(stripHTML(matchContent) + "\n"); // check if text is a new text String hash = PofaUtils.getMD5(matchContent).toString(16); try { JSONArray indexHit = dbInterface.queryNodeIndex("opinion", "content", hash); if (indexHit.length() == 0) { // new text item, insert it JSONObject text = new JSONObject(); for (Iterator<?> key = metaData.keys(); key.hasNext(); ) { String keyName = (String)key.next(); text.put(keyName, metaData.get(keyName)); } text.put("content", matchContent); // put into DB String commentNodeID = dbInterface.createNode(text); db.addNodeToIndex("opinion", "content", hash, commentNodeID); dbInterface.createRelationship(db.getRootNode(), commentNodeID, "OPINION", new JSONObject()); dbInterface.createRelationship(db.getRefNodeTokenize(), commentNodeID, "", new JSONObject()); newSlices++; commentNodeIDs.add(commentNodeID); } else commentNodeIDs.add(indexHit.getJSONObject(0).getString("node")); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } } // detect language based on extracted slice's raw text Locale language = tokenizer.languageDetect(rawText.toString()); // add entities ArrayList<String> entityIDs = addEntities(entityNames, commentNodeIDs, factSheet, language); // add categories addCategories(categoryNames, commentNodeIDs, entityIDs, language); // add factsheet items as categories addFactSheet(factSheet, commentNodeIDs, entityIDs, language); return (double)newSlices / (double)matches.size(); } /** * Store entities in database. Update relevant indexes. Connect entites to comments. * @param entityNames * @param commentIDs * @param facts * @param locale * @return */ private ArrayList<String> addEntities(ArrayList<String> entityNames, ArrayList<String> commentIDs, JSONObject facts, Locale locale) { ArrayList<String> result = new ArrayList<String>(); for (int i = 0; i != entityNames.size(); i++) { //--- query if entity exists --try { String entityName = entityNames.get(i); String entityNodeID; JSONArray indexHit = db.queryNodeIndex("entity", "name", entityName, locale); if (indexHit.length() == 0) { // new item, insert it JSONObject properties = new JSONObject(); properties.put("name", entityName); properties.put("factsheet", facts.toString()); // put into DB entityNodeID = dbInterface.createNode(properties); // connect to CLASSIFYENTITY node (require later classification) dbInterface.createRelationship(db.getRefNodeClassifyEntity(), entityNodeID, "", new JSONObject()); // index enitity ArrayList<String> indexExpressions = tokenizer.composeSubExpressions(entityName, locale, true, false); db.addNodeToIndex("entity", "name", indexExpressions, entityNodeID); } else { entityNodeID = indexHit.getJSONObject(0).getString("node"); // compare stored fact sheet and current fact sheet try { JSONObject storedFactSheet = new JSONObject(indexHit.getJSONObject(0).getJSONObject("data").getString("factsheet")); boolean factSheetUpdated = false; for (Iterator<?> newKey = facts.keys(); newKey.hasNext(); ) { String newKeyName = (String)newKey.next(); if (storedFactSheet.has(newKeyName)) { // key exists, compare values JSONArray storedValues = storedFactSheet.getJSONArray(newKeyName); JSONArray newValues = facts.getJSONArray(newKeyName); for (int j = 0; j != newValues.length(); j++) { Object newValue = newValues.get(j); boolean hasValue = false; for (int k = 0; k != storedValues.length(); k++) if (storedValues.get(k).equals(newValue)) { hasValue = true; break ; } if (!hasValue) { storedValues.put(newValue); factSheetUpdated = true; } } } else { // key does not exist, add it storedFactSheet.put(newKeyName, facts.get(newKeyName)); factSheetUpdated = true; } } if (factSheetUpdated) // fact sheet changed, store in db dbInterface.setNodeProperty(entityNodeID, storedFactSheet.toString()); } catch (JSONException e) { } } result.add(entityNodeID); // connect entity to comments for (int j = 0; j != commentIDs.size(); j++) db.createNewRelationship(commentIDs.get(j), JSONObject(), false); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } return result; } "factsheet", entityNodeID, "APPLIES_TO", /** * Store categories in database. Connect categories to entites and comments. * @param categories * @param commentIDs * @param locale */ private void addCategories(ArrayList<String> categories, ArrayList<String> ArrayList<String> entityIDs, Locale locale) { for (int i = 0; i != categories.size(); i++) { String categoryName = categories.get(i); if (categoryName.length() > 0) { // check if category exists new commentIDs, try { String categoryNodeID; JSONArray indexHit; // make sure category name is not an entity name too indexHit = db.queryNodeIndex("entity", "name", categoryName, locale); if (indexHit.length() == 0) { indexHit = db.queryNodeIndex("category", "name", categoryName, locale); if (indexHit.length() == 0) { // new category categoryNodeID = dbInterface.createNode(new JSONObject().put("name", categoryName)); dbInterface.createRelationship(db.getRootNode(), categoryNodeID, "CATEGORY_BC", new JSONObject()); // intex category ArrayList<String> indexExpressions = tokenizer.composeSubExpressions(categoryName, locale, true, true); db.addNodeToIndex("category", "name", indexExpressions, categoryNodeID); } else categoryNodeID = indexHit.getJSONObject(0).getString("node"); // connect category to comments // TODO: if there were more index hits, why not connect to all of them? for (int j = 0; j != commentIDs.size(); j++) db.createNewRelationship(commentIDs.get(j), categoryNodeID, "APPLIES_TO", new JSONObject(), false); // connect category to entites for (int j = 0; j != entityIDs.size(); j++) db.createNewRelationship(entityIDs.get(j), categoryNodeID, "BELONGS_TO", new JSONObject(), false); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } } } private void addFactSheet(JSONObject facts, ArrayList<String> commentIDs, ArrayList<String> entityIDs, Locale locale) { for (Iterator<?> i = facts.keys(); i.hasNext(); ) { String key = (String)i.next(); try { String categoryKeyNodeID; JSONArray indexHit = db.queryNodeIndex("category", "name", key, locale); if (indexHit.length() == 0) { // new category categoryKeyNodeID = dbInterface.createNode(new JSONObject().put("name", key)); dbInterface.createRelationship(db.getRootNode(), categoryKeyNodeID, "CATEGORY_FSK", new JSONObject()); // intex category ArrayList<String> indexExpressions = tokenizer.composeSubExpressions(key, locale, true, false); db.addNodeToIndex("category", "name", indexExpressions, categoryKeyNodeID); } else { categoryKeyNodeID = indexHit.getJSONObject(0).getString("node"); } // create sub-categories JSONArray values = facts.getJSONArray(key); for (int j = 0; j != values.length(); j++) { String value = values.getString(j); String valueNodeID; indexHit = db.queryNodeIndex("category", "name", value, locale); if (indexHit.length() == 0) { // new category valueNodeID = dbInterface.createNode(new JSONObject().put("name", value)); dbInterface.createRelationship(db.getRootNode(), valueNodeID, "CATEGORY_FSV", new JSONObject()); dbInterface.createRelationship(categoryKeyNodeID, valueNodeID, "VALUE", new JSONObject()); // intex category ArrayList<String> indexExpressions = tokenizer.composeSubExpressions(value, locale, true, false); db.addNodeToIndex("category", "name", indexExpressions, valueNodeID); } else { valueNodeID = indexHit.getJSONObject(0).getString("node"); db.createNewRelationship(categoryKeyNodeID, valueNodeID, "VALUE", new JSONObject(), false); } // connect subcategory to entities for (int k = 0; k != entityIDs.size(); k++) db.createNewRelationship(entityIDs.get(k), valueNodeID, "CATEGORY_PV", new JSONObject(), false); // connect subcategory to comments for (int k = 0; k != commentIDs.size(); k++) db.createNewRelationship(commentIDs.get(k), valueNodeID, "APPLIES_TO", new JSONObject(), false); } // connect main category to entities for (int k = 0; k != entityIDs.size(); k++) db.createNewRelationship(entityIDs.get(k), categoryKeyNodeID, "CATEGORY_PK", JSONObject(), false); // connect main category to comments for (int k = 0; k != commentIDs.size(); k++) db.createNewRelationship(commentIDs.get(k), categoryKeyNodeID, "APPLIES_TO", JSONObject(), false); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ServerErrorResponse e) { // TODO Auto-generated catch block e.printStackTrace(); } } } } PofaStopWords.java package import import import import import import import com.wcs.pofa; java.io.FileInputStream; java.io.FileNotFoundException; java.io.IOException; java.util.HashSet; java.util.Hashtable; java.util.Locale; java.util.Scanner; public class private PofaStopWords { Hashtable<String, HashSet<String>> stopWords; public PofaStopWords() { stopWords = new Hashtable<String, HashSet<String>>(); // load stop words FileInputStream fis; Scanner scanner; //FIXME: don't use hardcoded filenames try { fis = new FileInputStream("..\\stopwords_hu.txt"); scanner = new Scanner(fis, "UTF-8"); HashSet<String> words = new HashSet<String>(); while (scanner.hasNextLine()) { String word = scanner.nextLine(); words.add(word); } scanner.close(); new new fis.close(); stopWords.put("hu", words); fis = new FileInputStream("..\\stopwords_en.txt"); scanner = new Scanner(fis, "UTF-8"); words = new HashSet<String>(); while (scanner.hasNextLine()) { String word = scanner.nextLine(); words.add(word); } scanner.close(); fis.close(); stopWords.put("en", words); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public HashSet<String> getStopWords(Locale locale) { return stopWords.get(locale.toString()); } } PofaTokenizer.java package com.wcs.pofa.tokenizer; import import import import import import import import import import import java.io.FileInputStream; java.io.FileNotFoundException; java.io.FileOutputStream; java.io.IOException; java.io.ObjectInputStream; java.io.ObjectOutputStream; java.util.ArrayList; java.util.HashMap; java.util.HashSet; java.util.Iterator; java.util.Locale; import import import org.json.JSONArray; org.json.JSONObject; org.tartarus.snowball.SnowballStemmer; import import import import import import import import import com.wcs.pofa.Pair; com.wcs.pofa.PofaStopWords; com.wcs.pofa.db.Neo4jRelationship; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseOrder; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseResult; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseReturnFilter; com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseUniqueness; com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection; com.wcs.pofa.slicer.PofaSlicer; import de.spieleck.app.cngram.NGramProfiles; /** * Tokenize a document */ public class PofaTokenizer { private private private NGramProfiles nps; NGramProfiles.Ranker ranker; PofaStopWords stopWords; private final static String "(\"|'|\$|\$|\\[|\\]|\\{|\\}|\\<|\\>|\\?|\\.|!|,|-|:|;)"; specialWordSeparators public PofaTokenizer() { System.out.println("Initializing " + this.getClass().getName() + "..."); try { this .nps = new NGramProfiles(); this .ranker = nps.getRanker(); } catch (IOException e) { = // TODO Auto-generated catch block e.printStackTrace(); } this .stopWords = new PofaStopWords(); System.out.println(this.getClass().getName() + " initialized."); } /** * Tokenize input document. * @param document Document to tokenize. * @param locale locale to use for lower-case converting (if null no lower case conversion is done) * @return list of tokens */ public static ArrayList<String> tokenize(String document, Locale locale) { // insert missing whitespaces at sentence ends document = correctWhitespaces(document, locale); ArrayList<String> words; ArrayList<String> tokens = new ArrayList<String>(); // split document at word boundaries words = splitDocument(document, locale); // split words at special characters for (int i = 0; i != words.size(); i++) tokens.addAll(splitWord(words.get(i))); // merge special tokens ArrayList<String> result = mergeTokens(tokens); return result; } /** * Inserts missing whitespaces and removes unnecessary ones using a language pattern. * TODO: Requires some kind of heuristics * @param document * @return */ public static String correctWhitespaces(String document, Locale locale) { //TODO: write this return document; } /** * Splits up input at word boundaries. * @param document document to split * @param locale locale to use for lower-case converting (if null no lower case conversion is done) * @return */ public static ArrayList<String> splitDocument(String document, Locale locale) { //--- split at whitespaces --String words[] = document.split("\\s+"); ArrayList<String> result = new ArrayList<String>(); if (null == locale) for (int i = 0; i != words.length; i++) result.add(words[i]); else for (int i = 0; i != words.length; i++) result.add(words[i].toLowerCase(locale)); return result; } /** * Split up a word to multiple tokens if it begins or ends with special chars. * @param word * @return */ public static ArrayList<String> splitWord(String word) { //--- split each word from prefixes and/or suffixes --//--- split up incoming word --final String splitter = specialWordSeparators; word = word.replaceAll("(?=" + splitter + ")", " "); word = word.replaceAll("(?<=" + splitter + ")", " "); String[] splitted = word.split("\\s+"); //--- find prefixes and suffixes --int prefix = 0; int suffix = splitted.length; for (int i = 0; i != splitted.length; i++) if (splitted[i].length() > 0 && !splitted[i].matches(splitter)) { prefix = i - 1; break ; } for (int i = splitted.length - 1; i > prefix; i--) if (splitted[i].length() > 0 && !splitted[i].matches(splitter)) { suffix = i + 1; break ; } //--- merge internal parts (keep only prefixes and suffixes separately) --StringBuilder sb = new StringBuilder(); for (int i = prefix + 1; i != suffix; i++) sb.append(splitted[i]); ArrayList<String> result = new ArrayList<String>(); for (int i = 0; i <= prefix; i++) if (splitted[i].length() > 0) result.add(splitted[i]); result.add(sb.toString()); for (int i = suffix; i < splitted.length; i++) if (splitted[i].length() > 0) result.add(splitted[i]); return result; } /** * Merges consequent tokens if they have a common meaning. * TODO: Requires some kind of heuristics (common abbreviations, etc.) * @param tokens * @return */ public static ArrayList<String> mergeTokens(ArrayList<String> tokens) { // merge same 1-length tokens int first = -1; int last = -1; String merged = ""; for (int i = tokens.size() - 2; i >= 0; i--) { if (tokens.get(i).length() == 1 && tokens.get(i).equals(tokens.get(i + 1))) { if (last == -1) { last = i + 1; merged = tokens.get(i + 1); } first = i; merged += tokens.get(i); } else if (first >= 0) { for (int j = last; j >= first; j--) tokens.remove(j); tokens.add(first, merged); first = -1; last = -1; merged = ""; } } return tokens; } /** * Tells if a word contains only special word separator characters. * @param token word to analyze * @return true if word consists only of separator chars */ public static boolean isSeparator(String token) { return token.matches(specialWordSeparators + "+"); } /** * Removes all separator tokens from the list of tokens. * @param tokenList list of tokens * @return cleaned list of tokens */ public static ArrayList<String> cleanSeparators(ArrayList<String> onlyFromEdges) { ArrayList<String> result = new ArrayList<String>(); if (onlyFromEdges) { tokenList, boolean int first = -1; int last = -1; // find first non-separator for (int i = 0; i != tokenList.size(); i++) if (!isSeparator(tokenList.get(i))) { first = i; break ; } if (first != -1) { // find last non-separator for (int i = tokenList.size() - 1; i >= first; i--) if (!isSeparator(tokenList.get(i))) { last = i; break ; } // cut middle for (int i = first; i <= last; i++) result.add(tokenList.get(i)); } } else for (Iterator<String> i = tokenList.iterator(); i.hasNext(); ) { String word = i.next(); if (!isSeparator(word)) result.add(word); } return result; } /** * Merge the list of tokens to a string * @param tokenList token list * @return single string */ public static String tokenListToExpression(ArrayList<String> tokenList) { StringBuilder result = new StringBuilder(); if (tokenList.size() != 0) { result.append(tokenList.get(0)); for (int i = 1; i != tokenList.size(); i++) result.append(" " + tokenList.get(i)); } return result.toString(); } /** * Stem all tokens using the specified language's stemmer. * @param language * @param tokens * @return * @throws ClassNotFoundException * @throws InstantiationException * @throws IllegalAccessException */ public static ArrayList<String> stemTokens(ArrayList<String> tokens, Locale throws ClassNotFoundException, InstantiationException, IllegalAccessException { Class<?> stemClass = Class.forName("org.tartarus.snowball.ext." + language.getDisplayLanguage(Locale.ENGLISH).toLowerCase(Locale.ENGLISH) + "Stemmer" ); SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance(); ArrayList<String> result = new ArrayList<String>(); for language) (int i = 0; i != tokens.size(); i++) { stemmer.setCurrent(tokens.get(i)); stemmer.stem(); result.add(stemmer.getCurrent()); } return result; } /** * Stem all expressions using the specified language's stemmer. * @param language language to use for stemming * @param expressions list of expressions, where expressions contain space separated words * @return * @throws ClassNotFoundException * @throws InstantiationException * @throws IllegalAccessException */ public static ArrayList<String> stemExpressions(ArrayList<String> expressions, Locale language) throws ClassNotFoundException, InstantiationException, IllegalAccessException { Class<?> stemClass = Class.forName("org.tartarus.snowball.ext." + language.getDisplayLanguage(Locale.ENGLISH).toLowerCase(Locale.ENGLISH) + "Stemmer" ); SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance(); ArrayList<String> result = new ArrayList<String>(); for (int i = 0; i != expressions.size(); i++) { String[] tokens = expressions.get(i).split(" "); StringBuilder stemmedExpression = new StringBuilder(); for (int j = 0; j != tokens.length; j++) { stemmer.setCurrent(tokens[j]); stemmer.stem(); if (0 == j) stemmedExpression.append(stemmer.getCurrent()); else stemmedExpression.append(" " + stemmer.getCurrent()); } result.add(stemmedExpression.toString()); } return result; } /** * Detect natural language. * @throws IOException */ public Locale languageDetect(String document) { ranker.reset(); ranker.account(document); NGramProfiles.RankResult res = ranker.getRankResult(); return new Locale(res.getName(0)); } /** * Splits the input document at separator chars. In-word separators are not considered. * The result list won't contain any separators that were used as splitters. * @param document document to split * @param locale locale to use for tokenization * @return */ public ArrayList<String> splitAtSeparators(String document, Locale locale) { ArrayList<String> result = new ArrayList<String>(); ArrayList<String> tokens = tokenize(document, locale); StringBuilder expression = new StringBuilder(); for (Iterator<String> i = tokens.iterator(); i.hasNext(); ) { String token = i.next(); if (isSeparator(token)) { if (expression.length() != 0) result.add(expression.toString()); expression = new StringBuilder(); } else { if (expression.length() == 0) expression.append(token); else expression.append(" " + token); } } if (expression.length() != 0) result.add(expression.toString()); return result; } /** * Create all multi-word subphrases from supplied expression. * @param expression initial expression to process * @param locale language locale to use for tokenization * @param removeSeparators if true separator tokens will be removed from result * @param removeStopWords if true 1 length expressions won't contain stopwords * @return List of all sub-expressions. */ public ArrayList<String> composeSubExpressions(String expression, Locale locale, boolean removeSeparators, boolean removeStopWords) { ArrayList<String> result = tokenize(expression, locale); if (removeSeparators) result = cleanSeparators(result, false); if (removeStopWords) return composer(result, locale, 1, 0); else return composer(result, null, 1, 0); } /** * Create all multi-word subphrases from supplied expression in the specified range. * @param expression initial expression to process * @param locale language locale to use for tokenization * @param minLength minimum expression length (number of words) * @param maxLength maximum expression length (if 0, all possible lengths will be generated) * @param removeSeparators if true separator tokens will be removed from result * @param removeStopWords if true 1 length expressions won't contain stopwords * @return List of all sub-expressions. */ public ArrayList<String> composeSubExpressions(String expression, Locale locale, int minLength, int maxLength, boolean removeSeparators, boolean removeStopWords) { ArrayList<String> result = tokenize(expression, locale); if (removeSeparators) result = cleanSeparators(result, false); if (removeStopWords) return composer(result, locale, minLength, maxLength); else return composer(result, null, minLength, maxLength); } /** * Create all multi-word subphrases from all supplied expressions. * @param expressions initial expression to process * @param locale language locale to use for tokenization * @param removeSeparators if true separator tokens will be removed from result * @return List of all sub-expressions. */ public ArrayList<String> composeSubExpressions(ArrayList<String> expressions, locale, boolean removeSeparators, boolean removeStopWords) { ArrayList<String> result = new ArrayList<String>(); for (Iterator<String> i = expressions.iterator(); i.hasNext(); ) { String expression = i.next(); ArrayList<String> subResult = tokenize(expression, locale); if (removeSeparators) result = cleanSeparators(subResult, false); if (removeStopWords) result.addAll(composer(subResult, locale, 1, 0)); else result.addAll(composer(subResult, null, 1, 0)); } return result; } Locale /** * Create multi-word subphrases from all supplied expressions in the specified range. * @param expressions initial expression to process * @param locale language locale to use for tokenization * @param minLength minimum expression length (number of words) * @param maxLength maximum expression length (if 0, all possible lengths will be generated) * @param removeSeparators if true separator tokens will be removed from result * @param removeStopWords if true 1 length expressions won't contain stopwords * @return List of all sub-expressions. */ public ArrayList<String> composeSubExpressions(ArrayList<String> expressions, Locale locale, int minLength, int maxLength, boolean removeSeparators, boolean removeStopWords) { ArrayList<String> result = new ArrayList<String>(); for (Iterator<String> i = expressions.iterator(); i.hasNext(); ) { String expression = i.next(); ArrayList<String> subResult = tokenize(expression, locale); if (removeSeparators) subResult = cleanSeparators(subResult, false); if (removeStopWords) result.addAll(composer(subResult, locale, minLength, maxLength)); else result.addAll(composer(subResult, null, minLength, maxLength)); } return result; } /** * Generate all subexpressions from a list of single words (order of words will be maintained). * Single word expressions will be generated withour stopwords. * E.g.: [this, is, a, test] will generate: * <li> "this is a test" * <li> "this is a" * <li> "is a test" * <li> "this is" * <li> "is a" * <li> "a test" * <li> "test" * @param words list of words to use for composing * @param locale language to use for stopword removal (if null no stopword removal is done) * @param minLength the desired minimum number of words in output expressions * @param maxLength the desired maximum number of words in output expressions (if set to zero</> it creates all possible lengths) * @return */ private ArrayList<String> composer(ArrayList<String> words, Locale locale, int minLength, int maxLength) { if (maxLength > words.size() || maxLength <= 0) maxLength = words.size(); int minMultiWordLength = Math.max(2, minLength); // compose all multi-word expressions ArrayList<String> result = new ArrayList<String>(); for (int exprLen = maxLength; exprLen >= minMultiWordLength; exprLen--) for (int startWord = 0; startWord != words.size() - exprLen + 1; startWord++) { String fixedLenExpression = ""; for (int i = 0; i != exprLen; i++) { if (0 == i) fixedLenExpression = words.get(startWord + i); else fixedLenExpression += " " + words.get(startWord + i); } result.add(fixedLenExpression); } if (minLength <= 1) { // add single words if (null == locale) { result.addAll(words); } else { // do not add stopwords HashSet<String> stops = stopWords.getStopWords(locale); if (null == stops) result.addAll(words); else for (int i = 0; i != words.size(); i++) if (!stops.contains(words.get(i))) result.add(words.get(i)); } } return result; } /** * Returns the number of words in a string. * NOTE: actually the number of spaces is counted (+1 for non-empty strings) so make sure: * <li>all whitespaces are spaces, * <li>there are no multiple spaces between words, * <li>there are no starting or trailing spaces. * @param sourceString * @return */ public static int countWords(String sourceString) { if (null == sourceString) return 0; int count = 1; final char [] chars = sourceString.toCharArray(); for (int i = 0; i < chars.length; i++) if (chars[i] == ' ') count++; return count; } public static void main(String argv[]) throws InstantiationException, IllegalAccessException, IOException { PofaTokenizer token = new PofaTokenizer(); Locale locale = new Locale("hu"); ClassNotFoundException, String document = "Ez itt egy teszt dokumentum, ami helyesen van ï¿½rva ,de van benne hiba is :( ï¿½s nï¿½hï¿½ny hiï¿½nyzï¿½ szï¿½kï¿½z:pl.itt..."; ArrayList<String> tokens = PofaTokenizer.tokenize(document, locale); for (int i = 0; i != tokens.size(); i++ ) { System.out.println(" \"" + tokens.get(i) + "\""); } /* ArrayList<String> words = cleanSeparators(tokenize("Samsung <notebook > ï¿½kezetes cucc, laptop, netbook", new Locale("hu"))); for (Iterator<String> i = words.iterator(); i.hasNext(); ) { String word = i.next(); System.out.println(word + ": " + isSeparator(word)); } */ /* PofaTokenizer token = new PofaTokenizer(); FileInputStream fis = new FileInputStream("c:\\users\\mikki\\workspace\\pofa\\texts.txt"); Scanner scanner = new Scanner(fis, "UTF-8"); FileOutputStream fos = new FileOutputStream("c:\\users\\mikki\\workspace\\pofa\\textsstemmed-en-800.txt", true); OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8"); while (scanner.hasNextLine()) { String document = scanner.nextLine(); long t1 = System.currentTimeMillis(); String lng = token.languageDetect(document).getLanguage(); long t2 = System.currentTimeMillis(); System.out.println(lng + " time: " + (t2 - t1)); ArrayList<String> tokens = token.tokenize(document); ArrayList<String> stems = token.stemTokens("english", tokens); for (int i = 0; i != tokens.size(); i++) { if (i != 0) out.write(" "); out.write(stems.get(i)); } out.write("\n"); System.out.println(document); } out.close(); fos.close(); scanner.close(); fis.close(); */ } } PofaUtils.java package import import import com.wcs.pofa; java.math.BigInteger; java.security.MessageDigest; java.security.NoSuchAlgorithmException; public class PofaUtils { private static static { MessageDigest digest; try { digest = MessageDigest.getInstance("MD5"); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } } //TODO: bigint -> hex conversion omits initial zeroes public static BigInteger getMD5(String data) { digest.reset(); digest.update(data.getBytes(),0, data.length()); return (new BigInteger(1, digest.digest())); } public } } PofaUtils() {

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Ponte Kft. - Webstar Csoport