Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Deszkamodellek
Pályázati azonosító: GOP 1.1.1-09/1-2009-0019
Verziószám: v1.0
Dátum: 2011. 03. 31.
Készítette: Webstar Csoport Kft.
Az elkészítésben részt vett: Ponte Kft.
Elfogadó: BDE Research Közhasznú Nonprofit Kft.
DESZKAMODELLEK
A deszkamodell célja az egyes modulok egy-egy működőképes változatának megalkotása és az
algoritmusok kipróbálása. A működőképes algoritmusokat leteszteljük, hogy eldönthessük
megfelelnek-e a velük szemben támasztott funkcionális követelményeknek. Amennyiben valamelyik
algoritmus hibás, úgy megpróbáljuk kijavítani, vagy újat keresünk helyette. Ezt mindaddig ismételjük,
ameddig elő nem áll egy olyan változat, amely megfelel az elvárásoknak.
Ebben a részben a hatékonysággal még csak minimálisan foglalkozunk (a triviálisan gyenge
hatékonyságú megoldásokat elkerüljük). Az optimalizáció egy későbbi ütemben valósul meg.
ELNEVEZÉS
A deszkamodell során használt terméknév: Product Opinion Finder and Analyzer, röviden POFA.
FELADATCSOPORTOK
A prototípus moduljait feladatjuk alapján implementálási szempontokat is figyelembe véve az alábbi
gyűjtőfeladatokba szervezhetjük:
Adatgyűjtés:
o [1] crawler-ek vezérlése.
Előfeldolgozás:
o [2] tördelő,
o [6] tokenelő,
o [3] entitáslista építés,
o [4] kategória struktúra építés,
o [5] entitás tulajdonság keresés (jó lenne, ha szekvenciálisan bekapcsolható lenne ide).
Elemzés:
o [7] hozzászólás kategorizálás (entitás + kategória összerendelés),
o [8] tájolás,
o [9] hasznosság.
Lekérdezés:
o [10] megjelenítés.
Tanítás:
o [11] tanítás.
Karbantartás:
o [12] karbantartás.
Ez a csoportosítás egyszerűbbé teszi a függőségi viszonyok kezelését. A feladatcsoportok fentről
lefelé haladva valamelyest előfeltételei egymásnak, azonban jól egybefogható csoportot alkotnak. Az
egyes feladatcsoportok belső működése eltérő (pl: az Előkészítés szekvenciálisan végrehajtható, de az
Elemzés lépései futhatnak párhuzamosan).
Az Adatgyűjtés feladata, hogy az előre megadott weblapokon fellelhető termékek adatlapjait és
hozzászólásait összegyűjtse.
Az Előfeldolgozás feladata, hogy a begyűjtött HTML formátumú oldalakból kinyerje a rendszer
számára hasznos információkat, és azokat a megfelelő formában eltárolja. Ebben a lépésben még
összefüggéseket nem keresünk. Egyszerre több előfeldolgozás feladatcsoport is futhat
párhuzamosan.
Az Elemzés feladata, hogy a rendszerben már eltárolt elemek között összefüggéseket tárjon fel,
illetve, azokat felhasználva értelmezni próbálja a szövegeket. Ennek a feladatcsoportnak az egyes
elemei kívülről is hívhatók. Külső hívás híján pedig folyamatosan halad végig a hozzászólásokon, és
egyfajta gyorsítótárat alkotva előzetesen kiszámolja az egyes hozzászólások értékeit (kategóriák,
tájolás, hasznosság).
A Lekérdezés feladata a felhasználóval való kapcsolattartás. Ez a feladatcsoport magában foglalja a
publikus webes szolgáltatást, a keresőkifejezések értelmezését, a lekérdezések végrehajtását, az
eredmények aggregálását és a megjelenítést is.
A Tanítás feladata a felhasználói visszajelzésekből érkező tanítópéldák összegyűjtése és ezeket
felhasználva a rendszer folyamatos vagy rendszeres tanítása.
A Karbantartás feladata az adatbázisok karbantartása, hogy a lehető legjobb teljesítményt adja a
rendszer.
OSZTÁLYOKRA BONTÁS
crawler:
o konfigurációs XML betöltése;
tördelő:
o entitáslista építése: nem lesz külön modul, a forrás oldalakról gyűjtjük ki az entitásokat
tördeléskor;
entitások tulajdonságainak keresése;
kategória struktúra építése;
tokenelés:
o hsz ↔ entitás, kategória összerendelés: nem külön modul, a forrás odalakról kinyerhető
információ, tördelés közben eltároljuk;
hsz tájolás;
hsz hasznosság;
megjelenítés;
felhasználói visszajelzés, tanítás;
karbantartás;
eseménykezelő: az egyes feladatcsoportok közötti kommunikációt valósítja meg (függőségek
egyszerűsítése);
szabályrendszer: az oldalak hasznos részeinek kibontásához szükséges; szabályok illesztését hajtja
végre;
domain specifikus szabályok kiválogatása;
adatbázis kapcsolat: magas szintű DB kérelmek lefordítása alacsony szintű DB parancsokká, és
azok végrehajtása;
vezérlő (opcionális).
A deszkamodellt úgy próbáljuk felépíteni, hogy a prototípus létrehozásakor minimális mennyiségű
kódót kelljen úrjaírni.
FELHASZNÁLT ESZKÖZÖK
A deszkamodell építéséhez felhasznált külső, nem standard eszközök listája.
NYELV
A deszkamodell elemei Java és Python nyelven készülnek az Eclipse és Netbeans fejlesztőkörnyezet
felhasználásával.
PROJEKT KEZELŐ
A projekt Java-ban készülő részét a Maven projektkezelő eszköz segítségével építjük össze, hogy a
projekt hordozható legyen, illetve a későbbekben elkészülő prototípust már könnyebben össze
levessen építeni.
CSOMAGOK
Az alábbiakban felsoroljuk a deszkamodellben felhasznált nem szabványos csomagokat, és az
esetlegesen rajtuk végrehajtott módosításokat.
JUnit v4.8.2
Honlap: http://www.junit.org/
Funkció: unit teszteket végrehajtó modul
crawler4j v2.2
Honlap: http://code.google.com/p/crawler4j/
Funkció: könnyen konfigurálható, nyílt forráskódú webcrawler.
Módosítások:
Page.load(): a karakterkódolás kezelését módosítani kellett, hogy jól felismerje az oldalak
karakter kódolását
JTidy r938
Honlap: http://jtidy.sourceforge.net/
Funkció: HTML parser és DOM fa építő modul, a nem szabványos HTML oldalakat is képes kezelni és
javítani.
Snowball
Honlap: http://snowball.tartarus.org/
Funkció: Szabály alapú többnyelvű szótövező csomag
neo4j REST v0.8 / neo4j Server v1.2 / neo4j embedded
Honlap:
http://neo4j.org/,
http://wiki.neo4j.org/content/Getting_Started_REST
http://components.neo4j.org/neo4j-rest/,
Funkció: Különálló és beágyazott gráf adatbázis szerver.
ESZKÖZÖK
Firefox addon: https://addons.mozilla.org/en-US/firefox/addon/2691
JSON
Honlap: http http://www.json.org/
Funkció: JSON adatformátum kezelését megvalósító csomag
LÉTREHOZOTT OSZTÁLYOK
A modulokon felül létrehozott segéd osztályok.
Pair
Rendezett pár sablon osztály.
PofaDomainRule, PofaDomainRuleList, PofaRuleMatcher
Az egyes domain-ekhez tartozó oldalak szerkezetét leíró szabályrendszert kezelő osztályok.
PofaDomainRule: egy szabály, ami egy (típus, DOM útvonal) párosból áll
PofaDomainRuleList: PofaDomainRule-ok listája
PofaRuleMatcher: egy HTML oldalra alkalmazza a megadott szabályokat és visszaadja az
sikeresen illesztett szabályok által kapott oldalrészeket.
PofaStopWords
Minden - a prototípus által támogatott - nyelvhez tárolja a stopszavakat.
Neo4jDBInterface, PofaNeo4jDB, Neo4jRelationship
Neo4j adatbázis kapcsolatot kezelő osztályok.
Neo4jDBInterface: alacsony szintű hozzáférést biztosít
PofaNeo4jDB: magas szintű hozzáférést biztosít
Neo4jRelationship: egy lekérdezés során a gráf bejárásához szükséges relációk listája
ADATBÁZISOK
DOMAIN DB
A kiválasztott crawler modul (crawler4j) biztosítja saját magának egy önálló adatbázssal, hogy
ugyanazokat az URL-eket ne töltse le többször ugyanabban a menetben.
Amit nekünk ezen felül tárolnunk kell, az az egyes domain nevek súlyozása, amivel az újralátogatás
gyakoriságát és sorrendjét tudjuk beállítani. Ehhez minden domain névhez el kell tárolnunk a
következő adatokat:
domain név
fontossági súly: a felhasználók milyen arányban találnak itt hasznos hozzászólásokat
változási súly: milyen gyakran változik a tartalom
utolsó látogatás dátuma
látogatások száma
DOKUMENTUMTÁR
Feladata, hogy a Tördelő által letöltött szöveges értékeléseket tárolja és a későbbi feldolgozáshoz
szükséges alapvető információkat is eltárolja hozzájuk.
Az alábbi adatokat szükséges tárolnunk:
Dokumentum egyedi azonosítója (MD5 hash kód)
Nyers szöveg (HTML tag-ekkel)
forrás URL-je
Hasznossági érték
Hasznossági érték időbélyege
Tájolási érték
Tájolási érték időbélyege
egyéb, a későbbi feldolgozás során társított címkék
ENTITÁS- ÉS KATEGÓRIATÁR
A kereshető termékek és azok kategóriáinak tárolása. Egy-egy termék több néven is szerepelhet,
ezért ezeket a neveket csoportokba kell foglalni.
csoport azonosító
o terméknév 1
o terméknév 2
o …
o terméknév n
terméktípus kategória
o kategória név
termékjellemző kategória
o főkategória név
alkategória név
gyakori kifejezések kategória
o kategória név
GRÁF ADATBÁZIS
Próba képpen a deszkamodellek egy gráf adatbázist használnak, amiben az összes korábban említett
adatbázis benne van. Tehát a tervezett 4 adatbázis helyett egyetlen közös gráfban tároljuk le az
összes adatot (nem számítva az egyes csomagok különálló adatbázisait; ilyen pl: a crawler4j).
A gráf jó választásnak tűnik abból a szempontból, hogy az egyes adatelemek között rengeteg
kapcsolat lesz, illetve az egyes elemek struktúrája sem jól definiált a legtöbb esetben.
A „Domain DB” azonban nem képezi szerves részét a gráfnak; egy XML formátumú szabályrendszer
segítségével írjuk le a meglátogatandó oldalakat és azok szerkezetét. A crawler által feljegyzett
információk (fontosság, utolsó látogatás, stb.) is tárolhatók ebben az XML-ben.
VÁLTOZTATÁSOK
A deszkamodell építése során előjöttek olyan helyzetek, amiket előzetesen nem láttunk. Ezek
megoldása időnként az eredeti tervek módosítását is megkövetelte. Ezeket a változataásokat írtuk itt
össze.
DOMAIN SZABÁLYRENDSZER
Az eredetileg tervezett NAVIGATION szabálytípus feleslegessé vált. A crawler amúgy is bejárná az
oldalakat, és ennek a szabálynak a beépítése a meglévő crawler csomagba bonyolult lenne.
A crawler első futtatásai alatt kiderült, hogy a jelenlegi szabályrendszer időnként túl sok adatot jelöl
ki egy-egy oldalon. A szabályok a mostani formájukban csak befoglalást jelentenek, azaz a DOM fán
kijelölnek egy útvonalat, és minden HTML tartalom, ami a kijelölt útvonalon elérhető, azt kiválasztják.
Sok esetben ebben a kijelölésben “szemét” is van, ezért célszerűnek egy további szabály felvétele is,
ami kizárást jelöl ki.
A kizárást is egy DOM útvonalra vonatkozó relatív (a befoglaló szabály illeszkedési helyétől számítva)
reguláris kifejezéssel adhatnánk meg. Például:
<SELECT
TYPE="TEXT"
PATH="html>body>div>p#content-box"
EXCLUDING="(div#ad|div#navigation|div#share)"
/>
A példában a kijelölésből kivesszük az ad, navigation és share részeit az egyébként kijelölt
tartalomnak. Ez hasznos, hiszen sok weblapon az egyes hozzászólások köré egy sablon fejléc és lábléc
kerül, de a számunkra hasznos szöveg nincs külön kiemelve. Ezekkel a szabályokkal levághatók a
felesleges elemek.
CRAWLER
A crawler nem végez előzetes ellenőrzést a letöltött oldalakon, hogy van-e rajtuk használható adat,
hanem minden megtalált oldalt (de csak a megengedett domain-ekről) továbbít a Tördelőnek. A
Tördelő szétbontja az oldalt a használható elemekre, amiket továbbít a megfelelő további
moduloknak. Ha egy oldalon nincs hasznosítható elem, akkor nem továbbít semmit.
TÖRDELŐ
A nyelv felismerés bekerült Tördelőbe is, mert a helyes indexelés előfeltétele, hogy a megfelelő
nyelven kisbetűsítsük a szöveget, és a stopszavak eltávolítása is ezen alapul.
Ez a nyelvfelismerés azonban nem helyettesíti a Tokenelőben végrehajtott nyelv felismerést, mert itt
a teljes oldal szövegét egyben vizsgáljuk.
A tördelő funkcionalítását kibővítettük az entitások gyűjtésével, illetve az entitások, kategóriák és
hozzászólások összerendelésével. Ezek az információk a forrásként használt oldalakból egyértelműen
és egyszerűen kinyerhetők, ezért sokkal egyszerűbb ezeket letárolni tördelés közben, mint utólag
valamilyen heurisztikával kitalálni.
FORRÁSKÓD
Neo4jDBInterface.java
package
com.wcs.pofa.db;
import
import
import
import
import
import
import
import
java.io.BufferedReader;
java.io.IOException;
java.io.InputStreamReader;
java.io.OutputStreamWriter;
java.net.HttpURLConnection;
java.net.URL;
java.util.ArrayList;
java.util.regex.Pattern;
import
import
import
org.json.JSONArray;
org.json.JSONException;
org.json.JSONObject;
import
com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection;
/**
* Wrapper class for communicating with a neo4j REST server.
*/
public final class Neo4jDBInterface {
/**
* Server response object
*/
public class ServerErrorResponse extends Throwable {
/**
*
*/
private static final long serialVersionUID = 4213803944885637897L;
private int returnCode;
private String response;
/**
* Indicates that the error is not a standard HTTP error (e.g. unexpected response)
*/
public static final int OTHER_ERROR = -1;
public ServerErrorResponse(int returnCode, String response) {
this .returnCode = returnCode;
this .response = response;
}
public int getReturnCode() { return returnCode; }
public String getResponse() { return response; }
@Override
public String toString() { return returnCode + ": " + response; }
}
/**
* Sends a request to the neo4j REST server.
* @param request PofaHttpRequestType (GET, POST, PUT, DELETE)
* @param path URL to send request to (path to node, relationship, etc.)
* @param data additional data (request body)
* @return Server response body if request successful
* @throws ServerErrorResponse on error response
*/
private
String sendRequest(Neo4jHttpRequestType request, String path, String data) throws
ServerErrorResponse {
int responseCode = HttpURLConnection.HTTP_NO_CONTENT;
StringBuilder response = new StringBuilder();
try {
// Send data
URL url = new URL(path);
HttpURLConnection conn = (HttpURLConnection) url.openConnection ();
conn.addRequestProperty("Content-type", "application/json");
conn.addRequestProperty("Accept", "application/json");
OutputStreamWriter writer = null;
if (Neo4jHttpRequestType.GET != request) {
conn.setDoOutput(true);
switch (request) {
case PUT:
conn.setRequestMethod("PUT");
case POST:
writer = new OutputStreamWriter(conn.getOutputStream(), "UTF-8");
// append data
writer.write(data);
writer.flush();
break ;
case DELETE:
conn.setRequestMethod("DELETE");
break ;
}
}
// Get the response
IOException error = null;
BufferedReader reader;
responseCode = conn.getResponseCode();
try {
reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
} catch (IOException e) {
error = e;
reader = new BufferedReader(new InputStreamReader(conn.getErrorStream(), "UTF-8"));
}
String line;
while ((line = reader.readLine()) != null)
response.append(line);
reader.close();
if (null != writer)
writer.close();
// deceide if return nicely or with error
if (null == error)
return response.toString();
else
throw error;
} catch (IOException e) {
throw
new
ServerErrorResponse(responseCode,
e.getMessage());
}
}
response.toString()
+
"
public Neo4jDBInterface(String databaseUrl) throws ServerErrorResponse {
String response = "";
JSONObject answer;
try {
// query database server for root node
response = sendRequest(Neo4jHttpRequestType.GET, databaseUrl + "/", null);
answer = new JSONObject(response);
nodeUrl = answer.getString("node") + "/";
indexNodeUrl = answer.getString("node_index") + "/";
indexRelationshipUrl = answer.getString("relationship_index") + "/";
relationshipUrl = nodeUrl.replaceFirst("node", "relationship");
nodeUrlRegex = Pattern.quote(nodeUrl);
relationshipUrlRegex = Pattern.quote(relationshipUrl);
rootNode = answer.getString("reference_node").replaceFirst(this.nodeUrl, "");
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
response:\n" + response);
}
}
"
+
server
/*
**********************************************************************************************
**************** */
/**
* HTTP request types required for communication with the neo4j REST server
*/
private static enum Neo4jHttpRequestType {
GET,
POST,
PUT,
DELETE
}
/**
* Result of traverse
*/
public static enum Neo4jTraverseResult {
NODE,
RELATIONSHIP,
PATH
}
/**
* Traverse order
*/
public static enum
DEPTH_FIRST,
BREADTH_FIRST
}
Neo4jTraverseOrder {
/**
* Traverse return filter
*/
public static enum Neo4jTraverseReturnFilter {
ALL,
ALL_BUT_START_NODE
}
/**
* Traverse return uniqueness filter
*/
public static enum Neo4jTraverseUniqueness {
NODE_PATH,
NODE
}
/**
* ID of root node (relative URL to server)
*/
private String rootNode;
/**
* Node URL
*/
private String nodeUrlRegex;
private String nodeUrl;
/**
* Relationship URL
*/
private String relationshipUrlRegex;
private String relationshipUrl;
/**
* Index URLs
*/
private String indexNodeUrl;
private String indexRelationshipUrl;
/*
**********************************************************************************************
**************** */
/**
* Returns root node.
*/
public String getRootNode() {
return rootNode;
}
/**
* Creates a new node in the graph with initial properties set.
* @param properties a JSON with the properties to set
* @return server node ID of the new node
* @throws ServerErrorResponse if request couldn't be completed
*/
public String createNode(JSONObject properties) throws ServerErrorResponse {
String
response
=
sendRequest(Neo4jHttpRequestType.POST,
nodeUrl.substring(0,
nodeUrl.length() - 1), properties.toString());
try {
JSONObject answer = new JSONObject(response);
return (answer.getString("self").replaceFirst(this.nodeUrlRegex, ""));
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
server
response:\n" + response);
}
}
/**
* Removes the specified node.
* @param node node ID
* @throws ServerErrorResponse if request couldn't be completed
*/
public void removeNode(String node) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, nodeUrl + node, null);
}
/**
* Gets all properties of specified node.
* @param node node ID
* @return server response body (JSON or empty string)
* @throws ServerErrorResponse if request couldn't be completed
*/
public JSONObject getNodeProperties(String node) throws ServerErrorResponse {
String response = sendRequest(Neo4jHttpRequestType.GET, nodeUrl + node + "/properties",
null);
try {
if (response.isEmpty()) response = "{}";
return (new JSONObject(response));
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
server
response:\n" + response);
}
}
/**
* Replaces all properties on a node with the supplied set of properties.
* @param node node ID
* @param properties a JSON with the properties to set
* @throws ServerErrorResponse if request couldn't be completed
*/
public
void
setNodeProperties(String
node,
JSONObject
properties)
throws
ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.PUT,
nodeUrl
+
node
+
"/properties",
properties.toString());
}
/**
* Removes all properties from a node.
* @param node node ID
* @throws ServerErrorResponse if request couldn't be completed
*/
public void removeNodeProperties(String node) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, nodeUrl + node + "/properties", null);
}
/**
* Returns the value of specified property of specified node.
* @param node node ID
* @param property property name
* @return value of property
* @throws ServerErrorResponse if request couldn't be completed
*/
public Object getNodeProperty(String node, String property) throws ServerErrorResponse {
String response = sendRequest(Neo4jHttpRequestType.GET, nodeUrl + node + "/properties/" +
property, null);
try {
return new JSONObject("{\"a\":" + response + "}").get("a");
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
server
response:\n" + response);
}
}
/**
* Changes a property of a node. Leaves all other properties intact.
* @param node node ID
* @param property name of property to change/create
* @param value value of property
* @throws ServerErrorResponse if request couldn't be completed
*/
public void
setNodeProperty(String node, String property, Object value)
throws
ServerErrorResponse {
String val;
if (value instanceof String) val = "\"" + value + "\"";
else val = value.toString();
sendRequest(Neo4jHttpRequestType.PUT, nodeUrl + node + "/properties/" + property, val);
}
/**
* Removes specified property from specified node
* @param node node ID
* @param property property name
* @throws ServerErrorResponse if request couldn't be completed
*/
public void removeNodeProperty(String node, String property) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, nodeUrl + node + "/properties/" + property,
null);
}
/*
**********************************************************************************************
**************** */
/**
* Create a new relationship between 2 nodes
* @param from start node (node ID) of relationship
* @param to end node (node ID) of relationship
* @param type type identifier of relationship
* @param properties properties to set for relationsip
* @return relationship ID
* @throws IOException if request couldn't be composed
* @throws ServerErrorResponse if request couldn't be completed
*/
public
String createRelationship(String from, String to, String type, JSONObject
properties) throws IOException, ServerErrorResponse {
JSONObject params;
params = new JSONObject();
try {
params.put("to", nodeUrl + to);
params.put("type", type);
params.put("data", properties);
String
response
=
sendRequest(Neo4jHttpRequestType.POST,
nodeUrl
+
from
+
"/relationships", params.toString());
try {
JSONObject answer = new JSONObject(response);
return (answer.getString("self").replaceFirst(this.nodeUrlRegex, ""));
} catch (JSONException e) {
throw new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server
response:\n" + response);
}
} catch (JSONException e) {
throw new IOException("Cannot compose request");
}
}
/**
* Gets relationship type and properties.
* @param relationship relationship ID
* @return a JSON object with type and data keys
* @throws ServerErrorResponse if request couldn't be completed
*/
public JSONObject getRelationship(String relationship) throws ServerErrorResponse {
String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship,
null);
try {
JSONObject answer = new JSONObject(response);
JSONObject result = new JSONObject();
result.put("start", answer.getString("start").replaceFirst(nodeUrlRegex, ""));
result.put("end", answer.getString("end").replaceFirst(nodeUrlRegex, ""));
result.put("type", answer.get("type"));
result.put("data", answer.get("data"));
return (result);
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
response:\n" + response);
}
}
"Unknown
server
/**
* Removes specified relationship
* @param relationship relationship ID
* @throws ServerErrorResponse if request couldn't be completed
*/
public void removeRelationship(String relationship) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, relationshipUrl + relationship, null);
}
/**
* Gets relationship type.
* @param relationship relationship ID
* @return relationship type identifier
* @throws ServerErrorResponse if request couldn't be completed
*/
public String getRelationshipType(String relationship) throws ServerErrorResponse {
String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship,
null);
try {
JSONObject answer = new JSONObject(response);
return (answer.getString("type"));
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
server
response:\n" + response);
}
}
/**
* Gets relationship properties.
* @param relationship relationship ID
* @return a JSON object with the properties
* @throws ServerErrorResponse if request couldn't be completed
*/
public JSONObject getRelationshipProperties(String relationship) throws ServerErrorResponse
{
String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship,
null);
try {
JSONObject answer = new JSONObject(response);
return (answer.getJSONObject("data"));
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
server
response:\n" + response);
}
}
/**
* Replaces all properties on a relationship with the supplied set of properties.
* @param relationship relationship ID
* @param properties JSON with the properties to set
* @throws ServerErrorResponse if request couldn't be completed
*/
public void
setRelationshipProperties(String relationship, JSONObject properties) throws
ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.PUT, relationshipUrl + relationship + "/properties",
properties.toString());
}
/**
* Removes all relationships from the specified relationship.
* @param relationship relationship ID
* @throws ServerErrorResponse is the request couldn't be completed
*/
public void removeRelationshipProperties(String relationship) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, relationshipUrl + relationship + "/properties",
null);
}
/**
* Returns the value of specified property of specified relationship.
* @param relationship relationship ID
* @param property property name
* @return value of property
* @throws ServerErrorResponse if request couldn't be completed
*/
public
Object getRelationshipProperty(String relationship, String property) throws
ServerErrorResponse {
String response = sendRequest(Neo4jHttpRequestType.GET, relationshipUrl + relationship +
"/properties/" + property, null);
return JSONObject.stringToValue(response);
}
/**
* Changes a property of a relationship. Leaves all other properties intact.
* @param relationship relationship ID
* @param property property name to change/create
* @param value value to set
* @throws ServerErrorResponse if request couldn't be completed
*/
public void
setRelationshipProperty(String relationship, String property, Object value)
throws ServerErrorResponse {
String val;
if (value instanceof String) val = "\"" + value + "\"";
else val = value.toString();
sendRequest(Neo4jHttpRequestType.PUT, relationshipUrl + relationship + "/properties/" +
property, val);
}
/**
* Remove specified property from specified relationship
* @param relationship relationship ID
* @param property property name
* @throws ServerErrorResponse if request couldn't be completed
*/
public void
removeRelationshipProperty(String relationship, String property) throws
ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, relationshipUrl + relationship + "/properties/",
null);
}
/*
**********************************************************************************************
**************** */
/**
* Returns the selected relationships of a node
* @param node node ID
* @param dir relationship direction
* @param types list of relationship types to include
* @return array of relationship information (like getRelationship())
* @throws ServerErrorResponse if request couldn't be completed
*/
public
JSONArray
getNodeRelationships(String
node,
ArrayList<Neo4jRelationship>
relationships) throws ServerErrorResponse {
StringBuilder target = new StringBuilder();
if (relationships.size() != 0) {
target.append(nodeUrl
+
node
+
"/relationships/"
+
relationships.get(0).getDirection().toString().toLowerCase());
target.append("/");
target.append(relationships.get(0).getType());
for (int i = 1; i != relationships.size(); i++)
target.append("&" + relationships.get(i).getType());
}
else
target.append(nodeUrl + node + "/relationships/all");
String response = sendRequest(Neo4jHttpRequestType.GET, target.toString(), null);
try {
JSONArray answer = new JSONArray(response);
JSONArray result = new JSONArray();
for (int i = 0; i != answer.length(); i++) {
JSONObject item = (JSONObject)answer.get(i);
result.put(new JSONObject().
put("start", item.getString("start").replaceFirst(this.nodeUrlRegex, "")).
put("end", item.getString("end").replaceFirst(this.nodeUrlRegex, "")).
put("type", item.getString("type")).
put("data", item.getString("data")).
put("relationship",
item.getString("self").replaceFirst(this.relationshipUrlRegex, ""))
);
}
return result;
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
response:\n" + response);
}
}
"Unknown
server
/*
**********************************************************************************************
**************** */
/**
* Adds a node to the index.
* @param node node ID to add to index
* @param key index key
* @param value index value
* @throws ServerErrorResponse if request couldn't be completed
*/
public void addNodeToIndex(String indexName, String node, String key, String value) throws
ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.POST, indexNodeUrl + indexName + "/" + key + "/" +
value, "\"" + nodeUrl + node + "\"");
}
/**
* Removes a node from the index
* @param node node ID to remove from index
* @param key indexing key
* @param value index value
* @throws ServerErrorResponse if request couldn't be completed
*/
public void
removeNodeFromIndex(String node, String indexName, String key, String value)
throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, indexNodeUrl + indexName + "/" + key + "/" +
value + "/" + node, null);
}
/**
* Queries DB index for (key, value) pair for matching nodes.
* @param key index key
* @param value index value
* @return JSONArray with node IDs and node properties
* @throws ServerErrorResponse if request couldn't be completed
*/
public
JSONArray queryNodeIndex(String indexName, String key, String value) throws
ServerErrorResponse {
try {
String response = sendRequest(Neo4jHttpRequestType.GET, indexNodeUrl + indexName + "/"
+ key + "/" + value, null);
try {
JSONArray answer = new JSONArray(response);
JSONArray result = new JSONArray();
for (int i = 0; i != answer.length(); i++) {
JSONObject item = answer.getJSONObject(i);
JSONObject indexHit = new JSONObject();
indexHit.put("node", item.getString("self").replaceFirst(nodeUrlRegex, ""));
indexHit.put("data", item.get("data"));
result.put(indexHit);
}
return (result);
} catch (JSONException e) {
throw new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR, "Unknown server
response:\n" + response);
}
} catch (ServerErrorResponse e) {
if (HttpURLConnection.HTTP_NOT_FOUND == e.getReturnCode()) return new JSONArray();
throw e;
}
}
/**
* Adds a relationship to the index.
* @param relationship relationship ID to add to index
* @param key index key
* @param value index value
* @throws ServerErrorResponse if request couldn't be completed
*/
public void
addRelationshipToIndex(String indexName, String relationship, String key,
String value) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.POST, indexRelationshipUrl + indexName + "/" + key + "/"
+ value, "\"" + relationshipUrl + relationship + "\"");
}
/**
* Removes a relationship from the index
* @param relationship relationship ID to remove from index
* @param key indexing key
* @param value index value
* @throws ServerErrorResponse if request couldn't be completed
*/
public void removeRelationshipFromIndex(String relationship, String indexName, String key,
String value) throws ServerErrorResponse {
sendRequest(Neo4jHttpRequestType.DELETE, indexRelationshipUrl + indexName + "/" + key +
"/" + value + "/" + relationship, null);
}
/**
* Queries DB index for (key, value) pair for matching relationship.
* @param key index key
* @param value index value
* @return JSONArray with relationship IDs and relationship properties
* @throws ServerErrorResponse if request couldn't be completed
*/
public JSONArray queryRelationshipIndex(String indexName, String key, String value) throws
ServerErrorResponse {
String response = sendRequest(Neo4jHttpRequestType.GET, indexRelationshipUrl + indexName
+ "/" + key + "/" + value, null);
//TODO: format response
try {
JSONArray answer = new JSONArray(response);
JSONArray result = new JSONArray();
for (int i = 0; i != answer.length(); i++) {
JSONObject item = answer.getJSONObject(i);
JSONObject indexHit = new JSONObject();
indexHit.put("node", item.getString("self").replaceFirst(nodeUrlRegex, ""));
indexHit.put("data", item.get("data"));
result.put(indexHit);
}
return (result);
} catch (JSONException e) {
throw
new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
"Unknown
server
response:\n" + response);
}
}
/*
**********************************************************************************************
**************** */
/**
* General purpose graph traverser.
* @param startNode node ID to start traversing from
* @param order depth-first or width-first traversing
* @param uniqueness uniqueness filter
* @param relationships
* @param pruneEvaluatorJS Javascript evaluator to prune graph while traversing. If empty
then maxDepth is used
* @param returnFilter which nodes to return
* @param maxDepth Maximum depth to traverse from start node. Ignored if pruneEvaluatorJS is
not empty.
* @return JSONArray containing requested return objects (nodes, relationships or paths) of
reached entites in graph.
* @throws IOException if request coundn't be composed
* @throws ServerErrorResponse if request couldn't be completed
*/
public
JSONArray
traverse(String
startNode,
Neo4jTraverseResult
returnType,
Neo4jTraverseOrder order,
Neo4jTraverseUniqueness uniqueness, ArrayList<Neo4jRelationship> relationships, String
pruneEvaluatorJS,
Neo4jTraverseReturnFilter
returnFilter,
int
maxDepth)
throws
ServerErrorResponse {
JSONObject request = new JSONObject();
try {
// traverse order
switch (order) {
case DEPTH_FIRST: request.put("order", "depth first"); break;
case BREADTH_FIRST: request.put("order", "breadth first"); break;
}
IOException,
// uniqueness
switch (uniqueness) {
case NODE: request.put("uniqueness", "node");
case NODE_PATH: request.put("uniqueness", "node path");
}
// relationships
JSONArray relations = new JSONArray();
for (int i = 0; i != relationships.size(); i++)
relations.put(new JSONObject().
put("type", relationships.get(i).getType()).
put("direction", relationships.get(i).getDirection().toString().toLowerCase())
);
request.put("relationships", relations);
// prune evaluator
if (null != pruneEvaluatorJS && pruneEvaluatorJS.length() > 0)
request.put("prune evaluator",
new JSONObject().
put("language", "javascript").
put("body", pruneEvaluatorJS)
);
// return filter
JSONObject filter = new JSONObject();
filter.put("language", "builtin");
switch (returnFilter) {
case ALL: filter.put("name", "all"); break;
case ALL_BUT_START_NODE: filter.put("name", "all but start node"); break;
}
request.put("return filter", filter);
// max depth
request.put("max depth", maxDepth);
// send request
String response = sendRequest(Neo4jHttpRequestType.POST, nodeUrl + startNode +
"/traverse/" + returnType.toString().toLowerCase(), request.toString());
try {
JSONArray answer = new JSONArray(response);
JSONArray result = new JSONArray();
switch (returnType) {
case NODE:
for (int i = 0; i != answer.length(); i++) {
JSONObject item = answer.getJSONObject(i);
JSONObject resultItem = new JSONObject();
resultItem.put("node", item.getString("self").replaceFirst(nodeUrlRegex, ""));
resultItem.put("data", item.get("data"));
result.put(resultItem);
}
break ;
case RELATIONSHIP:
for (int i = 0; i != answer.length(); i++) {
JSONObject item = answer.getJSONObject(i);
JSONObject resultItem = new JSONObject();
resultItem.put("relationship",
item.getString("self").replaceFirst(relationshipUrlRegex, ""));
resultItem.put("start",
item.getString("start").replaceFirst(nodeUrlRegex,
""));
resultItem.put("end", item.getString("end").replaceFirst(nodeUrlRegex, ""));
resultItem.put("type", item.get("type"));
resultItem.put("data", item.get("data"));
result.put(resultItem);
}
break ;
case PATH:
for (int i = 0; i != answer.length(); i++) {
JSONObject item = answer.getJSONObject(i);
JSONObject resultItem = new JSONObject();
JSONArray itemNodes = item.getJSONArray("nodes");
resultItem.put("nodes", new JSONArray());
for (int j = 0; j != itemNodes.length(); j++)
resultItem.accumulate("nodes",
itemNodes.getString(j).replaceFirst(nodeUrlRegex, ""));
JSONArray itemRelations = item.getJSONArray("relationships");
resultItem.put("relationships", new JSONArray());
for (int j = 0; j != itemRelations.length(); j++)
resultItem.accumulate("relationships",
itemRelations.getString(j).replaceFirst(relationshipUrlRegex, ""));
resultItem.put("start",
item.getString("start").replaceFirst(nodeUrlRegex,
""));
resultItem.put("end", item.getString("end").replaceFirst(nodeUrlRegex, ""));
resultItem.put("length", item.get("length"));
result.put(resultItem);
}
break ;
}
return (result);
} catch (JSONException e) {
throw new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
response:\n" + response);
}
} catch (JSONException e) {
throw new IOException("Cannot compose request");
}
}
"Unknown
server
/*
**********************************************************************************************
**************** */
/**
* Returns the shortest path from startNode to endNode using the specified relationships
only.
* If the length of returned path is 0 then no path exists.
* @param startNode Node ID to start path finding from.
* @param endNode Node ID to reach to.
* @param relationships List of relationships allowed to use.
* @param maxDepth Maximum path length to search for.
* @return A JSONObject containing the nodes and relationships on the shortest path.
* @throws IOException if request couldn't be composed
* @throws ServerErrorResponse if request couldn't be completed
*/
public
JSONObject
findShortestPath(String
startNode,
String
endNode,
ArrayList<Neo4jRelationship>
relationships,
int
maxDepth)
throws
IOException,
ServerErrorResponse {
JSONObject request = new JSONObject();
// compose request
try {
request.put("to", nodeUrl + endNode);
request.put("algorithm", "shortestPath");
request.put("max depth", maxDepth);
// relationships
JSONArray relations = new JSONArray();
for (int i = 0; i != relationships.size(); i++)
relations.put(new JSONObject().
put("type", relationships.get(i).getType()).
put("direction", relationships.get(i).getDirection().toString().toLowerCase())
);
request.put("relationships", relations);
try {
String response = sendRequest(Neo4jHttpRequestType.POST,
"/path", request.toString());
try {
JSONObject answer = new JSONObject(response);
JSONObject result = new JSONObject();
// nodes
JSONArray itemNodes = answer.getJSONArray("nodes");
nodeUrl
+
startNode
+
result.put("nodes", new JSONArray());
for (int j = 0; j != itemNodes.length(); j++)
result.accumulate("nodes",
itemNodes.getString(j).replaceFirst(nodeUrlRegex,
""));
// relationships
JSONArray itemRelations = answer.getJSONArray("relationships");
result.put("relationships", new JSONArray());
for (int j = 0; j != itemRelations.length(); j++)
result.accumulate("relationships",
itemRelations.getString(j).replaceFirst(relationshipUrlRegex, ""));
return result;
} catch (JSONException e) {
throw new
ServerErrorResponse(ServerErrorResponse.OTHER_ERROR,
response:\n" + response);
}
} catch (ServerErrorResponse e) {
if (e.returnCode == HttpURLConnection.HTTP_NOT_FOUND) {
// HTTP 404 means no path found, return empty path
return new JSONObject().
put("start", startNode).
put("end", endNode).
put("length", 0).
put("nodes", new JSONArray()).
put("relationships", new JSONArray());
}
else
// other error
throw e;
}
} catch (JSONException e) {
throw new IOException("Cannot compose request");
}
}
"Unknown
}
Neo4jRelationship.java
package
com.wcs.pofa.db;
public class Neo4jRelationship {
/**
* Relationship directions
*/
public static enum Neo4jRelationshipDirection {
ALL,
IN,
OUT
}
private
private
String type;
Neo4jRelationshipDirection direction;
public Neo4jRelationship(String type, Neo4jRelationshipDirection direction) {
super ();
this .type = type;
this .direction = direction;
}
public int hashCode() {
int hashFirst = type != null ? type.hashCode() : 0;
int hashSecond = direction != null ? direction.hashCode() : 0;
return
(hashFirst + hashSecond) * hashSecond + hashFirst;
}
public boolean equals(Object other) {
if (other instanceof Neo4jRelationship) {
Neo4jRelationship otherPair = (Neo4jRelationship) other;
return
(( this.type == otherPair.type ||
( this.type != null && otherPair.type != null &&
this .type.equals(otherPair.type))) &&
(
this.direction == otherPair.direction ||
( this.direction != null && otherPair.direction != null &&
server
this .direction.equals(otherPair.direction))) );
}
return false ;
}
public
String toString() {
return "(" + type + ", " + direction + ")";
}
public String getType() {
return type;
}
public void setType(String type) {
this .type = type;
}
public Neo4jRelationshipDirection getDirection() {
return direction;
}
public void setDirection(Neo4jRelationshipDirection direction) {
this .direction = direction;
}
}
Pair.java
package
import
com.wcs.pofa;
java.io.Serializable;
public class Pair<A, B> implements Serializable {
/**
*
*/
private static final long serialVersionUID = 1L;
private A first;
private B second;
public Pair(A first, B second) {
super ();
this .first = first;
this .second = second;
}
public int hashCode() {
int hashFirst = first != null ? first.hashCode() : 0;
int hashSecond = second != null ? second.hashCode() : 0;
return
(hashFirst + hashSecond) * hashSecond + hashFirst;
}
@SuppressWarnings ("unchecked")
public boolean equals(Object other) {
if (other instanceof Pair) {
Pair otherPair = (Pair) other;
return
(( this.first == otherPair.first ||
( this.first != null && otherPair.first != null &&
this .first.equals(otherPair.first))) &&
(
this.second == otherPair.second ||
( this.second != null && otherPair.second != null &&
this .second.equals(otherPair.second))) );
}
return false ;
}
public
String toString() {
return "(" + first + ", " + second + ")";
}
public A getFirst() {
return first;
}
public void setFirst(A first) {
this .first = first;
}
public B getSecond() {
return second;
}
public void setSecond(B second) {
this .second = second;
}
}
Pofa.java
package
com.wcs.pofa;
import
java.io.File;
import
import
javax.xml.parsers.DocumentBuilder;
javax.xml.parsers.DocumentBuilderFactory;
import
org.w3c.dom.Document;
import
import
import
import
import
import
com.wcs.pofa.db.PofaNeo4jDB;
com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse;
com.wcs.pofa.entities.PofaEntityListBuilder;
com.wcs.pofa.events.PofaEvents;
com.wcs.pofa.events.PofaNotifier;
com.wcs.pofa.slicer.PofaSlicer;
/**
* Main class of POFA prototype.
*
*/
public class Pofa implements PofaEvents {
private
private
private
String configFile;
String databaseUrl;
String crawlerSettingsFile;
private
private
private
private
private
PofaNotifier notifier;
PofaNeo4jDB db;
PofaDataminerController dataminerController;
PofaSlicer slicer;
PofaEntityListBuilder entityList;
/**
* Loads the settings from the specified config file.
*/
private void loadConfig() {
try {
File file = new File(configFile);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
// load settings
this
.databaseUrl
=
doc.getElementsByTagName("database").item(0).getAttributes().getNamedItem("url").getNodeValue(
);
this
.crawlerSettingsFile
=
doc.getElementsByTagName("crawler").item(0).getAttributes().getNamedItem("file").getNodeValue(
);
} catch (Exception e) {
e.printStackTrace();
}
}
public PofaNeo4jDB getDb() {
return this .db;
}
public PofaSlicer getSlicer() {
return this .slicer;
}
/**
* Start the prototype.
* @param configFile
* @throws Exception
* @throws ServerErrorResponse
*/
public Pofa(String configFile) throws ServerErrorResponse {
System.out.println("Initializing " + this.getClass().getName() + "...");
// load settings
this .configFile = configFile;
loadConfig();
// initialize modules
notifier = new PofaNotifier();
notifier.notifyRequest(this);
db = new PofaNeo4jDB(databaseUrl);
slicer = new PofaSlicer(db);
//entityList = new PofaEntityListBuilder(db);
// initialize other modules
dataminerController = new PofaDataminerController(notifier, crawlerSettingsFile);
System.out.println(this.getClass().getName() + " initialized.");
}
public void start() {
// start the dataminer process
dataminerController.start();
}
public synchronized void
onNewPage(PofaAbstractDataminer sender, String url, String page,
PofaDomainInfo rule) {
System.out.println(url);
String html = PofaSlicer.cleanHTML(page);
//double usefulness =
slicer.process(url, html, rule);
//double entityRatio = entityList.addEntities(slicer.getEntities());
}
public void
}
onEntityFound(PofaAbstractSlicer sender, String entityName) {
public void
}
onPageDownloaded(PofaAbstractCrawler sender, Object data) {
public boolean onPageVisiting(PofaAbstractCrawler sender, String url) {
return false ;
}
public static void main(String argv[]) throws ServerErrorResponse {
Pofa prototype = new Pofa("c:\\users\\mikki\\workspace\\pofa\\settings.xml");
prototype.start();
/*
Neo4jDBInterface db = new Neo4jDBInterface("http://localhost:9999");
System.out.println(db.getRelationship("1"));
System.out.println(db.queryIndex("foo", "bar"));
System.out.println(db.findShortestPath("0",
"3",
10,
Neo4jRelationshipDirection.ALL));
*/
}
}
PofaAbstractCrawler.java
package
com.wcs.pofa;
public interface
PofaAbstractCrawler {
}
PofaAbstractDataminer.java
package
com.wcs.pofa;
public interface
}
PofaAbstractDataminer {
"connect",
PofaAbstractSlicer.java
package
com.wcs.pofa;
public interface
PofaAbstractSlicer {
}
PofaCrawler.java
package
com.wcs.pofa.crawler;
import
java.util.regex.Pattern;
import
import
com.wcs.pofa.PofaAbstractCrawler;
com.wcs.pofa.events.PofaNotifier;
import
import
import
edu.uci.ics.crawler4j.crawler.Page;
edu.uci.ics.crawler4j.crawler.WebCrawler;
edu.uci.ics.crawler4j.url.WebURL;
/**
* Crawler to crawl all specified domains and parse valuable sites.
*
*/
public class PofaCrawler extends WebCrawler implements PofaAbstractCrawler {
private Pattern excludeFilter = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
);
//private Neo4jDBInterface db;
private PofaNotifier notifier;
public
}
PofaCrawler() {
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (excludeFilter.matcher(href).matches())
return false ;
if (notifier.onPageVisiting(this, url.getURL()))
return true ;
return false ;
}
/**
* Visit specified page
*/
public void visit(Page page) {
notifier.onPageDownload(this, page);
}
// This function is called by controller to get the local data of this crawler when job is
finished
public Object getMyLocalData() {
return null ;
}
// This function is called by controller before finishing the job.
public void onBeforeExit() {
System.out.println("Crawler " + getMyId() + " finished.");
}
public void onStart() {
Object data = getMyData();
if (data instanceof PofaNotifier) {
notifier = ((PofaNotifier)getMyData());
}
else {
//TODO: throw exception
}
}
}
PofaCrawlerController.java
package
com.wcs.pofa.crawler;
import
import
java.io.IOException;
java.util.ArrayList;
import
import
javax.management.modelmbean.XMLParseException;
javax.xml.parsers.ParserConfigurationException;
import
org.xml.sax.SAXException;
import
import
import
import
import
import
import
import
com.wcs.pofa.PofaAbstractCrawler;
com.wcs.pofa.PofaAbstractDataminer;
com.wcs.pofa.PofaAbstractSlicer;
com.wcs.pofa.PofaDomainInfo;
com.wcs.pofa.events.PofaEvents;
com.wcs.pofa.events.PofaNotifier;
com.wcs.pofa.settings.crawler.PofaCrawlerXML;
com.wcs.pofa.settings.crawler.PofaCrawlerXMLDomain;
import
edu.uci.ics.crawler4j.crawler.CrawlController;
public class
PofaCrawlerController implements PofaEvents {
private CrawlController controller;
private int crawlerCount;
private PofaNotifier notifier;
private String configFile;
private ArrayList<PofaDomainInfo> domainInfo;
private void loadConfig() {
domainInfo = new ArrayList<PofaDomainInfo>();
try {
PofaCrawlerXML config = new PofaCrawlerXML(configFile);
ArrayList<PofaCrawlerXMLDomain> domain = config.getDomains();
for (int i = 0; i != domain.size(); i++)
domainInfo.add(new PofaDomainInfo(domain.get(i)));
} catch (XMLParseException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (ParserConfigurationException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (SAXException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
}
/**
* Creates a crawler controller class and adds the seed URLs specified in database.
*
* @param notifier a PofaNotifier instance to handle communication between objects.
* @param numberOfCrawlers Specifies the number of crawlers to use.
* @param rootFolder Base folder for crawler to store it's database.
* @param configFile specifies the config file name to use (contains per-domain info: seed
urls, accept urls, DOM rules)
* @throws Exception
*/
public
PofaCrawlerController(PofaNotifier
notifier,
int
numberOfCrawlers,
String
rootFolder, String configFile) throws Exception {
System.out.println("Initializing " + this.getClass().getName() + "...");
this .notifier = notifier;
this .notifier.notifyRequest(this); // add ourselves to the notification chain
crawlerCount = numberOfCrawlers;
controller = new CrawlController(rootFolder);
controller.setPolitenessDelay(300); //TODO: select a useable politeness delay
this .configFile = configFile;
loadConfig();
//--- add seed urls --for (int i = 0; i != domainInfo.size(); i++) {
for (int j = 0; j != domainInfo.get(i).getSeedUrl().size(); j++) {
System.out.println("adding seed: " + domainInfo.get(i).getSeedUrl().get(j));
controller.addSeed(domainInfo.get(i).getSeedUrl().get(j));
}
}
System.out.println(this.getClass().getName() + " initialized.");
}
/**
* Starts the crawling.
*/
public void startCrawler() {
this .controller.start(PofaCrawler.class, this.crawlerCount, notifier);
}
/**
* Callback event: a page was downloaded by a crawler object
*/
@Override
public boolean onPageVisiting(PofaAbstractCrawler sender, String url) {
if (sender instanceof PofaCrawler) {
for (int i = 0; i != domainInfo.size(); i++)
if (domainInfo.get(i).accept(url))
return true ;
}
return false ;
}
@Override
public void onEntityFound(PofaAbstractSlicer sender, String entityName) {
// TODO Auto-generated method stub
}
@Override
public void onNewPage(PofaAbstractDataminer sender, String url,
String page, PofaDomainInfo rule) {
// TODO Auto-generated method stub
}
@Override
public void onPageDownloaded(PofaAbstractCrawler sender, Object data) {
// TODO Auto-generated method stub
}
}
PofaCrawlerXML.java
package
com.wcs.pofa.settings.crawler;
import
import
import
java.io.File;
java.io.IOException;
java.util.ArrayList;
import
import
import
import
javax.management.modelmbean.XMLParseException;
javax.xml.parsers.DocumentBuilder;
javax.xml.parsers.DocumentBuilderFactory;
javax.xml.parsers.ParserConfigurationException;
import
import
import
import
org.w3c.dom.Document;
org.w3c.dom.Node;
org.w3c.dom.NodeList;
org.xml.sax.SAXException;
/**
* Read and parse config XML of crawler.
*
*/
public class PofaCrawlerXML {
private
PofaCrawlerXMLDomains domains = null;
/**
* Parse XML file and create objects.
* @param configFile config file to parse
* @throws XMLParseException
* @throws ParserConfigurationException
* @throws IOException
* @throws SAXException
*/
public
PofaCrawlerXML(String
configFile)
throws
ParserConfigurationException, SAXException, IOException {
File file = new File(configFile);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
ParseRoot(doc);
}
/**
* Parse XML document
* @param root
* @throws XMLParseException
*/
private void ParseRoot(Document root) throws XMLParseException {
NodeList children = root.getChildNodes();
for (int i = 0; i != children.getLength(); i++) {
Node child = children.item(i);
if (child.getNodeName().equals("crawlersettings"))
ParseCrawlerSettings(child);
}
}
/**
* Parse <crawlersettings> node
* @param root
* @throws XMLParseException
* @throws XMLParseException
*/
private void ParseCrawlerSettings(Node root) throws XMLParseException {
if (domains != null)
throw new XMLParseException("Only one <domains> node expected.");
NodeList children = root.getChildNodes();
for (int i = 0; i != children.getLength(); i++) {
Node child = children.item(i);
if (child.getNodeName().equals("domains"))
domains = new PofaCrawlerXMLDomains(child);
}
}
/**
* @return domains
*/
public ArrayList<PofaCrawlerXMLDomain> getDomains() {
if (null == domains)
return null ;
return domains.getDomains();
}
}
PofaCrawlerXMLAccept.java
package
import
com.wcs.pofa.settings.crawler;
org.w3c.dom.Node;
public class
private
PofaCrawlerXMLAccept {
String url = null;
public PofaCrawlerXMLAccept(Node root) {
url = root.getTextContent();
}
/**
* @return the url
*/
XMLParseException,
public String getUrl() {
return url;
}
}
PofaCrawlerXMLDomain.java
package
com.wcs.pofa.settings.crawler;
import
java.util.ArrayList;
import
import
org.w3c.dom.Node;
org.w3c.dom.NodeList;
public class
private
private
private
private
PofaCrawlerXMLDomain {
ArrayList<PofaCrawlerXMLSeed> seed;
ArrayList<PofaCrawlerXMLAccept> accept;
ArrayList<PofaCrawlerXMLRule> rule;
String rootUrl;
public PofaCrawlerXMLDomain(Node root) {
seed = new ArrayList<PofaCrawlerXMLSeed>();
accept = new ArrayList<PofaCrawlerXMLAccept>();
rule = new ArrayList<PofaCrawlerXMLRule>();
rootUrl = root.getAttributes().getNamedItem("root").getNodeValue();
NodeList children = root.getChildNodes();
for (int i = 0; i != children.getLength(); i++) {
Node child = children.item(i);
if (child.getNodeName().equals("seed"))
seed.add(new PofaCrawlerXMLSeed(child));
else if (child.getNodeName().equals("accept"))
accept.add(new PofaCrawlerXMLAccept(child));
else if (child.getNodeName().equals("rule"))
rule.add(new PofaCrawlerXMLRule(child));
}
}
/**
* @return seed urls
*/
public ArrayList<PofaCrawlerXMLSeed> getSeedURLs() {
return seed;
}
/**
* @return accept urls
*/
public ArrayList<PofaCrawlerXMLAccept> getAcceptURLs() {
return accept;
}
/**
* @return rules
*/
public ArrayList<PofaCrawlerXMLRule> getRules() {
return rule;
}
/**
* @return root URL
*/
public String getRootUrl() {
return rootUrl;
}
}
PofaCrawlerXMLDomains.java
package
com.wcs.pofa.settings.crawler;
import
java.util.ArrayList;
import
org.w3c.dom.Node;
import
org.w3c.dom.NodeList;
public class
private
PofaCrawlerXMLDomains {
ArrayList<PofaCrawlerXMLDomain> domain;
public PofaCrawlerXMLDomains(Node root) {
domain = new ArrayList<PofaCrawlerXMLDomain>();
NodeList children = root.getChildNodes();
for (int i = 0; i != children.getLength(); i++) {
Node child = children.item(i);
if (child.getNodeName().equals("domain"))
domain.add(new PofaCrawlerXMLDomain(child));
}
}
/**
* @return domains
*/
public ArrayList<PofaCrawlerXMLDomain> getDomains() {
return domain;
}
}
PofaCrawlerXMLRule.java
package
import
import
com.wcs.pofa.settings.crawler;
org.w3c.dom.NamedNodeMap;
org.w3c.dom.Node;
public class
private
private
private
PofaCrawlerXMLRule {
String type = null;
String path = null;
String exclude = null;
public PofaCrawlerXMLRule(Node root) {
NamedNodeMap attributes = root.getAttributes();
type = attributes.getNamedItem("type").getNodeValue();
path = attributes.getNamedItem("path").getNodeValue();
exclude = attributes.getNamedItem("exclude").getNodeValue();
}
/**
* @return the type
*/
public String getType() {
return type;
}
/**
* @return the path
*/
public String getPath() {
return path;
}
/**
* @return the exclude
*/
public String getExclude() {
return exclude;
}
}
PofaCrawlerXMLSeed.java
package
import
com.wcs.pofa.settings.crawler;
org.w3c.dom.Node;
public class
private
PofaCrawlerXMLSeed {
String url = null;
public PofaCrawlerXMLSeed(Node root) {
url = root.getTextContent();
}
/**
* @return the url
*/
public String getUrl() {
return url;
}
}
PofaDataminerController.java
package
com.wcs.pofa;
import
import
java.io.IOException;
java.util.ArrayList;
import
import
javax.management.modelmbean.XMLParseException;
javax.xml.parsers.ParserConfigurationException;
import
org.xml.sax.SAXException;
import
import
import
import
import
import
com.wcs.pofa.crawler.PofaCrawler;
com.wcs.pofa.crawler.PofaCrawlerController;
com.wcs.pofa.events.PofaEvents;
com.wcs.pofa.events.PofaNotifier;
com.wcs.pofa.settings.crawler.PofaCrawlerXML;
com.wcs.pofa.settings.crawler.PofaCrawlerXMLDomain;
import
edu.uci.ics.crawler4j.crawler.Page;
/**
* Controls the datamining process.
*
*/
public class PofaDataminerController implements PofaEvents, PofaAbstractDataminer {
private
private
private
String crawlerSettingsFile;
PofaNotifier notifier;
PofaCrawlerController crawlerController;
private
ArrayList<PofaDomainInfo> domainInfo;
/**
* Load the crawler configuration file.
*/
private void loadDomainXML() {
domainInfo = new ArrayList<PofaDomainInfo>();
try {
PofaCrawlerXML config = new PofaCrawlerXML(crawlerSettingsFile);
ArrayList<PofaCrawlerXMLDomain> domain = config.getDomains();
for (int i = 0; i != domain.size(); i++)
domainInfo.add(new PofaDomainInfo(domain.get(i)));
} catch (XMLParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Initialize the dataminer process.
*/
public PofaDataminerController(PofaNotifier notifier, String crawlerSettingsFile) {
System.out.println("Initializing " + this.getClass().getName() + "...");
this .notifier = notifier;
notifier.notifyRequest(this);
this .crawlerSettingsFile = crawlerSettingsFile;
loadDomainXML();
try {
crawlerController
=
new
PofaCrawlerController(notifier,
"c:\\users\\mikki\\workspace\\pofa\\crawler", crawlerSettingsFile);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(this.getClass().getName() + " initialized.");
}
9,
/**
* Start the dataminer process.
*/
public void start() {
crawlerController.startCrawler();
}
@Override
public synchronized void onPageDownloaded(PofaAbstractCrawler sender, Object data) {
System.out.print("Crawler " + ((PofaCrawler) sender).getMyId() + ": ");
if (data instanceof Page) {
Page page = (Page)data;
System.out.println(page.getWebURL());
for (int i = 0; i != domainInfo.size(); i++) {
if (page.getWebURL().getURL().matches(domainInfo.get(i).getDomainRoot())) {
notifier.onNewPage(this,
page.getWebURL().getURL(),
page.getHTML(),
domainInfo.get(i));
break ;
}
}
}
else
System.out.println("unknown data");
}
@Override
public void onEntityFound(PofaAbstractSlicer sender, String entityName) {
// TODO Auto-generated method stub
}
@Override
public void onNewPage(PofaAbstractDataminer sender, String url,
String page, PofaDomainInfo rule) {
// TODO Auto-generated method stub
}
@Override
public boolean onPageVisiting(PofaAbstractCrawler sender, String url) {
// TODO Auto-generated method stub
return false ;
}
}
PofaDomainInfo.java
package
com.wcs.pofa;
import
java.util.ArrayList;
import
import
import
import
import
com.wcs.pofa.PofaDomainRule.PofaPageElement;
com.wcs.pofa.settings.crawler.PofaCrawlerXMLAccept;
com.wcs.pofa.settings.crawler.PofaCrawlerXMLDomain;
com.wcs.pofa.settings.crawler.PofaCrawlerXMLRule;
com.wcs.pofa.settings.crawler.PofaCrawlerXMLSeed;
/**
* Stores basic info about a domain required for crawling and processing
* Stored data are:
* <li>seed URLs: start crawler on these sites</li>
* <li>accept URLs: process only this kind of URLs</li>
* <li>processing rules: structural information</li>
*
*/
public class PofaDomainInfo {
private
private
private
private
String domainRoot;
ArrayList<String> seedUrl;
ArrayList<String> acceptUrl;
ArrayList<PofaDomainRule> rules;
//
//
//
//
domain root URL
seed URLs
accepted URLs
processing rules
/**
* Construct an empty domain info.
* @param domainRoot base URL of domain
*/
public PofaDomainInfo(String domainRoot) {
this .domainRoot = domainRoot;
this .seedUrl = new ArrayList<String>();
this .acceptUrl = new ArrayList<String>();
this .rules = new ArrayList<PofaDomainRule>();
}
/**
* Construct a domain info instance from a part of XML configuration file.
* @param configNode part of config XML containing domain-specific info
*/
public PofaDomainInfo(PofaCrawlerXMLDomain node) {
this .seedUrl = new ArrayList<String>();
this .acceptUrl = new ArrayList<String>();
this .rules = new ArrayList<PofaDomainRule>();
this .domainRoot = node.getRootUrl();
ArrayList<PofaCrawlerXMLSeed> seeds = node.getSeedURLs();
for (int i = 0; i != seeds.size(); i++)
this .seedUrl.add(seeds.get(i).getUrl());
ArrayList<PofaCrawlerXMLAccept> accepts = node.getAcceptURLs();
for (int i = 0; i != accepts.size(); i++)
this .acceptUrl.add(accepts.get(i).getUrl());
ArrayList<PofaCrawlerXMLRule> rules = node.getRules();
for (int i = 0; i != rules.size(); i++) {
PofaCrawlerXMLRule rule = rules.get(i);
this
.getRules().add(new
PofaDomainRule(PofaPageElement.valueOf(rule.getType()),
rule.getPath(), rule.getExclude()));
}
}
/**
* Adds a rule to the domain rule list.
* @param rule The rule to add.
* @return The class instance itself
*/
public PofaDomainInfo addRule(PofaDomainRule rule) {
this .getRules().add(rule);
return this ;
}
/**
* Search for a rule in the list of added rules that matches the given DOM path.
* @param path The DOM path to match.
* anything from that point (even ID and/or CLASS items).
* @return <i>PofaPageElement</i> of the first rule matching the specified
<b>null</b> if no rules match
* the path.
*/
/*
* TODO: match multiple rules (return a list of matching rules?)
*/
public PofaDomainRule.PofaPageElement matchingRule(String path) {
path = path.toLowerCase();
for (int i = 0; i != getRules().size(); i++)
if (getRules().get(i).matchRule(path))
return (getRules().get(i).getType());
return null ;
}
/**
* @return the domainRoot
*/
public String getDomainRoot() {
return domainRoot;
}
path
or
/**
* @return seed URLs
*/
public ArrayList<String> getSeedUrl() {
return seedUrl;
}
/**
* Tells if domain accepts this URL (that is if crawler should visit this URL or not).
* @return
*/
public boolean accept(String url) {
for (int i = 0; i != acceptUrl.size(); i++)
if (url.matches(acceptUrl.get(i)))
return true ;
return false ;
}
/**
* @return the rules
*/
public ArrayList<PofaDomainRule> getRules() {
return rules;
}
}
PofaDomainRule.java
package
com.wcs.pofa;
/**
* Pair of CSS selector like path and page element type.
*
*/
public class PofaDomainRule {
/**
* Page element type.
*/
public static enum PofaPageElement {
THEME,
BREADCRUMBS,
FACTSHEET,
TEXT,
//NAVIGATION,
ENTITY
}
private
private
private
PofaPageElement type;
String path;
String exclude;
/**
* Create a domain rule from a <i>PofaPageElement</i> type and a DOM path. Path elements
should be separated
* by the 'greater than' character (>). The format of each element is: ELEMENT_NAME [#
ID_TAG] [. CLASS_TAG]
* where ELEMENT_NAME is the DOM element name, ID_TAG is an optional ID attribute of the DOM
element and
* CLASS_TAG is an optional CLASS attribute of the DOM element. <b>Warning:</b> DOM path is
interpreted as a
* regular expression so make sure that dot (.) chars before CLASS_TAG elements are escaped.
Path string
* should be lower case (except for regexp special chars).
* @param ruleType Page element type.
* @param rulePath regexp desctibing the DOM path to match (regexp is matched from the
beginning)
* @param excludePath regexp describing the DOM path to exclude from match relative to
rulePath (if <b>null</b> no exclusion is made)
*/
public PofaDomainRule(PofaPageElement ruleType, String rulePath, String excludePath) {
setType(ruleType);
setPath("^" + rulePath + "(>.*|)$");
if (excludePath == null || excludePath.length() == 0)
setExclude(null);
else
setExclude("^" + rulePath + ">" + excludePath + "(>.*|)$");
}
private void setType(PofaPageElement type) {
this .type = type;
}
public PofaPageElement getType() {
return type;
}
private void setPath(String path) {
this .path = path;
}
public String getPath() {
return path;
}
public void setExclude(String exclude) {
this .exclude = exclude;
}
public String getExclude() {
return exclude;
}
/**
* Tries to match rule path against the supplied DOM path.
* @param path DOM path to match against.
* @return <b>True</b> if rule matches the supplied path
<b>false</b> otherwise.
*/
public boolean matchRule(String path) {
if (null == path)
return false ;
path = path.toLowerCase();
if (null == this.exclude || this.exclude.isEmpty())
// no exclusion rules, simply match paths
return path.matches(this.path);
else if (path.matches(this.path)) {
// paths match, check if exclusion rule doesn't match
return (!path.matches(this.exclude));
}
else {
// main path doesn't match
return false ;
}
}
}
PofaEntityListBuilder.java
package
com.wcs.pofa.entities;
import
import
import
import
import
java.io.IOException;
java.math.BigInteger;
java.security.MessageDigest;
java.security.NoSuchAlgorithmException;
java.util.ArrayList;
import
import
org.json.JSONException;
org.json.JSONObject;
import
import
com.wcs.pofa.db.Neo4jDBInterface;
com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse;
@Deprecated
//TODO: needs rewiseing
public class PofaEntityListBuilder {
private
Neo4jDBInterface db;
public PofaEntityListBuilder(Neo4jDBInterface db) {
this .db = db;
}
(matching
done
via
regexp),
/**
* Insert an entity into database.
* @param entityName
* @return ture if entity was inserted
*/
public String addEntity(String entityName) {
String result = addEntityWithNoClustering(entityName);
if (result.length() > 0) {
ArrayList<String> item = new ArrayList<String>();
item.add(result);
performClustering(item);
}
return result;
}
/**
* Same as addEntity but does not perform clustering. Good for bulk insert.
* @param entityName
* @return hash of added entity or empty string if entity was not added
*/
private String addEntityWithNoClustering(String entityName) {
String result = "";
//--- query if entity exists --MessageDigest digest;
try {
digest = MessageDigest.getInstance("MD5");
digest.update(entityName.getBytes(),0, entityName.length());
String hash = new BigInteger(1, digest.digest()).toString(16);
if (db.queryNodeIndex("entity", "hash", hash).length() == 0) {
// new item, insert it
result = hash;
JSONObject text = new JSONObject();
text.put("content", entityName);
//TODO: append fact sheet
// put into DB
String node = db.createNode(text);
// connect to CLASSIFYENTITY node (require later classification)
//db.createRelationship(db.getRefNodeClassifyEntity(), node, "", new JSONObject());
//db.addToIndex(node, "ENTITYNAMEHASH", hash);
}
else {
}
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
//} catch (IOException e) {
// // TODO Auto-generated catch block
// e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return result;
}
/**
* Inserts many entities at once into database.
* @param entities
* @return ratio of added/requested entities
*/
public double addEntities(ArrayList<String> entities) {
if (entities.size() == 0)
return 0;
ArrayList<String> added = new ArrayList<String>();
for (int i = 0; i != entities.size(); i++) {
String newHash = addEntityWithNoClustering(entities.get(i));
if (newHash.length() > 0)
added.add(newHash);
}
performClustering(added);
return added.size() / entities.size();
}
/**
* Performs clustering of specified entites (requires hashes).
* @param newEntities
*/
private void performClustering(ArrayList<String> newEntities) {
//TODO: stub method
//TODO: create a 2nd version of this method which queries CLASSIFYENTITY node's relatives
to get entities to classify
}
}
PofaEvents.java
package
import
import
import
import
com.wcs.pofa.events;
com.wcs.pofa.PofaAbstractCrawler;
com.wcs.pofa.PofaAbstractDataminer;
com.wcs.pofa.PofaAbstractSlicer;
com.wcs.pofa.PofaDomainInfo;
/**
* Common interface for handling events.
* @author mikki
*/
public interface PofaEvents {
// crawler to controller events
public boolean onPageVisiting(PofaAbstractCrawler sender, String url);
public void onPageDownloaded(PofaAbstractCrawler sender, Object data);
// dataminer to main events
public void onNewPage(PofaAbstractDataminer sender, String url, String page, PofaDomainInfo
rule);
// slicer events
public void onEntityFound(PofaAbstractSlicer sender, String entityName);
}
PofaNeo4jDB.java
package
com.wcs.pofa.db;
import
import
import
import
import
import
import
import
java.io.IOException;
java.io.UnsupportedEncodingException;
java.net.HttpURLConnection;
java.net.URLEncoder;
java.util.ArrayList;
java.util.Hashtable;
java.util.Iterator;
java.util.Locale;
import
import
import
org.json.JSONArray;
org.json.JSONException;
org.json.JSONObject;
import
import
com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse;
com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection;
/**
* DB connection handler class for Neo4j database server for high-level access.
*/
public class PofaNeo4jDB {
private
private
private
private
private
Neo4jDBInterface dbInterface;
String refNodeTokenize;
String refNodeClassifyEntity;
String refNodeNewFeedback;
Hashtable<String, String> languageNode = new Hashtable<String, String>();
/**
* Create a DB connection to a Neo4j server.
* @param databaseUrl
* @throws ServerErrorResponse
*/
public PofaNeo4jDB(String databaseUrl) throws ServerErrorResponse {
System.out.println("Initializing " + this.getClass().getName() + "...");
dbInterface = new Neo4jDBInterface(databaseUrl);
lookupReferenceNodes();
System.out.println(this.getClass().getName() + " initialized.");
}
/**
* Caches the reference nodes of the DB for faster access.
* @throws ServerErrorResponse
*/
private void lookupReferenceNodes() throws ServerErrorResponse {
// query for control nodes
ArrayList<Neo4jRelationship> controlRel = new ArrayList<Neo4jRelationship>();
controlRel.add(new Neo4jRelationship("CONTROL", Neo4jRelationshipDirection.OUT));
JSONArray
answerArray
=
dbInterface.getNodeRelationships(dbInterface.getRootNode(),
controlRel);
for (int i = 0; i != answerArray.length(); i++) {
String nodeID;
try {
nodeID = ((String)((JSONObject)answerArray.get(i)).get("end"));
String nodeName = (String)dbInterface.getNodeProperty(nodeID, "name");
// FIXME: don't use hardcoded strings
if (nodeName.equals("TOKENIZE"))
refNodeTokenize = nodeID;
else if (nodeName.equals("CLASSIFYENTITY"))
refNodeClassifyEntity = nodeID;
else if (nodeName.equals("NEWFEEDBACK"))
refNodeNewFeedback = nodeID;
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
// query language nodes
controlRel.clear();
controlRel.add(new Neo4jRelationship("LANGUAGE", Neo4jRelationshipDirection.OUT));
answerArray = dbInterface.getNodeRelationships(dbInterface.getRootNode(), controlRel);
for (int i = 0; i != answerArray.length(); i++) {
String nodeID;
try {
nodeID = ((String)((JSONObject)answerArray.get(i)).get("end"));
String languageName = (String)dbInterface.getNodeProperty(nodeID, "language");
languageNode.put(languageName, nodeID);
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
// create reference nodes if they don't exist
try {
if (refNodeTokenize == null) {
// TOKENIZE
refNodeTokenize = dbInterface.createNode(new JSONObject().put("name", "TOKENIZE"));
dbInterface.createRelationship(dbInterface.getRootNode(), refNodeTokenize, "CONTROL",
new JSONObject());
}
if (refNodeClassifyEntity == null) {
// CLASSIFYENTITY
refNodeClassifyEntity
=
dbInterface.createNode(new
"CLASSIFYENTITY"));
dbInterface.createRelationship(dbInterface.getRootNode(),
"CONTROL", new JSONObject());
}
if (refNodeNewFeedback == null) {
// NEWFEEDBACK
refNodeNewFeedback
=
dbInterface.createNode(new
"NEWFEEDBACK"));
dbInterface.createRelationship(dbInterface.getRootNode(),
"CONTROL", new JSONObject());
}
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
JSONObject().put("name",
refNodeClassifyEntity,
JSONObject().put("name",
refNodeNewFeedback,
e.printStackTrace();
}
}
/**
* Returns DB interface class for low-level DB access.
*/
public Neo4jDBInterface getDBInterface() {
return dbInterface;
}
/**
* Returns the 'root' node ID of the given language. Creates language node if it doesn't
exist
* @param language
* @return
* @throws ServerErrorResponse
* @throws JSONException
*/
public String getLanguageNode(String language) throws JSONException, ServerErrorResponse {
if (languageNode.containsKey(language))
return languageNode.get(language);
else {
String nodeID = dbInterface.createNode(new JSONObject().put("language", language));
languageNode.put(language, nodeID);
return nodeID;
}
}
public String getRefNodeTokenize() {
return refNodeTokenize;
}
public String getRefNodeClassifyEntity() {
return refNodeClassifyEntity;
}
public String getRefNodeNewFeedback() {
return refNodeNewFeedback;
}
public String getRootNode() {
return dbInterface.getRootNode();
}
/**
* Queries DB index for nodes belonging to specified key/value pair.
* @param indexName name of index to query
* @param indexKey Index key
* @param indexValue Index value
* @param locale Locale for converting to lower-case
* @return JSONArray containing matching node IDs and their content.
*/
public
JSONArray queryNodeIndex(String indexName, String indexKey,
Locale locale) {
try {
return
dbInterface.queryNodeIndex(indexName,
URLEncoder.encode(indexValue.toLowerCase(locale), "UTF-8"));
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
switch (e.getReturnCode()) {
case HttpURLConnection.HTTP_NOT_FOUND: return new JSONArray();
}
// TODO Auto-generated catch block
e.printStackTrace();
}
return new JSONArray();
}
String
indexValue,
indexKey,
/**
* Adds multiple expressions to the node index. Expressions are added with lower-case only
* if they are not already in index.
* @param indexName name of index to use
* @param indexKey Key name to store index under
* @param expressions list of strings to use as index values
* @param nodeToIndex Node to add to index
* @return true if node wasn't previously indexed
*/
public boolean
addNodeToIndex(String indexName, String indexKey,
expressions, String nodeToIndex) {
boolean alreadyIndexed = true;
// add expressions
for (Iterator<String> i = expressions.iterator(); i.hasNext(); ) {
String expression = i.next();
if (!addNodeToIndex(indexName, indexKey, expression, nodeToIndex))
alreadyIndexed = false;
}
return !alreadyIndexed;
}
ArrayList<String>
/**
* Adds a single item to the node index. If node is already indexed it won't be added again.
* Returns true if item wasn't previously indexed.
* @param indexName name of index to use
* @param indexKey Key name to store index under
* @param indexValue Index value to store (needs to be converted to lower-case for easier
access)
* @param node Node ID to add to index
* @return True if item wasn't previously indexed.
*/
public boolean addNodeToIndex(String indexName, String indexKey, String indexValue, String
node) {
boolean alreadyIndexed = false;
try {
indexValue = URLEncoder.encode(indexValue, "UTF-8");
JSONArray indexHit;
indexHit = dbInterface.queryNodeIndex(indexName, indexKey, indexValue);
if (indexHit.length() != 0) {
//find out if node is indexed
for (int i = 0; i != indexHit.length(); i++) {
JSONObject indexedNode = indexHit.getJSONObject(i);
if (indexedNode.getString("node").equals(node)) {
alreadyIndexed = true;
break ;
}
}
}
if (!alreadyIndexed)
//node not indexed yet, add it
dbInterface.addNodeToIndex(indexName, node, indexKey, indexValue);
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return !alreadyIndexed;
}
/**
* Creates a new relationship into the database only if there is no existing typed relation
in the same direction (properties aren't checked).
* If a relationship exists it's properties are replaced with the new properties if
replaceProperties is set.
* @param from start node
* @param to end node
* @param type relationship type
* @param properties relationship properties
* @param replaceProperties tells if existing properties should be replaced
* @return ID of created/existing relationship
* @throws IOException
* @throws ServerErrorResponse
*/
public
String createNewRelationship(String from, String to, String type, JSONObject
properties, boolean replaceProperties) throws IOException, ServerErrorResponse {
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship(type, Neo4jRelationshipDirection.OUT));
JSONObject path = dbInterface.findShortestPath(from, to, relationships, 1);
JSONArray pathRels;
try {
pathRels = path.getJSONArray("relationships");
} catch (JSONException e) {
throw new IOException("Cannot check existing relationships between nodes " + from + "
and " + to);
}
if (pathRels.length() != 0) {
// update relationship
String relationshipID;
try {
relationshipID = pathRels.getString(0);
if (replaceProperties)
dbInterface.setRelationshipProperties(relationshipID, properties);
return relationshipID;
} catch (JSONException e) {
throw new
IOException("Cannot get existing relationship between nodes " + from + "
and " + to);
}
}
else {
// create new relationship
return dbInterface.createRelationship(from, to, type, properties);
}
}
}
PofaNotifier.java
package
com.wcs.pofa.events;
import
import
java.util.Iterator;
java.util.ArrayList;
import
import
import
import
com.wcs.pofa.PofaAbstractCrawler;
com.wcs.pofa.PofaAbstractDataminer;
com.wcs.pofa.PofaAbstractSlicer;
com.wcs.pofa.PofaDomainInfo;
/**
* Common class for handling event notifications between modules.
*
* @author mikki
*/
// TODO: create separate notifiers for processes
public class PofaNotifier {
private
ArrayList<PofaEvents> list;
/**
* Put an object into the notification chain
* @param callback an implementation of PofaEvents interface
*/
public synchronized void notifyRequest(PofaEvents callback) {
if (!list.contains(callback))
list.add(callback);
}
/**
* Remove specified object from notification chain
*/
public synchronized void notifyCancel(PofaEvents callback) {
list.remove(callback);
}
public PofaNotifier() {
list = new ArrayList<PofaEvents>();
}
public synchronized void onPageDownload(PofaAbstractCrawler sender, Object data) {
for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); )
i.next().onPageDownloaded(sender, data);
}
public synchronized boolean
onPageVisiting(PofaAbstractCrawler sender, String url) {
boolean result = false;
for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); )
if (i.next().onPageVisiting(sender, url))
result = true;
return result;
}
public synchronized void onEntityFound(PofaAbstractSlicer sender, String entityName) {
for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); )
i.next().onEntityFound(sender, entityName);
}
public synchronized void
onNewPage(PofaAbstractDataminer sender, String url, String page,
PofaDomainInfo rule) {
for (Iterator<PofaEvents> i = list.iterator(); i.hasNext(); )
i.next().onNewPage(sender, url, page, rule);
}
}
PofaPreprocessController.java
package
com.wcs.pofa;
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
import
java.io.FileInputStream;
java.io.FileNotFoundException;
java.io.FileOutputStream;
java.io.IOException;
java.io.ObjectInputStream;
java.io.ObjectOutputStream;
java.io.OutputStreamWriter;
java.text.DecimalFormat;
java.text.NumberFormat;
java.util.ArrayList;
java.util.Collection;
java.util.Comparator;
java.util.HashMap;
java.util.HashSet;
java.util.Iterator;
java.util.Locale;
java.util.Map;
java.util.TreeMap;
java.util.regex.Matcher;
java.util.regex.Pattern;
import
import
import
org.json.JSONArray;
org.json.JSONException;
org.json.JSONObject;
import
import
import
import
import
import
import
import
import
import
import
com.wcs.pofa.db.Neo4jDBInterface;
com.wcs.pofa.db.Neo4jRelationship;
com.wcs.pofa.db.PofaNeo4jDB;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseOrder;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseResult;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseReturnFilter;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseUniqueness;
com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse;
com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection;
com.wcs.pofa.slicer.PofaSlicer;
com.wcs.pofa.tokenizer.PofaTokenizer;
public class
private
private
private
private
PofaPreprocessController {
PofaNeo4jDB db;
Neo4jDBInterface dbInterface;
PofaTokenizer tokenizer;
PofaStopWords stopWords;
public PofaPreprocessController(PofaNeo4jDB db) {
System.out.println("Initializing " + this.getClass().getName() + "...");
this .db = db;
this .dbInterface = db.getDBInterface();
this .tokenizer = new PofaTokenizer();
this .stopWords = new PofaStopWords();
System.out.println(this.getClass().getName() + " initialized.");
}
/**
* Starts the preprocessing process
*/
public void start() {
// TODO: add a timer which repeatedly checks db node TOKENIZE for new input
processNewInput();
}
private void processNewInput() {
// query db for untokenized nodes
String[] relTypes = new String[1];
relTypes[0] = "";
try {
System.out.println("Query untokenized opinions...");
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT));
JSONArray nodes = dbInterface.traverse(
db.getRefNodeTokenize(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.DEPTH_FIRST,
Neo4jTraverseUniqueness.NODE_PATH,
relationships,
null,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1
);
relationships.clear();
relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.IN));
for (int i = 0; i != nodes.length(); i++) {
JSONObject node = nodes.getJSONObject(i);
String nodeID = node.getString("node");
JSONObject metadata = node.getJSONObject("data");
//TODO: this is just for testing
if (metadata.getString("url").matches(".*mobilarena.*")) {
System.out.println("Skipping " + (i + 1) + "/" + nodes.length() + " [" + nodeID +
"]");
continue ;
}
// get node's raw HTML content
String content = metadata.getString("content");
content = PofaSlicer.stripHTML(content);
System.out.println("Processing " + (i + 1) + "/" + nodes.length() + " [" + nodeID +
"]...");
if (content.length() > 80)
System.out.println("
\"" + content.substring(0, 79) + "...\"");
else
System.out.println("
\"" + content + "\"");
// detect language
Locale language = tokenizer.languageDetect(content);
// store tokens
try {
long t1 = System.currentTimeMillis();
storeTokens(nodeID, language, content, 3);
long t2 = System.currentTimeMillis();
System.out.println("
time to store tokens: " + (t2 - t1) + " ms");
// add language as metadata to 'node'
metadata.put("language", language);
// update node
dbInterface.setNodeProperties(nodeID, metadata);
JSONArray toTokenize = dbInterface.getNodeRelationships(nodeID, relationships);
for (int j = 0; j != toTokenize.length(); j++) {
JSONObject rel = toTokenize.getJSONObject(j);
if (!rel.getString("type").equals(""))
continue ;
// remove node from TOKENIZE node
String relationshipID = rel.getString("relationship");
dbInterface.removeRelationship(relationshipID);
}
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
System.out.println("Done.");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InstantiationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IllegalAccessException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void
storeTokens(String textNodeID, Locale language, String content, int
maxExpressionLength)
throws
ClassNotFoundException,
InstantiationException,
IllegalAccessException, IOException {
final String tokenIndexName = "token_" + language.getLanguage().toLowerCase();
final String relationTOKEN = "TOKEN_" + language.getLanguage().toUpperCase();
final String relationCONTAINS = "CONTAINS_" + language.getLanguage().toUpperCase();
HashSet<String> updatedTokens = new HashSet<String>();
ArrayList<String> expressions =
tokenizer.composeSubExpressions(
tokenizer.splitAtSeparators(content, language),
language, 1, maxExpressionLength, true, true
);
ArrayList<String> stems = PofaTokenizer.stemExpressions(expressions, language);
System.out.println("
document has " + expressions.size() + " tokens, processing...");
for (int i = 0; i != expressions.size(); i++) {
String token = expressions.get(i);
int wordCount = PofaTokenizer.countWords(token);
String stem = stems.get(i);
// query DB for token
JSONArray indexHit;
try {
indexHit = db.queryNodeIndex(tokenIndexName, "stem", stem, language);
if (indexHit.length() == 0) {
// new token
String tokenNode;
tokenNode = dbInterface.createNode(new JSONObject().
put("token", token).
put("stem", stem).
put("length", wordCount).
put("documents", 1).
put("count", 1)
);
// connect token to root node
dbInterface.createRelationship(db.getRootNode(),
tokenNode,
relationTOKEN,
new
JSONObject());
// connect token to original text
dbInterface.createRelationship(textNodeID,
tokenNode,
relationCONTAINS,
new
JSONObject());
// add token indices
db.addNodeToIndex(tokenIndexName, "stem", stem, tokenNode);
db.addNodeToIndex(tokenIndexName, "expression", token, tokenNode);
}
else {
// add token to stem
if (db.queryNodeIndex(tokenIndexName, "expression", token, language).length() == 0)
{
// add token to stem
JSONObject stemNode = indexHit.getJSONObject(0);
String stemNodeID = stemNode.getString("node");
JSONObject
properties
=
stemNode.getJSONObject("data").accumulate("token",
token);
dbInterface.setNodeProperties(stemNodeID, properties);
db.addNodeToIndex(tokenIndexName, "expression", token, stemNodeID);
}
String tokenNode = indexHit.getJSONObject(0).getString("node");
// update token counters
JSONObject properties = indexHit.getJSONObject(0).getJSONObject("data");
properties.put("count", properties.getInt("count") + 1);
if (!updatedTokens.contains(token))
properties.put("documents", properties.getInt("documents") + 1);
dbInterface.setNodeProperties(tokenNode, properties);
// connect token to original text (if not already connected)
db.createNewRelationship(textNodeID, tokenNode, relationCONTAINS, new JSONObject(),
false);
}
updatedTokens.add(token);
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
e.printStackTrace();
}
}
}
private void addWhitespaces() {
// query db for untokenized nodes
final int minGramLength = 2;
final int maxGramLength = 5;
final int minOccurrence = 250;
try {
// ngrams:
// - String: ngram content
// - Pair<Integer, ..>: number of ngram occurrences
// - number of whitespaces BEFORE ngram
// - number of whitespaces AFTER ngram
FileOutputStream fos;
OutputStreamWriter out;
HashMap<String, Pair<Integer, Pair<Integer, Integer>>> ngrams = new HashMap<String,
Pair<Integer, Pair<Integer, Integer>>>();
try {
System.out.println("Reading ngrams...");
ObjectInputStream oin = new ObjectInputStream(new FileInputStream("ngrams_hu" +
minGramLength + "-" + maxGramLength + ".dat"));
try {
ngrams = (HashMap<String, Pair<Integer, Pair<Integer, Integer>>>)oin.readObject();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
oin.close();
} catch (FileNotFoundException e) {
System.out.println("Query opinions...");
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("OPINION", Neo4jRelationshipDirection.OUT));
JSONArray nodes = dbInterface.traverse(
db.getRootNode(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.DEPTH_FIRST,
Neo4jTraverseUniqueness.NODE_PATH,
relationships,
null,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1
);
System.out.println("Building ngrams...");
fos = new FileOutputStream("input texts.txt");
out = new OutputStreamWriter(fos, "UTF-8");
for (int i = 0; i != nodes.length(); i++) {
JSONObject node = nodes.getJSONObject(i);
JSONObject metadata = node.getJSONObject("data");
// get node's raw HTML content
String content = metadata.getString("content");
// clean text
content = PofaSlicer.stripHTML(content);
Locale language = tokenizer.languageDetect(content);
if (!language.getLanguage().equals("hu")) continue;
out.write(content + "\n");
// create n-grams
for (int gramSize = minGramLength; gramSize <= maxGramLength; gramSize++) {
int limit = content.length() - gramSize - 2;
for (int j = 0; j <= limit; j++) {
ArrayList<String> grams = getNGram(content, language, j, gramSize + 2);
if (grams.size() != 0) {
String window = grams.get(0);
String charBefore;
String charAfter;
String gram;
if (j > 0 && j < limit) {
charBefore = window.substring(0, 1);
charAfter = window.substring(gramSize + 1);
gram = window.substring(1, gramSize + 1);
}
else {
if (j == 0) {
charBefore = " ";
charAfter = window.substring(gramSize, gramSize + 1);
gram = window.substring(0, gramSize);
}
else {
charBefore = window.substring(1, 2);
charAfter = " ";
gram = window.substring(2, gramSize + 2);
}
}
// process n-gram
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null == counters) {
counters = new Pair<Integer, Pair<Integer, Integer>>(0, new Pair<Integer,
Integer>(0, 0));
ngrams.put(gram, counters);
}
Pair<Integer, Integer> wsCount = counters.getSecond();
// increase occurrence count
counters.setFirst(counters.getFirst() + 1);
// preceding whitespace
if (charBefore.matches("\\s"))
wsCount.setFirst(wsCount.getFirst() + 1);
// following whitespace
if (charAfter.matches("\\s"))
wsCount.setSecond(wsCount.getSecond() + 1);
counters.setSecond(wsCount);
}
}
}
if (i % 100 == 0)
System.out.println((i + 1) + "/" + nodes.length());
}
ObjectOutputStream oout = new ObjectOutputStream(new FileOutputStream("ngrams_hu" +
minGramLength + "-" + maxGramLength + ".dat"));
oout.writeObject(ngrams);
oout.close();
}
///*
System.out.println("Writing to file...");
fos = new FileOutputStream("test.txt");
out = new OutputStreamWriter(fos, "UTF-8");
NumberFormat formatter = new DecimalFormat("0.00");
for (Iterator<String> i = ngrams.keySet().iterator(); i.hasNext(); ) {
String gram = i.next();
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
Pair<Integer, Integer> wsCount = counters.getSecond();
double preceding = 0.0;
double following = 0.0;
if (wsCount.getFirst() > 0)
preceding = (1.0 * wsCount.getFirst()) / (1.0 * counters.getFirst()) * 100.0;
if (wsCount.getSecond() > 0)
following = (1.0 * wsCount.getSecond()) / (1.0 * counters.getFirst()) * 100.0;
if (counters.getFirst() >= minOccurrence && (preceding > 0.5 || following > 0.5)) {
out.write(gram
+
"\t"
+
formatter.format(preceding)
+
"\t"
+
formatter.format(following) + "\t" + counters.getFirst() + "\n");
}
}
out.close();
fos.close();
//*/
System.out.println("Sample sentence:");
Locale language = new Locale("hu");
fos = new FileOutputStream("sample_sentence" + minGramLength + "-" + maxGramLength +
".txt");
out = new OutputStreamWriter(fos, "UTF-8");
final
String good = "Nagyon meg vagyok el�gedve a telefonommal! Szerintem nagyon
szuper :)";
int bestMatch = 9999999;
String best = "";
for (int insertLimit = 100; insertLimit <= 100; insertLimit += 1) {
for (int removeLimit = 100; removeLimit >= 100; removeLimit -= 5) {
String testText = "Nagyon megvagyok el �gedve a telefonommal!Szerintemn agyon
szuper:)";
//String
testText
=
"Nagyonmegvagyokel�gedveatelefonommal!Szerintemnagyonszuper:)";
testText = testText.replaceAll("\\s+", " ");
int
int
spaceBefore[] = new int[testText.length()];
totalNgrams[] = new int[testText.length()];
// condense multiple whitespaces to 1 space
String result = testText;
for (int i = testText.length() - 1; i >= 0; i--) {
int wsOccurrenceCount = 0;
int noWSCount = 0;
int totalNGramCount = 0;
// calculate whitespace occurrence probability
// preceding n-gram
for
(int gramSize = minGramLength; gramSize <= Math.min(maxGramLength,
gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i + 1, -gramSize);
if (grams.size() > 0) {
String gram = grams.get(0);
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
Pair<Integer, Integer> wsCount = counters.getSecond();
totalNGramCount += counters.getFirst();
wsOccurrenceCount += wsCount.getSecond();
}
//break;
}
}
// following n-gram
for
(int gramSize = minGramLength; gramSize <= Math.min(maxGramLength,
gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i + 1, gramSize);
if (grams.size() > 0) {
String gram = grams.get(0);
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
i);
i);
Pair<Integer, Integer> wsCount = counters.getSecond();
totalNGramCount += counters.getFirst();
wsOccurrenceCount += wsCount.getFirst();
}
//break;
}
}
/*
// calculate non-whitespace probability
for
(int
gramSize
=
Math.max(2,
minGramLength);
gramSize
<=
Math.min(maxGramLength, i); gramSize++) {
if (gramSize % 2 == 0)
continue;
int halfLength = gramSize / 2;
for (int j = i - halfLength; j < i + halfLength; j++) {
if (j < 0 || j > testText.length() - gramSize)
continue;
ArrayList<String> grams = getNGram(testText, language, j, gramSize);
if (grams.size() > 0) {
String gram = grams.get(0);
if (gram.substring(gramSize - (j - i + halfLength) - 1, gramSize - (j - i +
halfLength)).matches("\\s"))
continue;
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
totalNGramCount += counters.getFirst();
noWSCount += counters.getFirst();
}
}
}
}*/
spaceBefore[i] += wsOccurrenceCount;
totalNgrams[i] += totalNGramCount;
}
for (int i = 0; i != testText.length(); i++)
System.out.println(testText.charAt(i)
+
"\t"
+
spaceBefore[i]
totalNgrams[i] + "\t" + (100.0 * spaceBefore[i] / totalNgrams[i]) + "%");
+
"\t"
+
/*
HashSet<String> results = new HashSet<String>();
while (true) {
testText = result;
for (int i = testText.length() - 1; i >= 0; i--) {
int wsOccurrenceCount = 0;
int noWSCount = 0;
int totalNGramCount = 0;
// calculate whitespace occurrence probability
// preceding n-gram
for (int gramSize = minGramLength; gramSize
<=
Math.min(maxGramLength,
i);
gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i + 1, -gramSize);
if (grams.size() > 0) {
String gram = grams.get(0);
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
Pair<Integer, Integer> wsCount = counters.getSecond();
totalNGramCount += counters.getFirst();
wsOccurrenceCount += wsCount.getSecond();
}
//break;
}
}
// following n-gram
for (int gramSize =
minGramLength;
gramSize
<=
Math.min(maxGramLength,
gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i + 1, gramSize);
if (grams.size() > 0) {
String gram = grams.get(0);
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
i);
Pair<Integer, Integer> wsCount = counters.getSecond();
totalNGramCount += counters.getFirst();
wsOccurrenceCount += wsCount.getFirst();
}
//break;
}
}
// calculate non-whitespace probability
for
(int
gramSize
=
Math.max(2,
minGramLength);
gramSize
<=
Math.min(maxGramLength, i); gramSize++) {
if (gramSize % 2 == 0)
continue;
int halfLength = gramSize / 2;
for (int j = i - halfLength; j < i + halfLength; j++) {
if (j < 0 || j > testText.length() - gramSize)
continue;
ArrayList<String> grams = getNGram(testText, language, j, gramSize);
if (grams.size() > 0) {
String gram = grams.get(0);
if (gram.substring(gramSize - (j - i + halfLength) - 1, gramSize - (j - i
+ halfLength)).matches("\\s"))
continue;
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
totalNGramCount += counters.getFirst();
noWSCount += counters.getFirst();
}
}
}
}
double wsChance = (1.0 * wsOccurrenceCount) / (1.0 * totalNGramCount);
if (wsChance >= insertLimit * 0.01) {
result = result.substring(0, i + 1) + " " + result.substring(i + 1);
}
}
result = result.replaceAll("\\s+", " ");
if (results.contains(result))
break;
results.add(result);
}
*/
/*
//while (true) {
testText = result;
for (int i = testText.length() - 1; i >= 0; i--) {
double chanceToInsert = 0.0; // before i-th char
int weightedAverageDivisor = 0;
String position = testText.substring(0, i + 1); //FIXME: only for debugging position display
// preceding n-grams
for (int gramSize =
minGramLength;
gramSize
<=
Math.min(maxGramLength,
i);
gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i + 1, -gramSize);
for (Iterator<String> j = grams.iterator(); j.hasNext(); ) {
String gram = j.next();
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
Pair<Integer, Integer> wsCount = counters.getSecond();
if (wsCount.getSecond() >= minOccurrence) {
// calculate chances
int weight = 1; //gramSize; // * wsCount.getSecond();
weightedAverageDivisor += weight;
chanceToInsert += weight * ((1.0 * wsCount.getSecond()) / (1.0
counters.getFirst()));
}
}
}
}
*
if (testText.substring(i, i + 1).matches("\\s")) {
// check if whitespace can be erased
// following n-grams
for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength,
result.length() - (i + 1) - 1); gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i, -gramSize);
for (Iterator<String> j = grams.iterator(); j.hasNext(); ) {
String gram = j.next();
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
Pair<Integer, Integer> wsCount = counters.getSecond();
if (wsCount.getFirst() >= minOccurrence) {
// calculate chances
int weight = 1; //gramSize; // * wsCount.getFirst();
weightedAverageDivisor += weight;
chanceToInsert += weight * ((1.0 * wsCount.getFirst()) / (1.0 *
counters.getFirst()));
}
}
}
}
// check remove
if (weightedAverageDivisor > 0) {
chanceToInsert /= (1.0 * weightedAverageDivisor);
//if (chanceToInsert < (removeLimit * 0.01))
// result = result.substring(0, i) + result.substring(i + 1);
}
}
else {
// check if whitespace needs to be added
// following x-grams
for (int gramSize = minGramLength; gramSize <= Math.min(maxGramLength,
result.length() - (i + 1)); gramSize++) {
ArrayList<String> grams = getNGram(testText, language, i, -gramSize);
for (Iterator<String> j = grams.iterator(); j.hasNext(); ) {
String gram = j.next();
Pair<Integer, Pair<Integer, Integer>> counters = ngrams.get(gram);
if (null != counters) {
Pair<Integer, Integer> wsCount = counters.getSecond();
if (wsCount.getFirst() >= 10) {
// calculate chances
int weight = 1; //gramSize; // * wsCount.getFirst();
weightedAverageDivisor += weight;
chanceToInsert += weight * ((1.0 * wsCount.getFirst()) / (1.0 *
counters.getFirst()));
}
}
}
}
// check insert
if (weightedAverageDivisor > 0) {
chanceToInsert /= (1.0 * weightedAverageDivisor);
if (chanceToInsert >= (insertLimit * 0.01))
result = result.substring(0, i + 1) + " " + result.substring(i + 1);
}
}
}
result = result.replaceAll("\\s+", " ");
if (results.contains(result))
break;
results.add(result);
//}
*/
// condense spaces
int similarity = calculateStringSimilarity(result, good, language);
if (similarity < bestMatch) {
//System.out.println(" " + insertLimit + "%/" + removeLimit + "%: \"" + result +
"\" " + similarity);
best = "
" + insertLimit + "%/" + removeLimit + "%: \"" + result + "\" " +
similarity;
bestMatch = similarity;
}
System.out.println("
" + insertLimit + "%/" + removeLimit + "%: \"" + result +
"\"");
//out.write("
" + insertLimit + "%/" + removeLimit + "%: \"" + result + "\"\n");
//System.out.println("
" + insertLimit + "%: \"" + result + "\"");
}
}
System.out.println("best " + minOccurrence + ": " + best);
out.close();
fos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Done.");
}
/**
* Returns the requested length N-grams from the input string.
* @param document Document to get N-gram from
* @param startPos starting position for extraction
* @param length length of requested n-gram (negative values extract <b>before</b>
<code>startPos</code>)
* @return the list of requested length n-grams
*/
public static ArrayList<String> getNGram(String document, Locale locale, int startPos, int
length) {
ArrayList<String> result = new ArrayList<String>();
if (length >= 1) {
// look ahead
int nextPos = startPos + length;
if (nextPos <= document.length()) {
String gram = document.substring(startPos, nextPos).toLowerCase(locale);
result.add(gram);
int i = 0;
while (i != gram.length() && nextPos != document.length()) {
switch (gram.charAt(i)) {
case ' ':
//FALLTHROUGH
case '\t':
//FALLTHROUGH
case '\n':
//FALLTHROUGH
case '\r':
// remove whitespace and add a new char to the end
gram = gram.substring(0, i) + gram.substring(i + 1) + document.charAt(nextPos);
nextPos++;
result.add(gram.toLowerCase(locale));
break ;
default :
i++;
}
}
}
}
else if (length <= -1) {
// look behind
int nextPos = startPos + length;
if (nextPos >= 0) {
String gram = document.substring(nextPos, startPos);
nextPos--;
result.add(gram);
int i = gram.length() - 1;
while (i >= 0 && nextPos >= 0) {
switch (gram.charAt(i)) {
case ' ':
//FALLTHROUGH
case '\t':
//FALLTHROUGH
case '\n':
//FALLTHROUGH
case '\r':
// remove whitespace and add a new char to the end
gram = document.charAt(nextPos) + gram.substring(0, i) + gram.substring(i + 1);
nextPos--;
result.add(gram.toLowerCase(locale));
break ;
default :
i--;
}
}
}
}
return
result;
}
public static int
calculateStringSimilarity(String string1, String string2, Locale locale)
{
int errors = 10 + Math.abs(69 - string1.length());
//"Nagyon megvagyok el �gedve a telefonommal!Szerintemn agyon szuper:)"
if (string1.matches("Nagyon meg vagyok el�gedve a telefonommal! Szerintem nagyon szuper
:\\)"))
return 0;
if (string1.matches("^Nagyon ")) errors--;
if (string1.matches(".* meg .*")) errors--;
if (string1.matches(".* vagyok .*")) errors--;
if (string1.matches(".* el�gedve .*")) errors--;
if (string1.matches(".* a .*")) errors--;
if (string1.matches(".* telefonommal! .*")) errors--;
if (string1.matches(".* Szerintem .*")) errors--;
if (string1.matches(".* nagyon .*")) errors--;
if (string1.matches(".* szuper .*")) errors--;
if (string1.matches(".* :\\)$")) errors--;
string1 = string1.toLowerCase(locale);
string2 = string2.toLowerCase(locale);
int score = 0;
int n = string1.length();
int m = string2.length();
if (0 == n) score = m;
else if (0 == m) score = n;
else {
int edits[][] = new int[n + 1][m + 1];
for (int i = 0; i <= n; i++)
edits[i][0] = i;
for (int i = 0; i <= m; i++)
edits[0][i] = i;
for (int i = 0; i != n; i++) {
for (int j = 0; j != m; j++) {
int cost = 0;
if (string1.charAt(i) != string2.charAt(j))
cost = 1;
edits[i + 1][j + 1] = cost + Math.min(Math.min(edits[i][j + 1], edits[i][j]),
edits[i + 1][j]);
}
}
score = edits[n][m];
}
return
errors + score;
}
public void calculateOpinionMeasures() {
Neo4jDBInterface dbInterface = db.getDBInterface();
System.out.println("Query opinions...");
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("OPINION", Neo4jRelationshipDirection.OUT));
try {
JSONArray nodes = dbInterface.traverse(
db.getRootNode(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.DEPTH_FIRST,
Neo4jTraverseUniqueness.NODE_PATH,
relationships,
null,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1
);
System.out.println("Processing...");
FileOutputStream fos = new FileOutputStream("measures.txt");
OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8");
for (int i = 0; i != nodes.length(); i++) {
JSONObject node;
try {
node = nodes.getJSONObject(i);
JSONObject metadata = node.getJSONObject("data");
// get node's raw HTML content
String content = metadata.getString("content");
// clean text
content = PofaSlicer.stripHTML(content);
// detect language
Locale language = tokenizer.languageDetect(content);
if (!language.getLanguage().equals("hu")) continue;
// tokenize
ArrayList<String> tokens = PofaTokenizer.tokenize(content, language);
ArrayList<String> stems = PofaTokenizer.stemTokens(tokens, language);
// compute stats
int TotalWords = tokens.size();
int TotalSentences = 1;
int TotalSyllables = 0;
int TotalComplexWords = 0;
int TotalSpecials = 0; // smileys and other weird tokens
for (int j = 0; j != tokens.size(); j++) {
String token = tokens.get(j);
String stem = stems.get(j);
if (PofaTokenizer.isSeparator(token)) {
if (token.matches("\\.+|\\?+|\\!+"))
TotalSentences++;
if (token.length() > 1)
TotalSpecials++;
}
else {
TotalSyllables
+=
(token.length()
token.replaceAll("[a�e�i�o���u���A�E�I�O���U���]",
"").length());
if
(stem.length()
stem.replaceAll("[a�e�i�o���u���A�E�I�O���U���]",
"").length() > 3)
TotalComplexWords++;
}
}
// write results
out.write(
node.getInt("node") + "\t" +
TotalWords + "\t" +
TotalSentences + "\t" +
TotalSyllables + "\t" +
TotalComplexWords + "\t" +
TotalSpecials + "\t" +
"\"" + content + "\"\n"
);
} catch (JSONException e) {
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InstantiationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IllegalAccessException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
if
(i % 100 == 99)
System.out.println((i + 1) + "/" + nodes.length());
}
} catch (IOException e) {
e.printStackTrace();
-
} catch (ServerErrorResponse e) {
e.printStackTrace();
}
}
public static void main(String argv[]) throws ServerErrorResponse {
new PofaTokenizer();
PofaNeo4jDB db = new PofaNeo4jDB("http://localhost:7474/db/data");
PofaPreprocessController ppc = new PofaPreprocessController(db);
ppc.start();
//ppc.addWhitespaces();
//ppc.calculateOpinionMeasures();
/*
// query all opinions for DB export
Neo4jDBInterface dbInterface = db.getDBInterface();
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("OPINION", Neo4jRelationshipDirection.OUT));
try {
JSONArray opinionNodes = dbInterface.traverse(db.getRootNode(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.BREADTH_FIRST,
Neo4jTraverseUniqueness.NODE,
relationships,
"",
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1);
FileOutputStream fos = new FileOutputStream("opinion_export_en.csv");
OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8");
relationships.clear();
relationships.add(new Neo4jRelationship("APPLIES_TO", Neo4jRelationshipDirection.OUT));
int opinionID = 0;
for (int i = 0; i != opinionNodes.length(); i++) {
try {
JSONObject opinionNode = opinionNodes.getJSONObject(i);
String opinion = opinionNode.getJSONObject("data").getString("content");
opinion = PofaSlicer.stripHTML(opinion);
Locale language = ppc.tokenizer.languageDetect(opinion);
if (!language.getLanguage().equals("en"))
continue;
if (opinion.length() < 30 || opinion.length() > 500)
continue;
if (opinion.matches("^(@.*|Specifik�ci�.*)$"))
continue;
JSONArray categories = dbInterface.traverse(
opinionNode.getString("node"),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.DEPTH_FIRST,
Neo4jTraverseUniqueness.NODE,
relationships,
null,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1);
StringBuilder categoryItems = new StringBuilder();
for (int j = 0; j != categories.length(); j++) {
String
categoryName
categories.getJSONObject(j).getJSONObject("data").getString("name");
categoryName.replaceAll("Comments", "");
if (0 == categoryName.length() || categoryName.matches("Home page|F�oldal"))
continue;
if (categoryItems.length() > 0)
categoryItems.append(";");
categoryItems.append(categoryName);
}
=
//out.write(opinionNode.getString("node") + "\t" + language.getLanguage() + "\t" +
opinion.length() + "\t" + opinion + "\n");
//opinion.replaceAll("\"", "\\\"");
out.write(opinionID
+
"\t"
+
opinionNode.getString("node")
+
language.getLanguage() + "\t" + categoryItems + "\t" + opinion + "\t0\t0\t0\n");
opinionID++;
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
"\t"
+
System.out.println("Written " + opinionID + " lines");
out.close();
fos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
*/
/*
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT));
try {
JSONArray
nodes
=
db.traverse(db.getRefNodeTokenize(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.DEPTH_FIRST,
relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1);
System.out.println(nodes.length());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
*/
//new PofaPreprocessController(db).start();
/*
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT));
JSONArray
nodes
=
db.traverse(db.getRefNodeTokenize(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.DEPTH_FIRST,
relationships, null, Neo4jTraverseReturnFilter.ALL_BUT_START_NODE, 1);
System.out.println(nodes.length());
*/
}
}
PofaQueryEngine.java
package
com.wcs.pofa.query;
import
import
import
import
import
import
java.io.IOException;
java.util.ArrayList;
java.util.HashMap;
java.util.Iterator;
java.util.Locale;
java.util.regex.Pattern;
import
import
import
org.json.JSONArray;
org.json.JSONException;
org.json.JSONObject;
import
import
import
import
import
import
import
import
import
import
import
import
import
com.wcs.pofa.Pair;
com.wcs.pofa.db.Neo4jDBInterface;
com.wcs.pofa.db.Neo4jRelationship;
com.wcs.pofa.db.PofaNeo4jDB;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseOrder;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseResult;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseReturnFilter;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseUniqueness;
com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse;
com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection;
com.wcs.pofa.query.PofaQueryResultList.ResultTriplet;
com.wcs.pofa.slicer.PofaSlicer;
com.wcs.pofa.tokenizer.PofaTokenizer;
/**
* Common query engine for processing user queries.
*
*/
public class
private
private
PofaQueryEngine {
PofaNeo4jDB db;
PofaTokenizer tokenizer;
public PofaQueryEngine(PofaNeo4jDB db) {
this .db = db;
this .tokenizer = new PofaTokenizer();
}
/**
* Match parts of query string to entity names found in DB.
* Returns lists of matched names in order of match quality and matched text of original
query.
* @param queryString
* @param locale
*/
private PofaQueryResultList<String, String, String> findEntities(String queryString, Locale
locale) {
PofaQueryResultList<String, String, String> result = new PofaQueryResultList<String,
String, String>();
String queryExpression = queryString;
//look for entities in query string
boolean reDo = true;
while (reDo) {
reDo = false;
//compose fixed length query expressions
ArrayList<String>
expressions
=
tokenizer.composeSubExpressions(queryExpression,
locale, true, false);
for (int i = 0; i != expressions.size() && !reDo; i++) {
String matchedExpression = expressions.get(i);
JSONArray indexHit;
try {
indexHit = db.queryNodeIndex("entity", "name", matchedExpression, locale);
for (int j = 0; j != indexHit.length(); j++) {
String key = indexHit.getJSONObject(j).getString("node");
// find out true name of matched node
String trueName;
trueName = indexHit.getJSONObject(j).getJSONObject("data").getString("name");
// insert found result to result list
result.addResult(
matchedExpression,
compareMatch2Query(queryString, trueName, locale),
key,
trueName
);
reDo = true;
}
if (reDo) {
// remove matched part of query
queryExpression =
queryExpression.replaceFirst(Pattern.quote(matchedExpression), "|").
replaceAll("\\| \\||^\\| | \\|$", "").
replaceAll("^\\s+|\\s+$", "").
replaceAll("\\s+", " ");
}
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
return
result;
}
/**
* Match parts of query string to category names found in DB.
* Returns lists of matched names in order of match quality and matched text of original
query.
* @param queryString
* @param locale
*/
private
PofaQueryResultList<String, String, String>
Locale locale) {
PofaQueryResultList<String, String, String> result
String, String>();
findCategories(String
=
new
queryString,
PofaQueryResultList<String,
String queryExpression = queryString;
//look for entities in query string
boolean reDo = true;
while (reDo) {
reDo = false;
//compose fixed length query expressions
ArrayList<String>
expressions
=
tokenizer.composeSubExpressions(queryExpression,
locale, true, true);
for (int i = 0; i != expressions.size() && !reDo; i++) {
String matchedExpression = expressions.get(i);
JSONArray indexHit;
try {
indexHit = db.queryNodeIndex("category", "name", matchedExpression, locale);
for (int j = 0; j != indexHit.length(); j++) {
String key = indexHit.getJSONObject(j).getString("node");
// find out true name of matched node
String trueName;
trueName = indexHit.getJSONObject(j).getJSONObject("data").getString("name");
// insert found result to result list
result.addResult(
matchedExpression,
compareMatch2Query(queryString, trueName, locale),
key,
trueName
);
reDo = true;
}
if (reDo) {
// remove matched part of query
queryExpression =
queryExpression.replaceFirst(Pattern.quote(matchedExpression), "|").
replaceAll("\\| \\||^\\| | \\|$", "").
replaceAll("^\\s+|\\s+$", "").
replaceAll("\\s+", " ");
}
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
return
result;
}
private
PofaQueryResultList<String, String, String> findEntitesFromCategories(String
queryString, PofaQueryResultList<String, String, String> categoryList, Locale locale) {
PofaQueryResultList<String, String, String> result = new PofaQueryResultList<String,
String, String>();
// get entities for top categories
HashMap<String, Pair<Integer, JSONObject>>
JSONObject>>();
hits
=
new
HashMap<String,
Neo4jDBInterface dbInterface = db.getDBInterface();
for (Iterator<String> i = categoryList.getMatches(); i.hasNext(); ) {
String matchedQuery = i.next();
double rankLimit = -1;
for
(Iterator<ResultTriplet<String,
String>>
categoryList.getMatchElements(matchedQuery); j.hasNext(); ) {
ResultTriplet<String, String> item = j.next();
if (-1 == rankLimit)
rankLimit = item.getRank() * 0.3;
else if (item.getRank() < rankLimit)
break ;
System.err.println("
Pair<Integer,
j
checking: " + item.getID() + " \"" + item.getMatch() + "\"");
// get entities from categories
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
=
relationships.add(new
Neo4jRelationshipDirection.ALL));
Neo4jRelationship("BELONGS_TO",
try {
JSONArray comments = dbInterface.traverse(
item.getID(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.BREADTH_FIRST,
Neo4jTraverseUniqueness.NODE,
relationships,
"" ,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1
);
for (int k = 0; k != comments.length(); k++) {
try {
JSONObject node = comments.getJSONObject(k);
String key = node.getString("node");
if (!hits.containsKey(key))
hits.put(key, new Pair<Integer, JSONObject>(1, node.getJSONObject("data")));
else {
Pair<Integer, JSONObject> oldValue = hits.get(key);
oldValue.setFirst(oldValue.getFirst() + 1);
hits.put(key, oldValue);
}
String trueName = node.getJSONObject("data").getString("name");
// insert found result to result list
result.addResult(
item.getMatch(),
compareMatch2Query(queryString, trueName, locale),
key,
trueName
);
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
return
result;
}
private void getComments(String nodeID) {
Neo4jDBInterface dbInterface = db.getDBInterface();
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("", Neo4jRelationshipDirection.OUT));
relationships.add(new Neo4jRelationship("APPLIES_TO", Neo4jRelationshipDirection.IN));
try {
JSONArray comments = dbInterface.traverse(
//nodeID,
db.getRefNodeClassifyEntity(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.BREADTH_FIRST,
Neo4jTraverseUniqueness.NODE,
relationships,
null,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
2
);
for (int i = 0; i != comments.length(); i++) {
String url;
try {
JSONObject data = comments.getJSONObject(i).getJSONObject("data");
url = data.getString("url");
if (!url.matches(".*mobilarena.*")) {
System.out.println("\""
+
PofaSlicer.stripHTML(data.getString("content"))
+
"\"");
System.out.println("
" + url);
}
} catch (JSONException e) {
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
* Interprete and execute query string.
* @param queryString
*/
public void query(String queryString, Locale locale) {
System.out.println("Query
string:
\""
+
queryString
+
"\",
language:
"
locale.toString());
/*
* goal:
* - identify entities
* - gather relevant info for affected entities
* - store info in session (? - for faster access, logging)
* - display
* method:
* - find out entity names from query (stop if at any point search has results)
*
- 1: simple entity name matching
*
- 2: category name matching
*
- 3: full-text search
* - filter entity names using the remaining part of query
*
- if using a filter would result in empty set, ignore that filter and warn user
* - gather all comments separately for entities
* - sort comments
* - display
*
* - for every entity recommend a best match
* - store the others as 'similar matches'
*/
+
// get entities
PofaQueryResultList<String, String, String> foundEntities;
foundEntities = findEntities(queryString, locale);
// get categories
PofaQueryResultList<String, String, String> foundCategories;
foundCategories = findCategories(queryString, locale);
// get categories
PofaQueryResultList<String, String, String> foundEntitiesFromCategories;
foundEntitiesFromCategories
=
findEntitesFromCategories(queryString,
locale);
System.out.println("Best matching entity names:");
for (Iterator<String> i = foundEntities.getMatches(); i.hasNext(); ) {
String matchedQuery = i.next();
double rankLimit = -1;
for
(Iterator<ResultTriplet<String,
String>>
foundEntities.getMatchElements(matchedQuery); j.hasNext(); ) {
ResultTriplet<String, String> item = j.next();
if (-1 == rankLimit)
rankLimit = item.getRank() * 0.3;
else if (item.getRank() < rankLimit)
break ;
System.out.println(
"
" +
matchedQuery + ": " +
item.getRank() + " " +
"\"" + item.getMatch() + "\" " +
item.getID()
);
}
}
foundCategories,
j
=
System.out.println("Best matching category names:");
for (Iterator<String> i = foundCategories.getMatches(); i.hasNext(); ) {
String matchedQuery = i.next();
double rankLimit = -1;
for
(Iterator<ResultTriplet<String,
String>>
j
foundCategories.getMatchElements(matchedQuery); j.hasNext(); ) {
ResultTriplet<String, String> item = j.next();
if (-1 == rankLimit)
rankLimit = item.getRank() * 0.3;
else if (item.getRank() < rankLimit)
break ;
System.out.println(
"
" +
matchedQuery + ": " +
item.getRank() + " " +
"\"" + item.getMatch() + "\" " +
item.getID()
);
}
}
System.out.println("Best found entities (category -> entity):");
for (Iterator<String> i = foundEntitiesFromCategories.getMatches(); i.hasNext(); ) {
String matchedQuery = i.next();
double rankLimit = -1;
for
(Iterator<ResultTriplet<String,
String>>
j
foundEntitiesFromCategories.getMatchElements(matchedQuery); j.hasNext(); ) {
ResultTriplet<String, String> item = j.next();
if (-1 == rankLimit)
rankLimit = item.getRank() * 0.3;
else if (item.getRank() < rankLimit)
break ;
System.out.println(
"
" +
matchedQuery + ": " +
item.getRank() + " " +
"\"" + item.getMatch() + "\" " +
item.getID()
);
}
}
}
/**
* Calculates similarity of two strings (relative Levenshtein distance)
* @param string1
* @param string2
* @return 1.0 if strings match otherwise a real number in [0..1) interval
*/
public static
Double calculateStringSimilarity(String string1, String
locale) {
string1 = string1.toLowerCase(locale);
string2 = string2.toLowerCase(locale);
string2,
=
=
Locale
int score = 0;
int n = string1.length();
int m = string2.length();
if (0 == n) score = m;
else if (0 == m) score = n;
else {
int edits[][] = new int[n + 1][m + 1];
for (int i = 0; i <= n; i++)
edits[i][0] = i;
for (int i = 0; i <= m; i++)
edits[0][i] = i;
for (int i = 0; i != n; i++) {
for (int j = 0; j != m; j++) {
int cost = 0;
if (string1.charAt(i) != string2.charAt(j))
cost = 1;
edits[i + 1][j + 1] = cost + Math.min(Math.min(edits[i][j + 1], edits[i][j]),
edits[i + 1][j]);
}
}
score = edits[n][m];
}
int
max = Math.max(n, m);
if (0 == max)
return 1.0;
else
return (Double)((1.0 * (max - score)) / (1.0 * max));
}
/**
* Compares a result string to the original query string word-by-words and returns a
* similarity score based on how the words of the initial query are matched in the result.
* @param query Original query string
* @param match Found matching string to compare to the query
* @param locale Locale to use
* @return Similarity score
*/
public static double compareMatch2Query(String query, String match, Locale locale) {
//TODO: filter both strings (remove special chars)
String[] queryParts = query.split("\\s+");
String[] matchParts = match.split("\\s+");
// calculate word-to-word similarities, average the best scores
double totalScore = 0;
int firstFullMatchQuery = -1;
int lastFullMatchQuery = -1;
int firstFullMatchResult = -1;
int lastFullMatchResult = -1;
for (int i = 0; i != queryParts.length; i++) {
double bestScore = 0;
for (int j = 0; j != matchParts.length; j++) {
double score = calculateStringSimilarity(matchParts[j], queryParts[i], locale);
if (score > bestScore) {
bestScore = score;
if (bestScore == 1.0) {
if (-1 == firstFullMatchResult) firstFullMatchResult = j;
lastFullMatchResult = j;
break ;
}
}
}
if (bestScore == 1.0) {
if (-1 == firstFullMatchQuery) firstFullMatchQuery = i;
lastFullMatchQuery = i;
}
totalScore += bestScore;
}
totalScore /= queryParts.length;
// calculate matching part length (word count) ratio relative to the longest string's word
count
double
avgMatchLen = 1.0 * (lastFullMatchQuery - firstFullMatchQuery + 1 +
lastFullMatchResult - firstFullMatchResult + 1) / 2;
double avgMatchLenRatio =
(1.0 * (lastFullMatchQuery - firstFullMatchQuery + 1 + lastFullMatchResult firstFullMatchResult + 1) / 2) /
(1.0 * Math.max(queryParts.length, matchParts.length));
// modify score with match length ratio
totalScore *= avgMatchLenRatio;
// calculate multi-word match bonus
if (avgMatchLen > 1) {
// find the matching parts of both initial strings
String matchingQueryPart = "";
for (int i = firstFullMatchQuery; i <= lastFullMatchQuery; i++)
if (i == firstFullMatchQuery)
matchingQueryPart = queryParts[i];
else
matchingQueryPart += " " + queryParts[i];
String matchingResultPart = "";
for (int i = firstFullMatchResult; i <= lastFullMatchResult; i++)
if (i == firstFullMatchResult)
matchingResultPart = matchParts[i];
else
matchingResultPart += " " + matchParts[i];
double
multiWordBonus
=
calculateStringSimilarity(matchingResultPart,
matchingQueryPart, locale);
totalScore = (totalScore + multiWordBonus) / 2;
}
return
totalScore;
}
public static void main(String argv[]) throws ServerErrorResponse, IOException {
PofaNeo4jDB db = new PofaNeo4jDB("http://localhost:7474/db/data");
PofaQueryEngine query = new PofaQueryEngine(db);
Locale locale = new Locale("hu");
/*
query.query("canon 50d fujifilm s2500hd", locale);
//query.query("canon 50d", locale);
System.out.println("----");
query.query("laptop, v�s�rl�s", locale);
System.out.println("----");
query.query("nokia classic", locale);
System.out.println("----");
*/
///*
long t1 = System.currentTimeMillis();
query.query("nokia classic samsung galaxy mobiltelefon", locale);
//query.getComments("1");
long t2 = System.currentTimeMillis();
//System.err.println("Query took: " + (t2 - t1) + " ms");
//*/
/*
ArrayList<Neo4jRelationship> relationships = new ArrayList<Neo4jRelationship>();
relationships.add(new Neo4jRelationship("VALUE", Neo4jRelationshipDirection.OUT));
JSONArray items = db.getDBInterface().traverse(
"174", //db.getRootNode(),
Neo4jTraverseResult.NODE,
Neo4jTraverseOrder.BREADTH_FIRST,
Neo4jTraverseUniqueness.NODE,
relationships,
null,
Neo4jTraverseReturnFilter.ALL_BUT_START_NODE,
1
);
for (int i = 0; i != items.length(); i++)
try {
System.out.println(items.getJSONObject(i).getJSONObject("data").getString("name"));
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
*/
//query.query("notebook laptop netbook", locale);
}
}
PofaQueryResultList.java
package
import
import
import
com.wcs.pofa.query;
java.util.HashMap;
java.util.Iterator;
java.util.TreeSet;
/**
* A special class that can hold different query results.
*/
public class PofaQueryResultList<Query, ID, Match> {
/**
* A triplet that holds (rank, ID, match) values as a search result.
* @param <ID> type of database item IDs
* @param <Match> type of match content
*/
public static class
ResultTriplet<_ID, _Match> implements Comparable<ResultTriplet<_ID,
_Match>> {
private
private
Double rank;
_ID id;
private
_Match match;
public ResultTriplet(Double rank, _ID id, _Match match) {
this .rank = rank;
this .id = id;
this .match = match;
}
public Double getRank() {
return rank;
}
public _ID getID() {
return id;
}
public _Match getMatch() {
return match;
}
/**
* Reverse order based on rank (best match first)
*/
public int compareTo(ResultTriplet<_ID, _Match> other) {
if (this.rank > other.rank) return -1;
if (this.rank < other.rank) return 1;
return 0;
}
}
/*
* return:
*
list of:
*
- string: matched part of query
*
- ordered list of:
*
- double: match factor
*
- string: DB node ID: entity node
*
- string: entity node name
*/
private
HashMap<Query, TreeSet<ResultTriplet<ID, Match>>> results;
public PofaQueryResultList() {
results = new HashMap<Query, TreeSet<ResultTriplet<ID, Match>>>();
}
/**
* Insert a new result into the result list.
* @param matchedQuery Part of original query that was matched
* @param rank rank of search result to be added
* @param id DB item ID of search result to be added
* @param match DB match content of search result to be added
*/
public void addResult(Query matchedQuery, Double rank, ID id, Match match) {
if (results.containsKey(matchedQuery)) {
results.get(matchedQuery).
add(new ResultTriplet<ID, Match>(rank, id, match));
}
else {
TreeSet<ResultTriplet<ID, Match>> items = new TreeSet<ResultTriplet<ID, Match>>();
items.add(new ResultTriplet<ID, Match>(rank, id, match));
results.put(matchedQuery, items);
}
}
/**
* Returns an iterator to the added original query part items.
* @return iterator
*/
public Iterator<Query> getMatches() {
return results.keySet().iterator();
}
/**
* Returns an iterator to a specific result query.
* @param matchedQuery
* @return
*/
public Iterator<ResultTriplet<ID, Match>> getMatchElements(Query matchedQuery) {
return results.get(matchedQuery).iterator();
}
/**
* Tells if result list is empty.
* @return
*/
public boolean isEmpty() {
return results.isEmpty();
}
}
PofaQueryResultList.java
package
com.wcs.pofa;
import
java.util.ArrayList;
import
import
import
import
org.w3c.dom.Document;
org.w3c.dom.NamedNodeMap;
org.w3c.dom.Node;
org.w3c.dom.NodeList;
import
com.wcs.pofa.PofaDomainRule.PofaPageElement;
public class
PofaRuleMatcher {
public static String getNodeText(Node node, boolean decorate, int depth) {
if (null == node)
return "";
StringBuilder result = new StringBuilder();
String indent = "";
if (decorate) {
for (int i = 0; i != depth * 2; i++)
result.append(" ");
indent = new String(result);
}
result.append("<" + node.getNodeName());
NamedNodeMap nodeAttrs = node.getAttributes();
for (int i = 0; i < nodeAttrs.getLength(); i++)
result.append("
"
+
nodeAttrs.item(i).getNodeName()
nodeAttrs.item(i).getNodeValue() + '"');
result.append(">");
if (decorate)
result.append("\n");
NodeList children = node.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
result.append(getNodeText(child, decorate, depth + 1));
break ;
case Node.TEXT_NODE:
result.append(indent + child.getNodeValue());
if (decorate)
result.append("\n");
break ;
}
}
result.append(indent + "</" + node.getNodeName() + ">");
if (decorate)
result.append("\n");
return result.toString();
}
public static String getDocumentText(Document doc, boolean decorate) {
StringBuilder result = new StringBuilder();
NodeList children = doc.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
+
"=\""
+
result.append(getNodeText(child, decorate, 0));
break ;
case Node.TEXT_NODE:
result.append(child.getNodeValue());
break ;
}
}
return
result.toString();
}
private static String domWalkerContent(Node node, String path, PofaDomainRule rule) {
String result = "";
if (null == node)
return result;
// compose absolute qualified node name
if ("" == path) path = node.getNodeName();
else path += ">" + node.getNodeName();
// include ID name (if present)
Node nodeAttr = node.getAttributes().getNamedItem("id");
if (null != nodeAttr) path += "#" + nodeAttr.getNodeValue();
// include class name (if present)
nodeAttr = node.getAttributes().getNamedItem("class");
if (null != nodeAttr) path += "." + nodeAttr.getNodeValue();
boolean match = rule.matchRule(path);
if (match) {
// collect content
result += "<" + node.getNodeName();
NamedNodeMap nodeAttrs = node.getAttributes();
for (int j = 0; j < nodeAttrs.getLength(); j++)
result
+=
"
"
+
nodeAttrs.item(j).getNodeName()
nodeAttrs.item(j).getNodeValue() + '"';
result += ">";
}
+
"=\""
+
NodeList children = node.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
// evaluate DOM node
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
if (match)
result += domWalkerContent(child, path, rule);
else
domWalkerContent(child, path, rule);
break ;
case Node.TEXT_NODE:
if (match)
result += child.getNodeValue();
break ;
case Node.ENTITY_REFERENCE_NODE:
break ;
}
}
if (match)
result += "</" + node.getNodeName() + ">";
return result;
}
private static ArrayList<Pair<PofaPageElement, String>> domWalker(Node node, String path,
PofaDomainRule rule) {
ArrayList<Pair<PofaPageElement, String>> result = new ArrayList<Pair<PofaPageElement,
String>>();
if (null == node)
return result;
// compose absolute qualified node name
if ("" == path) path = node.getNodeName();
else path += ">" + node.getNodeName();
// include ID name (if present)
Node nodeAttr = node.getAttributes().getNamedItem("id");
if (null != nodeAttr) path += "#" + nodeAttr.getNodeValue();
// include class name (if present)
nodeAttr = node.getAttributes().getNamedItem("class");
if (null != nodeAttr) path += "." + nodeAttr.getNodeValue();
// match DOM path against supplied rules
NodeList children = node.getChildNodes();
if (rule.matchRule(path)) {
// found a rule: cumulate content
String content = "<" + node.getNodeName();
NamedNodeMap nodeAttrs = node.getAttributes();
for (int j = 0; j < nodeAttrs.getLength(); j++)
content
+=
"
"
+
nodeAttrs.item(j).getNodeName()
+
"=\""
+
nodeAttrs.item(j).getNodeValue() + '"';
content += ">";
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
content += domWalkerContent(child, path, rule);
break ;
case Node.TEXT_NODE:
content += child.getNodeValue();
break ;
}
}
content += "</" + node.getNodeName() + ">";
result.add(new Pair<PofaPageElement, String>(rule.getType(), content));
}
else {
// parse children
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
if (Node.ELEMENT_NODE == child.getNodeType()) {
// process subtree
ArrayList<Pair<PofaPageElement, String>> subResult = domWalker(child, path,
rule);
// merge results
for (int j = 0; j != subResult.size(); j++)
result.add(subResult.get(j));
}
}
}
return result;
}
/**
* Extracts content from the supplied DOM using the supplied rules.
* @param nodes The DOM to extract content from.
* @param rules The rules to use.
* @return List of (page element type, content) pairs. Page element type is the
<i>PofaPageElement</i> part
* of the matching rule path, content is the XHTML content of the DOM in the appiled path
(including tags
* and attributes).
*/
public static
ArrayList<Pair<PofaPageElement, String>> ruleMatcher(NodeList nodes,
PofaDomainInfo rules) {
ArrayList<Pair<PofaPageElement, String>> result = new ArrayList<Pair<PofaPageElement,
String>>();
for (int i = 0; i != rules.getRules().size(); i++) {
for (int j = 0; j != nodes.getLength(); j++) {
ArrayList<Pair<PofaPageElement, String>> ruleResult = domWalker(nodes.item(j), "",
rules.getRules().get(i));
for (int k = 0; k != ruleResult.size(); k++)
result.add(ruleResult.get(k));
}
}
return result;
}
}
PofaSlicer.java
package
import
import
import
import
import
import
com.wcs.pofa.slicer;
java.io.ByteArrayInputStream;
java.io.IOException;
java.io.InputStream;
java.util.ArrayList;
java.util.Iterator;
java.util.Locale;
import
import
import
import
import
import
import
org.json.JSONArray;
org.json.JSONException;
org.json.JSONObject;
org.w3c.dom.Document;
org.w3c.dom.Node;
org.w3c.dom.NodeList;
org.w3c.tidy.Tidy;
import
import
import
import
import
import
import
import
import
import
com.wcs.pofa.Pair;
com.wcs.pofa.PofaAbstractSlicer;
com.wcs.pofa.PofaDomainInfo;
com.wcs.pofa.PofaRuleMatcher;
com.wcs.pofa.PofaUtils;
com.wcs.pofa.PofaDomainRule.PofaPageElement;
com.wcs.pofa.db.Neo4jDBInterface;
com.wcs.pofa.db.PofaNeo4jDB;
com.wcs.pofa.db.Neo4jDBInterface.ServerErrorResponse;
com.wcs.pofa.tokenizer.PofaTokenizer;
/**
* Receives an (almost properly formatted) HTML as string and extracts parts of it described
by page rules
* (PofaDomainRuleList). Stores the extracts (slices) in the DocumentStore DB and passes them
further.
*
*/
public class PofaSlicer implements PofaAbstractSlicer {
private
private
private
private
Tidy tidy = new Tidy();
PofaNeo4jDB db;
Neo4jDBInterface dbInterface;
PofaTokenizer tokenizer;
public PofaSlicer(PofaNeo4jDB db) {
System.out.println("Initializing " + this.getClass().getName() + "...");
this .db = db;
this .dbInterface = db.getDBInterface();
this .tokenizer = new PofaTokenizer();
// initialize JTidy
tidy.setQuiet(true);
tidy.setHideComments(true);
tidy.setShowWarnings(false);
tidy.setShowErrors(0);
tidy.setXHTML(true);
System.out.println(this.getClass().getName() + " initialized.");
}
public static String stripHTML(String data) {
//TODO: handle escaped characters
// remove HTML tags
data = data.replaceAll("<(\".*?\"|'.*?'|.*?)*>", " ");
// collapse whitespaces
data = data.replaceAll("\\s+", " ");
// trim
data = data.replaceAll("(^\\s+|\\s+$)", "");
return data;
}
/**
* Clean downloaded HTML from unwanted content (such as scrips, style, comments)
* @param html
* @return
*/
public static String cleanHTML(String html) {
// remove script tags
html = html.replaceAll("(?s)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>", " ");
// remove comments
html = html.replaceAll("<!--.*?-->", " ");
return html;
}
private String displayDOM(NodeList nodes, int indent) {
String dom = "";
String spaces = "";
for (int i = 0; i != indent; i++)
spaces += " ";
for
(int i = 0; i != nodes.getLength(); i++) {
String nodeName;
Node node = nodes.item(i);
if (node.getNodeType() == Node.TEXT_NODE)
nodeName = "[TEXT] \"" + node.getNodeValue() + "\"";
else {
nodeName = node.getNodeName();
// include ID name (if present)
Node nodeAttr = node.getAttributes().getNamedItem("id");
if (null != nodeAttr) nodeName += "#" + nodeAttr.getNodeValue();
// include class name (if present)
nodeAttr = node.getAttributes().getNamedItem("class");
if (null != nodeAttr) nodeName += "." + nodeAttr.getNodeValue();
}
dom += spaces + nodeName + "\n" +
displayDOM(node.getChildNodes(), indent + 2);
}
return
dom;
}
/**
* Processes the contents of a HTML page. Extracts parts of it described by the supplied
rules.
*
* @param content The textual content of the page to analyze
* @param rules The rules to use for extraction
* @return Amount of new information found in content expressed by formula:<br>
* <b>number_of_new_slices / number_of_found_slices</b><br>
* where only textual slices are counted.
*/
public double process(String url, String pageContent, PofaDomainInfo rules) {
ArrayList<String> entityNames = new ArrayList<String>();
ArrayList<String> categoryNames = new ArrayList<String>();
ArrayList<String> commentNodeIDs = new ArrayList<String>();
JSONObject factSheet = new JSONObject();
// prepare
InputStream is = new ByteArrayInputStream(pageContent.getBytes());
// parse content
Document tidyDoc = tidy.parseDOM(is, null);
//System.err.println(displayDOM(tidyDoc.getChildNodes(), 0)); //FIXME: debug output
// generate slices
ArrayList<Pair<PofaPageElement, String>> matches;
matches = PofaRuleMatcher.ruleMatcher(tidyDoc.getChildNodes(), rules);
int newSlices = 0;
// metadata
JSONObject metaData = new JSONObject();
try {
metaData.put("url", url);
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
StringBuilder rawText = new StringBuilder();
// process metadata slices
for (int i = 0; i != matches.size(); i++) {
Pair<PofaPageElement, String> match = matches.get(i);
PofaPageElement matchType = match.getFirst();
String matchContent = match.getSecond();
switch (matchType) {
case THEME:
try {
String textContent = stripHTML(matchContent);
metaData.put("THEME", textContent);
rawText.append(textContent + "\n");
} catch (JSONException e) {
//TODO: error handling
}
break ;
case BREADCRUMBS:
{
// clean entity name (remove HTML tags, starting and ending special chars)
String textContent =
PofaTokenizer.tokenListToExpression(
PofaTokenizer.cleanSeparators(
PofaTokenizer.tokenize(
stripHTML(matchContent), null
),
true
)
);
categoryNames.add(textContent);
rawText.append(textContent + "\n");
break ;
}
case ENTITY:
{
// clean entity name (remove HTML tags, starting and ending special chars)
String textContent =
PofaTokenizer.tokenListToExpression(
PofaTokenizer.cleanSeparators(
PofaTokenizer.tokenize(
stripHTML(matchContent), null
),
true
)
);
entityNames.add(textContent);
rawText.append(textContent + "\n");
break ;
}
case FACTSHEET:
{
String factSheetItem;
// TODO: look for hidden values too (inside html tag)
// remove html tags & separate keys from values
factSheetItem
=
matchContent.replaceAll("<(\".*?\"|'.*?'|.*?)*>|:|�|,\\s+",
"\n");
// collapse whitespaces
factSheetItem = factSheetItem.replaceAll("\\n+", "\n");
// inner trim
factSheetItem = factSheetItem.replaceAll("( |\\t)+", " ");
// trim
factSheetItem = factSheetItem.replaceAll("(?m)(^\\s+|\\s+$|^\\n)", "");
// key: first non-empty line
// values: all other non-empty lines
String[] sheet = factSheetItem.split("\n");
String key = "";
ArrayList<String> values = new ArrayList<String>();
for (int j = 0; j != sheet.length; j++) {
String item = sheet[j];
if (item.length() != 0)
if (key.length() == 0)
key = item;
else
values.add(item);
}
// store factsheet
if (key.length() > 0 && values.size() > 0)
for (int j = 0; j != values.size(); j++)
try {
if (factSheet.has(key))
factSheet.accumulate(key, values.get(j));
else
factSheet.put(key, new JSONArray().put(values.get(j)));
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
break ;
}
default :
}
}
// process TEXT slices
for (int i = 0; i != matches.size(); i++) {
Pair<PofaPageElement, String> match = matches.get(i);
PofaPageElement matchType = match.getFirst();
String matchContent = match.getSecond();
if (PofaPageElement.TEXT == matchType) {
rawText.append(stripHTML(matchContent) + "\n");
// check if text is a new text
String hash = PofaUtils.getMD5(matchContent).toString(16);
try {
JSONArray indexHit = dbInterface.queryNodeIndex("opinion", "content", hash);
if (indexHit.length() == 0) {
// new text item, insert it
JSONObject text = new JSONObject();
for (Iterator<?> key = metaData.keys(); key.hasNext(); ) {
String keyName = (String)key.next();
text.put(keyName, metaData.get(keyName));
}
text.put("content", matchContent);
// put into DB
String commentNodeID = dbInterface.createNode(text);
db.addNodeToIndex("opinion", "content", hash, commentNodeID);
dbInterface.createRelationship(db.getRootNode(), commentNodeID, "OPINION", new
JSONObject());
dbInterface.createRelationship(db.getRefNodeTokenize(), commentNodeID, "", new
JSONObject());
newSlices++;
commentNodeIDs.add(commentNodeID);
}
else
commentNodeIDs.add(indexHit.getJSONObject(0).getString("node"));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
// detect language based on extracted slice's raw text
Locale language = tokenizer.languageDetect(rawText.toString());
// add entities
ArrayList<String>
entityIDs
=
addEntities(entityNames,
commentNodeIDs,
factSheet,
language);
// add categories
addCategories(categoryNames, commentNodeIDs, entityIDs, language);
// add factsheet items as categories
addFactSheet(factSheet, commentNodeIDs, entityIDs, language);
return (double)newSlices / (double)matches.size();
}
/**
* Store entities in database. Update relevant indexes. Connect entites to comments.
* @param entityNames
* @param commentIDs
* @param facts
* @param locale
* @return
*/
private
ArrayList<String> addEntities(ArrayList<String> entityNames, ArrayList<String>
commentIDs, JSONObject facts, Locale locale) {
ArrayList<String> result = new ArrayList<String>();
for (int i = 0; i != entityNames.size(); i++) {
//--- query if entity exists --try {
String entityName = entityNames.get(i);
String entityNodeID;
JSONArray indexHit = db.queryNodeIndex("entity", "name", entityName, locale);
if (indexHit.length() == 0) {
// new item, insert it
JSONObject properties = new JSONObject();
properties.put("name", entityName);
properties.put("factsheet", facts.toString());
// put into DB
entityNodeID = dbInterface.createNode(properties);
// connect to CLASSIFYENTITY node (require later classification)
dbInterface.createRelationship(db.getRefNodeClassifyEntity(), entityNodeID, "", new
JSONObject());
// index enitity
ArrayList<String> indexExpressions = tokenizer.composeSubExpressions(entityName,
locale, true, false);
db.addNodeToIndex("entity", "name", indexExpressions, entityNodeID);
}
else {
entityNodeID = indexHit.getJSONObject(0).getString("node");
// compare stored fact sheet and current fact sheet
try {
JSONObject
storedFactSheet
=
new
JSONObject(indexHit.getJSONObject(0).getJSONObject("data").getString("factsheet"));
boolean factSheetUpdated = false;
for (Iterator<?> newKey = facts.keys(); newKey.hasNext(); ) {
String newKeyName = (String)newKey.next();
if (storedFactSheet.has(newKeyName)) {
// key exists, compare values
JSONArray storedValues = storedFactSheet.getJSONArray(newKeyName);
JSONArray newValues = facts.getJSONArray(newKeyName);
for (int j = 0; j != newValues.length(); j++) {
Object newValue = newValues.get(j);
boolean hasValue = false;
for (int k = 0; k != storedValues.length(); k++)
if (storedValues.get(k).equals(newValue)) {
hasValue = true;
break ;
}
if (!hasValue) {
storedValues.put(newValue);
factSheetUpdated = true;
}
}
}
else {
// key does not exist, add it
storedFactSheet.put(newKeyName, facts.get(newKeyName));
factSheetUpdated = true;
}
}
if (factSheetUpdated)
// fact sheet changed, store in db
dbInterface.setNodeProperty(entityNodeID,
storedFactSheet.toString());
} catch (JSONException e) {
}
}
result.add(entityNodeID);
// connect entity to comments
for (int j = 0; j != commentIDs.size(); j++)
db.createNewRelationship(commentIDs.get(j),
JSONObject(), false);
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return result;
}
"factsheet",
entityNodeID,
"APPLIES_TO",
/**
* Store categories in database. Connect categories to entites and comments.
* @param categories
* @param commentIDs
* @param locale
*/
private void
addCategories(ArrayList<String> categories, ArrayList<String>
ArrayList<String> entityIDs, Locale locale) {
for (int i = 0; i != categories.size(); i++) {
String categoryName = categories.get(i);
if (categoryName.length() > 0) {
// check if category exists
new
commentIDs,
try {
String categoryNodeID;
JSONArray indexHit;
// make sure category name is not an entity name too
indexHit = db.queryNodeIndex("entity", "name", categoryName, locale);
if (indexHit.length() == 0) {
indexHit = db.queryNodeIndex("category", "name", categoryName, locale);
if (indexHit.length() == 0) {
// new category
categoryNodeID
=
dbInterface.createNode(new
JSONObject().put("name",
categoryName));
dbInterface.createRelationship(db.getRootNode(), categoryNodeID, "CATEGORY_BC",
new JSONObject());
// intex category
ArrayList<String>
indexExpressions
=
tokenizer.composeSubExpressions(categoryName, locale, true, true);
db.addNodeToIndex("category", "name", indexExpressions, categoryNodeID);
}
else
categoryNodeID = indexHit.getJSONObject(0).getString("node");
// connect category to comments
// TODO: if there were more index hits, why not connect to all of them?
for (int j = 0; j != commentIDs.size(); j++)
db.createNewRelationship(commentIDs.get(j), categoryNodeID, "APPLIES_TO", new
JSONObject(), false);
// connect category to entites
for (int j = 0; j != entityIDs.size(); j++)
db.createNewRelationship(entityIDs.get(j), categoryNodeID, "BELONGS_TO", new
JSONObject(), false);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
private void addFactSheet(JSONObject facts, ArrayList<String> commentIDs, ArrayList<String>
entityIDs, Locale locale) {
for (Iterator<?> i = facts.keys(); i.hasNext(); ) {
String key = (String)i.next();
try {
String categoryKeyNodeID;
JSONArray indexHit = db.queryNodeIndex("category", "name", key, locale);
if (indexHit.length() == 0) {
// new category
categoryKeyNodeID = dbInterface.createNode(new JSONObject().put("name", key));
dbInterface.createRelationship(db.getRootNode(), categoryKeyNodeID, "CATEGORY_FSK",
new JSONObject());
// intex category
ArrayList<String> indexExpressions = tokenizer.composeSubExpressions(key, locale,
true, false);
db.addNodeToIndex("category", "name", indexExpressions, categoryKeyNodeID);
}
else {
categoryKeyNodeID = indexHit.getJSONObject(0).getString("node");
}
// create sub-categories
JSONArray values = facts.getJSONArray(key);
for (int j = 0; j != values.length(); j++) {
String value = values.getString(j);
String valueNodeID;
indexHit = db.queryNodeIndex("category", "name", value, locale);
if (indexHit.length() == 0) {
// new category
valueNodeID = dbInterface.createNode(new JSONObject().put("name", value));
dbInterface.createRelationship(db.getRootNode(), valueNodeID, "CATEGORY_FSV", new
JSONObject());
dbInterface.createRelationship(categoryKeyNodeID,
valueNodeID,
"VALUE",
new
JSONObject());
// intex category
ArrayList<String>
indexExpressions
=
tokenizer.composeSubExpressions(value,
locale, true, false);
db.addNodeToIndex("category", "name", indexExpressions, valueNodeID);
}
else {
valueNodeID = indexHit.getJSONObject(0).getString("node");
db.createNewRelationship(categoryKeyNodeID,
valueNodeID,
"VALUE",
new
JSONObject(), false);
}
// connect subcategory to entities
for (int k = 0; k != entityIDs.size(); k++)
db.createNewRelationship(entityIDs.get(k),
valueNodeID,
"CATEGORY_PV",
new
JSONObject(), false);
// connect subcategory to comments
for (int k = 0; k != commentIDs.size(); k++)
db.createNewRelationship(commentIDs.get(k),
valueNodeID,
"APPLIES_TO",
new
JSONObject(), false);
}
// connect main category to entities
for (int k = 0; k != entityIDs.size(); k++)
db.createNewRelationship(entityIDs.get(k), categoryKeyNodeID, "CATEGORY_PK",
JSONObject(), false);
// connect main category to comments
for (int k = 0; k != commentIDs.size(); k++)
db.createNewRelationship(commentIDs.get(k), categoryKeyNodeID, "APPLIES_TO",
JSONObject(), false);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ServerErrorResponse e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
PofaStopWords.java
package
import
import
import
import
import
import
import
com.wcs.pofa;
java.io.FileInputStream;
java.io.FileNotFoundException;
java.io.IOException;
java.util.HashSet;
java.util.Hashtable;
java.util.Locale;
java.util.Scanner;
public class
private
PofaStopWords {
Hashtable<String, HashSet<String>> stopWords;
public PofaStopWords() {
stopWords = new Hashtable<String, HashSet<String>>();
// load stop words
FileInputStream fis;
Scanner scanner;
//FIXME: don't use hardcoded filenames
try {
fis = new FileInputStream("..\\stopwords_hu.txt");
scanner = new Scanner(fis, "UTF-8");
HashSet<String> words = new HashSet<String>();
while (scanner.hasNextLine()) {
String word = scanner.nextLine();
words.add(word);
}
scanner.close();
new
new
fis.close();
stopWords.put("hu", words);
fis = new FileInputStream("..\\stopwords_en.txt");
scanner = new Scanner(fis, "UTF-8");
words = new HashSet<String>();
while (scanner.hasNextLine()) {
String word = scanner.nextLine();
words.add(word);
}
scanner.close();
fis.close();
stopWords.put("en", words);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public HashSet<String> getStopWords(Locale locale) {
return stopWords.get(locale.toString());
}
}
PofaTokenizer.java
package
com.wcs.pofa.tokenizer;
import
import
import
import
import
import
import
import
import
import
import
java.io.FileInputStream;
java.io.FileNotFoundException;
java.io.FileOutputStream;
java.io.IOException;
java.io.ObjectInputStream;
java.io.ObjectOutputStream;
java.util.ArrayList;
java.util.HashMap;
java.util.HashSet;
java.util.Iterator;
java.util.Locale;
import
import
import
org.json.JSONArray;
org.json.JSONObject;
org.tartarus.snowball.SnowballStemmer;
import
import
import
import
import
import
import
import
import
com.wcs.pofa.Pair;
com.wcs.pofa.PofaStopWords;
com.wcs.pofa.db.Neo4jRelationship;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseOrder;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseResult;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseReturnFilter;
com.wcs.pofa.db.Neo4jDBInterface.Neo4jTraverseUniqueness;
com.wcs.pofa.db.Neo4jRelationship.Neo4jRelationshipDirection;
com.wcs.pofa.slicer.PofaSlicer;
import
de.spieleck.app.cngram.NGramProfiles;
/**
* Tokenize a document
*/
public class PofaTokenizer {
private
private
private
NGramProfiles nps;
NGramProfiles.Ranker ranker;
PofaStopWords stopWords;
private
final
static
String
"(\"|'|\\(|\\)|\\[|\\]|\\{|\\}|\\<|\\>|\\?|\\.|!|,|-|:|;)";
specialWordSeparators
public PofaTokenizer() {
System.out.println("Initializing " + this.getClass().getName() + "...");
try {
this .nps = new NGramProfiles();
this .ranker = nps.getRanker();
} catch (IOException e) {
=
// TODO Auto-generated catch block
e.printStackTrace();
}
this .stopWords = new PofaStopWords();
System.out.println(this.getClass().getName() + " initialized.");
}
/**
* Tokenize input document.
* @param document Document to tokenize.
* @param locale locale to use for lower-case converting (if <b>null</b> no lower case
conversion is done)
* @return list of tokens
*/
public static ArrayList<String> tokenize(String document, Locale locale) {
// insert missing whitespaces at sentence ends
document = correctWhitespaces(document, locale);
ArrayList<String> words;
ArrayList<String> tokens = new ArrayList<String>();
// split document at word boundaries
words = splitDocument(document, locale);
// split words at special characters
for (int i = 0; i != words.size(); i++)
tokens.addAll(splitWord(words.get(i)));
// merge special tokens
ArrayList<String> result = mergeTokens(tokens);
return result;
}
/**
* Inserts missing whitespaces and removes unnecessary ones using a language pattern.
* TODO: Requires some kind of heuristics
* @param document
* @return
*/
public static String correctWhitespaces(String document, Locale locale) {
//TODO: write this
return document;
}
/**
* Splits up input at word boundaries.
* @param document document to split
* @param locale locale to use for lower-case converting (if <b>null</b> no lower case
conversion is done)
* @return
*/
public static ArrayList<String> splitDocument(String document, Locale locale) {
//--- split at whitespaces --String words[] = document.split("\\s+");
ArrayList<String> result = new ArrayList<String>();
if (null == locale)
for (int i = 0; i != words.length; i++)
result.add(words[i]);
else
for (int i = 0; i != words.length; i++)
result.add(words[i].toLowerCase(locale));
return result;
}
/**
* Split up a word to multiple tokens if it begins or ends with special chars.
* @param word
* @return
*/
public static ArrayList<String> splitWord(String word) {
//--- split each word from prefixes and/or suffixes --//--- split up incoming word --final String splitter = specialWordSeparators;
word = word.replaceAll("(?=" + splitter + ")", " ");
word = word.replaceAll("(?<=" + splitter + ")", " ");
String[] splitted = word.split("\\s+");
//--- find prefixes and suffixes --int prefix = 0;
int suffix = splitted.length;
for (int i = 0; i != splitted.length; i++)
if
(splitted[i].length() > 0 && !splitted[i].matches(splitter)) {
prefix = i - 1;
break ;
}
for (int i = splitted.length - 1; i > prefix; i--)
if (splitted[i].length() > 0 && !splitted[i].matches(splitter)) {
suffix = i + 1;
break ;
}
//--- merge internal parts (keep only prefixes and suffixes separately) --StringBuilder sb = new StringBuilder();
for (int i = prefix + 1; i != suffix; i++)
sb.append(splitted[i]);
ArrayList<String> result = new ArrayList<String>();
for (int i = 0; i <= prefix; i++)
if (splitted[i].length() > 0)
result.add(splitted[i]);
result.add(sb.toString());
for (int i = suffix; i < splitted.length; i++)
if (splitted[i].length() > 0)
result.add(splitted[i]);
return
result;
}
/**
* Merges consequent tokens if they have a common meaning.
* TODO: Requires some kind of heuristics (common abbreviations, etc.)
* @param tokens
* @return
*/
public static ArrayList<String> mergeTokens(ArrayList<String> tokens) {
// merge same 1-length tokens
int first = -1;
int last = -1;
String merged = "";
for (int i = tokens.size() - 2; i >= 0; i--) {
if (tokens.get(i).length() == 1 && tokens.get(i).equals(tokens.get(i + 1))) {
if (last == -1) {
last = i + 1;
merged = tokens.get(i + 1);
}
first = i;
merged += tokens.get(i);
}
else if (first >= 0) {
for (int j = last; j >= first; j--)
tokens.remove(j);
tokens.add(first, merged);
first = -1;
last = -1;
merged = "";
}
}
return tokens;
}
/**
* Tells if a word contains only special word separator characters.
* @param token word to analyze
* @return true if word consists only of separator chars
*/
public static boolean isSeparator(String token) {
return token.matches(specialWordSeparators + "+");
}
/**
* Removes all separator tokens from the list of tokens.
* @param tokenList list of tokens
* @return cleaned list of tokens
*/
public static
ArrayList<String> cleanSeparators(ArrayList<String>
onlyFromEdges) {
ArrayList<String> result = new ArrayList<String>();
if (onlyFromEdges) {
tokenList,
boolean
int first = -1;
int last = -1;
// find first non-separator
for (int i = 0; i != tokenList.size(); i++)
if (!isSeparator(tokenList.get(i))) {
first = i;
break ;
}
if (first != -1) {
// find last non-separator
for (int i = tokenList.size() - 1; i >= first; i--)
if (!isSeparator(tokenList.get(i))) {
last = i;
break ;
}
// cut middle
for (int i = first; i <= last; i++)
result.add(tokenList.get(i));
}
}
else
for (Iterator<String> i = tokenList.iterator(); i.hasNext(); ) {
String word = i.next();
if (!isSeparator(word))
result.add(word);
}
return result;
}
/**
* Merge the list of tokens to a string
* @param tokenList token list
* @return single string
*/
public static String tokenListToExpression(ArrayList<String> tokenList) {
StringBuilder result = new StringBuilder();
if (tokenList.size() != 0) {
result.append(tokenList.get(0));
for (int i = 1; i != tokenList.size(); i++)
result.append(" " + tokenList.get(i));
}
return result.toString();
}
/**
* Stem all tokens using the specified language's stemmer.
* @param language
* @param tokens
* @return
* @throws ClassNotFoundException
* @throws InstantiationException
* @throws IllegalAccessException
*/
public static
ArrayList<String> stemTokens(ArrayList<String> tokens, Locale
throws ClassNotFoundException, InstantiationException, IllegalAccessException {
Class<?> stemClass = Class.forName("org.tartarus.snowball.ext." +
language.getDisplayLanguage(Locale.ENGLISH).toLowerCase(Locale.ENGLISH) +
"Stemmer" );
SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance();
ArrayList<String> result = new ArrayList<String>();
for
language)
(int i = 0; i != tokens.size(); i++) {
stemmer.setCurrent(tokens.get(i));
stemmer.stem();
result.add(stemmer.getCurrent());
}
return
result;
}
/**
* Stem all expressions using the specified language's stemmer.
* @param language language to use for stemming
* @param expressions list of expressions, where expressions contain space separated words
* @return
* @throws ClassNotFoundException
* @throws InstantiationException
* @throws IllegalAccessException
*/
public static
ArrayList<String> stemExpressions(ArrayList<String> expressions, Locale
language) throws ClassNotFoundException, InstantiationException, IllegalAccessException {
Class<?> stemClass = Class.forName("org.tartarus.snowball.ext." +
language.getDisplayLanguage(Locale.ENGLISH).toLowerCase(Locale.ENGLISH) +
"Stemmer" );
SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance();
ArrayList<String> result = new ArrayList<String>();
for (int i = 0; i != expressions.size(); i++) {
String[] tokens = expressions.get(i).split(" ");
StringBuilder stemmedExpression = new StringBuilder();
for (int j = 0; j != tokens.length; j++) {
stemmer.setCurrent(tokens[j]);
stemmer.stem();
if (0 == j)
stemmedExpression.append(stemmer.getCurrent());
else
stemmedExpression.append(" " + stemmer.getCurrent());
}
result.add(stemmedExpression.toString());
}
return
result;
}
/**
* Detect natural language.
* @throws IOException
*/
public Locale languageDetect(String document) {
ranker.reset();
ranker.account(document);
NGramProfiles.RankResult res = ranker.getRankResult();
return new Locale(res.getName(0));
}
/**
* Splits the input document at separator chars. In-word separators are not considered.
* The result list won't contain any separators that were used as splitters.
* @param document document to split
* @param locale locale to use for tokenization
* @return
*/
public ArrayList<String> splitAtSeparators(String document, Locale locale) {
ArrayList<String> result = new ArrayList<String>();
ArrayList<String> tokens = tokenize(document, locale);
StringBuilder expression = new StringBuilder();
for (Iterator<String> i = tokens.iterator(); i.hasNext(); ) {
String token = i.next();
if (isSeparator(token)) {
if (expression.length() != 0)
result.add(expression.toString());
expression = new StringBuilder();
}
else {
if (expression.length() == 0)
expression.append(token);
else
expression.append(" " + token);
}
}
if (expression.length() != 0)
result.add(expression.toString());
return result;
}
/**
* Create all multi-word subphrases from supplied expression.
* @param expression initial expression to process
* @param locale language locale to use for tokenization
* @param removeSeparators if true separator tokens will be removed from result
* @param removeStopWords if true 1 length expressions won't contain stopwords
* @return List of all sub-expressions.
*/
public
ArrayList<String> composeSubExpressions(String expression, Locale locale, boolean
removeSeparators, boolean removeStopWords) {
ArrayList<String> result = tokenize(expression, locale);
if (removeSeparators)
result = cleanSeparators(result, false);
if (removeStopWords)
return composer(result, locale, 1, 0);
else
return composer(result, null, 1, 0);
}
/**
* Create all multi-word subphrases from supplied expression in the specified range.
* @param expression initial expression to process
* @param locale language locale to use for tokenization
* @param minLength minimum expression length (number of words)
* @param maxLength maximum expression length (if 0, all possible lengths will be generated)
* @param removeSeparators if true separator tokens will be removed from result
* @param removeStopWords if true 1 length expressions won't contain stopwords
* @return List of all sub-expressions.
*/
public
ArrayList<String> composeSubExpressions(String expression, Locale locale, int
minLength, int maxLength, boolean removeSeparators, boolean removeStopWords) {
ArrayList<String> result = tokenize(expression, locale);
if (removeSeparators)
result = cleanSeparators(result, false);
if (removeStopWords)
return composer(result, locale, minLength, maxLength);
else
return composer(result, null, minLength, maxLength);
}
/**
* Create all multi-word subphrases from all supplied expressions.
* @param expressions initial expression to process
* @param locale language locale to use for tokenization
* @param removeSeparators if true separator tokens will be removed from result
* @return List of all sub-expressions.
*/
public
ArrayList<String> composeSubExpressions(ArrayList<String> expressions,
locale, boolean removeSeparators, boolean removeStopWords) {
ArrayList<String> result = new ArrayList<String>();
for (Iterator<String> i = expressions.iterator(); i.hasNext(); ) {
String expression = i.next();
ArrayList<String> subResult = tokenize(expression, locale);
if (removeSeparators)
result = cleanSeparators(subResult, false);
if (removeStopWords)
result.addAll(composer(subResult, locale, 1, 0));
else
result.addAll(composer(subResult, null, 1, 0));
}
return result;
}
Locale
/**
* Create multi-word subphrases from all supplied expressions in the specified range.
* @param expressions initial expression to process
* @param locale language locale to use for tokenization
* @param minLength minimum expression length (number of words)
* @param maxLength maximum expression length (if 0, all possible lengths will be generated)
* @param removeSeparators if true separator tokens will be removed from result
* @param removeStopWords if true 1 length expressions won't contain stopwords
* @return List of all sub-expressions.
*/
public
ArrayList<String> composeSubExpressions(ArrayList<String> expressions, Locale
locale, int minLength, int maxLength, boolean removeSeparators, boolean removeStopWords) {
ArrayList<String> result = new ArrayList<String>();
for (Iterator<String> i = expressions.iterator(); i.hasNext(); ) {
String expression = i.next();
ArrayList<String> subResult = tokenize(expression, locale);
if (removeSeparators)
subResult = cleanSeparators(subResult, false);
if (removeStopWords)
result.addAll(composer(subResult, locale, minLength, maxLength));
else
result.addAll(composer(subResult, null, minLength, maxLength));
}
return
result;
}
/**
* Generate all subexpressions from a list of single words (order of words will be
maintained).
* Single word expressions will be generated withour stopwords.<br>
* E.g.: <b>[this, is, a, test]</b> will generate:
* <li> "this is a test"
* <li> "this is a"
* <li> "is a test"
* <li> "this is"
* <li> "is a"
* <li> "a test"
* <li> "test"
* @param words list of words to use for composing
* @param locale language to use for stopword removal (if <b>null</b> no stopword removal is
done)
* @param minLength the desired minimum number of words in output expressions
* @param maxLength the desired maximum number of words in output expressions (if set to
<b>zero</> it creates all possible lengths)
* @return
*/
private
ArrayList<String> composer(ArrayList<String> words, Locale locale, int minLength,
int maxLength) {
if (maxLength > words.size() || maxLength <= 0) maxLength = words.size();
int minMultiWordLength = Math.max(2, minLength);
// compose all multi-word expressions
ArrayList<String> result = new ArrayList<String>();
for (int exprLen = maxLength; exprLen >= minMultiWordLength; exprLen--)
for (int startWord = 0; startWord != words.size() - exprLen + 1; startWord++) {
String fixedLenExpression = "";
for (int i = 0; i != exprLen; i++) {
if (0 == i)
fixedLenExpression = words.get(startWord + i);
else
fixedLenExpression += " " + words.get(startWord + i);
}
result.add(fixedLenExpression);
}
if (minLength <= 1) {
// add single words
if (null == locale) {
result.addAll(words);
}
else {
// do not add stopwords
HashSet<String> stops = stopWords.getStopWords(locale);
if (null == stops)
result.addAll(words);
else
for (int i = 0; i != words.size(); i++)
if (!stops.contains(words.get(i)))
result.add(words.get(i));
}
}
return result;
}
/**
* Returns the number of words in a string.
* NOTE: actually the number of spaces is counted (+1 for non-empty strings) so make sure:
* <li>all whitespaces are spaces,
* <li>there are no multiple spaces between words,
* <li>there are no starting or trailing spaces.
* @param sourceString
* @return
*/
public static int countWords(String sourceString) {
if (null == sourceString)
return 0;
int count = 1;
final char [] chars = sourceString.toCharArray();
for (int i = 0; i < chars.length; i++)
if (chars[i] == ' ')
count++;
return count;
}
public
static
void
main(String
argv[])
throws
InstantiationException, IllegalAccessException, IOException {
PofaTokenizer token = new PofaTokenizer();
Locale locale = new Locale("hu");
ClassNotFoundException,
String document = "Ez itt egy teszt dokumentum, ami helyesen van �rva ,de van benne
hiba is :( �s n�h�ny hi�nyz� sz�k�z:pl.itt...";
ArrayList<String> tokens = PofaTokenizer.tokenize(document, locale);
for (int i = 0; i != tokens.size(); i++ ) {
System.out.println(" \"" + tokens.get(i) + "\"");
}
/*
ArrayList<String> words = cleanSeparators(tokenize("Samsung <notebook > �kezetes cucc,
laptop, netbook", new Locale("hu")));
for (Iterator<String> i = words.iterator(); i.hasNext(); ) {
String word = i.next();
System.out.println(word + ": " + isSeparator(word));
}
*/
/*
PofaTokenizer token = new PofaTokenizer();
FileInputStream fis = new FileInputStream("c:\\users\\mikki\\workspace\\pofa\\texts.txt");
Scanner scanner = new Scanner(fis, "UTF-8");
FileOutputStream fos = new FileOutputStream("c:\\users\\mikki\\workspace\\pofa\\textsstemmed-en-800.txt", true);
OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8");
while (scanner.hasNextLine()) {
String document = scanner.nextLine();
long t1 = System.currentTimeMillis();
String lng = token.languageDetect(document).getLanguage();
long t2 = System.currentTimeMillis();
System.out.println(lng + " time: " + (t2 - t1));
ArrayList<String> tokens = token.tokenize(document);
ArrayList<String> stems = token.stemTokens("english", tokens);
for (int i = 0; i != tokens.size(); i++) {
if (i != 0)
out.write(" ");
out.write(stems.get(i));
}
out.write("\n");
System.out.println(document);
}
out.close();
fos.close();
scanner.close();
fis.close();
*/
}
}
PofaUtils.java
package
import
import
import
com.wcs.pofa;
java.math.BigInteger;
java.security.MessageDigest;
java.security.NoSuchAlgorithmException;
public class
PofaUtils {
private static
static
{
MessageDigest digest;
try {
digest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
}
//TODO: bigint -> hex conversion omits initial zeroes
public static BigInteger getMD5(String data) {
digest.reset();
digest.update(data.getBytes(),0, data.length());
return (new BigInteger(1, digest.digest()));
}
public
}
}
PofaUtils() {