Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed Data Mining System in Java Group Member 王春笙,林俊甫,王慧芬 Overview of Project • Project participants – 王春笙,林俊甫,王慧芬 Project Programming Tasks • D92725002 林俊甫 – – – – – – Polling and reply Multicast between client and server Client/Server Socket programming Client dynamic join and leave mechanism Multi-thread programming Synchronization mechanism Data chunks maintenance and dispatching mechanism – Client/Server communication link control Project Programming Tasks(cont’d) – Client failure handling • Reassign backup server, if failure client is backup • Restore failure client works (with 王春笙) – Server failure handling • Backup Server designate mechanism and logic design – RMI mechanism (with 王春笙) – Basic GUI System Infrastructure • System diagram Client Client Client ... LAN Mining data chunk Mining result Server/Coordinator Basic Operation Time Time Server Listen multicast Group query and reply Fork thread to Handle client connection 1. Polling on port 4444 Group 230.0.0.1 @: who is server? 2. Servername: I am the server Server found; Connect to the Server 3. Connect to <servername, port 4445> 4. Client do: filechunk# Wait for client’s Processed result, Order client to get Another file chunk Client 5. ok 6. Client do: next filechunk# 7….. 8….. …. Receive server’s Instruction, ivoke RMI to get file chunk Port Assignment • Port 4444: for multicast • Port 4445: for TCP/IP socket connection • Port 4446: for RMI services Finding A Server • Once a client start up, it will query periodically 1. Client Query: who 2. Listen for every 3 sec. over the is the Server now? server response multicast group 230.0.0.1 port 4444 by sending 1 byte string “@” to locating 6. Server failure the server host. 3.Connect to detect -> if I am backup Server on port go to backup server • Once a server start up, it procedure, 4445 otherwise go to step.1. will fork a thread to 4. Use RMI Get file chunk from dealing with the query Server 5. Process data mining and return result to server File Dispatching • Server maintain a file chunk pool . FileChunks ………… -1: empty, 0: available, 1: using, 2:used • Server will find a available file chunk for client, set it to 1 and order client to get this file chunk by RMI file chunk will be update to 2 when client return result. • Recovery: When server detects client’s link-broken, it will restore file chunk allocate to client to 0. • File chunk class is declared as Serializable for RMI message passing to backup server • File chunk class use Synchronization for concurrent control Backup Server Selection • Server maintains and assigns unique id for each individual client. • Unique id is incremented as serial number. • Client with smallest id is assigned as backup server • When client failure, server will check if it is the backup server to restart the selection process or not. Nodes Maintenance • Server maintain connected client’s records in an ArrayList • ArrayList is compound with class Nodes, which records client’s detail information. ArrayList: ht Key Nodes Value Id Address Port Work on Status RMI Services • RMI services is written in independent program because server and client (which acts as backup server) will use it. • RMI services provides: – Backup server data to backup-server. – Get file chunk from server – Return mining result to server – Receive nodes information from server Client Failure • Server’s action took: – Recovery – Reassignment – Redo backup server selection if failure nodes is backup • Client’s action – Do nothing except one is told by server to act as backup Server Failure Time Server S Client A Time 1.A is told by S that It is the backup A invoke RMI to get all Server data A: Do backup RMI Get file Server run backup Selection choose A As backup RMI reply Client do # 2. A periodically Get server services, File chunk data Client do # do reply 3. Comm.link broken Is detected, start ServerAction class X 4. Create server Socket at 4445, fork thread To listen to query And wait for connection do reply Server Crash X Time Client B 1. B receives instruction as discuss before 2. Comm.Link Broken is detected, multicast query who is the server now? B Polling @: who is server? A reply: I am the server Connect to A:4445 3. B know A is the backup, reconnect to A Server/Client Life Cycle Server Client evolve Normal/Abnormal Termination Server Normal/Abnormal Termination Project Programming Tasks • D91725001 王春笙 – Web log file preprocessing and separating – Web pages traversal sequences parsing – Page items transferring and mapping – Web pages sequential patterns mining – Mining results maintenance – RMI mining results transfer – Mining results lookup and display Project Programming Tasks(cont’d) – Backup mechanism • Separate thread backup server files and memory data • Restore failure client works (with 林俊甫) – RMI mechanism (with 林俊甫) – GUI global states refreshment – System integration • Testing and debugging Web Log File Format • • • • User IP Date Time Web pages URL Web File Preprocessing • • • • Select *.htm and *.html pages First sort by user ID Second sort by time Pages sequences separated by time – more than 30 seconds Chunk Data Files • Part*.ppp 6023 2 1 1 2 8 6024 1 1 206 6025 7 1 1 1 1 1 1 1 2 5 17 18 19 20 11 6026 3 1 1 1 144 145 338 6027 2 1 1 2 9 6028 3 1 1 1 2 8 3 • Items.ppp /~visualdep/htm/p5b.htm 168 /~businessdep/student/picture.html 169 /~comedu/inde.htm 170 /~account/91tuition.htm 171 /~stuaffair/life/procedure-17.htm 172 /~stuaffair/life/procedure-25.htm 173 Apriori algorithm • • • • • • • 1:find all L1 2:generate C2 from L1 3:count C2 and find all L2 4:k=3 5:generate & prune Ck from Lk-1 6:count Ck and find all Lk 7:if Lk not empty then k++, goto 5 Apriori algorithm (cont’d) • join phase:s1 join s2 if s1(drop first) = s2(drop last) s1 {a, b}, s2 {b, a} – s1 join s2 => {a, b, a} • prune phase:delete a k candidate if any k1 sub sequence not large • C & L are stored in hash data structure Mining Result Display • Client frequent patterns – Web page ID – Support – Saved as *.pppl files • Client frequent patterns – Web page ID – Support – Web page name Backup Mechanism • When backup server selected, that client start a backup thread • Backup thread loop every 0.5 second • RMI data transfer – Chunk data file(part*.ppp,items.ppp) – Client information – File chunk information • determine MaxID and set “in use” to “available” – Frequent patterns information System Integration • Java class integration – Server component – Client component – Data mining component – GUI component • Testing • Debugging Project Programming Tasks • D92725001 王慧芬 – Graphical User Interface • Since this is a system working on data mining task in a distributed way, its GUI provides four panels: – – – – A system console A result window A connection table A graphical network configuration GUI • The system console shows how system proceeds GUI (cont’d) • The result window displays the progress and results of data mining GUI (cont’d) • A connection table lists all of the on-line client connection information GUI (cont’d) • A connection table consists of 5 fields – NO:client-server connection id – IP address:client’s IP address – Port:client’s port number – Status:connection status, it could be • • • • • 0: offline 1: online 2: file transfer from server to client 3: client is doing data mining 4: client returns value back to server if data mining finished 5: client is doing the backup and data mining at the same time – # chunk works on:if data mining and backup, it indicates the chuck number that the connection works on GUI (cont’d) • A graphical network configuration follows the connection table to depict the dynamic network configuration GUI (cont’d) • In the dynamic network configuration, we use different client GIFs to express the status: – Offline – Data mining – Backup and mining On-line GUI interface • mw.showMsg() – provided by GUI for server/client module to show the console message • mw.showResultString() – provided by GUI for server/client module to show the results of data mining • Connection table – modified by server/client module for connection information – read by GUI every 0.01 second to depict the dynamic network configuration GUI design • Java swing is used to generate label, text, scrollbar, and table, etc.. • Java AWT 2D painting is used to form the animation of the connection lines in the dynamic configuration panel • ‘Photo Impact’ and ‘GIF animator’ are used to generate the node icons • EasyRGB used to tune the color harmonies. GUI design (cont’d) • A new thread is forked from the GUI task to work on the animation of the connection lines in the dynamic configuration panel, GUI – to read the table every 0.03 second and to show the connection status with a moving ball. Generate system console Generate result panel Generate connection table Generate connection table animation Installation • 以執行一個 server,兩個client為例 – 建立三個資料夾,此三資料夾Ser(Server),Cli(Client1),Cli2(Client 2) – 將附檔解壓至Ser資料夾,此資料夾內要下載weblog10.zip檔,並 解壓 – 將附檔解壓至 Cli 與Cli2的空資料夾 – 開啟二個dos視窗(1,2號視窗),進入Ser資料夾 – 開啟三個dos視窗(3,4,5號視窗),3,4號進入Cli資料夾,5號進入 Cli2資料夾 – 1號視窗執行 compile.bat 批次檔,再執行 rmi.bat – 2號視窗執行 server.bat 批次檔 – 3號視窗執行 compile.bat 批次檔,再執行 rmi.bat – 4號視窗執行 client.bat批次檔 – 5號視窗執行 compile.bat批次檔,再執行 client.bat批次檔