Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Nebula A cloud-based back end for SETI@home David P. Anderson Kevin Luong Space Sciences Lab University of California, Berkeley SETI@home observation signal detection Signal storage re-observation Back-end processing: RFI detection/removal persistent signal detection Signal storage ● Using SQL database (Informix) ● Signal types – ● spike, Gaussian, triplet, pulse, autocorrelation Database table hierarchy – tape – workunit group – workunit – result – (signal tables) Pixelized sky position ● ● HealPix: Hierarchical Equal-Area Isolatitude Pixelization ~51M pixels; telescope beam is ~1 pixel Current back end: NTPCKR ● As signals are added, mark pixels as “hot” ● Score hot pixels – ● DB-intensive Do RFI removal from high-scoring pixels, flag for re-scoring Problems with current back end ● Signal DB is large – ● 5 billion signals, 10 TB Informix has limited speed – NTPCKR can’t keep up with signal arrival – > 1 year to score all pixels ● labor-intensive ● non-scalable Impact on science ● We haven’t done scoring/reobservation in 10 years ● We wouldn’t find ET signal if it were there ● We don’t have anything to tell volunteers ● We don’t have basis for writing papers Nebula goals ● Short-term – RFI-remove and score all pixels in ~1 day for ~$100 – stop doing sysadmin, start doing science ● ● ● e.g. continuous reobservation, experiment with scoring algorithm Long-term – generality; include other signal sources (SERENDIP) – provide outside access to scoring, signals, raw data General – build expertise in clouds and big-data techniques – form relationship with cloud providers, e.g. Amazon Design decisions ● Use Amazon cloud (AWS) for the heavy lifting – ● Use flat files and Unix filesystem – ● For bursty usage, clouds are cheaper than in-house hardware NoSQL DB systems don’t buy us anything Software – C++ for compute-intensive stuff (use existing code) – Python for the rest AWS features Simple Storage System(S3) disk storage by the GB/month HTTP Elastic Computing Cloud (EC2) VM hosting by the hour various “node types” mount HTTP Elastic Block Storage (EBS) Internet disk storage by the GB/month attached to 1 EC2 node Interfaces to AWS ● Web-based ● Python APIs – Boto3: interface to S3 storage – Fabric: interface to EC2 nodes local host HTTP script.py AWS Nebula: the basic idea ● Dump SETI@home database to flat files ● Upload files to S3 ● Split files by pixel (~80M files) ● – remove RFI, redundant signals in the process – do this in parallel on EC2 nodes Score the pixels – do this in parallel on EC2 nodes Moving data from Informix to S3 ● Informix DB unload: 1-2 days ● Nebula upload script – use Unix “split” to make 2GB chunks – upload chunks in parallel ● – ● thread pool / queue approach, 8 threads S3 automatically reassembles chunks Getting close to 1 Gb throughput Pixelization ● ● Need to: – Divide TB-size files into 16M files – remove RFI, redundant signals Can’t do this sequentially – A process can only have 1024 open files – it would take too long Hierarchical pixelization ● ● ● Level 1 – split flat files 512 ways based on pixel – convert from ASCII to binary – remove redundant signals Level 2 – split level 1 files 256 ways – result: 130K level 2 files Level 3 – split each level 2 file 512 ways – remove RFI Pixelization on EC2 ● Create N instances (t2.micro) ● Create a thread per node ● Create a queue of level 1 tasks ● To run a task: – get input file from S3 – run pixelize program – upload output files to S3 – create next-level tasks ● Keep going until all tasks done ● Kill instances Removing redundant signals ● ● Old way: for each signal, walk up chain of DB tables New way: – create bitmap file, indexed by result ID, saying whether result is from a redundant tape – memory-map this file – given a signal, can instantly see if it’s redundant Pixel scoring ● Assemble signals in disc centered at pixel ● Compute probability that these are noise ● Can be done independently for each pixel Nebula scoring program ● ● ● Same code as NTPCKR Modified to get signals from flat files instead of DB First try: remove all references to Informix – ● this failed; too intertwined Second try: keep Informix but don’t use it Parallelizing scoring ● Need to score 16M pixels ● Use about 1K nodes ● Want to minimize file transfers; reuse signal files on a node ● Divide pixels into adjacent “blocks” of 4^n, say 1024 ● Each block is a job (16K of them) ● ● Each job loops over pixels, fetches and caches files, creates and uploads output file (pixel, score) Master script instantiates EC2 nodes, uses thread/queue approach – keeps nodes busy even if some pixels take longer than others Nebula user interface ● ● Configuration – AWS, Nebula config files – check out, build SETI@home software Scripts – s3_upload.py, s3_status.py, s3_delete.py – pixelize.py – score.py ● logging ● Amazon accounting tools Status ● Mostly written, working – ● doing performance, cost tests – ● code: seti_science/nebula I think we’ll meet goals design docs are on Google – readable to ucb_seti_dev group. Future directions ● ● Flat-file-centric architecture – assimilators write signals to flat files – load into SQL DB if needed Amazon spot instances (auction pricing) – ● Amazon elastic file system (upcoming) – ● instances are killed if price goes above bid shared mountable storage, at a price Incremental processing