Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque 刘世泽 Lau Shyh Tzer, David Visiting Student, IIIS, Tsinghua University Summer 2013 BSc. Computer Science, The Chinese University of Hong Kong Background • Problem: The Growing of Android API • Level 1-18, over 20k API Methods • Difficult for developers to master the usage of Android API, especially the inexperienced developer • Aim: Adapt Machine Learning and Reverse Engineering Technique to Analyze the Usage Pattern • Possibly developer a helper tool to suggest/fix the Android API usage during development stage Workflow 1 Adapt reverse engineering technique to retrieve the needed data from packaged Android App (.apk) 2 Perform data mining on the result raw data to dig out interesting API usage pattern, relationship. 1 Reverse Engineering on Android App • Need to retrieve information from package (.apk) file not Perform static analysis source code not dynamic analysis • There’re basically three options: Retrieve .dex file 1 Dalvik bytecode .apk Disassembly 2 smali code .apk Decompile 3 .apk .jar (.class) bytecode Low level, Lack of analysis tools (Few) Good for hacking. Take time to familiar with smali code Becomes a Java problem Lots of analysis tool 1 Reverse Engineering on Android App Decompile .apk .jar (.class) bytecode • Use dex2jar open source tool: https://code.google.com/p/dex2jar/ • Support directly decompile .apk to .jar file linux sh dex2jar/d2j-dex2jar.sh someApk.apk windows dex2jar\d2j-dex2jar.bat someApk.apk • We can then redirect it into a Java problem and focus on the static analysis with Java bytecode 1 Reverse Engineering on Android App • In order to understand the usage pattern of Android API, we have to know the structure of the code • The easy and abstract approach to understand the structure of the code is to look at its Abstract Syntax Tree (AST) Generate AST from Java bytecode It’s obvious and easy to parse Java source code into Abstract Syntax Tree, but parse the bytecode not • is Bytecode is a set of instructions that JVM interpret to perform stack execution to run the • program The stack execution on the bytecode is different than the common program flow that we observe at source code • Example method = i3 * i1 return * i3 Bytecode i2 AST 2 Bytecode Outline Plugin for Eclipse http://asm.ow2.org/eclipse/index.ht ml Generate AST from Java bytecode • Intuitively, bytecode is interpreted by JVM as the stack execution, so we can ‘recover’ the code structure and construct the AST through simulating the JVM stack operation Example: Variable Assignment Sourc e code int i=1; = i ICONST1 byte code ISTORE0 1 1 Thread Stack Abstract Syntax Tree Example 2: From Previous Bytecode Example Source code method public int method(inti1,int i2){ int i3=i1*i2 return i3*2; } = i3 bytecode ILOAD1 ILOAD2 IMUL ISTORE3 ILOAD3 ICONST2 IMUL IRETURN * i1 i2 2 i1 i3 * Thread Stack return i2 * i3 2 Abstract Syntax Tree Generate AST from Java bytecode • There are various kinds of AST structure, such as condition statement, goto statement, compound statement, but they can all be ‘recovered’ from bytecode by using the previous technique to simulate the stack execution • However, read directly on the .class file result in binary format that useless for our parsing • So we need a systematic way to parse the bytecode ASM - Bytecode Engineering Library • ASM is an all purpose Java bytecode manipulation and analysis framework http://asm.ow2.org/ • It provides two powerful APIs: Core API and Tree API • Core API creates an interface of visiting bytecode • Tree API parses bytecode into Objects Refer to http://download.forge.objectweb.org/asm/as m4-guide.pdf for complete usage ASM - Bytecode Engineering Library • Tree API is particularly useful for generating the AST from bytecode • It provides two important interfaces: ClassNode and MethodNode which enables the developer to assess to the bytecode The bytecode (opcode) of the respective method is information directly stored at InsnList instructions ASM - Bytecode Engineering Library Usage: ClassReader cr = new ClassReader(“App1.class”); ClassNode cn = new ClassNode(); cr.accept(cn,0); ACM parse the whole class and store the respective objects over here Assess Class Information: All the information is stored at the properties of the object, so simply retrieve from them cn.name; <- the class name cn.field; <- class field variables cn.innerClasses <- class inner classes Assess Method Information: List<MethodNode> mnList = cn.methods; for(MethodNode mn:mnList){ mn.name <- method name mn.signature <- method signature (parameter and return value type) mn.instructions; <- the respective opcodes } Each class contains several methods, so it’s List<MethodNode> type ASM - Bytecode Engineering Library Assess Bytecode Intructions: InsnList insn = mn.instructions; Iterator itr=insn.iterator(); while(itr.hasNext()){ AbstractInsnNode ain=(AbstractInsnNode)itr.next(); int opcode = ain.getOpcode(); int type = ain.getType(); An abstract class to wrap the instructions. ASM separate instruction into 16 different kinds //simulate the stack execution here to generate AST } The type of the instructions is also defined in integer constant. For example: 4 = FIELD_INSN 5 = METHOD_INSN Detail of the instruction, like which store to which local variable, the Field Variable ID, invoked method’s signature are stored in this object as properties. The bytecode instruction is defined as an integer constant in ASM. For example, 3 = ICONST_0 21 = ILOAD 54 = ISTORE My Design of AST Generator • With the support of ASM Tree API, we can parse the bytecode and simulate each stack execution to construct the respective Abstract Syntax Tree • In order to suit with the needed data for data systematically mining, I designed a customized Abstract Syntax Tree structure Specifically designed class to suit each code structure Abstract Parent Class ASTNode Inherited ASTArithmeti cNode ASTArray Node ASTArray ValueNod e ASTConsta ntNode ASTFieldN ode ASTFuncti onNode ASTJump Node ASTLabel Node ASTLocalVa riableNode ASTMetho dNode ASTObject Node ASTRetur nNode ASTSwitc hNode ASTCastN ode ASTClass Node My Design of AST Generator ASTNode • • • • • • getASTKind setName getName setSignature getSignature setCallBy getCallBy setUsedBy getUsedBy setUsedAsObject getUsedAsObject Example: CallBy/UsedBy/UsedAsObje ct are stored as ArrayList<ASTNode> to handle multiple connections “Sample” sb setUsedAsObject setUsedBy text StringBuilder sb = new StringBuilder; String text = “Sample”; sb.append(text); String result = sb.toString(); setUsedBy append setCallBy toString setUsedBy result Object and its methods have doubly connections to ensure bidirectional traverse My Design of AST Generator addParameter are stored as ArrayList<ASTNode> to handle multiple connections ASTMethodNode • addParameter ASTLocalVaria bleNode • getPara Trick: Local variable are stored separately at JVM Method Area by index. So in order to track the changing of local variable assignment (such as one variable can be used multiple times), create a hash table to record the pointers reference to the local variable. So the update of the variable assignment can be done easily while parsing the bytecode setIndex getIndex • setVariableType getVariableType • setVariableValue getVariableValue ASTFieldNode • setFieldValue getFieldValue Trick: Same case with Local Variable, create a hash table to have pointers reference to the Field Variable. Be careful that the hash table is clear while accessing each method, but Field Value is tracked through the whole class Data Flow Analysis on the AST • With complete Abstract Syntax Tree for an Android App, it gives a very useful details to perform various kinds of static analysis • My research is mainly focused on its data flow analysis: 1 Where does the return value goes? 2 ?? = Builder.fromParts(String scheme, String ssp, String fragment); 3 This methods is called by what kind of object? Where are these arguments come from? Data Flow Analysis on the AST • Performed depth-first-search on the AST to trace the return value path, argument path and call by path • Trick: use hash table/list to record Collect data from Android API Invocation down the path to avoid infinite loop within the AST QQ Android ASTClassNode ASTMethodNode ASTClassNode ASTMethodNode 2 Mining on the Analysis Data Convert App into Jar format Generate AST from Jar file Define To-Do Analysis Format Preparing Mining Raw Data • Coded a web crawler to download free Android Apps from http://apk.gfan.com/ open Android Market • Successfully grabbed 10, 266 valid Android Apps and generated respective AST, analysis data through the self-developed ASTGenerator by using Amazon Web Service High Memory cluster. • Result data structure: 1 1 1 0 0 1 1 0 appname-android/net/Uri buildUpon-0 1 1 1 0 0 0 0 1 appname-android/net/Uri parse-0 Mining on the Analysis Data • Adapted Weka 3 to perform data mining task. http://www.cs.waikato.ac.nz/ml/weka/ • It’s convenient to convert the analysis raw data into Weka ARFF input format, especially its support of sparse matrix format Mining on the Analysis Data • Adapted Hierarchical Agglomerative Clustering in the result matrix to discover the apis’ relationship and their usage pattern Mining on the Analysis Data • The analysis result data from 10, 266 apps is huge (~50GB text file), it’s time-consuming and unnecessary to mine directly on them • Designed a MapReduce task and ran it at Hadoop to categorize the result data into methods by methods and compute the statistics of their invocation numbers • Then perform hierarchical clustering on methods that have enough data to discover meaningful pattern (like the number of invocation reached a threshold) Mining on the Analysis Data Total 19, 250 Android API methods discovered to be invoked at least one time among 10, 266 apps Methods that have high numbers of invocation don’t directly mean they’re having higher possible to find meaningful pattern. They’re more likely to be really common usage like UI elements, Log Methods that have average number of invocation, especially those indicating specific feature like geolocation, network, database might be the target of data mining Clustered Result: • Weka’s hierarchical clustering result in Newick Format, we can use software like Dendroscope to visualize it http://ab.inf.uni-tuebingen.de/software/dendroscope/ Some clustering result has obvious clusters which may implied an obvious usage pattern with this method android/net/Uri buildUpon Clustered Result: • Trick: Some analysis that shows many redundant on the column key (related APIs), perform a filter to throw away those no obvious relations (such as only related one time) before sending the data for clustering android/location/Location <init> May perform several trials and observe the best cut-off branch Clustered Result: • The result Newick Format can be parsed back to get the list of related APIs for each cluster android/location/Geocoder <init> There may have many unique usage pattern like this which probably an error pattern or special usage Mining on the Analysis Data • By tracing back the result Newick Tree, we got the related APIs for each methods • Interesting Results: Package Analysis Identify Good and Error Usage Pattern • Complete APIs Relationship • The clustering result shows the android/location package classes are strongly inner-connected. Classes like GpsSatellite, Location, Location Manager are highly relied on each other. • By having App’s name as identifier, we can trace back to the information of the nature of the app (its download times, rating, popularity) to determine feature, possible good/bad pattern of the result clusters. Perform clustering on all the useful data and retrieve the APIs relationship from the cluster after identify its usage pattern. It can eventually conclude a ‘Good’ usage pattern suggestion list when respective API methods are called Mining on the Analysis Data • Android API vs Java Library App’s Permissions vs Method invocation There are many Android APIs have data flow relations with Java Library methods, by digging into the methods’ clusters, we can discover some obvious usage pattern of Android APIs and Java Library methods. • Android Permissions information is stored at AndroidManifest.xml which is an binary-encoded XML file in .apk package. It can be decoded by using tool such as APK-tool (https://code.google.com/p/androidapktool/) to read it. • Android Permissions declare at statically at compile time and can’t grant dynamically (at run time) • So with the AST generated from the bytecode, we can check/traverse the AST to determine the correctness of one app’s declared permissions. • For example, one app may declared BroadcastReceiver permissions and then there’re no related methods/functions are found at the AST Thank you very much