Android APK Identification using Non Neural Network and Neural Network Classifier

The purpose of this study is to identify Android APK files by classifying them using Artificial Neural Network (ANN) and Non Neural Network (NNN). The ANN is Multi-Layer Perceptron Classifier (MLPC), while the NNN are KNN, SVM, Decision Tree, Logistic Regression and Naïve Bayes methods. The results show that the performance using NNN has decreasing accuracy when training using larger datasets. The use of the K-Nearest Neighbor algorithm with a dataset of 600 APKs achieves an accuracy of 91.2% and dataset of 14170 APKs achieves an accuracy of 88%. The using of the Support Vector Machine algorithm with the 600 APK dataset has an accuracy of 99.1% and the 14170 APK dataset has an accuracy of 90.5%. The using of the Decision Tree algorithm with the 600 APK dataset has an accuracy of 99.2%, the 14170 APK dataset has an accuracy of 90.8%. The experiment using the Multi-Layer Perceptron Classifier has increasing with the 600 APK dataset reaching 99%, the 7000 APK dataset reaching 100% and the 14170 APK dataset


I. INTRODUCTION
Currently, the development of APK malware is increasing, along with the number of Package Kit Applications (APKs) which are applications that run on the Android operating system. So many Androids APKs, causing more and more certain parties to attack for purposes that are profitable for malware makers. Therefore, there are many losses for Android Mobile phones that have been infected by malware. From year to year the development of malware has increased, for this reason this research uses the topic of Android malware. Intents are interfaces that connect interactions between Activities in an Android APK. In addition, Intents send data to other Activities, such as sending data to other applications (Gmail, Google Map, etc.). In essence, Intent is a mechanism to perform an action and communicate between application components.
Originality: Most of the journals in the literature review focus on feature permissions, rarely exploring feature intents. An Android APK to activate an action or activity calls a component, sends data, requires a feature intent. Without feature intents, Android cannot perform action functions. Therefore, this research focuses on feature permissions and feature intents. Malware classification has been carried out by applying machine learning, such as the use of the K-Nearest Neighbor algorithm, Support Vector Machine and Decision Tree. The average classification results are good, but if you use a large dataset, the classification performance decreases. Then the experiment was carried out by applying a deep learning algorithm, namely Multi-Layer Perceptron. Some experimental results continue to increase in accuracy with the increasing number of datasets.
The aim of this study is to identify Android APK files by classifying Android APK files using the Multi-Layer Perceptron Classifier. The main contribution of this paper is to improve the accuracy of malware classification performance by applying the Multi-Layer Perceptron Classifier algorithm.
Some research questions in this study: RQ 1, How to extract malware dataset using permission feature and intent feature? RQ 2, What is the percentage of application of the K-NN algorithm, Support Vector Machine and Decision Tree? RQ 3, What is the percent increase in accuracy with the implementation of the Multi-Layer Perceptron algorithm? RQ 4, Is it effective to perform malware analysis using static methods?
This article is organized as follows: Section 2 presents related work on several articles related to the Classification of Android malware. Section 3 describes the methodology used. Section 4 describe Propose Method. Section 5 presents the results of the experiments that have been carried out. Section 6 includes a summary of the paper.

II. RELATED WORK
In this study, we compare with previous research that discusses the Android malware APK. The attackers created malware using a new method of targeting victims of Android mobile phones. Several studies have used effective tools to carry out the malware detection process as accurately as possible. Table I shows a lot of research using extract on feature permissions, system calls, API Calls, Net Info, but still very rarely uses feature intent. This feature intent is an addition to the research, in addition to using feature permissions. This research uses feature permission and feature intent.

III. METODOLOGY
In this section, the researcher discusses malware analysis and classification [11] research methodology. Performing malware analysis there are three analyzes, namely static analysis [12], dynamic analysis and hybrid analysis. The use of the malware identification or detection method is a supervised learning classification. The algorithm used is KNN, Support Vector Machine and Decision Tree, as well as Deep Learning Multi-Layer Perceptron Classifier [13] [14].

A. Static Analysis
Static analysis [15] is a malware analysis method by analyzing source code. Reverse engineering is used to obtain the source code file, which converts the executable file into a source code file. To analyze the malware APK file, for example, the APK file must be reverse engineered. Analyzing static malware does not need to run the application. Using the JADX module from APKTOOL, to do reverse engineering. The source code to be analyzed is the AndroidManifest.xml file. This file is then read or parse android-permission and android-intent. Some purposes for reverse engineering: • To know the protocol of a program. For example: want to create a command line Instagram client.

•
To find out the API used by a program. For example, you want to know how to turn on the camera flash as a flashlight.

•
To find security bugs for a program.

•
To find out if a program violates copyright. For example, we suspect that a program uses a commercial library that we created, without paying for a license. • For forensic purposes. For example, we want to know the data format used by a program.

B. Dynamic Analysis
Malware is a threat to Android, various methods are used to analyze malware, one of which is using dynamic analysis. Analyzing Android malware with dynamic methods aims to understand its behavior and improve the ability to detect it. Dynamic analysis also takes an analytical approach to analyze Android malware behavior. How to perform analysis by running malware code in a virtual environment to understand the actual behavior of malware. The dynamic analysis method does not examine the source code, but runs the malware files in a controlled environment, which is called a sandbox. This way the behavior of the malware can be analyzed in a controlled environment, this is very useful where the malware does not spread to other systems. After observing the behavior of malware, a log of malware activity is obtained. This log will be analyzed.

C. Hybrid Analysis
Dynamic malware analysis is a combination of static analysis and dynamic analysis, where the analysis runs the malware in a controlled environment after that it also analyzes the source code. Hybrid model analysis is a perfect and complete analysis for analyzing a malware.

D. K-Nearest Neighbor
K-Nearest Neighbor (KNN) [16] [17] is a classification algorithm using a way to measure the distance, which is measured from the k nearest neighbors. This classification projects the training dataset in a multidimensional space. The space is divided into sections that describe the character of the data. Each training data is represented as points in a multidimensional space. Where the KNN classification [18] [19] process is looking for the point c closest to the new (c). The general formula is to find the Euclidean distance, Hamming distance, Manhattan distance, and Minkowski Distance.
Euclidean distance [20] is a formula for finding the distance between two points in two-dimensional space. Hamming distance [21] is a way to find the distance between two points which is calculated by the length of the binary vector formed by the two points in the binary code block. Manhattan Distance [22] is a formula to find the distance d between 2 vectors in n dimensional space. Minkowski distance is a formula for measuring between two points in a normal vector space which is a hybridization that generalizes the Euclidean distance and Manhattan distance. The K-Nearest Neighbor (KNN) [23] [24] algorithm is a classification of objects based on the learning data that is closest to the object. Then the determination of the K value is carried out. It is determined that the K value is odd, after that a vote is carried out on the closest distance. Advantages of KNN (K-Nearest Neighbor), dataset used for training is very nonlinear and easy to implementation. Disadvantages of KNN (K-Nearest Neighbor): Need to indicate the parameter K (number of nearest neighbors). Does not handle missing values implicitly. Sensitive to data outliers (outliers). Vulnerable to non-informative variables. Vulnerable to high dimensionality. The computational cost is quite high, because it is necessary to calculate the distance from each testing data to the entire training data.

E. Support Vector Machine
Support Vector Machine (SVM) was first presented by Boser, Guyon and Vapnik in 1992. Support Vector Machine is a supported classification algorithm by finding the hyperplane with the largest margin. There are three main sections in SVM, namely Supervised, Classified and Hyperplane with the largest margin. How the Support Vector Machine [25] [26] [27] works. Support vectors are two closely spaced data that come from different classes or groups, these two data will be used as support vectors. Hyperplane [28] [29] is the dividing line between support vectors. Max Margin [30] is the distance between the support vector and the hyperplane, the margin distance must be maximum to be able to anticipate the similarity of one data to another. For non-linear data, SVM Kernel Trick [31] [32] [33] is used by creating new dimensions. So that it can create a hyperplane. The advantage of SVM is that Supervised is able to control the accuracy of classification and Kernel trick is able to classify with nonlinear data. Disadvantages of SVM, not good for large amounts of data and Kernel trick is not easy to implement.

F. Decision Tree
The Decision Tree [34] [35] algorithm was developed by J. Ross Quinlan, in 1975. Decision tree is a popular classification method, because it is easy to interpret. Predictive model that uses a tree structure. Another term for Decision Tree is Classification and Regression Tree (CART) [35] [36] [37] which is a decision tree. Decision trees can convert data into decision trees and decision rules. The benefits of DT are its ability to break down complex decision-making processes into simpler ones, so that decision-makers better interpret problem solutions.
Making a Decision Tree model [38] [39] is like drawing an inverted tree where the Root Node is in the top position. Internal Node that has 1 input and at least 2 outputs. Leaf Node is the final Node, has 1 input and has no output.

G. Multi-Layer Perceptron Classifier
Multi-Layer Perceptron is a classification algorithm that works by using a deep neural network. This algorithm is very different from machine learning algorithms based on statistical science. By using the deep neural network method, it is expected that the performance of the model is more accurate, when compared to machine learning. Figure 2 is the architecture of the Multi-Layer Perceptron Classifier to complete the classification of the malware dataset.

A. Pipeline 1: Create Dataset
This stage is to create a dataset from Android APK files that are indicated as malware or Benign. The malware APK files are downloaded from the University of New Brunswick. The file has been labeled for types of malwares. The downloaded file is accommodated to local storage, then the classification process is carried out and stored in a similar folder. Next, the Android APK file extraction feature is carried out using reverse engineering. Many reverse engineering tools are commonly used. In this research, reverse engineering uses the JADX module. The result of the reverse engineering process is some folders and files AndroidManifest.xml. Files and folders other than AndroidManifest.xml are deleted, while AndroidManifest.xml is then parsed to read the permissions and intent features. The results of the feature extraction process produce a malware dataset. The next process is classification using machine learning or deep learning algorithms.

B. Pipeline 2: Prepare Training Dataset malware
Before training the malware dataset, the prepare stage is very necessary. To generate a model from a machine learning or deep learning training process must use a clean dataset, a good dataset (no null, incorrect data in features). The dataset must ensure that the contents of the malware Dataset should not be mixed with the Benign data. If there is a mixture of malware and Benign, the resulting model will experience errors and affect the performance of the model. In addition to the data cleaning process, there are also engineering features, namely feature analysis and the most influential features. This process must be carried out because this process is also very influential on the resulting model. The next process is to create a uniform dataset, in the sense that if there are five groups of datasets, then the dataset must be an unmixed dataset. For example, the malware APK dataset is of the Ransomware type, then the Ransomware dataset should not be mixed with the Riskware APK dataset.
The division of the number of datasets for machine learning is to divide the 70% training dataset and 30% testing data. But there is no requirement to do so. There are also those who share it, 60% training data and 40% testing data. Sharing datasets for deep learning, training data, validation data and testing data. Example (Data Training + Data Validation) = 70%, while testing data is 30%. Cross validation of datasets or swapping training positions with testing is also carried out to get the performance model that will be generated by machine learning or deep learning. Some of the reasons for this data preparation is done: 1) The data owned is not ideal, there is data that is missing value. Missing data in the dataset will result in a declining model for its performance. Filling must be done so that the dataset becomes intact and good. It is not permissible to fill in the dataset arbitrarily and an analysis of the features or dimensions of the appropriate dataset must be carried out.
2) There are different data formats. To avoid differences in formats in the feature dataset, it is necessary to check, validate the dataset and analyze the features of the dataset. 3) Small dataset or imbalanced dataset from ideal in terms of quantity. Small datasets are not ideal for machine learning or deep learning processes to be generated as models. This makes the model invalid. Synthetic Minority Over-sampling Technique (SMOTE) is a way to balance the dataset, if machine learning is done, in order to produce a good model.

4) The dependent variable and the independent variable are not clear or have no label. C. Pipeline 3: Training and Testing Process
This stage is conducting training on the malware dataset. Training using the KNN Algorithm, Support Vector Machine and Decision Tree. The distribution of the dataset is carried out, the training dataset is 70% and the testing dataset is 30%. The Multi-Layer Perceptron Classifier algorithm is also used for this stage. The training process is also carried out using changes in the position of the training dataset and testing dataset, which is better known as cross validation. In this study using 5-fold cross validation, to get better model accuracy.
Cross Validation (CV) is a method used to evaluate model performance, where data is separated into two subsets, namely learning process data and evaluation data. The model or algorithm is trained by the learning subset and validated by the validation subset. Furthermore, the selection of the type of CV can be based on the size of the dataset. CV K-fold is used because it can reduce computation time while maintaining the accuracy of the estimate. 5-fold CV is one of the K-fold CVs used for selecting the best model because it tends to provide less biased accuracy estimates. In 5-fold CV, the dataset is divided into 5 folds of approximately equal size, thus having 5 subsets of data to evaluate model performance. For each of these 5 subsets of data, CV will use 4 folds for training data and 1-fold for testing.

D. Pipeline 4: Prepare New APK data to be tested
At this stage the aim is to add new datasets. If in performing the classification and new variants of malware are found, before being entered into the dataset, the data must be feature extraction. Then retraining is carried out. The more datasets, the better the classification model in identifying malware APK.

E. Pipeline 5: Decision Classification Output Label
The last stage aims to produce a classification model and the model is ready for deployment. Testing the model before the model is ready for use, aims to anticipate model errors in identifying Android APK files.

V. EXPERIMENT AND RESULT
In conducting the experiment, using the MacBook Air 2020 hardware with specifications of 8 GB RAM, 256 GB storage. Using the Python programming language in the Jupiter Notebook package, the reverse engineer JADX module made by APKTOOL. In this section, answer research questions and report experimental results.
A. RQ 1, How to extract malware dataset using permission feature and intent feature? This is a much-needed step, where this step generates a malware dataset. APK files are downloaded and extracted, reverse engineered and parsed to read feature permissions and feature intents. The final result of feature extraction is a malware dataset. Following are the featurefeature permissions of the malware dataset:   Furthermore, the permission features and intent features are trained with machine learning and deep learning to produce models with the best accuracy. In performing the extraction of the Android APK dataset consisting of the Benign APK and the malware APK that have been labeled, it takes 24 hours of processing for 2 weeks. Table II and Table III are features generated by the feature extraction process from malware and benign android APK files. The total of downloaded APK files = 14,170 APKs. Where the APK file process is reverse engineered with the Jadx APKTOOL module. The results of reverse engineering produce the source code of the APK. Next take the AndroidManifest.xml file to parse the permission features and intent features. Feature parse results are stored into the malware dataset. The malware dataset has permission and intent features of 1178 columns or dimensions. Table II shows some of the permission  features and Table III shows some of the intent features.  Precision is the ratio of a positive correct prediction compared to the overall positive predicted outcome. Precision answers the question "What percentage of Android APK files are Malware correct from the total dataset that Malware predicts?". Precision = (TP) / (TP + FP). Precision can be seen in Table V. F1 Score is a weighted comparison of the average precision and recall. F1 Score = 2 * (Recall * Precision) / (Recall + Precision). F1-Score can be seen in Table VI. Recall is the ratio of true positive predictions compared to the total number of true positive data. Recall answers the question "What percentage of Android APK files are predicted to be malware compared to all students who are actually malware". Recall = (TP) / (TP + FN). Recall can be seen in Table VII. There is a decrease in performance for the model generated from the K-Nearest Neighbor algorithm, Support Vector Machine and Decision Tree. Table IV, Accuracy of KNN, Support Vector and Decision Tree classifier decreased when using a larger dataset. This is because the three algorithms are suitable for use if the dataset is small. The larger the size of the training dataset, the lower the accuracy. Table V Precision decreased if the classifier was  carried out with the three non-neural network algorithms.  Table VI F1-Score experienced a decrease in the  classification of large datasets and Table VII Recall decreases if the dataset is large. In this study, the use of a large dataset is not suitable when using a large dataset, and it is tried to train the dataset using a Neural Network.

C. RQ 3, What is the percent increase in accuracy with
the implementation of the Multi-Layer Perceptron algorithm? The results of the Multi-Layer Perceptron classification experiment show that performance increases with increasing datasets. The more the number of datasets, the better for performance. Experiment from dataset 600 APK = 99%, dataset 7000 APK = 100% and dataset 14170 APK = 100%. In Figure 4, the ROC of the model results from the Artificial Neural Network Classifier from the malware dataset. ROC (Receiver Operating Characteristics) is a performance measurement tool for classification problems in determining the threshold of the model. Malware Banking APK APK file label 0, symbolized in light blue. APK file Malware Ransomware APK label 1, symbolized in orange. APK file Malware Riskware APK label 2, symbolized in blue. Malware SMS APK file APK label 3, symbolized in light blue. APK file Malware Benign APK label 4, symbolized in orange. The y-axis represents the True Positive Rate (sensitivity), the x-axis represents the False Positive Rate (Specificity). Figure 4 shows the higher the True Positive Rate (sensitivity) and the smaller the False Positive Rate, the better the threshold. The optimistic Area Under Curve (AUC) value from the Artificial Neural Network validation results shows a value of = 1. This shows that the accuracy results obtained are in the very good category.
D. RQ 4, Is it effective to perform malware analysis using static methods? Using this static method does not require running the malware into an isolated or controlled environment. The malware APK file is only extracted, then stored into the malware dataset. The dataset is classified using the classification method and then the model is tested with the extracted malware dataset. The results are effective for detecting the Android APK file is infected with malware or normal. The static method is actually simple and works effectively in malware detection.

VI. CONCLUTION
Based on the results of experiments conducted in this study, it can be concluded that classification using machine learning produces good accuracy on the K-Nearest Neighbor algorithm, Support Vector Machine and Decision Tree. However, the use of larger datasets causes a decrease in accuracy. This factor causes the use of deep learning in training datasets in order to produce high accuracy on large datasets. The accuracy of the K-Nearest Neighbor algorithm on average = 88%, if using the 14170 APK dataset. Average Support Vector Machine accuracy = 90.5%, when using the 14170 APK dataset. Average Decision Tree accuracy = 90.8%, when using the 14170 APK dataset. Accuracy using deep learning with Multi-Layer Perceptron results in 100% accuracy, using the 14170 APK dataset.