Klasifikasi Spam Email Berbasis Semantik Menggunakan Metode BERT

  • Yunita Renta Hutagaol Del Institute of Technology
  • Yulyani Arifin Universitas Bina Nusantara

Abstract

 Perkembangan teknologi mendorong banyak orang di seluruh dunia, termasuk di Indonesia, untuk dapat memanfaatkan kecanggihan teknologi tersebut. Salah satu teknologi tersebut adalah internet dan gadget. Perkembangan smartphone yang begitu pesat ternyata tidak mengubah fungsi dari salah satu penyedia layanannya, yaitu layanan pesan teks yang disebut email. Email saat ini masih digunakan untuk mengirimkan pesan kepada pengguna yang sudah saling mengenal maupun kepada orang yang belum saling mengenal, dengan berbagai tujuan termasuk untuk menawarkan produk atau jasa. Hal ini menjadi masalah untuk mengklasifikasikan Email yang masuk sebagai Email spam atau bukan spam (ham). Klasifikasi email pada penelitian ini menggunakan algoritma BERT dan Long Short-Term Memory (LSTM). Tujuan penelitian ini adalah untuk mengevaluasi dan menentukan algoritma yang paling efektif untuk mengkategorikan Email spam dan juga untuk mengetahui Email yang diterima sebagai spam atau bukan spam. Hasil penelitian menunjukkan bahwa algoritma XL Net memiliki akurasi yang lebih tinggi dibandingkan dengan Algoritma Bert, Algoritma Roberta, dan algoritma LSTM, dengan nilai 1.00. Nilai precision, recall, f1-score, dan akurasi dari algoritma Bert juga memiliki performa yang paling baik dibandingkan dengan algoritma LSTM.

 

Kata Kunci: Spam Email, Bert Algorithm, Roberta Algorithm, XL Net, LSTM

References

AbdulNabi, I., & Yaseen, Q. (2021). Spam email detection using deep learning techniques. Procedia Computer Science, 184(October), 853–858. https://doi.org/10.1016/j.procs.2021.03.107
Altulaihan, E., Alismail, A., Hafizur Rahman, M. M., & Ibrahim, A. A. (2023). Email Security Issues, Tools, and Techniques Used in Investigation. Sustainability (Switzerland), 15(13). https://doi.org/10.3390/su151310612
Atlam, H. F., & Oluwatimilehin, O. (2023). Business Email Compromise Phishing Detection Based on Machine Learning: A Systematic Literature Review. Electronics (Switzerland), 12(1), 1–28. https://doi.org/10.3390/electronics12010042
Baafi, P. O. (2022). Tools For Cyber Forensics. Advances in Multidisciplinary and Scientific Research Journal Publication, 1(1), 285–290. https://doi.org/10.22624/aims/crp-bk3-p46
Bagui, S., Nandi, D., Bagui, S., & White, R. J. (2021). Machine Learning and Deep Learning for Phishing Email Classification using One-Hot Encoding. Journal of Computer Science, 17(7), 610–623. https://doi.org/10.3844/jcssp.2021.610.623
Brindha, R., Nandagopal, S., Azath, H., Sathana, V., Joshi, G. P., & Kim, S. W. (2023). Intelligent Deep Learning Based Cybersecurity Phishing Email Detection and Classification. Computers, Materials and Continua, 74(3), 5901–5914. https://doi.org/10.32604/cmc.2023.030784
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1(Mlm), 4171–4186.
Fadlila Nurwanda, Winita Sulandari, Yuliana Susanti, & Zakya Reyhana. (2023). Comparative Analysis Of Performance Levels Of Svm And Naïve Bayes Algorithm For Lifestyle Classification On Twitter Social Media. International Conference On Digital Advanced Tourism Management And Technology, 1(1 SE-Articles), 215–230. https://doi.org/10.56910/ictmt.v1i1.65
Fajri, F., Tutuko, B., & Sukemi, S. (2022). Membandingkan Nilai Akurasi BERT dan DistilBERT pada Dataset Twitter. JUSIFO (Jurnal Sistem Informasi), 8(2), 71–80. https://doi.org/10.19109/jusifo.v8i2.13885
Ganesan, A. V., Matero, M., Ravula, A. R., Vu, H., & Schwartz, H. A. (2021). Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality. NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 4515–4532. https://doi.org/10.18653/v1/2021.naacl-main.357
Gupta, R. (2024). Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing. Информатика. Экономика. Управление - Informatics. Economics. Management, 3(1), 0311–0320. https://doi.org/10.47813/2782-5280-2024-3-1-0311-0320
Hina, M., Ali, M., Javed, A. R., Srivastava, G., Gadekallu, T. R., & Jalil, Z. (2021). Email Classification and Forensics Analysis using Machine Learning. Proceedings - 2021 IEEE SmartWorld, Ubiquitous Intelligence and Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Internet of People, and Smart City Innovations, SmartWorld/ScalCom/UIC/ATC/IoP/SCI 2021, July 2022, 630–635. https://doi.org/10.1109/SWC50871.2021.00093
Iqbal, F., Javed, A. R., Jhaveri, R. H., Almadhor, A., & Farooq, U. (2023). Transfer Learning-based Forensic Analysis and Classification of E-Mail Content. ACM Transactions on Asian and Low-Resource Language Information Processing. https://doi.org/10.1145/3604592
John-Africa, E., & Emmah, V. T. (2022). Performance Evaluation of LSTM and RNN Models in the Detection of Email Spam Messages. European Journal of Information Technologies and Computer Science, 2(6), 24–30. https://doi.org/10.24018/compute.2022.2.6.80
Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K. (2020). Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework. IEEE Access, 8, 154759–154788. https://doi.org/10.1109/ACCESS.2020.3017082
Pan, W., Li, J., Gao, L., Yue, L., Yang, Y., Deng, L., & Deng, C. (2022). Semantic Graph Neural Network: A Conversion from Spam Email Classification to Graph Classification. Scientific Programming, 2022(ii). https://doi.org/10.1155/2022/6737080
Riehl, K., Neunteufel, M., & Hemberg, M. (2023). Hierarchical confusion matrix for classification performance evaluation. Journal of the Royal Statistical Society. Series C: Applied Statistics, 72(5), 1394–1412. https://doi.org/10.1093/jrsssc/qlad057
Saidani, N., Adi, K., & Allili, M. S. (2020). A semantic-based classification approach for an enhanced spam detection. Computers and Security, 94, 101716. https://doi.org/10.1016/j.cose.2020.101716
Sharmeen, S., Ahmed, Y. A., Huda, S., Kocer, B. S., & Hassan, M. M. (2020). Avoiding future digital extortion through robust protection against ransomware. IEEE Access, 8, 24522–24534.
Srinivasan, S., Ravi, V., Alazab, M., Ketha, S., Al-Zoubi, A. M., & Kotti Padannayil, S. (2021). Spam Emails Detection Based on Distributed Word Embedding with Deep Learning. Studies in Computational Intelligence, 919(December), 161–189. https://doi.org/10.1007/978-3-030-57024-8_7
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017-Decem(Nips), 5999–6009.
Published
2024-11-04
Abstract viewed = 13 times
PDF downloaded = 7 times