|Hendrik Maulana||Department of Electrical Engineering, Universitas Indonesia, PO Box 16424, Indonesia|
|Riri Fitri Sari||Department of Electrical Engineering, Universitas Indonesia, PO Box 16424, Indonesia|
Stylometry is an authorship analysis technique that uses statistics. Through stylometry, the authorship identity of a document can be analyzed with high accuracy. This poses a threat to the privacy of the author. Meanwhile, there is a stylometry method, namely the elimination of authorship identity, which can provide privacy protection for writers. This study uses the authorship method to eliminate the method applied to the Federalist Paper corpus. Federalist Paper is a well-known corpus that has been extensively studied, especially in authorship identification methods, considering that there are 12 disputed texts in the corpus. One identification method is the use of the support vector machine (SVM) algorithm. Through this algorithm, the author’s identity of disputed text can be obtained with 86% accuracy. The authorship identity elimination method can change the writing style while maintaining its meaning. Long-short-term memory (LSTM) is a deep learning-based algorithm that can predict words well. Through a model formed from the LSTM algorithm, the writing style of the disputed documents in the Federalist Paper can be changed. As a result, 4 out of 12 disputed documents can be changed from one author identity to another identity. The similarity level of the changed documents ranges from 40% to 57%, which indicates the meaning preservation from original documents. Our experimental results conclude that the proposed method can eliminate authorship identity well.
Authorship; Long-short-term memory (LSTM); Obfuscation
Stylometry is a science that analyzes authorship style using statistics. Most research in the field of stylometry refers to Mosteller and Wallace’s research on Federalist Papers in 1963. With the beginning of computer-based stylometry, the corpus of Federalist Papers gained popularity. Stylometry is classified into several working subsections, namely authorship identification, authorship verification, authorship profile, stylochronometry, and authorship elimination, with the majority of studies in the first three classes. In contrast, authorship deletion is generally used for concealing identity when an author does not want their identity to be revealed publicly. Authorship identification methods have progressed rapidly, currently achieving an accuracy of up to 90% (Iqbal et al., 2020). This rapid development has raised serious threats to privacy for certain professionals, such as journalists and activists.
McDonald et al. (2012) proposed a method of authorship style transformation that can be entered manually by the user to make their writing anonymous. Due to the limitations of manual techniques, Almishari et al. (2013) proposed a machine translation method to automate authorship style changes. Other previously proposed methods have studied the use of synonyms, sentence separation, sentence combinations, and paraphrases, but all studies in this field still have misspellings in their text results. In addition, no single study provides randomness settings for users. This paper proposes a method that has better text results and gives text randomness control for the user.
Research in the field of authorship identity elimination requires a standardized dataset. The dataset must be well known, have anonymous documents, and should have been identified by the number of previous authorship attribution studies. One of the datasets that matches this criterion is the Federalist Paper. The corpus contains 85 documents, including 51 documents written by Hamilton, 14 documents written by Madison, 5 documents written by Jay, 3 documents are a collaboration between Hamilton and Madison, the remaining 12 documents are doubtful of their authorship between Hamilton or Madison. Juola (2020) identified no fewer than 19 studies conducted on this corpus. One of them was conducted by Savoy (2013), who applied several algorithms and concluded that only the Naive Bayes and SVM algorithms have relevant results. SVM is one of the encouraging classification techniques in the field of machine learning (Abdillah, et al., 2016).
Rahguoy et al. (2018) conducted sentence separations, combining sentences and replacing words through WordNet. WordNet is a lexical database used to extract features from sentences (Santosh, et al., 2015). This method can reduce the confidence level in the authorship identification process by 20%, but not all sentences produced can be arranged properly. Other attributes examined by Karadzhov et al. (2017) are the ratio of word types, stopword ratio, ratio of large capitalized words, and the ratio of part of speech. In contrast, Bakhteev and Khazov (2017) changed the sentence level by paraphrasing and modifying the content using the LSTM algorithm through the encoder–decoder technique. As a result, they obtained a fairly high sentence change rate, but there were still some spelling mistakes.
Two requirements need to be fulfilled for the authorship elimination method: Confirmation of authorship identity modification to be proven by SVM classification and confirmation of meaning preservation. The semantic similarity method can show the meaning preservation, because the greater the value of semantic similarity, the greater is the similarity of meaning between two documents (Sitikhu et al., 2019). There are techniques that can be used to calculate the semantic similarity of documents, such as the Jaccard coefficient, dice coefficient, and cosine similarity. Afzali and Kumar (2017) found that cosine similarity is the best performance evaluation technique for calculating semantic similarity. Deleting authorship identity in this research can be used to test the performance of the authorship identification method if the document has been changed. The long-short term memory (LSTM) algorithm based on a neural network is used as a rearrangement of disputed documents because of its suitable performance in natural language generation (Lippi et al., 2019).
The analysis conducted in this study indicates that the Federalist Paper corpus is an unbalanced dataset because most articles (60%) were written by Hamilton. Therefore, normalization of text data is required when they are classified. Chi-square and cosine similarity methods show the tendency of the author’s to identify a text. If two texts are of high similarity value, then the possibility of the authors of both texts being the same person is high. Through the SVM algorithm, it is known that the text writer debated in the Federalist Paper can be identified with an accuracy of 86%. Then, using the LSTM algorithm, the level of accuracy can be reduced by 19%. Document changes resulting from the elimination of authorship identity have a similarity level of 39%–57% of the original document, which illustrates that the meaning of the document experiences insignificant changes. These results indicate that authorship identity elimination using the LSTM algorithm achieves suitable performance. For future research, a grid search in parameter tuning can be used to obtain better LSTM parameters. Another method of modifying the writing style can be combined with LSTM text generation, such as separating and combining sentences.
This work is supported by Universitas Indonesia under the Q1Q2 Grant Number NKB-0321/UN2.R3.1/HKP.05.00/2019.
Afzali, M., Kumar, S., 2017. Comparative Analysis of Various Similarity Measures for Finding Similarity of Two Documents. International Journal of Database Theory and Application, Volume 10(2), pp. 23–20
Abdillah, A., Suwarno. 2016. Diagnosis of Diabetes Using Support Vector Machines with Radial Basis Function Kernels. International Journal of Technology, Volume 7(5), pp. 849–858
Almishari, M., Gasti, P., Tsudik, G., Oguz, E. (2013). Privacy-Preserving Matching of Community-Contributed Content. In: Crampton, J., Jajodia, S., Mayes, K. (eds) Computer Security – ESORICS 2013. ESORICS 2013. Lecture Notes in Computer Science, vol 8134. Springer, Berlin, Heidelberg, https://doi.org/10.1007/978-3-642-40203-6_25
Bakhteev, O., Khazov, A., 2017. Author Masking using Sequence-to-Sequence Models. In: 2017 CEUR Workshop Proceedings, http://ceur-ws.org/Vol-1866/paper_68.pdf.
Grishunin, S., Suloeva, S., Egorova, A., Burova, E., 2020. Comparison of Empirical Methods for the Reproduction of Global Manufacturing Companies Credit Ratings. International Journal of Technology, Volume 11(6), pp. 1223–1232
Iqbal, F., Debbabi, M., Fung, B., 2020. Authorship Analysis Approaches. Machine Learning for Authorship Attribution and Cyber Forensics, pp. 45–56
Juola, P., 2020. Authorship Studies and the Dark Side of Social Media Analytics. Journal of Computer Science, Volume 26(1), pp. 156–170
Kacmarcik, G., Gamon, M., 2006. Obfuscating Document Stylometry to Preserve Author Anonymity. In: 21st International Conference on Computational Linguistics pp. 444–451
Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., 2017. The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation. In: International Conference of the Cross-Language Evaluation Forum for European Languages, https://doi.org/10.48550/arXiv.1707.03736.
Khomytska, I., Teslyuk, V., Kryvinska, N., Bazylevych, I. 2020. Software-Based Approach towards Automated Authorship Acknowledgement—Chi-Square Test on One Consonant Group. Electronics, Volume 9(7), pp. 1–11
Lippi, A., Montemurro M., Esposti, M., Cristadoro, G., 2019. Natural Language Statistical Features of LSTM-Generated Texts. IEEE Transactions on Neural Networks and Learning Systems, Volume 30
Moesteller, F., Wallace, D., 1963. Inference in an Authorship Problem. Journal of the American Statistical Association, Volume 58(302), pp. 275–309
McDonald, A., Afroz, S., Caliskan, A., Stole, A., 2012. Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization. In: International Symposium on Privacy Enhancing Technologies, https://doi.org/10.1007/978-3-642-31680-7_16.
Merity, S., Keskar, N., Socher, R., 2018. Regularizing and Optimizing LSTM Language Models. In: International Conference on Learning Representations
Rahguoy, M., Giglou, H., Rahguoy, T., Zaeynali, H., 2018. Author Masking Directed by Author's Style. In: 2018 CEUR Workshop Proceedings, https://pan.webis.de/downloads/publications/papers/rahgouy_2018.pdf.
Santosh, D., Vardhan, B., 2015. Obtaining Feature and Sentiment-Based Linked Instance RDF Data from Unstructured Reviews using Ontology-Based Machine Learning. International Journal of Technology, Volume 2, pp. 198–206
Savoy, J., 2013. The Federalist Papers Revisited: A Collaborative Attribution Scheme. In: Proceedings of The American Society for Information Science and Technology, Volume 50(1), pp. 1–8
Sitikhu, P., Pahi, K., Thapa, P., Shakya, S., 2019. A Comparison of Semantic Similarity Methods for Maximum Human Interperability. In: IEEE International Conference on Artificial Intelligence for Transforming Business and Society, https://doi.org/10.48550/arXiv.1910.09129.