Published at : 01 Apr 2022
Volume : IJtech
Vol 13, No 2 (2022)
DOI : https://doi.org/10.14716/ijtech.v13i2.4257
Hendrik Maulana | Department of Electrical Engineering, Universitas Indonesia, PO Box 16424, Indonesia |
Riri Fitri Sari | Department of Electrical Engineering, Universitas Indonesia, PO Box 16424, Indonesia |
Stylometry is an authorship
analysis technique that uses statistics. Through stylometry, the authorship
identity of a document can be analyzed with high accuracy. This poses a threat
to the privacy of the author. Meanwhile, there is a stylometry method, namely
the elimination of authorship identity, which can provide privacy protection
for writers. This study uses the authorship method to eliminate the method
applied to the Federalist Paper corpus. Federalist Paper is a well-known corpus
that has been extensively studied, especially in authorship identification
methods, considering that there are 12 disputed texts in the corpus. One
identification method is the use of the support vector machine (SVM) algorithm.
Through this algorithm, the author’s identity of disputed text can be obtained
with 86% accuracy. The authorship identity elimination method can change the
writing style while maintaining its meaning. Long-short-term memory (LSTM) is a
deep learning-based algorithm that can predict words well. Through a model
formed from the LSTM algorithm, the writing style of the disputed documents in
the Federalist Paper can be changed. As a result, 4 out of 12 disputed
documents can be changed from one author identity to another identity. The
similarity level of the changed documents ranges from 40% to 57%, which
indicates the meaning preservation from original documents. Our experimental
results conclude that the proposed method can eliminate authorship identity
well.
Authorship; Long-short-term memory (LSTM); Obfuscation
Stylometry is a science that analyzes authorship style using statistics. Most research in the field of stylometry refers to Mosteller and Wallace’s research on Federalist Papers in 1963. With the beginning of computer-based stylometry, the corpus of Federalist Papers gained popularity. Stylometry is classified into several working subsections, namely authorship identification, authorship verification, authorship profile, stylochronometry, and authorship elimination, with the majority of studies in the first three classes. In contrast, authorship deletion is generally used for concealing identity when an author does not want their identity to be revealed publicly. Authorship identification methods have progressed rapidly, currently achieving an accuracy of up to 90% (Iqbal et al., 2020). This rapid development has raised serious threats to privacy for certain professionals, such as journalists and activists.
McDonald et al. (2012)
proposed a method of authorship style transformation that can be entered
manually by the user to make their writing anonymous. Due to the limitations of
manual techniques, Almishari et al. (2013)
proposed a machine translation method to automate authorship style changes.
Other previously proposed methods have studied the use of synonyms, sentence
separation, sentence combinations, and paraphrases, but all studies in this
field still have misspellings in their text results. In addition, no single
study provides randomness settings for users. This paper proposes a method that
has better text results and gives text randomness control for the user.
Research in the field of authorship identity
elimination requires a standardized dataset. The dataset must be well known,
have anonymous documents, and should have been identified by the number of
previous authorship attribution studies. One of the datasets that matches this
criterion is the Federalist Paper. The corpus contains 85 documents, including
51 documents written by Hamilton, 14 documents written by Madison, 5 documents
written by Jay, 3 documents are a collaboration between Hamilton and Madison,
the remaining 12 documents are doubtful of their authorship between Hamilton or
Madison. Juola (2020) identified no fewer
than 19 studies conducted on this corpus. One of them was conducted by Savoy (2013), who applied several algorithms and
concluded that only the Naive Bayes and SVM algorithms have relevant results.
SVM is one of the encouraging classification techniques in the field of machine
learning (Abdillah, et al., 2016).
Rahguoy et al. (2018)
conducted sentence separations, combining sentences and replacing words through
WordNet. WordNet is a lexical database used to extract features from sentences (Santosh, et al., 2015). This method can reduce
the confidence level in the authorship identification process by 20%, but not
all sentences produced can be arranged properly. Other attributes examined by Karadzhov et al. (2017) are the ratio of word
types, stopword ratio, ratio of large capitalized words, and the ratio of part
of speech. In contrast, Bakhteev and Khazov (2017)
changed the sentence level by paraphrasing and modifying the content using the
LSTM algorithm through the encoder–decoder technique. As a result, they
obtained a fairly high sentence change rate, but there were still some spelling
mistakes.
Two requirements need to be fulfilled for the authorship elimination
method: Confirmation of authorship identity modification to be proven by SVM
classification and confirmation of meaning preservation. The semantic
similarity method can show the meaning preservation, because the greater the
value of semantic similarity, the greater is the similarity of meaning between
two documents (Sitikhu et al., 2019). There
are techniques that can be used to calculate the semantic similarity of
documents, such as the Jaccard coefficient, dice coefficient, and cosine
similarity. Afzali and Kumar (2017) found
that cosine similarity is the best performance evaluation technique for
calculating semantic similarity. Deleting authorship identity in this research
can be used to test the performance of the authorship identification method if
the document has been changed. The long-short term memory (LSTM) algorithm
based on a neural network is used as a rearrangement of disputed documents
because of its suitable performance in natural language generation (Lippi et al., 2019).
The analysis conducted in this study indicates
that the Federalist Paper corpus is an unbalanced dataset because most articles
(60%) were written by Hamilton. Therefore, normalization of text data is
required when they are classified. Chi-square and cosine similarity methods
show the tendency of the author’s to identify a text. If two texts are of high
similarity value, then the possibility of the authors of both texts being the
same person is high. Through the SVM algorithm, it is known that the text
writer debated in the Federalist Paper can be identified with an accuracy of
86%. Then, using the LSTM algorithm, the level of accuracy can be reduced by
19%. Document changes resulting from the elimination of authorship identity
have a similarity level of 39%–57% of the original document, which illustrates
that the meaning of the document experiences insignificant changes. These
results indicate that authorship identity elimination using the LSTM algorithm
achieves suitable performance. For future research, a grid search in parameter
tuning can be used to obtain better LSTM parameters. Another method of
modifying the writing style can be combined with LSTM text generation, such as
separating and combining sentences.
This work is supported by Universitas Indonesia under
the Q1Q2 Grant Number NKB-0321/UN2.R3.1/HKP.05.00/2019.
Afzali,
M., Kumar, S., 2017. Comparative Analysis of Various Similarity Measures for
Finding Similarity of Two Documents. International Journal of Database
Theory and Application, Volume 10(2), pp. 23–20
Abdillah,
A., Suwarno. 2016. Diagnosis of Diabetes Using Support Vector Machines with
Radial Basis Function Kernels. International Journal of Technology,
Volume 7(5), pp. 849–858
Almishari,
M., Gasti, P., Tsudik, G., Oguz, E. (2013). Privacy-Preserving Matching of
Community-Contributed Content. In: Crampton, J., Jajodia, S., Mayes, K.
(eds) Computer Security – ESORICS 2013. ESORICS 2013. Lecture Notes in Computer
Science, vol 8134. Springer, Berlin, Heidelberg,
https://doi.org/10.1007/978-3-642-40203-6_25
Bakhteev,
O., Khazov, A., 2017. Author Masking using Sequence-to-Sequence Models. In:
2017 CEUR Workshop Proceedings, http://ceur-ws.org/Vol-1866/paper_68.pdf.
Grishunin,
S., Suloeva, S., Egorova, A., Burova, E., 2020. Comparison of Empirical Methods
for the Reproduction of Global Manufacturing Companies Credit Ratings. International
Journal of Technology, Volume 11(6), pp. 1223–1232
Iqbal,
F., Debbabi, M., Fung, B., 2020. Authorship Analysis Approaches. Machine
Learning for Authorship Attribution and Cyber Forensics, pp. 45–56
Juola,
P., 2020. Authorship Studies and the Dark Side of Social Media Analytics. Journal
of Computer Science, Volume 26(1), pp. 156–170
Kacmarcik,
G., Gamon, M., 2006. Obfuscating Document Stylometry to Preserve Author
Anonymity. In: 21st International Conference on Computational
Linguistics pp. 444–451
Karadzhov,
G., Mihaylova, T., Kiprov, Y., Georgiev, G., 2017. The Case for Being Average:
A Mediocrity Approach to Style Masking and Author Obfuscation. In: International
Conference of the Cross-Language Evaluation Forum for European Languages,
https://doi.org/10.48550/arXiv.1707.03736.
Khomytska,
I., Teslyuk, V., Kryvinska, N., Bazylevych, I. 2020. Software-Based Approach
towards Automated Authorship Acknowledgement—Chi-Square Test on One Consonant
Group. Electronics, Volume 9(7), pp. 1–11
Lippi,
A., Montemurro M., Esposti, M., Cristadoro, G., 2019. Natural Language
Statistical Features of LSTM-Generated Texts. IEEE Transactions on Neural
Networks and Learning Systems, Volume 30
Moesteller, F.,
Wallace, D., 1963. Inference in an Authorship Problem. Journal of the
American Statistical Association, Volume 58(302), pp. 275–309
McDonald,
A., Afroz, S., Caliskan, A., Stole, A., 2012. Use Fewer Instances of the Letter
"i": Toward Writing Style Anonymization. In: International
Symposium on Privacy Enhancing Technologies,
https://doi.org/10.1007/978-3-642-31680-7_16.
Merity,
S., Keskar, N., Socher, R., 2018. Regularizing and Optimizing LSTM Language
Models. In: International Conference on Learning Representations
Rahguoy,
M., Giglou, H., Rahguoy, T., Zaeynali, H., 2018. Author Masking Directed by
Author's Style. In: 2018 CEUR Workshop Proceedings,
https://pan.webis.de/downloads/publications/papers/rahgouy_2018.pdf.
Santosh,
D., Vardhan, B., 2015. Obtaining Feature and Sentiment-Based Linked Instance
RDF Data from Unstructured Reviews using Ontology-Based Machine Learning. International
Journal of Technology, Volume 2, pp. 198–206
Savoy,
J., 2013. The Federalist Papers Revisited: A Collaborative Attribution Scheme. In:
Proceedings of The American Society for Information Science and Technology,
Volume 50(1), pp. 1–8
Sitikhu,
P., Pahi, K., Thapa, P., Shakya, S., 2019. A Comparison of Semantic Similarity
Methods for Maximum Human Interperability. In: IEEE International Conference on Artificial
Intelligence for Transforming Business and Society,
https://doi.org/10.48550/arXiv.1910.09129.