A Web-Forum Free of Disguised Profanity by Means of Sequence Alignment

Christian Mogollón-Pinzon; Sergio Rojas-Galeano

doi:10.11144/Javeriana.iyu20-2.wffd

Vol. 20 No. 2 (2016), Industrial and systems engineering

Vol. 20 No. 2 (2016)

A Web-Forum Free of Disguised Profanity by Means of Sequence Alignment

Industrial and systems engineering

Published 2016-06-20

Christian Mogollón-Pinzon, BSc⁺⁻
Sergio Rojas-Galeano, PhD⁺⁻

Christian Mogollón-Pinzon, BSc

Universidad Distrital Francisco José de Caldas

Sergio Rojas-Galeano, PhD

Universidad Distrital Francisco José de Caldas

PDF

Supplementary Files

Archivo Word sin figuras (Spanish)

Figuras 1 a la 10 (Spanish)

Figuras 11 a la 19 y tablas (Spanish)

Original 10CCC proceedings paper selected among best for extended version (Spanish)

Keywords

web forum
profanity detection
text analysis

How to Cite

A Web-Forum Free of Disguised Profanity by Means of Sequence Alignment. (2016). Ingenieria Y Universidad, 20(2), 239-266. https://doi.org/10.11144/Javeriana.iyu20-2.wffd

Almetrics

Dimensions

Google Scholar

Abstract

Profanity is the use of offensive, obscene, or abusive vocables or expressions in public conversations. A big source of conversations in text format nowadays are digital media such as forums, blogs, or social networks where malicious users are taking advantage of their ample worldwide coverage to disseminate undesired profanity aimed at insulting or denigrating opinions, names, or trademarks. Lexicon-based exact comparisons are the most common filters used to prevent such attacks in these media; however, ingenious users are disguising profanity using transliteration or masking of the original vocable while still conveying its intended semantic (e.g. by writing piss as P!55 or p.i.s.s), hence defeating the filter. Recent approaches to this problem, inspired in the sequence alignment methods from comparative genomics in bioinformatics, have shown promise in unmasking such guises. Building upon those techniques we have developed an experimental Web forum (ForumForte) where user comments are cleaned of disguised profanity. In this paper we discuss briefly the techniques and main engineering artefacts obtained during the developing of the software. Empirical evidence reveals filtering effectiveness between 84% and 97% at vocable level depending on the length of the profanity (with more than four letters), and 86% at sentence level when tested in two sets of real user-generated-comments written in Spanish and Portuguese. These results suggest the suitability of the software as a language-independent tool.

PDF

[1] T. O’Reilly, “What is Web 2.0: Design patterns and business models for the next generation of software,” Commun Strateg., no. 1, p. 17, 2007.
[2] S. Sood, J. Antin, and E. Churchill, “Profanity use in online communities,” in Proc. SIGCHI Conf. Human Factors in Computing Systems, ACM, 2012, pp. 1481–1490.
[3] W. Wang, L. Chen, K. Thirunarayan, and A. P. Sheth, “Cursing in English on Twitter,” in Proc. 17th ACM Conf. Comput. Supported Cooperative Work & Social Computing, 2014.
[4] M.-E. Maurer and L. Höfer, “Sophisticated phishers make more spelling mistakes: using URL similarity against phishing,” in Cyberspace Safety and Security. Berlin: Springer, 2012, pp. 414–426.
[5] S. A. Rojas-Galeano, “Revealing non-alphabetical guises of spam-trigger vocables,” DYNA, vol. 80, pp. 15-24, 2013.
[6] X. Zhong, “Deobfuscation based on edit distance algorithm for spam filtering,” in Machine Learning and Cybernetics (ICMLC), 2014 International Conference on, vol. 1. IEEE, 2014, pp. 109–114.
[7] V. P. Cardona-Zea and S. A. Rojas-Galeano, “Recognizing irregular answers in automatic assessment of fill-in-the-blank tests,” in Engineering Applications (WEA), 2012 Workshop on, 2012, pp. 1–4.
[8] S. A. Rojas-Galeano, “Towards automatic recognition of irregular, short-open answers in Fill-in-the-blank tests,” Tecnura, vol. 18, 2014.
[9] C. Mogollón Pinzón and S. Rojas-Galeano, “A genomic-based profanity-safe Web forum,” Proc. 10th Colombian Computing Conference, IEEExplore, 2015.
[10] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443-453, 1970.
[11] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Mol. Biol., vol. 147, no. 1, 1981.
[12] D. Venema. “Evolution basics: Genomes as ancient texts”. The BioLogos Forum. [Online]. Available: http://biologos.org/
[13] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J. ACM, vol. 21, pp. 168–173, 1974.
[14] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Sov Phys Doklady, vol. 10, no. 8, 1966.
[15] A. Leff and J. T. Rayfield, “Web-application development using the model/view/controller design pattern,” in Enterprise Distributed Object Computing Conference, Proc. Fifth IEEE Int., 2001.
[16] D. Alur, D. Malks, J. Crupi, G. Booch, and M. Fowler, “Core J2EE Patterns (Core Design Series): Best Practices and Design Strategies”. Sun Microsystems, 2003.
[17] G. Laboreiro and E. Oliveira, “What we can learn from looking at profanity,” Computational Processing of the Portuguese Language. Berlin: Springer, 2014, pp. 108-113.
[18] P. Burnap and M.L. Williams, “Us and them: Identifying cyber hate on Twitter across multiple protected characteristics,” EPJ Data Sci., vol. 5, no. 1, pp. 1-15, 2016.
[19] H. Hosseinmardi et al., “Analyzing labeled cyberbullying incidents on the Instagram social network,” Social Informatics: 7th Int. Conf. (SocInfo 2015), Beijing, China, December 9-12, 2015, 2015, pp.49-66.
[20] S. H. Yadav and P. Manwatkar, “An Approach for offensive text detection and prevention in social networks,” Innovations in Information Embedded and Communication Systems (ICIIECS), 2015 IEEE 2nd International Conference on. IEEExplore, 2015.

This journal is registered under a Creative Commons Attribution 4.0 International Public License. Thus, this work may be reproduced, distributed, and publicly shared in digital format, as long as the names of the authors and Pontificia Universidad Javeriana are acknowledged. Others are allowed to quote, adapt, transform, auto-archive, republish, and create based on this material, for any purpose (even commercial ones), provided the authorship is duly acknowledged, a link to the original work is provided, and it is specified if changes have been made. Pontificia Universidad Javeriana does not hold the rights of published works and the authors are solely responsible for the contents of their works; they keep the moral, intellectual, privacy, and publicity rights.

Approving the intervention of the work (review, copy-editing, translation, layout) and the following outreach, are granted through an use license and not through an assignment of rights. This means the journal and Pontificia Universidad Javeriana cannot be held responsible for any ethical malpractice by the authors. As a consequence of the protection granted by the use license, the journal is not required to publish recantations or modify information already published, unless the errata stems from the editorial management process. Publishing contents in this journal does not generate royalties for contributors.

A Web-Forum Free of Disguised Profanity by Means of Sequence Alignment

Supplementary Files

Keywords

How to Cite

Plumx

Language

Information

Make a Submission

Abstract

A Web-Forum Free of Disguised Profanity by Means of Sequence Alignment

Supplementary Files

Keywords

How to Cite

Download Citation

Plumx

Language

Information

Make a Submission

Abstract

References