Documentos duplicados y casi duplicados en el Web: detección con técnicas de hashing borroso

García de Figuerola Paniagua, Luis Carlos; Gómez Díaz, Raquel; Alonso Berrocal, José Luis; Zazo Rodríguez, Ángel Francisco

doi:10.54886/SCIRE.V17I1.3895

Documentos duplicados y casi duplicados en el Webdetección con técnicas de hashing borroso

Revista:

Scire: Representación y organización del conocimiento

ISSN: 1135-3716

Año de publicación: 2011

Volumen: 17

Número: 1

Páginas: 49-54

Tipo: Artículo

DOI: 10.54886/SCIRE.V17I1.3895 DIALNET GOOGLE SCHOLAR Acceso abierto editor

Otras publicaciones en: Scire: Representación y organización del conocimiento

Referencias bibliográficas

Bar-Ilan, J. (2005). Expectations versus reality sarch engine features needed for web research at mid 2005. // Cybermetrics 9:1 (2005).
Bharat, K.; Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated con-tent. // Computer Networks. 31:11-16 (1999) 1579-1590.
Chowdhury, A. (2004). Duplicate data detection. http://gogamza.mireene. co.kr/wpcontent/uploads/1/Xbsr PeUgh6.pdf (2011-01-13).
Chowdhury, A.; Frieder, O.; Grossman, D.; McCabe, M. (2002). Collection statistics for fast duplicate document detection. // ACM Transactions on In-formation Systems (TOIS) 20:2, 171-191 (2002) http://citeseerx.ist.psu.edu/ viewdoc/download/doi=10.1.1.5.373&rep=rep1\&type= pdf (2011-01-13). (Pubitemid 44642301)
Clarke C.L.; Crasswell, N.; Soboroff, I. (2009). Overview of the TREC 2009 Web Track // Proceedings of the 18th Text REtrieval Conference, Gaithersburg, Maryland, 2009. 1-9
Damerau, F. (1964). A technique for computer detec-tion and correction of spelling errors. // Communications of the ACM. 3, 171-176.
Figuerola, C. G.; Alonso Berrocal, J. L.; Zazo Rodríguez, A. F.; Rodriguez Vázquez de Aldana, E. (2006). Diseño de spiders. // Tech. Rep. DPTOIA-IT-2006-002 (2006).
Figuerola, C. G.; Gómez Díaz, R.; Alonso Berrocal, J. L.; Zazo Rodríguez, A. F. (2010). Proyecto 7: un motor de recuperación web colaborativo. // Scire: Representación y Organización del Conocimiento. 16, 53-60 (2010).
Hamming, R. (1950). Error detecting and error correcting codes. // Bell System Technical Journal. 29:2, 147-160.
Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. // Digital investigation. 3, 91-97. (Pubitemid 44088492)
Kornblum, J. (2010). Beyond fuzzy hash. // US Digital Forensic and Incident Response Summit 2010 (2010). http://computer-foren-sics.sans.org/ community/summits/ 2010/files/19-beyond-fuzzy-hashing-kornblum.pdf (2011- 01-13).
Kornblum, J. (2010). Fuzzy hashing and sseep. http://ssdeep.sourceforge. net/ (2011-01-13).
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. // Soviet Physics Doklady. 10:8, 707-710.
Milenko, D. (2010). ssdeep 2.5. python wrapper for ssdeep library. http://pypi.python.org/pypi/ssdeep (2011-01-13).
Navarro, G. (2001). A guided tour to approximate string matching. // ACM computing surveys (CSUR). 33:1, 31-88. (Pubitemid 33768480)
Pugh, W. Y; Henzinger, M.H. (2003). Detecting Duplicate and Near Duplicate Files. United Sates Patent 6.658.423.
Soukoreff, R., MacKenzie, I. (2001). Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. // CHI'01 ex-tended abstracts on Human factors in computing systems. 319-320. ACM.
Tan, P.; Steinbach, M.; Kumar, V.; et al. (2006). Introduction to data mining. Pearson Addison Wesley: Boston (2006).
Tridgell, A. (2002). Spamsum overview and code. http://sam ba.org/ftp/unpacked/junkcode/spamsum (2011-01-13).
Tridgell, A., Mackerras, P.(2004). The rsync algorithm. http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/ TR-CS-96-05.pdf (2011-01-13).
Yahoo! (2011). Yahoo Developer Network. http://developer.yahoo.com (2011-01-13).
Yerra, R.; Ng, Y. (2005). Detecting similar html documents using a fuzzy set information retrieval approach. // 2005 IEEE International Conference on Granular Computing. 2, 693-699. IEEE (2005). (Pubitemid 44867683)

Fuente de los datos: Dialnet