Documentos duplicados y casi duplicados en el Webdetección con técnicas de hashing borroso

  1. García de Figuerola Paniagua, Luis Carlos
  2. Gómez Díaz, Raquel
  3. Alonso Berrocal, José Luis
  4. Zazo Rodríguez, Ángel Francisco
Journal:
Scire: Representación y organización del conocimiento

ISSN: 1135-3716

Year of publication: 2011

Volume: 17

Issue: 1

Pages: 49-54

Type: Article

DOI: 10.54886/SCIRE.V17I1.3895 DIALNET GOOGLE SCHOLAR lock_openOpen access editor

More publications in: Scire: Representación y organización del conocimiento

Bibliographic References

  • Bar-Ilan, J. (2005). Expectations versus reality sarch engine features needed for web research at mid 2005. // Cybermetrics 9:1 (2005).
  • Bharat, K.; Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated con-tent. // Computer Networks. 31:11-16 (1999) 1579-1590.
  • Chowdhury, A. (2004). Duplicate data detection. http://gogamza.mireene. co.kr/wpcontent/uploads/1/Xbsr PeUgh6.pdf (2011-01-13).
  • Chowdhury, A.; Frieder, O.; Grossman, D.; McCabe, M. (2002). Collection statistics for fast duplicate document detection. // ACM Transactions on In-formation Systems (TOIS) 20:2, 171-191 (2002) http://citeseerx.ist.psu.edu/ viewdoc/download/doi=10.1.1.5.373&rep=rep1\&type= pdf (2011-01-13). (Pubitemid 44642301)
  • Clarke C.L.; Crasswell, N.; Soboroff, I. (2009). Overview of the TREC 2009 Web Track // Proceedings of the 18th Text REtrieval Conference, Gaithersburg, Maryland, 2009. 1-9
  • Damerau, F. (1964). A technique for computer detec-tion and correction of spelling errors. // Communications of the ACM. 3, 171-176.
  • Figuerola, C. G.; Alonso Berrocal, J. L.; Zazo Rodríguez, A. F.; Rodriguez Vázquez de Aldana, E. (2006). Diseño de spiders. // Tech. Rep. DPTOIA-IT-2006-002 (2006).
  • Figuerola, C. G.; Gómez Díaz, R.; Alonso Berrocal, J. L.; Zazo Rodríguez, A. F. (2010). Proyecto 7: un motor de recuperación web colaborativo. // Scire: Representación y Organización del Conocimiento. 16, 53-60 (2010).
  • Hamming, R. (1950). Error detecting and error correcting codes. // Bell System Technical Journal. 29:2, 147-160.
  • Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. // Digital investigation. 3, 91-97. (Pubitemid 44088492)
  • Kornblum, J. (2010). Beyond fuzzy hash. // US Digital Forensic and Incident Response Summit 2010 (2010). http://computer-foren-sics.sans.org/ community/summits/ 2010/files/19-beyond-fuzzy-hashing-kornblum.pdf (2011- 01-13).
  • Kornblum, J. (2010). Fuzzy hashing and sseep. http://ssdeep.sourceforge. net/ (2011-01-13).
  • Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. // Soviet Physics Doklady. 10:8, 707-710.
  • Milenko, D. (2010). ssdeep 2.5. python wrapper for ssdeep library. http://pypi.python.org/pypi/ssdeep (2011-01-13).
  • Navarro, G. (2001). A guided tour to approximate string matching. // ACM computing surveys (CSUR). 33:1, 31-88. (Pubitemid 33768480)
  • Pugh, W. Y; Henzinger, M.H. (2003). Detecting Duplicate and Near Duplicate Files. United Sates Patent 6.658.423.
  • Soukoreff, R., MacKenzie, I. (2001). Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. // CHI'01 ex-tended abstracts on Human factors in computing systems. 319-320. ACM.
  • Tan, P.; Steinbach, M.; Kumar, V.; et al. (2006). Introduction to data mining. Pearson Addison Wesley: Boston (2006).
  • Tridgell, A. (2002). Spamsum overview and code. http://sam ba.org/ftp/unpacked/junkcode/spamsum (2011-01-13).
  • Tridgell, A., Mackerras, P.(2004). The rsync algorithm. http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/ TR-CS-96-05.pdf (2011-01-13).
  • Yahoo! (2011). Yahoo Developer Network. http://developer.yahoo.com (2011-01-13).
  • Yerra, R.; Ng, Y. (2005). Detecting similar html documents using a fuzzy set information retrieval approach. // 2005 IEEE International Conference on Granular Computing. 2, 693-699. IEEE (2005). (Pubitemid 44867683)