Organización automática de documentos mediante técnicas de análisis de redes

  1. Carlos G. FIGUEROLA 1
  2. José Luis ALONSO BERROCAL 1
  3. Ángel ZAZO RODRÍGUEZ 1
  1. 1 Universidad de Salamanca
    info

    Universidad de Salamanca

    Salamanca, España

    ROR https://ror.org/02f40zc51

Journal:
Scire: Representación y organización del conocimiento

ISSN: 1135-3716

Year of publication: 2017

Volume: 23

Issue: 2

Pages: 25-36

Type: Article

DOI: 10.54886/SCIRE.V1I2.4453 DIALNET GOOGLE SCHOLAR lock_openOpen access editor

More publications in: Scire: Representación y organización del conocimiento

Abstract

Automatic organization of documents can showthe semantic structure of broad collections of documents. This paper proposes to model a document collection using a graph or network and then applying the so-called Social Networks Analysis techniques. We describe a practical experiment carried outwith a collection of newspaper articles,and then we analyze the topic structure resulting after applying community discovery techniques. Results look enough promising; we envisage as future work the application and comparison of different communities discovery algorithms.

Bibliographic References

  • Aggarwal, C. C. y Zhai, C. (2012). A survey of text clustering algorithms. // Aggarwal y Zhai, eds.: Mining Text Data. Springer US: Boston MA. 77-128
  • Ares Brea, M.E.; Parapar López, J.; Barreiro García, A. (2011). Agrupamiento Documental. // Cacheda Seijo, F.; Fernández Luna, J. M.; Huete Guadix, J. F. Eds. (2011). Recuperación de Información: Un enfoque práctico y multidisciplinar. Madrid; Ra-Ma, 2011. 392-416.
  • Arun, R.; Suresh, V.; Veni Madhavan, C. E.; Narasimha Murthy, M. N.; Zaki, M. J.; Yu, J. X.; Ravindran, B.; Pudi, V. (2010). On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. // Advances in Knowledge Discovery and Data Mining: 14th Pacific-Asia Conference, PAKDD 2010. Hyderabad, India. 391-402. .http://dx.doi.org/10.1007/978-3-642-13657-3_4 3 (2017-01-12).
  • Blei, D., Ng, A.; Jordan, M. (2003). Latent dirichlet allocation. // The Journal of Machine Learning Research. 3, 9931022.
  • Baharudin, B.; Lee, L. H.; Khan, K. (2010). A review of machine learning algorithms for text-documents classification. // Journal of Advances in Information Technology. 1:1, 4–20.
  • Bohlin, L.; Edler, D.; Lancichinetti, A.; Rosvall, M. (2014). Community detection and visualization of networks with the map equation framework. // Measuring Scholarly Impact. Springer International Publishing. 3-34.
  • Campos Ibáñez, L. M.; Romero López, A. E. (2011). Clasificación documental. // Cacheda Seijo, F.; Fernández Luna, J.M. ; Huete Guadix, J.F. Eds. (2011). Recuperación de Información: un enfoque práctico y multidisciplinar. Madrid; Ra-Ma, 2011. 359-392.
  • Edler, D.; Rosvall, M. (2015). The infomap software package. http://www.mapequation.org/code.html (2017-02-16).
  • Eyheramendy, S.; Lewis, D. D.; Madigan, D. (2003). On the naive bayes model for text categorization. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.4949 (2017-02-16).
  • Figuerola, C. G. (2013). Clasificación automática de documentos: Un caso práctico. http://grulla.usal.es/figuerola2013clasificacion.pdf (2017-02-16).
  • Figuerola, C. G.; Quintanilla Fisac, M. A.; et al. (2017): Sistema de Indicadores para el SCSC (Spanish Corpus of Scientific Culture). http://grulla.usal.es/figuerola2017sistema.pdf (2017-03-28).
  • Figuerola, G. C.; García Marco, F. J.; Pinto, M. (2017). Mapping the evolution of libray and information science (19782014) using topic modeling on LISA. Scientometrics. 11:23, 1507-1535.
  • Griffiths, T. L.; Steyvers, M. (2004). Finding scientific topics. // Proceedings of the National Academy of Sciences. 101:1, 5228-5235.
  • Groves, T; Figuerola, C. G.; Quintanilla, M. A (2015). Ten years of science news: a longitudinal analysis of scientific culture in the Spanish digital press. Public Understanding of Science. 25:6, 691-705. https://gredos.usal.es/jspui/ handle/10366/127539 (2017-02-16)
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. // Pattern recognition letters. 31:8, 651-666. http://www.ppgia.pucpr.br/~fabricio/ftp/Roges/JainClustering_PRL10.pdf (2017-02-18)
  • Joachims, T. (1998, April). Text categorization with support vector machines: Learning with many relevant features. // European conference on machine learning (pp. 137-142). Springer Berlin Heidelberg. https://eldorado.tudortmund.de/bitstream/2003/2595/1/report23_ps.pdf (2017-02-16).
  • Joachims T. (2002) Learning to Classify Text Using Support Vector Machines – Methods, Theory and Algorithms. Boston, MA: Kluwer Academic Publishers.
  • Kim, S. B., Han, K. S., Rim, H. C., & Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. // IEEE transactions on knowledge and data engineering. 18:11, 1457-1466. http://ir.kaist.ac.kr/papers/20 06/some%20effective%20techniques%20for%20naive% 20bayes%20text%20classification.pdf (2017-02-16).
  • Hidayat, E. Y.; Firdausillah, F.; Hastuti, K.; Dewi, I. N.; Azhari, A. (2015). Automatic Text Summarization Using Latent Drichlet Allocation (LDA) for Document Clustering. // International Journal of Advances in Intelligent Informatics, 1:3, 132-139.
  • Lancichinetti, A.; Fortunato, S. (2009). Community detection algorithms: A comparative analysis. // Physical Review E. 80:5. http://arxiv.org/pdf/0908.1062v2.pdf (2017-02-18)
  • Langley, P.; Iba, W.; Thompson, K. (1992). An analysis of bayesian classifiers. // Proceedings of National Conference on Artificial Intelligence. San Antonio, CA: AAAI Press andMIT Press. 223–228
  • Lee, C.; Cunningham, P. (2014) Community detection: Effective on large social networks. // Journal of Complex Networks. 2:1, 19–37. http://comnet.oxfordjournals.org/content/2/1/19.full.pdf+html (2017-02-18)
  • Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton's Cosine versus the Jaccard index. // Journal of the American Society for Information Science and Technology. 59:1, 77-85.
  • Martin, S.; Brown, M.W.; Klavans, R.; Boyack K.W.(2011). OpenOrd: an open-source toolbox for large graph layout. // Proc. SPIE 7868, Visualization and Data Analysis 2011. doi:10.1117/12.871402
  • Martin-Pozuelo Campillos, M. P. (1996). La construcción teórica en archivística: el principio de procedencia. Madrid: Universidad Carlos III de Madrid.
  • McCallum, A.; Nigam, K. (1998) A comparison of event models for naive bayes text classification. // AAAI-98 workshop on learning for text categorization. 41-48. http://www.kamalnigam.com/papers/multinomial-aaaiws 98.pdf (2016-12-14)
  • Otte, E.; Rousseau, R. (2002). Social network analysis: a powerful strategy, also for the information sciences. // Journal of information Science. 28:6, 441-453. .http://www.academia.edu/download/42254790/Social_N etwork_Analysis_A_Powerful_Strat20160206-25456-1pc 1lcl.pdf (2017-02-18)
  • Plantié, M. ; Crampes, M. (2013) Survey on social community detection. // Social media retrieval, 65–85. http://hal.archives-ouvertes.fr/docs/00/80/42/34/PDF/Survey-on-SocialCommunity-Detection-V2.pdf (2017-02-18)
  • Pons, P.; Latapy, M. (2005). Computing communities in large networks using random walks. // Computer and information sciences (ISCIS) 284–293. http://arxiv.org/abs/physics/0512106 (2017-02-18)
  • Rendón, E.; Abundez, I.; Arizmendi, A.; Quiroz, E. (2011). Internal versus external cluster validation indexes. // International Journal of computers and communications. 5:1, 27-34.
  • Rosvall, M.; Axelsson, D.; Bergstrom, C. (2009). The map equation. // European Physical Journal Special Topics. 178, 13–23.
  • Rousseeuw, P. J. (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20, 53–65. doi:10.1016/0377-0427(87)90125-7.
  • Salton, G.; McGill, M.J. (1983) Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.
  • Scott, J. (2013). Social network analysis. Thousand Oaks, CA, US: Sage Publications, Inc
  • Shawn, G.; Milligan, I. (2012).Review of MALLET, produced by Andrew Kachites McCallum. // Journal of Digital Humanities, 2:1. http://journalofdigitalhumanities.org/2-1/review-mallet-by-ian-milligan-and-shawn-graham/ (201703-15)
  • Yang, Y. (1999). An evaluation of statistical approaches to text categorization. // Information retrieval. 1:1-2, 69-90.