Técnicas big dataanálisis de textos a gran escala para la investigación científica y periodística

  1. Carlos Arcila-Calderón 1
  2. Eduar Barbosa-Caro 2
  3. Francisco Cabezuelo Lorenzo 3
  1. 1 Universidad de Salamanca
    info

    Universidad de Salamanca

    Salamanca, España

    ROR https://ror.org/02f40zc51

  2. 2 Universidad del Norte
    info

    Universidad del Norte

    Barranquilla, Colombia

    ROR https://ror.org/031e6xm45

  3. 3 Universidad de Valladolid
    info

    Universidad de Valladolid

    Valladolid, España

    ROR https://ror.org/01fvbaw18

Revista:
El profesional de la información

ISSN: 1386-6710 1699-2407

Año de publicación: 2016

Título del ejemplar: Datos

Volumen: 25

Número: 4

Páginas: 623-631

Tipo: Artículo

DOI: 10.3145/EPI.2016.JUL.12 DIALNET GOOGLE SCHOLAR lock_openAcceso abierto editor

Otras publicaciones en: El profesional de la información

Resumen

Este trabajo conceptualiza el término big data y describe su importancia en el campo de la investigación científica en ciencias sociales y en las prácticas periodísticas. Se explican técnicas de análisis de datos textuales a gran escala como el análisis automatizado de contenidos, la minería de datos (data mining), el aprendizaje automatizado (machine learning), el modelamiento de temas (topic modeling) y el análisis de sentimientos (sentiment analysis), que pueden servir para la generación de conocimiento en ciencias sociales y de noticias en periodismo. Se expone cuál es la infraestructura necesaria para el análisis de big data a través del despliegue de centros de cómputo distribuido y se valora el uso de las principales herramientas para la obtención de información a través de software comerciales y de paquetes de programación como Python o R.

Referencias bibliográficas

  • Alpaydin, Ethem (2010). Introduction to machine learning. Cambridge/London: The MIT Press. ISBN 978 0262012430
  • Arora, Sanjeev; Ge, Rong; Halpern, Yoni; Mimno, David; Moitra, Ankur; Sontag, David; Wu, Yichen; Zhu, Michael (2013). “A practical algorithm for topic modeling with provable guarantees”. En: 30th Intl conf on machine learning. pp. 280-288. http://jmlr.org/proceedings/papers/v28/arora13.html
  • Blei, David M. (2012). “Topic modeling and digital Humanities”. Journal of digital humanities, v. 2, n. 1, pp. 8-11. http://journalofdigitalhumanities.org/2-1/topic-modelingand-digital-humanities-by-david-m-blei
  • Blum, Avrim (2003). “Machine learning theory”. En: FOCS 2003 Procs of the 44th Annual IEEE Symposium on foundations of computer science. Washington DC: IEEE Computer Society, pp. 2-4. ISBN: 0 7695 2040 5
  • Cai, Keke; Spangler, Scott; Chen, Ying; Zhang, Li (2010). “Leveraging sentiment analysis for topic detection”. En: IEEE/ WIC/ACM International Conference on Web Intelligence and Agent Systems: An International Journal, pp. 265-271. http://www.csce.uark.edu/~sgauch/5013NLP/S13/hw/Chris. pdf http://dx.doi.org/10.1109/WIIAT.2008.188
  • Cambria, Erick; Schuller, Björn; Liu, Bing; Wang, Haixun; Havasi, Catherine (2013). “Knowledge-based approaches to concept-level sentiment analysis”. IEEE intelligent systems, v. 28, n. 2, pp. 12-14. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6547971 http://dx.doi.org/10.1109/MIS.2013.45
  • Cheng, An-Shou; Fleischmann, Kenneth; Wang, Ping; Oard, Douglas (2008). “Advancing social science research by applying computational linguistics”. En: Procs of the American Society for Information Science and Technology, v. 45, n. 1, pp. 1-12. http://www.asis.org/Conferences/AM08/proceedings/ posters/55_poster.pdf
  • Dhar, Vasant (2013). “Data science and prediction”. Communications of the ACM, v. 56, n. 12, pp. 64-73. https://archive.nyu.edu/bitstream/2451/31553/2/DharDataScience.pdf http://dx.doi.org/10.1145/2500499
  • Dietterich, Thomas (2003). “Machine learning”. Nature encyclopedia of cognitive science. London: Macmillan. http://eecs.oregonstate.edu/~tgd/publications/nature-ecsmachine-learning.ps.gz
  • Domingos, Pedro (2012). “A few useful things to know about machine learning”. Communications of the ACM, v. 55, n. 10, pp. 78-87. http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf http://dx.doi.org/10.1145/2347736.2347755
  • Feldman, Ronen (2013). “Techniques and applications for sentiment analysis”. Communications of the ACM, v. 56, n. 4, pp. 82-89. http://dx.doi.org/10.1145/2436256.2436274
  • Han, Jiawei; Kamber, Micheline; Pei, Jian (2006). Data mining. Concepts and techniques. San Francisco: Morgan Kaufmann Publishers. ISBN: 978 0123814791 http://goo.gl/5zTYb6
  • Hand, David; Mannila, Heikki; Smyth, Padhraic (2001). Principles of data mining. Cambridge: MIT Press. ISBN: 978 0262082907 ftp://gamma.sbin.org/pub/doc/books/Principles_of_Data_ Mining.pdf
  • Harwood, Tracy; Garry, Tony (2003). “An overview of content analysis”. The marketing review, v. 3, pp. 479-498. http://dx.doi.org/10.1362/146934703771910080
  • Kalina, Jan (2013). “Highly robust methods in data mining”. Serbian journal of management, v. 8, n. 1, pp. 9-24. http://www.sjm06.com/SJM%20ISSN1452-4864/8_1_2013_ May_1_132/8_1_2013_9-24.pdf http://dx.doi.org/10.5937/sjm8-3226
  • Kechaou, Zied; Ben-Ammar, Mohammed; Alimi, Adel (2013). “A multi-agent based system for sentiment analysis of user-generated content”. International journal on artificial intelligence tools, v. 22, n. 2, pp. 1-28. http://dx.doi.org/10.1142/S0218213013500048
  • Kelleher, John D.; MacNamee, Brian; D’Arcy, Aoife (2015). Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. Londres: MIT Press. ISBN: 978 0262029445
  • Krippendorff, Klaus. (2004). Content analysis. An introduction to its methodology. Los Angeles: Sage Publications. ISBN: 978 0761915454
  • Leetaru, Kalev-Hannes (2011). Data mining methods for the content analyst: An introduction to the computational analysis of informational center. New York: Routledge. ISBN: 978 0415895149
  • Mayer-Schönberger, Viktor; Cukier, Kenneth (2013). Big data. La revolución de los datos masivos. Madrid: Turner. ISBN: 978 8415832102
  • McCallum, Andrew-Kachites (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu
  • Meena, Arun; Prabhakar, T. V. (2007). Sentence level sentiment analysis in the presence of conjuncts using linguistic analysis. En: Amati, Giambattista; Carpineto, Claudio; Romano, Giovanni (eds.). Advances in information retrieval. 29th European conf on IR research (ECIR), April 2-5, 2007, Rome, Italy, pp. 573-580. http://dx.doi.org/10.1007/978-3-540-71496-5_53
  • Mitchell, Tom (1997). Machine learning. New York: McGraw-Hill. ISBN: 978 0070428072 http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_ Machine_Learning_-Tom_Mitchell.pdf
  • Murphy, Kevin (2012). Machine learning. A probabilistic perspective. Cambridge/London: The MIT Press. ISBN: 978 0262018029
  • Murphy, Michael; Barton, John (2014). “From a sea of data to actionable insights: Big data and what it means for lawyers”. Intellectual property & technology law journal, v. 26, n. 3, pp. 8-17. http://www.pillsburylaw.com/publications/from-a-sea-ofdata-to-actionable-insights
  • Nunan, Dan; Di-Domenico, Maria-Laura (2013). “Market research and the ethics of big data”. International journal of market research, v. 55, n. 4, pp. 505-520. http://dx.doi.org/10.2501/IJMR-2013-015
  • Pennacchiotti, Marco; Popescu, Ana-Maria (2011). “A machine learning approach to Twitter user classification”. En: Procs of the 5th Intl conf on weblogs and social media. Menlo Park, California: The Association for the Advancement of Artificial Intelligence Press. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/ paper/download/2886/3262
  • Téllez-Valero, Alberto; Montes, Manuel; Villaseñor-Pineda, Luis (2009). “Using machine learning for extracting information from natural disaster news reports”. Computación y sistemas, v. 13, n. 1, pp. 33-44. http://www.scielo.org.mx/pdf/cys/v13n1/v13n1a4.pdf
  • Turney, Peter (2002). “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews”. En: Procs of the 40th Annual meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 417-424. http://www.aclweb.org/anthology/P02-1053.pdf
  • Verbeke, Mathias; Berendt, Bettina; D’Haenens, Leen; Opgenhaffen, Michaël (2014). “When two disciplines meet, data mining for communication science”. En: 64th Annual meeting of International Communication Association (ICA) conf. Seattle, USA. https://lirias.kuleuven.be/handle/123456789/436424
  • Vinodhini, Gopalakrishnan; Chandrasekaran, Ramaswamy M. (2012). “Sentiment analysis and opinion mining: A survey”. International journal of advanced research in computer science and software engineering, v. 2, n. 6, pp. 282-292. http://www.i jarcsse.com/docs/papers/June2012/ Volume_2_issue_6/V2I600263.pdf
  • West, Mark (2001). Theory, method, and practice in computer content analysis. Westport, Connecticut: Ablex Publishing. ISBN: 978 1567505030
  • White, Marilyn-Domas; Marsh, Emiliy (2006). “Content analysis: A flexible methodology”. Library trends, v. 55, n.1, pp. 22-45. https://www.ideals.illinois.edu/bitstream/handle/2142/3670/ whitemarch551.pdf?sequence=2 http://dx.doi.org/10.1353/lib.2006.0053
  • Woody, Alex (2016). “Inside the Panama papers: How cloud analytics made it all possible”. Datanami, 7 April. http://www.datanami.com/2016/04/07/inside-panamapapers-cloud-analytics-made-possible