Técnicas big dataanálisis de textos a gran escala para la investigación científica y periodística
- Carlos Arcila-Calderón 1
- Eduar Barbosa-Caro 2
- Francisco Cabezuelo Lorenzo 3
-
1
Universidad de Salamanca
info
-
2
Universidad del Norte
info
-
3
Universidad de Valladolid
info
ISSN: 1386-6710, 1699-2407
Año de publicación: 2016
Título del ejemplar: Datos
Volumen: 25
Número: 4
Páginas: 623-631
Tipo: Artículo
Otras publicaciones en: El profesional de la información
Resumen
Este trabajo conceptualiza el término big data y describe su importancia en el campo de la investigación científica en ciencias sociales y en las prácticas periodísticas. Se explican técnicas de análisis de datos textuales a gran escala como el análisis automatizado de contenidos, la minería de datos (data mining), el aprendizaje automatizado (machine learning), el modelamiento de temas (topic modeling) y el análisis de sentimientos (sentiment analysis), que pueden servir para la generación de conocimiento en ciencias sociales y de noticias en periodismo. Se expone cuál es la infraestructura necesaria para el análisis de big data a través del despliegue de centros de cómputo distribuido y se valora el uso de las principales herramientas para la obtención de información a través de software comerciales y de paquetes de programación como Python o R.
Referencias bibliográficas
- Alpaydin, Ethem (2010). Introduction to machine learning. Cambridge/London: The MIT Press. ISBN 978 0262012430
- Arora, Sanjeev; Ge, Rong; Halpern, Yoni; Mimno, David; Moitra, Ankur; Sontag, David; Wu, Yichen; Zhu, Michael (2013). “A practical algorithm for topic modeling with provable guarantees”. En: 30th Intl conf on machine learning. pp. 280-288. http://jmlr.org/proceedings/papers/v28/arora13.html
- Blei, David M. (2012). “Topic modeling and digital Humanities”. Journal of digital humanities, v. 2, n. 1, pp. 8-11. http://journalofdigitalhumanities.org/2-1/topic-modelingand-digital-humanities-by-david-m-blei
- Blum, Avrim (2003). “Machine learning theory”. En: FOCS 2003 Procs of the 44th Annual IEEE Symposium on foundations of computer science. Washington DC: IEEE Computer Society, pp. 2-4. ISBN: 0 7695 2040 5
- Cai, Keke; Spangler, Scott; Chen, Ying; Zhang, Li (2010). “Leveraging sentiment analysis for topic detection”. En: IEEE/ WIC/ACM International Conference on Web Intelligence and Agent Systems: An International Journal, pp. 265-271. http://www.csce.uark.edu/~sgauch/5013NLP/S13/hw/Chris. pdf http://dx.doi.org/10.1109/WIIAT.2008.188
- Cambria, Erick; Schuller, Björn; Liu, Bing; Wang, Haixun; Havasi, Catherine (2013). “Knowledge-based approaches to concept-level sentiment analysis”. IEEE intelligent systems, v. 28, n. 2, pp. 12-14. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6547971 http://dx.doi.org/10.1109/MIS.2013.45
- Cheng, An-Shou; Fleischmann, Kenneth; Wang, Ping; Oard, Douglas (2008). “Advancing social science research by applying computational linguistics”. En: Procs of the American Society for Information Science and Technology, v. 45, n. 1, pp. 1-12. http://www.asis.org/Conferences/AM08/proceedings/ posters/55_poster.pdf
- Dhar, Vasant (2013). “Data science and prediction”. Communications of the ACM, v. 56, n. 12, pp. 64-73. https://archive.nyu.edu/bitstream/2451/31553/2/DharDataScience.pdf http://dx.doi.org/10.1145/2500499
- Dietterich, Thomas (2003). “Machine learning”. Nature encyclopedia of cognitive science. London: Macmillan. http://eecs.oregonstate.edu/~tgd/publications/nature-ecsmachine-learning.ps.gz
- Domingos, Pedro (2012). “A few useful things to know about machine learning”. Communications of the ACM, v. 55, n. 10, pp. 78-87. http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf http://dx.doi.org/10.1145/2347736.2347755
- Feldman, Ronen (2013). “Techniques and applications for sentiment analysis”. Communications of the ACM, v. 56, n. 4, pp. 82-89. http://dx.doi.org/10.1145/2436256.2436274
- Han, Jiawei; Kamber, Micheline; Pei, Jian (2006). Data mining. Concepts and techniques. San Francisco: Morgan Kaufmann Publishers. ISBN: 978 0123814791 http://goo.gl/5zTYb6
- Hand, David; Mannila, Heikki; Smyth, Padhraic (2001). Principles of data mining. Cambridge: MIT Press. ISBN: 978 0262082907 ftp://gamma.sbin.org/pub/doc/books/Principles_of_Data_ Mining.pdf
- Harwood, Tracy; Garry, Tony (2003). “An overview of content analysis”. The marketing review, v. 3, pp. 479-498. http://dx.doi.org/10.1362/146934703771910080
- Kalina, Jan (2013). “Highly robust methods in data mining”. Serbian journal of management, v. 8, n. 1, pp. 9-24. http://www.sjm06.com/SJM%20ISSN1452-4864/8_1_2013_ May_1_132/8_1_2013_9-24.pdf http://dx.doi.org/10.5937/sjm8-3226
- Kechaou, Zied; Ben-Ammar, Mohammed; Alimi, Adel (2013). “A multi-agent based system for sentiment analysis of user-generated content”. International journal on artificial intelligence tools, v. 22, n. 2, pp. 1-28. http://dx.doi.org/10.1142/S0218213013500048
- Kelleher, John D.; MacNamee, Brian; D’Arcy, Aoife (2015). Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. Londres: MIT Press. ISBN: 978 0262029445
- Krippendorff, Klaus. (2004). Content analysis. An introduction to its methodology. Los Angeles: Sage Publications. ISBN: 978 0761915454
- Leetaru, Kalev-Hannes (2011). Data mining methods for the content analyst: An introduction to the computational analysis of informational center. New York: Routledge. ISBN: 978 0415895149
- Mayer-Schönberger, Viktor; Cukier, Kenneth (2013). Big data. La revolución de los datos masivos. Madrid: Turner. ISBN: 978 8415832102
- McCallum, Andrew-Kachites (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu
- Meena, Arun; Prabhakar, T. V. (2007). Sentence level sentiment analysis in the presence of conjuncts using linguistic analysis. En: Amati, Giambattista; Carpineto, Claudio; Romano, Giovanni (eds.). Advances in information retrieval. 29th European conf on IR research (ECIR), April 2-5, 2007, Rome, Italy, pp. 573-580. http://dx.doi.org/10.1007/978-3-540-71496-5_53
- Mitchell, Tom (1997). Machine learning. New York: McGraw-Hill. ISBN: 978 0070428072 http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_ Machine_Learning_-Tom_Mitchell.pdf
- Murphy, Kevin (2012). Machine learning. A probabilistic perspective. Cambridge/London: The MIT Press. ISBN: 978 0262018029
- Murphy, Michael; Barton, John (2014). “From a sea of data to actionable insights: Big data and what it means for lawyers”. Intellectual property & technology law journal, v. 26, n. 3, pp. 8-17. http://www.pillsburylaw.com/publications/from-a-sea-ofdata-to-actionable-insights
- Nunan, Dan; Di-Domenico, Maria-Laura (2013). “Market research and the ethics of big data”. International journal of market research, v. 55, n. 4, pp. 505-520. http://dx.doi.org/10.2501/IJMR-2013-015
- Pennacchiotti, Marco; Popescu, Ana-Maria (2011). “A machine learning approach to Twitter user classification”. En: Procs of the 5th Intl conf on weblogs and social media. Menlo Park, California: The Association for the Advancement of Artificial Intelligence Press. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/ paper/download/2886/3262
- Téllez-Valero, Alberto; Montes, Manuel; Villaseñor-Pineda, Luis (2009). “Using machine learning for extracting information from natural disaster news reports”. Computación y sistemas, v. 13, n. 1, pp. 33-44. http://www.scielo.org.mx/pdf/cys/v13n1/v13n1a4.pdf
- Turney, Peter (2002). “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews”. En: Procs of the 40th Annual meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 417-424. http://www.aclweb.org/anthology/P02-1053.pdf
- Verbeke, Mathias; Berendt, Bettina; D’Haenens, Leen; Opgenhaffen, Michaël (2014). “When two disciplines meet, data mining for communication science”. En: 64th Annual meeting of International Communication Association (ICA) conf. Seattle, USA. https://lirias.kuleuven.be/handle/123456789/436424
- Vinodhini, Gopalakrishnan; Chandrasekaran, Ramaswamy M. (2012). “Sentiment analysis and opinion mining: A survey”. International journal of advanced research in computer science and software engineering, v. 2, n. 6, pp. 282-292. http://www.i jarcsse.com/docs/papers/June2012/ Volume_2_issue_6/V2I600263.pdf
- West, Mark (2001). Theory, method, and practice in computer content analysis. Westport, Connecticut: Ablex Publishing. ISBN: 978 1567505030
- White, Marilyn-Domas; Marsh, Emiliy (2006). “Content analysis: A flexible methodology”. Library trends, v. 55, n.1, pp. 22-45. https://www.ideals.illinois.edu/bitstream/handle/2142/3670/ whitemarch551.pdf?sequence=2 http://dx.doi.org/10.1353/lib.2006.0053
- Woody, Alex (2016). “Inside the Panama papers: How cloud analytics made it all possible”. Datanami, 7 April. http://www.datanami.com/2016/04/07/inside-panamapapers-cloud-analytics-made-possible