Q-analysis Based Clustering of Online News

David M.S. Rodrigues

Discontinuity, Nonlinearity, and Complexity

Dimitry Volchenkov (editor), Dumitru Baleanu (editor)

Q-analysis Based Clustering of Online News

Discontinuity, Nonlinearity, and Complexity 3(3) (2014) 227--236 | DOI:10.5890/DNC.2014.09.002

David M.S. Rodrigues

Centre for Complexity and Design, Faculty of Mathematics, Computing and Technology, The Open University, Milton Keynes, MK7 6AA, UK

Download Full Text PDF

Abstract

With online publication and social media taking the main role in dissemination of news, and with the decline of traditional printed media, it has become necessary to devise ways to automatically extract meaningful information from the plethora of sources available and to make that information readily available to interested parties. In this paper we present a method of automated analysis of the underlying structure of online newspapers based onQ-analysis and modularity optimisation. We show how the combination of the two strategies allows for the identification of well defined news clusters that are free of noise (unrelated stories) and provide automated clustering of information on trending topics on news published online.

References

[1]	Yang, Y. and Liu, X. (1999), A re-examination of text categorization methods, in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '99, 42-49, ACM.

[2]	Pantel, P. and Lin, D. (2001), A statistical corpus-based term extractor, in Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 1-10, Springer-Verlag.

[3]	Cardoso-Cachopo, A. and Oliveira, A.L. (2003), An empirical comparison of text categorization methods, in String Processing and Information Retrieval (M. A. Nascimento, E. S. D. Moura, and A. L. Oliveira, eds.), 183-196, Springer Verlag, Heidelberg, DE.

[4] Lee, M., Wang, W., and Yu, H. (2006), Exploring supervised and unsupervised methods to detect topics in biomedical text, BMC bioinformatics, 7, 140-140.
[5] Weninger, T. and Hsu, W. (2008), Text extraction from the web via text-to-tag ratio, in 19th InternationalWorkshop on Database and Expert Systems Application, 2008. DEXA'08., 23-28, IEEE.

[6]	Zhang, Z.-K., Zhou, T., and Zhang, Y.-C. (2010), Personalized recommendation via integrated diffusion on user-itemtag tripartite graphs, Physica A: Statistical Mechanics and its Applications, 389, 179-186.

[7]

Hatzivassiloglou, V., Gravano, L., and Maganti, A. (2000), An investigation of linguistic features and clustering algorithms for topical document clustering, in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '00, (Athens, Greece), 224-231.

[8] Nigam, K., Lafferty, J., and McCallum, A. (1999), Using maximum entropy for text classification, in IJCAI-99 Workshop on Machine Learning for Information Filtering, 61-67.
[9] Johnson, J.H. (2013), Hypernetworks in the Science of Complex Systems, Imperial College Press, London.
[10] Atkin, R.H. (1972), From cohomology in physics to q-connectivity in social science, International Journal of Man- Machine Studies, 4(2), 139-167.
[11] Atkin, R.H. (1974), Mathematical Structure in Human Affairs, 48 Charles Street, London: Heinemann Educational Publishers, 1 ed..
[12] Beaumont, J.R. and Gatrell, A.C. (1982), An Introduction to Q-analysis, Norwich Norfolk, Geo Abstracts.
[13] Johnson, J.H. (1983), A survey of q-analysis, part 1: The past and present, in Proceedings of the Seminar on Q-analysis and the Social Sciences, Universty of Leeds, 9.

[14]	Rodrigues, D.M.S (2013), Identifying news clusters using q-analysis and modularity, in Proceedings of the European Conference on Complex Systems 2013 (A. Diaz-Guilera, A. Arenas, and Á . Corral, eds.), (Barcelona), 9.

[15] Clauset, A., Newman, M., and Moore, C.(2004), Finding community structure in very large networks, Phys. Rev. E, 70, p. 066111.
[16] Newman, M. (2006), Modularity and community structure in networks, Proceedings of the National Academy of Sciences, 103(23), 8577-8582.
[17] Miao, Y. and Qiu, X. (2009), Hierarchical centroid-based classifier for large scale text classification, in Large Scale Hierarchical Text classification (LSHTC) Pascal Challenge.
[18] Joachims, T. (1998), Text categorization with support vector machines: Learning with many relevant features, Machine Learning ECML98, 1398(23), 137-142.

[19]

Hamamoto, M., Kitagawa, H., Pan, J., and Faloutsos, C. (2005), A comparative study of feature Vector-Based topic detection schemes a comparative study of feature Vector-Based topic detection schemes, in Web Information Retrieval and Integration, 2005. WIRI '05. Proceedings. International Workshop on Challenges in, 122-127, IEEE, Apr.

[20] Solé, R.V., Corominas-Murtra, B., Valverde, S., and Steels, L. (2010), Language networks: Their structure, function, and evolution, Complexity, 15(6), 20-26.
[21] Meilǎ, M. (2007), Comparing clusterings—an information based distance, J. Multivar. Anal., 98(5), 873-895.

[22]	Rodrigues, D.M.S. (2010), The observatorium - the structure of news: topic monitoring in online media with mutual information, in Proceedings of the European Conference on Complex Systems (J. Louçã, ed.), (Lisbon), Complex Systems Society, 9.

[23] Derenyi, I., Palla, G., and Vicsek, T. (2005), Clique percolation in random networks, Physical Review Letters, 94, 160202.
[24] Palla, G., Derényi, I., Farkas, I., and Vicsek, T. (2005), Uncovering the overlapping community structure of complex networks in nature and society, Nature, 435, 814-818. PMID: 15944704.
[25] Atkin, R.H., Bray, R., and Cook, I.(1968), A mathematical approach towards a social science, Essex University Review, 2, 6-8.
[26] Atkin, R.H., Johnson, J., and Mancini, V. (1971), An analysis of urban structure using concepts of algebraic topology, Urban Studies, 8(3), 221-242.
[27] Fortunato, S.(2010), Community detection in graphs, Physics Reports, 486(3 5), 75-174.
[28] Johnson, J.H. (1981), Some structures and notation of Q-analysis, Environment And Planning B, 8, 73-86.
[29] R Development Core Team (2011), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.
[30] Csardi, G. and Nepusz, T. (2006), The igraph software package for complex network research, Inter Journal, vol. Complex Systems, p. 1695.
[31] Fortunato, S. and Barthelemy, M. (2006), Resolution limit in community detection, physics/0607100, 7 2006.

[32]	Johnson, J.H. (2005),Multidimensional multilevel networks in the science of the design of complex systems, in ECCS 2005 Satellite Workshop: Embracing Complexity in Design (J. Johnson, ed.), vol. ECCS 2005 Satellite Workshop: Embracing Complexity in Design.

[33] Johnson, J.H. (2006), Can complexity help us better understand risk? Risk Managment, 8(4), 227-267.
[34] Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003), Latent dirichlet allocation, J. Mach. Learn. Res., 3, 993-1022.

[35]	Li, W. and McCallum, A.(2006), Pachinko allocation: Dag-structured mixture models of topic correlations, in Proceedings of the 23rd international conference on Machine learning, ICML '06, (New York, NY, USA), 577-584.