Tag Archives: method

Digital Scholarship day of ideas: data [2]

This is the second session of the day I wanted to note in detailed (the first is here). The session it Robert Procter on Big Data and the sociological imagination, Professor of social informatics at the University of Warwick. These notes are written live from the live stream. So here we go:

The title has changed to Big Data and the Co-Production of Social Scientific Knowledge. The talk will explain a bit more on social informatics as a hybrid computer scientist and sociologist; the meaning of ‘big data’ and how academic sociology can use such data including the development of new tools and methods of inquiry – see COSMOS – and concluding with remarks how these elements may combine in an exciting understanding of how social science and technology may emerge through different stakeholders including crow-sourced approach.

Social informatics is inter-disciplinary study of factors that shape adoption of ICT and the social shaping of technology. Processes of innovation involving districted technologies are large in scale and involve diverse range of publics such as understanding social media as processes of large-scale social learning. Asking how social media works and how people can use it to further their aims. As it is public and involves social media makes it easier in many ways to see what is going on as the technology makes much of the data available (although its not entirely straightforward).

Social media is Rob’s primary area of interest. Recent research includes on the use of social media in scholarly communications to put research in the public domain. But the value of this is not entirely clear. Identified positive and negative view points. The research also looked at how academic publishers were responding to such changes in scholarly communications such as supporting the use of social media as well as developing tools to trace and aggregate the use of research data. This showed mixed results.

Another research project was on the use of Twitter use during the 2012 riots in England in conjunction with The Guardian. In particular, was social media important in spreading false information during such events. So the research looked at particular rumours identified in the corpus of Tweets. So how do rumours emerge in social media and how do people behave and respond to such rumours?

This leads to the question of how to analyse 2.5m Tweets which is qualitative data. Research needs to seek out structures and patterns to focus scarce human resources for closer analysis of the Tweets.

Savage and Burrows (2007) on empirical sociology arguing that the best sociology is being done by the commercial sector as they have access to data. Academic sociology becoming irrelevant. However, newer sources of data that provides for enhanced relevance of academic sociology and this is reinforced by the rise of open data initiatives. So we can feel more confident on the future of academic sociology.

But how this data is being used raises further issues such as linking mood in social media with stock market movements but this confuses correlation and causation. Other analysis has focused on challenges to dictatorial regimes and the promotion of democracy and political changes and for social movements to self-organise. Methodological challenges are concerned with dealing with the volume of data so combining computation tools with sociological sensitivity and understanding of the world. But many sociologists are wary of the ‘computational turn’.

Returning to the England riots looking at the rumour of rioters attacking a children’s hospital in Birmingham. This involves an interpretive piece of work focused on data that may provide useful and interesting results. So the rumour started with people reporting police congregating at the hospital and so people inferred that the hospital was under threat. The computational component was to discover a useful structure in the data using sentiment and topic analysis – divided Tweets into original and retweets that combine in to an information flows and some flows are bigger than others. Taking size of the information flow as an indicator of significance can provide an indication for where to focus the analysis. Used coding frames to capture the relevant ways people were responding to the information including accepting and challenging Tweets. This coding was used to visualise how information flows through Twitter. The rumour was initially framed as a possibility but mushroomed and different threads of the rumour emerged. The rumour initially spreads without challenge but later people began to Tweet alternative explanations for the police being her the hospital i.e., a police station is next to the hospital. So rumours do not go unchallenged and people apply common-sense reasoning to rumours. While rumours grow quickly in social media but the crowd sourcing effects of social media help in establishing what the likely truth is. This could be further enhanced through engagement from trusted sources such as news organisations or the police? This could be augmented by computational work to help address such rumour flows (see Pheme).

There is also the question of what the police were doing on Twitter at the time. In Manchester, accounts were created to disseminate what was happening and draw attention to events to the police so acting to inform public services.

This research indicates innovation as a co-production. People collective experimenting and discovering the limitations and benefits of social media. Uses of social media are emergent and shaped through exploration.

On to the development of tools for sociologists to analyse ‘big’ social data including COSMOS to help interrogate large social media data. This also involves linking social media data with other data sets [and so links to the open data]. So COSMOS assists in forging interdisciplinary working between sociologists and computing scientists, provide interoperable analysis tools and evolve capabilities for analysis. In particular, points to the issues of the black-boxing of computational analysis and COSMOS aims to make the computational processes as transparent as possible.
COSMOS tools include text analysis and social network analysis linked to other data sets. A couple of experimental tools are being developed on geolocation and on topic identification and clustering around related words. COSMOS research looking at social media and civil society; hate speech and social media, citizen science, crime sensing; suicide clusters and social media; and the BBC and tweeting the olympics. Points to an educational need for people to understand the public nature of social media especially in relation to hate speech.

Social media as digital agora, on the role of social media in developing civil society and social resilience through sharing information, holding institutions to account, inter-subjective sense-making, cohesion and so forth.

Sociology beyond the academy and the co-production of scientific knowledge. Points to examples such as the Channel 4 fact checker as an example of wider data awareness and understanding and citizen journalism mobilises people to document and disseminate what is going on in the world. Also gives the example of sousveillance of the police as a counter to the rise of the surveillance state. The Guardian’s use of volunteers to analyse MP expenses. So ‘the crowd’ is involved in social science through collecting and analysing data and so sociology is spanning the academy and so boundaries of the academy are becoming more porous. These developments create an opportunity to realise a ‘public sociology’ (Burawoy 2005) but this requires greater facilitation from the academy through engaging with diverse stakeholders, provision of tools, new forms of scholarly communication, training and capacity building and developing more open dialogues on research problems. Points to public lab and hackathons as means for people to engage with and do (social) science themselves.

Digital Scholarship day of ideas: data

Live notes from the day.

Starting the day with Dorothy Meill, Head of CHSS introducing the third annual day of ideas as a forum for those interested in digital scholarship across the University and College. Today has a mixture of internal and external speakers. Also mentions the other digital HSS activities including the website and the other events listed there.
Todays’ focus is on data as a contested but popular term. What does it mean for HSS and what traction does it have in the humanities and what currency does big data have for humanities and what are the implications for the computational turn for digital scholarship?
The event is being streamed on the website and the presentations will be posted there later.

Sian Bayne introducing Annette Markham as a theorist of the internet and is currently at Aarhus University. Her focus is on ethnographic research and the ethics of online research. She also has a good line on paper titles.

Annette Markham asking “‘Data’ what does that mean anyway?”. For the last five years or so she has been particularly pushing at thinking about method to better represent the complexities of 21st Century life. She works with STS, informatics, ethnographers, social scientists, linguists, machine learning scholars etc. The presentation is based on a series of workshops published in First Monday special issues October 2013.
Annette argues that we need to be careful about using the term data as it assumes we’re all meaning the same term. Taking a post-humanist perspective or at least non-positivist stance. It is our repsonsibility to critique the word “data”. For other researchers, data and big data are terms that seem unproblematic.

Annette is providing an overview of the debates on data and a provocation to start the day. Asking what does method mean for our forms of inquiry requires ‘method’ to be looked at sideways or from above and below ‘method’ to take account of  the epistemological and political conditions for inquiry. Such conditions include funding constraints and demands around, for example, developing evidence bases and requirements for the archiving of data. But the latter is problematic in terms of capturing and tracing ethnographic research and ‘data’. Also look below ‘method’ in terms of the practices of inquiry that involve the gathering and analysing of data as well as the practices of “writing up”.
The notion of framing inquiry (Goffman) involves drawing attention to some things and excludes others – those outside the frame. Changes the frame changes the focus of inquiry and perspective of the phenomenon. Different images such as frames, a globe/ sphere, a cluster of connected nodes (sociogram or DNA) are used to critique the notion of a ‘frame’. A frame guides our view of the world but is often invisible until it is disrupted. So it is important to make the frame visible.
The term data acts as a frame but is highly ambiguous yet is often perceived as being universally understood, eg, not visible as a framing mechanism.
How are our research sensibilities being framed? To understand the question, we need to ask how are we framing culture, objects and processes of analysis and how do we frame legitimate inquiry. Culture is framed in internet studies through the changes due to the internet, as a networked culture but also how our understanding of the internet as embodied informational spaces. Interfaces developed from an interest in architecturalised spaces towards standardised interfaces to simplification as represented by Google. This is linked to the rise of commercial interests in the internet. the frame of objects and processes of inquiries has not changed much and not changed sufficiently. Inquiry involves entangled processes of social interaction online yet methods remain largely based on 19th century practices. Research models are generally based on linear processes (deductive) which acts to value linear research over messy and complex. We are still expected to draw conclusions for example. The framing of legitimate inquiry has gone backwards from the feminist work on situated knowledge and practice in the 1960s towards evidence and solutions based practices.
So what is data? An easy term to toss around to cover a lot of stuff. It is a vague term and arguably powerful rhetorical term shaping what we see and think. The term comes from 18th century sciences and popularised via translation of scientific works. As a term, ‘data’ was used as preceding an argument or analysis so data is unarguable and pre-existing – it has an “itness”. Data cannot be falsified. Data as a term refers to what a research seeks and needs for inquiry. Yet there are alternative sociological approaches involving the collection of ‘material’ to construct ‘data’ through practices of interpretation. So a very different meaning from ‘data’ as more widely used.
Refers to boyd and Crawfords 2011 provocations on big data and Baym’s 2013 work arguing that all social metrics are partial and non-representative and thereis ambiguity involved in decontextualising material from its context.
Technology now pervasive in everyday life as repsented in a Galaxy S II advert. Experiences are flattened and equalised with everything else and than flattened again as informational bits that can be diffused shared through technology.
Humans as data argument. She has nothing against data and computational analysis as such analysis is important and powerful. But wants to critique the idea that data speaks for itself and that human interaction with technology produces just data. Not all human experience is reduceable to data points. Data is never raw, it is always filtered and framed. Data operates in a larger framework of inquiry and other frameworks of inquiry exist that do not focus on data. Rather, this is inquiry that is focused on the analysis of phenomena involving play around with understanding that phenomena (not data).
Data functions powerfully as a term and acts as a frame on inquiry and this should be subject critique. Inquiry can and should be playful and generative in its entanglements with ‘the world’.

Q: what is the alternative to data? What is human experience reducible to?
A: that’s not the key question. We don’t want to think in terms of reduction which is how data generally frames inquiry.


This talk was followed by a fascinating use of crowd-sourced data coding by Prof Ken Benoit. This included completing an analysis of a UKIP manifesto during the course of the talk via cloudflower.

Social Network Analysis and Digital Data Analysis

Notes on a presentation by Pablo Paredes. The abstract for the seminar is:

This presentation will be about how to make social network analysis from social media services such as Facebook and Twitter. Although traditional SNA packages are able to analyse data from any source, the volume of data from these new services can make convenient the use of additional technologies. The case in the presentation will be about a study of the degrees of distance on Twitter, considering different steps as making use of streaming API, filtering and computing results.

The presentation is drawn from the paper: Fabrega, J. Paredes, P. (2013) Social Contagion and Cascade behaviours on Twitter. Information 4/2: 171-181.

These are my brief and partial notes on the seminar taken live (so “typos ahead!”).

Looking at gathering data from social network sites and on a research project on contagion in digital data.

Data access requires knowledge of the APIs for each platform but Apigee details the APIs of most social networks (although as an intermediary, this may lead to further issues in interfacing different software tools, e.g., Python tool kits may assist in accessing APIs directly rather than through Apigee). In their research, Twitter data was extracted using Python tools such as Tweepy (calls to Twitter) and NetworkX (a Python library for SNA) along with additional libraries including Apigee. These tools allow the investigation of different forms of SNA beyond ego-centric analysis.

Pablo presented a network diagram from Twitter using NodeXL as ego-networks but direct access to Twitter API would give more options in alternative network analysis . Diffusion of information on Twitter was not possible on NodeXL.

Used three degrees of influence theory from Christakes & Fowler 2008. Social influence diffuses to three degrees but not beyond due to noisy communication and technology/ time issues leading to information decay. For example, most RTs take place within 48 hrs so tends not to extend beyond a friends, friends friend! This relates to network instability and loss of interest from users beyond three degrees alongside increasing information competition as too intense beyond three degrees to diffusion decomposes.

The  direct research found a 3-5% RT rate in diffusion of a single Tweet. RT rates were higher with the use of a hashtag and correlate to the number of followers of the originator but negatively correlates to @_mentions in the original Tweet. This is possibly as a result of @_mentions being seen as a private conversations. Overall, less than 1% of RTs went beyond three degrees.

Conclusion is that diffusion in digital networks is similar to that found in physical networks which implies that there are human barriers to communication in online spaces. But the research is limited due to the limits on access to Twitter API as well as privacy policies on Twitter API. Replicability becomes very difficult as a result and this issue is compounded as API versions change and so software libraries and tools no longer work or no longer work in the same way. Worth noting that there is no way of knowing how Twitter samples the 1% of Tweets provided through the API. Therefore, there is a need to access 100% of the Twitter data to provide a clear baseline for understanding Twitter samples and justify the network boundaries.

Points to importance that were writing code using R/ Python preferable as easier to learn and with larger support communities.

Weeknotes [21022014]

A picture of various draft word processed documents
In this last week, I have mainly been: