How implementing text-mining and other advanced analytics can help organizations understand complex email collections and social networks
There is great interest in automatically detecting relation networks from email collections, from e-books, web-sites, blogs, micro-blogs and other social media. There are various methods to automatically measure particular properties of such social networks such as but not limited to: degree, closeness, betweenness, importance, prominence, local & global centrality, density, community structure & density, groups, leaders, etc. This information can be derived from either the full-text by recognizing the named entities of persons, organizations and companies and their relations or from the natural attributes (from, to, subject, cc, bcc, etc,) of email messages and other social network communication.
A great example of the automatic derivation of a relationship network for Leo Tolstoy’s Anna Karenina can be found here:
This graph (created by Sacha Franssen and Juliane Meyer), although not perfect, reflects an 80% accuracy rate and was derived in just a few minutes from Tolstoy’s lengthy book. The same techniques can be applied to large email collections from multiple custodians and from data gathered from social networks and other cloud-based data collections.
Analyzing organizational email networks and email lists can provide a wealth of social information that can support important decisions and novel interventions. Organizations can identify social roles, determine who the main communicators are between certain groups, who are the influencers and what is the real “information” organization of a group of people, which can be very different from the official organization chart. Especially in the fields of eDiscovery and Fraud Investigation, this information is crucial and can be the difference between a successful and a failed investigation. With this analysis it is quite possible to identify experts on a topic, identify the real leaders or determine unexpected social groups.
Email is an even more complex format than one would think
Email is a very complex format with many different dimensions: not only is there a time scale, emails can also be sent, forwarded, copied, blind-copied and replied to multiple individuals or groups. In addition to this: (i) emails can also contain emails as attachments and attachments can contain embedded documents (ZIP’s, etc.), (ii) various additional information can be extracted from the email body and associated attachments such as named entities, locations, events and other facts, (iii) emails and attachments can contain information in multiple languages, and (iv) emails and attachments contain various document and file properties which can be extracted for forensic investigations.
Goals of Email and Social Network Analyses
There are different search strategies to analyze email collections: (i) the personal email search strategy, where a particular email is searched to solve a particular problem and (ii) the more exploratory email search where investigators do not know exactly what they are looking for, but they need to investigate large collections of email from multiple users to discover fraud and other criminal activities.
The second category has our prime interest, because it is in this case where the investigator does not have all the knowledge to effectively search and analyze the data collection with plain full-text search techniques – reason being that often the prime users, (also often refered to as custodians) social groups and (code) words used by the custodians are not known by outsiders.
Typical tasks for such investigators include:
- Identify parties, groups and communities
- Find exports and thought leaders on certain topics
- Find interesting email threads and display in these time
- Find relationship development by analyzing email patterns in time
- Identify meetings, disagreements, agreements, conspiracies, political decisions, hidden agendas, etc.
- Identify roles, discourse and intentions
- How does email communication differ from the official org chart?
- Discover someone’s role from email? For instance, do they request meetings, advice, information, and travel or do they only passively act or follow in such events?
- Derive someone’s email bio such as name, contact information, his or her role in the organization, communication frequency, communication periods, unsolicited communicators (spammers), recency, affiliation, longevity, reciprocity, centrality, connectence, collectivity, magnitude, etc.
- Identify different email addresses from the same person
- Review, annotate and sort emails in a legal discovery or law enforcement investigative process
- Identify contact-based patterns
- Identify thematic and semantic patterns: find the discourse of conversations and go beyond email threads.
For each of these questions and goals, different techniques can be used to analyze and visualize large email collections and data from social networks.
Text Analysis and Text Mining for Email and Social Network Analyses
But first, several pre-processing and text mining techniques have to be applied to the email before it can be analyzed or visualized. Think of:
- Identify attachments from email bodies
- Identify embedded emails and attachments with compound objects
- Identify duplicate emails based on sender, receiver, subject, body text and attachments
- Identify near-duplicate emails based on sender, receiver, subject, body text, and attachments
- Identify variations and synonyms for email addresses by identifying co-reference addresses, co-occurrence and exact name matching, person names, job titles, number of correspondence partners, etc.
- Identify email threads and thread hierarchy and timeline based on conversation topic
- Extract named entities for email bodies and attachments
- Extract patterns of interest for email bodies and attachments such as events, relationships
- Extract for email bodies and attachments categories
- Extract for email bodies and attachments sentiments
- Extract for email bodies and attachments summaries
- Identify emotional tone from email or attachment
Email and Social Network Visualization
Once the investigation goals and tasks to perform are determined, we can also define a useful email visualization method. In general, these analyses can also be categorized as: (i) hierarchical, (ii) correlational, and (iii) temporal patterns. Given the different nature of these analyses, it is virtually impossible to capture them in one interface or in one visualization method. Therefore, as in the previous section, a visualization dashboard approach is best to capture these different dimensions of email.
A typical hierarchical email visualization is the Treemap. An example of this can be found here under. Here, a large tree structure is projected onto a surface. Each color indicates an email thread (conversation or subject) and the squares can be either senders or receivers. Dynamic filters can be set and the hierarchy can be changed for different insights as needed.
Examples of Correlational visualizations, is for instance NodeXL, a free Microsoft Excel plug that was available for Excel 2007 spreadsheet up to 64.000 rows
A good example of a Temporal visualization is the following graph, which shows email conversations between different custodians in time. Patterns and rhythms can be used to identify gaps or other irregularities in conversations (which may indicate missing data or sensitive off-line communication). More information on this can be found in Viegas, Golder and Donath: Visualizing email content: Portraying relationships from conversational histories, Proc. SIGCHI 2006, ACM Press, New York, 979-988.
The Future of Email and Social Network Analyses
In the past five years many efforts have been made to visualize email. Clearly, there is no “final solution” yet, there is room for more efforts! Off course, I am also very interested in related work.
Because of the different dimensions that email communication has and the different search goals, an email search and analysis dashboard is currently the best solution. Based on the characteristics of an email or social network collection communication, an investigator can select various analysis and visualization tools to provide more in-depth insights into the structure of such collections.