forked from mrafayaleem/community-clusters
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
419 lines (376 loc) · 29.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>Community Clustering of Web Graph Data using PySpark</title>
<!-- JavaScript Libraries //-->
<script src="http://d3js.org/d3.v3.min.js"></script>
<!-- CSS Style //-->
<link rel="stylesheet" type="text/css" href="http://fonts.googleapis.com/css?family=Source+Sans+Pro:300,900|Source+Code+Pro:300">
<link rel="stylesheet" type="text/css" href="./public/style.css">
</head>
<body>
<div style="text-align:center;font-weight:600;font-size:25px">
<p>Clustering communities on web crawl data</p>
<p>Crawlers: Oluwaseyi Talabi, M. Rafay Aleem, Prashanth Rao, Nandita Dwivedi</p>
<span>Github Repository: <a href="https://github.com/mrafayaleem/community-clusters" target="_blank">community-clusters</a></span>
</div>
<h1>Problem Definition</h1>
<p>
In our final project, we look to apply big data techniques to perform large-scale
graph mining on web crawl data. We utilize the <a href="http://commoncrawl.org/the-data/">
Common Crawl</a> web dataset, which is an open dataset hosted on Amazon Web Services' cloud
platform. This is an extremely large dataset that consists of petabytes of web crawl
data. Our particular problem looks at identifying communities (or clusters) of related
web pages utilizing nothing but the URL information and the links between them, using
web graphs built from the crawl data. Our work is inspired by
<a href="https://towardsdatascience.com/large-scale-graph-mining-with-spark-part-2-2c3d9ed15bb5">
this introductory Medium blog post</a> on this topic.
In particular, our project focuses on the following aspects:
<ol>
<li>URL data extraction and cleaning from Common Crawl database</li>
<li>Graph analysis on domain names to see how they are related to one another</li>
<li>Visualizing the relationships using graph visualization techniques</li>
</ol>
In addition to the base tasks listed above, we also treated this as a technology
review of different tools that are used for analyzing and visualizing graphs, some of which we list
in this report.
<p>
<h2>Motivation: Why graphs?</h2>
<p>
In recent years, there have been some <a href="https://arxiv.org/pdf/1111.3919.pdf">
interesting applications</a> of graph analysis on datasets in areas as diverse as social networks, biology and computer science.
In the field of machine learning, graphs are especially important for unsupervised learning, especially clustering. As datasets
become larger and larger, finding more efficient unsupervised clustering techniques to gain insights
from the data will become more and more important. Recently, the availability of big data tools combined with freely available open datasets
and relevant example code on GitHub has enabled widespread graph analysis in a host of real-world applications.
</p>
<p>
Graphs are a fundamental type of data structure that utilize the underlying substructures in
the data to capture relationships between its objects. A graph is a very simple data structure, represented by
<i>vertices</i> and <i>edges</i>, which
can be directed or undirected. In our case, since we are analyzing the relationship between
web URLs by just using the links that they connect to, we will be looking at undirected
graphs, where each web page is a <i>vertex</i>, and each <i>href</i>, or outgoing link from one
web page to another, is an <i>edge</i>.
</p>
<p>
For our future careers as data scientists/engineers, we felt that graph analysis and visualization was a fruitful topic to finetune our
data handling skills and would give us exposure to new areas of bid data analysis and visualization.
</p>
<h2>Tools used</h2>
<p>
In approaching this problem, we brainstormed on a combination of approaches based on our reading numerous
of blogs and research papers that tackle the problem of graph analysis.
<ul>
<li><a href="https://spark.apache.org/docs/2.3.1/api/python/pyspark.html">
Spark's Python API</a> for URL extraction and cleaning</li>
<li>Amazon Web Services'<a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html"> S3</a> interface
and its <i>boto3</i> Python client to help avoid large network file transfers of data when accessing Common Crawl data</li>
<li><a href="https://graphframes.github.io/quick-start.html">GraphFrames</a> in PySpark to
perform the heavy-lifting for graph analysis using example code from <a href="https://github.com/wsuen/pygotham2018_graphmining">
this GitHub repo</a></li>
<li><a href="https://d3js.org/">D3.js</a> library for graph visualization</li>
</ul>
</p>
<h1>Methodology</h1>
<h2>ETL Process</h2>
<p>
Since Common Crawl dataset is extremely large, it is unfeasible to store it locally for processing and extracting information.
To avoid this, we used AWS S3 Boto client to stream data directly from Common Crawl buckets. This kind of streaming allows
us to process data in chunks without causing CPU or memory to be a bottleneck. We feed this stream to a WARC parser which processes
each record that is a HTTP response type and further extracts parent and child hrefs. During this process, we also ensure that any child
links lying under the parent domain are removed to avoid any skew in our analysis.
</p>
<p>
For the ETL process, we leverage <i>mapPartitionsWithIndex</i> instead of <i>map</i>>. The reason for this is to optimize S3 downloads
over partitions rather than performing it for every function call of the map operation. mapPartitionsWithIndex and mapPartitions
allow us to limit this download operation per partition and hence avoid network congestion.
</p>
<p>
The number of WARC files for a single month of crawl is usually around 72,000 with each file containing several segment paths.
To make our ETL process more flexible, we have added batching on top of our Spark job. For instance, if there are 100 segment
paths in a single WARC file and we want to use a batch size of 10, Spark will process data for 10 files first, store the
results in parquet format and continue on the next set of 10 segment paths. This allows us to checkpoint our work such that
if any of these jobs fail, we do not lose the entire amount of data that has been processed and can continue from the point-of-failure.
</p>
<h2>Handling Common Crawl Data</h2>
<p>
Common crawl web data is stored in three primary file formats: <i>WAT, WET and WARC</i>. These are an efficient
storage format for massive web arvhice data, and the full extent of the data structure is explained on the
<a href="http://commoncrawl.org/2014/04/navigating-the-warc-file-format/">source page</a>. Common Crawl generates
extensive crawl data roughly once a month (each a few petabytes in size), and stores all of the information in these
three file formats. The data is then indexed and the index information is made <a href="https://index.commoncrawl.org/">
publicly available</a>.
</p>
<p>The format we are primarily interested in is the WARC source files. Using some starter code from
<a href="https://github.com/commoncrawl/cc-pyspark">this GitHub repo</a>, we wrote some of our own code that parses
through a subset of the complete Common Crawl database using a list of download links (specified as S3 buckets) from the
index page mentioned above. Since parsing through the crawl data for the entire month all at once can be quite expensive,
we limit the number of WARC files that our code looks through. Given sufficient compute resources and run time, we imagine
that this same code would scale to be able to parse through an entire month's data without significant rework.
</p>
<p>
Our ETL code extracts the URLs of the parent web page through the HTML source inside the WARC files, and then parses through
the same HTML source for embedded <i>href</i> links that the parent page points to. We refer to these embedded hyperlinks as
"children". To simplify the extraction and show us direct parent-child relationships, we limit the search to only one level
of depth, and store both the parent and child URLs in a PySpark DataFrame. Additional filtering is done to remove duplicates
and we limit our domain search to only the topmost level. For example, <i>"www.cnn.com"</i> and <i>"https://www.cnn.com/2018/12/01"</i> are
both stored as <i>"cnn.com"</i>, with subdomains removed.
</p>
<p>
Once we extract the relevant parent-child domain information, we export all the data to the <i>parquet</i> format. This step
greatly condenses the information to a few hundred megabytes at most. The parquet data is then read back into PySpark so that
we can perform graph analysis. We use PySpark's GraphFrame module to condense and filter information in the DataFrames to edges
and vertices, following which we can run clustering analyses on them.
</p>
<h2>Label Propagation Analysis</h2>
<p>
<a href="https://arxiv.org/pdf/0709.2938.pdf">Label propagation analysis (LPA)</a> is a very useful technique in studying the
relationship between the vertices of a graph. In beginning our analysis, we performed an extensive literature
survey on the various available methods to perform community detection for studying web graphs.
<a href="https://arxiv.org/pdf/0906.0612.pdf">This paper by Santo Fortunato</a> discusses in detail the different
clustering techniques used in graph analysis, and their applications in real-world problems in biology, sociology, etc.
</p>
<p>
The key benefit of LPA over other techniques for graph clustering is that it requires no prior information about
the communities beforehand, nor does it need a specific optimization objective. It utilizes a <i>label</i>, which in our case
is a unique identifier assigned to each vertex (or web domain name) in our graph. At the start, all vertices of the graph are
assigned their own unique identifiers, but as we perform an iterative update of the vertex labels, each vertex takes on the
label of the one that majority of its neighbors shares. "Convergence" of the algorithm is achieved once each vertex's label
has a similar label to its neighbor. In a large enough graph, this difference in label is sufficient to see how closely
or distantly two vertices should lie, and this tells us something about their relationship.
</p>
<h2>PageRank Analysis</h2>
<p>
Another useful technique to identify the importance of specific domain names relative to the others, is <i>
<a href="https://nlp.stanford.edu/IR-book/html/htmledition/pagerank-1.html">PageRank</a></i>. Given a particular graph
structure, running a PageRank analysis computes a score for each vertex of the graph depending on how many other vertices
point to it. Looking at the graph this way allows us to pinpoint nodes of "high centrality" very easily. In our case, we
look to visualize the vertices (or nodes) of our LPA-computed graph by sizing the nodes proportional to the PageRank Score.
This tells us a lot about the graph structure in a visual manner.
</p>
<h2>Graph Visualization</h2>
<p>
Once we have the analysis completed, we write our results to CSV files. We then postprocess the analysis data (nodes and edges)
using a Python script. The data is formatted and written to a JSON file that can then be read into D3 for visualization. This
particular step of our pipeline is not particularly suited for "huge data", since D3 is not really meant for this purpose. However,
in doing extensive research on graph visualization, we realized that there are not a lot of tools out there that are both easy to
use and are optimized for handling millions of graph nodes, hence we stick to visualizing a small subset of our data using D3.
</p>
<h1>Problems Faced</h1>
<h2>Data Cleaning</h2>
<p>
As one can imagine, parsing through a dataset that is several petabytes in size requires some forethought. We stood on the shoulders
of people who have tackled this issue before us, and looked at example code on GitHub. However, for our specific case, we not only
needed to extract and parse out the URL information, but also handle filtering and formatting issues that arose, for which we wrote
some custom cleaning code. Below is a summary of some of the issues we faced in our ETL phase:
<ul>
<li>Due to the size of the data, we spent significant time studying how to reduce disk space usage when testing and accessing the data through S3</li>
<li>For our specific needs, we faced some issues removing duplicate URLS while not losing information on parent-child relationships</li>
<li>Not all web pages had English characters, so we had to look up encoding standards to avoid slowdowns in Spark</li>
<li>We had issues with deciding how best to filter down the massive list of domains to a few domains that we care about -- for example, how
to subset the data such that we can study relationships between specific domain names.
</li>
</ul>
</p>
<h2>Visualization</h2>
<p>
Visualizing large graphs (with hundreds of thousands or more) nodes is quite a challenge. In most cases, it doesn't make sense to visualize
these many nodes and edges all at once, since a lot of the relationships are obscured in such a large and complex graph. There are quite a few
graph visualization tools out there, and we faced some problems in deciding which to use. In the end, we chose to use D3 because of its ease of
use, great documentation, high aesthetic appeal and web-rendering capabilities.
</p>
<p>
In addition to the size of the data, the technique we use for graph clustering, called "force-clustering" (described in more detail in the next section)
, has been implemented in D3 in way that it scales with <i>O(n log n)</i> per iteration <a href="https://stackoverflow.com/a/7811354/1194761">
(see this thread)</a>. This makes rendering anything higher than 50,000 nodes very expensive, and it would take quite a long time to design
a system that performs the rendering on the server side rather than the client side. An alternate tool to perform the visualization is to use a tool
like <a href="https://gephi.org/">Gephi</a>; however, it does not have web-based rendering capabilities and came with its own learning curve
and implementation difficulties.
</p>
<p>
In the end, we came to a compromise with respect to data size, aesthetics and ease of deployment on a web-interface, and chose D3 for our specific
case.
</p>
<h1>Results</h1>
<p>
Label propagation analysis begins by assigning a unique random label to each and every domain name (which form nodes in our graph).
As the LPA iterations progress, each node takes on the label of its nearest neighbors based on the similarity of their children. The
intrinsic nature of the parent-child relationships results in the labels achieving "convergence", i.e. after a few iterations, all nodes
that are similar take on the labels of their neighbors, and once clusters are formed, the labels do not update any further. Once we run our LPA analysis using PySpark GraphFrames, we pipe a condensed version of the output to a JSON file, which is used
as input to D3 for graph visualization.
</p>
<p>
D3 has some very useful utilities to help visualize graph structures. One technique that proved useful in our analysis was
<a href="https://github.com/d3/d3-force">"force-directed graphs"</a>. In this approach, we run a simple physics emulator for
positioning the vertices of the graph visually such that "forces" exist between elements. Elements that are closely connected in
the graph (i.e. share the same label from LPA) attract one another, whereas elements that are very different in their labels repel
one another. Visualizing the graph this way allows us to more easily pick up and identify communities of interest.
For example, domain names like "facebook.com" and "twitter.com" are more heavily connected to other domains, and these vertices
form central points in the graph, which is easily visible upon prelimilary inspection.
</p>
<p>
D3 utilizes the label information from the LPA and runs a "force simulation", where nodes with similar labels "attract" one other,
and nodes with very different labels "repel" one another. This is a good way to visualize how similar or different to clusters are -
clusters that are close to one another are very similar, and those that are far apart (or are not connected at all) are very different.
</p>
<h2>Example: Shopping Domains</h2>
<p>
One case that we looked at involved studying the relationship between shopping website domains. Consider three shopping domains:
Amazon, eBay and Etsy. These are three well-known online vendors that sell a variety of products. We wanted to explore how these particular
domain names form clusters, and if so, how they are connected to each other. Our reasoning is that small-scale vendors that sell their
products on Etsy would not always do so on Amazon, and this should be apparent when looking at the clusters.
</p>
<div class="lpa row">
<p><b> Shopping Domains (etsy, ebay, amazon)</b></p>
<div class="column">
<span>May 2018</span><br/>
<img src="public/images/shopping-lpa-may.png" style="width:630px;height:500px"><br/>
<span>To view an interactive display of this plot, click <a href="http://mrafayaleem.com/community-clusters/public/index-shopping-may.html" target="_blank"><b>here</b></a></span>
</div>
<div class="column">
<span>October 2018</span><br/>
<img src="public/images/shopping-lpa-oct.png" style="width:630px;height:500px;margin-top:8px"><br/>
<span>To view an interactive display of this plot, click <a href="http://mrafayaleem.com/community-clusters/public/index-shopping-oct.html" target="_blank"><b>here</b></a></span>
</div>
</div><br/>
<p>
We show the results for the shopping domains (fully interactive animations are linked below each image) - the data used for each case
was crawled a few months apart in 2018. It is clear that Amazon forms the biggest
cluster, which makes sense considering Amazon's scale of web presence and the sheer number of online vendors who link to Amazon. What is
interesting, however, is that the domain names that cluster around Etsy are quite unique, highighting that our graph is capturing something inherent
in the real world - in this case vendors who sell their products on Etsy are not the same ones that also sell on Amazon.
</p>
<p>
Another observation that we can make is by following the node connections from Etsy's root node in the graph, and seeing how it connects to neighboring
clusters. In the month of October, we can see that there are a couple of reseller web domain names that form a connection between Etsy and eBay.
This is interesting because it gives us information that we did not have otherwise - learning the names of resellers purely through a graph visualization
is a different way of gaining insights on data that is otherwise unused. One can easily imagine a use-case where a large graph mining operation
could be used to decide marketing or investment strategies based on graphical analysis and visualization techniques.
</p>
<h2>Example: News Domains</h2>
<p>
Another interesting use-case would be to analyze the web content and connectivity between news domains. In this case, we choose several major US
news outlets and see how their web presence is linked with one another. As we might expect, news organizations are tightly linked because of the
multiple ways they link to each other through a variety of stories, so we would expect these clusters to be more tightly mixed up.
</p>
<div class="lpa row">
<p><b> News Domains (cnn, foxnews, huffingtonpost, washingtonpost, nytimes, usatoday)</b></p>
<div class="column">
<span>May 2018</span><br/>
<img src="public/images/news-lpa-may.png" style="width:630px;height:500px"><br/>
<span>To view an interactive display of this plot, click <a href="http://mrafayaleem.com/community-clusters/public/index-news-may.html" target="_blank"><b>here</b></a></span>
</div>
<div class="column">
<span>October 2018</span><br/>
<img src="public/images/news-lpa-oct.png" style="width:630px;height:500px;margin-top:8px"><br/>
<span>To view an interactive display of this plot, click <a href="http://mrafayaleem.com/community-clusters/public/index-news-oct.html" target="_blank"><b>here</b></a></span>
</div>
</div><br/>
<p>
The clusters formed by analyzing news domain URLs indeed shows the deep connectivity one might expect based on what we observe in the real world. Clusters for
certain domains such as the New York Times and the Washington Post end up close to one another. CNN's parent domain appears far enough away from Fox news,
as one might expect; however, CNN's money (finance) page appears closer to Fox news. On inspecting these kinds of connections over longer periods of time
(and with much larger amounts of data), it could be possible to gain deeper insights from such visualizations.
</p>
<p>
A similar technique can be extended to be used on a large list of political blogs, or fashion web pages to see if these connections show some inherent property
of society and politics. Graphs are indeed useful tools to study this kind of relationships.
</p>
<h2>Example: University Domains</h2>
<p>
To see how our technique can apply to a range of domains, we also tried a case that studied university domains as shown below. Universities are fundamentally
different in the sense that they do not link to each other as frequently as news domains might do. In addition, the traffic to university web pages and the
number of external links that point to a university web page may not be as large enough to form distinct cluster. On analyzing a subset of universities on the
US east-coast, this is indeed what we observe. The clusters formed are far less dense, meaning that there aren't anywhere near as many web pages that point
to university domains as other domains, such as shopping or news. Some of this could be attributed to the fact that we are only looking at a subset of the
crawl data, however, there might be some inherent pattern based on the clusters we visualize.
</p>
<div class="lpa row">
<p><b> University Domains (harvard, princeton, nyu, cornell, dartmouth, brown, yale)</b></p>
<div class="column">
<span>May 2018</span><br/>
<img src="public/images/university-lpa-may.png" style="width:630px;height:500px"><br/>
<span>To view an interactive display of this plot, click <a href="http://mrafayaleem.com/community-clusters/public/index-university-may.html" target="_blank"><b>here</b></a></span>
</div>
<div class="column">
<span>October 2018</span><br/>
<img src="public/images/university-lpa-oct.png" style="width:630px;height:500px;margin-top:8px"><br/>
<span>To view an interactive display of this plot, click <a href="http://mrafayaleem.com/community-clusters/public/index-university-oct.html" target="_blank"><b>here</b></a></span>
</div>
</div>
<h3>Page Rank:</h3>
<p>
In addition to running LPA on the data, we also apply the PareRank algorithm to find which are the most dominant communities in each month's analysis.
Since we are looking at TLDs and not parent domains, we are able to obtain some useful geographical information on domains from multiple countries.
</p>
<div class="pagerank row">
<p><b> Shopping Domains (etsy, ebay, amazon)</b></p>
<div class="column">
<span>May 2018</span>
<img src="public/images/shopping-rank-may.png" style="width:100%;height:100%;margin-top:8px">
</div>
<div class="column">
<span>October 2018</span>
<img src="public/images/shopping-rank-oct.png" style="width:100%;height:100%;margin-top:8px">
</div>
</div>
<div class="pagerank row">
<p><b> News Domains (cnn, foxnews, huffingtonpost, washingtonpost, nytimes, usatoday)</b></p>
<div class="column">
<span>May 2018</span>
<img src="public/images/news-rank-may.png" style="width:100%;height:100%;margin-top:8px">
</div>
<div class="column">
<span>October 2018</span>
<img src="public/images/news-rank-oct.png" style="width:100%;height:100%;margin-top:8px">
</div>
</div>
<div class="pagerank row">
<p><b> University Domains (harvard, princeton, nyu, cornell, dartmouth, brown, yale)</b></p>
<div class="column">
<span>May 2018</span>
<img src="public/images/university-rank-may.png" style="width:100%;height:100%;margin-top:8px">
</div>
<div class="column">
<span>October 2018</span>
<img src="public/images/university-rank-oct.png" style="width:100%;height:100%;margin-top:8px">
</div>
</div>
<h1>Summary and Further Work</h1>
<p>
Common Crawl data is a very highly valuable and open data source for various aspects of the internet, and has the potential to yield many useful insights aside
from the aspects we looked at in this project. Over the course of our work, we had to learn about various aspects of big data systems, cloud storage, encoding styles
and also how to extract relevant and meaningful data from large datasets. We had to be creative about how to visualize the graphs in an aesthetically pleasing way
while yielding useful visual insights through an interactive web interface. We used a lot of example code that is available for similar tasks, but customized it to suit
our goals and published our code openly on GitHub.
</p>
<p>
There are a number of areas where our work can be extended and refined. We hope to look into better ways to perform upstream data extraction and filtering in a way
that further minimizes the amount of local data stored. Additional ways to identify parent-child relationships in the web URLs, and utilizing metadata information from
the WARC files can be looked at. Additional user-interface features could be added for making the entire pipeline a lot smoother and easier to modify for specific tasks.
For example, an alternate data pipeline that utilizes graph databases (like <a href="https://neo4j.com/">Neo4j</a>) could massively speed up the querying capability and
allow us to filter down the domain names to our communities of interest more efficiently. With regard to visualization, there are numerous other tools out there that
could possibly handle large graph loads better than our approach, such as <a href="http://www.graphviz.org/">Graphviz</a> and <a href="http://sigmajs.org/">Sigma.js</a>.
The graph processing and visualization space is continuously evolving, and we aim to keep a close watch on these tools and techniques for our future careers.
</p>
<h1>Summary Score</h1>
<p>
Based on the project requirements, we include the below self-evaluation of our efforts for this project. We spent a good deal of effort in developing an ETL pipeline
using a combination of S3, boto3 and parquet formats to minimize local storage requirements. We also looked at how our approach would scale for much
larger datasets, and implemented our PySpark routines accordingly. Along the way, we learned a good deal about technologies such as Spark GraphFrames, S3, and interactive
visualization techniques for graphs in D3.
</p>
<table style="width:35%">
<tr><th>Data acquisition</th><th>3</th></tr>
<tr><th>ETL</th><th>3</th></tr>
<tr><th>Problem Motivation</th><th>2</th></tr>
<tr></tr> <th>Algorithmic work</th><th>1</th></tr>
<tr></tr> <th>Bigness/parallelization</th><th>3</th></tr>
<tr><th>UI</th><th>3</th></tr>
<tr><th>Visualization</th><th>3</th></tr>
<th>Technologies</th><th>2</th></tr>
<tr><th><i>Total</i></th><th>20</th></tr>
</table>
</body>
</html>