arXiv:1701.00072v1 [cs.DB] 31 Dec 2016 Graph or Relational Databases: A Speed Comparison for Process Mining Algorithm Jeevan Joishi Ashish Sureka Indraprastha Institute of Information Technology, Delhi (IIITD) New Delhi, India ABB Corporate Research Bangalore, India Abstract Process-Aware Information System (PAIS) are IT systems that manages, supports business processes and generate large event logs from execution of business processes. An event log is represented as a tuple of the form CaseID, TimeStamp, Activity and Actor. Process Mining is an emerging area of research that deals with the study and analysis of business processes based on event logs. Process Mining aims at analyzing event logs and discover business process models, enhance them or check for conformance with an a priori model. The large volume of event logs generated are stored in databases. Relational databases perform well for certain class of applications. However, there are certain class of applications for which relational databases are not able to scale. A number of NoSQL databases have emerged to encounter the challenges of scalability. Discovering social network from event logs is one of the most challenging and important Process Mining task. Similar-Task and Sub-Contract algorithms are some of the most widely used Organizational Mining techniques. Our objective is to investigate which of the databases (Relational or Graph) perform better for Organizational Mining under Process Mining. An intersection of Process Mining and Graph Databases can be accomplished by modelling these Organizational Mining metrics with graph databases. We implement Similar-Task and Sub-Contract algorithms on relational and NoSQL (graph-oriented) databases using only query language constructs. We conduct empirical analysis on a large real world data set to compare the performance of row-oriented database and NoSQL graph-oriented database. We benchmark performance factors like query execution time, CPU usage and disk/memory space usage for NoSQL graph-oriented database against row-oriented database. Keywords: Benchmarking, CYPHER, Graph Databases, MySQL, Neo4j, Organizational Mining, Process Mining, Performance Comparison, Relational Databases, SQL. 1 1 Research Motivation and Aim PAIS like ERP, CRM, etc. are IT systems that manages and supports business processes. The data generated by execution of activities within PAIS is in the form of event logs (tuple of the form ). An event log contains information on the business process being considered (CaseID), the set of events (Activities) within that CaseID, performer of the Activity (Actor) besides other information like TimeStamp and unique identifier. Process Mining is a area of research that aims on analyzing business processes based on event logs [1]. Insights gathered from the analysis can be used by organizations to identify bottlenecks if any, improve or enhance their business process. For example, in the domain of Software Engineering, Gupta et al. mine bug report history for discovering process maps, inefficiencies and inconsistencies [2]. Based on whether an a priori model exists or not, Process Mining is classified into three broad techniques viz. Process Discovery, Process Conformnce and Process Enhancement. Process Mining is divided into three different perspectives viz. Control Flow, Organizational and Case, based on the type of attribute being considered from the event log [1]. Control Flow perspective focuses on the lineage of business processes. Organizational Mining perspective deals with techniques used to study social structure within an organization [3], [4]. Whereas Case perspective focuses on mining information within each process instance (CaseID). Organizational Mining is a Process Discovery technique which focuses on finding social networks between Actors of the event log. Various metrics for finding such sociograms are defined in [3]. Organizations have generally used Relational Databases (RDBMS) to store data. RDBMS handle tabular structures exceedingly well [5]. RDBMS generally focuses on Online Transaction Processing (OLTP) applications but are not found to be efficient for certain Online Analytical Processing (OLAP) applications that involve joins or analytical functions (Dense_Rank, Sum, Average, etc.) at large scale. Developers have faced problems in trying to handle relationships with RDBMS mostly due to join intensive queries leading to JOIN BOMB1. The reason is that relationships in RDBMS can be modeled by means of joins only, and an increase in connectedness of data implies increased number of joins. Join intensive queries are an impediment to performance and scalability in a dynamic system with ever-changing business needs. Furthermore, complications arise when, in addition to modeling the relationships, we also need to weigh the strength of these relationships [5]. Recent trends in database technologies has seen the emergence of various NoSQL databases. These databases breaks away from the traditional one-size-fits-all philosophy employed by RDBMS and instead focuses on specific use cases [6]. One such type of databases is Graph Database that are built to cater to linked data commonly found on social networking sites like Linkedin2, Facebook3, etc. Graph databases have emerged to address the issue of lever- 1http://neo4j.com/blog/demining-the-join-bomb-with-graph-queries/ 2https://www.linkedin.com/ 3https://www.facebook.com/ 2 aging complex and dynamic relationships in highly connected data. In contrast to relational databases, where performance deteriorates as the size of the dataset increases, performance of a graph database is expected to remain constant, even as the dataset grows [5]. This is because queries would be localized to a portion of the graph and hence, the execution time for each query would depend only on the part of the graph traversed to satisfy that query, instead of the overall size of the graph. A lot of research has has been done on integrating data mining techniques directly into the DBMS [7], [8], [9]. This allows for better data management, allows primitives to be defined at database levels and the applications are tightly coupled to the database. We aim to implement Organizational Mining algorithms viz. Similar-Task and Sub-Contract, using only database language constructs and make these applications tightly coupled to the database. In view of the current work, our aim of this study can be summarized as- 1. To investigate approaches to transform Similar-Task and Sub-Contract algorithm in roworiented database MySQL4. 2. To examine approaches to adapt Similar-Task and Sub-Contract algorithm in graphoriented database Neo4j5. 3. To conduct a series of experiments to benchmark and compare performance of SimilarTask and Sub-Contract algorithms in Neo4j against MySQL. 2 Related Work and Research Contribution In this Section, we closely review the related work to the study that are presented in this paper and also list the novel contributions of our work in context to existing work. 2.1 Implementation of Mining Algorithms in Relational Databases Ordonez et al. did an extensive work on implementing k-means clustering algorithm in SQL [7]. They came up three different SQL implementations of k-means algorithm to integrate it with RDBMS. Experiments were performed on large clusters, efficient indexing and with queries optimized and re-written. Ordonez et al. also presented SQL implementations of EM Algorithm that worked with high dimensional data, high number of clusters and very large datasets [8]. They came up with three different strategies viz. Horizontal, Vertical and Hybrid. Ordonez et al. came up with another SQL implementation of clustering algorithm which merges Markov Chain Monte Carlo with EM algorithm [9]. Sattler et al. described primitives for applying and building decision tree classifiers which were directly coupled on commercial databases used in various classification problems [10]. 4http://www.mysql.com/ 5http://neo4j.com/ 3 2.2 Implementation of Mining Algorithms in Graph Databases Wang et al. presented papers that studied structural pattern mining for large disk based graph databases They presented a novel ADI index structure and efficient algorithms for mining frequent patterns [11]. Wang et al again came up with novel techniques to obtain scalable mining on large disk based graph databases [12]. Huan et al. also presented techniques to find maximal frequent sub-graphs from Graph Databases [13]. Ozaki came up with the concept of hyperclique pattern in graph databases to detect highly correlated sub-graph in graph structured databases. It considers general ordering of sub-graphs and employed techniques like breadth-first search/ depth-first search with powerful pruning techniques based on various measures [14]. 2.3 Performance comparison between Relational and Graph Databases. Vicknair et al. performed comparisons between Relational Databases and Graph Databases. Their work included recording and querying data provenance information [15]. McColl et al. evaluated performance for a series of open-source graph databases. They used four different graph algorithms to evaluate performance for graph setup consisting upto 256 million nodes [16]. Ciglan et al. came up with benchmarking of graph databases over graph traversal algorithms [17]. Macko et al. presented a performance introspection framework for graph databases, PIG. PIG provided techniques and tools to understand performance of graph databases [18]. 2.4 Performance Analysis of Process Mining Algorithm on other Architecture Kundra et al. investigate the application of parallelization on Alpha Miner algorithm and use Graphics Processor Unit (GPU) to run computationally intensive parts of Alpha Miner algorithm in parallel. They demonstrate a highest speedup on GPU reaching 39-40 times from the same program run over multi-core CPU [19]. Sachdev et al. [20]. Sachdev et al. investigate which of the databases (Relational or NoSQL) performs better for a Process Discovery application under Process Mining [20]. They conduct a performance benchmarking and comparison of the alpha-miner algorithm on row-oriented database and NoSQL column-oriented database [20][21]. Anand et al. Anand et al propose a Utility-Based Fuzzy Miner (UBFM) algorithm to efficiently mine a process model driven by a utility threshold and conduct experimental analysis to show the performance of the process mining algorithm on relational databases [22]. 2.5 Novel Contributions In context to existing work, the study presented in this paper makes the following novel contributions. The work presented in this paper is an extension and detailed version of the paper by Joishi et al. [23] 1. While there has been work done on implementing data mining algorithms in row-oriented 4 Table 1: Event Log CaseID 1 2 1 1 2 2 3 3 4 5 3 4 4 6 5 6 5 1 6 Activity A A B E E B A E A A B B E A B C E D D Actor Matt Matt Britney Matt Matt Britney Brad Matt Brad Brad Brad Britney Brad Brad Joan Joan Brad George George Table 2: Actor-Activity Matrix ABCDE Matt 2 0 0 0 3 Britney 0 3 0 0 1 Brad 4 1 0 0 1 Joan 0 1 1 0 0 George 0 0 0 2 0 databases, we are the the first to implement Organizational Mining algorithms in relational databases. 2. While data mining algorithms like frequent pattern mining have been implemented in graph databases, we believe we are the first to implement Organizational Mining algorithms in graph databases. 3. We conduct a series of experiments to compare performance and benchmark Organizational Mining algorithms on graph databases against relational databases. 3 Similar-Task and Sub-Contract Algorithm An example of an event log is shown in Table 1. Each row of the table is an event with CaseID, corresponding Activity and the Actor performing that Activity. We suggest readers to refer [3], [4] for better understanding of Organizational Mining metrics. 3.1 Similar-Task Algorithm Similar-Task algorithm which comes under Organizational Mining is a metric based on joint activities. It does not consider how individuals work together on shared cases but focuses on the activities they perform [3]. Similar-Task aims at finding similarity between Actors based on the intersection of Activities. The idea is that individuals performing similar tasks are more closely related to each other than individuals performing different tasks [3]. Similarity calculation could be achieved using Cosine-Similarity, Pearson Correlation Coefficient, Hamming Distance, etc. Based on previous literature reviews, we present the following adaptation of Similar-Task algorithm. The input to Similar-Task algorithm is a 2-dimensional matrix. The matrix contains frequencies of activities performed by each actor. This matrix is commonly referred as ActorActivity Matrix. An example of Actor-Activity Matrix is shown in Table 2. For instance, Matt performs activity A twice, activity E thrice and has no involvement in activities B, C and D. In this paper, we use Cosine-Similarity as a metric of measuring similarity between Actors 5 Algorithm 1: Similar-Task Algorithm Data: Actor-Activity Matrix (M) Result: Matrix with similarity values between Actors 1 Get the number of rows of M into m. 2 Get the number of columns of M into n. 3 D[m][m] = Declare square matrix to store results. 4 foreach i = 1 to m - 1 do 5 P =Vector corresponding to ith row. 6 foreach j = i + 1 to m do 7 Q=Vector corresponding to jth row. 8 Apply Cosine Similarity between ith and jth row P ·Q cos(P, Q) = (1) PQ 9 Set D[i][j]=similarity value obtained in the Step 8. based on the Activities they perform. Table 3 gives similarity values between Actors based on Algorithm 1. 3.2 Subcontract Algorithm Sub-Contract is another Organizational Mining metric which is based on causal dependencies between Actors in carrying out business process [3]. Sub-Contract Algorithms tries to find out the number of times individual j executes it's task in between two activities performed by individual i [3]. Sub-Contract algorithm considers dependencies between activities in the process model, commonly referred as causality fall factor (). These dependencies can be obtained using Process Discovery techniques like -miner algorithm. Sub-Contract algorithm also considers direct/indirect succession (depth) between Actors. It also takes into consideration whether sub-contraction between Actors occurs single or multiple times. Sub-Contract algorithm presented in Algorithm 2 considers indirect succession, multiplicity while ignoring dependencies of activities. Each P rocessInstance corresponds to a Case Identifier (CaseID) in the event log. AuditTrailEntryList constitutes all the events pertaining to a particular CaseID. An AuditT railEntry refers to an individual event [4]. For example, considering events pertaining to Case1 in Table 1, we have a sub-contraction between Matt and Britney. Matrix entry corresponding to Matt and Britney is updated in m followed by an update in D. Final result shown in Table 4 is obtained after all such sub-contractions are identified from all cases in the event log. 6 Table 3: Cosine-Similarity Values Matt Britney Brad Joan George Matt ­ ­ ­ ­ ­ Britney 0.263 ­ ­ ­ ­ Brad 0.719 0.298 ­ ­ ­ Joan 0.00 0.671 0.167 ­ ­ George 0.00 0.00 0.00 0.00 ­ Table 4: Sub-contract values Matt Britney George Brad Joan Matt 0 0 0.22 0 0 Britney 0.22 0 0 0.22 0 George 0 0 0 0 0 Brad 0 0 0 0 0 Joan 0 0 0 0.22 0 Algorithm 2: Sub-Contract Algorithm Data: , depth, Len, Log Result: Normalized 2D Matrix D with subcontract values between Actors 1 Declare Square Matrix D of size Len*Len. Initialize all elements to 0 2 Declare and initialize variable normal to 0 3 foreach ProcessInstance pi in the log do 4 Get AuditT railEntryList ates for pi 5 if sizeates < 3 then 6 continue to the next ProcessInstance, pi 7 Declare and intialize minK to 0. 8 if sizeates < depth then 9 set minK= sizeates 10 else 11 set minK=depth + 1. 12 if minK < 3 then 13 set minK=3. 14 foreach k:=2 to minK do 15 Update normal by k-2. 16 m= Square matrix of Len*Len. 17 foreach i:=0 to sizeates - k do 18 atei = get AuditT railEntry at position i. 19 ateik = get AuditT railEntry at position i + k 20 if Actoratei = Actorateik then 21 foreach j:=i + 1 to i + k do 22 atej = get AuditT railEntry at position j. 23 row = get row-position for Actoratei 24 col = get column-position for Actoratej 25 For valid (row , col ) set m[row][col]=1. 26 foreach i:=0 to Len do 27 foreach j:=0 to Len do 28 set D[i][j] = D[i][j] + m[i][j]*k-2. 29 Return N ormalizedM atrixD. //divide each value by normal. 7 Algorithm 3: Similar-Task Algorithm in Graph Database Data: Actor-Activity Graph Result: Graph with similarity values between Actors 1 Ai = Get an Actor 'i' from the Actor-Activity Graph. 2 Aj = Get another Actor 'j' from the Actor-Activity Graph. 3 Find intersecting Activities between Ai and Aj. 4 Collect frequencies of Activities from the edges of intersecting Activities. 5 Apply Cosine-Similarity with the values obtained in Step 4. 6 Set [:SIMILARITY] between Ai and Aj with the value obtained in Step 5. Figure 1: Similar-Task implementation flow in RDBMS 4 Implementation of Algorithms on RDBMS We present a few segments of our implementation due to limited space in the paper. The entire code and implementation can be downloaded from our website6. 4.1 Similar-Task Algorithm Typical Steps involved in the implementation of Similar-Task algorithm in RDBMS is shown in Fig. 1. We import the event log dataset into a table, dataset. A stored procedure creates Actor-Activity matrix (a table in MySQL) from dataset table. We use Actor-Activity matrix(AAMatrix) for calculating cosine-similarity in another stored procedure and the similarity values are collected in Result Matrix. The SQL implementation of Similar-Task algorithm involves Create, Read, Update, Delete (CRUD)7 statements. We define these statements as a single adhoc SQL query or as part of stored procedure(s). 1. To create Actor-Activity matrix, we define a stored procedure that takes the table dataset as input parameter. (a) We collect all distinct Activity from table dataset using a cursor8. (b) We create a table AAMatrix to store frequency of each Activity performed by the Actors. AAMatrix's schema is of the form (Actor, Activity1, Activity2,...) where Actor is of type VARCHAR and is a PRIMARY KEY, and Activity1, Activity2, etc. are all those Activities collected from the cursor and are of type INT. 6http://goo.gl/wMyUOS 7http://dev.mysql.com/doc 8http://dev.mysql.com/doc/refman/5.0/en/cursors.html 8 2. We populate Actor-Activity matrix using INSERT and IF statements inside the stored procedure. For any (Actor, Activity) pair that is found, its corresponding value in AAMatrix is incremented by one. COUNT(IF (ACTIVITY='ACTIVITY1', 1, NULL)) COUNT(IF (ACTIVITY='ACTIVITY2', 1, NULL)) This combination of COUNT and IF statements are combined with INSERT statement to populate AAMatrix. 3. Calculation of Cosine-Similarity is done using another stored procedure that takes ActorActivity Matrix as input parameter. A table InitSim with schema (SOURCEACTOR, TARGETACTOR, SIMILARITY) is created to store similarity values as they are calculated. SOURCEACTOR and TARGETACTOR are of type VARCHAR, while SIMILARITY is of type DOUBLE. Join is applied to two instances of AAMatrix and cosinesimilarity calculated for each pair of distinct Actors. AAMatrix T1 JOIN AAMatrix T2 WHERE T1.ACTOR <> T2.ACTOR The similarity values obtained with calculations on the join are first ordered by T1.ACTOR, followed by T2.ACTOR and then inserted into InitSim. 4. We create another table FinalSim with schema (SOURCEACTOR, ACTOR1, ACTOR2, ...) and populate it using values from InitSim. The schema is also created using cursors in the stored procedure where each distinct Actor forms the column of table FinalSim. Data into table FinalSim is populated in the same way as Step 2. 4.2 Sub-Contract Algorithm Alike Similar-Task algorithm, implementation of Sub-Contract algorithm also involves ad-hoc SQL queries and dynamically built queries using stored procedures. The implementation flow is much alike Fig. 1, and has not been shown here. 1. We import the event log dataset in a table also named dataset (ID, CaseID, Actor, Activity) where ID is an auto incrementing field, CaseID is the case identifier selected from the dataset. Actor and Activity are self-explanatory. However, for efficient implementation of the algorithm, data from the dataset is re-ordered so that all events corresponding to a CaseID are together and ordered in ascending order. We define a secondary table named organiseddata to store this ordered information. 2. Sub-Contraction can only be detected if there are at least three (3) events in a particular CaseID. Joins are applied only when this criteria is met. Since each event in the table is assigned a unique ID (auto incrementing), so actor responsible for sub-contraction will always have ID difference of at least 2. The following SQL snippet joins tables for each CaseID to find the IDs of actors responsible for sub-contraction. 9 (a) Actor-Activity information in Graph Database (b) Similarity values in Graph Databases Figure 2: Similar-Task implementation flow in Graph Database organiseddata AS T1 JOIN organiseddata AS T2 ON T2.ID >= T1.ID + 2 AND T1.ACTOR= T2.ACTOR AND T1.ACTIVITY <> T2.ACTIVITY ORDER BY DIFF ASC 3. Once IDs of Actors responsible for sub-contraction are found out in Step 2, all intermediate IDs are collected and their sub-contraction strength calculated. We create a table RESULTTABLE (PERFORMER, ACTOR1, ACTOR2, ...) to store the value of sub-contraction. Here, PERFORMER is the actor with whom sub-contraction is being considered, and ACTOR1, ACTOR2, etc. are other Actors that are placed dynamically using stored procedure. 5 Implementation of Algorithms in Neo4j 5.1 Similar-Task Algorithm The Steps involved in the implementation of Similar-Task algorithm in Neo4j9 is shown in Fig. 2. Fig. 2(a) depicts how Actor-Activity information is maintained in Neo4j. While Fig. 4(b) presents a typical view of the graph after similarity calculation. We present Similar-Task algorithm adapted for graph database in Algorithm 3. 1. We create nodes and relationships such that Actor-Activity information is also calculated and stored directly during dataset import. We create only unique Actor and Activity nodes and merge the relationship between them for any repetition. Relationship [:PERFORMS] connects an Actor to an Activity node with a property times that records the frequency of the Activity performed by that Actor. 2. Calculation of Cosine-Similarity in Neo4j comprises of three steps. 9http://neo4j.com/docs/ 10 (a) All intersecting Activities between a pair of Actors are found out. (b) Using value of times property from [:PERFORMS] relationship for all the intersect- ing activities found in Step 2(a), cosine-similarity is calculated. (c) The cosine-similarity value thus obtained is stored as a value to property similarity in the relationship [:SIMILARITY] between the Actors in consideration. MATCH (p1:Actor)-[x:PERFORMS]->(m:Activity)<-[y:PERFORMS]-(p2:Actor) WITH SUM(x.times * y.times) AS xyDotProduct, SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.times) | xDot + a^2)) AS xLength, SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.times) | yDot + b^2)) AS yLength, p1, p2 MERGE (p1)-[s:SIMILARITY]-(p2) SET s.similarity = xyDotProduct / (xLength * yLength) 5.2 Sub-Contract Algorithm We implement Sub-Contract algorithm in graph database using a similar approach as shown in Fig. 2. But unlike Similar-Task algorithm implementation in CYPHER, only CaseID is made unique. Whereas other information like Actor and Activity are created for each event in the event log. 1. We create unique 'CASE' nodes for distinct CaseIDs. These Case nodes stores information like the case names, and an incrementing counter, occurrence ID (OccID) whose value increases as new Actor nodes for that CaseID is added. Within each CaseID, Actor nodes are created with information like actor name, OccID (taken from CaseID) and the activity it performs. Case nodes and Actor nodes are connected via [:CONTAINS] relationship. 2. Second Step involves finding Actors responsible for sub-contraction. It is worth mentioning that OccID are always assigned in ascending order and only those Actors with same name but with different Activity and OccID are responsible for sub-contraction. For each CaseID, (a) We find out OccIDs of the Actors responsible for sub-contraction. (b) We collect all intermediate OccIDs between the OccIDs found in Step 2(a). (c) A relationship [:RELATED_TO] is created from the the OccID of starting Actor node found in Step 2(a) to all intermediate OccIDs found in Step 2(b). A property value is set to 1 that is used in subsequent steps. WITH commActorPath, n, (Actor2.OccID - Actor1.OccID) as sepDist WITH RANGE(head(nodes(commActorPath)).OccID+1, last(nodes(commActorPath)).OccID-1) as intermediateIDs, n, head(nodes(commActorPath)).OccID as startID, sepDist UNWIND intermediateIDs as endID 11 MATCH (person1:PERSON {OccID:startID})<--(n)-->(person2:PERSON {OccID:endID}) MERGE (person1)-[:RELATED_TO {value:1, length:sepDist}]->(person2) We define commActorPath as the path to find out the Actors (with common name) responsible for sub-contraction. RANGE function collects all the intermediate IDs which is then used to connect sub-contracting Actors. 3. Sub-contraction strength between two Actors can only be set once they Actor nodes are made unique. To do so, we create UNIQUEACTOR nodes for all distinct Actor names in the database. 4. The final Step of the algorithm involves setting sub-contract strength between UNIQUEACTOR nodes. For each CaseID, (a) We collect start node and end node of [:RELATED_TO] relationship. (b) We refer the UNIQUEACTOR nodes corresponding to the start node and end node found in Step 4(a). (c) We establish sub-contraction between the two UNIQUEACTOR nodes using [:SUB- CONTRACT] relationship with a property strength whose value is updated accordingly using the value property of [:RELATED_TO] relationship. MATCH (n)-[:CONTAINS]->()-[r:RELATED_TO]->()<-[:CONTAINS]-(n) MERGE (p:UNIQUEACTOR {name:startNode(r).name})-[rf:SUBCONTRACT]->(q:UNIQUEACTOR {name:endNode(r).name}) SET rf.strength = CASE WHEN rf.strength IS NULL THEN r.value ELSE rf.strength + (0.5^(l-2))*r.value END 6 Experimental Dataset We conduct experiments on a publicly available large real world dataset downloaded from Business Process Intelligence 2014 (BPI 2014)10. The dataset contains information on Information Technology Infrastructure Library (ITIL) of Robobank Information and Communication and Technology (ICT). ITIL is a process of addressing customer grievances regarding disruption in ICT services. A Service Desk Agent records the complete information about the problem in an Interaction record. We choose the 'Detail Incident Activity' for our set of experiments. The dataset contains 4, 66, 737 records and out of the seven fields in the dataset, we use the following three 1. Incident_ID: The unique ID of a record in the Service Management tool. It is represented as CaseID in our data model. 2. IncidentActivity_Type: Identifies which type of an activity takes place. 3. Assignment_Group: The team responsible for an activity. 10http://www.win.tue.nl/bpi/2014/start 12 Table 5: Number Actors per of Unique dataset size. Dataset Size 65000 1,01,000 2,19,500 3,00,000 4,66,737 Unique Actors 150 158 220 229 242 Table 6: Data Load (Similar - Time T ask) Unique Actors 150 158 220 229 242 Load Time (msec) MySQL Neo4j 2467 3413 2875 3362 5966 4354 5850 5877 7819 6875 Figure 3: Data Load Time for Similar-Task Algorithm 7 Benchmarking and Performance Comparison We conduct a series of experiments on the implementations of Similar-Task and Sub-Contract algorithms. Our benchmarking system consists of Intel Core 2 Duo (3M Cache, 2.1 GHz), 4 GB DDR3 RAM and 320 GB of Hard disk drive. We use Windows 8.1 with single node setup of MySQL 5.6 and Neo4j 2.14. We ensure that only minimally required services are running during the analysis. We conduct experiments on warmed up cache and the values recorded are an average over five runs of the implementations. In order to study scalability, we divide our event log dataset into five different chunks of increasing size and conduct experiments that takes into consideration both the size of the dataset and the number of unique actors in each chunk. Table 5 presents the statistics for each of these chunks. 7.1 Similar-Task Algorithm Table 6 and Fig. 3 reveals load time across different sizes. The load time includes loading data into a table, processing it to generate and populate an Actor-Activity matrix. The load time varies with the number of unique actors present in each chunk. We observe that both the databases give similar performance. However with increase in number of unique actors, Neo4j gives better load time performance and has been seen to perform 1.25x magnitude faster than MySQL. We believe that Neo4j's better performance is due to the fact that only unique Actor and Activity nodes are created during data import. On the other hand, MySQL has a predefined schema with number of columns being equivalent to the number of distinct activities in the dataset. Hence even if an actor has not performed an activity, value (albeit zero or default) 13 (a) Execution time for cosine-similarity calculation (b) Time taken to update results Figure 4: Execution Time for Step-8 and Step-9 (Similar-Task) has to be set at that respective column. Whereas Neo4j defines relationship only when they are discovered and thus gives better performance. The core of Similar-Task algorithm is similarity calculation between Actors (Step-8 of the Algorithm 1) and updating the result table (Step-9 of the Algorithm 1). Table 7 and Fig. 4 displays the time taken to calculate cosine-similarity and update result in Similar-Task Algorithm as a function of the number of unique actors for different dataset chunk (given in Table 5). It is interesting to note that execution time for cosine-similarity calculation in MySQL is 32 times better than Neo4j. In case of write operations too, MySQL slightly outperforms Neo4j by a magnitude of 1.5. Unique Actors 150 158 220 229 242 Execution Time (msec) Step-8 Step-9 MySQL Neo4j MySQL Neo4j 225 9616 1907 2403 372 11700 2844 2925 713 14655 6292 3664 903 29520 6703 7380 1403 48891 8453 12223 Table 7: Execution Time for Step-8 and Step-9 (Similar - T ask) We believe that data meant for cosine similarity calculation in MySQL is available in tables, and fetching these data is then only limited to advancing pointers to the next row. Graph databases like Neo4j are not known to have such constructs available for matrix. Calculations in Neo4j are based on reading property values defined on nodes and relationships. Cosinesimilarity computation in Neo4j requires matching intersecting activities between the two actors in concern, collecting property values from the relationships connecting these intersecting activities, followed by the actual computation. It is for this reason that we see a sharp rise in cosine-similarity calculation in Neo4j. Fig. 4(b) gives an estimate of the time required to update results. We observe that setting properties in Neo4j is more time consuming because existing relationships needs to be merged with the updated property values or new ones be created if such relationship does not exist. Whereas updating results in MySQL only consists 14 (a) Disk Space Usage for MySQL tables (b) Disk Space Usage for Neo4j elements Figure 5: Disk Space Usage in Similar-Task algorithm. of updating results in respective columns on an already defined table. Table 8: Disk Space Usage (bytes) for MySQL tables (Similar - T ask) Tables Dataset Size 65000 101000 219500 300000 466737 Dataset 3686400 5783552 11026432 15220736 21544960 AAMatrix 65536 65536 65536 81920 81920 InitSim 1589248 1589248 1589248 3686400 3686400 FinalSim 229376 262144 278528 491520 1589248 Table 9: Disk Space Usage (bytes) for Neo4j Elements (Similar - T ask) Graph Elements Dataset Size 65000 101000 219500 300000 466737 Nodes 2820 2910 3075 3990 4215 Relationships 770040 414315 479663 856809 983227 Properties 1033856 563873 651203 1155011 1323439 Table 8 and Fig. 5(a) presents the disk space taken by tables in MySQL which includes both the space taken by actual data and indexes, if any. Readers are suggested to refer to Section 4.1 for better understanding of the tables associated with the implementation of Similar-Task algorithm. Table 9 and Fig. 5(b) shows disk space taken by various graph elements in Neo4j. We observe that Neo4j uses almost 12 times less disk space in comparison to MySQL. We believe that nodes and relationships in Neo4j are created only when needed. On the other hand, MySQL needs to write values for all columns which contributes to higher disk usage. 7.2 Sub Contract Algorithm Table 10 and Fig. 6 shows data load time across different dataset sizes. The load time includes loading the event log dataset, pre-procesing and writing it back to the database. Pre-procesing in MySQL involves ordering the event log dataset by CaseID, whereas assigning incremental occurrence identifiers to Actor nodes within each Case node in Neo4j. We observe that for a single node setup, both the databases give similar performance. However with increase in dataset size, Neo4j gives better load time performance and is seen to perform 1.15x magnitude faster than MySQL. Also data load time in Sub-Contract algorithm is 5.5x magnitude slower than Similar-Task algorithm. 15 Table 11: Execution Time for Sub-Contract Algorithm in MySQL Dataset Size Execution Time(msec) Update Sub-Contract Update Normalize Normal Detection Result Result 65,000 32 11712 8296 16 1,01,000 32 11782 8138 16 2,19,500 35 11713 7940 17 3,00,000 70 11736 8094 17 4,66,737 73 11747 7754 20 Table 12: Execution Time for Sub-Contract Algorithm in Neo4j Dataset Size Execution Time(msec) Update Sub-Contract Update Normalize Normal Detection Result Result 65,000 118 1542 2077 5 1,01,000 140 1707 2773 5 2,19,500 202 2534 2369 6 3,00,000 336 3442 5261 9 4,66,737 560 4149 5334 9 Table 10: Data Load Time (Sub - Contract) DataSet Load Time Size (msec) MySQL Neo4j 65,000 6575 9567 1,01,000 8390 10476 2,19,500 14279 14873 3,00,000 26437 25435 4,66,738 43712 38234 Figure 6: Data Load Time for Sub-Contract Algorithm We observe that alike Similar-Task algorithm, data load time exhibits similar pattern in SubContract algorithm too. With increase in dataset size, ordering tables by CaseID and writing them back to database takes longer time as compared to creating nodes in Neo4j. However we observe that load time is higher in Sub-Contract algorithm as compared to Similar-Task algorithm because Actor nodes are created for each event in the event log. Whereas in SimilarTask algorithm only unique nodes are created. We believe that setting property values on nodes for each event in the dataset incurs more write operations and thus takes more time as compared to setting property values for unique nodes in Similar-Task algorithm. Table 11 displays execution time of four major steps of Sub-Contract algorithm. These steps include updating the value of normal (Update Normal), detecting sub-contracting Actors (Sub-Contraction Detection), writing the result back to the database (Update Result) and normalizing the result (Normalize Result). Table 12 shows the execution time noted for four major steps of Sub-Contract algorithm implemented in Neo4j. We record execution time for the four major steps as a function of dataset size and the results are presented in Fig. 7(a) and Fig. 7(b). We observe that Sub-Contract algorithm implemented in MySQL have identical performance across dataset chunks. On the other hand, Sub-Contract algorithm's performance in Neo4j varies linearly with increase in dataset size. We observe that detecting sub-contracting Actors in Neo4j attains performance boost of the magnitude of 7x over MySQL. Empirical analysis shows that write operations in MySQL is almost 4 times slower than Neo4j. We believe that detecting sub-contracting actors in MySQL is compute intensive and hence time consuming task. The operation is expensive because detecting sub-contracting actors in- 16 (a) Execution Time for Sub-Contract Algorithm on (b) Execution Time for Sub-Contract Algorithm on MySQL Neo4j Figure 7: Execution Time for Sub-Contract Algorithm Tables Dataset Organised Data Result Matrix 65000 4734976 4734976 1589248 101000 6832128 6832128 1589248 Dataset Size 219500 300000 13123584 18366464 13123584 18366464 1589248 1589248 466737 27836416 27836416 1589248 Table 13: Disk Space Usage (bytes) for MySQL tables (Sub - Contract) volves retrieving all records for a particular CaseID and then applying self-join on the result set. Joins are compute intensive task in MySQL and involves Cartesian product of the tables based on the condition, followed by selection. On the other hand, detecting sub-contracting actors in Neo4j is equivalent to traversing relationships in Neo4j using index-free adjacency. In our opinion, index-free adjacency achieves better traversal because relationships are stored as firstclass citizens in Neo4j and no computation(s) are performed for deriving these relationships. Another major aspect that Fig. 7(a) and Fig. 7(b) brings forward is that write operation in MySQL roughly takes the same amount of time for all dataset sizes. We believe that MySQL needs to write values, albeit zero or default, for all those relations that does not even exist. On the other hand, Neo4j's approach to creating relationship between nodes only when needed is an effcient approach and thus takes less time as compared to MySQL. Although we observe a gradual increasing trend in Update Result (or write operation) in Neo4j, we conclude that write operation in Neo4j is linearly scalable with dataset size, On the other hand, write operations in MySQL is fairly constant for all dataset sizes and comparatively higher than Neo4j. Table 13 and Fig. 8(a) presents the disk space (in bytes) taken by tables in MySQL. These statistics include both initial tables, intermediate tables, final tables and index, if any. Table 14 and Fig. 8(b) shows the disk usage of various graph elements for the implementation of sub-contraction algorithm for different dataset sizes. The disk space for nodes is contributed by three different nodes type viz. Case nodes, Actor nodes and Unique Actor nodes. There are three relationships that contribute to relationship disk space viz. [: CON T AIN S] relationships that connects Case node to Actor nodes, [:RELATED_TO] connects Actor to Actor who satisfy the sub-contraction criteria and [: SU BCON T RACT ] connects UNIQUEACTOR nodes 17 (a) Disk space usage for MySQL tables (b) Disk space usagefor Neo4j elements Figure 8: Disk Space Usage for Sub-Contract Algorithm to UNIQUEACTOR nodes with the actual sub-contraction value between the unique actors. Readers are suggested to refer to Section 5 for better understanding of the tables and graph elements associated with the implementation of Sub-Contract algorithm in MySQL and Neo4j. We observe that MySQL disk space usage is 30 times lower than Neo4j. Tables Nodes Relationships Properties 65000 982212 153477291 384189475 101000 1523732 183955761 461537287 Dataset Size 219500 300000 3360798 4598454 285778449 375437997 719874720 946265404 466737 7190330 490033038 1238579332 Table 14: Disk Space Usage (bytes) for Neo4j elements (Sub - Contract) We believe that Neo4j's disk space usage for Neo4j is higher than MySQL because of the number of properties being used to store information used in sub-contract detection. Each property in Neo4j takes 41 bytes and relationship takes 33 bytes. Apart from this, Neo4j stores relationships using index-free adjacency which means every relationship is explicitly stored without any pointers or indexes. This contributes to higher disk usage in Neo4j. On the other hand, MySQL stores information in tables whose size is determined by the data type involved and thus consumes lesser disk space. We conduct a general experiment to study the variance of memory, disk and process parameters in MySQL and Neo4j using Performance Monitor. We use SQL and CYPHER implementations of Similar-Task algorithm for the experiment. Fig. 9(a) and 9(b) presents bar graphs for various memory, process and disk related parameters. We observe that Neo4j achieves higher level of caching and outperforms MySQL by a factor of 18. Though, Neo4j is seen to incur 3 times more page faults per second, such page faults may not necessarily go to disk. It is further made evident from the fact that MySQL incurs about 6 times more IO operations per second as compared to Neo4j. Fig. 9(b) further strengthens the point with the fact that MySQL spends almost 10 times more time in disk operations and about 6 times more doing disk transfers per second. Based on experimental results, we conclude that Neo4j is more IO efficient than MySQL and with higher physical memory, Neo4j's performance would significantly improve. 18 (a) Statistics for Memory and Process parameters (b) Statistics for Disk parameters Figure 9: Comparison of memory and disk performance monitors for Similar-Task algorithm 8 Conclusion In this paper, we present the implementation of two different Organizational Mining algorithms in Structured Query Language and CYPHER Query Language. We implement Similar-Task and Sub-Contract algorithm for both native SQL client and as well as by using Java API's using memoization. Furthermore, we benchmark and present performance comparisons of SimilarTask and Sub-Contract algorithms in MySQL and Neo4j. Similar-Task implementation in MySQL is a one-tier application which uses only standard SQL queries and advanced stored procedures. Similarly, implementation in Neo4j is done using standard CYPHER queries. We conclude that Neo4j on an average is 1.25 times faster than MySQL in loading large datasets with only unique elements being created. Based on experimental results, we conclude that similarity calculation in MySQL is 32 times better in MySQL as compared to Neo4j. MySQL outperforms Neo4j in terms of time taken for write operations. The time taken by MySQL is 1.5 times lower as compared to Neo4j. The disk space occupied by elements of graph database in Neo4j is 12 times lower than disk space taken by tables in MySQL. We conclude that Neo4j is more efficient than MySQL in terms of storing only unique information in the database. Sub-Contract implementation in MySQL is a one-tier application which also uses standard SQL queries and advanced stored procedures. Similarly, implementation in Neo4j is done using native CYPHER queries. Also, we implement Sub-Contract algorithm with Java API using memoization. We conclude that Neo4j on an average is 1.15 times faster than MySQL in loading large datasets with duplication of elements being allowed. Based on experimental results, we conclude that traversing relationships to find sub-contracting actors in Neo4j is 7 times better as compared to MySQL. Neo4j outperforms MySQL in terms of time taken for write operations. The time taken by Neo4j is 4 times lower as compared to MySQL. However, disk space taken by graph elements in Neo4j is over 30 times higher as compared to MySQL because of the need to store redundant information. 19 In general, we conclude that Neo4j gives better performance than MySQL in loading large datasets with performance benefits of upto 25 percent. Tasks which involve traversing relationships followed by computation (like Similar-Task algorithm) are time consuming in Neo4j. However, Neo4j performs better than MySQL when finding relationship is concerned (like SubContract algorithm) and is seen to perform 7 times better than MySQL. Also Neo4j gives better write time performance as volume of data increases. Based on our analysis of resources during experiments of Similar-Task algorithm, we conclude that Neo4j achieves higher level of caching and incurs almost 6 times lower disk IO operations. Our analysis reveals that Neo4j spends 10x less time doing disk operations with an average of 6 times lower disk transfers per second. References [1] Wil M. P. Van Der Aalst, Ton Weijters, and Laura Maruster. Workflow mining: Discovering process models from event logs. Transactions on Knowledge And Data Engineering, pages 1128­1142, 2004. [2] Monika Gupta and Ashish Sureka. Nirikshan: Mining bug report history for discovering process maps, inefficiencies and inconsistencies. In Proceedings of the 7th India Software Engineering Conference, ISEC '14, pages 1:1­1:10, 2014. [3] Wil M. P. Van Der Aalst, Hajo A. Reijers, and Minseok Song. Discovering social networks from event logs. Computer Supported Cooperative Work, pages 549 ­ 593, 2005. [4] Minseok Song and Wil M. P. Van Der Aalst. Towards comprehensive support for organizational mining. Decision Support Systems, pages 300­317, 2008. [5] Ian Robinson, Jim Webber, and Emil Eifrem. Graph databases. 2013. [6] Michael Stonebraker and Ugur Cetintemel. One size fits all: An idea whose time has come and gone. Proceeding ICDE '05 Proceedings of the 21st International Conference on Data Engineering, pages 2­11, 2005. [7] Carlos Ordonez. Programming the K-means Clustering Algorithm in SQL. (6):823­828, 2004. [8] Carlos Ordonez and P.Cereghini. SQLEM: Fast Clustering in SQL using the EM Algorithm. International Conference on Management of Data, pages 559­570, 2000. [9] David Sergio Matusevich and Carlos Ordonez. A clustering algorithm merging mcmc and em methods using sql queries. JMLR Workshop and Conference Proceedings, pages 61­76, 2004. [10] K-U. Sattler and O.Dunemann. SQL Database Primitives for Decision Tree Classifiers. Conference on Information and Knowledge Management, pages 379­386, 2001. 20 [11] Wei Wang, Chen Wang, Yongtai Zhu, Baile Shi, Jian Pei, Xifeng Yan, and Jiawei Han. GraphMiner: A Structural Pattern-Mining System for Large Disk-based Graph Databases and Its Applications. Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 879­881, 2005. [12] Chen Wang, Wei Wang, Jian Pei, and Yongtai Zhuand Baile Shi. Scalable Mining of Large Disk-based Graph Databases. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316­325, 2004. [13] Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. SPIN: Mining Maximal Frequent Subgraphs from Graph Databases. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581­586, 2004. [14] Tomonobu Ozaki and Takenao Ohkawa. Mining Correlated Subgraphs in Graph Databases. 12th Pacific-Asia Conference, PAKDD 2008 Osaka, Japan, May 20-23, 2008 Proceedings, pages 272­283, 2008. [15] Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen, and Dawn Wilkins. A Comparison of a Graph Database and a Relational Database. Proceedings of the 48th Annual Southeast Regional Conference, 2010. [16] Robert McColl, David Ediger, Jason Poovey, Dan Campbell, and David A. Bader. A Performance Evaluation of Open Source Graph Databases. Proceedings of the first workshop on Parallel programming for analytics applications, pages 11­18, 2014. [17] Marek Ciglan, Alex Averbuch, and Ladialav Hluchy. Benchmarking Traversal Operations over Graph Databases. International Conference on Data Engineering Workshops, pages 186­189, 2012. [18] Peter Macko, Daniel Margo, and Margo Seltzer. Performance Introspection of Graph Databases. Proceedings of the 6th International Systems and Storage Conference, 2013. [19] Divya Kundra, Prerna Juneja, and Ashish Sureka. Vidushi: Parallel Implementation of Alpha Miner Algorithm and Performance Analysis on CPU and GPU Architecture. 2016. [20] Astha Sachdev, Kunal Gupta, and Ashish Sureka. Khanan: Performance Comparison and Programming Alpha-Miner Algorithm in Column-Oriented and Relational Database Query Languages. 2015. [21] Kunal Gupta, Astha Sachdev, and Ashish Sureka. Pragamana: Performance comparison and programming alpha-miner algorithm in relational database query language and nosql column-oriented using apache phoenix. In Proceedings of the Eighth International C* Conference on Computer Science and Software Engineering, C3S2E '15, pages 113­118, 2015. 21 [22] Kritika Anand, Nisha Gupta, and Ashish Sureka. Utility-Based Control Flow Discovery from Business Process Event Logs. 2015. [23] Jeevan Joishi and Ashish Sureka. Vishleshan: Performance comparison and programming process mining algorithms in graph-oriented and relational database query languages. In Proceedings of the 19th International Database Engineering Applications Symposium, IDEAS '15, pages 192­197, 2015. 22