Programming Data Mining Essays

MIS 6473 – Data Mining – Take Home Final – Fall 2022 – Dr. SegallDUE IN LAST CLASS Tuesday December 6, 2022 or
DR. SEGALL’s OFFICE BU 216 OR WALL BASKET ON
TUESDAY, DECEMBER 6, 2022 by 5PM!!
Rules of Take-Home Exam:
1.) You are allowed to use any resources except your classmates.
2.) Your submissions of Computer Outputs should like identical (e.g., elimination of text rollovers,
that may need use of appropriate font size) to those you are provided with exception of your
“Index Number” appended to Databases to make your printouts unique.
Rules for Essays:
3.) All borrowed materials need to be properly cited including any Figures and Tables. Direct quotes
of text need to be carefully indicated in direct quotes as well as paraphrases of rewording text
written by others. See attached Examples of appropriate citation methods including List of
References at end of your essays.
4.) All essays of this exam are to be word-processed and stated page limits are minimums. A fullpage of text is considered to be that with 1-inch margins on top, bottom, left & right of text in
normal font size and not more than normal double-line spacing.
As stated on Course Syllabus:
This exam is worth 15% of course grade of MIS 6743.
======================================================================================
COMPUTER ASSIGNMENT #1: [Jamsa Chapter 6: Keep SQL in Your Toolset]
CREATION OF DATABASES for Structured Query Language (SQL):
① CREATE TABLE Inputs & Outputs
SALES_REP_YourIndexNumber
CUSTOMER_YourIndexNumber
INVOICES_YourIndexNumber
INVOICES_YourIndexNumber
ITEM_YourIndexNumber
INVOICE_LINE_YourIndexNumber
② DESCRIBE Table Outputs for each of 5 Tables.
③ SELECT * FROM each of the 5 above Tables.
Your submission should be printed in Landscape mode with no rollovers of texts as shown in handout
provided and also posted in Blackboard.
COMPUTER ASSIGNMENT #2 [Jamsa Chapter 6: Keep SQL in Your Toolset]
④ SINGLE TABLE QUERIES
(a.) Inputs for Single Table Queries for Exercises #1 to #9 as provided.
(b.) Outputs for Single Table Queries for Exercises #1 to #9 as provided with no rollovers of text.
COMPUTER ASSIGNMENT #3 [Jamsa Chapter 6: Keep SQL in Your Toolset]
⑤ MULTIPLE TABLE QUERIES
(a.) Inputs for Single Table Queries for Exercises #10 to #15 as provided.
(b.) Outputs for Single Table Queries for Exercises #10 to #15 as provided with no rollovers of text.
ESSAY #1:
Refer to the Journal Article you are provided as class handout and posting in Blackboard Learn:
Amani & Fadlala (2017): Data mining applications in accounting: A review of the literature and
organizing framework, International Journal of Accounting Information Systems, pp. 32-58.
Write a minimum of 3-page essay pertaining to the article that:
1.) Summarizes the highlights of this article.
2.) Your perspective of the article regarding your background and experience in accounting.
3.) Your further evaluation of the data mining related techniques referenced in this article by
selecting any of the References listed such as that by:
Fanning, K.M. & Cogger, K.O. (1998). Neural network detection of management fraud using
published financial data, International Journal of Intelligent Systems in Accounting, Finance and
Management, v.7. n.1, pp. 21-41.
And searching on web of more recent publications related to your selected reference and
discussing these data mining advances of applications to accounting.
Write your 3-page minimum double-spaced essay with citations for all borrowed materials of direct
quotes, paraphrases and borrowed figures and tables that discusses each of the above. Label your parts
(1.), (2.) and (3.) Any figures, tables and List of References included do not count toward the 3-page
minimum of text of essay. Use format for citations and additional References as shown at end of this
Take-Home Final and other as provided.
ESSAY #2:
This semester we encountered a journey thru the world of ever-expanding world of Data Mining.
This semester we studied:
① Use of RapidMiner in JAMSA Chapter 1 on pages 13-17 for Titanic data set.
② Theory of databases and relational databases in JAMSA book Chapter 3.
③ Data Mining of nominal sized databases using SAS Enterprise Miner version 15.3
[The size of the data set we used for PVA (Paralyzed Veterans Agency) was about 78,000.]
④ Text Mining using module within SAS Enterprise Miner 15.3.
⑤ Tools for relational databases such as Structured Query Language (SQL) using Oracle 21c as
discussed in JAMSA Chapter 6.
[This is mining of data subject to specified criteria (e.g., queries) for nominal sized data sets.]
⑥ Data Mining and Data Visualization of “Big Data” using SAS Viya
[The size of “Orion Star Sports & Outdoors” data set we used was described on page 1-8 of
Chapter 1: Getting Started with SAS Viya of: 747,953 orders + 68,300 customers + 3,151 products
+ 64 suppliers + 648 employees = 820,116 data values.]
In this 3-page essay plus page of any necessary Reference citations, you are asked to describe:
1.) What you perceive to be the highlights of what you learned this Fall 2022 semester in MIS 6473
Data Mining.
2.) What part(s) of the MIS 6473 Data Mining course interested/intrigued you the most and why.
3.) What part(s) of the MIS 6473 Data Mining course you enjoyed the most and why.
4.) How you think Data Mining and related topics (e.g., Big Data, SQL, Relational databases, Text
Mining etc.) studied this semester could be used in your future career.
=============================================================================
REFERENCES are indicated using numbers as exponents or [1.] or by name of author(s)
such as Smith (2021) and Smith & Jones (2020) WITHIN body of each essay for borrowed
materials or paraphrases of text written by others with complete citations LISTED in
order of appearance within body of paper at end of essay in format such as:
[1.] Jamsa, K. (2021). Introduction to Data Mining and Analytics, Jones & Bartlett
Learning, pp.15-23.
[2.] Ibid., p.500.
[3.] Collica, R.S. (2011), Customer Segmentation and Clustering using SAS Enterprise
Miner, pp. 1-1 to 1-13, http://www.sas.com/store/books/categories/usage-andreference/customer-segmentation-and-clustering-using-sas-enterprise-miner-secondedition/prodBK_62640_en.html , viewed November 7, 2022.
[4.] Jamsa, K. (2021). op.cit., pp. 232-237.
======================================================================
International Journal of Accounting Information Systems 24 (2017) 32–58
Contents lists available at ScienceDirect
International Journal of Accounting Information
Systems
journal homepage: www.elsevier.com/locate/accinf
Data mining applications in accounting: A review of the
literature and organizing framework
Farzaneh A. Amani a, Adam M. Fadlalla b,⁎
a
b
Qatar University, Doha, Qatar
Department of Accounting and Information Systems, College of Business and Economics, Qatar University, Doha, Qatar
a r t i c l e
i n f o
Article history:
Received 22 September 2015
Received in revised form 5 November 2016
Accepted 20 December 2016
Available online 27 January 2017
Keywords:
Data mining
Accounting
Literature review
Framework
Prospective
Retrospective
a b s t r a c t
This paper explores the applications of data mining techniques in accounting and proposes an
organizing framework for these applications. A large body of literature reported on speciﬁc
uses of the important data mining paradigm in accounting, but research that takes a holistic
view of these uses is lacking. To organize the literature on the applications of data mining in
accounting, we create a framework that combines the two well-known accounting reporting
perspectives (retrospection and prospection), and the three well-accepted goals of data mining
(description, prediction, and prescription). The framework encapsulates a taxonomy of four
categories (retrospective-descriptive, retrospective-prescriptive, prospective-prescriptive, and
prospective-predictive) of data mining applications in accounting. The proposed framework revealed that the area of accounting that beneﬁted the most from data mining is assurance and
compliance, including fraud detection, business health and forensic accounting. The clear gaps
seem to be in the two prescriptive application categories (retrospective-prescriptive and prospective-prescriptive), indicating opportunities for beneﬁting from data mining in these application categories. The framework presents a holistic view of the literature and systematically
organizes it in a structurally logical and thematically coherent manner.
© 2017 Elsevier Inc. All rights reserved.
1. Introduction
In the era of rapidly changing, globalized economies, and highly competitive markets, organizations, to become competitively
relevant, need to consider, and, many a times, adopt or implement a wide variety of innovative management philosophies, approaches, and advanced information technologies (Dorsch and Yasin, 1998). In particular, artiﬁcial intelligence (AI) is important
to the future of the accounting profession (Elliott, 1992), and intelligent systems have empowered many enhancements in multidimensional analytical power and efﬁciency of the accounting processes (Granlund, 2011). Thus, there are clear calls that AI deserves added attention (Debreceny, 2011), and the existence of opportunities of massive scale for companies to better fully
leverage the analytical capability of their enterprise systems (White, 2004). An open question is: could the lack of full utilization
of these analytical capabilities be explained by the complexity of these systems as suggested by Kim et al., 2009, or could it be due
to other factors such as features speciﬁc to data mining techniques, or the nature of the intelligent accounting applications
themselves?
⁎ Corresponding author.
E-mail addresses: farzanakhoory@gmail.com (F.A. Amani), fadlalla@qu.edu.qa (A.M. Fadlalla).
http://dx.doi.org/10.1016/j.accinf.2016.12.004
1467-0895/© 2017 Elsevier Inc. All rights reserved.
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
33
Data mining is one of the most important current paradigms of advanced intelligent business analytics and decision support
tools. Such signiﬁcance is acknowledged by the major accounting professional bodies. The American Institute of Certiﬁed Public
Accountants (AICPA) has identiﬁed data mining as one of the top ten technologies for tomorrow, and the Institute of Internal Auditors (IIA) has listed data mining as one of the four research priorities (Koh and Low, 2004). In addition, the Chartered Global
Management Accountants (CGMA) has reported that N50% of corporate leaders rank big data and data mining among the top
ten corporate priorities that are fundamental for the data-driven era of business (CGMA, 2013). Data mining has been deﬁned
as the process of identifying valid, potentially novel, and ultimately understandable patterns in data (Pujari, 2001). It is also
known as the process of extracting or mining knowledge from massive amounts of data (Han et al., 2006) to improve decisions
in a particular discipline. The key focus of data mining is, therefore, to leverage the data assets of an organization to derive ﬁnancial or non-ﬁnancial beneﬁts. Thus data mining has been applied to almost all non-business as well as business disciplines, including accounting.
Data mining is reported to afford organizations a wide array of beneﬁts and capabilities; including effectively predicting future
trends of corporate development, helping managers make better decisions, and raising competitiveness of an enterprise (Xiao et
al., 2010; Yigitbasioglu and Velcu, 2012). It can also provide managers with logical and causal connections within a company’s
ﬁgures so that issues can be proactively tackled (Yigitbasioglu and Velcu, 2012). In addition, data mining can contribute towards
signiﬁcantly improving judgment, transaction, and compliance in auditing (Vasarhelyi et al., 2004), improve the quality of evidence supplied to auditors (Brown et al., 2007), and contribute to the efﬁciency of the overall audit (Chan and Vasarhelyi,
2011). Furthermore, data mining can facilitate electronic (Liang et al., 2001) and continuous (Brown et al., 2007; Vasarhelyi et
al., 2012) auditing, and has the potential to radically alter the managerial control systems’ role and execution in organizations
(Sutton et al., 2011; Granlund et al., 2013). Data mining enables organizations to more easily identify statistical relations
among performance measures (Ittner and Larcker, 2001), estimate the likelihood an event will occur, thereby supplementing
managers’ qualitative judgments (Rezaee et al., 2002), and provide a vehicle of control for both accuracy of the data and legitimacy of data requests. Not least, data mining can help organizations quickly discern patterns in data that would take years to discover using older techniques (Mauldin and Ruchala, 1999), identify disgruntled employees from patterns of their email
exchanges (Huerta et al., 2012), and empower regulatory agencies with real-time market surveillance and risk proﬁling of market
players (Williams, 2013).
Accounting is a bedrock of any enterprise and spans a wide range of tasks including internal and external reporting, costing,
estimating, evaluating, analyzing, and auditing. Many of these tasks involve a great deal of uncertainty and risk complexities. Accounting has a history of intelligent applications dating back more than three decades (Baldwin et al., 2006), and was one of the
earliest business disciplines to utilize data mining to better address these risks and complexities. A large body of research has
been published describing applications of data mining in accounting. Although many researchers offered literature reviews of
such research, these reviews have generally focused on a speciﬁc accounting domain and/or data mining technique (Coakley
and Brown, 2000; Yang, 2006; Calderon and Cheh, 2002; Wang, 2010; Ngai et al., 2011).
A more encompassing approach is a review that presents this body of knowledge in a manner that simultaneously takes into
consideration the multi-faceted nature of the two underlying disciplines of accounting and data mining. This approach can help in
addressing effectively questions such as: what is the current status of the amalgamation of accounting, a fundamental business
discipline, and data mining, a top ten future information systems technology? How pervasive is this critical technology in accounting, and is it uniformly used across all branches of accounting or is it limited to some and not others? When used, is it used with
similar or varying intensities across the different accounting domains, and what are the plausible explanations if there is variability in usage intensity? How much has accounting adopted of the various powerful capabilities (including goals, tasks, and techniques) that data mining has to offer? These questions cannot be answered by the existing reviews individually. In addition,
such an approach may provide a mechanism to organize the research on the applications of data mining in accounting in a structurally logical and thematically coherent manner. It is the purpose of this research to attempt to answer these questions, and to
propose an organizing framework for the literature on data mining applications in accounting. In so doing, the paper, contributes
to extant literature: ﬁrst, a better understanding of the intersection of these two important disciplines; second, a macro-level perspective of the current status of research and practice on data mining applications in accounting; and third, a direction to potential opportunities for future research in this important domain. In addition, using a framework that succinctly organizes the
literature and summarizes its overall topology reveals the main research themes and patterns, provides deeper insights into
the underlying conceptual underpinnings and relationships, thereby leading to a better informed research and practice agenda.
Without such a reﬂective well-organized literature review, one is left with a fragmented landscape without a solid handle on
the true topography of the literature. Under such disjointed circumstances, it will be difﬁcult, at best, to ascertain the extent to
which the capabilities of a crucial technology of the 21st century have been leveraged in the core business discipline of accounting. The contributions of the paper are relevant to both researchers and practitioners with an interest in the application of data
mining in accounting.
The objective of this paper is thus to systematically examine published research on data mining applications in accounting to
understand the current status of, discern any central themes in, and offer an organizing framework for, this research. We propose
a framework that provides a comprehensive view of what has been accomplished by using data mining in accounting, what areas
in the accounting discipline have more and which ones have less utilization of this technology. The paper relies on interpretative
research using content analysis to understand the relevant literature. Extant research that describes applications of data mining in
accounting served as the primary data for understanding the nature of these applications and for mapping them into the organizing framework. The rest of the paper is organized as follows: section 2 provides a background and literature review, section 3
34
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
describes the research methodology, section 4 presents the proposed framework, section 5 presents and discusses results, and section 6 offers conclusions, limitations, and future research directions.
2. Background and literature review
Data mining is the application of speciﬁc algorithms for extracting patterns from data. It allows the automated discovery of
implicit patterns and interesting knowledge hidden in large amounts of data (Jiawei and Kamber, 2001). Data mining helps organizations to focus on the most important information and knowledge available in their existing databases. But it is only a tool; it
does not eliminate the need to know the business, to understand the data, or to understand the analytical methods involved
(Jackson, 2002). Data mining has three main goals: description, prediction, and prescription. Whereas description focuses on ﬁnding human-interpretable patterns describing the data, prediction involves using some variables or ﬁelds in the database to predict
unknown or future values of other variables of interest (Fayyad et al., 1996). On the other hand, prescription focuses on providing
the best solution for the given problem (Evans, 2013). These goals can be achieved by using many data mining tasks, including
classiﬁcation, clustering, prediction, outlier detection, optimization, and visualization. These tasks differ with the type of problem
to be solved as follows:
■ Classiﬁcation focuses on mapping data to predeﬁned qualitative discrete attribute set of classes, which could be binary or
multi-class.
■ Clustering focuses on segmenting the data to some meaningful classes or groups.
■ Prediction focuses on ﬁnding a future numerical value (forecasting) or non-numerical value (classiﬁcation).
■ Outlier Detection focuses on ﬁnding the data that signiﬁcantly deviates from the normal.
■ Optimization focuses on ﬁnding the best solution given some resources.
■ Visualization focuses on the visual presentation and understanding of data.
■ Regression focuses on estimation of a dependent variable from a set of independent variables.
A wide variety of data mining techniques exist, such as artiﬁcial neural networks (NNs), case-based reasoning (CBR), genetic
algorithms (GA), decision trees (DT), association rules (AR), support vector machines (SVM), regression, self-organizing maps
(SOM), k-nearest neighbor (KNN), naïve Bayes (NB), and fuzzy analysis. Each of these data mining techniques serves a particular
purpose, problem, and business need. Additional details on these techniques are readily available from a myriad of references on
data mining.
Many researchers have investigated the application of data mining in accounting. However, each of these researchers focused
on some specialized aspect of this broader topic, and none, to the best of our knowledge, has provided an all-encompassing overview. One of the early papers was that of Foltin and Garceau (1996), which demonstrated the differences between expert systems
and neural networks and the future of neural networks applications in accounting. Coakley and Brown (2000) covered the modeling issues of neural networks in accounting and ﬁnance and classiﬁed them by research question, type of output (continuous versus discrete), and the parametric nature of the model. Yang (2006) pointed out how data mining is useful in both auditing and
fraud detection. More speciﬁcally in auditing, Baldwin et al. (2006) highlighted the opportunities for AI in auditing, Calderon
and Cheh (2002) provided a roadmap for future neural networks research in auditing and risk assessment, and Koskivaara
(2004a) reviewed the use of neural networks in auditing and concluded that the focus was mainly on analytical review procedures. In the area of forensic accounting, Wang (2010) provided a review of data mining-based accounting-fraud detection research and summarized the data structures, algorithms, ﬁndings, and model performance evaluation with the aim of helping
the accountants in selecting the suitable data and data mining technologies for detecting fraud. Furthermore, Ngai et al. (2011)
explored the application of data mining techniques in the detection of ﬁnancial fraud, and Gray and Debreceny (2014) provided
a taxonomy to guide research on the application of data mining to fraud detection in ﬁnancial statement audits. More speciﬁcally,
Debreceny and Gray (2011) provided an overview of how data mining techniques can mine emails and how such techniques and
applications can be used by auditors as audit evidence. Ravi Kumar and Ravi (2007) provided a review of the application of data
mining in bankruptcy prediction in banks and ﬁrms during the period 1986–2006. Their review highlighted the techniques applied, sources of data, ﬁnancial ratios used, country of origin, time line of study, and the comparative performance of techniques
in terms of prediction accuracy. More broadly, Fisher et al. (2010) and Chakraborty et al. (2014) applied text and data mining to
automatically classify academic articles in accounting and improve understanding of the accounting lexicon. Thus, so far researchers focused their reviews of data mining applications to a speciﬁc topical context. We did not ﬁnd any research that provides a comprehensive view of data mining applications in the broader accounting context, and the paper will attempt to
address this gap.
3. Research methodology
Broadly following the methodology of a systematic review outlined in Tranﬁeld et al. (2003) and Khlif and Chalmers (2015),
our research methodology consists of seven steps:
Step 1. Scoping of the study: This study focuses on the application of data mining in accounting.
Step 2. Identiﬁcation of search terms: To frame the scope of the study, we identiﬁed keywords that we used as search terms to
capture relevant articles. We included accounting-related search terms such as accounting, ﬁnancial, auditing, costing, fraud, combined with any of the following data mining-related search terms such as data mining, AI, big data, machine learning, clustering,
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
35
decision tree, genetic algorithm, neural network, self-organizing map, regression, case-based reasoning, nearest neighbor, and
Bayes. We consider these keywords and their combinations to represent a reasonably broad set of search terms to unravel the
relevant literature.
Step 3. Identiﬁcation of data sources: Our data sources consist of: (a) leading accounting journals1, (b) all journals, not overlapping with (a), published by the American Accounting Association (AAA), (c) University Library e-Resources, which include subscription to N80 major electronic databases, (d) OhioLINK, which is an electronic database of N9000 journals from 101 publishers,
and (e) Google Scholar (see Fig. 1). We believe, these outlets represent a comprehensive collection of literary sources that will
cast a wide enough net to cover research relevant to the scope of the study.
Step 4. Article collection: We searched for literature on data mining applications in accounting using combinations of the
search terms speciﬁed in Step 2, without time or outlet constraint in the multiple electronic sources (similar to Grabski et al.,
2011; and Richardson et al., 2015). We also included articles from OhioLINK’s and Google Scholar’s “related papers” functionality
during the collection process.
Step 5. Article ﬁltering: A manual inspection and ﬁltering process is undertaken by the authors to only include papers that satisfy the following inclusion criteria: (1) describe a speciﬁc application of data mining in accounting, (2) explicitly describe what
data mining techniques have been utilized, and (3) the data mining goal and tasks are discernible from the paper. All other papers
that either did not describe a concrete data mining application, such as interpretive articles, commentaries, and literature reviews,
or did not provide enough detail to satisfy the inclusion criteria were excluded. A total of 209 papers satisﬁed the inclusion
criteria.
Step 6. Content evaluation: We utilized a data extraction form to capture an article’s:
• bibliographic details (including author(s), publication date, title, journal, volume, issue, pages);
• reporting focus (retrospective or prospective);
• accounting topic (e.g., accounting information systems, ﬁnancial accounting, managerial accounting, compliance and assurance);
• accounting sub-topic (e.g. ﬁnancial analysis, ﬁnancial performance, budgeting, asset management, cost management, auditing
cycle, business health, forensic accounting, tax compliance);
• data mining goal (description, prediction, or prescription);
• data mining task (e.g., association, classiﬁcation, clustering, estimation, exploration, forecasting, optimization); and,
• data mining technique(s) (e.g., regression, neural networks, genetic algorithms, decision trees, support vector machines, casebased reasoning).
Step 7. Synthesis and framework development: section 4 details this step.
Our search goal was to capture literature on as many data mining accounting applications as possible, and identify the nature
and the major areas of such applications. Our methodology is by no means without limitations. For example, many other search
terms could have been used. However, no search strategy could exhaust all possible relevant terms in either accounting or data
mining. We believe we have included the major outlets and search terms to capture the major literature related to the applications of data mining in accounting.
4. Proposed framework
Given the large amount of research that has been produced on the use of modern data mining technology in the ﬁeld of accounting, an obvious question is: can this research be presented in a structurally logical and thematically coherent manner? In an
attempt to answer this question in the afﬁrmative, we propose an organizing framework for the applications of data mining in
accounting (Fig. 2). Frameworks that organize literature succinctly summarize the topology of the literature, provide better understandability to complex relationships, and offer a convenient mechanism of mapping research in a given domain. We adopt the
methodology of juxtaposing elements from different entities to construct a framework, similar to the approach used by Preece
and Rombach (1994) and Richardson et al. (2015). While Preece and Rombach (1994) created a framework by combining measurement approaches from the disciplines of software engineering and human-computer interaction, Richardson et al. (2015)
built their framework by linking elements of a professional entity (accountant) with elements of a profession entity (accounting).
Our proposed framework combines characteristics of accounting discipline with characteristics of data mining discipline. Speciﬁcally, it combines the two well-known major reporting perspectives of accounting (retrospective and prospective) and the wellestablished three main data mining goals (description, prediction, and prescription).
The retrospective-prospective duality of reporting in accounting is manifested in the work of Birnberg (1980), Carnegie (2012),
and Owen (2013). Retrospective reporting deals primarily with reﬂective reporting of the historical ﬁnancial position of an organization mainly for ﬁnancial valuation, decision making, and/or compliance purposes. For example, preparing ﬁnancial statements
provides a retrospective summary of an organization’s ﬁnancial position at a point in time (balance sheet) or proﬁt or loss for a
span of time (income statement). Prospective reporting, on the other hand, is future-oriented and includes future ﬁnancial outlooks, estimations, and projections. For example, when historical information is used for predicting some future aspect of an
1
Top accounting journal is assigned using 2014 Thomson Reuters journal citation reports with impact factor≥1. These, alphabetically, are: Accounting, Auditing and
Accountability Journal, Accounting, Organizations and Society, Contemporary Accounting Research, International Journal of Accounting Information Systems, Journal of
Accounting and Economics, Journal of Accounting Research, Management Accounting Research, Review of Accounting Studies, and The Accounting Review.
36
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
Fig. 1. Search methodology.
organization, such as the direction or the magnitude of its future performance, growth, or any other ﬁnancial performance or
health indicator, the focus of accounting reporting becomes prospective.
Data mining has three main goals: description, prediction, and prescription. The main goal of descriptive data mining is business and data understanding (the what happened), the goal of predictive data mining is using the past to understand the future
(the what could happen), and the goal of prescriptive data mining is to achieve the best outcome (the what should happen). Descriptive data mining, the most commonly used and most well understood type, focuses on the use of data to understand the past
Fig. 2. Proposed framework.
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
37
and present and, accordingly, make informed decisions. It uses techniques to categorize, characterize, consolidate, and visualize
data to convert it into useful information for the purposes of better data and business understanding. Descriptive data mining enables users to identify patterns and trends in data and discover problems and/or areas of opportunity. On the other hand, predictive data mining analyzes the past in an effort to predict the future by examining historical data, detecting patterns or
relationships in these data, and then extrapolating these relationships forward in time. For example, using predictive data mining,
a bank system might alert a credit card customer to a potentially fraudulent charge. Predictive data mining primarily informs future-oriented decision making. Finally, prescriptive data mining uses optimization techniques to identify the best alternatives to
minimize or maximize some objective function. The mathematical and statistical techniques of predictive data mining can be
combined with optimization to make decisions that take into account the uncertainty in the data (Evans, 2013). Whether the
goal is retrospective or prospective reporting, prescription (optimization) may be utilized to support maximizing or minimizing
these goals in the most resource-efﬁcient manner. Indeed, data mining provides advanced techniques to facilitate descriptive, predictive, and prescriptive modeling, and thus it has a great deal to offer in direct support for the major reporting perspectives of
accounting.
Juxtaposing the two major perspectives of accounting reporting and the three main goals of data mining, six combinations result. Only four of these combinations are logically feasible; namely, (1) descriptive data mining in retrospective reporting, (2) prescriptive data mining in retrospective reporting, (3) prescriptive data mining in prospective reporting, and (4) predictive data
mining in prospective reporting. These four combinations represent the major groupings of the proposed framework and are
used to organize the published research on accounting applications of data mining. We, respectively, refer to the categories
formed by these combinations as retrospective-descriptive, retrospective-prescriptive, prospective-prescriptive, and prospectivepredictive (Fig. 2).
Retrospective-descriptive applications focus on business and data understanding from a historical viewpoint. The main data
mining tasks used by this strand of applications include exploration, clustering, association analysis, visualization, segmentation,
and pattern recognition. Retrospective-prescriptive and prospective-prescriptive applications emphasize efﬁciency, and thus utilize optimization and estimation as data mining tasks, yet they differ on their temporal orientation; where retrospective-prescriptive applications focus on the past and the present, and prospective-prescriptive applications focus on the future. Finally,
prospective-predictive applications focus on a future business aspect using historical data, and employ data mining tasks such
as classiﬁcation, forecasting, and estimation. The proposed framework is comprehensive, simple, easy to understand, and empirically veriﬁable. Section 5 demonstrates how the proposed framework is capable of capturing the research on data mining applications in accounting.
5. Results and discussion
The review of the applications of data mining in accounting reveals various patterns relating to temporal trends; data mining
goals, tasks, and techniques used; primary accounting sub-domains and areas covered; current literature macro themes and patterns; results of mapping the literature to the various categories of the proposed framework, and intensity of coverage in each of
these categories. These ﬁndings are elaborated in the following sub-sections.
5.1. Temporal trends
Our search methodology identiﬁed a total of 209 applications (23 described in conference proceeding papers and 186 described in journal articles) of data mining in accounting between 1989 and 2014. There are clear upward trends in the application
of data mining in accounting between 1995 and 2001 and 2004–2014, with the highest number of applications in 2013 (Fig. 3). It
appears that accounting researchers and professionals have realized beneﬁts of applying data mining to accounting and thus have
shown greater propensity to adopt it over time. Of special interest is the quantum leap in the number of such applications since
2010. One possible reason for such a leap is the need for more modeling sophistication in the accounting practice following the
major worldwide ﬁnancial crisis of 2008 and the subsequent business meltdown.
Fig. 3. Number of data mining applications in accounting 1989–2014.
38
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
Fig. 4. Applications of data mining in accounting by data mining goals.
5.2. Goals, tasks, and techniques
The analysis shows that the vast majority (82%) of applications focused on predictive data mining, 11% on descriptive, and 7%
on prescriptive (Fig. 4). These patterns may be a reﬂection of the fact that prediction provides more strategic value to accounting
decision making as it embodies future outlook, strategic orientation, guidance, and positioning. Hence, the heavier emphasis on
prediction, as opposed to description and prescription, in applications of data mining in accounting.
Analysis of the reviewed papers revealed that the classiﬁcation data mining task is used by the vast majority (67%) of data
mining applications, followed by estimation (12%), clustering (6%), and optimization (5%), and that the least used data mining
tasks are pattern analysis (b0.5%), exploration (2.5%), and association (2.5%) (Fig. 5). There is a clear trend in the accounting
data mining applications to favor binary classiﬁcation as it seems to ﬁt many of the problems tackled in these applications; for
example, the objective being to classify into a binary class of: fraud/no fraud, bankruptcy/no bankruptcy, healthy/not healthy,
good performance/poor performance,… etc. It seems, however, that accounting has yet to fully beneﬁt from the many other important data mining tasks such as pattern and association analyses that have shown great beneﬁts in other business disciplines
such as marketing (Cil, 2012; Liao and Chen, 2014).
Data mining is a multi-disciplinary approach that uses a variety of techniques from statistics, machine learning, databases, and
others. The analysis of the literature shows that neural networks is the most widely used technique (Table 1), and was used by
almost half (47%) of the applications. Such dominance of neural networks may be due to the nature of neural networks as a general problem solving technique that can be utilized in all data mining types, tasks, and business problems. Regression, a wellestablished and accepted method, comes a distant second and was used by 20% of the applications. Following behind are decision
trees used by 14%, support vector machines and genetic algorithms each used by 11% of the applications. Other less extensively
used techniques include: text mining, self-organizing maps, k-nearest neighbor, discriminant analysis, association rules, casebased reasoning, Bayesian networks, and k-means. There may be a familiarity gap in the accounting community with the more
advanced data mining techniques, hence reﬂected by the low usage of these techniques.
5.3. Application areas
A topical analysis of data mining applications in accounting showed that almost two thirds (64%) of these applications focused
on assurance and compliance, one fourth (25%) on managerial accounting, and the remaining on ﬁnancial accounting and accounting information systems (AIS) (due to the small number of AIS applications, they are sometimes combined with ﬁnancial
Fig. 5. Applications of data mining in accounting by data mining tasks.
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
39
Table 1
Data mining techniques used in accounting applications⁎.
Data mining technique
Count
Neural networks
Regression
Decision tree
Support vector machines
Genetic algorithms
Text mining
Self-organizing maps
Discriminant analysis; K-nearest neighbor, Bayesian networks
Association rules; Case-based reasoning
K-means; Fuzzy analysis
Expert systems
Data envelopment analysis
Analytic hierarchy process; Principal component analysis; Hybrid
Proprietary; Rough sets; Process mining
Collocational networks; Digital analysis; OLAP; PMI (pointwise mutual information); Linear programming; Particle swarm optimization
99
41
30
23
22
15
13
9
7
6
5
4
3
2
1
⁎ Some applications reported using more than one technique.
accounting applications) (Fig. 6). Fig. 7 further summarizes the number of applications in each accounting topic and sub-topic, and
the sections that follow provide more detail on each of these application areas. The varying intensity of using data mining across
the various branches of accounting may reﬂect the intensity of the need for advanced analytics in each of these branches. The
well-publicized auditing failures and corresponding bankruptcies as well as the tightening of regulatory legislations and oversight
may have necessitated the search for advanced technological support in the domain of assurance and compliance. Similarly, competitive pressures and the pursuit of corporate efﬁciencies may have created more need for data mining in managerial accounting
than in ﬁnancial accounting.
5.3.1. Data mining in AIS
Few authors examined the application of data mining in the area of AIS. For example, Wang et al. (2009) used self-organizing
maps, nominal data analysis, and the concept of entropy in building a chart of accounts structure for enterprise resource planning
(ERP) accounting information system. Although Wang et al. (2009) reported savings of effort and time as a result of utilizing data
mining techniques in addressing this core accounting problem that represents the initial step in an ERP accounting module implementation, their sample size was very limited (ﬁve data sets) and they did not shed any light on the accuracy or reliability of their
approach. Zheng (2011) built a resources, events, agents (REA)-based accounting information systems framework that, combined
with data warehousing, decision support systems, data mining, and other information technologies was adaptable to e-commerce
environments. Zheng (2011) suggested an enabling role of data mining in an accounting information system, particularly in the an
e-commerce context, yet there is no clear reason why can’t data mining play a similar role in accounting information systems in
general.
It is not clear whether this thin research coverage of data mining applications in AIS is due to lack of reporting of such applications or due to true lack of such applications. If the former, it may be, and understandably so, because of the unwillingness to
reveal these applications for competitive considerations. If the latter, it is counter-intuitive as one would expect AIS to be a major
beneﬁciary from such key analytics technology, and thus represents a research gap and an opportunity to showcase the power of
data mining in AIS. One could also conjecture that AISs are viewed more as technology rather business systems and may thus be
tended to by technical staff rather than business people, thus missing the opportunity to leveraging the business dimension of
these systems by incorporating advanced data mining applications that aim primarily to deriving business beneﬁts.
Fig. 6. Data mining applications in primary accounting topics.
40
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
Fig. 7. Data mining applications in accounting topics and sub-topics (numbers in parentheses indicate the number of data mining applications in the corresponding
area).
5.3.2. Data mining in ﬁnancial accounting
Financial accounting applications mainly examined ﬁnancial performance and analysis. One of the earliest applications of data
mining in this area was that of Callen et al. (1996), in which they built a neural networks model to forecast quarterly accounting
earnings. This work benchmarked neural networks against linear time series forecasting models, and reported that the linear time
series models yielded better quarterly earnings forecasts than an artiﬁcial neural network model. The research that followed this
early work, reports, in the majority, the opposite of this ﬁnding on various problems, perhaps replicability of Callen et al.’s (1996)
experiment was difﬁcult due to the lack of exact speciﬁcation of their neural network model. Back et al. (2001) used self-organizing maps to compare company performance extracted from numerical information versus that extracted from textual information
in annual reports. Similarly, but mainly focusing on the future, Kloptchenko et al. (2004) and Magnusson et al. (2005), used data
mining techniques to analyze quantitative and qualitative contents of ﬁnancial reports to predict future ﬁnancial performance;
both concluding that while textual content is more informative of future performance, quantitative content is more informative
of past performance. Although the primary strength of these studies is in leveraging the power of textual information, in addition
to that of quantitative information, their scope is very narrow: focusing on international pulp and paper industry. Thus the question of generalizability of their ﬁndings to other industries is still open.
To enhance ﬁnancial performance analysis, and taking into consideration both homogeneity of size and sector in their experiment, Hofmann and Lampe (2013) used clustering on balance sheet structure of logistics service providers. This study focused
only on the balance sheet and on macro-level variables, but ﬁnancial performance goes far beyond the balance sheet and spans
all other ﬁnancial statements. Hence Hofmann and Lampe (2013) analysis is bound to have missed important variables relevant
to ﬁnancial performance, but fall outside the scope of the balance sheet. More speciﬁcally, to improve ﬁnancial ratio analysis,
Landajo et al. (2007) developed a robust neural networks model for the cross-sectional analysis of accounting information, and
Eklund et al. (2008) used self-organizing maps to identify a set of ratios for deriving a ﬁnancial benchmarking model. The
strengths of these studies are that they took into consideration the error cost dimension in evaluating different models and selected ﬁnancial ratios based on their empirical reliability and validity in international comparisons. Perhaps feature selection techniques could have been a more direct data mining approach to selecting ratios relevant to a speciﬁc task – such as ﬁnancial
analysis.
Huang and Li (2011), using advanced text mining features (but not considering interdependence between risk factors), developed a multi-label text classiﬁcation k-nearest neighbor algorithm to identify risk factors of annual reports. Koskivaara (2004b),
an early adopter of SOMs and visualization for ﬁnancial analysis – although in the context of a single-medium-sized company,
used self-organizing maps for classifying and clustering accounting data for signaling unexpected ﬂuctuations. Focusing more
on discovering patterns of data quality issues, Alpar and Winkelsträter (2014) applied association rules to accounting transactions
data. Their results showed that not including error cost considerations in classiﬁcation models evaluation could lead to economically bad decisions. However, although their procedure may be useable in other companies, the discovered association rules are
company-speciﬁc.
Many authors used data mining in accounting at a more macro level. Spear and Leis (1997) developed well-speciﬁed, yet toospeciﬁc, multiple supervised neural network models to improve the choice of accounting method (full cost vs. successful effort)
for oil and gas producing companies. Beaudoin et al. (2010), using logistic regression, examined the potential effects of accounting
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
41
policy Statement No. 158 of Financial Accounting Standards on management actions. This is one of the few works that used a balanced matched-sample experimental design to contribute to the debate on costs and beneﬁts of accounting regulation. Lodhia and
Martin (2011), uniquely focusing on the use of data mining for carbon accounting and reporting, explored the use of text mining
in environmental accounting, and whether broader climate change issues were addressed in submissions made by corporations
and other stakeholders to regulatory agencies. Garnsey (2006) used clustering to derive related accounting concepts to improve
access to, and retrieval of, ﬁnancial accounting material. Other authors (such as Henry, 2006; Cho et al., 2010; Li, 2010; Davis and
Tama-Sweet, 2012; Huang et al., 2013; Huang et al., 2014) focused on the use of text mining in analyzing the narrative across
different disclosure outlets to examine the relationship between the tone of the disclosure, future performance, investor, and market reactions. Their overall consensus is that inclusion of predictor variables capturing verbal content and writing style of earnings-press releases results in more accurate predictions of market response earnings announcements. In addition, the language
and verbal tone used in corporate environmental disclosures, in addition to their amount and thematic content, should be considered when investigating the relation between corporate disclosure and performance.
Data mining in ﬁnancial accounting has primarily focused on ﬁnancial performance and ratio analysis; such as forecasting
quarterly accounting earnings, comparing informational value of numerical versus textual data for performance measurement, ﬁnancial performance benchmarking, identifying risk factors in annual reports, visualization of patterns in accounting data, assessment of quality of accounting data underlying ﬁnancial reports, and the impact of management announcements and their tones
on market response, among others. These applications have primarily focused on description and prediction as goals, and used
clustering and classiﬁcation as data mining tasks. Neural networks and text mining are the most prevalent techniques of these
applications. Future research opportunities include more utilization of textual components of the ﬁnancial reports in predicting
ﬁnancial performance, importance of domain expertise in data mining applications, paying more attention to data quality issues,
importance of benchmarking of applications, going beyond ﬁnancial ratios to capture relevant inputs for better prediction ﬁnancial
performance, sensitivity analysis of derived models to data characteristics, and the importance of considering variable relationships in the analysis.
5.3.3. Data mining in managerial accounting
Managerial accounting applications focused on major areas such as cost management, asset management, and budgeting and
pricing management.
5.3.3.1. Cost management. Data mining has been applied in the area of cost management at various costing levels: equipment, process, construction, product, and project. At the equipment level, data mining has been used for estimation of equipment
manufacturing cost (Chou et al., 2010; Chou et al., 2011), for improving the accuracy of equipment inspection and repair
(Chou and Tsai, 2012) and for tracing equipment replacement costs (Dessureault and Benito, 2012). The application of data mining to cost management strand of research demonstrated many strengths and weaknesses. Strengths include well-deﬁned accuracy measures (Chou et al., 2010), use of hybrid paradigms (Chou et al., 2011; Kostakis et al., 2008), utilization of hierarchical
analysis approaches (Chou and Tsai, 2012), and highlighting of the importance of data understanding (Dessureault and Benito,
2012). Weaknesses include a very limited scope and sample size (Chou et al., 2010; Chou et al., 2011, and Chou and Tsai,
2012), as well as use of limited variables (Dessureault and Benito, 2012), or simulated data (Kostakis et al., 2008). At the business
process level, data mining has been used for deﬁning cost drivers in activity-based costing and improving production process
routing, (Kostakis et al., 2008; Liu et al., 2012), as well as for intelligent transfer price decision making (Kirsch et al., 1991). At
the construction level, authors (Yu et al., 2006; Shi and Li, 2008; Migliaccio et al., 2011; Vouk et al., 2011) focused on the application of data mining to construction cost management, creating a neural networks system for simple, fast, and adequately accurate estimation of total or unit cost of construction, operation, and maintenance. Some of these studies are not replicable because
the variables used are not speciﬁed (e.g. Yu et al., 2006), and some are not integrated into the existing operational systems (e.g.
Shi and Li, 2008). Many authors applied data mining to product costing; namely, for forecasting product unit cost (Chang et al.,
2012), estimating product life-cycle cost (Seo et al., 2002; Yeh and Deng, 2012), estimating project design cost (Deng and Yeh,
2010), and estimating product manufacturing cost (Deng and Yeh, 2011). These studies utilized the power of hybrid data mining
modeling techniques to report their results, yet they are narrowly-focused on a single industry, company, and/or product, and
thus can hardly be considered of wide applicability. At the project level, data mining has been used to develop a project-level
cost control system (Zhao and Ding, 2009; Ji et al., 2010, 2011: Kaluzny et al., 2011; Petroutsatou et al., 2011), and to develop
a project-level cost estimation system (Shan and He, 2012). These cost estimation applications are not limited to tangible products
or projects, but also extend to estimation of cost for intangible projects such as software projects (Huang et al., 2007; Khalifelu
and Gharehchopogh, 2012).
5.3.3.2. Asset management. Inventory management, including inventory classiﬁcation, costing, optimization, and controlling is a key
factor that inﬂuences company competitiveness. In the area of inventory control, neural networks were used to optimize inventory level (Bansal et al., 1998a, 1998b; Reyes-Aldasoro et al., 1999), reporting a 50% reduction in inventory cost (from over a billion dollars to about half-a-billion dollars) while maintaining the same level of probability that a particular customer’s demand
will be satisﬁed. In addition, these papers describe the use of traditional statistical techniques to help determine the best neural
network type for a particular application. To better manage inventory, many data mining techniques were used, including decision
trees (Braglia et al., 2004), fuzzy neural networks (Li and Kuo, 2008), and genetic algorithms (Zeng et al., 2006). These researches
reported improved management processes and, consequently, reduced inventory holding costs by using hybrid data mining
42
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
models, including combinations of fuzzy systems and neural networks, integrating neural networks, analytic hierarchy process,
and genetic algorithms.
The predictive accuracy of data mining techniques on the LIFO/FIFO and ABC inventory classiﬁcation has been an active area of
research to minimize total inventory costs, including ordering cost, holding cost, purchase cost and transportation cost (Liang et
al., 1992; Altay Guvenir and Erel, 1998; Partovi and Anandarajan, 2002; Šimunović et al., 2009; Yu, 2011; Kabir and Hasin, 2013).
An interesting ﬁnding in these studies is that that there was no signiﬁcant difference between the backpropagation and the genetic algorithms learning methods on the predictive accuracy of neural networks to classify inventory. Data mining has also
been used to improve inventory management in e-commerce environments (Chodak and Suchacka, 2012), resulting in better recommender systems that take into consideration product cost, which results in moving inventory items that otherwise remain dormant in e-stores. Many authors (Gaafar and Choueiki, 2000; Megala and Jawahar, 2006) addressed the material requirements
planning (MRP) lot-sizing problem using various data mining techniques, genetic algorithms (Lee et al., 2013) to develop an integrated model for lot-sizing with supplier selection and quantity discount, neural networks and genetic algorithms (Zhou et al.,
2009) to optimize the multi-objective function of selecting materials for a product, similarly (Wu and Hsu, 2008) for designing bill
of material conﬁguration for reducing logistic costs for spare parts inventory, stochastic neuro-fuzzy (Gumus et al., 2010) for inventory management in a multi-echelon environment, and neural networks (Wang, 2011) for classiﬁcation of inventory risk level.
The overall focus of these authors is to achieve high solution quality at acceptable computational time, and the common ﬁnding is
that data mining techniques, used individually or integrated in hybrid models, are capable of solving the static or dynamic lotsizing problem with notable consistency and reasonable accuracy.
In addition, data mining techniques have been used to improve the accuracy and efﬁciency of asset evaluation (Liu and Ren,
2009), for identifying important factors affecting intangible assets value (Tsai et al., 2012), and for improving prediction of cash
ﬂow (Cheng and Roy, 2011). Accuracy of physical or intangible asset valuation is important to both investors and creditors, especially in the context of knowledge-based economies where assets are becoming more and more intangible knowledge as opposed to physical assets. Thus the need for novel approaches to valuation of such intangible assets as reported in Tsai et al., 2012
who used feature selection, an important data-preprocessing step in data mining, to identify important and representative factors affecting intangible assets, concluding that their model is simple and feasible, and improve the valuation accuracy and
efﬁciency.
5.3.3.3. Other. In addition to the broad categories of cost and asset management, this study revealed application of data mining to
other areas of managerial accounting. For budgeting (Chou, 2009; Tang, 2009), developed a web-based case-based reasoning system for early cost budgeting to assist decision makers in project screening, and applied fuzzy analytic hierarchy process for multicriteria decision-making to improve budget allocation decisions. In the area of revenue management Ragothaman and Lavin (2008)
used neural networks to curb improper revenue recognition practices by predicting ﬁrms that will restate their revenue. Their results show that the neural network model has superior predictive power for predicting revenue restatement ﬁrms compared to
the Multiple Discriminant Analysis (MDA) and Logit models; although the Logit and MDA models predict nonrevenue restatement
ﬁrms better. Moreover, when misclassiﬁcation costs are taken into consideration, the neural network model performs the best
with the lowest relative misclassiﬁcation costs. On the other hand, to improve revenue underpayment recovery process in a
healthcare organization, Hennigan and Chowlera (2011) developed a proprietary data mining algorithm to recover more than
$20 million in less twenty months of implementation, and to increase staff productivity by 100% in less than six months, allowing
auditors to resolve 40 to 50 claims per day versus 20 per day prior to their systems implementation. For account reconciliation,
Chew and Robinson (2012) explored application of natural language processing to achieve 100% precision and recall on a reallife dataset, suggesting that their approach is highly reliable and eliminated most of the manual work for their test problem, suggesting the possibility of highly desirable improvements in information technology controls to reduce the cost of external audit
work. In the area of mergers and acquisitions, Shawver (2005) used neural networks for accurately predicting bank merger premiums to better price mergers, accrue competitive advantage in pricing merger offers, and enhance the possibility that the merger
will achieve its intended ﬁnancial, strategic, and/or operational synergy.
Data mining applications in managerial accounting mainly addressed cost management at different levels (product, equipment,
process, and project levels), asset (mainly inventory) management, among others with less emphasis. These applications cover
many speciﬁc implementations such as classifying, selecting, predicting, and optimizing inventory management, deﬁning cost
drivers, estimating and forecasting project and product cost, developing cost estimation models for product, equipment, and project, developing budgeting systems, predicting cash ﬂows, etc. The primary goals in these applications involve prediction and prescription. The main tasks are estimation and optimization, and the predominant technique is neural networks. Future research
opportunities in this domain include: exposure of these applications using web services, better identiﬁcation of relevant variables,
more sensitivity analysis of generated models, improve model parsimony, improve data handling, clearly differentiating between
causation and association, addressing issues of data availability, and cross-industry model validation.
5.3.4. Data mining in assurance and compliance
5.3.4.1. Auditing. The accounting transactions are becoming more complicated and easier to manipulate with the increasing use of
online systems and the proliferation of smart devices and the internet of things. This necessitates a more sophisticated auditing
profession, including an increasing use of the advanced techniques of data mining. Needless to ignore, the important role of information technology has to play in improving the efﬁciency of the monitoring and controlling process (Daigle and Lampe
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
43
2005). Data mining has been applied throughout the auditing cycle: planning (such as engagement, risk assessment, design of
audit plan), conducting (mainly performing substantive audit tests), and reporting (audit report). Data mining has also been applied post the audit cycle, including impact and consequences of auditor’s opinion.
In the engagement phase, data mining has been used to predict auditor selection (Kirkos et al., 2008, 2010) and switching
(Kirkos, 2012), to ﬁnd the optimal match between the audit project characteristics and auditor expertise in public construction
projects (Wang and Kong, 2012), and to classify the level of corporate audit costs and variation in audit fees (Curry and Peel,
1998; Beynon et al., 2004). In today’s information-rich environments, risk assessment involves recognizing patterns in the data,
such as complex data anomalies and discrepancies that may conceal one or more error or hazard conditions (Ramamoorti et
al., 1999). Calderon (1999) and Ramamoorti et al. (1999) studied the ability of neural networks to enhance auditors’ risk assessment and reported that neural network modeling is invaluable in directing internal auditor attention to those aspects of ﬁnancial,
operating, and compliance data most informative of high-risk audit areas, and thus enhances audit efﬁciency and effectiveness. Similarly, Davis et al. (1997) and Hwang et al. (2004) developed neural networks models to support auditors in conducting control risk
assessment; concluding that neural networks provide auditors an effective way to recognize patterns in the large number of control
variable inter-relationships that even experienced auditors cannot express. Likewise, Issa and Kogan (2014) proposed a predictive
logistic regression model as a tool for quality review of control risk assessments, and thus improve audit efﬁciency by focusing on
the concept of audit by exception. For audit planning, Ragothaman et al. (1995) developed a rule-based system that assists auditors
at the planning stage in the design of subsequent substantive tests, when material errors and irregularities in the ﬁnancial statements are probable; demonstrating that this system outperforms a model based on discriminant analysis in classifying ﬁrms into
error and non-error categories. Unfortunately, the sample size used in their study limits the generalizability of the generated rules.
Some of the interesting ﬁndings in the application of data mining in the audit engagement phase is that the level of debt is a
factor that inﬂuences the auditor choice decision, gross proﬁt is the variable that best predicts auditor switching, and that adoption of an effective auditor procurement process increases the likelihood that a company will engage and match the right auditor
at a fair price (Kirkos et al., 2008, 2010; Kirkos, 2012; Wang and Kong, 2012). In addition, Curry and Peel, 1998 report that neural
network models exhibit superior forecasting accuracy to their ordinary least squares counterparts in predicting the cross sectional
variation in corporate audit fees, although this differential reduces when the models are tested out of sample.
In the audit conducting phase, Argyrou and Andreev (2011) proposed a semi-supervised tool for clustering accounting database
as an internal control procedure through usage of self-organizing maps to supplement internal controls, verify processing of accounting transactions, and assess accuracy of ﬁnancial statements. Their empirical results suggest that the proposed tool can compress a large number of accounting transactions, generating homogeneous, well-separated, and interpretable clusters. In
performing substantive tests, Coakley and Brown (1993) and Koskivaara (2000a) used neural networks in predicting patterns in
auditing monthly balances as part of auditors’ analytical review process, and suggest that neural networks recognize patterns
within ﬁnancial accounts as well as the dynamics and the relationships between these accounts more effectively than did ﬁnancial
ratio and regression methods. Koskivaara (2000b), focusing on the pre-processing of the data, investigated the ability of neural
networks for recognizing the dynamics and the relationships between ﬁnancial accounts values in order to detect unexpected
ﬂuctuations. His ﬁndings indicate that the best results were achieved when all the data were preprocessed by scaling them either
linearly or linearly on a yearly basis, yet no further elucidation is provided for why this is the case – although the author cautioned about the stability of the proposed model. Along the same lines, Coakley (1995) suggested the use of neural networks
in pattern recognition of the investigation signals generated by analytical procedures; demonstrating that the use of neural networks provides a more reliable indication of the presence of material errors than either traditional analytic procedures or pattern
analysis, and also provides insight into the plausible causes of these errors. Their results suggest that the use of an ANN to analyze
patterns of related ﬂuctuations across numerous ﬁnancial ratios provides a more reliable indication of the presence of material
errors than either traditional analytic procedures or pattern analysis, offer improved performance in recognizing material misstatements within the ﬁnancial accounts, and, not less importantly, provide insight to the plausible causes of the error.
Koskivaara and Back (2007) proposed a neural networks model for analytical review for continuous auditing of ﬁnancial data;
namely, estimating the future revenues and expenses of an organization, concluding that the neural networks model is most successful for such estimates.
In the area of post auditing cycle, the informational content conveyed by the auditor’s going concern opinion has substantial
impact on a ﬁrm’s current and future standing. Jones (1996) examined the abnormal stock returns surrounding the release of
the auditor’s going concern report using ordinary least squares regression, and found that ordinary least squares regression
tests indicated that mean abnormal returns surrounding the release of the auditor’s report were lower for going concern opinions
than for clean opinions and that the magnitude of the abnormal returns depended on the extent to which the opinion type was
unexpected by investors. Bhimani et al. (2009) examined the inﬂuence of the release on the ﬁrm ability to continue and the possibilities of subsequent default, revealing that the likelihood of default for ﬁrms that received going concern opinion is 2.8 times
that of ﬁrms that received a clean opinion.
The auditing profession has become highly litigious in recent years. Misclassiﬁcation of a potential future bankruptcy candidate
as healthy is referred to as an audit failure, and may result in substantial litigation costs. For example, Ernst & Young was compelled to pay $400 million and KPMG paid $186.5 million for audit failures (Anandarajan et al., 2001). Blacconiere and DeFond
(1997) investigated the independent audit opinions of publicly-traded savings and loans that subsequently failed and the
resulting substantial independent auditor litigation, concluding that the only variable consistently related to independent auditor
litigation was client size – possibly because failures of larger ﬁrms are more costly to regulators than the failure of smaller ﬁrms;
in which case, regulators may be more likely to initiate lawsuits against auditors of larger ﬁrms. Chen et al. (2009a, 2009b) looked
44
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
to the ability of data mining techniques, mainly neural networks, in predicting fraud litigation for assisting accountants on devising an audit strategy and found that neural networks are highly capable of identifying potential lawsuits.
In addition, data mining has been applied to improve business processes. For example, Jans et al. (2013) explored the valueadded by process mining to audit practice, and Mueller-Wickop and Schultz (2013) demonstrated the beneﬁts of process mining
in audit domain through an algorithm that determines an activity sequence from accounting data to construct considerably improved business process models. These studies concluded that process mining, among its many beneﬁts, allows the auditor to
conduct analyses not possible with existing audit tools, such as discovering the ways in which business processes are actually
being carried out in practice, and to identify social relationships between individuals.
5.3.4.2. Business health. In the audit conducting phase, the main goal is to assess the ﬁnancial position of a ﬁrm in order to decide
on the auditor opinion for the ﬁnal reporting phase. In business health, researchers focused on three major areas in the application of data mining: ﬁnancial viability, bankruptcy, and going concern. Financial viability or business failure can be deﬁned as a
situation that a ﬁrm cannot pay lenders, preferred stock shareholders, suppliers, or other creditors, or the ﬁrm goes into bankruptcy according to the law (Dimitras et al., 1996). The are many focus areas of using data mining in this area: supporting auditor’s
judgment about a client’s continued ﬁnancial viability (Etheridge et al., 2000), predicting business failure (Ahn et al., 2000;
Chakraborty and Sharma, 2007; Tang and Chi, 2005; Huang et al., 2008; Youn and Gu, 2010; Benhayoun et al., 2013; Chen,
2013; Chen et al., 2013; Li et al., 2013), classifying, predicting, and preventing bank failures (Alam et al., 2000; Tung et al.,
2004; Boyacioglu et al., 2009; Quek et al., 2009). A common feature of these researches is that almost all of them used a hybrid
data mining modeling approach. Some of their main ﬁndings include: using overall error rate metric, a probabilistic neural network is the most reliable tool for predicting ﬁnancial viability, but when the estimated relative costs of misclassiﬁcation are considered, the best such predictor is the categorical learning neural network model. While neural networks and regression
demonstrate comparable Type I errors, neural networks show lower Type II errors for both in-sample and hold-out sample predictions. Additionally, interest coverage is the most important signal of business failure (Youn and Gu, 2010). Chen, 2013 reported
that different neural networks learning techniques have different accuracy of prediction across time horizons. Tang and Chi, 2005
investigated the inﬂuences of network architecture, variable selection, sample mixture of training and testing subsets on neural
network models’ learning and prediction capability.
Bankruptcy prediction is a critical topic that has been studied extensively and persistently in the accounting and ﬁnance literature. Many authors used data mining techniques for bankruptcy prediction (Jo et al., 1997; O’Leary, 1998; Yang et al., 1999;
Zhang et al., 1999; Charalambous et al., 2000; Lee et al., 2005; Min and Lee, 2005; McKee, 2007; Tsai and Wu, 2008; Chen et
al., 2009a, 2009b; Mokhatab Raﬁei et al., 2011; Shirata et al., 2011; Olson et al., 2012; Fedorova et al., 2013; Kasgari et al.,
2013; Korol, 2013; Serrano-Cinca and Gutiérrez-Nieto, 2013; Tinoco and Wilson, 2013). A somewhat surprising results are that
of Yang et al., 1999 where back-propagation was reported to have failed to discriminate between bankrupt and non-bankrupt
ﬁrms, and the superiority of linear discriminant analysis over probabilistic neural networks. On the other hand, Zhang et al.,
1999 reported that neural networks are robust to sampling variations in overall classiﬁcation performance. Shirata et al., 2011
work demonstrated the effectiveness of text mining bankruptcy prediction, in that certain combinations of terms were effective
in distinguishing between bankrupt and non-bankrupt companies. More speciﬁcally, Pompe and Bilderbeek (2005) examined factors that inﬂuence bankruptcy prediction, noting that models generated from the ﬁnal annual report published prior to bankruptcy were less successful in the timely prediction of failure, and economic decline coincided with the deterioration of a model’s
performance. While all these authors use only quantitative measures, mainly ﬁnancial ratios, in their bankruptcy prediction
modeling, Anandarajan et al. (2001) used both qualitative and quantitative measures. Whereas Cho et al. (2009) developed an
integrated model combining statistical and AI techniques for bankruptcy prediction, others focused on the performance accuracy
of bankruptcy prediction models (Tseng and Hu, 2010; Kim and Kang, 2010; du Jardin, 2010; Tseng and Hu, 2010; Kim and Kang,
2010; du Jardin, 2010) with non-conclusive agreement on which modeling technique offers the best predictive power. This conclusion is not surprising given the many different ways each technique can be parametrized and the speciﬁcs of the problem addressed. In a nutshell, there is no evidence that one data mining technique outperforms the others under all circumstances.
Auditing standards, namely SAS No. 126 in 2012, address the auditor’s responsibilities in an audit of ﬁnancial statements with
respect to evaluating whether there is substantial doubt about the entity’s ability to continue as a going concern. Determining the
going concern status of a company is not an easy task. For predicting going-concern qualiﬁcation, Peel (1989) used logistic regression and found that high gearing, low proﬁtability, and low ownership concentration were consistently associated with the
auditor’s decision to issue a going-concern qualiﬁcation. Koh and Tan (1999), Lenard et al. (1995), Koh and Low (2004) used logistic regression as well as neural networks and decision trees for predicting ﬁrm going concern status. The consensus of the ﬁrst
two studies is that neural networks are proposed as a robust model for auditors to support their assessment of going concern
opinion. Koh and Low (2004) reported the superiority of decision tree as a predictive model for going concern over neural networks and logistic regression. Kleinman and Anandarajan (1999) used non-ﬁnancial variables as predictors of an auditor’s
going concern opinion, and for supporting an auditor’s going-concern assessment decision, highlighting the power of qualitative
data in predicting going-concern qualiﬁcation. Lenard et al. (2000) developed a going concern evaluation decision model based on
fuzzy clustering and a hybrid model of a statistical model and an expert system to identify categories of ﬁrms with particular
characteristics that may indicate whether or not the audit report of the ﬁrms requires a going concern modiﬁcation. Whereas
Lenard et al. (2001) examined decision making capabilities of a hybrid rule-based expert system and discriminant analysis to provide insight into the characteristics of ﬁrms that experience problems, but do not necessarily receive a going-concern modiﬁcation, Martens et al. (2008) constructed a an effective going concern predictive system using support vector machines and rule-
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
45
based classiﬁers. Shirata and Sakagami (2008) used text mining for clarifying the difference between going-concern and nongoing-concern companies by analyzing the nonﬁnancial (qualitative) information disclosed in ﬁnancial reports. Doumpos et al.
(2005) developed a support vector machine model combining publically available ﬁnancial information and credit-risk rating indicators in explaining qualiﬁcations in audit reports. Their major conclusion is that linear and non-linear support vector machines
models are capable of distinguishing between qualiﬁed and unqualiﬁed ﬁnancial statements, at a point in time as well as over
time, with satisfactory accuracy. Spathis (2003) used logistic regression and ordinary least squares regression to test the extent
to which combinations of ﬁnancial and non-ﬁnancial information can be used to enhance the ability to discriminate between
the choices of a qualiﬁed or unqualiﬁed audit report. The qualiﬁcation decision is associated with ﬁnancial information such as
ﬁnancial distress and with non-ﬁnancial information such as ﬁrm litigation. Salterio (1996) used case-based reasoning in investigating whether precedents and the client’s preferred accounting policy affect auditors’ accounting policy judgments. Kirkos et al.
(2007a), Gaganis et al. (2007a), and Gaganis et al. (2007b) focused on using data mining for identifying qualiﬁed auditors’ opinions, and more speciﬁcally, Anandarajan and Anandarajan (1999) compared the predictive ability of neural networks, expert systems, and multiple discriminant analysis in determining what type (modiﬁed or disclaimer) of going concern report should be
issued. The ﬁndings of these studies indicate that the inclusion of credit rating in the models results in a considerable increase
both in terms of goodness of ﬁt and classiﬁcation accuracy, and that the results are mixed concerning the accuracy of industryspeciﬁc models, as opposed to general models. Furthermore, ﬁnancial distress and proﬁtability are reported strongly related to
qualiﬁed opinions, yet liquidity and auditor’s characteristics seem to be irrelevant to identifying qualiﬁed auditors’ opinions.
5.3.4.3. Forensic accounting. AICPA explicitly acknowledges the responsibility of auditors in fraud detection (Cullinan and Sutton,
2002). Detection of manipulated ﬁnancial statements by using normal audit procedures becomes an incredibly difﬁcult task
(Dikmen and Küçükkocaoğlu, 2010). “Fraud risk assessment is a highly complex process that is a part of every audit engagement.
Over time, regulatory requirements have steadily increased the amount of time and effort required of the auditor to assess fraud.
It follows, therefore, that fraud risk assessment presents an ideal opportunity for technological assistance” (Comunale et al., 2010).
The review of literature showed prevalent usage of data mining by researchers and practitioners to detect fraud. Researchers tackled different levels and areas of fraud. Some focused on detecting fraud risk at the more macro level of audit engagement level
(Comunale et al., 2010), and others focused on detecting fraud at the more micro level of business transactions (Debreceny
and Gray, 2010; Bella et al., 2009; Tackett, 2013). Whereas Debreceny and Gray (2010) researched the journal entry fraud
using digit analysis and found that the distribution of ﬁrst digits of journal dollar amounts differed from that expected by
Benford’s Law, Bella et al. (2009) developed a four-stage self-organizing map fraud detection architecture of electronic billing records, and Tackett (2013) suggested the use of association rules in detecting fraud through ﬁnding patterns and relationships
when examining a company’s digital records. On the other hand, Bay et al. (2006) focused on identifying the irregularities at
the general ledger level, and (Jans et al., 2010; Jans et al., 2011; Owusu-Ansah et al., 2002) focused on detecting fraud at the business cycle or process level. While Jans et al. (2010) used descriptive data mining techniques for detecting and reducing risk of
internal fraud at the procurement cycle level, Jans et al. (2011) examined the effectiveness of fraud detection audit procedures
at the stock and warehousing cycle level, and Owusu-Ansah et al. (2002) employed business process mining for mitigating internal transactions fraud in the procurement processes. These authors found that size of the audit ﬁrm, auditor’s position tenure, and
auditor’s years of experience are statistically signiﬁcant predictors of fraud. Using a combination of Benford’s Law and neural networks, Busta and Weinberg (1998) focused on detecting manipulated ﬁnancial data in analytical review procedures, and Kim and
Vasarhelyi (2012) used data mining for detecting company level internal fraud.
Management fraud is a type of fraud that adversely affects stakeholders through misleading or fraudulent ﬁnancial statements
(FFS) (Elliott and Willingham, 1980), thus many researchers focused on detecting FFS with the help of data mining at different
levels: detection of top management fraud (Fanning and Cogger, 1998; Pai et al., 2011), detection of fraud based on prediction
of company future performance (Virdhagriswaran and Dakin, 2006), and detection of fraud in ﬁnancial reports (Kirkos et al.,
2007b; Ata and Seyrek, 2009; Deng, 2009; Zhou and Kapoor, 2011). Other researchers (Cerullo and Cerullo, 2006; Feroz et al.,
2000; Green and Choi, 1997; Hoogs et al., 2007; Huang et al., 2012; Jie and Wei, 2009; Krambia-Kapardis et al., 2010; Ogut et
al., 2009; Ravisankar et al., 2011; Tsaih et al., 2009; Yue et al., 2009; Zouboulidis and Kotsiantis, 2012; Kotsiantis et al., 2006;
and Perols, 2011) used data mining for predicting FFS. Important ﬁndings of these authors include: the ability of neural networks
models to classify membership in SEC investigated versus non-investigated ﬁrms with high accuracy. One explanation for such
relative success of neural networks is their ability to use adaptive learning processes to determine what is important to distinguish true signal from noise. The researches also explored the effectiveness of combining ﬁnancial and governance indicators, exogenous and endogenous factors, and feature selection to detect fraudulent ﬁnancial statements. Along the same lines, the work of
Gaganis (2009) involved the use of data mining classiﬁcation techniques combining both ﬁnancial and nonﬁnancial data for the
identiﬁcation of FFS and concluded that classiﬁcation accuracy depends on the way the data is pre-processed, the objective function, and the search strategy of the model. Alden et al. (2012) used genetic algorithms in detecting patterns of FFS and concluded
that GAs and estimation of distribution algorithm demonstrate a better ability to classify patterns of ﬁnancial statement fraud than
those the traditional logistic regression model. More speciﬁcally, Lin et al. (2003) developed an integrated fuzzy neural networks
model to assess the risk of FFS. The fuzzy neural network model of Lin et al. (2003) outperformed most statistical models and
artiﬁcial neural networks reported in prior studies, and its performance compared favorably with a baseline Logit model. Liou
(2008) explored the differences and similarities between falsiﬁed ﬁnancial reporting detection and business failure prediction
models using logistic regression, neural networks, and decision tree and found that the ﬁnancial factors used to detect fraudulent
reporting are helpful for predicting business failure. Welch et al. (1998) developed a data mining-based classiﬁer system for
46
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
modeling auditor decision when estimating the likelihood of fraud by contractors developing bids for government contracts, and
reported that, in classiﬁcation decision models involving simultaneous processing, genetic algorithms represent an innovative
heuristic approach that may produce improved models when compared to traditional mathematical approaches. Kochetovakozloski et al. (2011) used data mining for improving auditors judgments on fraudulent management events.
The search for indicators of ﬁnancial reporting fraud is not limited to using the numeric part of ﬁnancial statements, which
was the focus of many previous researchers, but it extends to the evaluation of the qualitative part. For example, Humpherys
et al. (2011) and Purda and Skillicorn (2014) evaluated and classiﬁed the management’s discussion and analysis section of
Form 10-K using linguistic credibility analysis and data mining techniques. Their ﬁndings indicate that writers of fraudulent disclosures may write more to appear credible while communicating less in actual content, and support the usefulness of linguistic
analyses by auditors to ﬂag questionable ﬁnancial disclosures and to assess fraud risk. Similarly, Yu et al. (2013) constructed
models based on data mining techniques in detecting and classifying the violations to the accounting information disclosure by
listed companies. Goel et al. (2010) and Goel and Gangolly (2012) examined qualitative textual content in annual reports to predict fraud, and found that textual information provides valuable clues pertaining to fraud prediction. Similarly, Gupta and Gill
(2012a) used support vector machines in detecting fraud in the qualitative part of ﬁnancial statements, and Gupta and Gill
(2012b) examined the efﬁcacy of decision trees, Naïve Bayes and genetic algorithms for preventing and detecting FFS. Larcker
and Zakolyukina (2012) focused on detecting ﬁnancial statement manipulations in CEO and CFO narratives during quarterly earnings conference calls. These researchers also reached similar conclusions as those of Humpherys et al. (2011) and Purda and
Skillicorn (2014) in that employment of linguistic features is an effective means for detecting fraud, and result in signiﬁcant improvement in the accuracy of detecting fraud in ﬁnancial reports.
In the area of predicting earnings management, Tsai and Chiou (2009) developed neural networks and decision tree models to
be used by investors in predicting the level of earnings management in advance, and whether earnings are managed upward or
downward. The results of Tsai and Chiou (2009) indicate that using data mining techniques signiﬁcantly improved prediction of
earnings management and generated decision rules that help in detecting earnings management. On the other hand, Ezazi et al.
(2013) examined the usefulness of various data mining techniques in predicting earnings management, clearly questioning the
assumption of linearity for modeling the accrual process, and concluded that a non-linear approach to predicting earnings management is more effective than a linear approach. Focusing on detecting earnings management, Hoglund (2012) assessed the performance of different data mining techniques, more speciﬁcally Hoglund (2013a) assessed the performance of the cross-sectional
Jones (1991) accrual model using a genetic algorithm. The results indicated the superiority of genetic algorithms compared to
other grouping methods. To overcome the problem of data availability in estimating time series, Hoglund (2013b) found that
fuzzy linear regression-based Jones model outperforms the regression-based Jones model in detecting simulated earnings management when the estimation time series is short. Song et al. (2013) examined the association between earnings management
and assets misappropriation and found that misappropriation of assets has a signiﬁcant positive association with discretionary
accruals.
Data mining applications in assurance and compliance focused primarily on three main topics: auditing (including engagement, planning, conducting, and post-auditing phases), business health (including ﬁnancial viability, bankruptcy, and going concern), and forensic accounting (including fraud detection and earnings management). Speciﬁc applications include: auditor
selection and switching, auditor fee determination, supporting auditor in the level of substantive tests, identifying patterns in accounting data and analytical procedures, bankruptcy prediction, detection of fraudulent ﬁnancial statements and reports, detection
of earnings management, etc. The main goal of applications in this domain is prediction, and the primary task is classiﬁcation. The
predominant techniques are neural networks and regression. Future research opportunities include: enriching input with variables related to managerial characteristics, testing different approaches to classiﬁers aggregation, testing of different learning algorithms and model architectures, exploring different time granularities and data preprocessing approaches, expand the scope
of model development to multiple ﬁrms and business types, more careful selection of input variables, elongating the prediction
horizon, including non-ﬁnancial variables and more visual analysis, pay more attention to model comparison, combing data
and text mining in predicting ﬁnancial fraud, etc.
5.3.4.4. Tax compliance. Data mining has also been used in taxation, such as the work of Denton et al. (1995), who used neural networks in classifying employees for tax purposes. For tax compliance at the company level, Wu (1994) used neural networks and
Kallio and Back (2011) used a self-organizing map, for successfully identifying companies that require further tax audit investigation. Wu et al. (2012) applied a data mining technique in enhancing tax evasion detection performance by using data mining to
develop a screening framework to ﬁlter possible non-compliant tax reports that may be subject to further auditing. The results of
Wu et al. (2012) show that their proposed data mining technique enhances the detection of tax evasion, and therefore can be
employed to effectively reduce or minimize losses from tax evasion.
5.3.5. Overall summary of ﬁndings of data mining application areas
In summary, it is clear that the overwhelming focus of using data mining in accounting, regardless of proposed framework category, is in the area of compliance and assurance. Thus, the “disconnect between the application domain of auditing and assurance and the technology domain of AI” that was reported by Baldwin et al. (2006) is not supported by these ﬁndings. One
could argue that the well-publicized ﬁnancial scandals over the last two decades, such as that of Enron and WorldCom, presented
the greatest challenge to the compliance and assurance functions of the accounting profession. This may have consequently expanded the reach of the accounting researchers to incorporate modern technology paradigms such as data mining to improve
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
47
Table 2
Strengths, weaknesses, and recommendations for data mining applications in accounting.
CRISP-DM
phase
Strengths
Weaknesses
Business
understanding
Technical justiﬁcations are well speciﬁed –
e.g. compare performance of a data mining
technique or techniques in addressing a
given business problem.
Consideration of business impact is generally
lacking.
Data
understanding
Many studies used benchmarked variables
that are available in publically published
ﬁnancial statements and reports.
Data
preparation
Use of variable normalization to remove
scale impact and variable derivation to
enrich the attribute set.
Model
development
Building on models speciﬁed in previous
research.
Model
assessment
Use of out-of-sample and k-fold cross
validation techniques.
Model
deployment
Modeling for real-world business
applications.
Recommendations
Business impact is as important as technical
justiﬁcation since the latter does not
necessarily translate into the former. More
attention needs to be given to questions such
as: if some measure of performance (such as
accuracy) is improved, how does this translate
into business beneﬁt? Such attention could
lead to more acceptance to data mining
applications in real business contexts, and
thus reduce reliance on simulations and
synthetic data.
Domain knowledge of the problem being
Sometimes clearly important variables are
missing, for example in the case of using only addressed and its business dimensions is
important to understand the most pertinent
balance sheet variables for predicting ﬁrm
data, appropriate unit of analysis (for
performance and ignoring variables in other
example, ﬁrm vs. ﬁrm-year), and temporal
ﬁnancial statements.
dimension (e.g. point-in-time vs.
cross-sectional). Research conducted from a
purely technical perspective or not using
domain expertise may miss important data
understanding considerations
Data preparation is rarely discussed, and little Including all relevant variables is key in
developing properly-speciﬁed models.
attention is paid to variable selection
Missing such variables may result in an
techniques.
inadequate model that does not capture all
important components of the problem being
solved. As important is not including
irrelevant and redundant variables. Strategies
to include the proper set of attributes in the
analysis include using feature selection
techniques, variable clustering, and including
in the analysis relationships between
variables as well.
Rarely is the model speciﬁcation complete.
Complete model speciﬁcation is necessary for
future replicability and benchmarking. Every
data mining modeling technique provides
multiple architectures and requires setting of
many parameters. For example, a neural
network model can be speciﬁed in terms of
topology (Multi-layer perceptron, Radial
Basis, etc.), information ﬂow (feedforward vs.
recurrent), learning function (most notably
backpropagation), number of hidden layers,
etc. Not providing all the speciﬁcs of a model
makes it hard to replicate and benchmark the
proposed model
It is key to take into account the impact of the
Scant attention is paid to considerations of
classiﬁcation error costs and hence rarely a
cost of a false positive and that of a false
mention of ﬁnancial impact of classiﬁcation
negative when assessing models – as these
decision outcomes. In addition, very few
costs are in many business situations are far
utilized a well-formed experimental design in from equally weighted. Ignoring such
which an experimental group is matched with considerations can lead to bad decisions.
a control group. Many researches used
Although accuracy is one of the most popular
accuracy to assess model performance.
model assessment measures, it is important to
recognize that accuracy is not always an
appropriate assessment metric – especially
when samples are imbalanced. For example, if
the cases of interest represent a small (b10%)
proportion of the total sample, a model will
report N90% accuracy without capturing any
of the cases of interest – this is especially true
in the case of fraud, auditing, and forensic
accounting.
Rarely a mention of model recalibration and
Businesses operate in a continually changing
variability of business environment.
environment and thus deployed models
should be re-calibrated on regular basis to
validate their relevance and performance
under the new business realities.
48
F.A. Amani, A.M. Fadlalla / International Journal of Accounting Information Systems 24 (2017) 32–58
capabilities of combating illegal reporting practices. Similar to what was reported in (Gupta and Gill, 2012b), the major focus of
data mining in forensics appears to still be on detection; with less emphasis on other, equally important, dimensions of fraud such
as prevention and mitigation. The managerial accounting also received a signiﬁcant number of data mining applications, perhaps
due to the increasing desire to improve overall enterprise efﬁciencies in light of the ramiﬁcations of the 2008 ﬁnancial crisis and
the subsequent business meltdown. Notable in the managerial accounting domain, is the near absence of the application of data
mining to management control systems, a major tool to guard against risk and vulnerability. In the ﬁnancial accounting domain,
the most focus was on ﬁnancial and performance analysis, but little has been reported on the use of data mining in the key functions of recording and validating transactions to improve the accuracy of both data and reporting. The application of data mining
in accounting also seems to be thin in the areas of process and text mining. Accounting functions have many embedded processes,
such as ordering, billing, paying, and reconciling that may provide rich input for process mining. However, lack of easy access to
business process information might be one reason that process mining applications are limited. Similarly, although accounting is
numeric in the majority, it has an abundance of textual content that can beneﬁt from text mining. Yet, text mining applications of
accounting are relatively few, possibly due to the fact that text mining is itself a relatively newer, less familiar, branch of data mining. These limitations may be aggravated by limited coverage, if at all, of accounting curriculum of such advanced techniques.
We use the well-known Cross Industry Standard Protocol – Data Mining (CRISP-DM) framework as a logical presentation
structure to present the strengths, weaknesses, and recommendations for data mining applications in accounting (Table 2).
CRISP-DM is a widely accepted industry-, discipline-, technology-, and technique-agnostic guiding methodological standard for
implementing data mining projects. CRISP-DM is based on six phases: business understanding, data understanding, data preparation, modeling, testing, and deployment. The reported data mining applications in accounting manifest many strengths as well as
some weaknesses. Strengths include: well- reasoned technical justiﬁcations, use of publically available variables, use of variable
normalization and derivation, incremental modeling approaches, unbiased model testing, and modeling of real-world problems.
Some of the weaknesses manifested throughout the previous research include: lacking consideration of business impact, missing
inclusion of important relevant variables, thin consideration of data preprocessing and preparation, limited information on model
speciﬁcation, rare consideration of cost of classiﬁcation errors, and limited dialogue on model calibration to account for the changing business environment. Recommendations for more effective applications of data mining in accounting include: highlighting
business impact of the application, i…

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Programming Data Mining Essays ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now