Business Analytics question on clustering
3 Question 1 (cont ’d ) i. Discuss why Data Governance should be a priority and identify two areas that Data Governance covers that you think are essential for JamMed . Provi de justification for your answer given the example scenario. [5 Marks] ii. Given the context of this organization , identify two specific challenges you think they may face in implementing Data Governance. Ensure that you answer is specific to JamMed.
[4 Marks] d. What issue (s) would you be trying to address if you were to recommend that JamMed invest s in a data warehouse for applying analytics rather than using their traditional databases? [4 Marks] e. JamMed is of the view that they need to invest in Big Data technologies because of the amount of data they have. Explain to them why this is not enough of a reason for them to invest in these technologies. Give an example of data that could lead them to consider Big Data technologies. [5 Marks] Q uestion 2 [Total 10 Marks ] a. W hy is clustering considered to be an unsupervised technique? H ow is this different from decision trees? [3 Marks] b. W hat does it mean for the output of the clustering method to have a high intra -class similarity and a low inter -class similarity? W hy is this considered a good thing? [2 Marks] c. Doctors and scientists are still trying to understand how different groups of people react to COVID – 19. They have data on all the persons who have tested positive for CVOVID -19. They have used that data and applied clustering to determine these segments of COVID -19 patients. The results below show the output of applying the K -Means on the data for k=2. Cluster 1 is highlighted in yellow and Cluster 2 highlighted in blue. Patient_ID Age (Years) Pre -existing Condition Area Living Duration Hospitalized Died 1 10 0 1 0 0 2 45 0 0 3 0 3 35 1 1 5 0 4 70 1 0 15 0 5 60 0 1 20 1 6 25 1 1 40 0 7 89 1 0 35 0 P.T.O 4 Some additional information about the data above: • Pre -existing condition – 1 means the patient had a pre -existing medical condition ; 0 means they did not. • Duration Hospitalized – No. of days in hospital – 0 means they were never admitted. • Area Living – 1 is in an urban area and 0 is in a rural area. • Died – 1 represents that the pat ient died and 0 they did not die. How could you interpret the two clusters above? What could be inferred from the information above and how could it be used by the Ministry of Health and Wellness. [5 Marks] Question 3 [Total 25 marks] Jamaica Insurance Company has been experiencing several cases of fraud over the last three years. The dataset below represents a series of life insurance claims suspected of fraud involving different persons. You have bee n hired by Jamaica Insurance Company to identify possible cases of fraud that may exist in the data set where a group of persons are making repeated claims to the insurance company . Co nduct association mining on the data set to identify possible social net works that exist in the data. Claim Number Policy Owner Claim Type Physician Name Beneficiary Claim 1 Mark -B Dismemberment Dr. S Susan -R Claim 2 Alice -T Dismemberment Dr. R Carol -D Claim 3 Mark -B Accident Dr. B Bob -Y Claim 4 Trisha -Z Accident Dr. S John -B Claim 5 Trisha -Z Accident Dr. S John -B Claim 6 Mark -B Dismemberment Dr. S Susan -R Complete the following: a) Using a support threshold of 30% , calculate the support for one itemset and four itemset ONLY . [8 marks] b) Create rules from the frequent itemset with four items ONLY . Use a confidence level of 95% to select the strong rules. [4marks] c) Calculate the lift for the strong rules. [4 marks] d) Select one strong rule and provide an interpretation for the support, confidence and lift as it relates to the rule selected. [4 marks] e) Make a recommendation to the company on how they should proceed in addressing fraud based on the trend identifie d in part d. Explain. [5 marks] P.T.O