Data mining is used to locate beneficial information by finding anomalies, patterns, or correlations within data. Association rules mining uses “if-then” statements to show the most important relationships between data. Some real-world examples of its use include medical diagnosis, purchasing patterns, consumer website usages, or content recommendation engines.
To perform association rules analysis/mining, complete the following:
- Access the “UCI Machine Learning Repository,” https://archive.ics.uci.edu/datasets
Note: There are about 120 data sets that are suitable for use in a clustering task. For this part of the exercise, you must choose two of these datasets, provided they include at least 10 attributes and 10,000 instances.
- Ensure that the data sets are suitable for clustering using this method.
- You may search for data in other repositories, such as Data.gov or Kaggle.
For your selected dataset, build a clustering model as follows:
- Explain the dataset and the type of information you wish to extract. Recall that the dataset must consist of transactions of the form If {x1, x2, …, xn} then {y1, y2, …, yk}.
- Explain the Apriori algorithm and how you will be using it in your analysis (list the steps, the intuition behind the mathematical representation, and address its assumptions).
- Identify the appropriate software packages.
- Preprocess the data, describe their characteristics, and visualize key characteristics like popular items and choices.
- Build the clustering model by implementing the Apriori algorithm.
- Run the model (make predictions).
- Display clustering results (quantitative and visual).
- Explain the meaning of each step in the context of the dataset.
- Interpret results and adjust your clustering.
- Validate the model, addressing support, confidence, lift, and conviction. Then, explain the results.
Prepare a comprehensive technical report as Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains:
a) Problem statement
b) Algorithm of the solution
c) Analysis of the findings
d) References