COMP 4353 TSUSM Data Mining And Analysis Exercise

  • Find attached files. Work on exercises 2.2/2.4/2.6/2.8. you don’t need to solve other questions
  • Summary
    Data sets are made up of data objects. A data object represents an entity. Data objects
    are described by attributes. Attributes can be nominal, binary, ordinal, or numeric.
    The values of a nominal (or categorical) attribute are symbols or names of things,
    where each value represents some kind of category, code, or state.
    Binary attributes are nominal attributes with only two possible states (such as 1 and
    0 or true and false). If the two states are equally important, the attribute is symmetric;
    otherwise it is asymmetric.
    An ordinal attribute is an attribute with possible values that have a meaningful order
    or ranking among them, but the magnitude between successive values is not known.
    A numeric attribute is quantitative (i.e., it is a measurable quantity) represented
    in integer or real values. Numeric attribute types can be interval-scaled or ratioscaled. The values of an interval-scaled attribute are measured in fixed and equal
    units. Ratio-scaled attributes are numeric attributes with an inherent zero-point.
    Measurements are ratio-scaled in that we can speak of values as being an order of
    magnitude larger than the unit of measurement.
    Basic statistical descriptions provide the analytical foundation for data preprocessing. The basic statistical measures for data summarization include mean, weighted
    mean, median, and mode for measuring the central tendency of data; and range, quantiles, quartiles, interquartile range, variance, and standard deviation for measuring the
    dispersion of data. Graphical representations (e.g., boxplots, quantile plots, quantile–
    quantile plots, histograms, and scatter plots) facilitate visual inspection of the data and
    are thus useful for data preprocessing and mining.
    Data visualization techniques may be pixel-oriented, geometric-based, icon-based, or
    hierarchical. These methods apply to multidimensional relational data. Additional
    techniques have been proposed for the visualization of complex data, such as text
    and social networks.
    Measures of object similarity and dissimilarity are used in data mining applications
    such as clustering, outlier analysis, and nearest-neighbor classification. Such measures of proximity can be computed for each attribute type studied in this chapter,
    or for combinations of such attributes. Examples include the Jaccard coefficient for
    asymmetric binary attributes and Euclidean, Manhattan, Minkowski, and supremum
    distances for numeric attributes. For applications involving sparse numeric data vectors, such as term-frequency vectors, the cosine measure and the Tanimoto coefficient
    are often used in the assessment of similarity.
    Exercises
    2.1 Give three additional commonly used statistical measures that are not already illustrated in this chapter for the characterization of data dispersion. Discuss how they can
    be computed efficiently in large databases.
    2.2 Suppose that the data for analysis includes the attribute age. The age values for the data
    tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
    33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
    (a) What is the mean of the data? What is the median?
    (b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal,
    trimodal, etc.).
    (c) What is the midrange of the data?
    (d) Can you find (roughly) the first quartile (Q1 ) and the third quartile (Q3 ) of the data?
    (e) Give the five-number summary of the data.
    (f) Show a boxplot of the data.
    (g) How is a quantile–quantile plot different from a quantile plot?
    2.3 Suppose that the values for a given set of data are grouped into intervals. The intervals
    and corresponding frequencies are as follows:
    age
    1–5
    6–15
    16–20
    21–50
    51–80
    81–110
    frequency
    200
    450
    300
    1500
    700
    44
    Compute an approximate median value for the data.
    2.4 Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
    with the following results:
    age
    %fat
    23
    9.5
    23
    26.5
    27
    7.8
    27
    17.8
    39
    31.4
    41
    25.9
    47
    27.4
    49
    27.2
    50
    31.2
    age
    %fat
    52
    34.6
    54
    42.5
    54
    28.8
    56
    33.4
    57
    30.2
    58
    34.1
    58
    32.9
    60
    41.2
    61
    35.7
    (a) Calculate the mean, median, and standard deviation of age and %fat.
    (b) Draw the boxplots for age and %fat.
    (c) Draw a scatter plot and a q-q plot based on these two variables.
    2.5 Briefly outline how to compute the dissimilarity between objects described by the
    following:
    (a) Nominal attributes
    (b) Asymmetric binary attributes
    (c) Numeric attributes
    (d) Term-frequency vectors
    2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
    (a) Compute the Euclidean distance between the two objects.
    (b) Compute the Manhattan distance between the two objects.
    (c) Compute the Minkowski distance between the two objects, using q = 3.
    (d) Compute the supremum distance between the two objects.
    2.7 The median is one of the most important holistic measures in data analysis. Propose several methods for median approximation. Analyze their respective complexity
    under different parameter settings and decide to what extent the real value can be
    approximated. Moreover, suggest a heuristic strategy to balance between accuracy and
    complexity and then apply it to all methods you have given.
    2.8 It is important to define or select similarity measures in data analysis. However, there
    is no commonly accepted subjective similarity measure. Results can vary depending on
    the similarity measures used. Nonetheless, seemingly different similarity measures may
    be equivalent after some transformation.
    Suppose we have the following 2-D data set:
    x1
    x2
    x3
    x4
    x5
    A1
    1.5
    2
    1.6
    1.2
    1.5
    A2
    1.7
    1.9
    1.8
    1.5
    1.0
    (a) Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as a
    query, rank the database points based on similarity with the query using Euclidean
    distance, Manhattan distance, supremum distance, and cosine similarity.
    (b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
    distance on the transformed data to rank the data points.

    Save Time On Research and Writing
    Hire a Pro to Write You a 100% Plagiarism-Free Paper.
    Get My Paper
    Still stressed from student homework?
    Get quality assistance from academic writers!

    Order your essay today and save 25% with the discount code LAVENDER