Machine Learning Basics
Definition : A common definition of machine learning is (Mitchell, 1997): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”
The act of creating a prediction model from previously known data is called training, and such data is called the training data or a training set. After the model is created, it must be applied to another data set to test its effectiveness. Data used for such purpose is called test data or test set.
Educational Data Mining
Data mining is a process of sorting data and extracting information from existing databases. With the help of pattern mining and data analysis, hidden information can be obtained from huge datasets. The strategy of data mining is now applied in the field of education by researchers.
They are busy in exploiting a lot of dimensions in education sector. This is now known as educational data mining. Data mining is being applied in educational sector by considering the performance of students and finding the position of students by using their academic records.
Educational dataset is being collected from various resources such as interactive learning systems, computer-supported collaborative systems, and administrative datasets of school, colleges and universities. Data mining methods are now implemented in well known universities to analyze the patterns of student performance from the dataset through which information can be extract and decision making may become easier for the management of institutions.
With the incremental growth in the use of technology everywhere, educational institutions are now busy in finding hidden trends and patterns in their larger datasets. With the help of these sources, dataset can easily be collected if authorization is accessed. One purpose of extracting information from its own dataset is to make its prestige among other educational institutions stronger. Another purpose is to build the student career.
Data mining is often used to build predictive/inference models aimed to predict future trends or behaviors based on the analysis of structured data. In this context, prediction is constructing the model and used to assess the class of an unlabeled example, or to assess the value or value ranges of an attribute.
We have proposed data mining process for evaluation of school dropout and failure. Experiment done on real information of 200 university students of Mehran University of Engineering and Technology. Data mining should work the same way as a human brain. It uses historical information (experience) to learn. However, in order for data mining technology to get information out of the database, the user must “tell it” what the information looks like (i.e. what is the problem that the user would like to solve).
It uses the description of that information to look for similar example in database, and uses these pieces of information from the past to develop a predictive model of what will happen in the future. The essential ingredient in building a successful predictive model is to have some information in the database that describes what has happened in the past. Data mining tools are designed to “learn” from these past success and failure (theoretically as a human being would), and then be able to predict what is going to happen next.
However, one of the major advantages of a data mining tool over a human mind is that data mining tool can automatically go through a very large database quickly, and find even the smallest pattern that may help in a better prediction.
Our main objectives of this proposed work are:
To understand, analyze and then find the difference between different prediction techniques of data mining in education.
To identify and understand different student attributes which are mainly used for the predicting the student performance.
Predicting Student Performance
Predicting student’s performance by using data mining techniques to extract information from the academic dataset of universities has become state of the art research in the scientific society. Universities are facing with some challenges now a day to analyze the performance of their students; only being active in class is not to analyze student performance that’s why we create such a system which will try to improve student performance.
We are focusing on student’s profiles and characteristics to make the university management aware of student’s performance and overall academic result. There is another dimension of student’s performance that is the dependence of student retention upon student student’s performance. To minimize the problem of student retention cases in the universities, different researchers have proposed different methods to predict the performance of students in their future semester based on the performance of previous one.
Student Data Attributes
For predicting the next semester academic performance of student based on previous academic record of student we taken data of two batches of Computer System (15CS & 16CS) till now and have considered following attributes in our project that are:
Based upon these parameters, recommended system can be trained to predict the grades of students accurately in any of the educational institution. We had used KNN algorithm approach for predicting student academic performance.
K – Nearest Neighbor
K – Nearest Neighbor (KNN) is a supervised learning algorithm. It is basically a classic method for clustering samples based on similarity. It is basically a non-parametric learning algorithm which belongs to data mining class. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point by matching it with previous data.
We have use KNN algorithm to obtain more accurate diagnostic results. KNN algorithm is used to analyze distance measurement using a set of data. Classification is a process of analyzing input and building a model for a class.
The K-NN algorithm can be used for:
Regression: predicting what number value a variable will have (if it is a variable that varies with time, it’s called ‘time series’ prediction).
Classification: predicting what category or class a case falls.
An alternate way of understanding KNN is by thinking about it as calculating a decision boundary (i.e. boundaries for more than 2 classes) which is then used to classify new points.
Feature extraction is the transformation of high-dimensional data input data into a meaningful representation of reduced dimensionality. So, basically transforming the input data into some particular set of features is called feature extraction. The representation extracted is often beneficial to improve the accuracy of a particular classifier. And, feature extraction is basically performed on raw data prior to applying k-NN algorithm on the transformed data in feature space.
Issues Regarding Classification
Missing data values cause problems during both the training phase and to the classification process itself. For example, the reason for non-availability of data may be due to:
Deletion due to inconsistency with other recorded data
This missing data can be handled using following approaches:
Data miners can ignore the missing data
Data miners can replace all missing values with a single global constant
Data miners can replace a missing value with its feature mean for the given class
Data miners and domain experts, together, can manually examine samples with missing values and enter a reasonable, probable or expected value
In our case, the chances of getting missing values in the training data are very less. The training data is to be retrieved from the admission records of a particular institute and the attributes considered for the input of the classification process are mandatory for each student. The tuple which is found to have a missing value for any attribute will be ignored from training set as the missing values cannot be predicted or set to some default value. Considering low chances of the occurrence of missing data, ignoring missing data will not affect the accuracy adversely.
The methodology of Algorithm:
Firstly real data is gathered of almost 200 students and that data is pre-processed.
Then that data set is trained and tested using a particular algorithm.
Then K-NN algorithm is applied to that data set to build prediction models then, predictions made by these models are compared using common evaluation criteria, such as accuracy, precision, and recall.
Testing data set is compared with training data set to check the accuracy of the algorithm.
Advantages of K-NN:
KNN has several main advantages: simplicity, effectiveness, intuitiveness and competitive classification performance in many domains. It is Robust to noisy training data and is effective if the training data is large.
Disadvantages of K-NN:
Despite the advantages given above, KNN has a few limitations. KNN can have poor run-time performance when the training set is large. It is very sensitive to irrelevant or redundant features because all features contribute to the similarity and thus to the classification.
Two other disadvantages of the method are:
Distance-based learning is not clear which type of distance to use and which attribute to use to produce the best results.
Computation cost is quite high because we need to compute distance of each query instance to all training samples.
Applications of K-NN:
KNN as a data mining technique has a wide variety of applications in classification as well as regression. Some of the applications of this method are mentioned below:
Student attrition and retention
With the passage of time, growth of private educational institutions has been increased up to the remarkable extend. These institutions have become source of higher learning and business entity. Therefore, maximum number of student’s enrollment is its lifeline. For the survival of private institutions, profitability, proper management and alignment are mandatory. In this respect, student retention until the completion of degree is quite necessary.
That’s why institutions are finding that factor that ultimately causes student attrition. After analyzing those factors, it is important for educational institutions to make strategic adjustments accordingly to improve student retention in institutions. The problem of student attrition and retention is not new for the educational institutions.
It has been enlightened by the researchers from the fields of data mining and information visualization. Now it has become very common research problem for the researchers. Student attrition and retention been observed by the researchers when this problem was raised up to the ratio of 50% on the colleges of Ontario. To reduce attrition rates, institutions should focus on student retention.
University students in all degree programs are motivated to enroll into university programs by a desire for personal accomplishment and completion of a previously set goal. All mature-age students in all degree programs are often believed to be highly motivated to return to university for promotion in their employment, improvement of their professional skills.
Kantanis (1999) observed that some mature-age students engage in studies because they want to enjoy personal advancement and achieve a higher status in their professional positions. Hence, motivation to embark on a career is clearly linked to expectations that the career will bring about the desired rewards and prestige.
Science is considered to be challenging, hence, students doing Science Education feel proud once they achieve their goal of successful completing the program. For instance, interest in the subject, perception of its usefulness, general desire to achieve, self-confidence, self-esteem, patience and persistence are factors motivating students to engage in studies.
In Science Education some students are motivated to choose the program in this area by approval from significant others while other students are motivated by the desire to overcome the perceived challenges in these program as they acquire new knowledge and skills.
Social support is a factor that can affect academic performance of students both negatively and positively. The social support networks have great value to enhance academic performance as students form friendship groups to exchange information on assignments and find out about tutorials and lecture schedules. Peer support and relationships have been found to enhance persistence of students both directly and indirectly. Parker and Johnson (1981) note that student-to-student interactions with peers have shown to be an extremely effective form of learning.