Data Mining Project
Data Mining Project
December 5, 2024
• Use classical algorithms (e.g., sklearn, xgboost) to train a model based on the
preprocessed data.
• Deep learning techniques (e.g., TensorFlow, PyTorch) are not allowed.
• Your goal is to develop the model that achieves the highest classification ac-
curacy on unseen data.
Once you submit your code, I will test your proposed model on a separate test dataset
(test_data.csv), which you will not have access to during the development phase. This
ensures a fair evaluation of your model’s performance on completely unseen data, replicating
real-world scenarios.
Submission Deadline: The submission deadline is set to 28-12-2024 at 23:59. Any
student who fails to submit their code before this date and time will be excluded from
the evaluation process.
1
2. Explanation of the Provided Code
You will receive a Python script (project_code.py) along with the datasets (train_data.csv
and test_data.csv). Both datasets will be located in the same folder of your working di-
rectory. The script is structured to include the following key sections:
1. Preprocessing Section:
• This is where you can modify the code to clean and preprocess the data.
• Examples of valid modifications include:
– Handling missing values.
– Encoding categorical features.
– Scaling or normalizing features.
– Applying basic feature selection or dimensionality reduction.
2. Training Section:
• Select and implement a classical machine learning algorithm for classification (e.g.,
Random Forest, Logistic Regression, XGBoost, etc.).
• Only modify the model implementation.
• Hyperparameter tuning must not be in the code in this step; hyperpa-
rameter tuning should be done separately and should not be done in the final
provided code.
• The evaluation section computes key metrics (e.g., accuracy, precision, recall,
F1-score, ROC-AUC, confusion matrix).
• You are not allowed to modify this section or print results fraudulently. Any
tampering will result in strict disciplinary actions.
4. Submission Requirements: You are required to submit a folder named after your
Student Code (e.g., IA20, RSI12) containing:
1. Code Submission:
2
• Include your Student Code in the script (e.g., IA20, RSI12, etc.).
• Specify the model used in the code (e.g., RandomForestClassifier, XGBoost,
etc.).
• Ensure your code executes without errors and generates the required results.csv
file.
2. Dataset Locations:
3. Execution Speed:
• Ensure preprocessing and training steps are efficient. Long execution times may
negatively impact evaluation.
4. Hyperparameter Tuning:
4. Evaluation Criteria
Your performance in this project will be evaluated as follows:
Participation
• 5 points: Awarded to every student who attends the practical session (TP).
Theoretical Component
• 5 points: Based on theoretical exercises proposed during the examination (partial).
Project Scoring
Your models will be ranked based on their classification accuracy on the test dataset. Scores
will be distributed as follows:
3
• 2 points: Next 10 most accurate models.
• 0 points: All remaining submissions that fail to produce valid results or perform
poorly.
Disqualification
• Submissions that fail to execute properly or do not generate the required results.csv
file will be excluded from evaluation.
• Students who modify the evaluation function or tamper with the test process will face
serious penalties, including potential academic consequences.
5. Final Reminders
• Use only the following libraries: numpy, matplotlib, seaborn, sklearn, xgboost.
• Respect the project rules, as this is not just a test of your technical skills but also your
integrity and adherence to guidelines.
This project provides a valuable opportunity to demonstrate your machine learning pro-
ficiency and compete for top scores. Put in your best effort, and good luck!