0% found this document useful (0 votes)
92 views

Course Project Guideline - New

This document provides guidelines for a team course project on big data systems for data science. It outlines the project schedule and key dates, suggested project topics including data analysis and scalable data science tools and platforms, example datasets and sample projects, requirements for submitting a project proposal, final report, code, and presentation. The goal is for student teams to explore aspects of big data systems through developing a proposal, performing experiments, and reporting results.

Uploaded by

Caiyi Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Course Project Guideline - New

This document provides guidelines for a team course project on big data systems for data science. It outlines the project schedule and key dates, suggested project topics including data analysis and scalable data science tools and platforms, example datasets and sample projects, requirements for submitting a project proposal, final report, code, and presentation. The goal is for student teams to explore aspects of big data systems through developing a proposal, performing experiments, and reporting results.

Uploaded by

Caiyi Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CS4225/CS5425 Big Data Systems for Data Science

Course Project Guidelines


1. Overview
This is a team project. Each team comprises 4 or 5 members (CS4225 students can
form a team with CS5425 students and vice versa). Each team should identify a topic on
Big Data Systems for Data Science, and conduct a course project specified in one of the
topics in the “Project Topics” section. You are encouraged to utilize other systems learnt
from the lecture in data science such as those in data visualization and machine learning.
An important component of the course is the course project. Rather than having
more homework, the course project allows students the flexibility to explore some aspect of big
data systems on their own. The project involves the development of a proposal, experimental
results, and a final report. In proposing a project please keep the following in mind: it is better to
propose a reasonable project you can complete rather than a huge project that will be in an
intermediate state at the end of the term. To receive a final grade for the project, there must be
an experimental result that is shown (e.g. a graph, table, or chart) based on experiments that you
perform.
You are encouraged to use the cluster platform of our school or a cloud platform for processing
big data. Microsoft has kindly offered us the education grant on Windows Azure for this course.
Therefore, you may be able to run your project and analytics on multiple virtual machines on
Azure. If you are interested in using those resources, please contact our teaching assistant.
There are some grading criterion at the end of this guideline for your reference.
2. Project schedule

Key dates Action Items Mark (of final mark)


6 Sept 2020 Grouping (drop me an email) --

4 Oct 2020 Submit a project proposal (See Section 6) 10%

15 Nov 2020 Project presentation 10%

22 Nov 2020 Project due – report must be submitted 20%

* all deadlines are 11:59pm of the date.


3. Project Topics
The topic of the course project can be ONE of the followings:

 Data Analysis
In big data processing, data analysis is a very critical step since it draws conclusions from
new datasets that are important to a specific domain. Nowadays, trending application domain
contains healthcare, insurance, transportation, social media, etc. As a part of this project, you
need to identify: a) A trending topic that is interesting to you. b) Novel and influential analytic

1
methods that are relevant to this topic. (You need to find some papers) c) New datasets that are
not fully explored by data scientists. (Better if the datasets are released within 1 year).
Impact: you may create new data science insights that can impact the society and improve
people’s life. For example, by analyzing the taxi trajectory data carefully, you may help to cut
down the waiting time for each user and make the trip more environment friendly.
Examples: a) A real-time sentiment analysis of twitter feeds with the NASDAQ index (i.e.,
analyze the correlation between tweeter feeds and hourly movements of the NASDAQ index), b)
Use deep learning techniques to predict the stock price.

 Scalable Data Science Tools and Platforms


Scalable platforms and tools are very important for meeting the requirement of big data.
Although there are many new algorithms that have been introduced, they may work on a single
machine. In this project, we will develop scalable platforms and tools and make them open-
sourced for public use. Here, scalability can be across multiple cores or multiple machines.
Examples include the new data mining and machine learning algorithms that was proposed
recently.
Impact: you may create new platforms and tools that can address the big data challenge that
have not been ever attacked. For example, by developing an easy-to-use and efficient graph
processing platform (say finding the shortest path), you may help the graph analysts to improve
their productivity.
Examples: a) Visualized graph analytics: the study implements different ways of graph
visualization and offers an intuitive way of graph analytics. b) K-means optimization: the study
optimizes K-means on multi-core machines, and later on a distributed environment. It studies
different implementation strategies (e.g., early reduction) and hardware capabilities (e.g.,
Ethernet vs. RDMA).

 Self-proposed Projects
If you have some ideas that may not fit exactly what is listed above (e.g., you want to design
a totally new algorithm, or new statistical measures of interestingness, etc), talk to the lecturer.
Impact: The only limit to your impact is your imagination and commitment. (by Tony Robbins).
Examples: a new machine learning platform.

4. More Example Data Sets and Sample Projects


You can use any datasets on the internet. We prefer to use the new and big data sets (Better if
the datasets are released within 1 year). There are several possible URL links:
https://www.kaggle.com/datasets
https://cloud.google.com/public-datasets/
https://github.com/caesar0301/awesome-public-datasets

2
https://github.com/openimages/dataset
http://www.cs.cmu.edu/~enron/
https://webobservatory.soton.ac.uk/
……

Regarding sample big data projects, you can refer to the following links. However, we encourage
you to think beyond that and explore significantly new ideas (for example, whether you should
identify a trending topic that is interesting to you, and then apply novel and influential analytic
methods that are relevant to this topic).
https://www.coursera.org/specializations/big-data
http://hadoopproject.com/big-data-projects/
https://blog.kaspersky.com/cool-big-data-projects/8186/
……

5. Project Proposal
The purpose of the project proposal is to provide background material for the work that you
are to complete and to describe the actual experiments and expected results. The proposal should
be sufficiently detailed so that I can understand specifically what you are going to do. An important
part of the proposal is my understanding of your proposed work. If I think the project is too large
I will ask you to trim it down. The following is a general outline for the proposal. All page counts
are for 11pt. font, single spaced, single column.

1) Topic introduction (0.5 - 1 page)


2) Discussion of previous work and how it relates to your topic. This is one of the most
important part of a proposal. Review related works in state-of-the-art (you need to review
not only websites but also recent research articles). You may check out Google Scholar.
(1-2 pages)
3) Discussion of your experimental approach (including specific tools, methodologies,
experiments, etc.). Try to be as specific as possible. (1 to 2 pages)
4) Data sets to be used (Are they large and new? Justify it, 0.5-1 pages)
5) Expected results. What exactly will your result be? What will you show in a table or chart?
Be specific. (1 page)
6) Project Summary (0.5 to 1 page)
7) References (should include research articles and other sources, which is more than just
web pages)

You need to submit your project proposal to LumiNUS, and name of the file should be include all
student IDs of your team (usually staring with A). For example, if the group has four members,
the file name can be like:
A1234567X-A1234567Y-A1234567Z-A1234567T-proposal.pdf

3
6. Submission Requirements
a) Final Report
You need to submit a report which is at most 20 single-columned page paper on the
problem and solution(s) and what will be demonstrated. All page counts are for 11pt. font,
single spaced, single column.

You should compare the different solutions qualitatively as well as provide an experimental
analysis. In general, the final report should contain the following contents:1. Project
Introduction. This can be the same as for the proposal. 2. Methodology and experimentation.
How did you perform the experimentation? This can be a revision version of the proposal. 3.
Discussion of results. 4. Problems encountered and lesson learnt. 5. Personal contribution
(for each student written by the individual student). 6. Project summary. 7. References. 8.
Workloads of each group member (for example: which member completes which task, which
member writes which section of the report).

b) Code
Code can be in any programming languages (c, c++, java, matlab, R…).

c) File Name
You need to compress all the documents and code into a zip (or rar) file, and name of the
file should be include all student IDs of your team. For example, if the group has four members,
the file name should be:
A1234567X-A1234567Y-A1234567Z-A1234567T-FinalReport.zip
Note: Please ensure that the code and report are complete before submission! You need to
ensure that your work is recoverable by others using the code and report that you provided.

7. Project Presentation

The presentations will be submitted as video presentation. You need to submit your video
and slide to LumiNUS with the following file name (suppose that your group has four members).
Your presentation should cover the same aspects of the project mentioned above.

A1234567X-A1234567Y-A1234567Z-A1234567T-VideoSlide.zip

After the presentation submission, you are encouraged to add more solid results and findings into
the report, besides those presented in oral presentation.

8. Grading Criterion
Although students performing different works, to make the grading work fair and reasonable,
we will evaluate your work from the following perspectives. For each category, we have given
some example issues for your reference.
(a) Linguistic ability: whether the report/slide is well prepared, whether figures and tables are
well presented.

4
(b) Complexity and novelty of the problem (you must carefully review the previous studies on
the same problem): whether the literature review is comprehensive (e.g., including both web
URL and research articles), whether a comparison between the previous studies and the
proposed study is clearly presented, whether you present the challenge/complexity of the
problem.
(c) Tools and algorithms: We expect you to use the systems learnt from the lecture. You are
welcome to use other systems and tools beyond this lecture. The usage and implementation
needs clear presentation.
(d) Comprehensiveness of the analysis and findings: If you focus on system design and
performance analysis, we expect you to have some in-depth system analysis and optimization
reasoning. If you focus on data analytics, you need to carefully support your claims and
findings (say, with different approaches for the same tasks, with multiple data sets, from
different angles of the same data sets).
(e) Impact of your project or findings: Are your finding new? Are your findings
impactful/meaningful?
(f) Datasets used (Large? New?): a reasonable guideline for “Large” is that the data set should
be bigger than 10 GB (many of our current desktop has 4GB-8GB main memory), and the
data set should be released within one year. That means,
a. For the data set size, "over 10GB" is a guideline. it is your job to explain: why the data
set is sufficient for your findings. say, if you want to analyze the long-term behavior of
a service, one day of data is clearly NOT sufficient.
b. For "new", please try to choose the data set released within a year. we are not
interested in the data sets that have been widely studied, unless you can justify that
you will study an old data set from a NEW angle.
(g) Oral presentation and demo: it will be a strong plus to have a demo. Your presentation
should be well prepared.

9. Submission Policies
For all submissions related to course project:
No multiple submissions allowed: Each team should make sure that the team only submit
exactly once. If a team submits two versions, we will retain the earliest version and discard all
later versions. The team will then be grade on the earliest version. For such a reason, if you
want to update your submission, you should delete your old submission first and then submit a
new one.

Policy on late submission: For fairness, reports submitted after the deadline but no more than
48 hours after the deadline will still be graded, with a penalty of 20%. Namely, I will first grade the
report normally, and then multiple the mark by 80% to get the final mark for that report. Reports
submitted more than 48 hours after the deadline will not be accepted and will get 0 mark.

10. Plagiarism

You are reminded that plagiarism is a very SERIOUS offense, and disciplinary action
(including possibility of expulsion from the university) will be taken against any individual or team

5
found plagiarizing. The individual or team that is being plagiarized will also be punished if it is
found to have allowed the work to be plagiarized voluntarily.

You might also like