OOP Report
OOP Report
Project Report
COURSE: OBJECT-ORIENTED PROGRAMING
(COURSE ID: IT3100E)
NEWS AGGREGATOR
GROUP 12
Class ID : 149323
Lecturer : Ph.D. Trịnh Tuấn Đạt
IT3100E – Object-Oriented Programing 20232
APPENDIX
APPENDIX 2
WORK DISTRIBUTION 3
INTRODUCTION 4
Chapter 1. Statistics on the collected data 5
Chapter 2. System design 6
2.1 Problem description 6
2.2 Detail System design 6
2.2.1. Package diagram 6
2.2.2. Class diagram: 6
2.3 Object-Oriented Techniques applied 11
Chapter 3. Technologies used and notable algorithms 12
3.1. Programming languages and external libraries 12
3.1.1. Java: 12
3.1.2 Python 12
3.2 Notable algorithms: 13
3.2.1 Exact Search 13
3.2.2 Entity Recognition 13
3.2.3 Trend detection 13
3.2.4 Smart Search 13
Chapter 4. Instructions and demonstration 15
4.1 Set up guide: 15
4.2 Collect data Instruction 15
4.2.1 Scraping news sites 15
4.2.2 Process data 15
4.3 App Usage Instruction 16
4.3.1 Login Function 16
4.3.2 Base-User function: 16
4.3.2 Manager Function 20
4.3.3 Admin Function 22
CONCLUSION 24
2
IT3100E – Object-Oriented Programing 20232
WORK DISTRIBUTION
● Tester
3
IT3100E – Object-Oriented Programing 20232
INTRODUCTION
Specifically, in the blockchain field, the need for a news aggregator tool
has been raging because of the gigantic volume of information available on
social media these days. As a result , having a news aggregator tool would help
further filter and digest important information as well as keep up with
up-to-date changes in the field.
4
IT3100E – Object-Oriented Programing 20232
Quantity Details
Sources 2 CryptoSlate
Blockchain News
5
IT3100E – Object-Oriented Programing 20232
First is the “connect” package, which is used to set up and handle connection
with a SQL server. However, we don’t have access to any SQL server at the
moment, so this package remains mostly unused.
6
IT3100E – Object-Oriented Programing 20232
3.Package engine:
This package contains the RunPython class, which provides a method to run the
Python files for handling data (searching, processing, filtering, finding trends).
It reads the “python_path.txt” file to get the location of “python.exe” in your
system. The “.txt” file should be created during the installation process before
running the project.
Inside “engine” you can find the “scraper ” subpackage containing an abstract
class “Scraper”. It provides a standard form to create child class scrapers for
individual news sites. Each scraper will have an ArrayList of “News” objects
having all necessary attributes of an article.
7
IT3100E – Object-Oriented Programing 20232
4.Package graph:
This package is used to draw the bar graphs in the trending section.
5.Package login:
This package is responsible for the login function in the app, where users can
have an account as an “Admin” or a “Manager” or default. Each level will have
a different level of access to the functionality of the app. The order is default,
“Manager” then “Admin”. With each level, the extra functionalities will be
increasingly impactful to the application.
8
IT3100E – Object-Oriented Programing 20232
9
IT3100E – Object-Oriented Programing 20232
6.Package shape:
This package is used to create the background for Search and the Menu page in
the application.
7.Package execute:
- ProcessData: format the collected data from all websites, combine them
into one file, update the list of Tags and Categories, create Inverted
Indexes, run Entity Recognition.
10
IT3100E – Object-Oriented Programing 20232
- “News” class has the toString method overridden so that it can be easier
written to data files.
11
IT3100E – Object-Oriented Programing 20232
3.1.1. Java:
- JSOUP is a Java library that simplifies working with real-world HTML
and XML. It offers an easy-to-use API for URL fetching, data parsing,
extraction, and manipulation. In this project, JSOUP is used to collect
data from news websites.
3.1.2 Python
- SPACY is a free, open-source Python library that provides advanced
capabilities to conduct natural language processing (NLP) on large
volumes of text at high speed. In this project, SPACY is used for Entity
Recognition and Smart Searching. Here we used “en_core_web_lg”
which is the large pre-trained NLP model for English.
12
IT3100E – Object-Oriented Programing 20232
13
IT3100E – Object-Oriented Programing 20232
- Searching: when the program gets an input, the input is tokenized, turned
to lowercase and lemmatized, with stop words and punctuations removed,
and we receive the final list of tokens. With this list, the grading function
calculates the points for each relevant article and the result is sorted.
Through the inverted indexes, only articles that contain at least one of the
tokens are considered.
14
IT3100E – Object-Oriented Programing 20232
15
IT3100E – Object-Oriented Programing 20232
The format options will ensure the data are in the same format and handle
duplicates in the raw data file. The result will be a new “formatted” file in the
same folder.
The “Merge” option then combines these new formatted data to the existing
ones, resulting in the “Data_full.csv” file in
“BlockChain/src/main/java/hust/soict/dsai/blockchain/engine/Data/”.
The next 2 options will create “All_Categories.csv”, “All_Tags.csv” and
“Data_full_with_entites.csv” in that same folder.
Finally the inverted indexes for the data need to be created before running the
program to ensure the data are properly indexed for accurate searching, creating
“All_Tags.csv” also in the folder mentioned above.
16
IT3100E – Object-Oriented Programing 20232
When boosting up the app,you can see in the top-right corner of the app there is
a login button where you can login by pressing it.
You can see there is a pop-up window ask for your user name and password.
When pressing the “Search” button in the sidebar on the left,the search tab
shows up on the screen like this:
17
IT3100E – Object-Oriented Programing 20232
In the middle of the page,you can see the search bar where we input our search
query.
Below the search bar is the switch to choose between smart search and exact
search.In this picture,we can see that we are choosing smart search option.
At the bottom of the page is another optional field with 2 bars with the name
“Tags” and “Categories” where we can choose with the option box on the right
then hit “Confirm”.The bar will show what tag/category you have chosen. We
can leave that field blank if we do not want to fill in.
After fill in all the field you want,it is time to let our app do its magic by hitting
Enter, and all the information you are searching for will be presented :
18
IT3100E – Object-Oriented Programing 20232
B.Trend function
When pressing the “Trend” button in the sidebar on the left,the search tab shows
up on the screen.
The field in the bottom of the page shows us all the necessary information that
we must fill in where the “Start date” and “End date” asks us to fill in the range
of time we want to see the trend for.For the “Option” option box, we choose
19
IT3100E – Object-Oriented Programing 20232
which parameter of trend do we want to use (tag/entities), and the “Top” field
ask us how many trends do we want to see. After filling in all the field, press
“Start find”, and the result will be displayed like this:
And just like that, you have performed the trend search with ease.
C.Data function
We can see all the data in a table like this:
20
IT3100E – Object-Oriented Programing 20232
D.Settings function:
When turning on the sql information ,this table shows up:
We fill in all the required fields to connect to a database.All the function interact
with the new database(search/trend) is under development.
E.Exit function
When you want to quit the app,simply press the exit button.
When pressing the “Import Data” button in the sidebar on the left,the search tab
shows up on the screen like this:
21
IT3100E – Object-Oriented Programing 20232
After choosing the file directory,choosing the file and pressing the”open” button
down below then your file will be added successfully to the database.But be
careful,the data added to the database might affect the data structure of the
previous data (wrong format, not indexed, ...). Because the data structure is
fragile, only the manager and admin can change it.
B.Data Function
22
IT3100E – Object-Oriented Programing 20232
Unlike the user’s data function, the manager has more things to do. They can
delete one / many rows in the database by choosing the rows they want to delete
then press the delete button next to the “export to csv” button. But be
careful,because the data structure is fragile, the data deleted might affect the
data structure (wrong format, not indexed, .. .).
23
IT3100E – Object-Oriented Programing 20232
If you want to add user,press the “add user” button,a window pop-up like this:
With user delete,simply select the rows containing the username that you want
to delete and press delete user.
Because the role assigned might affect many other users,only the admin can
perform that action.
24
IT3100E – Object-Oriented Programing 20232
CONCLUSION
The News Aggregator was developed based on the knowledge taught in class
and self-research, so it still has many limitations. Part of this is due to the group
members' lack of experience in building and developing applications. After the
process of developing and testing the application, we would like to present the
following conclusions and development orientation:
1.Advantages
- Meet and complete the required features of the problem.
- The amount of data collected is decent.
2.Disadvantages
- Exist program bugs.
- User Interface is not really eye-catching
3. Development orientation:
With the above advantages and disadvantages, the group has a few new
development ideas for the application:
25