0% found this document useful (0 votes)
12 views

OOP Report

Uploaded by

huy01234055137
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

OOP Report

Uploaded by

huy01234055137
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

HA NOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF INFORMATION AND TECHNOLOGY


──────── * ───────

Project Report
COURSE: OBJECT-ORIENTED PROGRAMING
(COURSE ID: IT3100E)

NEWS AGGREGATOR
GROUP 12

Class ID : 149323
Lecturer : Ph.D. Trịnh Tuấn Đạt
IT3100E – Object-Oriented Programing 20232

APPENDIX

APPENDIX 2
WORK DISTRIBUTION 3
INTRODUCTION 4
Chapter 1. Statistics on the collected data 5
Chapter 2. System design 6
2.1 Problem description 6
2.2 Detail System design 6
2.2.1. Package diagram 6
2.2.2. Class diagram: 6
2.3 Object-Oriented Techniques applied 11
Chapter 3. Technologies used and notable algorithms 12
3.1. Programming languages and external libraries 12
3.1.1. Java: 12
3.1.2 Python 12
3.2 Notable algorithms: 13
3.2.1 Exact Search 13
3.2.2 Entity Recognition 13
3.2.3 Trend detection 13
3.2.4 Smart Search 13
Chapter 4. Instructions and demonstration 15
4.1 Set up guide: 15
4.2 Collect data Instruction 15
4.2.1 Scraping news sites 15
4.2.2 Process data 15
4.3 App Usage Instruction 16
4.3.1 Login Function 16
4.3.2 Base-User function: 16
4.3.2 Manager Function 20
4.3.3 Admin Function 22
CONCLUSION 24

2
IT3100E – Object-Oriented Programing 20232

WORK DISTRIBUTION

FULL NAME STUDENT STUDENT WORK DETAIL WORK


ID EMAIL PERCENTAGE
Ngô Quang Đức 20225484 duc.nq225484@ ● Collect, process data 20%
sis.hust.edu.vn ● Write report

● Tester

Đỗ Doãn Hoàng Du 20220060 du.ddh220060@ ● Analyze data 20%


sis.hust.edu.vn ● Smart Search algorithm
● Package designer
Bùi Duy Anh 20225563 anh.bd225563@ ● Collect, process data 20%
sis.hust.edu.vn ● Write report
● Tester
Lê Gia Huy 20225498 huy.lg225498@s ● Main GUI coder 20%
is.hust.edu.vn ● Write report
● Package designer

Nguyễn Tiến Thành 20225459 thanh.nt225459 Exact Search algorithm 20%


@sis.hust.edu.v Create slides
n
Draw UML diagrams

3
IT3100E – Object-Oriented Programing 20232

INTRODUCTION

Information technology, surely this is a phrase that is not strange to us.


That is understandable, because no one can deny its role in the development of
society. However, for a developing country like ours, the application of
information technology is still limited. A large number of agencies and
organizations still carry out their daily work using traditional methods.
Therefore, the issue of computerization has become extremely necessary for the
country's socio-economic development.

Specifically, in the blockchain field, the need for a news aggregator tool
has been raging because of the gigantic volume of information available on
social media these days. As a result , having a news aggregator tool would help
further filter and digest important information as well as keep up with
up-to-date changes in the field.

Our group's project is designed using Object-Oriented Programming in


Java. This language allows the program to run on various platforms through an
execution environment, as long as the platform is supported. This makes
installation and execution easier for end-users.

Despite our efforts to refine the product, we acknowledge there may be


shortcomings due to limitations in our knowledge and testing. We would
welcome feedback from our lecturer and our peers to further improve the
project. Finally, we would like to express our sincere gratitude to Mr. Trinh
Tuan Dat for his guidance throughout the completion of this final project.

4
IT3100E – Object-Oriented Programing 20232

Chapter 1. Statistics on the collected data

Quantity Details

Number of data type 1 2 (news articles)

Sources 2 CryptoSlate
Blockchain News

Number of articles 25354 Crypto Slate: 17428


Blockchain News: 7926
Latest Date: 2024-04-04
Oldest Date: 2017-09-17

Number of tags 11831

Number of categories 211

5
IT3100E – Object-Oriented Programing 20232

Chapter 2. System design

2.1 Problem description


The project involves building a data collection and analysis system for news
related to the blockchain field. The tasks to be performed are as follows:

- Find data sources: news websites, blogs, tweets about blockchain


- Develop a search engine on the collected data. The search feature should
be as versatile and intelligent as possible.
- Trend detection in articles
- Entity recognition in articles

2.2 Detail System design


2.2.1. Package diagram
The following diagram shows the relationship between the packages in our project.

2.2.2. Class diagram:


1.Package connect:

First is the “connect” package, which is used to set up and handle connection
with a SQL server. However, we don’t have access to any SQL server at the
moment, so this package remains mostly unused.

6
IT3100E – Object-Oriented Programing 20232

3.Package engine:

This package contains the RunPython class, which provides a method to run the
Python files for handling data (searching, processing, filtering, finding trends).
It reads the “python_path.txt” file to get the location of “python.exe” in your
system. The “.txt” file should be created during the installation process before
running the project.

“FilePathsManager” stores the path to many non-Java files in the project.


“CSVHandle” is responsible for writing scraped data, displaying data from CSV
to table, exporting or removing data.

Inside “engine” you can find the “scraper ” subpackage containing an abstract
class “Scraper”. It provides a standard form to create child class scrapers for
individual news sites. Each scraper will have an ArrayList of “News” objects
having all necessary attributes of an article.

7
IT3100E – Object-Oriented Programing 20232

4.Package graph:

This package is used to draw the bar graphs in the trending section.

5.Package login:

This package is responsible for the login function in the app, where users can
have an account as an “Admin” or a “Manager” or default. Each level will have
a different level of access to the functionality of the app. The order is default,
“Manager” then “Admin”. With each level, the extra functionalities will be
increasingly impactful to the application.

8
IT3100E – Object-Oriented Programing 20232

9
IT3100E – Object-Oriented Programing 20232

6.Package shape:
This package is used to create the background for Search and the Menu page in
the application.

7.Package execute:

This package includes all classes that can be run:

- DashBoard: contains the main GUI and control logic

- RunCSL: start the scraper for the CryptoSlate website

- RunBCN: start the scraper for the BlockchainNews website

- ProcessData: format the collected data from all websites, combine them
into one file, update the list of Tags and Categories, create Inverted
Indexes, run Entity Recognition.

10
IT3100E – Object-Oriented Programing 20232

2.3 Object-Oriented Techniques applied


- “Connection” interface which is implemented by “ConnectJDBC” : can
add other types of connections in the future (Abstraction - Inheritance).

- Abstract class “Scraper” inherited by “CryptoSlateScraper” and


“BlockChainNewsScraper” : has abstract methods that need to be
overridden for getting each article element, creating a template for
conveniently adding more scrapers for other news websites. Every child
class inherits the getNews method. When the method getNews is called, it
calls other get element methods, resulting in different behaviors for each
scraper to respond appropriately to different websites layout. (Abstraction
- Inheritance - Polymorphism).

- “News” class has the toString method overridden so that it can be easier
written to data files.

- Many classes have many constructors allowing for multiple ways to


create an object. (Method Overloading).

- Methods providing similar functionalities are bundled into classes and


packages appropriately named. In many classes, methods and attributes

11
IT3100E – Object-Oriented Programing 20232

are given private/default access modifiers when possible, getters and


setters are provided if needed. (Encapsulation, making the code easier to
maintain and debug).

Chapter 3. Technologies used and notable


algorithms
3.1. Programming languages and external libraries
The GUI and control logic of the main program as well as the data collection is
written in Java. The algorithmic code for data analysis is implemented in Python
and is called through Java when needed. We will introduce the list of
programming languages as well as external libraries below in detail.

3.1.1. Java:
- JSOUP is a Java library that simplifies working with real-world HTML
and XML. It offers an easy-to-use API for URL fetching, data parsing,
extraction, and manipulation. In this project, JSOUP is used to collect
data from news websites.

- OPENCSV is a library for writing, reading, serializing, deserializing,


and/or parsing .csv files. In this project, OPENCSV is used for reading
and writing data to .csv files.

3.1.2 Python
- SPACY is a free, open-source Python library that provides advanced
capabilities to conduct natural language processing (NLP) on large
volumes of text at high speed. In this project, SPACY is used for Entity
Recognition and Smart Searching. Here we used “en_core_web_lg”
which is the large pre-trained NLP model for English.

- PANDAS is a powerful and open-source Python library used for data


manipulation and data analysis. Pandas consist of data structures and
functions to perform efficient operations on data. In this project, Pandas is
used to read and handle csv file reading.

12
IT3100E – Object-Oriented Programing 20232

3.2 Notable algorithms:

3.2.1 Exact Search


This algorithm will return articles which contain the input string (through string
matching). The algorithm ignores cases as well as leading and trailing white
spaces for convenience.

3.2.2 Entity Recognition


We go through the title, summary, content, tags and category of each article
collected, and record notable named entities that the model detected. The
entities belong to one of the following tags:

+ PERSON: People, including fictional.


+ NORP: Nationalities or religious or political groups.
+ FAC: Buildings, airports, highways, bridges, etc.
+ ORG: Companies, agencies, institutions, etc.
+ GPE: Countries, cities, states.
+ LOC: Non-GPE locations, mountain ranges, bodies of water.
+ PRODUCT: Objects, vehicles, foods, etc. (Not services.)
+ EVENT: Named hurricanes, battles, wars, sports events, etc.
+ WORK_OF_ART: Titles of books, songs, etc.
+ LAW: Named documents made into laws.
+ LANGUAGE: Any named language.
+ PERCENT: Percentage, including “%”.
+ MONEY: Monetary values, including units.
+ QUANTITY: Measurements, as of weight or distance.

3.2.3 Trend detection


We count the number of occurrences of each named entity or tag between any 2
chosen dates. The result is sorted by the number of occurrences descending.

3.2.4 Smart Search


We used a combination of Inverted Indexing and a grading function to assign
points to each article based on its relevance to the search query.

13
IT3100E – Object-Oriented Programing 20232

- Inverted Indexing: before creating the mapping, each word is converted


to lowercase, and lemmatized to its original form (e.g “walked” or
“walking” will both be turned into “walk”). Then, it is mapped to its
locations in the datafile (article id) alongside the number of its
occurrences in that article. For example, a word having an inverted index
(iv_id) of “15443-14|16774-7” means that it (or its other forms) appear
14 times in article 15443, and 7 times in 16774.

- Searching: when the program gets an input, the input is tokenized, turned
to lowercase and lemmatized, with stop words and punctuations removed,
and we receive the final list of tokens. With this list, the grading function
calculates the points for each relevant article and the result is sorted.
Through the inverted indexes, only articles that contain at least one of the
tokens are considered.

Point of each article is given by :


𝑛
p = ∏ (𝑥𝑖 + 𝑐𝑛)
𝑖=1
p : the total points
n : the number of tokens in the final list
𝑥𝑖 : the number of occurrences of token i in the article
𝑐𝑛 : the default point or penalty coefficient for each token, which has the
1
formula 𝑐𝑛 = 1 − α (α > 0)
𝑛
We can see that:
- 0 < 𝑐𝑛 < 1 for all n.
- When n = 1, 𝑐𝑛 = 0, which means that p = 0 if 𝑥1 = 0 (the article does not
contain the token).
- When n approaches ∞ , 𝑐𝑛 approaches 1, which means even if 𝑥𝑖 = 0, the
total point will not be lowered when multiplying with (𝑥𝑖 + 𝑐𝑛).
- Overall, the function penalized missing tokens more when there are fewer
total tokens, and less when the total number of tokens is high.
- α decides how fast the penalty coefficient converges to 1 when n
increases. In our testing, we found the algorithm worked better with 0 < α
< 1, and we chose α = 0. 2.

14
IT3100E – Object-Oriented Programing 20232

Chapter 4. Instructions and demonstration


4.1 Set up guide:
To properly function, the program needs to know the location of “python.exe”
in your system, as well as install the necessary libraries in Python and Java.
Inside the project you will find and run 2 batch files called “install.bat” and
“locate_python.bat”. The first one will install Pandas and Spacy to your
system and create a text file containing the path to “python.exe” in the
appropriate location. If you already have the necessary libraries, you only need
to run “locate_python.bat” to get the python path.

Java external libraries will be automatically downloaded and added to the


project using Maven when you launch the project.

4.2 Collect data Instruction


This part of the project is not aimed at the end user so these functions do not
have a GUI and need to be run in the terminal or IDE.

4.2.1 Scraping news sites


When you run the main methods in RunCSL, it will ask for the start and end
page you want to scrape. The articles’ information will be printed in the
terminal and added to a news list. After the scraper finishes (or stopped due to
loss of Internet connection for an extended period of time), the news list will be
written to a csv file in called “CSL_raw.csv” in the
“BlockChain/src/main/java/hust/soict/dsai/blockchain/engine/Data/CSL”
folder. A list of links that the scraper cannot scrape from will also be printed to
the terminal.
RunBCN works almost identically to RunCSL. The only difference is that it will
also ask you which category you want to include when scraping. The data file
output is “BCN_raw.csv” in
“BlockChain/src/main/java/hust/soict/dsai/blockchain/engine/Data/BCN”.

15
IT3100E – Object-Oriented Programing 20232

4.2.2 Process data


Running the main method in ProcessData will give you the following options:
- Format CryptoSlate
- Format BlockChain News
- Merge all data files
- Update tags and categories
- Named entity recognition
- Update inverted indexes.
Which all need to be run in that order.

The format options will ensure the data are in the same format and handle
duplicates in the raw data file. The result will be a new “formatted” file in the
same folder.

The “Merge” option then combines these new formatted data to the existing
ones, resulting in the “Data_full.csv” file in
“BlockChain/src/main/java/hust/soict/dsai/blockchain/engine/Data/”.
The next 2 options will create “All_Categories.csv”, “All_Tags.csv” and
“Data_full_with_entites.csv” in that same folder.

Finally the inverted indexes for the data need to be created before running the
program to ensure the data are properly indexed for accurate searching, creating
“All_Tags.csv” also in the folder mentioned above.

4.3 App Usage Instruction


4.3.1 Login Function
To understand the login function, first we need to know all kinds of roles which
are User, Manager and Admin. Users can access the base function, while
manager and admin can perform a lot more. So with their privileges, admin and
manager needs to login,while base user do not need to login

16
IT3100E – Object-Oriented Programing 20232

When boosting up the app,you can see in the top-right corner of the app there is
a login button where you can login by pressing it.

You can see there is a pop-up window ask for your user name and password.

.Then we can login as manager or admin.

4.3.2 Base-User function:


A.Search function

When pressing the “Search” button in the sidebar on the left,the search tab
shows up on the screen like this:

17
IT3100E – Object-Oriented Programing 20232

In the middle of the page,you can see the search bar where we input our search
query.

Below the search bar is the switch to choose between smart search and exact
search.In this picture,we can see that we are choosing smart search option.

At the bottom of the page is another optional field with 2 bars with the name
“Tags” and “Categories” where we can choose with the option box on the right
then hit “Confirm”.The bar will show what tag/category you have chosen. We
can leave that field blank if we do not want to fill in.

After fill in all the field you want,it is time to let our app do its magic by hitting
Enter, and all the information you are searching for will be presented :

18
IT3100E – Object-Oriented Programing 20232

B.Trend function
When pressing the “Trend” button in the sidebar on the left,the search tab shows
up on the screen.

The field in the bottom of the page shows us all the necessary information that
we must fill in where the “Start date” and “End date” asks us to fill in the range
of time we want to see the trend for.For the “Option” option box, we choose

19
IT3100E – Object-Oriented Programing 20232

which parameter of trend do we want to use (tag/entities), and the “Top” field
ask us how many trends do we want to see. After filling in all the field, press
“Start find”, and the result will be displayed like this:

And just like that, you have performed the trend search with ease.

C.Data function
We can see all the data in a table like this:

20
IT3100E – Object-Oriented Programing 20232

Export the databases to csv by pressing “Export to CSV” button.

D.Settings function:
When turning on the sql information ,this table shows up:

We fill in all the required fields to connect to a database.All the function interact
with the new database(search/trend) is under development.
E.Exit function
When you want to quit the app,simply press the exit button.

4.3.2 Manager Function


Manager has all the functions that a user can perform,follow-up with other
restricted functions that only manager or higher role can use.

A.Import Data Function:

When pressing the “Import Data” button in the sidebar on the left,the search tab
shows up on the screen like this:

21
IT3100E – Object-Oriented Programing 20232

After choosing the file directory,choosing the file and pressing the”open” button
down below then your file will be added successfully to the database.But be
careful,the data added to the database might affect the data structure of the
previous data (wrong format, not indexed, ...). Because the data structure is
fragile, only the manager and admin can change it.

B.Data Function

22
IT3100E – Object-Oriented Programing 20232

Unlike the user’s data function, the manager has more things to do. They can
delete one / many rows in the database by choosing the rows they want to delete
then press the delete button next to the “export to csv” button. But be
careful,because the data structure is fragile, the data deleted might affect the
data structure (wrong format, not indexed, .. .).

4.3.3 Admin Function


Admin has all the functions that a user can perform,follow-up with other
restricted functions that only admin can use-which is to promote/add
managers/admins or remove them.

To do that,simply just logged in as an admin,choosing Add Manager in the


left-hand bar. A tab will show up like this:

23
IT3100E – Object-Oriented Programing 20232

If you want to add user,press the “add user” button,a window pop-up like this:

You need to fill in the name,password and their role (manager/default/admin).


Then press add user and it will be done.

With user delete,simply select the rows containing the username that you want
to delete and press delete user.

Because the role assigned might affect many other users,only the admin can
perform that action.

24
IT3100E – Object-Oriented Programing 20232

CONCLUSION
The News Aggregator was developed based on the knowledge taught in class
and self-research, so it still has many limitations. Part of this is due to the group
members' lack of experience in building and developing applications. After the
process of developing and testing the application, we would like to present the
following conclusions and development orientation:

1.Advantages
- Meet and complete the required features of the problem.
- The amount of data collected is decent.
2.Disadvantages
- Exist program bugs.
- User Interface is not really eye-catching

3. Development orientation:

With the above advantages and disadvantages, the group has a few new
development ideas for the application:

- Develop database storage using structured query languages.


- Improve the graphical user interface.
- Fix the existing bugs.
- Update the scraper when page format changes.

25

You might also like