Close menu


Sunderland Repository records the research produced by the University of Sunderland including practice-based research and theses.

Learning Noise Web Data Prior to Elimination: Classification of Dynamic Web User Interests

Onyancha, Julius (2019) Learning Noise Web Data Prior to Elimination: Classification of Dynamic Web User Interests. Doctoral thesis, University of Sunderland.

Item Type: Thesis (Doctoral)


The amount of noise in web data is rapidly increasing, as is the number of users searching for information to suit their interest. The increase of web data has led to some critical issues, such as a high level of noise and irrelevant data. Given that the web is noisy, inconsistent and irrelevant by nature, finding useful information that defines interest of user has become a challenge. Existing research acknowledges that there is a need to propose machine learning tools capable of addressing problems with noise web data. Identifying and eliminating noise web data is critical to the web usage mining process. As the web evolves and more web data sources emerge, the level of noisiness also increases.
Despite efforts by existing research to address noise in web data, a number of critical issues remain unresolved. For example, existing research work considers noise web data as irrelevant data that does not form part of the main content of a web page. Therefore, current machine learning tools focus on protecting the main content of a web page by eliminating noise/irrelevant data, such as advertisements, banners and external links etc. However, the main content of a web page can potentially be noise when user interests are considered. The position taken by the proposed research is based on the fact that noise web data can itself be useful when the interests of a user are considered prior to elimination.
To justify this position, a Noise Web Data Learning (NWDL) approach which aims to learn noise web data prior to elimination is proposed. To the best of our knowledge, learning noise web data prior to elimination has not been addressed by current and relevant research works. The objective is to ensure that the interestingness of data on the web is defined by user interests over time. The proposed NWDL considers the following key aspects, 1) the significance of exit page in defining user interest level on web pages visited by a user. 2) The effect of the dynamic change of user interests towards the classification of web pages.
Experiments conducted in this research shows that noise web data reduction process is user-centric, i.e., the dynamic changes of user interests influence the interestingness of web data. As a result, what is currently identified and eliminated as noise can be useful when user interests and their changes over time are considered. The findings from validation and evaluation of the proposed NWDL against existing tools shows that user interest over time significantly impacts the importance of data on the web. Given that current research work mainly identifies and eliminate noise web data without defining user interests, the process is not user-centric. A key contribution of the proposed research is to identify and learn noise web data taking into account user interests as they change over time prior to elimination. Ultimately the proposed NWDL contributes towards minimising loss of useful information otherwise considered as noise by the existing tool as well as reduce the level of noise data suggested to a user.

Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
[img] PDF (Administrator use only)
13533 RDEC.pdf - Other
Restricted to Repository staff only

Download (87kB) | Request a copy

More Information

Depositing User: Leah Maughan


Item ID: 13533

Users with ORCIDS

Catalogue record

Date Deposited: 24 May 2021 15:12
Last Modified: 24 May 2021 15:15


Author: Julius Onyancha

University Divisions

Collections > Theses

Actions (login required)

View Item View Item