Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm


The explosion of web as an information source recently has brought more interesting challenges to document annotation and classification of web documents as well as to the information and collaboration filtering world. Not only does it introduce performance issues due to the huge number of documents, there is even bigger challenges in that most of the data are unlabeled. The arrival of social media and blogging has increased web data as well. Thus the machine learning community has taken up the interest in using unsupervised learning in classification of such big data. New learning called semi-supervised learning relies on assumptions that unlabeled data can help bring interesting patterns, which the industry can use. The study uses the Probability Latent Semantic Analysis algorithm which is an unsupervised machine learning algorithm to retrieve information based on latent classes of documents and terms. PLSA is the topic modeling tool used by the study to deduce hidden topics across the documents by looking at the terms and documents and inferring using the Expectation Maximization algorithm which words belong to which topic. The model was used to discover if given documents infer one or more topics. The study concluded that the PLSA algorithm is more efficient on processed data than raw web data and processing time was reduced when preprocessing was used to eliminate redundant latent variables. An increase of k, the number of topics improved the topic quality.



Software And Hardware

• Hardware: Processor: i3 ,i5 RAM: 4GB Hard disk: 16 GB • Software: operating System : Windws2000/XP/7/8/10 Anaconda,jupyter,spyder,flask Frontend :-python Backend:- MYSQL