About

Ryan R Rosario

University of California, Los Angeles, Computer Science, Adjunct

Other Affiliations:
Google, Search, Department MemberUniversity of California, Los Angeles, Statistics, Alumnus
add
Research Interests:
Machine Learning, Data Mining, Statistics, Natural Language Processing, Text Mining, Psychometrics, and 2 morePsychometrics and Test Development and Databases
(Psychometrics and Test Development and Databases)
edit
About:
Machine Learning, Natural Language Processing, Data Mining, Data Systemsedit
Advisors:
edit

Papers

Publisher: ProQuest LLC

Publication Date: 2017

Publication Name: ProQuest LLC eBooks

Research Interests:
Computer Science, Artificial Intelligence, Statistics, Natural Language Processing, Machine Learning, and 14 moreSupport Vector Machines, Linguistics, Language, Latent Semantic Analysis, Topic Models, Text, Classification, Data, Support vector machine, Latent Dirichlet Allocation, Short, Small samples, Augmentation, and Short Text Classification
(Support Vector Machines, Linguistics, Language, Latent Semantic Analysis, Topic Models, Text, Classification, Data, Support vector machine, Latent Dirichlet Allocation, Short, Small samples, Augmentation, and Short Text Classification)

Download (.pdf)

This lecture focused on methods of combining labeled and unlabeled data to learn a classifier. As a motivating example, suppose we would like to classify web pages as either fraudulent or not fraudulent. In this case, obtaining unlabeled data (i.e., webpages) is easy. However, labeling such data can be very costly since it requires humans to manually look at each webpage and determine whether or not it is a scam. We might hope that by making use of the unlabeled data in a clever way, we could learn a classifier without requiring as much labeled data as we would normally need. In this lecture, we consider two learning models: semi-supervised learning, and active learning.

Publication Date: 2010

Download (.pdf)

Author(s): ROSARIO, RYAN ROBERT | Advisor(s): Wu, Yingnian | Abstract: Text classification typically performs best with large training sets, but short texts are very common on the World Wide Web. Can we use resampling and data augmentation to construct larger texts using similar terms? Several current methods exist for working with short text that rely on using external data and contexts, or workarounds. Our focus is to test a new preprocessing approach that uses resampling, inspired by the bootstrap, combined with data augmentation, by treating each short text as a population and sampling similar words from a semantic space to create a longer text. We use blog post titles collected from the Technorati blog aggregator as experimental data with each title appearing in one of ten categories. We first test how well the raw short texts are classified using a variant of SVM designed specifically for short texts as well as a supervised topic model and an SVM model that uses semantic vecto...

Publication Date: 2017

Research Interests:
Computer Science, Artificial Intelligence, Natural Language Processing, Machine Learning, Support Vector Machines, and 5 moreLatent Semantic Analysis, Topic Models, Latent Dirichlet Allocation, Small samples, and Short Text Classification
(Latent Semantic Analysis, Topic Models, Latent Dirichlet Allocation, Small samples, and Short Text Classification)

A decision tree is a predictive model that can be used for either classification or regression [3]. Decision tree construction algorithms take a set of observations X and a vector of labels Y, and constructs a tree that attempts to build a model that describes the data based on specific splits in values of the input variables. Each interior node of such a tree represents a single input variable and each leaf of the tree represents the predicted class label of an observation whose values follow the path from root to that leaf. One of the major problems with decision trees is the tendency for a raw decision tree to overfit the training data. In these cases, the decision tree contains many paths from the root to each leaf that are not of any statistical validity, and provide little predictive power to the model [2]. That is, the decision tree model does not generalize well to new, unseen values of X. If T is a decision tree and S is the sample of observations used to build T , then pru...

Publication Date: 2010

Publisher: Foundation for Open Access Statistic

Publication Date: 2008

Publication Name: Journal of Statistical Software

Research Interests:
Statistics, Natural Language Processing, Text Mining, PERL, and Statistical software
()

Download (.pdf)

Publication Date: 2008

Research Interests:
Statistics, Natural Language Processing, Text Mining, PERL, and Statistical software
()

Download (.pdf)

Publication Date: Jan 1, 2008

Publication Name: Journal Of Statistical Software

Research Interests:
Statistical software
()

Publisher: stat.ucla.edu

Text classification typically performs best with large training sets, but short texts are very common on the World Wide Web. Can we use resampling and data augmentation to construct larger texts using similar terms? Several current methods exist for working with short text that rely on using exte2rnal data and contexts, or workarounds.

Our focus is to test a new preprocessing approach that uses resampling, inspired by the bootstrap, combined with data augmentation, by treating each short text as a population and sampling similar words from a semantic space to create a longer text. We use blog post titles collected from the Technorati blog aggregator as experimental data with each title appearing in one of ten categories. We first test how well the raw short texts are classified using a variant of SVM designed specifically for short texts as well as a supervised topic model and an SVM model that uses semantic vectors as features. We then build a semantic space and augment each short text with related terms under a variety of experimental conditions. We test the classifiers on the augmented data and compare performance to the aforementioned baselines. The classifier performance on augmented test sets outperformed the baseline classifiers in most cases.