Matthew Joseph Kusner

I am pursing my doctoral degree in Machine Learning under the direction of Kilian Weinberger at Washington University in St. Louis. My work is in document distances, differential privacy, budgeted learning, submodular optimization, dataset compression, and Bayesian optimization

Site Last updated 05/23/15

Publications
[PDF] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger
From Word Embeddings To Document Distances
International Conference on Machine Learning (ICML), 2015

[PDF] Matt J. Kusner, Jacob R. Gardner, Roman Garnett, Kilian Q. Weinberger
Differentially Private Bayesian Optimization
International Conference on Machine Learning (ICML), 2015

[PDF] Zhixiang (Eddie) Xu, Matt J. Kusner, Kilian Q. Weinberger, Minmin Chen, Olivier Chapelle
Classifier Cascades and Trees for Minimizing Feature Evaluation Cost
Journal of Machine Learning Research (JMLR), 2014

[PDF]Matt J. Kusner, Wenlin Chen, Quan Zhou, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, Yixin Chen
Feature-Cost Sensitive Learning with Submodular Trees of Classifiers
AAAI Conference on Artificial Intelligence (AAAI), 2014

[PDF] [Poster]Matt J. Kusner, Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal
Stochastic Neighbor Compression
International Conference on Machine Learning (ICML), 2014

[PDF]Jacob R. Gardner, Matt J. Kusner, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, John P. Cunningham
Bayesian Optimization with Inequality Constraints
International Conference on Machine Learning (ICML), 2014

[PDF] Zhixiang (Eddie) Xu, Matt J. Kusner, Gao Huang, Kilian Q. Weinberger
Anytime Feature Learning
International Conference on Machine Learning (ICML), 2013

[PDF] Zhixiang (Eddie) Xu, Matt J. Kusner, Kilian Q. Weinberger, Minmin Chen
Cost-Sensitive Tree of Classifiers
International Conference on Machine Learning (ICML), 2013

WMD Code
Here is version 1.0 of Python and Matlab code for the Word Mover's Distance from the paper "From Word Embeddings to Document Distances": [code]
Here's the prerequisites:
- Python 2.7
- packages:
- gensim
- numpy
- scipy
If you download Anaconda Python 2.7 it has everything: http://continuum.io/downloads

You'll also need to download:
- word2vec embedding trained on the Google News corpus: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing (described briefly here in 'Pre-trained word and phrase vectors': https://code.google.com/p/word2vec/)

You'll need to build:
- python-emd-master/: just go into the directory and type: make
- If you want to use matlab then you'll have to build emd/ . Just open matlab, go to the directory, and type build_emd

The main files are:
- get_word_vectors.py: This extracts the word vectors and BOW vectors. This is the script you will run first. You call it like this:

python get_word_vectors.py input_file.txt vectors.pk vectors.mat

the last argument saves a .mat file (I think you technically have to now, but I will make this optional soon). The first argument is the text document you want to process, it assumes the input text file is in the following format:

doc1_label_ID \t word1 word2 word3 word4
doc2_label_ID \t word1 word2 word3 word4
...

Specifically, each document is on one line. The first thing on the line (doc1_label_ID) signifies the label of the document. For example if you have a set of tweets labeled by their sentiment (e.g. positive, negative, neutral), then this describes the label. Look at the file all_twitter_by_line.txt for an example. This is followed by a tab character: \t. Then each word of the document is separated by a space (it can be multiple spaces, it doesn't matter). The words can have punctuation and whatnot, this gets stripped by the python script.

The second argument is the name of the pickle file that saves the word vectors, and the third is a mat file with the same results (used for matlab code later if you like). After you run this code then you'll run

- wmd.py: This computes the distance matrix between all documents in the saved file above. You call it like this:

python wmd.py vectors.pk dist_matrix.pk

where vectors.pk was generated by the first script.

- wmd_mat.m: If you'd like to use Matlab instead of wmd.py. You can use wmd_mat.m and change the variable load_file to vectors.mat and save_file to whatever name you like.

Here's some example code with 'all_twitter_by_line.txt':
python get_word_vectors.py all_twitter_by_line.txt twitter_vec.pk twitter_vec.mat
python wmd.py twitter_vec.pk twitter_wmd_d.pk

Matlab:
>> wmd_mat (changing load_file to 'twitter_vec.mat' and save_file to whatever you like)


Let me know if you have any questions at mkusner AT wustl DOT edu. Please cite using the following BibTeX entry (instead of Google Scholar):

@inproceedings{kusner2015doc,
title={From Word Embeddings To Document Distances},
author={Kusner, M. J. and Sun, Y. and Kolkin, N. I. and Weinberger, K. Q.},
booktitle={ICML},
year={2015},
}


SNC Code
Here is code for the paper "Stochastic Neighbor Compression": [code]. See the README for details.

Contact Information
Office: 422A Jolley Hall
Address: Washington University. 1 Brookings Drive. St. Louis, MO 63130
Feel free to email me at: matt dot kusner at gmail dot com