提交 5a1fc696 编写于 作者: wnma3mz's avatar wnma3mz

complete 10th

上级 d357f42f
# A Gentle Introduction to the Bag-of-Words Model
原文链接:[A Gentle Introduction to the Bag-of-Words Model](A Gentle Introduction to the Bag-of-Words Model)
The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.
The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.
In this tutorial, you will discover the bag-of-words model for feature extraction in natural language processing.
After completing this tutorial, you will know:
- What the bag-of-words model is and why it is needed to represent text.
- How to develop a bag-of-words model for a collection of documents.
- How to use different techniques to prepare a vocabulary and score words.
Let’s get started.
A Gentle Introduction to the Bag-of-Words Model
Photo by [Do8y](https://www.flickr.com/photos/beorn_ours/5675267679/), some rights reserved.
## Tutorial Overview
This tutorial is divided into 6 parts; they are:
1. The Problem with Text
2. What is a Bag-of-Words?
3. Example of the Bag-of-Words Model
4. Managing Vocabulary
5. Scoring Words
6. Limitations of Bag-of-Words
### Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
[Start Your FREE Crash-Course Now](https://machinelearningmastery.lpages.co/leadbox/144855173f72a2%3A164f8be4f346dc/5655638436741120/)
## The Problem with Text
A problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer well defined fixed-length inputs and outputs.
Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers.
> In language processing, the vectors x are derived from textual data, in order to reflect various linguistic properties of the text.
— Page 65, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wycQKA), 2017.
This is called feature extraction or feature encoding.
A popular and simple method of feature extraction with text data is called the bag-of-words model of text.
## What is a Bag-of-Words?
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “*bag*” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
> A very common feature extraction procedures for sentences and documents is the bag-of-words approach (BOW). In this approach, we look at the histogram of the words within the text, i.e. considering each word count as a feature.
— Page 69, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wycQKA), 2017.
The intuition is that documents are similar if they have similar content. Further, that from the content alone we can learn something about the meaning of the document.
The bag-of-words can be as simple or complex as you like. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.
We will take a closer look at both of these concerns.
## Example of the Bag-of-Words Model
Let’s make the bag-of-words model concrete with a worked example.
### Step 1: Collect Data
Below is a snippet of the first few lines of text from the book “[A Tale of Two Cities](https://www.gutenberg.org/ebooks/98)” by Charles Dickens, taken from Project Gutenberg.
> It was the best of times,
> it was the worst of times,
> it was the age of wisdom,
> it was the age of foolishness,
For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire corpus of documents.
### Step 2: Design the Vocabulary
Now we can make a list of all of the words in our model vocabulary.
The unique words here (ignoring case and punctuation) are:
- “it”
- “was”
- “the”
- “best”
- “of”
- “times”
- “worst”
- “age”
- “wisdom”
- “foolishness”
That is a vocabulary of 10 words from a corpus containing 24 words.
### Step 3: Create Document Vectors
The next step is to score the words in each document.
The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.
Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word.
The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present.
Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first document (“*It was the best of times*“) and convert it into a binary vector.
The scoring of the document would look as follows:
- “it” = 1
- “was” = 1
- “the” = 1
- “best” = 1
- “of” = 1
- “times” = 1
- “worst” = 0
- “age” = 0
- “wisdom” = 0
- “foolishness” = 0
As a binary vector, this would look as follows:
| 1 | [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] |
| ---- | ------------------------------ |
| | |
The other three documents would look as follows:
| 123 | "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1] |
| ---- | ------------------------------------------------------------ |
| | |
All ordering of the words is nominally discarded and we have a consistent way of extracting features from any document in our corpus, ready for use in modeling.
New documents that overlap with the vocabulary of known words, but may contain words outside of the vocabulary, can still be encoded, where only the occurrence of known words are scored and unknown words are ignored.
You can see how this might naturally scale to large vocabularies and larger documents.
## Managing Vocabulary
As the vocabulary size increases, so does the vector representation of documents.
In the previous example, the length of the document vector is equal to the number of known words.
You can imagine that for a very large corpus, such as thousands of books, that the length of the vector might be thousands or millions of positions. Further, each document may contain very few of the known words in the vocabulary.
This results in a vector with lots of zero scores, called a sparse vector or sparse representation.
Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.
As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words model.
There are simple text cleaning techniques that can be used as a first step, such as:
- Ignoring case
- Ignoring punctuation
- Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
- Fixing misspelled words.
- Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.
A more sophisticated approach is to create a vocabulary of grouped words. This both changes the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document.
In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible bigrams.
> An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.
— Page 85, [Speech and Language Processing](http://amzn.to/2vaEb7T), 2009.
For example, the bigrams in the first line of text in the previous section: “It was the best of times” are as follows:
- “it was”
- “was the”
- “the best”
- “best of”
- “of times”
A vocabulary then tracks triplets of words is called a trigram model and the general approach is called the n-gram model, where n refers to the number of grouped words.
Often a simple bigram approach is better than a 1-gram bag-of-words model for tasks like documentation classification.
> a bag-of-bigrams representation is much more powerful than bag-of-words, and in many cases proves very hard to beat.
— Page 75, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wycQKA), 2017.
## Scoring Words
Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored.
In the worked example, we have already seen one very simple approach to scoring: a binary scoring of the presence or absence of words.
Some additional simple scoring methods include:
- **Counts**. Count the number of times each word appears in a document.
- **Frequencies**. Calculate the frequency that each word appears in a document out of all the words in the document.
### Word Hashing
You may remember from computer science that a [hash function](https://en.wikipedia.org/wiki/Hash_function) is a bit of math that maps data to a fixed size set of numbers.
For example, we use them in hash tables when programming where perhaps names are converted to numbers for fast lookup.
We can use a hash representation of known words in our vocabulary. This addresses the problem of having a very large vocabulary for a large text corpus because we can choose the size of the hash space, which is in turn the size of the vector representation of the document.
Words are hashed deterministically to the same integer index in the target hash space. A binary score or count can then be used to score the word.
This is called the “*hash trick*” or “*feature hashing*“.
The challenge is to choose a hash space to accommodate the chosen vocabulary size to minimize the probability of collisions and trade-off sparsity.
### TF-IDF
A problem with scoring word frequency is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.
One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.
This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-IDF for short, where:
- **Term Frequency**: is a scoring of the frequency of the word in the current document.
- **Inverse Document Frequency**: is a scoring of how rare the word is across documents.
The scores are a weighting where not all words are equally as important or interesting.
The scores have the effect of highlighting words that are distinct (contain useful information) in a given document.
> Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
— Page 118, [An Introduction to Information Retrieval](http://amzn.to/2hAR7PH), 2008.
## Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on your specific text data.
It has been used with great success on prediction problems like language modeling and documentation classification.
Nevertheless, it suffers from some shortcomings, such as:
- **Vocabulary**: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
- **Sparsity**: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
- **Meaning**: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.
## Further Reading
This section provides more resources on the topic if you are looking go deeper.
### Articles
- [Bag-of-words model on Wikipedia](https://en.wikipedia.org/wiki/N-gram)
- [N-gram on Wikipedia](https://en.wikipedia.org/wiki/N-gram)
- [Feature hashing on Wikipedia](https://en.wikipedia.org/wiki/Feature_hashing)
- [tf–idf on Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
### Books
- Chapter 6, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wycQKA), 2017.
- Chapter 4, [Speech and Language Processing](http://amzn.to/2vaEb7T), 2009.
- Chapter 6, [An Introduction to Information Retrieval](http://amzn.to/2vvnPHP), 2008.
- Chapter 6, [Foundations of Statistical Natural Language Processing](http://amzn.to/2vvnPHP), 1999.
## Summary
In this tutorial, you discovered the bag-of-words model for feature extraction with text data.
Specifically, you learned:
- What the bag-of-words model is and why we need it.
- How to work through the application of a bag-of-words model to a collection of documents.
- What techniques can be used for preparing a vocabulary and scoring words.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
\ No newline at end of file
# A Research to Engineering Workflow
原文链接:[A Research to Engineering Workflow](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
Going from a research idea to experiments is fundamental. But this step is typically glossed over with little explicit advice. In academia, the graduate student is often left toiling away—fragmented code, various notes and LaTeX write-ups scattered around. New projects often result in entirely new code bases, and if they do rely on past code, are difficult to properly extend to these new projects.
Motivated by this, I thought it’d be useful to outline the steps I personally take in going from research idea to experimentation, and how that then improves my research understanding so I can revise the idea. This process is crucial: given an initial idea, all my time is spent on this process; and for me at least, the experiments are key to learning about and solving problems that I couldn’t predict otherwise.[1](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#references)
## Finding the Right Problem
Before working on a project, it’s necessary to decide how ideas might jumpstart into something more official. Sometimes it’s as simple as having a mentor suggest a project to work on; or tackling a specific data set or applied problem; or having a conversation with a frequent collaborator and then striking up a useful problem to work on together. More often, I find that research is a result of a long chain of ideas which were continually iterated upon—through frequent conversations, recent work, longer term readings of subjects I’m unfamiliar with (e.g., [Pearl (2000)](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#pearl2000causality)), and favorite papers I like to revisit (e.g.,[Wainwright & Jordan (2008)](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#wainwright2008graphical), [Neal (1994)](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#neal1994bayesian)).
![img](http://dustintran.com/blog/assets/2017-06-03-fig0.png)
*A master document of all my unexplored research ideas.*
One technique I’ve found immensely helpful is to maintain a single master document.[2](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#references) It does a few things.
First, it has a bulleted list of all ideas, problems, and topics that I’d like to think more carefully about (Section 1.3 in the figure). Sometimes they’re as high-level as “Bayesian/generative approaches to reinforcement learning” or “addressing fairness in machine learning”; or they’re as specific as “Inference networks to handle memory complexity in EP” or “analysis of size-biased vs symmetric Dirichlet priors.”. I try to keep the list succinct: subsequent sections go in depth on a particular entry (Section 2+ in the figure).
Second, the list of ideas is sorted according to what I’d like to work on next. This guides me to understand the general direction of my research beyond present work. I can continually revise my priorities according to whether I think the direction aligns with my broader research vision, and if I think the direction is necessarily impactful for the community at large. Importantly, the list isn’t just about the next publishable idea to work on, but generally what things I’d like to learn about next. This contributes long-term in finding important problems and arriving at simple or novel solutions.
Every so often, I revisit the list, resorting things, adding things, deleting things. Eventually I might elaborate upon an idea enough that it becomes a formal paper. In general, I’ve found that this process of iterating upon ideas within one location (and one format) makes the transition to formal paper-writing and experiments to be a fluid experience.
## Managing Papers
![img](http://dustintran.com/blog/assets/2017-06-03-fig5.png)
Good research requires reading *a lot* of papers. Without a good way of organizing your readings, you can easily get overwhelmed by the field’s hurried pace. (These past weeks have been especially notorious in trying to catch up on the slew of NIPS submissions posted to arXiv.)
I’ve experimented with a lot of approaches to this, and ultimately I’ve arrived at the [Papers app](http://papersapp.com/) which I highly recommend.3
The most fundamental utility in a good management system is a centralized repository which can be referenced back to. The advantage of having one location for this cannot be underestimated, whether it be 8 page conference papers, journal papers, surveys, or even textbooks. Moreover, Papers is a nice tool for actually reading PDFs, and it conveniently syncs across devices as I read and star things on my tablet or laptop. As I cite papers when I write, I can go back to Papers and get the corresponding BibTeX file and citekey.
I personally enjoy taking painstaking effort in organizing papers. In the screenshot above, I have a sprawling list of topics as paper tags. These range from `applications`, `models`, `inference` (each with subtags), and there are also miscellaneous topics such as `information-theory` and `experimental-design`. An important collection not seen in the screenshot is a tag called `research`, which I bin all papers relevant to a particular research topic into. For example, [the PixelGAN paper](https://arxiv.org/abs/1706.00531) presently highlighted is tagged into two topics I’ve currently been thinking a lot about—these are sorted into `research→alignment-semi`and `research→generative-images`.
## Managing a Project
![img](http://dustintran.com/blog/assets/2017-06-03-fig1.png)
*The repository we used for a recent arXiv preprint.*
I like to maintain one research project in one Github repository. They’re useful not only for tracking code but also in tracking general research progress, paper writing, and tying others in for collaboration. How Github repositories are organized is a frequent pain point. I like the following structure, based originally from [Dave Blei’s preferred one](http://www.cs.columbia.edu/~blei/seminar/2016_discrete_data/notes/week_01.pdf):
```
-- doc/
-- 2017-nips/
-- preamble/
-- img/
-- main.pdf
-- main.tex
-- introduction.tex
-- etc/
-- 2017-03-25-whiteboard.jpg
-- 2017-04-03-whiteboard.jpg
-- 2017-04-06-dustin-comments.md
-- 2017-04-08-dave-comments.pdf
-- src/
-- checkpoints/
-- codebase/
-- log/
-- out/
-- script1.py
-- script2.py
-- README.md
```
`README.md` maintains a list of todo’s, both for myself and collaborators. This makes it transparent how to keep moving forward and what’s blocking the work.
`doc/` contains all write-ups. Each subdirectory corresponds to a particular conference or journal submission, with `main.tex`being the primary document and individual sections written in separate files such as `introduction.tex`. Keeping one section per file makes it easy for multiple people to work on separate sections simultaneously and avoid merge conflicts. Some people prefer to write the full paper after major experiments are complete. I personally like to write a paper more as a summary of the current ideas and, as with the idea itself, it is continually revised as experiments proceed.
`etc/` is a dump of everything not relevant to other directories. I typically use it to store pictures of whiteboards during conversations about the project. Or sometimes as I’m just going about my day-to-day, I’m struck with a bunch of ideas and so I dump them into a Markdown document. It’s also a convenient location to handle various commentaries about the work, such as general feedback or paper markups from collaborators.
`src/` is where all code is written. Runnable scripts are written directly in `src/`, and classes and utilities are written in`codebase/`. I’ll elaborate on these next. (The other three are directories outputted from scripts, which I’ll also elaborate upon.)
## Writing Code
![img](http://dustintran.com/blog/assets/2017-06-03-fig2.png)
Any code I write now uses [Edward](http://edwardlib.org/). I find it to be the best framework for quickly experimenting with modern probabilistic models and algorithms.
On a conceptual level, Edward’s appealing because the language explicitly follows the math: the model’s generative process translates to specific lines of Edward code; then the proposed algorithm translates to the next lines; etc. This clean translationavoids future abstraction headaches when trying to extend the code with natural research questions: for example, what if I used a different prior, or tweaked the gradient estimator, or tried a different neural net architecture, or applied the method on larger scale data sets?
On a practical level, I most benefit from Edward by building off pre-existing model examples (in [`edward/examples/`](https://github.com/blei-lab/edward/tree/master/examples) or [`edward/notebooks/`](https://github.com/blei-lab/edward/tree/master/notebooks)), and then adapting it to my problem. If I am also implementing a new algorithm, I take a pre-existing algorithm’s source code (in [`edward/inferences/`](https://github.com/blei-lab/edward/tree/master/edward/inferences)), paste it as a new file in my research project’s `codebase/` directory, and then I tweak it. This process makes it really easy to start afresh—beginning from templates and avoiding low-level details.
When writing code, I always follow PEP8 (I particularly like the [`pep8`](https://pypi.python.org/pypi/pep8) package), and I try to separate individual scripts from the class and function definitions shared across scripts; the latter is placed inside `codebase/` and then imported. Maintaining code quality from the beginning is always a good investment, and I find this process scales well as the code gets increasingly more complicated and worked on with others.
**On Jupyter notebooks.** Many people use [Jupyter notebooks](http://jupyter.org/) as a method for interactive code development, and as an easy way to embed visualizations and LaTeX. I personally haven’t found success in integrating it into my workflow. I like to just write all my code down in a Python script and then run the script. But I can see why others like the interactivity.
## Managing Experiments
![img](http://dustintran.com/blog/assets/2017-06-03-fig3.png)
Investing in a good workstation or cloud service is a must. Features such as GPUs should basically be a given with [their wide availability](http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/), and one should have access to running many jobs in parallel.
After I finish writing a script on my local computer, my typical workflow is:
1. Run `rsync` to synchronize my local computer’s Github repository (which includes uncommitted files) with a directory in the server;
2. `ssh` into the server.
3. Start `tmux` and run the script. Among many things, `tmux` lets you detach the session so you don’t have to wait for the job to finish before interacting with the server again.
When the script is sensible, I start diving into experiments with multiple hyperparameter configurations. A useful tool for this is [`argparse`](https://docs.python.org/3/library/argparse.html). It augments a Python script with commandline arguments, where you add something like the following to your script:
```
parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', type=int, default=128,
help='Minibatch during training')
parser.add_argument('--lr', type=float, default=1e-5,
help='Learning rate step-size')
args = parser.parse_args()
batch_size = args.batch_size
lr = args.lr
```
Then you can run terminal commands such as
```
python script1.py --batch_size=256 --lr=1e-4
```
This makes it easy to submit server jobs which vary these hyperparameters.
Finally, let’s talk about managing the output of experiments. Recall the `src/` directory structure above:
```
-- src/
-- checkpoints/
-- codebase/
-- log/
-- out/
-- script1.py
-- script2.py
```
We described the individual scripts and `codebase/`. The other three directories are for organizing experiment output:
- `checkpoints/` records saved model parameters during training. Use `tf.train.Saver` to save parameters as the algorithm runs every fixed number of iterations. This helps with running long experiments, where you might want to cut the experiment short and later restore the parameters. Each experiment outputs a subdirectory in `checkpoints/` with the convention,`20170524_192314_batch_size_25_lr_1e-4/`. The first number is the date (`YYYYMMDD`); the second is the timestamp (`%H%M%S`); and the rest is hyperparameters.
- `log/` records logs for visualizing learning. Each experiment belongs in a subdirectory with the same convention as `checkpoints/`. One benefit of Edward is that for logging, you can simply pass an argument as `inference.initialize(logdir='log/' + subdir)`. Default TensorFlow summaries are tracked which can then be visualized using TensorBoard (more on this next).
- `out/` records exploratory output after training finishes; for example, generated images or matplotlib plots. Each experiment belongs in a subdirectory with the same convention as `checkpoints/`.
**On data sets.** Data sets are used across many research projects. I prefer storing them in the home directory `~/data`.
**On software containers.** [virtualenv](http://python-guide-pt-br.readthedocs.io/en/latest/dev/virtualenvs/) is a must for managing Python dependencies and avoiding difficulties with system-wide Python installs. It’s particularly nice if you like to write Python 2/3-agnostic code. [Docker containers](https://www.docker.com/) are an even more powerful tool if you require more from your setup.
## Exploration, Debugging, & Diagnostics
![img](http://dustintran.com/blog/assets/2017-06-03-fig4.png)
[Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) is an excellent tool for visualizing and exploring your model training. With TensorBoard’s interactivity, I find it particularly convenient in that I don’t have to configure a bunch of matplotlib functions to understand training. One only needs to percolate a bunch of `tf.summary`s on tensors in the code.
Edward logs a bunch of summaries by default in order to visualize how loss function values, gradients, and parameter change across training iteration. TensorBoard also includes wall time comparisons, and a sufficiently decorated TensorFlow code base provides a nice computational graph you can stare at. For nuanced issues I can’t diagnose with TensorBoard specifically, I just output things in the `out/` directory and inspect those results.
**Debugging error messages.** My debugging workflow is terrible. I percolate print statements across my code and find errors by process of elimination. This is primitive. Although I haven’t tried it, I hear good things about [TensorFlow’s debugger](https://www.tensorflow.org/programmers_guide/debugger).
## Improving Research Understanding
Interrogating your model, algorithm, and generally the learning process lets you better understand your work’s success and failure modes. This lets you go back to the drawing board, thinking deeply about the method and how it might be further improved. As the method indicates success, one can go from tackling simple toy configurations to increasingly large scale and high-dimensional problems.
From a higher level, this workflow is really about implementing the scientific method in the real world. No major ideas are necessarily discarded at each iteration of the experimental process, but rather, as in the ideal of science, you start with fundamentals and iteratively expand upon them as you have a stronger grasp of reality.
Experiments aren’t alone in this process either. Collaboration, communicating with experts from other fields, reading papers, working on both short and longer term ideas, and attending talks and conferences help broaden your perspective in finding the right problems and solving them.
## Footnotes & References
1 This workflow is specifically for empirical research. Theory is a whole other can of worms, but some of these ideas still generalize.
2 The template for the master document is available [`here`](https://github.com/dustinvtran/latex-templates).
3 There’s one caveat to Papers. I use it for everything: there are at least 2,000 papers stored in my account, and with quite a few dense textbooks. The application sifts through at least half a dozen gigabytes, and so it suffers from a few hiccups when reading/referencing back across many papers. I’m not sure if this is a bug or just inherent to me exploiting Papers almost *too*much.
1. Neal, R. M. (1994). *Bayesian Learning for Neural Networks* (PhD thesis). University of Toronto.
2. Pearl, J. (2000). *Causality*. Cambridge University Press.
3. Wainwright, M. J., & Jordan, M. I. (2008). Graphical Models, Exponential Families, and Variational Inference. *Foundations and Trends in Machine Learning*, *1*(1–2), 1–305.
\ No newline at end of file
......@@ -3,3 +3,4 @@
| [A Research to Engineering Workflow](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | |
| [Introduction to Information Theory and Why You Should Care](https://recast.ai/blog/introduction-information-theory-care/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) | |
| [A Gentle Introduction to the Bag-of-Words Model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) | |
# 工程工作流程研究
原文链接:[工程工作流程研究](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
从研究理念到实验是至关重要的。但是这个步骤通常只有很少的明确建议。在学术界,研究生常常会分散各种各样的代码,各种各样的笔记和散落的LaTeX文章。新项目通常会产生全新的代码库,如果它们依赖于过去的代码,则难以适当地扩展到这些新项目。
受此启发,我认为概述我个人从研究思路到实验所采取的步骤,以及如何改善我的研究理解,以便我可以修改这个想法。这个过程至关重要:给出一个初步的想法,我所有的时间都用在这个过程上; 至少对我来说,实验是学习和解决我无法预测的问题的关键。[1](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#references)
## 找到正确的问题
在开展项目之前,有必要决定如何将想法推向更正式的事物。有时它就像导师建议一个项目一样简单; 或处理特定数据集或应用问题; 或者与频繁的合作者进行对话,然后找出有用的问题一起工作。更常见的是,我发现研究是一系列思想的结果,这些思想不断被迭代 - 通过频繁的对话,最近的工作,对我不熟悉的科目的长期阅读(例如,[Pearl(2000)](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#pearl2000causality)),以及最喜欢的我想重温的论文(例如,[Wainwright&Jordan(2008)](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#wainwright2008graphical)[Neal(1994)](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#neal1994bayesian))。
[![IMG](https://camo.githubusercontent.com/f4f85e4b2e1b8fa4c5a0fa8ca7b3716fdd13010d/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967302e706e67)](https://camo.githubusercontent.com/f4f85e4b2e1b8fa4c5a0fa8ca7b3716fdd13010d/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967302e706e67)
*我所有未开发的研究思路的主文件。*
我发现一种非常有用的技术是维护一个主文档。[2](http://dustintran.com/blog/a-research-to-engineering-workflow?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#references)它做了一些事情。
首先,它有一个包含所有想法,问题和主题的项目符号列表,我想更仔细地考虑一下(图中的1.3节)。有时它们与“强化学习的贝叶斯/生成方法”或“解决机器学习的公平性”一样高; 或者它们与“用于处理EP中的存储器复杂性的推理网络”或“尺寸偏差与对称Dirichlet先验的分析”一样具体。我试着保持清单简洁:后续部分深入讨论特定条目(图中的第2节)。
其次,根据我接下来要做的工作对想法列表进行排序。这引导我理解我的研究超越当前工作的总体方向。我可以根据我是否认为方向与我更广泛的研究愿景一致,以及我是否认为这个方向对整个社会有一定影响而不断修改我的优先事项。重要的是,该列表不仅仅是关于下一个可以发表的可发表的想法,而且通常是我接下来要学习的内容。这有助于长期发现重要问题并找到简单或新颖的解决方案。
每隔一段时间,我就会重新审视清单,诉诸事物,添加内容,删除内容。最终,我可能会详细阐述一个想法,使其成为一份正式的论文。总的来说,我发现这个在一个位置(和一种格式)内迭代思想的过程使得向正式的论文写作和实验过渡成为一种流畅的体验。
## 管理论文
[![IMG](https://camo.githubusercontent.com/0c2534e468fd58112614fffc66d0cdd4bc1c5737/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967352e706e67)](https://camo.githubusercontent.com/0c2534e468fd58112614fffc66d0cdd4bc1c5737/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967352e706e67)
好的研究需要阅读*大量*论文。如果没有一个好的方式来组织你的阅读,你很容易被这个领域的匆忙节奏所淹没。(过去几周在试图赶上向arXiv发布的大量NIPS提交文件时尤其臭名昭着。)
我已经尝试了很多方法,最终我得到了我强烈推荐的[Papers应用程序](http://papersapp.com/)
良好管理系统中最基本的实用程序是集中式存储库,可以回来参考。无论是8页会议论文,期刊论文,调查甚至教科书,都有一个位置的优势不容小觑。此外,Papers是一个实际阅读PDF的好工具,它可以在我阅读和平板电脑或笔记本电脑上的东西时方便地同步设备。当我写作时,我会引用论文,我可以回到Papers并获得相应的BibTeX文件(配置文件)和citekey。
我个人喜欢在组织论文时付出艰辛的努力。在上面的截图中,我有一个庞大的主题列表纸质标签。这些范围从`applications(应用程序)``models(模型)``inference(推理)`(每个子标签),并且也有杂主题,如`information-theory(信息论)``experimental-design(实验设计)`。截图中未显示的一个重要集合是一个名为的标签`research`,我将所有与特定研究主题相关的论文都包含在内。例如,目前突出显示[的PixelGAN论文](https://arxiv.org/abs/1706.00531)被标记为我目前一直在思考的两个主题 - 这些主题分为`research→alignment-semi``research→generative-images`
## 管理项目
[![IMG](https://camo.githubusercontent.com/b36c06c4a5c8578dff3414cb9d7f009582aef0df/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967312e706e67)](https://camo.githubusercontent.com/b36c06c4a5c8578dff3414cb9d7f009582aef0df/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967312e706e67)
*我们用于最近的arXiv预印本的存储库。*
我想在一个Github存储库中维护一个研究项目。它们不仅可用于跟踪代码,还可用于跟踪一般研究进展,论文写作以及与其他人的协作联系起来。如何组织Github存储库是一个常见的痛点。我喜欢以下结构,最初来自[Dave Blei的首选](http://www.cs.columbia.edu/~blei/seminar/2016_discrete_data/notes/week_01.pdf)结构:
```
-- doc/
-- 2017-nips/
-- preamble/
-- img/
-- main.pdf
-- main.tex
-- introduction.tex
-- etc/
-- 2017-03-25-whiteboard.jpg
-- 2017-04-03-whiteboard.jpg
-- 2017-04-06-dustin-comments.md
-- 2017-04-08-dave-comments.pdf
-- src/
-- checkpoints/
-- codebase/
-- log/
-- out/
-- script1.py
-- script2.py
-- README.md
```
`README.md`为我自己和合作者维护一份待办事项列表。这使得如何继续前进以及阻碍工作的方式变得透明。
`doc/`包含所有的报道。每个子目录对应于特定的会议或日记提交,其中`main.tex`主要文档和各个部分用不同的文件编写,例如`introduction.tex`。保留每个文件一个部分使多个人可以轻松地同时处理不同的部分并避免合并冲突。有些人喜欢在主要实验完成后写完整篇论文。我个人更喜欢写一篇论文作为当前想法的总结,并且与想法本身一样,它随着实验的进行而不断修订。
`etc/`是与其他目录无关的内容的转储。我通常用它来存储关于项目的对话的白板图片。或者有时因为我正在处理我的日常工作,我对一堆想法很震惊,因此我将它们转储到Markdown文档中。它也是处理有关工作的各种评论的便利位置,例如来自协作者的一般反馈或纸质标记。
`src/`是所有代码存放的地方。Runnable脚本直接编写`src/`,并编写类和实用程序`codebase/`。我接下来会详细说明。(其他三个是从脚本输出的目录,我也将详细说明。)
## 编写代码
[![IMG](https://camo.githubusercontent.com/8044da6bcb1fce3603e59588316d3caffa1f6f1d/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967322e706e67)](https://camo.githubusercontent.com/8044da6bcb1fce3603e59588316d3caffa1f6f1d/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967322e706e67)
我现在写的任何代码都使用[Edward](http://edwardlib.org/)。我发现它是快速试验现代概率模型和算法的最佳框架。
在概念层面上,爱德华的吸引力是因为语言明确地遵循了数学:模型的生成过程转化为爱德华代码的特定线; 然后,所提出的算法转换为下一行; 当试图用自然研究问题扩展代码时,这种干净的翻译避免了未来的抽象难题:例如,如果我使用不同的先验,或调整梯度估计,或尝试不同的神经网络架构,或在大规模数据集上应用该方法该怎么办?
在实践层面上,我通过建立预先存在的模型示例(in [`edward/examples/`](https://github.com/blei-lab/edward/tree/master/examples)[`edward/notebooks/`](https://github.com/blei-lab/edward/tree/master/notebooks)),然后使其适应我的问题,从而最大程度上受益于Edward 。如果我也在实现一个新算法,我会使用预先存在的算法的源代码(in [`edward/inferences/`](https://github.com/blei-lab/edward/tree/master/edward/inferences)),将其粘贴为我的研究项目`codebase/`目录中的新文件,然后我调整它。这个过程使得从模板重新开始并避免低级细节变得非常容易。
在编写代码时,我总是遵循PEP8(我特别喜欢这个[`pep8`](https://pypi.python.org/pypi/pep8)包),并尝试将各个脚本与脚本之间共享的类和函数定义分开; 后者放在里面`codebase/`然后导入。从一开始就保持代码质量始终是一项很好的投资,我发现这个过程可以很好地扩展,因为代码变得越来越复杂并且与其他人一起工作。
**在Jupyter笔记本上。**许多人使用[Jupyter笔记本](http://jupyter.org/)作为交互式代码开发的方法,并且作为嵌入可视化和LaTeX的简单方法。我个人没有成功将其集成到我的工作流程中。我喜欢在Python脚本中编写所有代码,然后运行脚本。但我可以看到为什么其他人喜欢互动。
## 管理实验
[![IMG](https://camo.githubusercontent.com/ae3e351706f9e0ff04a8ee0e067021555e9c98a1/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967332e706e67)](https://camo.githubusercontent.com/ae3e351706f9e0ff04a8ee0e067021555e9c98a1/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967332e706e67)
投资一个好的工作站或云服务是必须的。GPU等功能基本上应该具有[广泛的可用性](http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/),并且应该能够并行运行多个作业。
在我的本地计算机上完成脚本编写后,我的典型工作流程是:
1. 运行`rsync`以将本地计算机的Github存储库(包括未提交的文件)与服务器中的目录同步;
2. `ssh` 进入服务器。
3. 启动`tmux`并运行脚本。在许多方面,`tmux`让您分离会话,这样您就不必在再次与服务器交互之前等待作业完成。
当脚本合情合理时,我开始深入研究多个超参数配置的实验。一个有用的工具是[`argparse`](https://docs.python.org/3/library/argparse.html)。它使用命令行参数扩充Python脚本,您可以在脚本中添加以下内容:
```
parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', type=int, default=128,
help='Minibatch during training')
parser.add_argument('--lr', type=float, default=1e-5,
help='Learning rate step-size')
args = parser.parse_args()
batch_size = args.batch_size
lr = args.lr
```
然后你可以运行终端命令,如
```
python script1.py --batch_size=256 --lr=1e-4
```
这使得提交改变这些超参数的服务器作业变得容易。
最后,我们来谈谈管理实验的输出。回想一下`src/`上面的目录结构:
```
-- src/
-- checkpoints/
-- codebase/
-- log/
-- out/
-- script1.py
-- script2.py
```
我们描述了各个脚本和`codebase/`。其他三个目录用于组织实验输出:
- `checkpoints/`记录训练期间保存的模型参数。使用`tf.train.Saver`的算法运行的迭代的每一个固定号码,储存的参数。这有助于运行长时间的实验,您可能希望缩短实验并稍后恢复参数。每个实验都输出一个`checkpoints/`带有约定的子目录`20170524_192314_batch_size_25_lr_1e-4/`。第一个数字是date(`YYYYMMDD`); 第二个是时间戳(`%H%M%S`); 其余的是超参数。
- `log/`记录可视化学习的日志。每个实验都属于一个子目录,其约定与`checkpoints/`。爱德华的一个好处是,对于日志记录,您可以简单地将参数传递为`inference.initialize(logdir='log/' + subdir)`。跟踪默认的TensorFlow摘要,然后可以使用TensorBoard对其进行可视化(接下来将详细介绍)。
- `out/`记录训练结束后的探索性输出; 例如,生成的图像或matplotlib图。每个实验都属于一个子目录,其约定与`checkpoints/`
**在数据集上。**数据集用于许多研究项目。我更喜欢将它们存储在主目录中`~/data`
**在软件容器上。** [virtualenv](http://python-guide-pt-br.readthedocs.io/en/latest/dev/virtualenvs/)是管理Python依赖项和避免系统范围Python安装困难的必备工具。如果你喜欢编写Python 2/3不可知代码,那就特别好了。如果您需要更多设置,[Docker容器](https://www.docker.com/)是一个更强大的工具。
## 探索,调试和诊断
[![IMG](https://camo.githubusercontent.com/de52d78ff8d22cdb62af0ed9fd2c2d9c33595337/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967342e706e67)](https://camo.githubusercontent.com/de52d78ff8d22cdb62af0ed9fd2c2d9c33595337/687474703a2f2f64757374696e7472616e2e636f6d2f626c6f672f6173736574732f323031372d30362d30332d666967342e706e67)
[Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard)是用于可视化和探索模型训练的出色工具。通过TensorBoard的交互性,我发现它特别方便,因为我不需要配置一堆matplotlib函数来理解训练。人们只需要`tf.summary`在代码中渗透一堆张量。
Edward默认记录了一堆摘要,以便可视化损失函数值,渐变和参数在训练迭代中的变化。TensorBoard还包括壁时间比较,并且充分装饰的TensorFlow代码库提供了一个可以盯着的漂亮计算图。对于细微问题,我无法专门用TensorBoard诊断,我只是在`out/`目录中输出内容并检查这些结果。
**调试错误消息。**我的调试工作流程非常糟糕。我在我的代码中渗透打印语句,并通过消除过程发现错误。这是原始的。虽然我没有尝试过,但我听说[TensorFlow的调试器](https://www.tensorflow.org/programmers_guide/debugger)很好。
## 提高研究理解
通过询问您的模型,算法以及学习过程,您可以更好地了解工作的成功和失败模式。这使您可以回到绘图板,深入思考方法以及如何进一步改进。由于该方法表明成功,人们可以从解决简单的玩具配置到越来越大规模和高维度的问题。
从更高的层面来看,这个工作流程实际上是关于在现实世界中实施科学方法。在实验过程的每次迭代中都不一定要丢弃重要的想法,而是在科学的理想中,你从基础知识开始,并在你对现实有更强的把握时反复扩展它们。
在这个过程中,实验并不孤单。协作,与其他领域的专家交流,阅读论文,研究短期和长期的想法,参加会谈和会议,有助于拓宽您找到正确问题和解决问题的视角。
## 注解和参考
1此工作流程专门用于实证研究。理论是另一种蠕虫,但其中一些思想仍然概括。
2可以使用主文档的模板[`here`](https://github.com/dustinvtran/latex-templates)
3 Papers有一点需要注意。我将它用于所有事情:我的帐户中至少存储了2,000篇论文,并且有相当多的密集教科书。该应用程序至少可以筛选出6个千兆字节,因此在阅读/引用许多文件时会遇到一些小问题。我不知道这是否是一个错误或只是固有的对我论文利用几乎*过*了。
1. Neal,RM(1994)。*神经网络的贝叶斯学习*(博士论文)。多伦多大学。
2. Pearl,J。(2000)。*因果关系*。剑桥大学出版社。
3. Wainwright,MJ,&Jordan,MI(2008)。图形模型,指数族和变分推理。*机器学习的基础和趋势**1*(1-2),1-305。
\ No newline at end of file
# 浅谈词袋模型
原文链接:[对词袋模型的温和介绍](对词袋模型的温和介绍)
原文链接:[A Gentle Introduction to the Bag-of-Words Model](A Gentle Introduction to the Bag-of-Words Model)
词袋模型是一种在使用机器学习算法建模文本时表示文本数据的方式。
......
......@@ -15,26 +15,26 @@
## 翻译贡献者
| 日期 | 翻译 | 校对 |
| --------------------------------------------------- | ---------------------------------------------- | ---- |
| [2017/09/25 第1期](https://hackcv.com/daily/p/1/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/04 第2期](https://hackcv.com/daily/p/2/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/05 第3期](https://hackcv.com/daily/p/3/) | [@Arron206](https://github.com/Arron206) | |
| [2017/10/06 第4期](https://hackcv.com/daily/p/4/) | [@mllove](https://github.com/mllove) | |
| [2017/10/07 第5期](https://hackcv.com/daily/p/5/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/08 第6期](https://hackcv.com/daily/p/6/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/09 第7期](https://hackcv.com/daily/p/7/) | [@mllove](https://github.com/mllove) | |
| [2017/10/10 第8期](https://hackcv.com/daily/p/8/) | [@AlexdanerZe](https://github.com/AlexdanerZe) | |
| [2017/10/11 第9期](https://hackcv.com/daily/p/9/) | | |
| [2017/10/11 第10期](https://hackcv.com/daily/p/10/) | | |
| [2017/10/13 第11期](https://hackcv.com/daily/p/11/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/14 第12期](https://hackcv.com/daily/p/12/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/15 第13期](https://hackcv.com/daily/p/13/) | | |
| [2017/10/16 第14期](https://hackcv.com/daily/p/14/) | | |
| [2017/10/17 第15期](https://hackcv.com/daily/p/15/) | | |
| [2017/10/18 第16期](https://hackcv.com/daily/p/16/) | | |
| [2017/10/19 第17期](https://hackcv.com/daily/p/17/) | | |
| [2017/10/20 第18期](https://hackcv.com/daily/p/18/) | | |
| 日期 | 翻译 | 校对 |
| --------------------------------------------------- | -------------------------------------------------- | ---- |
| [2017/09/25 第1期](https://hackcv.com/daily/p/1/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/04 第2期](https://hackcv.com/daily/p/2/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/05 第3期](https://hackcv.com/daily/p/3/) | [@Arron206](https://github.com/Arron206) | |
| [2017/10/06 第4期](https://hackcv.com/daily/p/4/) | [@mllove](https://github.com/mllove) | |
| [2017/10/07 第5期](https://hackcv.com/daily/p/5/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/08 第6期](https://hackcv.com/daily/p/6/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/09 第7期](https://hackcv.com/daily/p/7/) | [@mllove](https://github.com/mllove) | |
| [2017/10/10 第8期](https://hackcv.com/daily/p/8/) | [@AlexdanerZe](https://github.com/AlexdanerZe) | |
| [2017/10/11 第9期](https://hackcv.com/daily/p/9/) | [@exqlnet](https://github.com/exqlnet) | |
| [2017/10/11 第10期](https://hackcv.com/daily/p/10/) | [@aboutmydreams](https://github.com/aboutmydreams) | |
| [2017/10/13 第11期](https://hackcv.com/daily/p/11/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/14 第12期](https://hackcv.com/daily/p/12/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/15 第13期](https://hackcv.com/daily/p/13/) | | |
| [2017/10/16 第14期](https://hackcv.com/daily/p/14/) | | |
| [2017/10/17 第15期](https://hackcv.com/daily/p/15/) | | |
| [2017/10/18 第16期](https://hackcv.com/daily/p/16/) | | |
| [2017/10/19 第17期](https://hackcv.com/daily/p/17/) | | |
| [2017/10/20 第18期](https://hackcv.com/daily/p/18/) | | |
## 贡献指南
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册