The approach of focusing on the syntax of a programming language in introductory courses is understandable. Problem-solving is not a precise, well-defined skill. It's more of an overall ability that gets honed with practice. Teaching and grading it is therefore challenging. It's much easier to jump immediately into the syntax of some simple programming language statements. Such an approach is concrete, and in principle, easy to understand but totally skips the part about why and when we need those statements. Professors that could survive in that environment as students usually go on to perpetuate the sink-or-swim approach when teaching other programmers.
In this course, I'd like to rectify that by focusing on both solving problems and learning to write complex Python code. To do that, we're going to follow an overall problem-solving strategy that involves designing a "work plan" or "algorithm," either on paper or in your head (when you get more experience). The plan helps us think about the problem long before we get to the coding phase. Part of the plan is to identify a suitable sequence of operation that solves our problem. This is the tricky bit so we'll reduce the scope of the solution space by: (1) restricting ourselves to a common set of operations and data structures, (2) applying well-established methods we can call "working backwards" and "reducing to a known solution", and finally, (3) taking advantage of the topic-specific nature of this introductory course to adopt a program outline that'll work for most data science problems. When we do finally get to Python programming, we'll restrict ourselves to a useful subset of the language. The goal is to teach you to program, not teach you the complete Python language.
Allow me to begin by making a distinction between <b>programming</b> (problem solving) and <b>coding</b> (expressing our solution in a particular programming language).
When we think about programming, we immediately think about programming languages because we express ourselves using specific language syntax. But, that is like asking a physicist in which language they discuss physics. **Programming** is mostly about converting "word problems" (project descriptions) to an execution plan. The final act of **coding** (entering code) is required, of course, but learning to solve programming problems mentally is the most difficult process and is the most important.
## 编程是什么?
The same is true for natural languages. Learning to prove mathematical theorems is harder than learning to write up proofs in some natural language. In fact, much of the mathematical syntax is the same across natural languages just as it is for programming languages. Expressing your thoughts in Python or R, as you will do in the data science program, is the simplest part of the programming process. That said, writing correct code is often the most frustrating and time-consuming part of the process even for experienced programmers.
Programming is more about *what* to say rather than *how* to say it. Solving a problem with a computer means identifying a sequence of operations, each of which solves a piece of the overall problem. Each operation might itself be a sequence of suboperations. Expressing those operations in Python or R is not the hard part. Identifying which operations are necessary and their relative order is the hard part.
自然语言也是如此。 学习证明数学定理比学习用某种自然语言编写证明更难。 实际上,大多数数学语法在自然语言中都是相同的,就像编程语言一样。 像在数据科学计划中一样,用 Python 或 R 表达您的想法是编程过程中最简单的部分。 也就是说,编写正确的代码通常是该过程中最令人沮丧和耗时的部分,即使对于有经验的程序员也是如此。
Let's start with an overall strategy for attacking programming problems.
编程更多是要表达*什么*而不是*如何*表达。 用计算机解决问题意味着识别一系列操作,每个操作都解决了整个问题的一部分。 每个操作本身可能是一系列子操作。 用 Python 或 R 表达这些操作并不困难。 确定哪些操作及其相对顺序是困难的部分。
## Problem-solving strategy
让我们从解决编程问题的整体策略入手。
Regardless of the software we're trying to write, there is an overall problem-solving strategy that we can follow.
## 解决问题的策略
**Step one** in any problem-solving situation is to fully understand the problem and clearly identify the goal. It might sound obvious, but any fuzziness in our understanding of the problem could send us off in the wrong direction. In a data science setting, the goal is usually a question we're trying to answer, such as "*which sales regions show the fastest year-on-year growth?*" (summary statistics), "*which transactions are fraudulent?*" (classifier) or "*what will a stock price be at a future date?*" (predictor). We should be able to precisely articulate the goal and the expected output using English words. If we can't do that, then no amount of coding expertise in Python or R will solve the problem. We'll see some examples shortly.
无论我们尝试编写什么软件,我们都可以遵循解决问题的整体策略。
**Step two** (or possibly part of step one) of the problem-solving process is to write out some input-output pairs by hand. Doing so helps us understand what the program will need to do and how it might do it. As we will see, this technique works not only for the overall input and output, but also works great for designing [functions](functions.ipynb)(reusable bits of code). **We can't automate operations with code if we can't identify and perform the operations manually.** Moreover, listing a bunch of cases usually highlights special cases, such as "when the input is negative, the output should be empty". In other words, the program should not crash with a negative number as input. Programmers call this *test-driven design*.
**在任何问题解决的情况下,第一步**是充分理解问题并清楚地确定目标。 这可能听起来很明显,但是我们对这个问题的理解中的任何模糊性都可能使我们走错方向。 在数据科学环境中,目标通常是我们试图回答的问题,例如“*哪个销售区域的同比增长最快?*”(摘要统计量),“*哪些交易是欺诈性的?*”(分类器)或“*未来某个日期股票价格是多少?*”(预测器)。 我们应该能够使用英语单词精确地表达目标和预期输出。 如果我们不能这样做,那么 Python 或 R 中没有任何编码的专业知识可以解决问题。 我们很快就会看到一些例子。
In a job interviewing setting, this step means immediately trying to draw a few instances of the problem. For example, if asked to process a list of numbers in some way, begin by putting three or four numbers up on the board or on a piece of paper. This naturally brings up a number of important questions that the interviewer is expecting you to ask, such as where the data comes from and whether it can all fit in memory etc...
**Step three** is to figure out what data or input, our raw materials, that we need to achieve the goal. Without the right data, we can't solve the problem. For example, I once mentored a student practicum team whose goal was to identify which customers of a website would upgrade to a professional account. The students only had data on users that had upgraded and no data on users who declined to upgrade. Whoops! You can't build an apples versus oranges classifier if you only have data on apples. If you don't have all the data you need, it's important to identify this requirement as part of the problem-solving process. Data acquisition often requires programming and we'll revisit the topic below as part of our generic program outline.
At this point, we've actually set the stage necessary to solve problems and we haven't thought about code at all. We started with the end result and then identified the data we need. The input-output pairs neatly bracket the computation we need to perform. At the beginning, we have the known data and, at the end, we have the expected output or work product. Ok, onto the programming steps.
**Step four** is to identify the sequence of operations that will compute the expected result. Sometimes this is called an *algorithm* and involves planning out the specific operations and suboperations that chew on the input data, gradually transforming it into the expected output.
These first four steps are a key part of the so-called [Feynman technique](https://www.google.com/search?q=feynman+technique), which includes writing down a complete explanation of an assigned task or problem as you would explain it to a nonexpert. Until you can write it down simply, without confusing language or terms, you yourself don't understand the problem. There is no point in continuing until you get past this phase. (Faculty often joke that the best way to learn a new topic is to teach a class on that topic!)
In **Step five**, we translate the operations in our plan to actual executable code. This step deserves an entire book but here's a summary of my advice. Start with the simplest suboperations and make sure they work first. Then code the larger operations that use those suboperations. If there's a problem, you know that it is likely in the new code not the already-tested suboperations. In this phase, we'll normally find problems in our design from step four so we'll typically repeat four and five. Testing functionality and fixing errors is called *debugging*.
Finally, **step six** is to check our overall results for correctness. The most obvious check is to compare the output of our program with the known input-output pairs from step three. Then, most importantly, test the program with input that was not considered in steps three through five. This is an important test of the programs generality. If the program gives incorrect output, it's back to step four to see what's wrong.
And now for a dose of reality. The world is a big messy place and, since we know the least about a problem at the start, we typically need to repeat or bounce around through some or all of these steps. For example, let's say we're building an apples vs oranges classifier and the above process leads to a program that doesn't distinguish between the two fruit very well. Perhaps we only have data on size and shape. We might decide that the classifier needs data on color so it's back to step two (and possibly step three) then step six to check the results again.
A program is a sequence of operations that transforms data or performs computations that ultimately lead to the expected output. *Programming* is the act of designing programs: identifying the operations and their appropriate sequence. In other words, programming is about coming up with a work plan intended for a computer, which we often describe in semi-precise English called *pseudocode*. This is **step four** from the previous section.
*Coding*, on the other hand, is the act of translating such high-level pseudocode to programming language syntax. As you gain more experience, it'll become easier and easier to go from a work plan in your head straight to code, without the pseudocode step.