深入CPython源码¶
在本系列的第一部分和第二部分中,我们探讨了Python程序的执行和编译背后的思想。我们将在接下来的 部分继续关注想法,但这次我们将破例并查看将这些想法变为现实的实际代码。
今日计划¶
CPython 代码库大约有350,000行C代码(不包括头文件)和近600,000行Python代码。毫无疑问,
一次理解所有这些将是一项艰巨的任务。今天我们将把我们的研究限制在每次运行python时执行的那部分
源代码。我们将从python可执行文件的main()
函数开始,并逐步执行源代码,直到我们到达评估循环,
这是执行Python字节码的地方。
我们的目标不是理解我们将遇到的每一段代码,而是突出显示最有趣的部分,研究它们,并最终大致了解 Python程序开始执行时发生的情况。
还有两点我要提前说一下。首先,我们不会进入每一个功能。我们只会对某些部分进行高层次的概述,
其他部分会深入研究。尽管如此,我保证按执行顺序呈现功能。其次,除了一些结构体定义之外,
我将保留代码原样。我允许自己做的唯一一件事就是添加一些评论并改写现有的评论。在这篇文章中,
所有多行/**/
评论都是原创的,所有单行//
评论都是我的。让我们开始我们的CPython源代码之旅。
获取CPython¶
在我们探索源代码之前,我们需要得到它。让我们克隆CPython存储库:
$ git clone https://github.com/python/cpython/ && cd cpython
当前的主分支是未来的CPython 3.10。我们对最新的稳定版本CPython 3.9感兴趣,所以让我们 切换到3.9分支:
$ git checkout 3.9
在根目录中,我们找到以下内容:
$ ls -p
CODE_OF_CONDUCT.md Objects/ config.sub
Doc/ PC/ configure
Grammar/ PCbuild/ configure.ac
Include/ Parser/ install-sh
LICENSE Programs/ m4/
Lib/ Python/ netlify.toml
Mac/ README.rst pyconfig.h.in
Makefile.pre.in Tools/ setup.py
Misc/ aclocal.m4
Modules/ config.guess
在本系列课程中,列出对我们特别重要的一些子目录:
Grammar/
包含我们上次讨论的语法文件。Include/
包含头文件。它们由CPython和Python/C API的用户使用。Lib/
包含用Python编写的标准库模块。虽然有些模块,例如argparse
和wave
,完全是用 Python编写的,但许多模块都封装了C代码。例如,Pythonio
模块包装了C_io
模块。Modules/
包含用C编写的标准库模块。虽然一些模块,如itertools
,旨在直接导入,但其他 模块由Python模块包装。Objects/
包含内置类型的实现。如果您想了解int
或list
是如何实现的,这是最终的去处。Parser/
包含旧解析器、旧解析器生成器、新解析器和标记器。Programs/
包含编译为可执行文件的源文件。Python/
包含解释器本身的源文件。这包括编译器、评估循环、内置模块和许多其他有趣的东西。Tools/
包含对构建和管理CPython有用的工具。例如,新的解析器生成器就在这里。
如果您没有看到测试目录,并且您的心跳开始加快,请放松。它是Lib/test/
。测试不仅对CPython
开发有用,而且对了解CPython的工作原理也很有用。例如,要了解窥视孔优化器预期进行哪些优化,
您可以查看Lib/test/test_peepholer.py
中的测试。要了解窥孔优化器的某些代码段的作用,
您可以删除该段代码,重新编译CPython,运行
$ ./python.exe -m test test_peepholer
并查看哪些测试失败。
理想情况下,编译CPython所需要做的就是运行./configure
和make
:
$ ./configure
$ make -j -s
make
将生成一个名为python
的可执行文件,但不要对在macOS
上看到python.exe
感到惊讶。
.exe
扩展名用于将可执行文件与不区分大小写的文件系统上的Python/
目录区分开来。
有关编译的更多信息,请查看Python开发人员指南 。
此时,我们可以自豪地说,我们已经构建了自己的CPython副本:
$ ./python.exe
Python 3.9.0+ (heads/3.9-dirty:20bdeedfb4, Oct 10 2020, 16:55:24)
[Clang 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 2 ** 16
65536
让我们看看当我们运行它时会发生什么。
主函数¶
CPython的执行,就像任何其他C程序的执行一样,从Programs/python.c 中的main()
函数开始:
/* Minimal main program -- everything is loaded from the library */
#include "Python.h"
#ifdef MS_WINDOWS
int
wmain(int argc, wchar_t **argv)
{
return Py_Main(argc, argv);
}
#else
int
main(int argc, char **argv)
{
return Py_BytesMain(argc, argv);
}
#endif
没有太多事情发生。唯一值得一提的是,在Windows上,CPython使用wmain()
而不是main()
作为入口点来接收argv
作为UTF-16编码的字符串。这样做的影响是,在其他平台上,CPython
会执行一个额外的步骤,将char
字符串转换为wchar_t
字符串。char
字符串的编码
取决于语言环境设置,而wchar_t
字符串的编码取决于wchar_t
的大小。例如,
如果sizeof(wchar_t) == 4
,则使用UCS-4编码。 PEP 383 对此有更多话要说。
我们在Modules/main.c
中找到Py_Main()
和Py_BytesMain()
。他们本质上所做的是
使用稍微不同的参数调用pymain_main()
:
int
Py_Main(int argc, wchar_t **argv)
{
_PyArgv args = {
.argc = argc,
.use_bytes_argv = 0,
.bytes_argv = NULL,
.wchar_argv = argv};
return pymain_main(&args);
}
int
Py_BytesMain(int argc, char **argv)
{
_PyArgv args = {
.argc = argc,
.use_bytes_argv = 1,
.bytes_argv = argv,
.wchar_argv = NULL};
return pymain_main(&args);
}
pymain_main() 似乎也没有做太多事情:
static int
pymain_main(_PyArgv *args)
{
PyStatus status = pymain_init(args);
if (_PyStatus_IS_EXIT(status)) {
pymain_free();
return status.exitcode;
}
if (_PyStatus_EXCEPTION(status)) {
pymain_exit_error(status);
}
return Py_RunMain();
}
尽管如此,我们还是应该多停留一会儿。上次我们了解到在Python程序开始执行之前,CPython会做 很多事情来编译它。事实证明,CPython甚至在开始编译程序之前就做了很多事情。这些东西构成了 CPython的初始化。我们在第1部分中提到CPython工作在三个阶段时提到了初始化:
- 初始化
- 编译;和
- 解释执行
pymain_main()
所做的是调用pymain_init()
来执行初始化,然后调用Py_RunMain()
以进行下一个阶段。
初始化阶段¶
CPython 在初始化期间做了什么?让我们考虑一下。至少它必须:
- 找到一种与操作系统通用的语言来正确处理参数、环境变量、标准流和文件系统的编码
- 解析命令行参数并读取环境变量以确定要运行的选项
- 初始化运行时状态、主解释器状态和主线程状态
- 初始化内置类型和内置模块
- 初始化sys模块
- 设置导入系统
- 创建 __main__ 模块。
从 CPython 3.8 开始,所有这些都分三个不同的阶段完成:
- 预初始化
- 核心初始化;和
- 主要初始化。
这些阶段逐渐引入新功能。预初始化阶段初始化运行时状态,设置默认内存分配器并执行非常基本的配置。 目前还没有Python的迹象。核心初始化阶段初始化主解释器状态和主线程状态、内置类型和异常、 内置模块、sys模块和导入系统。此时,您可以使用Python的“核心”。但是,有些东西还没有。 比如sys模块只是部分初始化,只支持内置和冻结模块的导入。在主要初始化阶段之后,CPython 已完全初始化并准备好编译和执行Python程序。
有不同的初始化阶段有什么好处?简而言之,它使我们可以更轻松地调整CPython。例如,可以在
预初始化状态下设置自定义内存分配器或在core_initialized
状态下覆盖路径配置。
这些功能对于扩展和嵌入Python的Python/C API用户来说非常重要。PEP 432
和PEP 587 更详细地解释了为什么多阶段
初始化是个好主意。
让我们回到源代码。pymain_init()
函数主要处理预初始化,最后调用Py_InitializeFromConfig()
来执行初始化的核心和主要阶段:
static PyStatus
pymain_init(const _PyArgv *args)
{
PyStatus status;
// Initialize the runtime state
status = _PyRuntime_Initialize();
if (_PyStatus_EXCEPTION(status)) {
return status;
}
// Initialize default preconfig
PyPreConfig preconfig;
PyPreConfig_InitPythonConfig(&preconfig);
// Perfrom preinitialization
status = _Py_PreInitializeFromPyArgv(&preconfig, args);
if (_PyStatus_EXCEPTION(status)) {
return status;
}
// Preinitialized. Prepare config for the next initialization phases
// Initialize default config
PyConfig config;
PyConfig_InitPythonConfig(&config);
// Store the command line arguments in `config->argv`
if (args->use_bytes_argv) {
status = PyConfig_SetBytesArgv(&config, args->argc, args->bytes_argv);
}
else {
status = PyConfig_SetArgv(&config, args->argc, args->wchar_argv);
}
if (_PyStatus_EXCEPTION(status)) {
goto done;
}
// Perform core and main initialization
status = Py_InitializeFromConfig(&config);
if (_PyStatus_EXCEPTION(status)) {
goto done;
}
status = _PyStatus_OK();
done:
PyConfig_Clear(&config);
return status;
}
_PyRuntime_Initialize()
初始化运行时状态。运行时状态存储在名为_PyRuntime
的
_PyRuntimeState
类型的全局变量中,其定义如下:
/* Full Python runtime state */
typedef struct pyruntimestate {
/* Is running Py_PreInitialize()? */
int preinitializing;
/* Is Python preinitialized? Set to 1 by Py_PreInitialize() */
int preinitialized;
/* Is Python core initialized? Set to 1 by _Py_InitializeCore() */
int core_initialized;
/* Is Python fully initialized? Set to 1 by Py_Initialize() */
int initialized;
/* Set by Py_FinalizeEx(). Only reset to NULL if Py_Initialize() is called again. */
_Py_atomic_address _finalizing;
struct pyinterpreters {
PyThread_type_lock mutex;
PyInterpreterState *head;
PyInterpreterState *main;
int64_t next_id;
} interpreters;
unsigned long main_thread;
struct _ceval_runtime_state ceval;
struct _gilstate_runtime_state gilstate;
PyPreConfig preconfig;
// ... less interesting stuff for now
} _PyRuntimeState;
_PyRuntimeState
的最后一个字段preconfig
保存用于预初始化CPython的配置。
下一阶段也使用它来完成配置。这是PyPreConfig
的广泛评论定义:
typedef struct {
int _config_init; /* _PyConfigInitEnum value */
/* Parse Py_PreInitializeFromBytesArgs() arguments?
See PyConfig.parse_argv */
int parse_argv;
/* If greater than 0, enable isolated mode: sys.path contains
neither the script's directory nor the user's site-packages directory.
Set to 1 by the -I command line option. If set to -1 (default), inherit
Py_IsolatedFlag value. */
int isolated;
/* If greater than 0: use environment variables.
Set to 0 by -E command line option. If set to -1 (default), it is
set to !Py_IgnoreEnvironmentFlag. */
int use_environment;
/* Set the LC_CTYPE locale to the user preferred locale? If equals to 0,
set coerce_c_locale and coerce_c_locale_warn to 0. */
int configure_locale;
/* Coerce the LC_CTYPE locale if it's equal to "C"? (PEP 538)
Set to 0 by PYTHONCOERCECLOCALE=0. Set to 1 by PYTHONCOERCECLOCALE=1.
Set to 2 if the user preferred LC_CTYPE locale is "C".
If it is equal to 1, LC_CTYPE locale is read to decide if it should be
coerced or not (ex: PYTHONCOERCECLOCALE=1). Internally, it is set to 2
if the LC_CTYPE locale must be coerced.
Disable by default (set to 0). Set it to -1 to let Python decide if it
should be enabled or not. */
int coerce_c_locale;
/* Emit a warning if the LC_CTYPE locale is coerced?
Set to 1 by PYTHONCOERCECLOCALE=warn.
Disable by default (set to 0). Set it to -1 to let Python decide if it
should be enabled or not. */
int coerce_c_locale_warn;
#ifdef MS_WINDOWS
/* If greater than 1, use the "mbcs" encoding instead of the UTF-8
encoding for the filesystem encoding.
Set to 1 if the PYTHONLEGACYWINDOWSFSENCODING environment variable is
set to a non-empty string. If set to -1 (default), inherit
Py_LegacyWindowsFSEncodingFlag value.
See PEP 529 for more details. */
int legacy_windows_fs_encoding;
#endif
/* Enable UTF-8 mode? (PEP 540)
Disabled by default (equals to 0).
Set to 1 by "-X utf8" and "-X utf8=1" command line options.
Set to 1 by PYTHONUTF8=1 environment variable.
Set to 0 by "-X utf8=0" and PYTHONUTF8=0.
If equals to -1, it is set to 1 if the LC_CTYPE locale is "C" or
"POSIX", otherwise it is set to 0. Inherit Py_UTF8Mode value value. */
int utf8_mode;
/* If non-zero, enable the Python Development Mode.
Set to 1 by the -X dev command line option. Set by the PYTHONDEVMODE
environment variable. */
int dev_mode;
/* Memory allocator: PYTHONMALLOC env var.
See PyMemAllocatorName for valid values. */
int allocator;
} PyPreConfig;
在调用_PyRuntime_Initialize()
之后,_PyRuntime
全局变量被初始化为默认值。接下来,
PyPreConfig_InitPythonConfig()
初始化新的默认预配置,然后_Py_PreInitializeFromPyArgv()
执行实际的预初始化。如果_PyRuntime
中已经有一个预配置,那么初始化另一个预配置的原因是什么?
请记住,CPython调用的许多函数也是通过Python/C API公开的。所以CPython只是按照它设计的
使用方式来使用这个API。这样做的另一个后果是,当您像我们今天一样单步执行CPython 源代码时,
您经常会遇到功能似乎超出您预期的函数。例如,_PyRuntime_Initialize()
在初始化过程中被
多次调用。当然,它对后续调用没有任何作用。
_Py_PreInitializeFromPyArgv()
读取命令行参数、环境变量和全局配置变量,并在此基础上
设置_PyRuntime.preconfig
、当前语言环境和内存分配器。它只读取与预初始化阶段相关的那些参数。
例如,它仅解析-E
-I
-X
参数。
此时,运行时已预初始化,pymain_init()
开始为下一个初始化阶段准备配置。不要将配置与预配置
混淆。前者是包含大部分Python配置的结构。它在初始化阶段和Python程序的执行过程中被大量使用。
要了解如何使用config
,我建议您查看其冗长的定义:
/* --- PyConfig ---------------------------------------------- */
typedef struct {
int _config_init; /* _PyConfigInitEnum value */
int isolated; /* Isolated mode? see PyPreConfig.isolated */
int use_environment; /* Use environment variables? see PyPreConfig.use_environment */
int dev_mode; /* Python Development Mode? See PyPreConfig.dev_mode */
/* Install signal handlers? Yes by default. */
int install_signal_handlers;
int use_hash_seed; /* PYTHONHASHSEED=x */
unsigned long hash_seed;
/* Enable faulthandler?
Set to 1 by -X faulthandler and PYTHONFAULTHANDLER. -1 means unset. */
int faulthandler;
/* Enable PEG parser?
1 by default, set to 0 by -X oldparser and PYTHONOLDPARSER */
int _use_peg_parser;
/* Enable tracemalloc?
Set by -X tracemalloc=N and PYTHONTRACEMALLOC. -1 means unset */
int tracemalloc;
int import_time; /* PYTHONPROFILEIMPORTTIME, -X importtime */
int show_ref_count; /* -X showrefcount */
int dump_refs; /* PYTHONDUMPREFS */
int malloc_stats; /* PYTHONMALLOCSTATS */
/* Python filesystem encoding and error handler:
sys.getfilesystemencoding() and sys.getfilesystemencodeerrors().
Default encoding and error handler:
* if Py_SetStandardStreamEncoding() has been called: they have the
highest priority;
* PYTHONIOENCODING environment variable;
* The UTF-8 Mode uses UTF-8/surrogateescape;
* If Python forces the usage of the ASCII encoding (ex: C locale
or POSIX locale on FreeBSD or HP-UX), use ASCII/surrogateescape;
* locale encoding: ANSI code page on Windows, UTF-8 on Android and
VxWorks, LC_CTYPE locale encoding on other platforms;
* On Windows, "surrogateescape" error handler;
* "surrogateescape" error handler if the LC_CTYPE locale is "C" or "POSIX";
* "surrogateescape" error handler if the LC_CTYPE locale has been coerced
(PEP 538);
* "strict" error handler.
Supported error handlers: "strict", "surrogateescape" and
"surrogatepass". The surrogatepass error handler is only supported
if Py_DecodeLocale() and Py_EncodeLocale() use directly the UTF-8 codec;
it's only used on Windows.
initfsencoding() updates the encoding to the Python codec name.
For example, "ANSI_X3.4-1968" is replaced with "ascii".
On Windows, sys._enablelegacywindowsfsencoding() sets the
encoding/errors to mbcs/replace at runtime.
See Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors.
*/
wchar_t *filesystem_encoding;
wchar_t *filesystem_errors;
wchar_t *pycache_prefix; /* PYTHONPYCACHEPREFIX, -X pycache_prefix=PATH */
int parse_argv; /* Parse argv command line arguments? */
/* Command line arguments (sys.argv).
Set parse_argv to 1 to parse argv as Python command line arguments
and then strip Python arguments from argv.
If argv is empty, an empty string is added to ensure that sys.argv
always exists and is never empty. */
PyWideStringList argv;
/* Program name:
- If Py_SetProgramName() was called, use its value.
- On macOS, use PYTHONEXECUTABLE environment variable if set.
- If WITH_NEXT_FRAMEWORK macro is defined, use __PYVENV_LAUNCHER__
environment variable is set.
- Use argv[0] if available and non-empty.
- Use "python" on Windows, or "python3 on other platforms. */
wchar_t *program_name;
PyWideStringList xoptions; /* Command line -X options */
/* Warnings options: lowest to highest priority. warnings.filters
is built in the reverse order (highest to lowest priority). */
PyWideStringList warnoptions;
/* If equal to zero, disable the import of the module site and the
site-dependent manipulations of sys.path that it entails. Also disable
these manipulations if site is explicitly imported later (call
site.main() if you want them to be triggered).
Set to 0 by the -S command line option. If set to -1 (default), it is
set to !Py_NoSiteFlag. */
int site_import;
/* Bytes warnings:
* If equal to 1, issue a warning when comparing bytes or bytearray with
str or bytes with int.
* If equal or greater to 2, issue an error.
Incremented by the -b command line option. If set to -1 (default), inherit
Py_BytesWarningFlag value. */
int bytes_warning;
/* If greater than 0, enable inspect: when a script is passed as first
argument or the -c option is used, enter interactive mode after
executing the script or the command, even when sys.stdin does not appear
to be a terminal.
Incremented by the -i command line option. Set to 1 if the PYTHONINSPECT
environment variable is non-empty. If set to -1 (default), inherit
Py_InspectFlag value. */
int inspect;
/* If greater than 0: enable the interactive mode (REPL).
Incremented by the -i command line option. If set to -1 (default),
inherit Py_InteractiveFlag value. */
int interactive;
/* Optimization level.
Incremented by the -O command line option. Set by the PYTHONOPTIMIZE
environment variable. If set to -1 (default), inherit Py_OptimizeFlag
value. */
int optimization_level;
/* If greater than 0, enable the debug mode: turn on parser debugging
output (for expert only, depending on compilation options).
Incremented by the -d command line option. Set by the PYTHONDEBUG
environment variable. If set to -1 (default), inherit Py_DebugFlag
value. */
int parser_debug;
/* If equal to 0, Python won't try to write ``.pyc`` files on the
import of source modules.
Set to 0 by the -B command line option and the PYTHONDONTWRITEBYTECODE
environment variable. If set to -1 (default), it is set to
!Py_DontWriteBytecodeFlag. */
int write_bytecode;
/* If greater than 0, enable the verbose mode: print a message each time a
module is initialized, showing the place (filename or built-in module)
from which it is loaded.
If greater or equal to 2, print a message for each file that is checked
for when searching for a module. Also provides information on module
cleanup at exit.
Incremented by the -v option. Set by the PYTHONVERBOSE environment
variable. If set to -1 (default), inherit Py_VerboseFlag value. */
int verbose;
/* If greater than 0, enable the quiet mode: Don't display the copyright
and version messages even in interactive mode.
Incremented by the -q option. If set to -1 (default), inherit
Py_QuietFlag value. */
int quiet;
/* If greater than 0, don't add the user site-packages directory to
sys.path.
Set to 0 by the -s and -I command line options , and the PYTHONNOUSERSITE
environment variable. If set to -1 (default), it is set to
!Py_NoUserSiteDirectory. */
int user_site_directory;
/* If non-zero, configure C standard steams (stdio, stdout,
stderr):
- Set O_BINARY mode on Windows.
- If buffered_stdio is equal to zero, make streams unbuffered.
Otherwise, enable streams buffering if interactive is non-zero. */
int configure_c_stdio;
/* If equal to 0, enable unbuffered mode: force the stdout and stderr
streams to be unbuffered.
Set to 0 by the -u option. Set by the PYTHONUNBUFFERED environment
variable.
If set to -1 (default), it is set to !Py_UnbufferedStdioFlag. */
int buffered_stdio;
/* Encoding of sys.stdin, sys.stdout and sys.stderr.
Value set from PYTHONIOENCODING environment variable and
Py_SetStandardStreamEncoding() function.
See also 'stdio_errors' attribute. */
wchar_t *stdio_encoding;
/* Error handler of sys.stdin and sys.stdout.
Value set from PYTHONIOENCODING environment variable and
Py_SetStandardStreamEncoding() function.
See also 'stdio_encoding' attribute. */
wchar_t *stdio_errors;
#ifdef MS_WINDOWS
/* If greater than zero, use io.FileIO instead of WindowsConsoleIO for sys
standard streams.
Set to 1 if the PYTHONLEGACYWINDOWSSTDIO environment variable is set to
a non-empty string. If set to -1 (default), inherit
Py_LegacyWindowsStdioFlag value.
See PEP 528 for more details. */
int legacy_windows_stdio;
#endif
/* Value of the --check-hash-based-pycs command line option:
- "default" means the 'check_source' flag in hash-based pycs
determines invalidation
- "always" causes the interpreter to hash the source file for
invalidation regardless of value of 'check_source' bit
- "never" causes the interpreter to always assume hash-based pycs are
valid
The default value is "default".
See PEP 552 "Deterministic pycs" for more details. */
wchar_t *check_hash_pycs_mode;
/* --- Path configuration inputs ------------ */
/* If greater than 0, suppress _PyPathConfig_Calculate() warnings on Unix.
The parameter has no effect on Windows.
If set to -1 (default), inherit !Py_FrozenFlag value. */
int pathconfig_warnings;
wchar_t *pythonpath_env; /* PYTHONPATH environment variable */
wchar_t *home; /* PYTHONHOME environment variable,
see also Py_SetPythonHome(). */
/* --- Path configuration outputs ----------- */
int module_search_paths_set; /* If non-zero, use module_search_paths */
PyWideStringList module_search_paths; /* sys.path paths. Computed if
module_search_paths_set is equal
to zero. */
wchar_t *executable; /* sys.executable */
wchar_t *base_executable; /* sys._base_executable */
wchar_t *prefix; /* sys.prefix */
wchar_t *base_prefix; /* sys.base_prefix */
wchar_t *exec_prefix; /* sys.exec_prefix */
wchar_t *base_exec_prefix; /* sys.base_exec_prefix */
wchar_t *platlibdir; /* sys.platlibdir */
/* --- Parameter only used by Py_Main() ---------- */
/* Skip the first line of the source ('run_filename' parameter), allowing use of non-Unix forms of
"#!cmd". This is intended for a DOS specific hack only.
Set by the -x command line option. */
int skip_source_first_line;
wchar_t *run_command; /* -c command line argument */
wchar_t *run_module; /* -m command line argument */
wchar_t *run_filename; /* Trailing command line argument without -c or -m */
/* --- Private fields ---------------------------- */
/* Install importlib? If set to 0, importlib is not initialized at all.
Needed by freeze_importlib. */
int _install_importlib;
/* If equal to 0, stop Python initialization before the "main" phase */
int _init_main;
/* If non-zero, disallow threads, subprocesses, and fork.
Default: 0. */
int _isolated_interpreter;
/* Original command line arguments. If _orig_argv is empty and _argv is
not equal to [''], PyConfig_Read() copies the configuration 'argv' list
into '_orig_argv' list before modifying 'argv' list (if parse_argv
is non-zero).
_PyConfig_Write() initializes Py_GetArgcArgv() to this list. */
PyWideStringList _orig_argv;
} PyConfig;
与pymain_init()
调用PyPreConfig_InitPythonConfig()
以创建默认预配置相同的方式,
它现在调用PyConfig_InitPythonConfig()
来创建默认配置。然后它调用PyConfig_SetBytesArgv()
将命令行参数存储在config.argv
和Py_InitializeFromConfig()
中以执行核心和主要初始化阶段。
我们进一步从pymain_init()
到Py_InitializeFromConfig()
:
PyStatus
Py_InitializeFromConfig(const PyConfig *config)
{
if (config == NULL) {
return _PyStatus_ERR("initialization config is NULL");
}
PyStatus status;
// Yeah, call once again
status = _PyRuntime_Initialize();
if (_PyStatus_EXCEPTION(status)) {
return status;
}
_PyRuntimeState *runtime = &_PyRuntime;
PyThreadState *tstate = NULL;
// The core initialization phase
status = pyinit_core(runtime, config, &tstate);
if (_PyStatus_EXCEPTION(status)) {
return status;
}
config = _PyInterpreterState_GetConfig(tstate->interp);
if (config->_init_main) {
// The main initialization phase
status = pyinit_main(tstate);
if (_PyStatus_EXCEPTION(status)) {
return status;
}
}
return _PyStatus_OK();
}
我们可以清楚地看到初始化阶段之间的分离。核心阶段由pyinit_core()
完成,主要阶段由
pyinit_main()
完成。pyinit_core()
函数初始化Python的“核心”。进一步来说,
- 它准备配置:解析命令行参数、读取环境变量、计算路径配置、选择标准流和文件系统的编码
并将所有这些写入
config
。 - 它应用配置:配置标准流,生成用于散列的密钥,创建主解释器状态和主线程状态,初始化 GIL并获取它,启用GC,初始化内置类型和异常,初始化系统模块和内置模块,并为内置和冻结模块 设置导入系统。
在第一步中,CPython计算config.module_search_paths
,稍后将其复制到sys.path
。否则,
这一步不是很有趣,那么让我们看看pyinit_core()
调用的pyinit_config()
来执行第二步:
static PyStatus
pyinit_config(_PyRuntimeState *runtime,
PyThreadState **tstate_p,
const PyConfig *config)
{
// Set Py_* global variables from config.
// Initialize C standard streams (stdin, stdout, stderr).
// Set secret key for hashing.
PyStatus status = pycore_init_runtime(runtime, config);
if (_PyStatus_EXCEPTION(status)) {
return status;
}
PyThreadState *tstate;
// Create the main interpreter state and the main thread state.
// Take the GIL.
status = pycore_create_interpreter(runtime, config, &tstate);
if (_PyStatus_EXCEPTION(status)) {
return status;
}
*tstate_p = tstate;
// Init types, exception, sys, builtins, importlib, etc.
status = pycore_interp_init(tstate);
if (_PyStatus_EXCEPTION(status)) {
return status;
}
/* Only when we get here is the runtime core fully initialized */
runtime->core_initialized = 1;
return _PyStatus_OK();
}
首先,pycore_init_runtime()
将一些配置字段复制到相应的全局配置变量中。在引入PyConfig
之前,这些全局变量用于配置CPython,并继续成为Python/C API的一部分。
接下来,pycore_init_runtime()
为stdio
、stdout
和stderr
文件指针设置缓冲模式。
在类Unix系统上,这是通过调用setvbuf()
库函数来完成的。
最后,pycore_init_runtime()
生成用于散列的密钥,该密钥存储在_Py_HashSecret
全局变量中。
密钥与SipHash24散列函数的输入一起获取,CPython使用它来计算散列。每次CPython启动时都会
随机生成密钥。随机化的目的是保护Python应用程序免受哈希冲突DoS攻击。Python和许多其他
语言(包括PHP、Ruby、JavaScript和C#)曾经容易受到此类攻击。攻击者可以向应用程序发送一组
具有相同哈希值的字符串,并显着增加将这些字符串放入字典所需的CPU时间,因为它们碰巧都在
同一个存储桶中。解决方案是提供一个哈希函数,其中包含攻击者未知的随机生成的密钥。要了解有关
攻击的更多信息,请查看此演示文稿 。
要了解有关哈希算法的更多信息,请查看PEP 456 。
如果您需要在程序中确定性地生成密钥,请将PYTHONHASHSEED
环境变量设置为某个固定值。
在第1部分中,我们了解到CPython使用线程状态来存储特定于线程的数据,例如调用堆栈和异常状态,
以及使用解释器状态来存储特定于解释器的数据,例如加载的模块和导入设置。
pycore_create_interpreter()
函数为主操作系统线程创建解释器状态和线程状态。
我们还没有看到这些结构是什么样的,所以这里是解释器状态结构的定义:
// The PyInterpreterState typedef is in Include/pystate.h.
struct _is {
// _PyRuntime.interpreters.head stores the most recently created interpreter
// `next` allows us to access all the interpreters.
struct _is *next;
// `tstate_head` points to the most recently created thread state.
// Thread states of the same interpreter are linked together.
struct _ts *tstate_head;
/* Reference to the _PyRuntime global variable. This field exists
to not have to pass runtime in addition to tstate to a function.
Get runtime from tstate: tstate->interp->runtime. */
struct pyruntimestate *runtime;
int64_t id;
// For tracking references to the interpreter
int64_t id_refcount;
int requires_idref;
PyThread_type_lock id_mutex;
int finalizing;
struct _ceval_state ceval;
struct _gc_runtime_state gc;
PyObject *modules; // sys.modules points to it
PyObject *modules_by_index;
PyObject *sysdict; // points to sys.__dict__
PyObject *builtins; // points to builtins.__dict__
PyObject *importlib;
// A list of codec search functions
PyObject *codec_search_path;
PyObject *codec_search_cache;
PyObject *codec_error_registry;
int codecs_initialized;
struct _Py_unicode_state unicode;
PyConfig config;
PyObject *dict; /* Stores per-interpreter state */
PyObject *builtins_copy;
PyObject *import_func;
/* Initialized to PyEval_EvalFrameDefault(). */
_PyFrameEvalFunction eval_frame;
// See `atexit` module
void (*pyexitfunc)(PyObject *);
PyObject *pyexitmodule;
uint64_t tstate_next_unique_id;
// See `warnings` module
struct _warnings_runtime_state warnings;
// A list of audit hooks, see sys.addaudithook
PyObject *audit_hooks;
#if _PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS > 0
// Small integers are preallocated in this array so that they can be shared.
// The default range is [-5, 256].
PyLongObject* small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
#endif
// ... less interesting stuff for now
};
这里要注意的重要一点是config
属于解释器状态。之前读取的配置存储在新创建的解释器状态的
config
中。线程状态结构体定义如下:
// The PyThreadState typedef is in Include/pystate.h.
struct _ts {
// Double-linked list is used to access all thread states belonging to the same interpreter
struct _ts *prev;
struct _ts *next;
PyInterpreterState *interp;
// Reference to the current frame (it can be NULL).
// The call stack is accesible via frame->f_back.
PyFrameObject *frame;
// ... checking if recursion level is too deep
// ... tracing/profiling
/* The exception currently being raised */
PyObject *curexc_type;
PyObject *curexc_value;
PyObject *curexc_traceback;
/* The exception currently being handled, if no coroutines/generators
* are present. Always last element on the stack referred to be exc_info.
*/
_PyErr_StackItem exc_state;
/* Pointer to the top of the stack of the exceptions currently
* being handled */
_PyErr_StackItem *exc_info;
PyObject *dict; /* Stores per-thread state */
int gilstate_counter;
PyObject *async_exc; /* Asynchronous exception to raise */
unsigned long thread_id; /* Thread id where this tstate was created */
/* Unique thread state id. */
uint64_t id;
// ... less interesting stuff for now
};
每个线程都需要访问其线程状态。当您使用threading
模块生成一个新线程时,该线程开始在评估循环
中执行给定的目标。它可以访问其线程状态,因为线程状态作为参数传递给评估函数。
在为主操作系统线程创建线程状态后,pycore_create_interpreter()
会初始化GIL,
以防止多个线程同时使用Python对象。线程等待GIL并在评估循环开始时获取GIL。
如果您编写C扩展并从C创建新线程,则需要手动获取GIL以使用Python对象。当您使用GIL时,
CPython通过将线程状态存储在线程特定存储(类Unix系统上的pthread_setspecific()
库函数)
中来将当前线程与相应的线程状态相关联。这就是允许任何线程访问其线程状态的机制。
pycore_create_interpreter()
创建主解释器状态和主线程状态后,pyinit_config()
调用
pycore_interp_init()
完成核心初始化阶段。pycore_interp_init()
的代码不言自明:
static PyStatus
pycore_interp_init(PyThreadState *tstate)
{
PyStatus status;
PyObject *sysmod = NULL;
status = pycore_init_types(tstate);
if (_PyStatus_EXCEPTION(status)) {
goto done;
}
status = _PySys_Create(tstate, &sysmod);
if (_PyStatus_EXCEPTION(status)) {
goto done;
}
status = pycore_init_builtins(tstate);
if (_PyStatus_EXCEPTION(status)) {
goto done;
}
status = pycore_init_import_warnings(tstate, sysmod);
done:
// Py_XDECREF() decreases the reference count of an object.
// If the reference count becomes 0, the object is deallocated.
Py_XDECREF(sysmod);
return status;
}
pycore_init_types()
函数初始化内置类型。但是这是什么意思?什么是类型?您可能知道,
您在Python中使用的所有内容都是一个对象。数字、字符串、列表、函数、模块、框架对象、用户定义的类和内置类型都是
Python对象。Python对象是PyObject结构的实例或任何其他从PyObject“继承”
(稍后我们将了解它的含义)的C结构的实例。PyObject结构体有两个字段:
typedef struct _object {
_PyObject_HEAD_EXTRA // for debugging only
Py_ssize_t ob_refcnt;
PyTypeObject *ob_type;
} PyObject;
ob_refcnt
字段存储引用计数,ob_type
字段指向对象的类型。
这是一个简单的Python对象float的示例:
typedef struct {
PyObject ob_base; // expansion of the PyObject_HEAD macro
double ob_fval;
} PyFloatObject;
注意PyFloatObject
如何从PyObject
“继承”。我说“继承”是因为C标准规定,指向任何结构的
指针都可以转换为指向其第一个成员的指针,反之亦然。这个特性允许CPython拥有通过接受
PyObject来将任何Python对象作为参数的函数,从而实现多态性。
CPython之所以可以用PyObject做一些有用的事情,是因为Python对象的行为是由它的类型决定的,
而PyObject总是有一个类型。类型“知道”如何创建该类型的对象,如何计算它们的哈希值,如何添加
它们,如何调用它们,如何访问它们的属性,如何释放它们等等。类型也是由PyTypeObject
结构
表示的Python对象。所有类型都具有相同的类型,即PyType_Type
。而PyType_Type
的类型指向
PyType_Type
本身。如果这个解释看起来很复杂,这个例子不应该:
$ ./python.exe -q
>>> type([])
<class 'list'>
>>> type(type([]))
<class 'type'>
>>> type(type(type([])))
<class 'type'>
PyTypeObject
的字段在Python/C API参考手册中
有很好的记录。我在这里只留下PyTypeObject
底层结构的定义,以了解Python类型存储的信息量:
// PyTypeObject is a typedef for struct _typeobject
struct _typeobject {
PyObject_VAR_HEAD // expands to
// PyObject ob_base;
// Py_ssize_t ob_size;
const char *tp_name; /* For printing, in format "<module>.<name>" */
Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */
/* Methods to implement standard operations */
destructor tp_dealloc;
Py_ssize_t tp_vectorcall_offset;
getattrfunc tp_getattr;
setattrfunc tp_setattr;
PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2)
or tp_reserved (Python 3) */
reprfunc tp_repr;
/* Method suites for standard classes */
PyNumberMethods *tp_as_number;
PySequenceMethods *tp_as_sequence;
PyMappingMethods *tp_as_mapping;
/* More standard operations (here for binary compatibility) */
hashfunc tp_hash;
ternaryfunc tp_call;
reprfunc tp_str;
getattrofunc tp_getattro;
setattrofunc tp_setattro;
/* Functions to access object as input/output buffer */
PyBufferProcs *tp_as_buffer;
/* Flags to define presence of optional/expanded features */
unsigned long tp_flags;
const char *tp_doc; /* Documentation string */
/* Assigned meaning in release 2.0 */
/* call function for all accessible objects */
traverseproc tp_traverse;
/* delete references to contained objects */
inquiry tp_clear;
/* Assigned meaning in release 2.1 */
/* rich comparisons */
richcmpfunc tp_richcompare;
/* weak reference enabler */
Py_ssize_t tp_weaklistoffset;
/* Iterators */
getiterfunc tp_iter;
iternextfunc tp_iternext;
/* Attribute descriptor and subclassing stuff */
struct PyMethodDef *tp_methods;
struct PyMemberDef *tp_members;
struct PyGetSetDef *tp_getset;
struct _typeobject *tp_base;
PyObject *tp_dict;
descrgetfunc tp_descr_get;
descrsetfunc tp_descr_set;
Py_ssize_t tp_dictoffset;
initproc tp_init;
allocfunc tp_alloc;
newfunc tp_new;
freefunc tp_free; /* Low-level free-memory routine */
inquiry tp_is_gc; /* For PyObject_IS_GC */
PyObject *tp_bases;
PyObject *tp_mro; /* method resolution order */
PyObject *tp_cache;
PyObject *tp_subclasses;
PyObject *tp_weaklist;
destructor tp_del;
/* Type attribute cache version tag. Added in version 2.6 */
unsigned int tp_version_tag;
destructor tp_finalize;
vectorcallfunc tp_vectorcall;
};
内置类型,例如int
和list
,是通过静态定义PyTypeObject
的实例来实现的,如下所示:
PyTypeObject PyList_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"list",
sizeof(PyListObject),
0,
(destructor)list_dealloc, /* tp_dealloc */
0, /* tp_vectorcall_offset */
0, /* tp_getattr */
0, /* tp_setattr */
0, /* tp_as_async */
(reprfunc)list_repr, /* tp_repr */
0, /* tp_as_number */
&list_as_sequence, /* tp_as_sequence */
&list_as_mapping, /* tp_as_mapping */
PyObject_HashNotImplemented, /* tp_hash */
0, /* tp_call */
0, /* tp_str */
PyObject_GenericGetAttr, /* tp_getattro */
0, /* tp_setattro */
0, /* tp_as_buffer */
Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC |
Py_TPFLAGS_BASETYPE | Py_TPFLAGS_LIST_SUBCLASS, /* tp_flags */
list___init____doc__, /* tp_doc */
(traverseproc)list_traverse, /* tp_traverse */
(inquiry)_list_clear, /* tp_clear */
list_richcompare, /* tp_richcompare */
0, /* tp_weaklistoffset */
list_iter, /* tp_iter */
0, /* tp_iternext */
list_methods, /* tp_methods */
0, /* tp_members */
0, /* tp_getset */
0, /* tp_base */
0, /* tp_dict */
0, /* tp_descr_get */
0, /* tp_descr_set */
0, /* tp_dictoffset */
(initproc)list___init__, /* tp_init */
PyType_GenericAlloc, /* tp_alloc */
PyType_GenericNew, /* tp_new */
PyObject_GC_Del, /* tp_free */
.tp_vectorcall = list_vectorcall,
};
CPython还需要初始化每个内置类型。这就是我们开始讨论类型的原因。所有类型都需要一些初始化,
例如,将特殊方法(如__call__()和__eq__())添加到类型的字典并将它们指向相应的
tp_*
函数。这种常见的初始化是通过为每种类型调用PyType_Ready()
来完成的:
PyStatus
_PyTypes_Init(void)
{
// The names of the special methods "__hash__", "__call_", etc. are interned by this call
PyStatus status = _PyTypes_InitSlotDefs();
if (_PyStatus_EXCEPTION(status)) {
return status;
}
#define INIT_TYPE(TYPE, NAME) \
do { \
if (PyType_Ready(TYPE) < 0) { \
return _PyStatus_ERR("Can't initialize " NAME " type"); \
} \
} while (0)
INIT_TYPE(&PyBaseObject_Type, "object");
INIT_TYPE(&PyType_Type, "type");
INIT_TYPE(&_PyWeakref_RefType, "weakref");
INIT_TYPE(&_PyWeakref_CallableProxyType, "callable weakref proxy");
INIT_TYPE(&_PyWeakref_ProxyType, "weakref proxy");
INIT_TYPE(&PyLong_Type, "int");
INIT_TYPE(&PyBool_Type, "bool");
INIT_TYPE(&PyByteArray_Type, "bytearray");
INIT_TYPE(&PyBytes_Type, "str");
INIT_TYPE(&PyList_Type, "list");
INIT_TYPE(&_PyNone_Type, "None");
INIT_TYPE(&_PyNotImplemented_Type, "NotImplemented");
INIT_TYPE(&PyTraceBack_Type, "traceback");
INIT_TYPE(&PySuper_Type, "super");
INIT_TYPE(&PyRange_Type, "range");
INIT_TYPE(&PyDict_Type, "dict");
INIT_TYPE(&PyDictKeys_Type, "dict keys");
// ... 50 more types
return _PyStatus_OK();
#undef INIT_TYPE
}
一些内置类型需要额外的特定于类型的初始化。比如int需要预先分配interp->small_ints
数组中的小整数以便可以复用,
而float
需要判断当前机器如何表示浮点数。
初始化内置类型时,pycore_interp_init()
调用_PySys_Create()
来创建sys
模块。
为什么sys
模块是要创建的第一个模块?它非常重要,因为它包含诸如传递给程序的命令行参数
(sys.argv
)、搜索模块的路径条目列表(sys.path
)、许多特定于系统和特定于实现的数据(sys.version
、
sys.implementation
、sys.thread_info
等)以及与解释器交互的各种函数(sys.addaudithook()
、
sys.settrace()
等)。不过,这么早创建sys模块的主要原因是初始化sys.modules
。
它指向interp->modules
字典,该字典也是由_PySys_Create()
创建的,并充当导入模块的缓存。
它是查找模块的第一个位置,也是所有加载的模块所在的位置。导入系统严重依赖sys.modules
。
在调用_PySys_Create()
之后,sys
模块只是部分初始化。函数和大部分变量都可用,
但调用特定的数据,如sys.argv
和sys._xoptions
,以及路径相关的配置,如sys.path
和
sys.exec_prefix
,在主初始化阶段设置.
创建sys
模块时,pycore_interp_init()
调用pycore_init_builtins()
来初始化builtins
模块。内置函数,如abs()
、dir()
和print()
,内置类型,如dict
、int
和str
,
内置异常,如Exception
和ValueError
,以及内置常量,如False
、Ellipsis
和None
都是内置模块的成员。内置函数是模块定义的一部分,但其他成员必须明确地放置在模块的字典中。
pycore_init_builtins()
函数就是这样做的。稍后,frame->f_builtins
将被设置为这个字典来查找名称,
因此我们不需要直接导入builtins
。
核心初始化阶段的最后一步由pycore_init_import_warnings()
函数执行。您可能知道Python
有一种发出警告的机制,如下所示:
$ ./python.exe -q
>>> import imp
<stdin>:1: DeprecationWarning: the imp module is deprecated in favour of importlib; ...
警告可以被忽略,变成异常并以各种方式显示。CPython有过滤器可以做到这一点。默认情况下,
某些过滤器是打开的,而pycore_init_import_warnings()
函数是将它们打开的原因。
不过,最重要的是pycore_init_import_warnings()
为内置和冻结模块设置了导入系统。
内置模块和冻结模块是两种特殊的模块。将它们结合在一起的是它们被直接编译到python可执行文件中。 不同之处在于内置模块是用C编写的,而冻结模块是用Python编写的。如何将用Python编写的模块编译成可执行文件? 这是通过将模块的编组代码对象合并到C源代码中来巧妙地完成的。
冻结模块的一个例子是_frozen_importlib
。它实现了导入系统的核心。为了支持内置和冻结模块
的导入,pycore_init_import_warnings()
调用init_importlib()
,而
init_importlib()
所做的第一件事就是import _frozen_importlib
。似乎CPython必须导入
_frozen_importlib
才能导入_frozen_importlib
,但事实并非如此。_frozen_importlib
模块是用于导入任何模块的通用API的一部分。然而,如果CPython知道它需要导入一个冻结的模块,
它可以在不依赖_frozen_importlib
的情况下这样做。
_frozen_importlib
模块依赖于其他两个模块。首先,它需要sys
模块来访问sys.modules
。
其次,它需要_imp
模块,该模块实现了低级导入功能,包括创建内置模块和冻结模块的功能。问题是
_frozen_importlib
不能导入任何模块,因为导入语句依赖于_frozen_importlib
本身。
解决方案是在init_importlib()
中创建_imp
模块,并通过调用_frozen_importlib._install(sys,_imp)
将其和sys
模块注入_frozen_importlib
中。导入系统的这种引导结束了核心初始化阶段。
我们离开pyinit_core()
并进入pyinit_main()
负责主要初始化阶段。该函数执行一些检查并调用
init_interp_main()
来完成可以总结如下的工作:
- 获取系统的实时和单调时钟,确保
time.time()
、time.monotonic()
和time.perf_counter()
正常工作。 - 完成
sys
模块的初始化。这包括设置路径配置变量,例如sys.path
、sys.executable
和sys.exec_prefix
, 以及特定于调用的变量,例如sys.argv
和sys._xoptions
。 - 添加对导入基于路径的(外部)模块的支持。这是通过导入另一个名为
importlib._bootstrap_external
的 冻结模块来完成的。它允许基于sys.path
导入模块。此外,还导入了zipimport
冻结模块。 它允许从 ZIP 档案中导入模块。 - 规范化文件系统和标准流的编码名称。在处理文件系统时设置用于编码和解码的错误处理程序。
- 安装默认信号处理程序。这些是在进程接收到
SIGINT
之类的信号时执行的处理程序。可以使用信号模块设置自定义处理程序。 - 导入
io
模块并初始化sys.stdin
、sys.stdout
和sys.stderr
。这基本上是通过在标准流的文件描述符上调用io.open()
来完成的。 - 将
builtins.open
设置为io.OpenWrapper
,以便open()
可用作内置函数。 - 创建
__main__
模块,将__main__.__builtins__
设置为builtins
,将__main__.__loader__
设置为_frozen_importlib.BuiltinImporter
。 - 导入警告和
site
模块。site
模块将特定于站点的目录添加到sys.path
。这就是sys.path
通常 包含像/usr/local/lib/python3.9/site-packages/
这样的目录的原因。 - 设置
interp->runtime->initialized = 1
CPython的初始化完成。pymain_init()
函数返回,我们进入Py_RunMain()
以查看CPython在进入求值循环之前还做了什么。
运行一个Python程序¶
Py_RunMain()
函数看起来不像是发生动作的地方:
int
Py_RunMain(void)
{
int exitcode = 0;
pymain_run_python(&exitcode);
if (Py_FinalizeEx() < 0) {
/* Value unlikely to be confused with a non-error exit status or
other special meaning */
exitcode = 120;
}
// Free the memory that is not freed by Py_FinalizeEx()
pymain_free();
if (_Py_UnhandledKeyboardInterrupt) {
exitcode = exit_sigint();
}
return exitcode;
}
首先,Py_RunMain()
调用pymain_run_python()
来运行Python。其次,它调用Py_FinalizeEx()
来撤消初始化。Py_FinalizeEx()
函数释放了CPython能够释放的大部分内存,其余的由pymain_free()
释放。最终确定CPython的另一个重要原因是调用退出函数,包括使用atexit
模块注册的函数。
您可能知道,运行python的方法有很多种,即:
- 交互的
$ ./cpython/python.exe
>>> import sys
>>> sys.path[:1]
['']
- 来自标准输入
$ echo "import sys; print(sys.path[:1])" | ./cpython/python.exe
['']
- 作为命令
$ ./cpython/python.exe -c "import sys; print(sys.path[:1])"
['']
- 作为脚本
$ ./cpython/python.exe 03/print_path0.py
['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes/03']
- 作为一个模块
$ ./cpython/python.exe -m 03.print_path0
['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes']
- 而且,很少用的,打包为脚本(
print_path0_package
是带有__main__.py
的目录)
$ ./cpython/python.exe 03/print_path0_package
['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes/03/print_path0_package']
我从cpython/
目录上移了一层,以显示不同的调用模式会导致 sys.path[0]
的不同值。
下一个函数pymain_run_python()
的作用是计算sys.path[0]
的值,将其添加到sys.path
并根据配置以适当的模式运行Python:
static void
pymain_run_python(int *exitcode)
{
PyInterpreterState *interp = _PyInterpreterState_GET();
PyConfig *config = (PyConfig*)_PyInterpreterState_GetConfig(interp);
// Prepend the search path to `sys.path`
PyObject *main_importer_path = NULL;
if (config->run_filename != NULL) {
// Calculate the search path for the case when the filename is a package
// (ex: directory or ZIP file) which contains __main__.py, store it in `main_importer_path`.
// Otherwise, left `main_importer_path` unchanged.
// Handle other cases later.
if (pymain_get_importer(config->run_filename, &main_importer_path,
exitcode)) {
return;
}
}
if (main_importer_path != NULL) {
if (pymain_sys_path_add_path0(interp, main_importer_path) < 0) {
goto error;
}
}
else if (!config->isolated) {
PyObject *path0 = NULL;
// Compute the search path that will be prepended to `sys.path` for other cases.
// If running as script, then it's the directory where the script is located.
// If running as module (-m), then it's the current working directory.
// Otherwise, it's an empty string.
int res = _PyPathConfig_ComputeSysPath0(&config->argv, &path0);
if (res < 0) {
goto error;
}
if (res > 0) {
if (pymain_sys_path_add_path0(interp, path0) < 0) {
Py_DECREF(path0);
goto error;
}
Py_DECREF(path0);
}
}
PyCompilerFlags cf = _PyCompilerFlags_INIT;
// Print version and platform in the interactive mode
pymain_header(config);
// Import `readline` module to provide completion,
// line editing and history capabilities in the interactive mode
pymain_import_readline(config);
// Run Python depending on the mode of invocation (script, -m, -c, etc.)
if (config->run_command) {
*exitcode = pymain_run_command(config->run_command, &cf);
}
else if (config->run_module) {
*exitcode = pymain_run_module(config->run_module, 1);
}
else if (main_importer_path != NULL) {
*exitcode = pymain_run_module(L"__main__", 0);
}
else if (config->run_filename != NULL) {
*exitcode = pymain_run_file(config, &cf);
}
else {
*exitcode = pymain_run_stdin(config, &cf);
}
// Enter the interactive mode after executing a program.
// Enabled by `-i` and `PYTHONINSPECT`.
pymain_repl(config, &cf, exitcode);
goto done;
error:
*exitcode = pymain_exit_err_print();
done:
Py_XDECREF(main_importer_path);
}
我们不会遵循所有路径,但假设我们将Python程序作为脚本运行。这导致我们使用pymain_run_file()
函数检查指定的文件是否可以打开,确保它不是目录并调用PyRun_AnyFileExFlags()
。PyRun_AnyFileExFlags()
函数处理文件是终端时的特殊情况(isatty(fd)
返回1)。如果是这种情况,则进入交互模式:
$ ./python.exe /dev/ttys000
>>> 1 + 1
2
否则,它调用PyRun_SimpleFileExFlags()
。您应该熟悉__pycache__
目录中与常规Python
文件一起不断弹出的.pyc
文件。.pyc
文件包含模块的封送代码对象。当我们导入模块时使用它代替原始的
.py
文件,以便可以跳过编译阶段。我想您知道这一点,但是您知道可以直接运行.pyc
文件吗?
$ ./cpython/python.exe 03/__pycache__/print_path0.cpython-39.pyc
['/Users/Victor/Projects/tenthousandmeters/python_behind_the_scenes/03/__pycache__']
PyRun_SimpleFileExFlags()
函数实现了这个逻辑。它检查文件是否为.pyc
文件,是否为当前
CPython版本编译,如果是,则调用run_pyc_file()
。如果文件不是.pyc
文件,它会调用
PyRun_FileExFlags()
。不过,最重要的是PyRun_SimpleFileExFlags()
导入__main__
模块
并将其字典传递给PyRun_FileExFlags()
作为执行文件的全局和本地命名空间。
int
PyRun_SimpleFileExFlags(FILE *fp, const char *filename, int closeit,
PyCompilerFlags *flags)
{
PyObject *m, *d, *v;
const char *ext;
int set_file_name = 0, ret = -1;
size_t len;
m = PyImport_AddModule("__main__");
if (m == NULL)
return -1;
Py_INCREF(m);
d = PyModule_GetDict(m);
if (PyDict_GetItemString(d, "__file__") == NULL) {
PyObject *f;
f = PyUnicode_DecodeFSDefault(filename);
if (f == NULL)
goto done;
if (PyDict_SetItemString(d, "__file__", f) < 0) {
Py_DECREF(f);
goto done;
}
if (PyDict_SetItemString(d, "__cached__", Py_None) < 0) {
Py_DECREF(f);
goto done;
}
set_file_name = 1;
Py_DECREF(f);
}
// Check if a .pyc file is passed
len = strlen(filename);
ext = filename + len - (len > 4 ? 4 : 0);
if (maybe_pyc_file(fp, filename, ext, closeit)) {
FILE *pyc_fp;
/* Try to run a pyc file. First, re-open in binary */
if (closeit)
fclose(fp);
if ((pyc_fp = _Py_fopen(filename, "rb")) == NULL) {
fprintf(stderr, "python: Can't reopen .pyc file\n");
goto done;
}
if (set_main_loader(d, filename, "SourcelessFileLoader") < 0) {
fprintf(stderr, "python: failed to set __main__.__loader__\n");
ret = -1;
fclose(pyc_fp);
goto done;
}
v = run_pyc_file(pyc_fp, filename, d, d, flags);
} else {
/* When running from stdin, leave __main__.__loader__ alone */
if (strcmp(filename, "<stdin>") != 0 &&
set_main_loader(d, filename, "SourceFileLoader") < 0) {
fprintf(stderr, "python: failed to set __main__.__loader__\n");
ret = -1;
goto done;
}
v = PyRun_FileExFlags(fp, filename, Py_file_input, d, d,
closeit, flags);
}
flush_io();
if (v == NULL) {
Py_CLEAR(m);
PyErr_Print();
goto done;
}
Py_DECREF(v);
ret = 0;
done:
if (set_file_name) {
if (PyDict_DelItemString(d, "__file__")) {
PyErr_Clear();
}
if (PyDict_DelItemString(d, "__cached__")) {
PyErr_Clear();
}
}
Py_XDECREF(m);
return ret;
}
PyRun_FileExFlags()
函数开始编译过程。它运行解析器,获取模块的AST并调用run_mod()
来
运行AST。它还创建了一个PyArena
对象,CPython用它来分配小对象 (小于或等于512字节):
PyObject *
PyRun_FileExFlags(FILE *fp, const char *filename_str, int start, PyObject *globals,
PyObject *locals, int closeit, PyCompilerFlags *flags)
{
PyObject *ret = NULL;
mod_ty mod;
PyArena *arena = NULL;
PyObject *filename;
int use_peg = _PyInterpreterState_GET()->config._use_peg_parser;
filename = PyUnicode_DecodeFSDefault(filename_str);
if (filename == NULL)
goto exit;
arena = PyArena_New();
if (arena == NULL)
goto exit;
// Run the parser.
// By default the new PEG parser is used.
// Pass `-X oldparser` to use the old parser.
// `mod` stands for module. It's the root node of the AST.
if (use_peg) {
mod = PyPegen_ASTFromFileObject(fp, filename, start, NULL, NULL, NULL,
flags, NULL, arena);
}
else {
mod = PyParser_ASTFromFileObject(fp, filename, NULL, start, 0, 0,
flags, NULL, arena);
}
if (closeit)
fclose(fp);
if (mod == NULL) {
goto exit;
}
// Compile the AST and run.
ret = run_mod(mod, filename, globals, locals, flags, arena);
exit:
Py_XDECREF(filename);
if (arena != NULL)
PyArena_Free(arena);
return ret;
}
run_mod()
通过调用PyAST_CompileObject()
运行编译器,获取模块的代码对象并调用
run_eval_code_obj()
来执行代码对象。在此期间,它会引发exec
事件,这是CPython在
Python运行时内部发生重要事件时通知审计工具的一种方式。PEP 578
解释了这种机制。
static PyObject *
run_mod(mod_ty mod, PyObject *filename, PyObject *globals, PyObject *locals,
PyCompilerFlags *flags, PyArena *arena)
{
PyThreadState *tstate = _PyThreadState_GET();
PyCodeObject *co = PyAST_CompileObject(mod, filename, flags, -1, arena);
if (co == NULL)
return NULL;
if (_PySys_Audit(tstate, "exec", "O", co) < 0) {
Py_DECREF(co);
return NULL;
}
PyObject *v = run_eval_code_obj(tstate, co, globals, locals);
Py_DECREF(co);
return v;
}
我们从第2部分已经知道编译器的工作原理是:
- 建立符号表
- 创建基本块的 CFG;和
- 将CFG组装成代码对象。
这正是PyAST_CompileObject()
所做的,所以我们不会讨论它。
run_eval_code_obj()
开始了一系列琐碎的函数调用,最终将我们引向_PyEval_EvalCode()
。
我将所有这些函数粘贴在这里,以便您可以看到_PyEval_EvalCode()
的参数来自哪里:
static PyObject *
run_eval_code_obj(PyThreadState *tstate, PyCodeObject *co, PyObject *globals, PyObject *locals)
{
PyObject *v;
// The special case when CPython is embeddded. We can safely ignore it.
/*
* We explicitly re-initialize _Py_UnhandledKeyboardInterrupt every eval
* _just in case_ someone is calling into an embedded Python where they
* don't care about an uncaught KeyboardInterrupt exception (why didn't they
* leave config.install_signal_handlers set to 0?!?) but then later call
* Py_Main() itself (which _checks_ this flag and dies with a signal after
* its interpreter exits). We don't want a previous embedded interpreter's
* uncaught exception to trigger an unexplained signal exit from a future
* Py_Main() based one.
*/
_Py_UnhandledKeyboardInterrupt = 0;
/* Set globals['__builtins__'] if it doesn't exist */
// In our case, it's been already set to the `builtins` module during the main initialization.
if (globals != NULL && PyDict_GetItemString(globals, "__builtins__") == NULL) {
if (PyDict_SetItemString(globals, "__builtins__",
tstate->interp->builtins) < 0) {
return NULL;
}
}
v = PyEval_EvalCode((PyObject*)co, globals, locals);
if (!v && _PyErr_Occurred(tstate) == PyExc_KeyboardInterrupt) {
_Py_UnhandledKeyboardInterrupt = 1;
}
return v;
}
PyObject *
PyEval_EvalCode(PyObject *co, PyObject *globals, PyObject *locals)
{
return PyEval_EvalCodeEx(co,
globals, locals,
(PyObject **)NULL, 0,
(PyObject **)NULL, 0,
(PyObject **)NULL, 0,
NULL, NULL);
}
PyObject *
PyEval_EvalCodeEx(PyObject *_co, PyObject *globals, PyObject *locals,
PyObject *const *args, int argcount,
PyObject *const *kws, int kwcount,
PyObject *const *defs, int defcount,
PyObject *kwdefs, PyObject *closure)
{
return _PyEval_EvalCodeWithName(_co, globals, locals,
args, argcount,
kws, kws != NULL ? kws + 1 : NULL,
kwcount, 2,
defs, defcount,
kwdefs, closure,
NULL, NULL);
}
PyObject *
_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
PyObject *const *args, Py_ssize_t argcount,
PyObject *const *kwnames, PyObject *const *kwargs,
Py_ssize_t kwcount, int kwstep,
PyObject *const *defs, Py_ssize_t defcount,
PyObject *kwdefs, PyObject *closure,
PyObject *name, PyObject *qualname)
{
PyThreadState *tstate = _PyThreadState_GET();
return _PyEval_EvalCode(tstate, _co, globals, locals,
args, argcount,
kwnames, kwargs,
kwcount, kwstep,
defs, defcount,
kwdefs, closure,
name, qualname);
}
回想一下,代码对象描述了一段代码做什么,但要执行代码对象,CPython需要为其创建一个状态,
这就是帧对象。_PyEval_EvalCode()
为具有指定参数的给定代码对象创建帧对象。
在我们的例子中,大部分参数都是NULL
,所以几乎不需要做任何事情。当CPython执行时,
需要做更多的工作,例如,传递不同类型参数的函数代码对象。因此,_PyEval_EvalCode()
将近
300行。我们将在接下来的部分中看到它们中的大多数是做什么用的。现在,您可以跳过
_PyEval_EvalCode()
以确保最终调用_PyEval_EvalFrame()
来评估创建的帧对象:
PyObject *
_PyEval_EvalCode(PyThreadState *tstate,
PyObject *_co, PyObject *globals, PyObject *locals,
PyObject *const *args, Py_ssize_t argcount,
PyObject *const *kwnames, PyObject *const *kwargs,
Py_ssize_t kwcount, int kwstep,
PyObject *const *defs, Py_ssize_t defcount,
PyObject *kwdefs, PyObject *closure,
PyObject *name, PyObject *qualname)
{
assert(is_tstate_valid(tstate));
PyCodeObject* co = (PyCodeObject*)_co;
PyFrameObject *f;
PyObject *retval = NULL;
PyObject **fastlocals, **freevars;
PyObject *x, *u;
const Py_ssize_t total_args = co->co_argcount + co->co_kwonlyargcount;
Py_ssize_t i, j, n;
PyObject *kwdict;
if (globals == NULL) {
_PyErr_SetString(tstate, PyExc_SystemError,
"PyEval_EvalCodeEx: NULL globals");
return NULL;
}
/* Create the frame */
f = _PyFrame_New_NoTrack(tstate, co, globals, locals);
if (f == NULL) {
return NULL;
}
fastlocals = f->f_localsplus;
freevars = f->f_localsplus + co->co_nlocals;
/* Create a dictionary for keyword parameters (**kwags) */
if (co->co_flags & CO_VARKEYWORDS) {
kwdict = PyDict_New();
if (kwdict == NULL)
goto fail;
i = total_args;
if (co->co_flags & CO_VARARGS) {
i++;
}
SETLOCAL(i, kwdict);
}
else {
kwdict = NULL;
}
/* Copy all positional arguments into local variables */
if (argcount > co->co_argcount) {
n = co->co_argcount;
}
else {
n = argcount;
}
for (j = 0; j < n; j++) {
x = args[j];
Py_INCREF(x);
SETLOCAL(j, x);
}
/* Pack other positional arguments into the *args argument */
if (co->co_flags & CO_VARARGS) {
u = _PyTuple_FromArray(args + n, argcount - n);
if (u == NULL) {
goto fail;
}
SETLOCAL(total_args, u);
}
/* Handle keyword arguments passed as two strided arrays */
kwcount *= kwstep;
for (i = 0; i < kwcount; i += kwstep) {
PyObject **co_varnames;
PyObject *keyword = kwnames[i];
PyObject *value = kwargs[i];
Py_ssize_t j;
if (keyword == NULL || !PyUnicode_Check(keyword)) {
_PyErr_Format(tstate, PyExc_TypeError,
"%U() keywords must be strings",
co->co_name);
goto fail;
}
/* Speed hack: do raw pointer compares. As names are
normally interned this should almost always hit. */
co_varnames = ((PyTupleObject *)(co->co_varnames))->ob_item;
for (j = co->co_posonlyargcount; j < total_args; j++) {
PyObject *name = co_varnames[j];
if (name == keyword) {
goto kw_found;
}
}
/* Slow fallback, just in case */
for (j = co->co_posonlyargcount; j < total_args; j++) {
PyObject *name = co_varnames[j];
int cmp = PyObject_RichCompareBool( keyword, name, Py_EQ);
if (cmp > 0) {
goto kw_found;
}
else if (cmp < 0) {
goto fail;
}
}
assert(j >= total_args);
if (kwdict == NULL) {
if (co->co_posonlyargcount
&& positional_only_passed_as_keyword(tstate, co,
kwcount, kwnames))
{
goto fail;
}
_PyErr_Format(tstate, PyExc_TypeError,
"%U() got an unexpected keyword argument '%S'",
co->co_name, keyword);
goto fail;
}
if (PyDict_SetItem(kwdict, keyword, value) == -1) {
goto fail;
}
continue;
kw_found:
if (GETLOCAL(j) != NULL) {
_PyErr_Format(tstate, PyExc_TypeError,
"%U() got multiple values for argument '%S'",
co->co_name, keyword);
goto fail;
}
Py_INCREF(value);
SETLOCAL(j, value);
}
/* Check the number of positional arguments */
if ((argcount > co->co_argcount) && !(co->co_flags & CO_VARARGS)) {
too_many_positional(tstate, co, argcount, defcount, fastlocals);
goto fail;
}
/* Add missing positional arguments (copy default values from defs) */
if (argcount < co->co_argcount) {
Py_ssize_t m = co->co_argcount - defcount;
Py_ssize_t missing = 0;
for (i = argcount; i < m; i++) {
if (GETLOCAL(i) == NULL) {
missing++;
}
}
if (missing) {
missing_arguments(tstate, co, missing, defcount, fastlocals);
goto fail;
}
if (n > m)
i = n - m;
else
i = 0;
for (; i < defcount; i++) {
if (GETLOCAL(m+i) == NULL) {
PyObject *def = defs[i];
Py_INCREF(def);
SETLOCAL(m+i, def);
}
}
}
/* Add missing keyword arguments (copy default values from kwdefs) */
if (co->co_kwonlyargcount > 0) {
Py_ssize_t missing = 0;
for (i = co->co_argcount; i < total_args; i++) {
PyObject *name;
if (GETLOCAL(i) != NULL)
continue;
name = PyTuple_GET_ITEM(co->co_varnames, i);
if (kwdefs != NULL) {
PyObject *def = PyDict_GetItemWithError(kwdefs, name);
if (def) {
Py_INCREF(def);
SETLOCAL(i, def);
continue;
}
else if (_PyErr_Occurred(tstate)) {
goto fail;
}
}
missing++;
}
if (missing) {
missing_arguments(tstate, co, missing, -1, fastlocals);
goto fail;
}
}
/* Allocate and initialize storage for cell vars, and copy free
vars into frame. */
for (i = 0; i < PyTuple_GET_SIZE(co->co_cellvars); ++i) {
PyObject *c;
Py_ssize_t arg;
/* Possibly account for the cell variable being an argument. */
if (co->co_cell2arg != NULL &&
(arg = co->co_cell2arg[i]) != CO_CELL_NOT_AN_ARG) {
c = PyCell_New(GETLOCAL(arg));
/* Clear the local copy. */
SETLOCAL(arg, NULL);
}
else {
c = PyCell_New(NULL);
}
if (c == NULL)
goto fail;
SETLOCAL(co->co_nlocals + i, c);
}
/* Copy closure variables to free variables */
for (i = 0; i < PyTuple_GET_SIZE(co->co_freevars); ++i) {
PyObject *o = PyTuple_GET_ITEM(closure, i);
Py_INCREF(o);
freevars[PyTuple_GET_SIZE(co->co_cellvars) + i] = o;
}
/* Handle generator/coroutine/asynchronous generator */
if (co->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) {
PyObject *gen;
int is_coro = co->co_flags & CO_COROUTINE;
/* Don't need to keep the reference to f_back, it will be set
* when the generator is resumed. */
Py_CLEAR(f->f_back);
/* Create a new generator that owns the ready to run frame
* and return that as the value. */
if (is_coro) {
gen = PyCoro_New(f, name, qualname);
} else if (co->co_flags & CO_ASYNC_GENERATOR) {
gen = PyAsyncGen_New(f, name, qualname);
} else {
gen = PyGen_NewWithQualName(f, name, qualname);
}
if (gen == NULL) {
return NULL;
}
_PyObject_GC_TRACK(f);
return gen;
}
retval = _PyEval_EvalFrame(tstate, f, 0);
fail: /* Jump here from prelude on failure */
/* decref'ing the frame can cause __del__ methods to get invoked,
which can call back into Python. While we're done with the
current Python frame (f), the associated C stack is still in use,
so recursion_depth must be boosted for the duration.
*/
if (Py_REFCNT(f) > 1) {
Py_DECREF(f);
_PyObject_GC_TRACK(f);
}
else {
++tstate->recursion_depth;
Py_DECREF(f);
--tstate->recursion_depth;
}
return retval;
}
_PyEval_EvalFrame()
是interp->eval_frame()
的包装器,它是帧评估函数。可以将
interp->eval_frame()
设置为自定义函数。例如,我们可以通过将默认评估函数替换为将编译后的机器代码存储在代码对象中并可以运行此类代码的函数来向
CPython添加JIT编译器。PEP 523 在CPython 3.6中引入了这个功能。
默认情况下,interp->eval_frame()
设置为_PyEval_EvalFrameDefault()
。这个函数在
Python/ceval.c
中定义,包含近3,000行。然而,今天,我们只对一个感兴趣。第1741行
开始了我们期待已久的内容:求值循环。
总结¶
我们今天讨论了很多。我们首先概述了CPython项目,编译了CPython并逐步浏览了其源代码, 并在此过程中研究了初始化阶段。我认为这应该让我们在开始解释字节码之前对CPython所做的事情有一个深入的了解。 之后会发生什么是下一篇文章的主题。
同时,为了巩固我们今天学到的知识并学习更多,我真的建议您花一些时间自己探索CPython源代码。 我打赌你在阅读这篇文章后会有很多问题,所以你应该寻找一些东西。玩的很开心!