fileio¶
The functionality provided by this module is used in Context.textFile()
for reading and in RDD.saveAsTextFile() for writing.
You can use this submodule with File.dump(), File.load() and
File.exists() to read, write and check for existance of a file.
All methods transparently handle various schemas (for example http://,
s3:// and file://) and compression/decompression of .gz and
.bz2 files (among others).
-
class
pysparkling.fileio.File(file_name)[source]¶ File object.
Parameters: file_name – Any file name. -
static
resolve_filenames(all_expr)[source]¶ resolve expression for a filename
Parameters: all_expr – A comma separated list of expressions. The expressions can contain the wildcard characters *and?. It also resolves Spark datasets to the paths of the individual partitions (i.e.my_datagets resolved to[my_data/part-00000, my_data/part-00001]).Returns: A list of file names. Return type: list
-
classmethod
get_content(all_expr)[source]¶ Return all files matching or in folder matching one of the given expression
Parameters: all_expr – A list of expressions. The expressions can contain the wildcard characters *and?.Returns: A list of file names. Return type: list
-
load()[source]¶ Load the data from a file.
Return type: io.BytesIO
-
static
-
class
pysparkling.fileio.TextFile(file_name)[source]¶ Derived from
File.Parameters: file_name – Any text file name.
File System¶
-
class
pysparkling.fileio.fs.FileSystem(file_name)[source]¶ Interface class for the file system.
Parameters: file_name (str) – File name. -
static
resolve_filenames(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
static
resolve_content(expr)[source]¶ Return all the files matching expr or in a folder matching expr
Return type: list
-
load()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
make_public(recursive=False)[source]¶ Make the file public (only on some file systems).
Parameters: recursive (bool) – Recurse. Return type: FileSystem
-
static
-
class
pysparkling.fileio.fs.Local(file_name)[source]¶ FileSystemimplementation for the local file system.-
static
resolve_filenames(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
static
resolve_content(expr)[source]¶ Return all the files matching expr or in a folder matching expr
Return type: list
-
load()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
static
-
class
pysparkling.fileio.fs.GS(file_name)[source]¶ FileSystemimplementation for Google Storage.Paths are of the form
gs://bucket_name/file_pathorgs://project_name:bucket_name/file_path.-
project_name= None¶ Set a default project name.
-
mime_type= 'text/plain'¶ Default mime type.
-
static
resolve_filenames(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
static
resolve_content(expr)[source]¶ Return all the files matching expr or in a folder matching expr
Return type: list
-
load()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
make_public(recursive=False)[source]¶ Make the file public (only on some file systems).
Parameters: recursive (bool) – Recurse. Return type: FileSystem
-
-
class
pysparkling.fileio.fs.Hdfs(file_name)[source]¶ FileSystemimplementation for HDFS.-
static
resolve_filenames(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
classmethod
resolve_content(expr)[source]¶ Return all the files matching expr or in a folder matching expr
Return type: list
-
load()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
static
-
class
pysparkling.fileio.fs.Http(file_name)[source]¶ FileSystemimplementation for HTTP.-
static
resolve_filenames(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
load()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
static
-
class
pysparkling.fileio.fs.S3(file_name)[source]¶ FileSystemimplementation for S3.Use environment variables
AWS_SECRET_ACCESS_KEYandAWS_ACCESS_KEY_IDfor auth and use file paths of the forms3://bucket_name/filename.txt.-
connection_kwargs= {}¶ Keyword arguments for new connections. Example: set to
{'anon': True}for anonymous connections.
-
classmethod
resolve_filenames(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
classmethod
resolve_content(expr)[source]¶ Return all the files matching expr or in a folder matching expr
Return type: list
-
load()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
make_public(recursive=False)[source]¶ Make the file public (only on some file systems).
Parameters: recursive (bool) – Recurse. Return type: FileSystem
-
Codec¶
-
class
pysparkling.fileio.codec.Codec[source]¶ Codec.
-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.Bz2[source]¶ Implementation of
Codecfor bz2 compression.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.Gz[source]¶ Implementation of
Codecfor gz compression.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.Lzma[source]¶ Implementation of
Codecfor lzma compression.Needs Python >= 3.3.
-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.SevenZ[source]¶ Implementation of
Codecfor 7z compression.Needs the
pylzmamodule.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.Tar[source]¶ Implementation of
Codecfor tar compression.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.TarGz[source]¶ Implementation of
Codecfor .tar.gz compression.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.TarBz2[source]¶ Implementation of
Codecfor .tar.bz2 compression.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.Zip[source]¶ Implementation of
Codecfor zip compression.-
compress(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-