fileio

The functionality provided by this module is used in Context.textFile() for reading and in RDD.saveAsTextFile() for writing.

You can use this submodule with File.dump(), File.load() and File.exists() to read, write and check for existance of a file. All methods transparently handle various schemas (for example http://, s3:// and file://) and compression/decompression of .gz and .bz2 files (among others).

class pysparkling.fileio.File(file_name)[source]

File object.

Parameters:file_name – Any file name.
static resolve_filenames(all_expr)[source]

resolve expression for a filename

Parameters:all_expr – A comma separated list of expressions. The expressions can contain the wildcard characters * and ?. It also resolves Spark datasets to the paths of the individual partitions (i.e. my_data gets resolved to [my_data/part-00000, my_data/part-00001]).
Returns:A list of file names.
Return type:list
classmethod get_content(all_expr)[source]

Return all files matching or in folder matching one of the given expression

Parameters:all_expr – A list of expressions. The expressions can contain the wildcard characters * and ?.
Returns:A list of file names.
Return type:list
exists()[source]

Checks both for a file or directory at this location.

Returns:True or false.
load()[source]

Load the data from a file.

Return type:io.BytesIO
dump(stream=None)[source]

Writes a stream to a file.

Parameters:stream – A BytesIO instance. bytes are also possible and are converted to BytesIO.
Return type:File
make_public(recursive=False)[source]

Makes the file public. Currently only supported on S3.

Parameters:recursive – Whether to apply this recursively.
Return type:File
class pysparkling.fileio.TextFile(file_name)[source]

Derived from File.

Parameters:file_name – Any text file name.
load(encoding=u'utf8', encoding_errors=u'ignore')[source]

Load the data from a file.

Parameters:
  • encoding (str) – The character encoding of the file.
  • encoding_errors (str) – How to handle encoding errors.
Return type:

io.StringIO

dump(stream=None, encoding=u'utf8', encoding_errors=u'ignore')[source]

Writes a stream to a file.

Parameters:
  • stream – An io.StringIO instance. A basestring is also possible and get converted to io.StringIO.
  • encoding – (optional) The character encoding of the file.
Return type:

TextFile

File System

class pysparkling.fileio.fs.FileSystem(file_name)[source]

Interface class for the file system.

Parameters:file_name (str) – File name.
static resolve_filenames(expr)[source]

Resolve the given glob-like expression to filenames.

Return type:list
static resolve_content(expr)[source]

Return all the files matching expr or in a folder matching expr

Return type:list
exists()[source]

Check whether the given file_name exists.

Return type:bool
load()[source]

Load a file to a stream.

Return type:io.BytesIO
load_text(encoding='utf8', encoding_errors='ignore')[source]

Load a file to a stream.

Parameters:
  • encoding (str) – Text encoding.
  • encoding_errors (str) – How to handle encoding errors.
Return type:

io.StringIO

dump(stream)[source]

Dump a stream to a file.

Parameters:stream (io.BytesIO) – Input tream.
make_public(recursive=False)[source]

Make the file public (only on some file systems).

Parameters:recursive (bool) – Recurse.
Return type:FileSystem
class pysparkling.fileio.fs.Local(file_name)[source]

FileSystem implementation for the local file system.

class pysparkling.fileio.fs.GS(file_name)[source]

FileSystem implementation for Google Storage.

Paths are of the form gs://bucket_name/file_path or gs://project_name:bucket_name/file_path.

mime_type = 'text/plain'

Default mime type.

project_name = None

Set a default project name.

class pysparkling.fileio.fs.Hdfs(file_name)[source]

FileSystem implementation for HDFS.

class pysparkling.fileio.fs.Http(file_name)[source]

FileSystem implementation for HTTP.

class pysparkling.fileio.fs.S3(file_name)[source]

FileSystem implementation for S3.

Use environment variables AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID for auth and use file paths of the form s3://bucket_name/filename.txt.

connection_kwargs = {}

Keyword arguments for new connections. Example: set to {'anon': True} for anonymous connections.

Codec

class pysparkling.fileio.codec.Codec[source]

Codec.

compress(stream)[source]

Compress.

Parameters:stream (io.BytesIO) – Uncompressed input stream.
Return type:io.BytesIO
decompress(stream)[source]

Decompress.

Parameters:stream (io.BytesIO) – Compressed input stream.
Return type:io.BytesIO
class pysparkling.fileio.codec.Bz2[source]

Implementation of Codec for bz2 compression.

class pysparkling.fileio.codec.Gz[source]

Implementation of Codec for gz compression.

class pysparkling.fileio.codec.Lzma[source]

Implementation of Codec for lzma compression.

Needs Python >= 3.3.

class pysparkling.fileio.codec.SevenZ[source]

Implementation of Codec for 7z compression.

Needs the pylzma module.

class pysparkling.fileio.codec.Tar[source]

Implementation of Codec for tar compression.

class pysparkling.fileio.codec.TarGz[source]

Implementation of Codec for .tar.gz compression.

class pysparkling.fileio.codec.TarBz2[source]

Implementation of Codec for .tar.bz2 compression.

class pysparkling.fileio.codec.Zip[source]

Implementation of Codec for zip compression.