Table Of Contents

Previous topic

pydoop.pipes — MapReduce API

Next topic

pydoop.utils — Utility Functions

Get Pydoop

Contributors

Pydoop is developed by: CRS4

And generously hosted by: Get Pydoop at SourceForge.net. Fast, secure and Free Open Source software downloads

pydoop.hdfs — HDFS API

This module allows you to connect to an HDFS installation, read and write files and get information on files, directories and global filesystem properties.

Configuration

The hdfs module is built on top of libhdfs, in turn a JNI wrapper around the Java fs code: therefore, for the module to work properly, the CLASSPATH environment variable must include all paths to the relevant Hadoop jars. Pydoop will do this for you, but it needs to know where your Hadoop installation is located and what is your hadoop configuration directory: if Pydoop is not able to automatically find these directories, you have to make sure that the HADOOP_HOME and HADOOP_CONF_DIR environment variables are set to the appropriate values.

Another important environment variable for this module is LIBHDFS_OPTS. This is used to set options for the JVM on top of which the module runs, most notably the amount of memory it uses. If LIBHDFS_OPTS is not set, the C libhdfs will let it fall back to the default for your system, typically 1 GB. According to our experience, this is much more than most applications need and adds a lot of unnecessary memory overhead. For this reason, the hdfs module sets LIBHDFS_OPTS to -Xmx48m, a value that we found to be appropriate for most applications. If your needs are different, you can set the environment variable externally and it will override the above setting.

pydoop.hdfs.chmod(hdfs_path, mode, user=None)

Change file mode bits.

Parameters:
  • path (string) – the path to the file or directory
  • mode (int) – the bitmask to set it to (e.g., 0777)
pydoop.hdfs.cp(src_hdfs_path, dest_hdfs_path, **kwargs)

Copy the contents of src_hdfs_path to dest_hdfs_path.

Additional keyword arguments, if any, are handled like in open(). If src_hdfs_path is a directory, its contents will be copied recursively.

pydoop.hdfs.dump(data, hdfs_path, **kwargs)

Write data to hdfs_path.

Additional keyword arguments, if any, are handled like in open().

pydoop.hdfs.get(src_hdfs_path, dest_path, **kwargs)

Copy the contents of src_hdfs_path to dest_path.

dest_path is forced to be interpreted as an ordinary local path (see abspath()). Additional keyword arguments, if any, are handled like in open().

pydoop.hdfs.load(hdfs_path, **kwargs)

Read the content of hdfs_path and return it.

Additional keyword arguments, if any, are handled like in open().

pydoop.hdfs.ls(hdfs_path, user=None, recursive=False)

Return a list of hdfs paths.

Works in the same way as lsl(), except for the fact that list items are hdfs paths instead of dictionaries of properties.

pydoop.hdfs.lsl(hdfs_path, user=None, recursive=False)

Return a list of dictionaries of file properties.

If hdfs_path is a file, there is only one item corresponding to the file itself; if it is a directory and recursive is False, each list item corresponds to a file or directory contained by it; if it is a directory and recursive is True, the list contains one item for every file or directory in the tree rooted at hdfs_path.

pydoop.hdfs.mkdir(hdfs_path, user=None)

Create a directory and its parents as needed.

pydoop.hdfs.move(src, dest, user=None)

Move or rename src to dest.

pydoop.hdfs.open(hdfs_path, mode='r', buff_size=0, replication=0, blocksize=0, readline_chunk_size=16384, user=None)

Open a file, returning an hdfs_file object.

hdfs_path and user are passed to split(), while the other args are passed to the hdfs_file constructor.

pydoop.hdfs.put(src_path, dest_hdfs_path, **kwargs)

Copy the contents of src_path to dest_hdfs_path.

src_path is forced to be interpreted as an ordinary local path (see abspath()). Additional keyword arguments, if any, are handled like in open().

pydoop.hdfs.rmr(hdfs_path, user=None)

Recursively remove files and directories.

pydoop.hdfs.path – Path Name Manipulations

pydoop.hdfs.path.abspath(hdfs_path, user=None, local=False)

Return an absolute path for hdfs_path.

The user arg is passed to split(). The local argument forces hdfs_path to be interpreted as an ordinary local path:

>>> import os
>>> os.chdir('/tmp')
>>> import pydoop.hdfs.path as hpath
>>> hpath.abspath('file:/tmp')
'file:/tmp'
>>> hpath.abspath('file:/tmp', local=True)
'file:/tmp/file:/tmp'
pydoop.hdfs.path.basename(hdfs_path)

Return the final component of hdfs_path.

pydoop.hdfs.path.dirname(hdfs_path)

Return the directory component of hdfs_path.

pydoop.hdfs.path.exists(hdfs_path, user=None)

Return True if hdfs_path exists in the default HDFS, else False.

pydoop.hdfs.path.isdir(path, user=None)

Return True if path refers to a directory; False otherwise.

pydoop.hdfs.path.isfile(path, user=None)

Return True if path refers to a file; False otherwise.

pydoop.hdfs.path.join(*parts)

Join path name components, inserting / as needed.

If any component looks like an absolute path (i.e., it starts with hdfs: or file:), all previous components will be discarded.

Note that this is not the reverse of split(), but rather a specialized version of os.path.join. No check is made to determine whether the returned string is a valid HDFS path.

pydoop.hdfs.path.kind(path, user=None)

Get the kind of item that the path references.

Return None if the path doesn’t exist.

pydoop.hdfs.path.split(hdfs_path, user=None)

Split hdfs_path into a (hostname, port, path) tuple.

Parameters:
  • hdfs_path (string) – an HDFS path, e.g., hdfs://localhost:9000/user/me
  • user (string) – user name used to resolve relative paths, defaults to the current user
Return type:

tuple

Returns:

hostname, port, path

pydoop.hdfs.fs – File System Handles

class pydoop.hdfs.fs.hdfs(host='default', port=0, user=None, groups=None)

A handle to an HDFS instance.

Parameters:
  • host (string) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
  • port (int) – the port on which the NameNode is listening
  • user (string or None) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
  • groups (list) – ignored. Included for backwards compatibility.

Note: when connecting to the local file system, user is ignored (i.e., it will always be the current UNIX user).

capacity()

Return the raw capacity of the filesystem.

Return type:int
Returns:the raw capacity
chmod(path, mode)

Change file mode bits.

Parameters:
  • path (string) – the path to the file or directory
  • mode (int) – the bitmask to set it to (e.g., 0777)
Raises :

IOError

chown(path, user='', group='')

Change file owner and group.

Parameters:
  • path (string) – the path to the file or directory
  • user (string) – Hadoop username. Set to ‘’ if only setting group
  • group (string) – Hadoop group name. Set to ‘’ if only setting user
Raises :

IOError

close()

Close the HDFS handle (disconnect).

copy(from_path, to_hdfs, to_path)

Copy file from one filesystem to another.

Parameters:
  • from_path (string) – the path of the source file
  • to_hdfs – the handle to destination filesystem
  • to_path (string) – the path of the destination file
Raises :

IOError

create_directory(path)

Create directory path (non-existent parents will be created as well).

Parameters:path (string) – the path of the directory
Raises :IOError
default_block_size()

Get the default block size.

Return type:int
Returns:the default blocksize
delete(path, recursive=True)

Delete path.

Parameters:
  • path (string) – the path of the file or directory
  • recursive (bool) – if path is directory, delete it recursively when True;
Raises :

IOError when recursive is False and directory is non-empty

exists(path)

Check if a given path exists on the filesystem.

Parameters:path (string) – the path to look for
Return type:bool
Returns:True if path exists, else False
get_hosts(path, start, length)

Get hostnames where a particular block (determined by pos and blocksize) of a file is stored. Due to replication, a single block could be present on multiple hosts.

Parameters:
  • path (string) – the path of the file
  • start (int) – the start of the block
  • length (int) – the length of the block
Return type:

list

Returns:

list of hosts that store the block

get_path_info(path)

Get information about path as a dict of properties.

Parameters:path (string) – a path in the filesystem
Return type:dict
Returns:path information
Raises :IOError
host

The actual hdfs hostname (empty string for the local fs).

list_directory(path)

Get list of files and directories for path.

Parameters:path (string) – the path of the directory
Return type:list
Returns:list of files and directories in path
Raises :IOError
move(from_path, to_hdfs, to_path)

Move file from one filesystem to another.

Parameters:
  • from_path (string) – the path of the source file
  • to_hdfs – the handle to destination filesystem
  • to_path (string) – the path of the destination file
Raises :

IOError

open_file(path, flags=0, buff_size=0, replication=0, blocksize=0, readline_chunk_size=16384)

Open an HDFS file.

Pass 0 as buff_size, replication or blocksize if you want to use the default values, i.e., the ones set in the Hadoop configuration files.

Parameters:
  • path (string) – the full path to the file
  • flags (string or int) – opening flags: 'r' or os.O_RDONLY for reading, 'w' or os.O_WRONLY for writing
  • buff_size (int) – read/write buffer size in bytes
  • replication (int) – HDFS block replication
  • blocksize (int) – HDFS block size
  • readline_chunk_size (int) – the amount of bytes that hdfs_file.readline() will use for buffering
Rtpye :

hdfs_file

Returns:

handle to the open file

port

The actual hdfs port (0 for the local fs).

rename(from_path, to_path)

Rename file.

Parameters:
  • from_path (string) – the path of the source file
  • to_path (string) – the path of the destination file
Raises :

IOError

set_replication(path, replication)

Set the replication of path to replication.

Parameters:
  • path (string) – the path of the file
  • replication (int) – the replication value
Raises :

IOError

set_working_directory(path)

Set the working directory to path. All relative paths will be resolved relative to it.

Parameters:path (string) – the path of the directory
Raises :IOError
used()

Return the total raw size of all files in the filesystem.

Return type:int
Returns:total size of files in the file system
user

The user associated with this HDFS connection.

utime(path, mtime, atime)

Change file last access and modification times.

Parameters:
  • path (string) – the path to the file or directory
  • mtime (int) – new modification time in seconds
  • atime (int) – new access time in seconds
Raises :

IOError

walk(top)

Generate infos for all paths in the tree rooted at top (included).

The top parameter can be either an HDFS path string or a dictionary of properties as returned by get_path_info().

Parameters:top (string or dict) – an HDFS path or path info dict
Return type:iterator
Returns:path infos of files and directories in the tree rooted at top
Raises :IOError
working_directory()

Get the current working directory.

Return type:str
Returns:current working directory

pydoop.hdfs.file – HDFS File Objects

class pydoop.hdfs.file.hdfs_file(raw_hdfs_file, fs, name, flags, chunk_size=16384)

Instances of this class represent HDFS file objects.

Objects from this class should not be instantiated directly. The preferred way to open an HDFS file is with the open() function; alternatively, hdfs.open_file() can be used.

available()

Number of bytes that can be read from this input stream without blocking.

Return type:int
Returns:available bytes
close()

Close the file.

flush()

Force any buffered output to be written.

fs

The file’s hdfs instance.

mode

The I/O mode for the file.

name

The file’s fully qualified name.

next()

Return the next input line, or raise StopIteration when EOF is hit.

pread(position, length)

Read length bytes of data from the file, starting from position.

Parameters:
  • position (int) – position from which to read
  • length (int) – the number of bytes to read
Return type:

string

Returns:

the chunk of data read from the file

pread_chunk(position, chunk)

Works like pread(), but data is stored in the writable buffer chunk rather than returned. Reads at most a number of bytes equal to the size of chunk.

Parameters:
  • position (int) – position from which to read
  • chunk (writable string buffer) – a c-like string buffer, such as the one returned by the create_string_buffer function in the ctypes module
Return type:

int

Returns:

the number of bytes read

read(length=-1)

Read length bytes from the file. If length is negative or omitted, read all data until EOF.

Parameters:length (int) – the number of bytes to read
Return type:string
Returns:the chunk of data read from the file
read_chunk(chunk)

Works like read(), but data is stored in the writable buffer chunk rather than returned. Reads at most a number of bytes equal to the size of chunk.

Parameters:chunk (writable string buffer) – a c-like string buffer, such as the one returned by the create_string_buffer function in the ctypes module
Return type:int
Returns:the number of bytes read
readline()

Read and return a line of text.

Return type:string
Returns:the next line of text in the file, including the newline character
seek(position, whence=0)

Seek to position in file.

Parameters:
  • position (int) – offset in bytes to seek to
  • whence (int) – defaults to os.SEEK_SET (absolute); other values are os.SEEK_CUR (relative to the current position) and os.SEEK_END (relative to the file’s end).
size

The file’s size in bytes. This attribute is initialized when the file is opened and updated when it is closed.

tell()

Get the current byte offset in the file.

Return type:int
Returns:current offset in bytes
write(data)

Write data to the file.

Parameters:data (string) – the data to be written to the file
Return type:int
Returns:the number of bytes written
write_chunk(chunk)

Write data from buffer chunk to the file.

Parameters:chunk (writable string buffer) – a c-like string buffer, such as the one returned by the create_string_buffer function in the ctypes module
Return type:int
Returns:the number of bytes written