Table Of Contents

Previous topic

Easy Hadoop Scripting with Pydoop Script

Next topic

Writing Full-Featured Applications

Get Pydoop

Contributors

Pydoop is developed by: CRS4

And generously hosted by: Get Pydoop at SourceForge.net. Fast, secure and Free Open Source software downloads

The HDFS API

The HDFS API allows you to connect to an HDFS installation, read and write files and get information on files, directories and global file system properties:

>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir("test")
>>> hdfs.dump("hello", "test/hello.txt")
>>> text = hdfs.load("test/hello.txt")
>>> print text
hello
>>> hdfs.ls("test")
['hdfs://localhost:9000/user/simleo/test/hello.txt']
>>> for k, v in hdfs.lsl("test")[0].iteritems():
...     print "%s = %s" % (k, v)
...
kind = file
group = supergroup
name = hdfs://localhost:9000/user/simleo/test/hello.txt
last_mod = 1333119543
replication = 1
owner = simleo
permissions = 420
block_size = 67108864
last_access = 1333119543
size = 5
>>> hdfs.cp("test", "test.copy")
>>> hdfs.ls("test.copy")
['hdfs://localhost:9000/user/simleo/test.copy/hello.txt']
>>> hdfs.get("test/hello.txt", "/tmp/hello.txt")
>>> with open("/tmp/hello.txt") as f:
...     print f.read()
...
hello
>>> hdfs.put("/tmp/hello.txt", "test.copy/hello.txt.copy")
>>> for x in hdfs.ls("test.copy"): print x
...
hdfs://localhost:9000/user/simleo/test.copy/hello.txt
hdfs://localhost:9000/user/simleo/test.copy/hello.txt.copy
>>> with hdfs.open("test/hello.txt", "r") as fi:
...     print fi.read(3)
...
hel

Low-level API

Pydoop’s HDFS API can also be used at a lower level, which mirrors Hadoop’s C HDFS API (libhdfs). The following example shows how to build statistics of HDFS usage by block size by directly instantiating an hdfs object, which represents an open connection to an HDFS instance:

import pydoop.hdfs as hdfs

ROOT = "pydoop_test_tree"

def treewalker(fs, root_info):
  yield root_info
  if root_info['kind'] == 'directory':
    for info in fs.list_directory(root_info['name']):
      for item in treewalker(fs, info):
        yield item

def usage_by_bs(fs, root):
  stats = {}
  root_info = fs.get_path_info(root)
  for info in treewalker(fs, root_info):
    if info['kind'] == 'directory':
      continue
    bs = int(info['block_size'])
    size = int(info['size'])
    stats[bs] = stats.get(bs, 0) + size
  return stats

if __name__ == "__main__":
  fs = hdfs.hdfs()
  print "BS (MB)\tBYTES USED"
  for k, v in sorted(usage_by_bs(fs, ROOT).iteritems()):
    print "%d\t%d" % (k/2**20, v)
  fs.close()

Full source code for the example is located under examples/hdfs in the Pydoop distribution. You should be able to run the example by doing, from the Pydoop root directory:

cd examples/hdfs
./run [TREE_DEPTH] [TREE_SPAN]

See the HDFS API reference for a complete feature list.