Affichage des articles dont le libellé est CMIS. Afficher tous les articles
Affichage des articles dont le libellé est CMIS. Afficher tous les articles

lundi 11 janvier 2016

Storing blobs in Couchbase for Content Management

On my previous post, I talked about how to setup a flexible content management service using Couchbase as the metadata repository, on top of an Apache Chemistry server. The blobs themselves (pdf, pptx, docx, etc) are stored in a separate file system or in a blob store.
Today, I would like to show how Couchbase can be used to store the blobs themselves, using a custom chunk manager. The idea is to store not only the metadata of a document (date of creation, creator, name, etc.) but in addition the blob itself.

The purpose of this new architecture is to reduce the number of different systems (and licences to pay) and also to benefit directly from the replication features offered by Couchbase.

First, let’s remember that Couchbase is not a blob store. This a memory-based document store, with an adhoc cache management tuned so that most of the data stored in Couchbase should be in RAM for fast querying. Data are also replicated between nodes (if replication is enabled) inside the cluster and optionnaly outside the cluster if XDCR is used. This is why data stored in Couchbase can not be larger than 20 MB. This is a hard limit, and in real life 1MB is already a large document to store. 

Knowing that, the point is  : how can I store large binary data in Couchbase ? 
Simple answer : chunk it !

The new architecture looks now like this.



There is now 2 buckets in Couchbase :
  1. cmismeta : used to store metadata
  2. cmisstore : used to store blobs

When a folder is created, only the bucket cmismeta is modified with a new entry because of course, a folder is not associated to any blob. This is simply a structure used by the user to organise the documents and navigate in the folder tree. Folders are virtuals. The entry point of the structure is the root folder as described previously.

When a document (for instance a pdf or a pptx) is inserted into a folder, 3 things happen:
  • A json document containing all its metadata is inserted into the cmismeta bucket, with a unique key. Let’s say for instance that the document has the key L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=.
  • A new json document with the same key is created in the cmisstore bucket. This document contains the number of chunk, the max size of each chunk (same for all chunk except for the last one that might be smaller) and the application mime type.
  • The blob attached to the document is chunked into binary pieces (the size depends on a parameter you can set in the properties of the project). By default, a chunk is 500KB large. Each chunk is stored in the cmisstore bucket as a binary document, with the same key “L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=” as prefix, and a suffix "::partxxx” where xxx is the number of the chunk (0, 1, 2, …).

For instance, if a insert a pptx called CouchbaseOverview.pptx which size is 4476932 bytes into Couchbase, I get :
  • In bucket cmismeta, a json document called L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=

  • In bucket cmisstore, a json document also called L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=

  • 9 chunks containing binary data and called L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=::part0, L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=::part1, … , L0NvdWNoYmFzZU92ZXJ2aWV3LnBwdHg=::part8

The CouchbaseStorageService is the class implementing the StorageService interface already used for local storage or S3 storage as I showed into my previous blog. The first difference is the reuse of the same CouchbaseCluster instance as the one used for the MetadataService because only one Couchbase Environnement should be instantiated to save lots of resources (RAM, CPU, Network, etc).

Now let’s see the writeContent method itself :

 /**
* ContentStream is split into parts
*/
public void writeContent(String dataId, ContentStream contentStream)
throws StorageException {
        // count the number of parts
        long length = contentStream.getLength();  
        long nbparts = length / BUFFER_SIZE;
        // the last part  
        if (length - nbparts * BUFFER_SIZE > 0)  nbparts++;  
        JsonObject doc = JsonObject.empty();
        doc.put("count", nbparts);
        doc.put("mimetype", contentStream.getMimeType());
        doc.put("length", length);
        long totalLength = 0;
        int read = 0; // The number of bytes not yet read
        byte[] byteArray = new byte[BUFFER_SIZE];
        int offset = 0;
        for (int i = 0; i < nbparts; i++) {
            try {
                read = contentStream.getStream()
                       .read(byteArray, 0, BUFFER_SIZE);    
                totalLength += read;
                             offset += read;
                writeContentPart(dataId + PART_SUFFIX + i, byteArray, read);
                doc.put(dataId + PART_SUFFIX + i, read);
            } catch (IOException e) {
                 e.printStackTrace();  
            }
        }

        if (totalLength != length)
                    throw new StorageException("Wrong number of bytes");    
        
         JsonDocument jsondoc = JsonDocument.create(dataId, doc);
         bucket.upsert(jsondoc);
     }

     private void writeContentPart(String partId, byte[] bytesArray, int length)
          throws StorageException {
                   BinaryDocument bDoc = BinaryDocument.create(partId,
          Unpooled.copiedBuffer(bytesArray));
                  bucket.upsert(bDoc);
     } 

Now what to do to retrieve the file from Couchbase ? The main idea is to get each part, concatenate each other is the same order they were cut and send the byte array to the stream. There is probably a lot of way to do this, I simply implement a straightforward one using a single byte array where I write each byte into.

private InputStream getInputStream(String dataId, StringBuffer mimeType)
throws StorageException {
JsonDocument
doc = bucket.get(dataId);
JsonObject
json = doc.content();
Integer
nbparts = json.getInt("count");
Integer
length = json.getInt("length");

          if(nbparts==null || length==null || mimeType==null
                 throw new StorageException("Document invalid");
          mimeType.append(json.getString("mimetype"));
          byte[] byteArray = new byte[length];
          // for each part, read the content into the byteArray
          int offset = 0;
          Integer partLength = null;
         
          for (int i = 0; i < nbparts; i++) {  
               partLength = json.getInt(dataId + PART_SUFFIX + i);
               if(partLength == null
                     throw new StorageException("length of part "+i+" is mandatory");
               BinaryDocument bDoc
                    bucket.get(dataId + PART_SUFFIX + i,BinaryDocument.class);
               ByteBuf part = bDoc.content();
               byte[] dst = new byte[partLength];
               part.readBytes(dst);
               for (int k = 0; k < partLength; k++) {
                    byteArray[k + offset] = dst[k];
               }
               offset += partLength;
               part.release();
          }
          InputStream stream = new ByteArrayInputStream(byteArray);  
          return stream;
}

Finally let’s see what happens in the workbench tool provided by Apache Chemistry ? I can see the document in the root folder and if I double click on it, the content is streamed from Couchbase and displayed in the associated viewer (here powerpoint) based on the mime type.

Workbench and document opened in powerpoint after double clic
Where can I find the code ?

The code implementing these CMIS server on top of Couchase is available at Github here :

https://github.com/cecilelepape/cmis-couchbaseonly


Specials Thanks to Laurent Doguin.

jeudi 24 septembre 2015

Flexible and Scalable Content Management using Apache Chemistry OpenCmis and Couchbase

Is it possible to build a Content Management system flexible and scalable ? Flexible so I can choose independently my file storage and my metadata storage.  Each of these parts should be scalable. I mean consistently  and horizontally scalable. The idea is to build a CMS where I can add nodes when the number of documents is too large or the number of customers grows so that the response time to retrieve a document is constant. And of course, the solution should follow standard CMS specifications for compatibility with existing clients.

Architecture

Yes it is. Let’s have a look at the architecture below :



At the bottom you can see 2 different scalable components:
  • Couchbase cluster : documents and folder metadata are stored as simple json documents with uniquer identifiers. The json also contains the tree structure (parent id and children). Couchbase Server is well suited for this since it provides a build-in cache for high and consistent performances of key/value queries and is able to documents in Json format. Couchbase stores document up to 20MB and can query them using views or N1QL (SQL like language to query Json document).
  • Distributed File System or Distributed Blob store : files are stored in a single container, using their identifier to retrieve them. No need for a hierarchical folder storage. AWS S3 or OpenStack Swift are examples of that kind of storage. Local File System is available for testing purpose. The blob store can store large files as binary content.
At the application layer, the server implements the CMIS specification using Apache Chemistry OpenCmis framework. The server is a web application with a custom repository containing a metadata service to interact with Couchbase cluster for metadata storage and a storage service interacting with a distributed blob store for file content storage. 

At the client side, you can use both AtomPub, SOAP or JSON to interact with the application. Apache Chemistry OpenCMIS provides several clients to test (browser, workbench).

To make your application layer scalable, you can simply setup multiple application servers and add a load balancer on top of them because each server is RESTful. Each request is sent to a load balancer which chooses which application server will respond to it, as show here :



Data modelling

CMIS specifications model includes documents, folder, item and relationship objects. Item and relationship are optional. For now, let’s consider documents and folders. Let’s assume that each document belongs to a single folder and that a folder is composed of subfolders and documents.

Each object has a unique identifier (for instance a generated UUID). The root folder is a special document with a special identifier (for instance ‘@root@‘). Each folder knows its path, its parentId, its children (folders and documents), together with its name, its last modification date, etc.
Each file is a json document looking similar to a folder object except that it doesn’t have children and it contains informations about the content stream (length, name, mime type).

For instance, suppose the root folder contains a subfolder folderA. Suppose folderA contains a document doc1.pdf Here is a sample of what the json documents can look like :


Repository

The repository is used by the CMIS framework to serve client queries, as a part of its RESTful architecture. There is several methods associated to each client query.

Method getObject retrieves a folder or a document and fill the objectInfos :

public ObjectData getObject(CallContext context, String objectId,
String versionServicesId, String filter,
Boolean includeAllowableActions, Boolean includeAcl,
ObjectInfoHandler objectInfos) {

boolean userReadOnly = checkUser(context, false);

// get the file or folder
CmisObject datathis.cbService.getCmisObject(objectId);

   // set defaults if values not set
boolean iaa = CouchbaseUtils.getBooleanParameter(includeAllowableActions, false);
   boolean iacl = CouchbaseUtils.getBooleanParameter(includeAcl, false);

// gather properties
return compileObjectData(context, data, filterCollection, iaa, iacl,userReadOnly, objectInfos);
}
  1. gets the object identifier objectId from the request
  2. asks CouchbaseService cbService for the corresponding JsonDocument mapped into a CmisObject
  3. fills the ObjectInfoHandler response and sends it back to the client
Couchbase service

The metadata service is very simple : basically, it can perform CRUD operations on Couchbase. It also map Cmis object to JsonDocument and vice-versa.

To connect to Couchbase create an instance of CouchabseService:

public class CouchbaseService {
private Cluster cluster = null;
private Bucket bucket = null;

public CouchbaseService() {
cluster = CouchbaseCluster.create();
bucket = cluster.openBucket(BUCKET);
// creation of root node if not exist yet
createRootFolderIfNotExists();
}
}

To disconnect Couchbase call the close method :

public void close() {
if (cluster != null) cluster.disconnect();
}

The CouchbaseService implements CRUD operations and mapping between Couchbase Json documents and Cmis objects.  For instance let's take a look at getCmisObject method retrieving folder or document metadata based on its CMIS type.

public CmisObject getCmisObject(String objectId) {
CmisObject data = new CmisObject(objectId);

JsonDocument jsondoc = this.bucket.get(objectId);
if (jsondoc == nullreturn null;

JsonObject doc = jsondoc.content();
java.util.Set<String> names = doc.getNames();

for (String propId : names) {
if (PropertyIds.NAME.equals(propId)) {
data.setName(doc.getString(propId));
data.setFileName(doc.getString(propId));
else if (PropertyIds.OBJECT_TYPE_ID.equals(propId)) {
data.setType(doc.getString(propId));
else if (PropertyIds.CREATED_BY.equals(propId)) {
data.setCreatedBy(doc.getString(propId));
else if (PropertyIds.LAST_MODIFIED_BY.equals(propId)) {
data.setLastModifiedBy(doc.getString(propId));
else if (PropertyIds.CONTENT_STREAM_MIME_TYPE.equals(propId)) {
data.setContentType(doc.getString(propId));
else if (PropertyIds.CREATION_DATE.equals(propId)) {
Long time = doc.getLong(propId);
GregorianCalendar cal = new GregorianCalendar();
cal.setTimeInMillis(Long.valueOf(time));
data.setCreationDate(cal);
    } else if (PropertyIds.LAST_MODIFICATION_DATE.equals(propId)) {
Long time = doc.getLong(propId);
GregorianCalendar cal = new GregorianCalendar();
cal.setTimeInMillis(Long.valueOf(time));
data.setLastModificationDate(cal);
} else if (PropertyIds.PARENT_ID.equals(propId)) {
data.setParentId(doc.getString(propId));
else if (PropertyIds.PATH.equals(propId)) {
  data.setPath(doc.getString(propId));
else if (CHILDREN.equals(propId)) {
JsonArray jsa = doc.getArray(CHILDREN);
int count = jsa.size();
for (int i = 0; i < count; i++) {
data.addChildren((String) jsa.get(i));
}
}
return data;
}
}
  1. retrieves the json document from couchbase using bucket.get(objectId)
  2. for each property in the Json document, checks if it is a CMIS property based on PropertyIds CMIS constants, converts the value if needed (for instance dates are stored in long and converted to GregorianCalendar) and fills the new CMIS object with the property value. 

Storage service

There is 2 current implementation for storage : local (using local file system) and remote (using AWS S3 storage). Each class implements the StorageService interface :

public interface StorageService {

public String getStorageId();
public void writeContent(String dataId, ContentStream contentStream)
throws StorageException;
public boolean deleteContent(String dataId);

public ContentStream getContent( String dataId, BigInteger offset
BigInteger length, String filename) throws StorageException;

public boolean exists(String dataId);

}

You can see that the storage is unaware of the folders’ structure. It stores binary content identified by an unique id. 


Where can I find the code ?

The code implementing these CMIS server on top of Couchase is available at Github here :

https://github.com/cecilelepape/cmis-couchbase/tree/master/chemistry-opencmis-server-couchbase


Specials Thanks to David Maier and David Ostrovsky that help me with the architecture and the S3 storage.